doc/guides/howto/debug_troubleshoot.rst

   1 ..  SPDX-License-Identifier: BSD-3-Clause
   2     Copyright(c) 2018 Intel Corporation.
   3
   4 Debug & Troubleshoot guide
   5 ==========================
   6
   7 DPDK applications can be designed to have simple or complex pipeline processing
   8 stages making use of single or multiple threads. Applications can use poll mode
   9 hardware devices which helps in offloading CPU cycles too. It is common to find
  10 solutions designed with
  11
  12 * single or multiple primary processes
  13
  14 * single primary and single secondary
  15
  16 * single primary and multiple secondaries
  17
  18 In all the above cases, it is tedious to isolate, debug, and understand various
  19 behaviors which occur randomly or periodically. The goal of the guide is to
  20 consolidate a few commonly seen issues for reference. Then, isolate to identify
  21 the root cause through step by step debug at various stages.
  22
  23 .. note::
  24
  25  It is difficult to cover all possible issues; in a single attempt. With
  26  feedback and suggestions from the community, more cases can be covered.
  27
  28
  29 Application Overview
  30 --------------------
  31
  32 By making use of the application model as a reference, we can discuss multiple
  33 causes of issues in the guide. Let us assume the sample makes use of a single
  34 primary process, with various processing stages running on multiple cores. The
  35 application may also make uses of Poll Mode Driver, and libraries like service
  36 cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev.
  37
  38 The overview of an application modeled using PMD is shown in
  39 :numref:`dtg_sample_app_model`.
  40
  41 .. _dtg_sample_app_model:
  42
  43 .. figure:: img/dtg_sample_app_model.*
  44
  45    Overview of pipeline stage of an application
  46
  47
  48 Bottleneck Analysis
  49 -------------------
  50
  51 A couple of factors that lead the design decision could be the platform, scale
  52 factor, and target. This distinct preference leads to multiple combinations,
  53 that are built using PMD and libraries of DPDK. While the compiler, library
  54 mode, and optimization flags are the components are to be constant, that
  55 affects the application too.
  56
  57
  58 Is there mismatch in packet (received < desired) rate?
  59 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  60
  61 RX Port and associated core :numref:`dtg_rx_rate`.
  62
  63 .. _dtg_rx_rate:
  64
  65 .. figure:: img/dtg_rx_rate.*
  66
  67    RX packet rate compared against received rate.
  68
  69 #. Is the configuration for the RX setup correctly?
  70
  71    * Identify if port Speed and Duplex is matching to desired values with
  72      ``rte_eth_link_get``.
  73
  74    * Check ``DEV_RX_OFFLOAD_JUMBO_FRAME`` is set with ``rte_eth_dev_info_get``.
  75
  76    * Check promiscuous mode if the drops do not occur for unique MAC address
  77      with ``rte_eth_promiscuous_get``.
  78
  79 #. Is the drop isolated to certain NIC only?
  80
  81    * Make use of ``rte_eth_dev_stats`` to identify the drops cause.
  82
  83    * If there are mbuf drops, check nb_desc for RX descriptor as it might not
  84      be sufficient for the application.
  85
  86    * If ``rte_eth_dev_stats`` shows drops are on specific RX queues, ensure RX
  87      lcore threads has enough cycles for ``rte_eth_rx_burst`` on the port queue
  88      pair.
  89
  90    * If there are redirect to a specific port queue pair with, ensure RX lcore
  91      threads gets enough cycles.
  92
  93    * Check the RSS configuration ``rte_eth_dev_rss_hash_conf_get`` if the
  94      spread is not even and causing drops.
  95
  96    * If PMD stats are not updating, then there might be offload or configuration
  97      which is dropping the incoming traffic.
  98
  99 #. Is there drops still seen?
 100
 101    * If there are multiple port queue pair, it might be the RX thread, RX
 102      distributor, or event RX adapter not having enough cycles.
 103
 104    * If there are drops seen for RX adapter or RX distributor, try using
 105      ``rte_prefetch_non_temporal`` which intimates the core that the mbuf in the
 106      cache is temporary.
 107
 108
 109 Is there packet drops at receive or transmit?
 110 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 111
 112 RX-TX port and associated cores :numref:`dtg_rx_tx_drop`.
 113
 114 .. _dtg_rx_tx_drop:
 115
 116 .. figure:: img/dtg_rx_tx_drop.*
 117
 118    RX-TX drops
 119
 120 #. At RX
 121
 122    * Identify if there are multiple RX queue configured for port by
 123      ``nb_rx_queues`` using ``rte_eth_dev_info_get``.
 124
 125    * Using ``rte_eth_dev_stats`` fetch drops in q_errors, check if RX thread
 126      is configured to fetch packets from the port queue pair.
 127
 128    * Using ``rte_eth_dev_stats`` shows drops in ``rx_nombuf``, check if RX
 129      thread has enough cycles to consume the packets from the queue.
 130
 131 #. At TX
 132
 133    * If the TX rate is falling behind the application fill rate, identify if
 134      there are enough descriptors with ``rte_eth_dev_info_get`` for TX.
 135
 136    * Check the ``nb_pkt`` in ``rte_eth_tx_burst`` is done for multiple packets.
 137
 138    * Check ``rte_eth_tx_burst`` invokes the vector function call for the PMD.
 139
 140    * If oerrors are getting incremented, TX packet validations are failing.
 141      Check if there queue specific offload failures.
 142
 143    * If the drops occur for large size packets, check MTU and multi-segment
 144      support configured for NIC.
 145
 146
 147 Is there object drops in producer point for the ring library?
 148 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 149
 150 Producer point for ring :numref:`dtg_producer_ring`.
 151
 152 .. _dtg_producer_ring:
 153
 154 .. figure:: img/dtg_producer_ring.*
 155
 156    Producer point for Rings
 157
 158 #. Performance issue isolation at producer
 159
 160    * Use ``rte_ring_dump`` to validate for all single producer flag is set to
 161      ``RING_F_SP_ENQ``.
 162
 163    * There should be sufficient ``rte_ring_free_count`` at any point in time.
 164
 165    * Extreme stalls in dequeue stage of the pipeline will cause
 166      ``rte_ring_full`` to be true.
 167
 168
 169 Is there object drops in consumer point for the ring library?
 170 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 171
 172 Consumer point for ring :numref:`dtg_consumer_ring`.
 173
 174 .. _dtg_consumer_ring:
 175
 176 .. figure:: img/dtg_consumer_ring.*
 177
 178    Consumer point for Rings
 179
 180 #. Performance issue isolation at consumer
 181
 182    * Use ``rte_ring_dump`` to validate for all single consumer flag is set to
 183      ``RING_F_SC_DEQ``.
 184
 185    * If the desired burst dequeue falls behind the actual dequeue, the enqueue
 186      stage is not filling up the ring as required.
 187
 188    * Extreme stall in the enqueue will lead to ``rte_ring_empty`` to be true.
 189
 190
 191 Is there a variance in packet or object processing rate in the pipeline?
 192 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 193
 194 Memory objects close to NUMA :numref:`dtg_mempool`.
 195
 196 .. _dtg_mempool:
 197
 198 .. figure:: img/dtg_mempool.*
 199
 200    Memory objects have to be close to the device per NUMA.
 201
 202 #. Stall in processing pipeline can be attributes of MBUF release delays.
 203    These can be narrowed down to
 204
 205    * Heavy processing cycles at single or multiple processing stages.
 206
 207    * Cache is spread due to the increased stages in the pipeline.
 208
 209    * CPU thread responsible for TX is not able to keep up with the burst of
 210      traffic.
 211
 212    * Extra cycles to linearize multi-segment buffer and software offload like
 213      checksum, TSO, and VLAN strip.
 214
 215    * Packet buffer copy in fast path also results in stalls in MBUF release if
 216      not done selectively.
 217
 218    * Application logic sets ``rte_pktmbuf_refcnt_set`` to higher than the
 219      desired value and frequently uses ``rte_pktmbuf_prefree_seg`` and does
 220      not release MBUF back to mempool.
 221
 222 #. Lower performance between the pipeline processing stages can be
 223
 224    * The NUMA instance for packets or objects from NIC, mempool, and ring
 225      should be the same.
 226
 227    * Drops on a specific socket are due to insufficient objects in the pool.
 228      Use ``rte_mempool_get_count`` or ``rte_mempool_avail_count`` to monitor
 229      when drops occurs.
 230
 231    * Try prefetching the content in processing pipeline logic to minimize the
 232      stalls.
 233
 234 #. Performance issue can be due to special cases
 235
 236    * Check if MBUF continuous with ``rte_pktmbuf_is_contiguous`` as certain
 237      offload requires the same.
 238
 239    * Use ``rte_mempool_cache_create`` for user threads require access to
 240      mempool objects.
 241
 242    * If the variance is absent for larger huge pages, then try rte_mem_lock_page
 243      on the objects, packets, lookup tables to isolate the issue.
 244
 245
 246 Is there a variance in cryptodev performance?
 247 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 248
 249 Crypto device and PMD :numref:`dtg_crypto`.
 250
 251 .. _dtg_crypto:
 252
 253 .. figure:: img/dtg_crypto.*
 254
 255    CRYPTO and interaction with PMD device.
 256
 257 #. Performance issue isolation for enqueue
 258
 259    * Ensure cryptodev, resources and enqueue is running on NUMA cores.
 260
 261    * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``.
 262
 263    * Parallelize enqueue thread for varied multiple queue pair.
 264
 265 #. Performance issue isolation for dequeue
 266
 267    * Ensure cryptodev, resources and dequeue are running on NUMA cores.
 268
 269    * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``.
 270
 271    * Parallelize dequeue thread for varied multiple queue pair.
 272
 273 #. Performance issue isolation for crypto operation
 274
 275    * If the cryptodev software-assist is in use, ensure the library is built
 276      with right (SIMD) flags or check if the queue pair using CPU ISA for
 277      feature_flags AVX|SSE|NEON using ``rte_cryptodev_info_get``.
 278
 279    * If the cryptodev hardware-assist is in use, ensure both firmware and
 280      drivers are up to date.
 281
 282 #. Configuration issue isolation
 283
 284    * Identify cryptodev instances with ``rte_cryptodev_count`` and
 285      ``rte_cryptodev_info_get``.
 286
 287
 288 Is user functions performance is not as expected?
 289 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 290
 291 Custom worker function :numref:`dtg_distributor_worker`.
 292
 293 .. _dtg_distributor_worker:
 294
 295 .. figure:: img/dtg_distributor_worker.*
 296
 297    Custom worker function performance drops.
 298
 299 #. Performance issue isolation
 300
 301    * The functions running on CPU cores without context switches are the
 302      performing scenarios. Identify lcore with ``rte_lcore`` and lcore index
 303      mapping with CPU using ``rte_lcore_index``.
 304
 305    * The functions running on CPU cores without context switches are the
 306      performing scenarios. Identify lcore with ``rte_lcore`` and lcore index
 307      mapping with CPU using ``rte_lcore_index``.
 308
 309    * Use ``rte_thread_get_affinity`` to isolate functions running on the same
 310      CPU core.
 311
 312 #. Configuration issue isolation
 313
 314    * Identify core role using ``rte_eal_lcore_role`` to identify RTE, OFF and
 315      SERVICE. Check performance functions are mapped to run on the cores.
 316
 317    * For high-performance execution logic ensure running it on correct NUMA
 318      and non-master core.
 319
 320    * Analyze run logic with ``rte_dump_stack``, ``rte_dump_registers`` and
 321      ``rte_memdump`` for more insights.
 322
 323    * Make use of objdump to ensure opcode is matching to the desired state.
 324
 325
 326 Is the execution cycles for dynamic service functions are not frequent?
 327 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 328
 329 service functions on service cores :numref:`dtg_service`.
 330
 331 .. _dtg_service:
 332
 333 .. figure:: img/dtg_service.*
 334
 335    functions running on service cores
 336
 337 #. Performance issue isolation
 338
 339    * Services configured for parallel execution should have
 340      ``rte_service_lcore_count`` should be equal to
 341      ``rte_service_lcore_count_services``.
 342
 343    * A service to run parallel on all cores should return
 344      ``RTE_SERVICE_CAP_MT_SAFE`` for ``rte_service_probe_capability`` and
 345      ``rte_service_map_lcore_get`` returns unique lcore.
 346
 347    * If service function execution cycles for dynamic service functions are
 348      not frequent?
 349
 350    * If services share the lcore, overall execution should fit budget.
 351
 352 #. Configuration issue isolation
 353
 354    * Check if service is running with ``rte_service_runstate_get``.
 355
 356    * Generic debug via ``rte_service_dump``.
 357
 358
 359 Is there a bottleneck in the performance of eventdev?
 360 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 361
 362 #. Check for generic configuration
 363
 364    * Ensure the event devices created are right NUMA using
 365      ``rte_event_dev_count`` and ``rte_event_dev_socket_id``.
 366
 367    * Check for event stages if the events are looped back into the same queue.
 368
 369    * If the failure is on the enqueue stage for events, check if queue depth
 370      with ``rte_event_dev_info_get``.
 371
 372 #. If there are performance drops in the enqueue stage
 373
 374    * Use ``rte_event_dev_dump`` to dump the eventdev information.
 375
 376    * Periodically checks stats for queue and port to identify the starvation.
 377
 378    * Check the in-flight events for the desired queue for enqueue and dequeue.
 379
 380
 381 Is there a variance in traffic manager?
 382 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 383
 384 Traffic Manager on TX interface :numref:`dtg_qos_tx`.
 385
 386 .. _dtg_qos_tx:
 387
 388 .. figure:: img/dtg_qos_tx.*
 389
 390    Traffic Manager just before TX.
 391
 392 #. Identify the cause for a variance from expected behavior, is due to
 393    insufficient CPU cycles. Use ``rte_tm_capabilities_get`` to fetch features
 394    for hierarchies, WRED and priority schedulers to be offloaded hardware.
 395
 396 #. Undesired flow drops can be narrowed down to WRED, priority, and rates
 397    limiters.
 398
 399 #. Isolate the flow in which the undesired drops occur. Use
 400    ``rte_tn_get_number_of_leaf_node`` and flow table to ping down the leaf
 401    where drops occur.
 402
 403 #. Check the stats using ``rte_tm_stats_update`` and ``rte_tm_node_stats_read``
 404    for drops for hierarchy, schedulers and WRED configurations.
 405
 406
 407 Is the packet not in the unexpected format?
 408 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 409
 410 Packet capture before and after processing :numref:`dtg_pdump`.
 411
 412 .. _dtg_pdump:
 413
 414 .. figure:: img/dtg_pdump.*
 415
 416    Capture points of Traffic at RX-TX.
 417
 418 #. To isolate the possible packet corruption in the processing pipeline,
 419    carefully staged capture packets are to be implemented.
 420
 421    * First, isolate at NIC entry and exit.
 422
 423      Use pdump in primary to allow secondary to access port-queue pair. The
 424      packets get copied over in RX|TX callback by the secondary process using
 425      ring buffers.
 426
 427    * Second, isolate at pipeline entry and exit.
 428
 429      Using hooks or callbacks capture the packet middle of the pipeline stage
 430      to copy the packets, which can be shared to the secondary debug process
 431      via user-defined custom rings.
 432
 433 .. note::
 434
 435    Use similar analysis to objects and metadata corruption.
 436
 437
 438 Does the issue still persist?
 439 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 440
 441 The issue can be further narrowed down to the following causes.
 442
 443 #. If there are vendor or application specific metadata, check for errors due
 444    to META data error flags. Dumping private meta-data in the objects can give
 445    insight into details for debugging.
 446
 447 #. If there are multi-process for either data or configuration, check for
 448    possible errors in the secondary process where the configuration fails and
 449    possible data corruption in the data plane.
 450
 451 #. Random drops in the RX or TX when opening other application is an indication
 452    of the effect of a noisy neighbor. Try using the cache allocation technique
 453    to minimize the effect between applications.
 454
 455
 456 How to develop a custom code to debug?
 457 --------------------------------------
 458
 459 #. For an application that runs as the primary process only, debug functionality
 460    is added in the same process. These can be invoked by timer call-back,
 461    service core and signal handler.
 462
 463 #. For the application that runs as multiple processes. debug functionality in
 464    a standalone secondary process.