doc/guides/howto/debug_troubleshoot.rst

   1 ..  SPDX-License-Identifier: BSD-3-Clause
   2     Copyright(c) 2018 Intel Corporation.
   3
   4 Debug & Troubleshoot guide
   5 ==========================
   6
   7 DPDK applications can be designed to have simple or complex pipeline processing
   8 stages making use of single or multiple threads. Applications can use poll mode
   9 hardware devices which helps in offloading CPU cycles too. It is common to find
  10 solutions designed with
  11
  12 * single or multiple primary processes
  13
  14 * single primary and single secondary
  15
  16 * single primary and multiple secondaries
  17
  18 In all the above cases, it is tedious to isolate, debug, and understand various
  19 behaviors which occur randomly or periodically. The goal of the guide is to
  20 consolidate a few commonly seen issues for reference. Then, isolate to identify
  21 the root cause through step by step debug at various stages.
  22
  23 .. note::
  24
  25  It is difficult to cover all possible issues; in a single attempt. With
  26  feedback and suggestions from the community, more cases can be covered.
  27
  28
  29 Application Overview
  30 --------------------
  31
  32 By making use of the application model as a reference, we can discuss multiple
  33 causes of issues in the guide. Let us assume the sample makes use of a single
  34 primary process, with various processing stages running on multiple cores. The
  35 application may also make uses of Poll Mode Driver, and libraries like service
  36 cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev.
  37
  38 The overview of an application modeled using PMD is shown in
  39 :numref:`dtg_sample_app_model`.
  40
  41 .. _dtg_sample_app_model:
  42
  43 .. figure:: img/dtg_sample_app_model.*
  44
  45    Overview of pipeline stage of an application
  46
  47
  48 Bottleneck Analysis
  49 -------------------
  50
  51 A couple of factors that lead the design decision could be the platform, scale
  52 factor, and target. This distinct preference leads to multiple combinations,
  53 that are built using PMD and libraries of DPDK. While the compiler, library
  54 mode, and optimization flags are the components are to be constant, that
  55 affects the application too.
  56
  57
  58 Is there mismatch in packet (received < desired) rate?
  59 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  60
  61 RX Port and associated core :numref:`dtg_rx_rate`.
  62
  63 .. _dtg_rx_rate:
  64
  65 .. figure:: img/dtg_rx_rate.*
  66
  67    RX packet rate compared against received rate.
  68
  69 #. Is the configuration for the RX setup correctly?
  70
  71    * Identify if port Speed and Duplex is matching to desired values with
  72      ``rte_eth_link_get``.
  73
  74    * Check promiscuous mode if the drops do not occur for unique MAC address
  75      with ``rte_eth_promiscuous_get``.
  76
  77 #. Is the drop isolated to certain NIC only?
  78
  79    * Make use of ``rte_eth_dev_stats`` to identify the drops cause.
  80
  81    * If there are mbuf drops, check nb_desc for RX descriptor as it might not
  82      be sufficient for the application.
  83
  84    * If ``rte_eth_dev_stats`` shows drops are on specific RX queues, ensure RX
  85      lcore threads has enough cycles for ``rte_eth_rx_burst`` on the port queue
  86      pair.
  87
  88    * If there are redirect to a specific port queue pair with, ensure RX lcore
  89      threads gets enough cycles.
  90
  91    * Check the RSS configuration ``rte_eth_dev_rss_hash_conf_get`` if the
  92      spread is not even and causing drops.
  93
  94    * If PMD stats are not updating, then there might be offload or configuration
  95      which is dropping the incoming traffic.
  96
  97 #. Is there drops still seen?
  98
  99    * If there are multiple port queue pair, it might be the RX thread, RX
 100      distributor, or event RX adapter not having enough cycles.
 101
 102    * If there are drops seen for RX adapter or RX distributor, try using
 103      ``rte_prefetch_non_temporal`` which intimates the core that the mbuf in the
 104      cache is temporary.
 105
 106
 107 Is there packet drops at receive or transmit?
 108 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 109
 110 RX-TX port and associated cores :numref:`dtg_rx_tx_drop`.
 111
 112 .. _dtg_rx_tx_drop:
 113
 114 .. figure:: img/dtg_rx_tx_drop.*
 115
 116    RX-TX drops
 117
 118 #. At RX
 119
 120    * Identify if there are multiple RX queue configured for port by
 121      ``nb_rx_queues`` using ``rte_eth_dev_info_get``.
 122
 123    * Using ``rte_eth_dev_stats`` fetch drops in q_errors, check if RX thread
 124      is configured to fetch packets from the port queue pair.
 125
 126    * Using ``rte_eth_dev_stats`` shows drops in ``rx_nombuf``, check if RX
 127      thread has enough cycles to consume the packets from the queue.
 128
 129 #. At TX
 130
 131    * If the TX rate is falling behind the application fill rate, identify if
 132      there are enough descriptors with ``rte_eth_dev_info_get`` for TX.
 133
 134    * Check the ``nb_pkt`` in ``rte_eth_tx_burst`` is done for multiple packets.
 135
 136    * Check ``rte_eth_tx_burst`` invokes the vector function call for the PMD.
 137
 138    * If oerrors are getting incremented, TX packet validations are failing.
 139      Check if there queue specific offload failures.
 140
 141    * If the drops occur for large size packets, check MTU and multi-segment
 142      support configured for NIC.
 143
 144
 145 Is there object drops in producer point for the ring library?
 146 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 147
 148 Producer point for ring :numref:`dtg_producer_ring`.
 149
 150 .. _dtg_producer_ring:
 151
 152 .. figure:: img/dtg_producer_ring.*
 153
 154    Producer point for Rings
 155
 156 #. Performance issue isolation at producer
 157
 158    * Use ``rte_ring_dump`` to validate for all single producer flag is set to
 159      ``RING_F_SP_ENQ``.
 160
 161    * There should be sufficient ``rte_ring_free_count`` at any point in time.
 162
 163    * Extreme stalls in dequeue stage of the pipeline will cause
 164      ``rte_ring_full`` to be true.
 165
 166
 167 Is there object drops in consumer point for the ring library?
 168 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 169
 170 Consumer point for ring :numref:`dtg_consumer_ring`.
 171
 172 .. _dtg_consumer_ring:
 173
 174 .. figure:: img/dtg_consumer_ring.*
 175
 176    Consumer point for Rings
 177
 178 #. Performance issue isolation at consumer
 179
 180    * Use ``rte_ring_dump`` to validate for all single consumer flag is set to
 181      ``RING_F_SC_DEQ``.
 182
 183    * If the desired burst dequeue falls behind the actual dequeue, the enqueue
 184      stage is not filling up the ring as required.
 185
 186    * Extreme stall in the enqueue will lead to ``rte_ring_empty`` to be true.
 187
 188
 189 Is there a variance in packet or object processing rate in the pipeline?
 190 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 191
 192 Memory objects close to NUMA :numref:`dtg_mempool`.
 193
 194 .. _dtg_mempool:
 195
 196 .. figure:: img/dtg_mempool.*
 197
 198    Memory objects have to be close to the device per NUMA.
 199
 200 #. Stall in processing pipeline can be attributes of MBUF release delays.
 201    These can be narrowed down to
 202
 203    * Heavy processing cycles at single or multiple processing stages.
 204
 205    * Cache is spread due to the increased stages in the pipeline.
 206
 207    * CPU thread responsible for TX is not able to keep up with the burst of
 208      traffic.
 209
 210    * Extra cycles to linearize multi-segment buffer and software offload like
 211      checksum, TSO, and VLAN strip.
 212
 213    * Packet buffer copy in fast path also results in stalls in MBUF release if
 214      not done selectively.
 215
 216    * Application logic sets ``rte_pktmbuf_refcnt_set`` to higher than the
 217      desired value and frequently uses ``rte_pktmbuf_prefree_seg`` and does
 218      not release MBUF back to mempool.
 219
 220 #. Lower performance between the pipeline processing stages can be
 221
 222    * The NUMA instance for packets or objects from NIC, mempool, and ring
 223      should be the same.
 224
 225    * Drops on a specific socket are due to insufficient objects in the pool.
 226      Use ``rte_mempool_get_count`` or ``rte_mempool_avail_count`` to monitor
 227      when drops occurs.
 228
 229    * Try prefetching the content in processing pipeline logic to minimize the
 230      stalls.
 231
 232 #. Performance issue can be due to special cases
 233
 234    * Check if MBUF continuous with ``rte_pktmbuf_is_contiguous`` as certain
 235      offload requires the same.
 236
 237    * Use ``rte_mempool_cache_create`` for user threads require access to
 238      mempool objects.
 239
 240    * If the variance is absent for larger huge pages, then try rte_mem_lock_page
 241      on the objects, packets, lookup tables to isolate the issue.
 242
 243
 244 Is there a variance in cryptodev performance?
 245 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 246
 247 Crypto device and PMD :numref:`dtg_crypto`.
 248
 249 .. _dtg_crypto:
 250
 251 .. figure:: img/dtg_crypto.*
 252
 253    CRYPTO and interaction with PMD device.
 254
 255 #. Performance issue isolation for enqueue
 256
 257    * Ensure cryptodev, resources and enqueue is running on NUMA cores.
 258
 259    * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``.
 260
 261    * Parallelize enqueue thread for varied multiple queue pair.
 262
 263 #. Performance issue isolation for dequeue
 264
 265    * Ensure cryptodev, resources and dequeue are running on NUMA cores.
 266
 267    * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``.
 268
 269    * Parallelize dequeue thread for varied multiple queue pair.
 270
 271 #. Performance issue isolation for crypto operation
 272
 273    * If the cryptodev software-assist is in use, ensure the library is built
 274      with right (SIMD) flags or check if the queue pair using CPU ISA for
 275      feature_flags AVX|SSE|NEON using ``rte_cryptodev_info_get``.
 276
 277    * If the cryptodev hardware-assist is in use, ensure both firmware and
 278      drivers are up to date.
 279
 280 #. Configuration issue isolation
 281
 282    * Identify cryptodev instances with ``rte_cryptodev_count`` and
 283      ``rte_cryptodev_info_get``.
 284
 285
 286 Is user functions performance is not as expected?
 287 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 288
 289 Custom worker function :numref:`dtg_distributor_worker`.
 290
 291 .. _dtg_distributor_worker:
 292
 293 .. figure:: img/dtg_distributor_worker.*
 294
 295    Custom worker function performance drops.
 296
 297 #. Performance issue isolation
 298
 299    * The functions running on CPU cores without context switches are the
 300      performing scenarios. Identify lcore with ``rte_lcore`` and lcore index
 301      mapping with CPU using ``rte_lcore_index``.
 302
 303    * Use ``rte_thread_get_affinity`` to isolate functions running on the same
 304      CPU core.
 305
 306 #. Configuration issue isolation
 307
 308    * Identify core role using ``rte_eal_lcore_role`` to identify RTE, OFF,
 309      SERVICE and NON_EAL. Check performance functions are mapped to run on the
 310      cores.
 311
 312    * For high-performance execution logic ensure running it on correct NUMA
 313      and worker core.
 314
 315    * Analyze run logic with ``rte_dump_stack`` and
 316      ``rte_memdump`` for more insights.
 317
 318    * Make use of objdump to ensure opcode is matching to the desired state.
 319
 320
 321 Is the execution cycles for dynamic service functions are not frequent?
 322 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 323
 324 service functions on service cores :numref:`dtg_service`.
 325
 326 .. _dtg_service:
 327
 328 .. figure:: img/dtg_service.*
 329
 330    functions running on service cores
 331
 332 #. Performance issue isolation
 333
 334    * Services configured for parallel execution should have
 335      ``rte_service_lcore_count`` should be equal to
 336      ``rte_service_lcore_count_services``.
 337
 338    * A service to run parallel on all cores should return
 339      ``RTE_SERVICE_CAP_MT_SAFE`` for ``rte_service_probe_capability`` and
 340      ``rte_service_map_lcore_get`` returns unique lcore.
 341
 342    * If service function execution cycles for dynamic service functions are
 343      not frequent?
 344
 345    * If services share the lcore, overall execution should fit budget.
 346
 347 #. Configuration issue isolation
 348
 349    * Check if service is running with ``rte_service_runstate_get``.
 350
 351    * Generic debug via ``rte_service_dump``.
 352
 353
 354 Is there a bottleneck in the performance of eventdev?
 355 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 356
 357 #. Check for generic configuration
 358
 359    * Ensure the event devices created are right NUMA using
 360      ``rte_event_dev_count`` and ``rte_event_dev_socket_id``.
 361
 362    * Check for event stages if the events are looped back into the same queue.
 363
 364    * If the failure is on the enqueue stage for events, check if queue depth
 365      with ``rte_event_dev_info_get``.
 366
 367 #. If there are performance drops in the enqueue stage
 368
 369    * Use ``rte_event_dev_dump`` to dump the eventdev information.
 370
 371    * Periodically checks stats for queue and port to identify the starvation.
 372
 373    * Check the in-flight events for the desired queue for enqueue and dequeue.
 374
 375
 376 Is there a variance in traffic manager?
 377 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 378
 379 Traffic Manager on TX interface :numref:`dtg_qos_tx`.
 380
 381 .. _dtg_qos_tx:
 382
 383 .. figure:: img/dtg_qos_tx.*
 384
 385    Traffic Manager just before TX.
 386
 387 #. Identify the cause for a variance from expected behavior, is due to
 388    insufficient CPU cycles. Use ``rte_tm_capabilities_get`` to fetch features
 389    for hierarchies, WRED and priority schedulers to be offloaded hardware.
 390
 391 #. Undesired flow drops can be narrowed down to WRED, priority, and rates
 392    limiters.
 393
 394 #. Isolate the flow in which the undesired drops occur. Use
 395    ``rte_tn_get_number_of_leaf_node`` and flow table to ping down the leaf
 396    where drops occur.
 397
 398 #. Check the stats using ``rte_tm_stats_update`` and ``rte_tm_node_stats_read``
 399    for drops for hierarchy, schedulers and WRED configurations.
 400
 401
 402 Is the packet in the unexpected format?
 403 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 404
 405 Packet capture before and after processing :numref:`dtg_pdump`.
 406
 407 .. _dtg_pdump:
 408
 409 .. figure:: img/dtg_pdump.*
 410
 411    Capture points of Traffic at RX-TX.
 412
 413 #. To isolate the possible packet corruption in the processing pipeline,
 414    carefully staged capture packets are to be implemented.
 415
 416    * First, isolate at NIC entry and exit.
 417
 418      Use pdump in primary to allow secondary to access port-queue pair. The
 419      packets get copied over in RX|TX callback by the secondary process using
 420      ring buffers.
 421
 422    * Second, isolate at pipeline entry and exit.
 423
 424      Using hooks or callbacks capture the packet middle of the pipeline stage
 425      to copy the packets, which can be shared to the secondary debug process
 426      via user-defined custom rings.
 427
 428 .. note::
 429
 430    Use similar analysis to objects and metadata corruption.
 431
 432
 433 Does the issue still persist?
 434 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 435
 436 The issue can be further narrowed down to the following causes.
 437
 438 #. If there are vendor or application specific metadata, check for errors due
 439    to META data error flags. Dumping private meta-data in the objects can give
 440    insight into details for debugging.
 441
 442 #. If there are multi-process for either data or configuration, check for
 443    possible errors in the secondary process where the configuration fails and
 444    possible data corruption in the data plane.
 445
 446 #. Random drops in the RX or TX when opening other application is an indication
 447    of the effect of a noisy neighbor. Try using the cache allocation technique
 448    to minimize the effect between applications.
 449
 450
 451 How to develop a custom code to debug?
 452 --------------------------------------
 453
 454 #. For an application that runs as the primary process only, debug functionality
 455    is added in the same process. These can be invoked by timer call-back,
 456    service core and signal handler.
 457
 458 #. For the application that runs as multiple processes. debug functionality in
 459    a standalone secondary process.