1 .. SPDX-License-Identifier: BSD-3-Clause
2 Copyright(c) 2018 Intel Corporation.
4 Debug & Troubleshoot guide
5 ==========================
7 DPDK applications can be designed to have simple or complex pipeline processing
8 stages making use of single or multiple threads. Applications can use poll mode
9 hardware devices which helps in offloading CPU cycles too. It is common to find
10 solutions designed with
12 * single or multiple primary processes
14 * single primary and single secondary
16 * single primary and multiple secondaries
18 In all the above cases, it is tedious to isolate, debug, and understand various
19 behaviors which occur randomly or periodically. The goal of the guide is to
20 consolidate a few commonly seen issues for reference. Then, isolate to identify
21 the root cause through step by step debug at various stages.
25 It is difficult to cover all possible issues; in a single attempt. With
26 feedback and suggestions from the community, more cases can be covered.
32 By making use of the application model as a reference, we can discuss multiple
33 causes of issues in the guide. Let us assume the sample makes use of a single
34 primary process, with various processing stages running on multiple cores. The
35 application may also make uses of Poll Mode Driver, and libraries like service
36 cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev.
38 The overview of an application modeled using PMD is shown in
39 :numref:`dtg_sample_app_model`.
41 .. _dtg_sample_app_model:
43 .. figure:: img/dtg_sample_app_model.*
45 Overview of pipeline stage of an application
51 A couple of factors that lead the design decision could be the platform, scale
52 factor, and target. This distinct preference leads to multiple combinations,
53 that are built using PMD and libraries of DPDK. While the compiler, library
54 mode, and optimization flags are the components are to be constant, that
55 affects the application too.
58 Is there mismatch in packet (received < desired) rate?
59 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
61 RX Port and associated core :numref:`dtg_rx_rate`.
65 .. figure:: img/dtg_rx_rate.*
67 RX packet rate compared against received rate.
69 #. Is the configuration for the RX setup correctly?
71 * Identify if port Speed and Duplex is matching to desired values with
74 * Check promiscuous mode if the drops do not occur for unique MAC address
75 with ``rte_eth_promiscuous_get``.
77 #. Is the drop isolated to certain NIC only?
79 * Make use of ``rte_eth_dev_stats`` to identify the drops cause.
81 * If there are mbuf drops, check nb_desc for RX descriptor as it might not
82 be sufficient for the application.
84 * If ``rte_eth_dev_stats`` shows drops are on specific RX queues, ensure RX
85 lcore threads has enough cycles for ``rte_eth_rx_burst`` on the port queue
88 * If there are redirect to a specific port queue pair with, ensure RX lcore
89 threads gets enough cycles.
91 * Check the RSS configuration ``rte_eth_dev_rss_hash_conf_get`` if the
92 spread is not even and causing drops.
94 * If PMD stats are not updating, then there might be offload or configuration
95 which is dropping the incoming traffic.
97 #. Is there drops still seen?
99 * If there are multiple port queue pair, it might be the RX thread, RX
100 distributor, or event RX adapter not having enough cycles.
102 * If there are drops seen for RX adapter or RX distributor, try using
103 ``rte_prefetch_non_temporal`` which intimates the core that the mbuf in the
107 Is there packet drops at receive or transmit?
108 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110 RX-TX port and associated cores :numref:`dtg_rx_tx_drop`.
114 .. figure:: img/dtg_rx_tx_drop.*
120 * Identify if there are multiple RX queue configured for port by
121 ``nb_rx_queues`` using ``rte_eth_dev_info_get``.
123 * Using ``rte_eth_dev_stats`` fetch drops in q_errors, check if RX thread
124 is configured to fetch packets from the port queue pair.
126 * Using ``rte_eth_dev_stats`` shows drops in ``rx_nombuf``, check if RX
127 thread has enough cycles to consume the packets from the queue.
131 * If the TX rate is falling behind the application fill rate, identify if
132 there are enough descriptors with ``rte_eth_dev_info_get`` for TX.
134 * Check the ``nb_pkt`` in ``rte_eth_tx_burst`` is done for multiple packets.
136 * Check ``rte_eth_tx_burst`` invokes the vector function call for the PMD.
138 * If oerrors are getting incremented, TX packet validations are failing.
139 Check if there queue specific offload failures.
141 * If the drops occur for large size packets, check MTU and multi-segment
142 support configured for NIC.
145 Is there object drops in producer point for the ring library?
146 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
148 Producer point for ring :numref:`dtg_producer_ring`.
150 .. _dtg_producer_ring:
152 .. figure:: img/dtg_producer_ring.*
154 Producer point for Rings
156 #. Performance issue isolation at producer
158 * Use ``rte_ring_dump`` to validate for all single producer flag is set to
161 * There should be sufficient ``rte_ring_free_count`` at any point in time.
163 * Extreme stalls in dequeue stage of the pipeline will cause
164 ``rte_ring_full`` to be true.
167 Is there object drops in consumer point for the ring library?
168 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
170 Consumer point for ring :numref:`dtg_consumer_ring`.
172 .. _dtg_consumer_ring:
174 .. figure:: img/dtg_consumer_ring.*
176 Consumer point for Rings
178 #. Performance issue isolation at consumer
180 * Use ``rte_ring_dump`` to validate for all single consumer flag is set to
183 * If the desired burst dequeue falls behind the actual dequeue, the enqueue
184 stage is not filling up the ring as required.
186 * Extreme stall in the enqueue will lead to ``rte_ring_empty`` to be true.
189 Is there a variance in packet or object processing rate in the pipeline?
190 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
192 Memory objects close to NUMA :numref:`dtg_mempool`.
196 .. figure:: img/dtg_mempool.*
198 Memory objects have to be close to the device per NUMA.
200 #. Stall in processing pipeline can be attributes of MBUF release delays.
201 These can be narrowed down to
203 * Heavy processing cycles at single or multiple processing stages.
205 * Cache is spread due to the increased stages in the pipeline.
207 * CPU thread responsible for TX is not able to keep up with the burst of
210 * Extra cycles to linearize multi-segment buffer and software offload like
211 checksum, TSO, and VLAN strip.
213 * Packet buffer copy in fast path also results in stalls in MBUF release if
214 not done selectively.
216 * Application logic sets ``rte_pktmbuf_refcnt_set`` to higher than the
217 desired value and frequently uses ``rte_pktmbuf_prefree_seg`` and does
218 not release MBUF back to mempool.
220 #. Lower performance between the pipeline processing stages can be
222 * The NUMA instance for packets or objects from NIC, mempool, and ring
225 * Drops on a specific socket are due to insufficient objects in the pool.
226 Use ``rte_mempool_get_count`` or ``rte_mempool_avail_count`` to monitor
229 * Try prefetching the content in processing pipeline logic to minimize the
232 #. Performance issue can be due to special cases
234 * Check if MBUF continuous with ``rte_pktmbuf_is_contiguous`` as certain
235 offload requires the same.
237 * Use ``rte_mempool_cache_create`` for user threads require access to
240 * If the variance is absent for larger huge pages, then try rte_mem_lock_page
241 on the objects, packets, lookup tables to isolate the issue.
244 Is there a variance in cryptodev performance?
245 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
247 Crypto device and PMD :numref:`dtg_crypto`.
251 .. figure:: img/dtg_crypto.*
253 CRYPTO and interaction with PMD device.
255 #. Performance issue isolation for enqueue
257 * Ensure cryptodev, resources and enqueue is running on NUMA cores.
259 * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``.
261 * Parallelize enqueue thread for varied multiple queue pair.
263 #. Performance issue isolation for dequeue
265 * Ensure cryptodev, resources and dequeue are running on NUMA cores.
267 * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``.
269 * Parallelize dequeue thread for varied multiple queue pair.
271 #. Performance issue isolation for crypto operation
273 * If the cryptodev software-assist is in use, ensure the library is built
274 with right (SIMD) flags or check if the queue pair using CPU ISA for
275 feature_flags AVX|SSE|NEON using ``rte_cryptodev_info_get``.
277 * If the cryptodev hardware-assist is in use, ensure both firmware and
278 drivers are up to date.
280 #. Configuration issue isolation
282 * Identify cryptodev instances with ``rte_cryptodev_count`` and
283 ``rte_cryptodev_info_get``.
286 Is user functions performance is not as expected?
287 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
289 Custom worker function :numref:`dtg_distributor_worker`.
291 .. _dtg_distributor_worker:
293 .. figure:: img/dtg_distributor_worker.*
295 Custom worker function performance drops.
297 #. Performance issue isolation
299 * The functions running on CPU cores without context switches are the
300 performing scenarios. Identify lcore with ``rte_lcore`` and lcore index
301 mapping with CPU using ``rte_lcore_index``.
303 * Use ``rte_thread_get_affinity`` to isolate functions running on the same
306 #. Configuration issue isolation
308 * Identify core role using ``rte_eal_lcore_role`` to identify RTE, OFF,
309 SERVICE and NON_EAL. Check performance functions are mapped to run on the
312 * For high-performance execution logic ensure running it on correct NUMA
315 * Analyze run logic with ``rte_dump_stack`` and
316 ``rte_memdump`` for more insights.
318 * Make use of objdump to ensure opcode is matching to the desired state.
321 Is the execution cycles for dynamic service functions are not frequent?
322 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
324 service functions on service cores :numref:`dtg_service`.
328 .. figure:: img/dtg_service.*
330 functions running on service cores
332 #. Performance issue isolation
334 * Services configured for parallel execution should have
335 ``rte_service_lcore_count`` should be equal to
336 ``rte_service_lcore_count_services``.
338 * A service to run parallel on all cores should return
339 ``RTE_SERVICE_CAP_MT_SAFE`` for ``rte_service_probe_capability`` and
340 ``rte_service_map_lcore_get`` returns unique lcore.
342 * If service function execution cycles for dynamic service functions are
345 * If services share the lcore, overall execution should fit budget.
347 #. Configuration issue isolation
349 * Check if service is running with ``rte_service_runstate_get``.
351 * Generic debug via ``rte_service_dump``.
354 Is there a bottleneck in the performance of eventdev?
355 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
357 #. Check for generic configuration
359 * Ensure the event devices created are right NUMA using
360 ``rte_event_dev_count`` and ``rte_event_dev_socket_id``.
362 * Check for event stages if the events are looped back into the same queue.
364 * If the failure is on the enqueue stage for events, check if queue depth
365 with ``rte_event_dev_info_get``.
367 #. If there are performance drops in the enqueue stage
369 * Use ``rte_event_dev_dump`` to dump the eventdev information.
371 * Periodically checks stats for queue and port to identify the starvation.
373 * Check the in-flight events for the desired queue for enqueue and dequeue.
376 Is there a variance in traffic manager?
377 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
379 Traffic Manager on TX interface :numref:`dtg_qos_tx`.
383 .. figure:: img/dtg_qos_tx.*
385 Traffic Manager just before TX.
387 #. Identify the cause for a variance from expected behavior, is due to
388 insufficient CPU cycles. Use ``rte_tm_capabilities_get`` to fetch features
389 for hierarchies, WRED and priority schedulers to be offloaded hardware.
391 #. Undesired flow drops can be narrowed down to WRED, priority, and rates
394 #. Isolate the flow in which the undesired drops occur. Use
395 ``rte_tn_get_number_of_leaf_node`` and flow table to ping down the leaf
398 #. Check the stats using ``rte_tm_stats_update`` and ``rte_tm_node_stats_read``
399 for drops for hierarchy, schedulers and WRED configurations.
402 Is the packet in the unexpected format?
403 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
405 Packet capture before and after processing :numref:`dtg_pdump`.
409 .. figure:: img/dtg_pdump.*
411 Capture points of Traffic at RX-TX.
413 #. To isolate the possible packet corruption in the processing pipeline,
414 carefully staged capture packets are to be implemented.
416 * First, isolate at NIC entry and exit.
418 Use pdump in primary to allow secondary to access port-queue pair. The
419 packets get copied over in RX|TX callback by the secondary process using
422 * Second, isolate at pipeline entry and exit.
424 Using hooks or callbacks capture the packet middle of the pipeline stage
425 to copy the packets, which can be shared to the secondary debug process
426 via user-defined custom rings.
430 Use similar analysis to objects and metadata corruption.
433 Does the issue still persist?
434 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
436 The issue can be further narrowed down to the following causes.
438 #. If there are vendor or application specific metadata, check for errors due
439 to META data error flags. Dumping private meta-data in the objects can give
440 insight into details for debugging.
442 #. If there are multi-process for either data or configuration, check for
443 possible errors in the secondary process where the configuration fails and
444 possible data corruption in the data plane.
446 #. Random drops in the RX or TX when opening other application is an indication
447 of the effect of a noisy neighbor. Try using the cache allocation technique
448 to minimize the effect between applications.
451 How to develop a custom code to debug?
452 --------------------------------------
454 #. For an application that runs as the primary process only, debug functionality
455 is added in the same process. These can be invoked by timer call-back,
456 service core and signal handler.
458 #. For the application that runs as multiple processes. debug functionality in
459 a standalone secondary process.