doc/guides/prog_guide/writing_efficient_code.rst

   1 ..  SPDX-License-Identifier: BSD-3-Clause
   2     Copyright(c) 2010-2014 Intel Corporation.
   3
   4 Writing Efficient Code
   5 ======================
   6
   7 This chapter provides some tips for developing efficient code using the DPDK.
   8 For additional and more general information,
   9 please refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual*
  10 which is a valuable reference to writing efficient code.
  11
  12 Memory
  13 ------
  14
  15 This section describes some key memory considerations when developing applications in the DPDK environment.
  16
  17 Memory Copy: Do not Use libc in the Data Plane
  18 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  19
  20 Many libc functions are available in the DPDK, via the Linux* application environment.
  21 This can ease the porting of applications and the development of the configuration plane.
  22 However, many of these functions are not designed for performance.
  23 Functions such as memcpy() or strcpy() should not be used in the data plane.
  24 To copy small structures, the preference is for a simpler technique that can be optimized by the compiler.
  25 Refer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations.
  26
  27 For specific functions that are called often,
  28 it is also a good idea to provide a self-made optimized function, which should be declared as static inline.
  29
  30 The DPDK API provides an optimized rte_memcpy() function.
  31
  32 Memory Allocation
  33 ~~~~~~~~~~~~~~~~~
  34
  35 Other functions of libc, such as malloc(), provide a flexible way to allocate and free memory.
  36 In some cases, using dynamic allocation is necessary,
  37 but it is really not advised to use malloc-like functions in the data plane because
  38 managing a fragmented heap can be costly and the allocator may not be optimized for parallel allocation.
  39
  40 If you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects.
  41 This API is provided by librte_mempool.
  42 This data structure provides several services that increase performance, such as memory alignment of objects,
  43 lockless access to objects, NUMA awareness, bulk get/put and per-lcore cache.
  44 The rte_malloc () function uses a similar concept to mempools.
  45
  46 Concurrent Access to the Same Memory Area
  47 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  48
  49 Read-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses,
  50 which are very costly.
  51 It is often possible to use per-lcore variables, for example, in the case of statistics.
  52 There are at least two solutions for this:
  53
  54 *   Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y.
  55
  56 *   Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
  57
  58 Read-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line.
  59
  60 NUMA
  61 ~~~~
  62
  63 On a NUMA system, it is preferable to access local memory since remote memory access is slower.
  64 In the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket.
  65
  66 Sometimes, it can be a good idea to duplicate data to optimize speed.
  67 For read-mostly variables that are often accessed,
  68 it should not be a problem to keep them in one socket only, since data will be present in cache.
  69
  70 Distribution Across Memory Channels
  71 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  72
  73 Modern memory controllers have several memory channels that can load or store data in parallel.
  74 Depending on the memory controller and its configuration,
  75 the number of channels and the way the memory is distributed across the channels varies.
  76 Each channel has a bandwidth limit,
  77 meaning that if all memory access operations are done on the first channel only, there is a potential bottleneck.
  78
  79 By default, the  :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels.
  80
  81 Locking memory pages
  82 ~~~~~~~~~~~~~~~~~~~~
  83
  84 The underlying operating system is allowed to load/unload memory pages at its own discretion.
  85 These page loads could impact the performance, as the process is on hold when the kernel fetches them.
  86
  87 To avoid these you could pre-load, and lock them into memory with the ``mlockall()`` call.
  88
  89 .. code-block:: c
  90
  91     if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
  92         RTE_LOG(NOTICE, USER1, "mlockall() failed with error \"%s\"\n",
  93                 strerror(errno));
  94     }
  95
  96 Communication Between lcores
  97 ----------------------------
  98
  99 To provide a message-based communication between lcores,
 100 it is advised to use the DPDK ring API, which provides a lockless ring implementation.
 101
 102 The ring supports bulk and burst access,
 103 meaning that it is possible to read several elements from the ring with only one costly atomic operation
 104 (see :doc:`ring_lib`).
 105 Performance is greatly improved when using bulk access operations.
 106
 107 The code algorithm that dequeues messages may be something similar to the following:
 108
 109 .. code-block:: c
 110
 111     #define MAX_BULK 32
 112
 113     while (1) {
 114         /* Process as many elements as can be dequeued. */
 115         count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL);
 116         if (unlikely(count == 0))
 117             continue;
 118
 119         my_process_bulk(obj_table, count);
 120    }
 121
 122 PMD
 123 ---
 124
 125 The DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode,
 126 allowing the factorization of some code for each call in the send or receive function.
 127
 128 Avoid partial writes.
 129 When PCI devices write to system memory through DMA,
 130 it costs less if the write operation is on a full cache line as opposed to part of it.
 131 In the PMD code, actions have been taken to avoid partial writes as much as possible.
 132
 133 Lower Packet Latency
 134 ~~~~~~~~~~~~~~~~~~~~
 135
 136 Traditionally, there is a trade-off between throughput and latency.
 137 An application can be tuned to achieve a high throughput,
 138 but the end-to-end latency of an average packet will typically increase as a result.
 139 Similarly, the application can be tuned to have, on average,
 140 a low end-to-end latency, at the cost of lower throughput.
 141
 142 In order to achieve higher throughput,
 143 the DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
 144
 145 Using the testpmd application as an example,
 146 the burst size can be set on the command line to a value of 32 (also the default value).
 147 This allows the application to request 32 packets at a time from the PMD.
 148 The testpmd application then immediately attempts to transmit all the packets that were received,
 149 in this case, all 32 packets.
 150
 151 The packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
 152 This behavior is desirable when tuning for high throughput because
 153 the cost of tail pointer updates to both the RX and TX queues can be spread
 154 across 32 packets,
 155 effectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
 156 However, this is not very desirable when tuning for low latency because
 157 the first packet that was received must also wait for another 31 packets to be received.
 158 It cannot be transmitted until the other 31 packets have also been processed because
 159 the NIC will not know to transmit the packets until the TX tail pointer has been updated,
 160 which is not done until all 32 packets have been processed for transmission.
 161
 162 To consistently achieve low latency, even under heavy system load,
 163 the application developer should avoid processing packets in bunches.
 164 The testpmd application can be configured from the command line to use a burst value of 1.
 165 This will allow a single packet to be processed at a time, providing lower latency,
 166 but with the added cost of lower throughput.
 167
 168 Locks and Atomic Operations
 169 ---------------------------
 170
 171 This section describes some key considerations when using locks and atomic
 172 operations in the DPDK environment.
 173
 174 Locks
 175 ~~~~~
 176
 177 On x86, atomic operations imply a lock prefix before the instruction,
 178 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 179 This has a big impact on performance in a multicore environment.
 180
 181 Performance can be improved by avoiding lock mechanisms in the data plane.
 182 It can often be replaced by other solutions like per-lcore variables.
 183 Also, some locking techniques are more efficient than others.
 184 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 185
 186 Atomic Operations: Use C11 Atomic Builtins
 187 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 188
 189 DPDK generic rte_atomic operations are implemented by __sync builtins. These
 190 __sync builtins result in full barriers on aarch64, which are unnecessary
 191 in many use cases. They can be replaced by __atomic builtins that conform to
 192 the C11 memory model and provide finer memory order control.
 193
 194 So replacing the rte_atomic operations with __atomic builtins might improve
 195 performance for aarch64 machines.
 196
 197 Some typical optimization cases are listed below:
 198
 199 Atomicity
 200 ^^^^^^^^^
 201
 202 Some use cases require atomicity alone, the ordering of the memory operations
 203 does not matter. For example, the packet statistics counters need to be
 204 incremented atomically but do not need any particular memory ordering.
 205 So, RELAXED memory ordering is sufficient.
 206
 207 One-way Barrier
 208 ^^^^^^^^^^^^^^^
 209
 210 Some use cases allow for memory reordering in one way while requiring memory
 211 ordering in the other direction.
 212
 213 For example, the memory operations before the spinlock lock are allowed to
 214 move to the critical section, but the memory operations in the critical section
 215 are not allowed to move above the lock. In this case, the full memory barrier
 216 in the compare-and-swap operation can be replaced with ACQUIRE memory order.
 217 On the other hand, the memory operations after the spinlock unlock are allowed
 218 to move to the critical section, but the memory operations in the critical
 219 section are not allowed to move below the unlock. So the full barrier in the
 220 store operation can use RELEASE memory order.
 221
 222 Reader-Writer Concurrency
 223 ^^^^^^^^^^^^^^^^^^^^^^^^^
 224
 225 Lock-free reader-writer concurrency is one of the common use cases in DPDK.
 226
 227 The payload or the data that the writer wants to communicate to the reader,
 228 can be written with RELAXED memory order. However, the guard variable should
 229 be written with RELEASE memory order. This ensures that the store to guard
 230 variable is observable only after the store to payload is observable.
 231
 232 Correspondingly, on the reader side, the guard variable should be read
 233 with ACQUIRE memory order. The payload or the data the writer communicated,
 234 can be read with RELAXED memory order. This ensures that, if the store to
 235 guard variable is observable, the store to payload is also observable.
 236
 237 Coding Considerations
 238 ---------------------
 239
 240 Inline Functions
 241 ~~~~~~~~~~~~~~~~
 242
 243 Small functions can be declared as static inline in the header file.
 244 This avoids the cost of a call instruction (and the associated context saving).
 245 However, this technique is not always efficient; it depends on many factors including the compiler.
 246
 247 Branch Prediction
 248 ~~~~~~~~~~~~~~~~~
 249
 250 The Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely()
 251 allow the developer to indicate if a code branch is likely to be taken or not.
 252 For instance:
 253
 254 .. code-block:: c
 255
 256     if (likely(x > 1))
 257         do_stuff();
 258
 259 Setting the Target CPU Type
 260 ---------------------------
 261
 262 The DPDK supports CPU microarchitecture-specific optimizations by means of RTE_MACHINE option.
 263 The degree of optimization depends on the compiler's ability to optimize for a specific microarchitecture,
 264 therefore it is preferable to use the latest compiler versions whenever possible.
 265
 266 If the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set),
 267 the build process gracefully degrades to whatever latest feature set is supported by the compiler.
 268
 269 Since the build and runtime targets may not be the same,
 270 the resulting binary also contains a platform check that runs before the
 271 main() function and checks if the current machine is suitable for running the binary.
 272
 273 Along with compiler optimizations,
 274 a set of preprocessor defines are automatically added to the build process (regardless of the compiler version).
 275 These defines correspond to the instruction sets that the target CPU should be able to support.