+Both memsegs and memzones are stored using ``rte_fbarray`` structures. Please
+refer to *DPDK API Reference* for more information.
+
+
+Multiple pthread
+----------------
+
+DPDK usually pins one pthread per core to avoid the overhead of task switching.
+This allows for significant performance gains, but lacks flexibility and is not always efficient.
+
+Power management helps to improve the CPU efficiency by limiting the CPU runtime frequency.
+However, alternately it is possible to utilize the idle cycles available to take advantage of
+the full capability of the CPU.
+
+By taking advantage of cgroup, the CPU utilization quota can be simply assigned.
+This gives another way to improve the CPU efficiency, however, there is a prerequisite;
+DPDK must handle the context switching between multiple pthreads per core.
+
+For further flexibility, it is useful to set pthread affinity not only to a CPU but to a CPU set.
+
+EAL pthread and lcore Affinity
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The term "lcore" refers to an EAL thread, which is really a Linux/FreeBSD pthread.
+"EAL pthreads" are created and managed by EAL and execute the tasks issued by *remote_launch*.
+In each EAL pthread, there is a TLS (Thread Local Storage) called *_lcore_id* for unique identification.
+As EAL pthreads usually bind 1:1 to the physical CPU, the *_lcore_id* is typically equal to the CPU ID.
+
+When using multiple pthreads, however, the binding is no longer always 1:1 between an EAL pthread and a specified physical CPU.
+The EAL pthread may have affinity to a CPU set, and as such the *_lcore_id* will not be the same as the CPU ID.
+For this reason, there is an EAL long option '--lcores' defined to assign the CPU affinity of lcores.
+For a specified lcore ID or ID group, the option allows setting the CPU set for that EAL pthread.
+
+The format pattern:
+ --lcores='<lcore_set>[@cpu_set][,<lcore_set>[@cpu_set],...]'
+
+'lcore_set' and 'cpu_set' can be a single number, range or a group.
+
+A number is a "digit([0-9]+)"; a range is "<number>-<number>"; a group is "(<number|range>[,<number|range>,...])".
+
+If a '\@cpu_set' value is not supplied, the value of 'cpu_set' will default to the value of 'lcore_set'.
+
+ ::
+
+ For example, "--lcores='1,2@(5-7),(3-5)@(0,2),(0,6),7-8'" which means start 9 EAL thread;
+ lcore 0 runs on cpuset 0x41 (cpu 0,6);
+ lcore 1 runs on cpuset 0x2 (cpu 1);
+ lcore 2 runs on cpuset 0xe0 (cpu 5,6,7);
+ lcore 3,4,5 runs on cpuset 0x5 (cpu 0,2);
+ lcore 6 runs on cpuset 0x41 (cpu 0,6);
+ lcore 7 runs on cpuset 0x80 (cpu 7);
+ lcore 8 runs on cpuset 0x100 (cpu 8).
+
+Using this option, for each given lcore ID, the associated CPUs can be assigned.
+It's also compatible with the pattern of corelist('-l') option.
+
+non-EAL pthread support
+~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to use the DPDK execution context with any user pthread (aka. Non-EAL pthreads).
+In a non-EAL pthread, the *_lcore_id* is always LCORE_ID_ANY which identifies that it is not an EAL thread with a valid, unique, *_lcore_id*.
+Some libraries will use an alternative unique ID (e.g. TID), some will not be impacted at all, and some will work but with limitations (e.g. timer and mempool libraries).
+
+All these impacts are mentioned in :ref:`known_issue_label` section.
+
+Public Thread API
+~~~~~~~~~~~~~~~~~
+
+There are two public APIs ``rte_thread_set_affinity()`` and ``rte_thread_get_affinity()`` introduced for threads.
+When they're used in any pthread context, the Thread Local Storage(TLS) will be set/get.
+
+Those TLS include *_cpuset* and *_socket_id*:
+
+* *_cpuset* stores the CPUs bitmap to which the pthread is affinitized.
+
+* *_socket_id* stores the NUMA node of the CPU set. If the CPUs in CPU set belong to different NUMA node, the *_socket_id* will be set to SOCKET_ID_ANY.
+
+
+Control Thread API
+~~~~~~~~~~~~~~~~~~
+
+It is possible to create Control Threads using the public API
+``rte_ctrl_thread_create()``.
+Those threads can be used for management/infrastructure tasks and are used
+internally by DPDK for multi process support and interrupt handling.
+
+Those threads will be scheduled on CPUs part of the original process CPU
+affinity from which the dataplane and service lcores are excluded.
+
+For example, on a 8 CPUs system, starting a dpdk application with -l 2,3
+(dataplane cores), then depending on the affinity configuration which can be
+controlled with tools like taskset (Linux) or cpuset (FreeBSD),
+
+- with no affinity configuration, the Control Threads will end up on
+ 0-1,4-7 CPUs.
+- with affinity restricted to 2-4, the Control Threads will end up on
+ CPU 4.
+- with affinity restricted to 2-3, the Control Threads will end up on
+ CPU 2 (master lcore, which is the default when no CPU is available).
+
+.. _known_issue_label:
+
+Known Issues
+~~~~~~~~~~~~
+
++ rte_mempool
+
+ The rte_mempool uses a per-lcore cache inside the mempool.
+ For non-EAL pthreads, ``rte_lcore_id()`` will not return a valid number.
+ So for now, when rte_mempool is used with non-EAL pthreads, the put/get operations will bypass the default mempool cache and there is a performance penalty because of this bypass.
+ Only user-owned external caches can be used in a non-EAL context in conjunction with ``rte_mempool_generic_put()`` and ``rte_mempool_generic_get()`` that accept an explicit cache parameter.
+
++ rte_ring
+
+ rte_ring supports multi-producer enqueue and multi-consumer dequeue.
+ However, it is non-preemptive, this has a knock on effect of making rte_mempool non-preemptable.
+
+ .. note::
+
+ The "non-preemptive" constraint means:
+
+ - a pthread doing multi-producers enqueues on a given ring must not
+ be preempted by another pthread doing a multi-producer enqueue on
+ the same ring.
+ - a pthread doing multi-consumers dequeues on a given ring must not
+ be preempted by another pthread doing a multi-consumer dequeue on
+ the same ring.
+
+ Bypassing this constraint may cause the 2nd pthread to spin until the 1st one is scheduled again.
+ Moreover, if the 1st pthread is preempted by a context that has an higher priority, it may even cause a dead lock.
+
+ This means, use cases involving preemptible pthreads should consider using rte_ring carefully.
+
+ 1. It CAN be used for preemptible single-producer and single-consumer use case.
+
+ 2. It CAN be used for non-preemptible multi-producer and preemptible single-consumer use case.
+
+ 3. It CAN be used for preemptible single-producer and non-preemptible multi-consumer use case.
+
+ 4. It MAY be used by preemptible multi-producer and/or preemptible multi-consumer pthreads whose scheduling policy are all SCHED_OTHER(cfs), SCHED_IDLE or SCHED_BATCH. User SHOULD be aware of the performance penalty before using it.
+
+ 5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
+
+ Alternatively, applications can use the lock-free stack mempool handler. When
+ considering this handler, note that:
+
+ - It is currently limited to the aarch64 and x86_64 platforms, because it uses
+ an instruction (16-byte compare-and-swap) that is not yet available on other
+ platforms.
+ - It has worse average-case performance than the non-preemptive rte_ring, but
+ software caching (e.g. the mempool cache) can mitigate this by reducing the
+ number of stack accesses.
+
++ rte_timer
+
+ Running ``rte_timer_manage()`` on a non-EAL pthread is not allowed. However, resetting/stopping the timer from a non-EAL pthread is allowed.
+
++ rte_log
+
+ In non-EAL pthreads, there is no per thread loglevel and logtype, global loglevels are used.
+
++ misc
+
+ The debug statistics of rte_ring, rte_mempool and rte_timer are not supported in a non-EAL pthread.
+
+cgroup control
+~~~~~~~~~~~~~~
+
+The following is a simple example of cgroup control usage, there are two pthreads(t0 and t1) doing packet I/O on the same core ($CPU).
+We expect only 50% of CPU spend on packet IO.
+
+ .. code-block:: console
+
+ mkdir /sys/fs/cgroup/cpu/pkt_io
+ mkdir /sys/fs/cgroup/cpuset/pkt_io
+
+ echo $cpu > /sys/fs/cgroup/cpuset/cpuset.cpus
+
+ echo $t0 > /sys/fs/cgroup/cpu/pkt_io/tasks
+ echo $t0 > /sys/fs/cgroup/cpuset/pkt_io/tasks
+
+ echo $t1 > /sys/fs/cgroup/cpu/pkt_io/tasks
+ echo $t1 > /sys/fs/cgroup/cpuset/pkt_io/tasks
+
+ cd /sys/fs/cgroup/cpu/pkt_io
+ echo 100000 > pkt_io/cpu.cfs_period_us
+ echo 50000 > pkt_io/cpu.cfs_quota_us
+
+
+Malloc
+------
+
+The EAL provides a malloc API to allocate any-sized memory.
+
+The objective of this API is to provide malloc-like functions to allow
+allocation from hugepage memory and to facilitate application porting.
+The *DPDK API Reference* manual describes the available functions.
+
+Typically, these kinds of allocations should not be done in data plane
+processing because they are slower than pool-based allocation and make
+use of locks within the allocation and free paths.
+However, they can be used in configuration code.
+
+Refer to the rte_malloc() function description in the *DPDK API Reference*
+manual for more information.
+
+Cookies
+~~~~~~~
+
+When CONFIG_RTE_MALLOC_DEBUG is enabled, the allocated memory contains
+overwrite protection fields to help identify buffer overflows.
+
+Alignment and NUMA Constraints
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The rte_malloc() takes an align argument that can be used to request a memory
+area that is aligned on a multiple of this value (which must be a power of two).
+
+On systems with NUMA support, a call to the rte_malloc() function will return
+memory that has been allocated on the NUMA socket of the core which made the call.
+A set of APIs is also provided, to allow memory to be explicitly allocated on a
+NUMA socket directly, or by allocated on the NUMA socket where another core is
+located, in the case where the memory is to be used by a logical core other than
+on the one doing the memory allocation.
+
+Use Cases
+~~~~~~~~~
+
+This API is meant to be used by an application that requires malloc-like
+functions at initialization time.
+
+For allocating/freeing data at runtime, in the fast-path of an application,
+the memory pool library should be used instead.
+
+Internal Implementation
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Data Structures
+^^^^^^^^^^^^^^^
+
+There are two data structure types used internally in the malloc library:
+
+* struct malloc_heap - used to track free space on a per-socket basis
+
+* struct malloc_elem - the basic element of allocation and free-space
+ tracking inside the library.
+
+Structure: malloc_heap
+""""""""""""""""""""""
+
+The malloc_heap structure is used to manage free space on a per-socket basis.
+Internally, there is one heap structure per NUMA node, which allows us to
+allocate memory to a thread based on the NUMA node on which this thread runs.
+While this does not guarantee that the memory will be used on that NUMA node,
+it is no worse than a scheme where the memory is always allocated on a fixed
+or random node.
+
+The key fields of the heap structure and their function are described below
+(see also diagram above):
+
+* lock - the lock field is needed to synchronize access to the heap.
+ Given that the free space in the heap is tracked using a linked list,
+ we need a lock to prevent two threads manipulating the list at the same time.
+
+* free_head - this points to the first element in the list of free nodes for
+ this malloc heap.
+
+* first - this points to the first element in the heap.
+
+* last - this points to the last element in the heap.
+
+.. _figure_malloc_heap:
+
+.. figure:: img/malloc_heap.*
+
+ Example of a malloc heap and malloc elements within the malloc library
+
+
+.. _malloc_elem:
+
+Structure: malloc_elem
+""""""""""""""""""""""
+
+The malloc_elem structure is used as a generic header structure for various
+blocks of memory.
+It is used in two different ways - all shown in the diagram above:
+
+#. As a header on a block of free or allocated memory - normal case
+
+#. As a padding header inside a block of memory
+
+The most important fields in the structure and how they are used are described below.
+
+Malloc heap is a doubly-linked list, where each element keeps track of its
+previous and next elements. Due to the fact that hugepage memory can come and
+go, neighboring malloc elements may not necessarily be adjacent in memory.
+Also, since a malloc element may span multiple pages, its contents may not
+necessarily be IOVA-contiguous either - each malloc element is only guaranteed
+to be virtually contiguous.
+
+.. note::
+
+ If the usage of a particular field in one of the above three usages is not
+ described, the field can be assumed to have an undefined value in that
+ situation, for example, for padding headers only the "state" and "pad"
+ fields have valid values.
+
+* heap - this pointer is a reference back to the heap structure from which
+ this block was allocated.
+ It is used for normal memory blocks when they are being freed, to add the
+ newly-freed block to the heap's free-list.
+
+* prev - this pointer points to previous header element/block in memory. When
+ freeing a block, this pointer is used to reference the previous block to
+ check if that block is also free. If so, and the two blocks are immediately
+ adjacent to each other, then the two free blocks are merged to form a single
+ larger block.
+
+* next - this pointer points to next header element/block in memory. When
+ freeing a block, this pointer is used to reference the next block to check
+ if that block is also free. If so, and the two blocks are immediately
+ adjacent to each other, then the two free blocks are merged to form a single
+ larger block.
+
+* free_list - this is a structure pointing to previous and next elements in
+ this heap's free list.
+ It is only used in normal memory blocks; on ``malloc()`` to find a suitable
+ free block to allocate and on ``free()`` to add the newly freed element to
+ the free-list.
+
+* state - This field can have one of three values: ``FREE``, ``BUSY`` or
+ ``PAD``.
+ The former two are to indicate the allocation state of a normal memory block
+ and the latter is to indicate that the element structure is a dummy structure
+ at the end of the start-of-block padding, i.e. where the start of the data
+ within a block is not at the start of the block itself, due to alignment
+ constraints.
+ In that case, the pad header is used to locate the actual malloc element
+ header for the block.
+
+* pad - this holds the length of the padding present at the start of the block.
+ In the case of a normal block header, it is added to the address of the end
+ of the header to give the address of the start of the data area, i.e. the
+ value passed back to the application on a malloc.
+ Within a dummy header inside the padding, this same value is stored, and is
+ subtracted from the address of the dummy header to yield the address of the
+ actual block header.
+
+* size - the size of the data block, including the header itself.
+
+Memory Allocation
+^^^^^^^^^^^^^^^^^
+
+On EAL initialization, all preallocated memory segments are setup as part of the
+malloc heap. This setup involves placing an :ref:`element header<malloc_elem>`
+with ``FREE`` at the start of each virtually contiguous segment of memory.
+The ``FREE`` element is then added to the ``free_list`` for the malloc heap.
+
+This setup also happens whenever memory is allocated at runtime (if supported),
+in which case newly allocated pages are also added to the heap, merging with any
+adjacent free segments if there are any.
+
+When an application makes a call to a malloc-like function, the malloc function
+will first index the ``lcore_config`` structure for the calling thread, and
+determine the NUMA node of that thread.
+The NUMA node is used to index the array of ``malloc_heap`` structures which is
+passed as a parameter to the ``heap_alloc()`` function, along with the
+requested size, type, alignment and boundary parameters.
+
+The ``heap_alloc()`` function will scan the free_list of the heap, and attempt
+to find a free block suitable for storing data of the requested size, with the
+requested alignment and boundary constraints.
+
+When a suitable free element has been identified, the pointer to be returned
+to the user is calculated.
+The cache-line of memory immediately preceding this pointer is filled with a
+struct malloc_elem header.
+Because of alignment and boundary constraints, there could be free space at
+the start and/or end of the element, resulting in the following behavior:
+
+#. Check for trailing space.
+ If the trailing space is big enough, i.e. > 128 bytes, then the free element
+ is split.
+ If it is not, then we just ignore it (wasted space).
+
+#. Check for space at the start of the element.
+ If the space at the start is small, i.e. <=128 bytes, then a pad header is
+ used, and the remaining space is wasted.
+ If, however, the remaining space is greater, then the free element is split.
+
+The advantage of allocating the memory from the end of the existing element is
+that no adjustment of the free list needs to take place - the existing element
+on the free list just has its size value adjusted, and the next/previous elements
+have their "prev"/"next" pointers redirected to the newly created element.
+
+In case when there is not enough memory in the heap to satisfy allocation
+request, EAL will attempt to allocate more memory from the system (if supported)
+and, following successful allocation, will retry reserving the memory again. In
+a multiprocessing scenario, all primary and secondary processes will synchronize
+their memory maps to ensure that any valid pointer to DPDK memory is guaranteed
+to be valid at all times in all currently running processes.
+
+Failure to synchronize memory maps in one of the processes will cause allocation
+to fail, even though some of the processes may have allocated the memory
+successfully. The memory is not added to the malloc heap unless primary process
+has ensured that all other processes have mapped this memory successfully.
+
+Any successful allocation event will trigger a callback, for which user
+applications and other DPDK subsystems can register. Additionally, validation
+callbacks will be triggered before allocation if the newly allocated memory will
+exceed threshold set by the user, giving a chance to allow or deny allocation.
+
+.. note::
+
+ Any allocation of new pages has to go through primary process. If the
+ primary process is not active, no memory will be allocated even if it was
+ theoretically possible to do so. This is because primary's process map acts
+ as an authority on what should or should not be mapped, while each secondary
+ process has its own, local memory map. Secondary processes do not update the
+ shared memory map, they only copy its contents to their local memory map.
+
+Freeing Memory
+^^^^^^^^^^^^^^
+
+To free an area of memory, the pointer to the start of the data area is passed
+to the free function.
+The size of the ``malloc_elem`` structure is subtracted from this pointer to get
+the element header for the block.
+If this header is of type ``PAD`` then the pad length is further subtracted from
+the pointer to get the proper element header for the entire block.
+
+From this element header, we get pointers to the heap from which the block was
+allocated and to where it must be freed, as well as the pointer to the previous
+and next elements. These next and previous elements are then checked to see if
+they are also ``FREE`` and are immediately adjacent to the current one, and if
+so, they are merged with the current element. This means that we can never have
+two ``FREE`` memory blocks adjacent to one another, as they are always merged
+into a single block.
+
+If deallocating pages at runtime is supported, and the free element encloses
+one or more pages, those pages can be deallocated and be removed from the heap.
+If DPDK was started with command-line parameters for preallocating memory
+(``-m`` or ``--socket-mem``), then those pages that were allocated at startup
+will not be deallocated.
+
+Any successful deallocation event will trigger a callback, for which user
+applications and other DPDK subsystems can register.