X-Git-Url: http://git.droids-corp.org/?a=blobdiff_plain;f=doc%2Fguides%2Fprog_guide%2Fenv_abstraction_layer.rst;h=fff1c06349420b06a8c176b25736b565f9887ac4;hb=fdb7f1a2119f1fc5c4c11851eb89d5b0622395d7;hp=25eb281fdabba2cce32059e46ddff5297f4a8334;hpb=4a22e6ee3d2f8be8afd5b374a8916e232ab7fe97;p=dpdk.git diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst index 25eb281fda..fff1c06349 100644 --- a/doc/guides/prog_guide/env_abstraction_layer.rst +++ b/doc/guides/prog_guide/env_abstraction_layer.rst @@ -103,7 +103,7 @@ Multi-process Support ~~~~~~~~~~~~~~~~~~~~~ The Linuxapp EAL allows a multi-process as well as a multi-threaded (pthread) deployment model. -See chapter 2.20 +See chapter :ref:`Multi-process Support ` for more details. Memory Mapping Discovery and Memory Reservation @@ -115,8 +115,7 @@ The physical address of the reserved memory for that memory zone is also returne .. note:: - Memory reservations done using the APIs provided by the rte_malloc library are also backed by pages from the hugetlbfs filesystem. - However, physical address information is not available for the blocks of memory allocated in this way. + Memory reservations done using the APIs provided by rte_malloc are also backed by pages from the hugetlbfs filesystem. Xen Dom0 support without hugetbls ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -167,8 +166,10 @@ CPU Feature Identification The EAL can query the CPU at runtime (using the rte_cpu_get_feature() function) to determine which CPU features are available. -User Space Interrupt and Alarm Handling -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +User Space Interrupt Event +~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++ User Space Interrupt and Alarm Handling in Host Thread The EAL creates a host thread to poll the UIO device file descriptors to detect the interrupts. Callbacks can be registered or unregistered by the EAL functions for a specific interrupt event @@ -177,8 +178,51 @@ The EAL also allows timed callbacks to be used in the same way as for NIC interr .. note:: - The only interrupts supported by the DPDK Poll-Mode Drivers are those for link status change, - i.e. link up and link down notification. + In DPDK PMD, the only interrupts handled by the dedicated host thread are those for link status change + (link up and link down notification) and for sudden device removal. + + ++ RX Interrupt Event + +The receive and transmit routines provided by each PMD don't limit themselves to execute in polling thread mode. +To ease the idle polling with tiny throughput, it's useful to pause the polling and wait until the wake-up event happens. +The RX interrupt is the first choice to be such kind of wake-up event, but probably won't be the only one. + +EAL provides the event APIs for this event-driven thread mode. +Taking linuxapp as an example, the implementation relies on epoll. Each thread can monitor an epoll instance +in which all the wake-up events' file descriptors are added. The event file descriptors are created and mapped to +the interrupt vectors according to the UIO/VFIO spec. +From bsdapp's perspective, kqueue is the alternative way, but not implemented yet. + +EAL initializes the mapping between event file descriptors and interrupt vectors, while each device initializes the mapping +between interrupt vectors and queues. In this way, EAL actually is unaware of the interrupt cause on the specific vector. +The eth_dev driver takes responsibility to program the latter mapping. + +.. note:: + + Per queue RX interrupt event is only allowed in VFIO which supports multiple MSI-X vector. In UIO, the RX interrupt + together with other interrupt causes shares the same vector. In this case, when RX interrupt and LSC(link status change) + interrupt are both enabled(intr_conf.lsc == 1 && intr_conf.rxq == 1), only the former is capable. + +The RX interrupt are controlled/enabled/disabled by ethdev APIs - 'rte_eth_dev_rx_intr_*'. They return failure if the PMD +hasn't support them yet. The intr_conf.rxq flag is used to turn on the capability of RX interrupt per device. + ++ Device Removal Event + +This event is triggered by a device being removed at a bus level. Its +underlying resources may have been made unavailable (i.e. PCI mappings +unmapped). The PMD must make sure that on such occurrence, the application can +still safely use its callbacks. + +This event can be subscribed to in the same way one would subscribe to a link +status change event. The execution context is thus the same, i.e. it is the +dedicated interrupt host thread. + +Considering this, it is likely that an application would want to close a +device having emitted a Device Removal Event. In such case, calling +``rte_eth_dev_close()`` can trigger it to unregister its own Device Removal Event +callback. Care must be taken not to close the device from the interrupt handler +context. It is necessary to reschedule such closing operation. Blacklisting ~~~~~~~~~~~~ @@ -295,8 +339,8 @@ Known Issues The rte_mempool uses a per-lcore cache inside the mempool. For non-EAL pthreads, ``rte_lcore_id()`` will not return a valid number. - So for now, when rte_mempool is used with non-EAL pthreads, the put/get operations will bypass the mempool cache and there is a performance penalty because of this bypass. - Support for non-EAL mempool cache is currently being enabled. + So for now, when rte_mempool is used with non-EAL pthreads, the put/get operations will bypass the default mempool cache and there is a performance penalty because of this bypass. + Only user-owned external caches can be used in a non-EAL context in conjunction with ``rte_mempool_generic_put()`` and ``rte_mempool_generic_get()`` that accept an explicit cache parameter. + rte_ring @@ -314,7 +358,7 @@ Known Issues be preempted by another pthread doing a multi-consumer dequeue on the same ring. - Bypassing this constraint it may cause the 2nd pthread to spin until the 1st one is scheduled again. + Bypassing this constraint may cause the 2nd pthread to spin until the 1st one is scheduled again. Moreover, if the 1st pthread is preempted by a context that has an higher priority, it may even cause a dead lock. This does not mean it cannot be used, simply, there is a need to narrow down the situation when it is used by multi-pthread on the same core. @@ -325,11 +369,6 @@ Known Issues 3. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR. - ``RTE_RING_PAUSE_REP_COUNT`` is defined for rte_ring to reduce contention. It's mainly for case 2, a yield is issued after number of times pause repeat. - - It adds a sched_yield() syscall if the thread spins for too long while waiting on the other thread to finish its operations on the ring. - This gives the preempted thread a chance to proceed and finish with the ring enqueue/dequeue operation. - + rte_timer Running ``rte_timer_manager()`` on a non-EAL pthread is not allowed. However, resetting/stopping the timer from a non-EAL pthread is allowed. @@ -366,3 +405,222 @@ We expect only 50% of CPU spend on packet IO. echo 50000 > pkt_io/cpu.cfs_quota_us +Malloc +------ + +The EAL provides a malloc API to allocate any-sized memory. + +The objective of this API is to provide malloc-like functions to allow +allocation from hugepage memory and to facilitate application porting. +The *DPDK API Reference* manual describes the available functions. + +Typically, these kinds of allocations should not be done in data plane +processing because they are slower than pool-based allocation and make +use of locks within the allocation and free paths. +However, they can be used in configuration code. + +Refer to the rte_malloc() function description in the *DPDK API Reference* +manual for more information. + +Cookies +~~~~~~~ + +When CONFIG_RTE_MALLOC_DEBUG is enabled, the allocated memory contains +overwrite protection fields to help identify buffer overflows. + +Alignment and NUMA Constraints +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The rte_malloc() takes an align argument that can be used to request a memory +area that is aligned on a multiple of this value (which must be a power of two). + +On systems with NUMA support, a call to the rte_malloc() function will return +memory that has been allocated on the NUMA socket of the core which made the call. +A set of APIs is also provided, to allow memory to be explicitly allocated on a +NUMA socket directly, or by allocated on the NUMA socket where another core is +located, in the case where the memory is to be used by a logical core other than +on the one doing the memory allocation. + +Use Cases +~~~~~~~~~ + +This API is meant to be used by an application that requires malloc-like +functions at initialization time. + +For allocating/freeing data at runtime, in the fast-path of an application, +the memory pool library should be used instead. + +Internal Implementation +~~~~~~~~~~~~~~~~~~~~~~~ + +Data Structures +^^^^^^^^^^^^^^^ + +There are two data structure types used internally in the malloc library: + +* struct malloc_heap - used to track free space on a per-socket basis + +* struct malloc_elem - the basic element of allocation and free-space + tracking inside the library. + +Structure: malloc_heap +"""""""""""""""""""""" + +The malloc_heap structure is used to manage free space on a per-socket basis. +Internally, there is one heap structure per NUMA node, which allows us to +allocate memory to a thread based on the NUMA node on which this thread runs. +While this does not guarantee that the memory will be used on that NUMA node, +it is no worse than a scheme where the memory is always allocated on a fixed +or random node. + +The key fields of the heap structure and their function are described below +(see also diagram above): + +* lock - the lock field is needed to synchronize access to the heap. + Given that the free space in the heap is tracked using a linked list, + we need a lock to prevent two threads manipulating the list at the same time. + +* free_head - this points to the first element in the list of free nodes for + this malloc heap. + +.. note:: + + The malloc_heap structure does not keep track of in-use blocks of memory, + since these are never touched except when they are to be freed again - + at which point the pointer to the block is an input to the free() function. + +.. _figure_malloc_heap: + +.. figure:: img/malloc_heap.* + + Example of a malloc heap and malloc elements within the malloc library + + +.. _malloc_elem: + +Structure: malloc_elem +"""""""""""""""""""""" + +The malloc_elem structure is used as a generic header structure for various +blocks of memory. +It is used in three different ways - all shown in the diagram above: + +#. As a header on a block of free or allocated memory - normal case + +#. As a padding header inside a block of memory + +#. As an end-of-memseg marker + +The most important fields in the structure and how they are used are described below. + +.. note:: + + If the usage of a particular field in one of the above three usages is not + described, the field can be assumed to have an undefined value in that + situation, for example, for padding headers only the "state" and "pad" + fields have valid values. + +* heap - this pointer is a reference back to the heap structure from which + this block was allocated. + It is used for normal memory blocks when they are being freed, to add the + newly-freed block to the heap's free-list. + +* prev - this pointer points to the header element/block in the memseg + immediately behind the current one. When freeing a block, this pointer is + used to reference the previous block to check if that block is also free. + If so, then the two free blocks are merged to form a single larger block. + +* next_free - this pointer is used to chain the free-list of unallocated + memory blocks together. + It is only used in normal memory blocks; on ``malloc()`` to find a suitable + free block to allocate and on ``free()`` to add the newly freed element to + the free-list. + +* state - This field can have one of three values: ``FREE``, ``BUSY`` or + ``PAD``. + The former two are to indicate the allocation state of a normal memory block + and the latter is to indicate that the element structure is a dummy structure + at the end of the start-of-block padding, i.e. where the start of the data + within a block is not at the start of the block itself, due to alignment + constraints. + In that case, the pad header is used to locate the actual malloc element + header for the block. + For the end-of-memseg structure, this is always a ``BUSY`` value, which + ensures that no element, on being freed, searches beyond the end of the + memseg for other blocks to merge with into a larger free area. + +* pad - this holds the length of the padding present at the start of the block. + In the case of a normal block header, it is added to the address of the end + of the header to give the address of the start of the data area, i.e. the + value passed back to the application on a malloc. + Within a dummy header inside the padding, this same value is stored, and is + subtracted from the address of the dummy header to yield the address of the + actual block header. + +* size - the size of the data block, including the header itself. + For end-of-memseg structures, this size is given as zero, though it is never + actually checked. + For normal blocks which are being freed, this size value is used in place of + a "next" pointer to identify the location of the next block of memory that + in the case of being ``FREE``, the two free blocks can be merged into one. + +Memory Allocation +^^^^^^^^^^^^^^^^^ + +On EAL initialization, all memsegs are setup as part of the malloc heap. +This setup involves placing a dummy structure at the end with ``BUSY`` state, +which may contain a sentinel value if ``CONFIG_RTE_MALLOC_DEBUG`` is enabled, +and a proper :ref:`element header` with ``FREE`` at the start +for each memseg. +The ``FREE`` element is then added to the ``free_list`` for the malloc heap. + +When an application makes a call to a malloc-like function, the malloc function +will first index the ``lcore_config`` structure for the calling thread, and +determine the NUMA node of that thread. +The NUMA node is used to index the array of ``malloc_heap`` structures which is +passed as a parameter to the ``heap_alloc()`` function, along with the +requested size, type, alignment and boundary parameters. + +The ``heap_alloc()`` function will scan the free_list of the heap, and attempt +to find a free block suitable for storing data of the requested size, with the +requested alignment and boundary constraints. + +When a suitable free element has been identified, the pointer to be returned +to the user is calculated. +The cache-line of memory immediately preceding this pointer is filled with a +struct malloc_elem header. +Because of alignment and boundary constraints, there could be free space at +the start and/or end of the element, resulting in the following behavior: + +#. Check for trailing space. + If the trailing space is big enough, i.e. > 128 bytes, then the free element + is split. + If it is not, then we just ignore it (wasted space). + +#. Check for space at the start of the element. + If the space at the start is small, i.e. <=128 bytes, then a pad header is + used, and the remaining space is wasted. + If, however, the remaining space is greater, then the free element is split. + +The advantage of allocating the memory from the end of the existing element is +that no adjustment of the free list needs to take place - the existing element +on the free list just has its size pointer adjusted, and the following element +has its "prev" pointer redirected to the newly created element. + +Freeing Memory +^^^^^^^^^^^^^^ + +To free an area of memory, the pointer to the start of the data area is passed +to the free function. +The size of the ``malloc_elem`` structure is subtracted from this pointer to get +the element header for the block. +If this header is of type ``PAD`` then the pad length is further subtracted from +the pointer to get the proper element header for the entire block. + +From this element header, we get pointers to the heap from which the block was +allocated and to where it must be freed, as well as the pointer to the previous +element, and via the size field, we can calculate the pointer to the next element. +These next and previous elements are then checked to see if they are also +``FREE``, and if so, they are merged with the current element. +This means that we can never have two ``FREE`` memory blocks adjacent to one +another, as they are always merged into a single block.