doc/guides/prog_guide/gpudev.rst

   1 .. SPDX-License-Identifier: BSD-3-Clause
   2    Copyright (c) 2021 NVIDIA Corporation & Affiliates
   3
   4 General-Purpose Graphics Processing Unit Library
   5 ================================================
   6
   7 When mixing networking activity with task processing on a GPU device,
   8 there may be the need to put in communication the CPU with the device
   9 in order to manage the memory, synchronize operations, exchange info, etc..
  10
  11 By means of the generic GPU interface provided by this library,
  12 it is possible to allocate a chunk of GPU memory and use it
  13 to create a DPDK mempool with external mbufs having the payload
  14 on the GPU memory, enabling any network interface card
  15 (which support this feature like Mellanox NIC)
  16 to directly transmit and receive packets using GPU memory.
  17
  18 Additionally, this library provides a number of functions
  19 to enhance the dialog between CPU and GPU.
  20
  21 Out of scope of this library is to provide a wrapper for GPU specific libraries
  22 (e.g. CUDA Toolkit or OpenCL), thus it is not possible to launch workload
  23 on the device or create GPU specific objects
  24 (e.g. CUDA Driver context or CUDA Streams in case of NVIDIA GPUs).
  25
  26 This library is optional in DPDK
  27 and can be disabled with ``-Ddisable_libs=gpudev``.
  28
  29
  30 Features
  31 --------
  32
  33 This library provides a number of features:
  34
  35 - Interoperability with device-specific library through generic handlers.
  36 - Allocate and free memory on the device.
  37 - Register CPU memory to make it visible from the device.
  38 - Communication between the CPU and the device.
  39
  40 The whole CPU - GPU communication is implemented
  41 using CPU memory visible from the GPU.
  42
  43
  44 API Overview
  45 ------------
  46
  47 Child Device
  48 ~~~~~~~~~~~~
  49
  50 By default, DPDK PCIe module detects and registers physical GPU devices
  51 in the system.
  52 With the gpudev library is also possible to add additional non-physical devices
  53 through an ``uint64_t`` generic handler (e.g. CUDA Driver context)
  54 that will be registered internally by the driver as an additional device (child)
  55 connected to a physical device (parent).
  56 Each device (parent or child) is represented through a ID
  57 required to indicate which device a given operation should be executed on.
  58
  59 Memory Allocation
  60 ~~~~~~~~~~~~~~~~~
  61
  62 gpudev can allocate on an input given GPU device a memory area
  63 returning the pointer to that memory.
  64 Later, it's also possible to free that memory with gpudev.
  65 GPU memory allocated outside of the gpudev library
  66 (e.g. with GPU-specific library) cannot be freed by the gpudev library.
  67
  68 Memory Registration
  69 ~~~~~~~~~~~~~~~~~~~
  70
  71 gpudev can register a CPU memory area to make it visible from a GPU device.
  72 Later, it's also possible to unregister that memory with gpudev.
  73 CPU memory registered outside of the gpudev library
  74 (e.g. with GPU specific library) cannot be unregistered by the gpudev library.
  75
  76 CPU mapping
  77 ~~~~~~~~~~~
  78
  79 gpudev can map into the CPU address space a GPU memory address allocated with gpudev.
  80 gpudev returns a pointer the CPU can use to access (ready or write) GPU memory.
  81 Later, it's also possible to unmap that memory with gpudev.
  82 GPU memory CPU mapped outside of the gpudev library (e.g. with GPU specific library)
  83 cannot be unmapped by the gpudev library.
  84
  85 Memory Barrier
  86 ~~~~~~~~~~~~~~
  87
  88 Some GPU drivers may need, under certain conditions,
  89 to enforce the coherency of external devices writes (e.g. NIC receiving packets)
  90 into the GPU memory.
  91 gpudev abstracts and exposes this capability.
  92
  93 Communication Flag
  94 ~~~~~~~~~~~~~~~~~~
  95
  96 Considering an application with some GPU task
  97 that's waiting to receive a signal from the CPU
  98 to move forward with the execution.
  99 The communication flag allocates a CPU memory GPU-visible ``uint32_t`` flag
 100 that can be used by the CPU to communicate with a GPU task.
 101
 102 Communication list
 103 ~~~~~~~~~~~~~~~~~~
 104
 105 By default, DPDK pulls free mbufs from a mempool to receive packets.
 106 Best practice, especially in a multithreaded application,
 107 is to no make any assumption on which mbufs will be used
 108 to receive the next bursts of packets.
 109 Considering an application with a GPU memory mempool
 110 attached to a receive queue having some task waiting on the GPU
 111 to receive a new burst of packets to be processed,
 112 there is the need to communicate from the CPU
 113 the list of mbuf payload addresses where received packet have been stored.
 114 The ``rte_gpu_comm_*()`` functions are responsible to create a list of packets
 115 that can be populated with receive mbuf payload addresses
 116 and communicated to the task running on the GPU.
 117
 118
 119 CUDA Example
 120 ------------
 121
 122 In the example below, there is a pseudo-code to give an example
 123 about how to use functions in this library in case of a CUDA application.
 124
 125 .. code-block:: c
 126
 127    //////////////////////////////////////////////////////////////////////////
 128    ///// gpudev library + CUDA functions
 129    //////////////////////////////////////////////////////////////////////////
 130    #define GPU_PAGE_SHIFT 16
 131    #define GPU_PAGE_SIZE (1UL << GPU_PAGE_SHIFT)
 132
 133    int main()
 134    {
 135        struct rte_gpu_flag quit_flag;
 136        struct rte_gpu_comm_list *comm_list;
 137        int nb_rx = 0;
 138        int comm_list_entry = 0;
 139        struct rte_mbuf *rx_mbufs[max_rx_mbufs];
 140        cudaStream_t cstream;
 141        struct rte_mempool *mpool_payload, *mpool_header;
 142        struct rte_pktmbuf_extmem ext_mem;
 143        int16_t dev_id;
 144        int16_t port_id = 0;
 145
 146        /* Initialize CUDA objects (cstream, context, etc..). */
 147        /* Use gpudev library to register a new CUDA context if any. */
 148
 149        /* Let's assume the application wants to use the default context of the GPU device 0. */
 150        dev_id = 0;
 151
 152        /* Create an external memory mempool using memory allocated on the GPU. */
 153        ext_mem.elt_size = mbufs_headroom_size;
 154        ext_mem.buf_len = RTE_ALIGN_CEIL(mbufs_num * ext_mem.elt_size, GPU_PAGE_SIZE);
 155        ext_mem.buf_iova = RTE_BAD_IOVA;
 156        ext_mem.buf_ptr = rte_gpu_mem_alloc(dev_id, ext_mem.buf_len, 0);
 157        rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, NULL, ext_mem.buf_iova, GPU_PAGE_SIZE);
 158        rte_dev_dma_map(rte_eth_devices[port_id].device,
 159                ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len);
 160        mpool_payload = rte_pktmbuf_pool_create_extbuf("gpu_mempool", mbufs_num,
 161                                                       0, 0, ext_mem.elt_size,
 162                                                       rte_socket_id(), &ext_mem, 1);
 163
 164        /*
 165         * Create CPU - device communication flag.
 166         * With this flag, the CPU can tell to the CUDA kernel to exit from the main loop.
 167         */
 168        rte_gpu_comm_create_flag(dev_id, &quit_flag, RTE_GPU_COMM_FLAG_CPU);
 169        rte_gpu_comm_set_flag(&quit_flag , 0);
 170
 171        /*
 172         * Create CPU - device communication list.
 173         * Each entry of this list will be populated by the CPU
 174         * with a new set of received mbufs that the CUDA kernel has to process.
 175         */
 176        comm_list = rte_gpu_comm_create_list(dev_id, num_entries);
 177
 178        /* A very simple CUDA kernel with just 1 CUDA block and RTE_GPU_COMM_LIST_PKTS_MAX CUDA threads. */
 179        cuda_kernel_packet_processing<<<1, RTE_GPU_COMM_LIST_PKTS_MAX, 0, cstream>>>(quit_flag->ptr, comm_list, num_entries, ...);
 180
 181        /*
 182         * For simplicity, the CPU here receives only 2 bursts of mbufs.
 183         * In a real application, network activity and device processing should overlap.
 184         */
 185        nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs);
 186        rte_gpu_comm_populate_list_pkts(comm_list[0], rx_mbufs, nb_rx);
 187        nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs);
 188        rte_gpu_comm_populate_list_pkts(comm_list[1], rx_mbufs, nb_rx);
 189
 190        /*
 191         * CPU waits for the completion of the packets' processing on the CUDA kernel
 192         * and then it does a cleanup of the received mbufs.
 193         */
 194        while (rte_gpu_comm_cleanup_list(comm_list[0]));
 195        while (rte_gpu_comm_cleanup_list(comm_list[1]));
 196
 197        /* CPU notifies the CUDA kernel that it has to terminate. */
 198        rte_gpu_comm_set_flag(&quit_flag, 1);
 199
 200        /* gpudev objects cleanup/destruction */
 201        rte_gpu_mem_free(dev_id, ext_mem.buf_len);
 202
 203        return 0;
 204    }
 205
 206    //////////////////////////////////////////////////////////////////////////
 207    ///// CUDA kernel
 208    //////////////////////////////////////////////////////////////////////////
 209
 210    void cuda_kernel(uint32_t * quit_flag_ptr, struct rte_gpu_comm_list *comm_list, int comm_list_entries)
 211    {
 212        int comm_list_index = 0;
 213        struct rte_gpu_comm_pkt *pkt_list = NULL;
 214
 215        /* Do some pre-processing operations. */
 216
 217        /* GPU kernel keeps checking this flag to know if it has to quit or wait for more packets. */
 218        while (*quit_flag_ptr == 0) {
 219            if (comm_list[comm_list_index]->status_d[0] != RTE_GPU_COMM_LIST_READY)
 220                continue;
 221
 222            if (threadIdx.x < comm_list[comm_list_index]->num_pkts)
 223            {
 224                /* Each CUDA thread processes a different packet. */
 225                packet_processing(comm_list[comm_list_index]->addr, comm_list[comm_list_index]->size, ..);
 226            }
 227            __threadfence();
 228            __syncthreads();
 229
 230            /* Wait for new packets on the next communication list entry. */
 231            comm_list_index = (comm_list_index+1) % comm_list_entries;
 232        }
 233
 234        /* Do some post-processing operations. */
 235    }