X-Git-Url: http://git.droids-corp.org/?a=blobdiff_plain;f=doc%2Fguides%2Fprog_guide%2Fvhost_lib.rst;h=b892eec67a90130563ef2ea0bb5309161d090029;hb=362f06f9a430d65e96a8bb8e972570bedd3ece5d;hp=f47473609322ee1cd878e5168e5accba1caedb13;hpb=768274ebbd5e532134420146cc26b1ff4b607740;p=dpdk.git diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst index f474736093..b892eec67a 100644 --- a/doc/guides/prog_guide/vhost_lib.rst +++ b/doc/guides/prog_guide/vhost_lib.rst @@ -63,16 +63,20 @@ The following is an overview of some key Vhost API functions: 512). * zero copy is really good for VM2VM case. For iperf between two VMs, the - boost could be above 70% (when TSO is enableld). + boost could be above 70% (when TSO is enabled). - * for VM2NIC case, the ``nb_tx_desc`` has to be small enough: <= 64 if virtio - indirect feature is not enabled and <= 128 if it is enabled. + * For zero copy in VM2NIC case, guest Tx used vring may be starved if the + PMD driver consume the mbuf but not release them timely. - This is because when dequeue zero copy is enabled, guest Tx used vring will - be updated only when corresponding mbuf is freed. Thus, the nb_tx_desc - has to be small enough so that the PMD driver will run out of available - Tx descriptors and free mbufs timely. Otherwise, guest Tx vring would be - starved. + For example, i40e driver has an optimization to maximum NIC pipeline which + postpones returning transmitted mbuf until only tx_free_threshold free + descs left. The virtio TX used ring will be starved if the formula + (num_i40e_tx_desc - num_virtio_tx_desc > tx_free_threshold) is true, since + i40e will not return back mbuf. + + A performance tip for tuning zero copy in VM2NIC case is to adjust the + frequency of mbuf free (i.e. adjust tx_free_threshold of i40e driver) to + balance consumer and producer. * Guest memory should be backended with huge pages to achieve better performance. Using 1G page size is the best. @@ -83,6 +87,14 @@ The following is an overview of some key Vhost API functions: of those segments, thus the fewer the segments, the quicker we will get the mapping. NOTE: we may speed it by using tree searching in future. + * zero copy can not work when using vfio-pci with iommu mode currently, this + is because we don't setup iommu dma mapping for guest memory. If you have + to use vfio-pci driver, please insert vfio-pci kernel module in noiommu + mode. + + * The consumer of zero copy mbufs should consume these mbufs as soon as + possible, otherwise it may block the operations in vhost. + - ``RTE_VHOST_USER_IOMMU_SUPPORT`` IOMMU support will be enabled when this flag is set. It is disabled by @@ -92,10 +104,63 @@ The following is an overview of some key Vhost API functions: from accessing memory the virtio device isn't allowed to, when the feature is negotiated and an IOMMU device is declared. - However, this feature enables vhost-user's reply-ack protocol feature, - which implementation is buggy in Qemu v2.7.0-v2.9.0 when doing multiqueue. - Enabling this flag with these Qemu version results in Qemu being blocked - when multiple queue pairs are declared. + - ``RTE_VHOST_USER_POSTCOPY_SUPPORT`` + + Postcopy live-migration support will be enabled when this flag is set. + It is disabled by default. + + Enabling this flag should only be done when the calling application does + not pre-fault the guest shared memory, otherwise migration would fail. + + - ``RTE_VHOST_USER_LINEARBUF_SUPPORT`` + + Enabling this flag forces vhost dequeue function to only provide linear + pktmbuf (no multi-segmented pktmbuf). + + The vhost library by default provides a single pktmbuf for given a + packet, but if for some reason the data doesn't fit into a single + pktmbuf (e.g., TSO is enabled), the library will allocate additional + pktmbufs from the same mempool and chain them together to create a + multi-segmented pktmbuf. + + However, the vhost application needs to support multi-segmented format. + If the vhost application does not support that format and requires large + buffers to be dequeue, this flag should be enabled to force only linear + buffers (see RTE_VHOST_USER_EXTBUF_SUPPORT) or drop the packet. + + It is disabled by default. + + - ``RTE_VHOST_USER_EXTBUF_SUPPORT`` + + Enabling this flag allows vhost dequeue function to allocate and attach + an external buffer to a pktmbuf if the pkmbuf doesn't provide enough + space to store all data. + + This is useful when the vhost application wants to support large packets + but doesn't want to increase the default mempool object size nor to + support multi-segmented mbufs (non-linear). In this case, a fresh buffer + is allocated using rte_malloc() which gets attached to a pktmbuf using + rte_pktmbuf_attach_extbuf(). + + See RTE_VHOST_USER_LINEARBUF_SUPPORT as well to disable multi-segmented + mbufs for application that doesn't support chained mbufs. + + It is disabled by default. + + - ``RTE_VHOST_USER_ASYNC_COPY`` + + Asynchronous data path will be enabled when this flag is set. Async data + path allows applications to register async copy devices (typically + hardware DMA channels) to the vhost queues. Vhost leverages the copy + device registered to free CPU from memory copy operations. A set of + async data path APIs are defined for DPDK applications to make use of + the async capability. Only packets enqueued/dequeued by async APIs are + processed through the async data path. + + Currently this feature is only implemented on split ring enqueue data + path. + + It is disabled by default. * ``rte_vhost_driver_set_features(path, features)`` @@ -160,6 +225,84 @@ The following is an overview of some key Vhost API functions: Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``. +* ``rte_vhost_crypto_create(vid, cryptodev_id, sess_mempool, socket_id)`` + + As an extension of new_device(), this function adds virtio-crypto workload + acceleration capability to the device. All crypto workload is processed by + DPDK cryptodev with the device ID of ``cryptodev_id``. + +* ``rte_vhost_crypto_free(vid)`` + + Frees the memory and vhost-user message handlers created in + rte_vhost_crypto_create(). + +* ``rte_vhost_crypto_fetch_requests(vid, queue_id, ops, nb_ops)`` + + Receives (dequeues) ``nb_ops`` virtio-crypto requests from guest, parses + them to DPDK Crypto Operations, and fills the ``ops`` with parsing results. + +* ``rte_vhost_crypto_finalize_requests(queue_id, ops, nb_ops)`` + + After the ``ops`` are dequeued from Cryptodev, finalizes the jobs and + notifies the guest(s). + +* ``rte_vhost_crypto_set_zero_copy(vid, option)`` + + Enable or disable zero copy feature of the vhost crypto backend. + +* ``rte_vhost_async_channel_register(vid, queue_id, features, ops)`` + + Register a vhost queue with async copy device channel. + Following device ``features`` must be specified together with the + registration: + + * ``async_inorder`` + + Async copy device can guarantee the ordering of copy completion + sequence. Copies are completed in the same order with that at + the submission time. + + Currently, only ``async_inorder`` capable device is supported by vhost. + + * ``async_threshold`` + + The copy length (in bytes) below which CPU copy will be used even if + applications call async vhost APIs to enqueue/dequeue data. + + Typical value is 512~1024 depending on the async device capability. + + Applications must provide following ``ops`` callbacks for vhost lib to + work with the async copy devices: + + * ``transfer_data(vid, queue_id, descs, opaque_data, count)`` + + vhost invokes this function to submit copy data to the async devices. + For non-async_inorder capable devices, ``opaque_data`` could be used + for identifying the completed packets. + + * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)`` + + vhost invokes this function to get the copy data completed by async + devices. + +* ``rte_vhost_async_channel_unregister(vid, queue_id)`` + + Unregister the async copy device channel from a vhost queue. + +* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count)`` + + Submit an enqueue request to transmit ``count`` packets from host to guest + by async data path. Enqueue is not guaranteed to finish upon the return of + this API call. + + Applications must not free the packets submitted for enqueue until the + packets are completed. + +* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)`` + + Poll enqueue completion status from async data path. Completed packets + are returned to applications through ``pkts``. + Vhost-user Implementations -------------------------- @@ -219,16 +362,16 @@ Guest memory requirement * Memory pre-allocation - For non-zerocopy, guest memory pre-allocation is not a must. This can help - save of memory. If users really want the guest memory to be pre-allocated - (e.g., for performance reason), we can add option ``-mem-prealloc`` when - starting QEMU. Or, we can lock all memory at vhost side which will force - memory to be allocated when mmap at vhost side; option --mlockall in - ovs-dpdk is an example in hand. + For non-zerocopy non-async data path, guest memory pre-allocation is not a + must. This can help save of memory. If users really want the guest memory + to be pre-allocated (e.g., for performance reason), we can add option + ``-mem-prealloc`` when starting QEMU. Or, we can lock all memory at vhost + side which will force memory to be allocated when mmap at vhost side; + option --mlockall in ovs-dpdk is an example in hand. - For zerocopy, we force the VM memory to be pre-allocated at vhost lib when - mapping the guest memory; and also we need to lock the memory to prevent - pages being swapped out to disk. + For async and zerocopy data path, we force the VM memory to be + pre-allocated at vhost lib when mapping the guest memory; and also we need + to lock the memory to prevent pages being swapped out to disk. * Memory sharing @@ -240,3 +383,62 @@ Vhost supported vSwitch reference For more vhost details and how to support vhost in vSwitch, please refer to the vhost example in the DPDK Sample Applications Guide. + +Vhost data path acceleration (vDPA) +----------------------------------- + +vDPA supports selective datapath in vhost-user lib by enabling virtio ring +compatible devices to serve virtio driver directly for datapath acceleration. + +``rte_vhost_driver_attach_vdpa_device`` is used to configure the vhost device +with accelerated backend. + +Also vhost device capabilities are made configurable to adopt various devices. +Such capabilities include supported features, protocol features, queue number. + +Finally, a set of device ops is defined for device specific operations: + +* ``get_queue_num`` + + Called to get supported queue number of the device. + +* ``get_features`` + + Called to get supported features of the device. + +* ``get_protocol_features`` + + Called to get supported protocol features of the device. + +* ``dev_conf`` + + Called to configure the actual device when the virtio device becomes ready. + +* ``dev_close`` + + Called to close the actual device when the virtio device is stopped. + +* ``set_vring_state`` + + Called to change the state of the vring in the actual device when vring state + changes. + +* ``set_features`` + + Called to set the negotiated features to device. + +* ``migration_done`` + + Called to allow the device to response to RARP sending. + +* ``get_vfio_group_fd`` + + Called to get the VFIO group fd of the device. + +* ``get_vfio_device_fd`` + + Called to get the VFIO device fd of the device. + +* ``get_notify_area`` + + Called to get the notify area info of the queue.