X-Git-Url: http://git.droids-corp.org/?a=blobdiff_plain;f=doc%2Fguides%2Fprog_guide%2Fvhost_lib.rst;h=171e0096f6d4ca44de6f122cc12ec173ff47a6f4;hb=e0ad8d2bdafcd74eb960bf96507fed11cc97d58c;hp=a52fa50e9714f375cc6c5db225bbb3ab58e67ea1;hpb=42683a7da7b22891836836d179dd35e1cdac230a;p=dpdk.git diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst index a52fa50e97..171e0096f6 100644 --- a/doc/guides/prog_guide/vhost_lib.rst +++ b/doc/guides/prog_guide/vhost_lib.rst @@ -1,135 +1,444 @@ -.. BSD LICENSE - Copyright(c) 2010-2014 Intel Corporation. All rights reserved. - All rights reserved. - - Redistribution and use in source and binary forms, with or without - modification, are permitted provided that the following conditions - are met: - - * Redistributions of source code must retain the above copyright - notice, this list of conditions and the following disclaimer. - * Redistributions in binary form must reproduce the above copyright - notice, this list of conditions and the following disclaimer in - the documentation and/or other materials provided with the - distribution. - * Neither the name of Intel Corporation nor the names of its - contributors may be used to endorse or promote products derived - from this software without specific prior written permission. - - THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS - "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT - LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR - A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT - OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, - SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT - LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, - DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY - THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE - OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(c) 2010-2016 Intel Corporation. Vhost Library ============= -The vhost library implements a user space vhost driver. It supports both vhost-cuse -(cuse: user space character device) and vhost-user(user space socket server). -It also creates, manages and destroys vhost devices for corresponding virtio -devices in the guest. Vhost supported vSwitch could register callbacks to this -library, which will be called when a vhost device is activated or deactivated -by guest virtual machine. +The vhost library implements a user space virtio net server allowing the user +to manipulate the virtio ring directly. In another words, it allows the user +to fetch/put packets from/to the VM virtio net device. To achieve this, a +vhost library should be able to: + +* Access the guest memory: + + For QEMU, this is done by using the ``-object memory-backend-file,share=on,...`` + option. Which means QEMU will create a file to serve as the guest RAM. + The ``share=on`` option allows another process to map that file, which + means it can access the guest RAM. + +* Know all the necessary information about the vring: + + Information such as where the available ring is stored. Vhost defines some + messages (passed through a Unix domain socket file) to tell the backend all + the information it needs to know how to manipulate the vring. + Vhost API Overview ------------------ -* Vhost driver registration +The following is an overview of some key Vhost API functions: + +* ``rte_vhost_driver_register(path, flags)`` + + This function registers a vhost driver into the system. ``path`` specifies + the Unix domain socket file path. + + Currently supported flags are: + + - ``RTE_VHOST_USER_CLIENT`` + + DPDK vhost-user will act as the client when this flag is given. See below + for an explanation. + + - ``RTE_VHOST_USER_NO_RECONNECT`` + + When DPDK vhost-user acts as the client it will keep trying to reconnect + to the server (QEMU) until it succeeds. This is useful in two cases: + + * When QEMU is not started yet. + * When QEMU restarts (for example due to a guest OS reboot). + + This reconnect option is enabled by default. However, it can be turned off + by setting this flag. + + - ``RTE_VHOST_USER_IOMMU_SUPPORT`` + + IOMMU support will be enabled when this flag is set. It is disabled by + default. + + Enabling this flag makes possible to use guest vIOMMU to protect vhost + from accessing memory the virtio device isn't allowed to, when the feature + is negotiated and an IOMMU device is declared. + + - ``RTE_VHOST_USER_POSTCOPY_SUPPORT`` + + Postcopy live-migration support will be enabled when this flag is set. + It is disabled by default. + + Enabling this flag should only be done when the calling application does + not pre-fault the guest shared memory, otherwise migration would fail. + + - ``RTE_VHOST_USER_LINEARBUF_SUPPORT`` + + Enabling this flag forces vhost dequeue function to only provide linear + pktmbuf (no multi-segmented pktmbuf). + + The vhost library by default provides a single pktmbuf for given a + packet, but if for some reason the data doesn't fit into a single + pktmbuf (e.g., TSO is enabled), the library will allocate additional + pktmbufs from the same mempool and chain them together to create a + multi-segmented pktmbuf. + + However, the vhost application needs to support multi-segmented format. + If the vhost application does not support that format and requires large + buffers to be dequeue, this flag should be enabled to force only linear + buffers (see RTE_VHOST_USER_EXTBUF_SUPPORT) or drop the packet. + + It is disabled by default. + + - ``RTE_VHOST_USER_EXTBUF_SUPPORT`` + + Enabling this flag allows vhost dequeue function to allocate and attach + an external buffer to a pktmbuf if the pkmbuf doesn't provide enough + space to store all data. + + This is useful when the vhost application wants to support large packets + but doesn't want to increase the default mempool object size nor to + support multi-segmented mbufs (non-linear). In this case, a fresh buffer + is allocated using rte_malloc() which gets attached to a pktmbuf using + rte_pktmbuf_attach_extbuf(). + + See RTE_VHOST_USER_LINEARBUF_SUPPORT as well to disable multi-segmented + mbufs for application that doesn't support chained mbufs. + + It is disabled by default. + + - ``RTE_VHOST_USER_ASYNC_COPY`` + + Asynchronous data path will be enabled when this flag is set. Async data + path allows applications to register async copy devices (typically + hardware DMA channels) to the vhost queues. Vhost leverages the copy + device registered to free CPU from memory copy operations. A set of + async data path APIs are defined for DPDK applications to make use of + the async capability. Only packets enqueued/dequeued by async APIs are + processed through the async data path. + + Currently this feature is only implemented on split ring enqueue data + path. + + It is disabled by default. + + - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS`` + + Since v16.04, the vhost library forwards checksum and gso requests for + packets received from a virtio driver by filling Tx offload metadata in + the mbuf. This behavior is inconsistent with other drivers but it is left + untouched for existing applications that might rely on it. + + This flag disables the legacy behavior and instead ask vhost to simply + populate Rx offload metadata in the mbuf. + + It is disabled by default. + +* ``rte_vhost_driver_set_features(path, features)`` + + This function sets the feature bits the vhost-user driver supports. The + vhost-user driver could be vhost-user net, yet it could be something else, + say, vhost-user SCSI. + +* ``rte_vhost_driver_callback_register(path, vhost_device_ops)`` + + This function registers a set of callbacks, to let DPDK applications take + the appropriate action when some events happen. The following events are + currently supported: + + * ``new_device(int vid)`` + + This callback is invoked when a virtio device becomes ready. ``vid`` + is the vhost device ID. + + * ``destroy_device(int vid)`` + + This callback is invoked when a virtio device is paused or shut down. + + * ``vring_state_changed(int vid, uint16_t queue_id, int enable)`` + + This callback is invoked when a specific queue's state is changed, for + example to enabled or disabled. + + * ``features_changed(int vid, uint64_t features)`` + + This callback is invoked when the features is changed. For example, + ``VHOST_F_LOG_ALL`` will be set/cleared at the start/end of live + migration, respectively. + + * ``new_connection(int vid)`` + + This callback is invoked on new vhost-user socket connection. If DPDK + acts as the server the device should not be deleted before + ``destroy_connection`` callback is received. + + * ``destroy_connection(int vid)`` + + This callback is invoked when vhost-user socket connection is closed. + It indicates that device with id ``vid`` is no longer in use and can be + safely deleted. + +* ``rte_vhost_driver_disable/enable_features(path, features))`` + + This function disables/enables some features. For example, it can be used to + disable mergeable buffers and TSO features, which both are enabled by + default. + +* ``rte_vhost_driver_start(path)`` + + This function triggers the vhost-user negotiation. It should be invoked at + the end of initializing a vhost-user driver. + +* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)`` + + Transmits (enqueues) ``count`` packets from host to guest. + +* ``rte_vhost_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count)`` + + Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``. + +* ``rte_vhost_crypto_create(vid, cryptodev_id, sess_mempool, socket_id)`` + + As an extension of new_device(), this function adds virtio-crypto workload + acceleration capability to the device. All crypto workload is processed by + DPDK cryptodev with the device ID of ``cryptodev_id``. + +* ``rte_vhost_crypto_free(vid)`` - rte_vhost_driver_register registers the vhost driver into the system. - For vhost-cuse, character device file will be created under the /dev directory. - Character device name is specified as the parameter. - For vhost-user, a unix domain socket server will be created with the parameter as - the local socket path. + Frees the memory and vhost-user message handlers created in + rte_vhost_crypto_create(). -* Vhost session start +* ``rte_vhost_crypto_fetch_requests(vid, queue_id, ops, nb_ops)`` - rte_vhost_driver_session_start starts the vhost session loop. - Vhost session is an infinite blocking loop. - Put the session in a dedicate DPDK thread. + Receives (dequeues) ``nb_ops`` virtio-crypto requests from guest, parses + them to DPDK Crypto Operations, and fills the ``ops`` with parsing results. -* Callback register +* ``rte_vhost_crypto_finalize_requests(queue_id, ops, nb_ops)`` - Vhost supported vSwitch could call rte_vhost_driver_callback_register to - register two callbacks, new_destory and destroy_device. - When virtio device is activated or deactivated by guest virtual machine, - the callback will be called, then vSwitch could put the device onto data - core or remove the device from data core by setting or unsetting - VIRTIO_DEV_RUNNING on the device flags. + After the ``ops`` are dequeued from Cryptodev, finalizes the jobs and + notifies the guest(s). -* Read/write packets from/to guest virtual machine +* ``rte_vhost_crypto_set_zero_copy(vid, option)`` - rte_vhost_enqueue_burst transmit host packets to guest. - rte_vhost_dequeue_burst receives packets from guest. + Enable or disable zero copy feature of the vhost crypto backend. -* Feature enable/disable +* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)`` - Now one negotiate-able feature in vhost is merge-able. - vSwitch could enable/disable this feature for performance consideration. + Register an async copy device channel for a vhost queue after vring + is enabled. Following device ``config`` must be specified together + with the registration: -Vhost Implementation --------------------- + * ``features`` -Vhost cuse implementation -~~~~~~~~~~~~~~~~~~~~~~~~~ -When vSwitch registers the vhost driver, it will register a cuse device driver -into the system and creates a character device file. This cuse driver will -receive vhost open/release/IOCTL message from QEMU simulator. + This field is used to specify async copy device features. -When the open call is received, vhost driver will create a vhost device for the -virtio device in the guest. + ``RTE_VHOST_ASYNC_INORDER`` represents the async copy device can + guarantee the order of copy completion is the same as the order + of copy submission. -When VHOST_SET_MEM_TABLE IOCTL is received, vhost searches the memory region -to find the starting user space virtual address that maps the memory of guest -virtual machine. Through this virtual address and the QEMU pid, vhost could -find the file QEMU uses to map the guest memory. Vhost maps this file into its -address space, in this way vhost could fully access the guest physical memory, -which means vhost could access the shared virtio ring and the guest physical -address specified in the entry of the ring. + Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is + supported by vhost. -The guest virtual machine tells the vhost whether the virtio device is ready -for processing or is de-activated through VHOST_NET_SET_BACKEND message. -The registered callback from vSwitch will be called. + Applications must provide following ``ops`` callbacks for vhost lib to + work with the async copy devices: -When the release call is released, vhost will destroy the device. + * ``transfer_data(vid, queue_id, descs, opaque_data, count)`` -Vhost user implementation -~~~~~~~~~~~~~~~~~~~~~~~~~ -When vSwitch registers a vhost driver, it will create a unix domain socket server -into the system. This server will listen for a connection and process the vhost message from -QEMU simulator. + vhost invokes this function to submit copy data to the async devices. + For non-async_inorder capable devices, ``opaque_data`` could be used + for identifying the completed packets. -When there is a new socket connection, it means a new virtio device has been created in -the guest virtual machine, and the vhost driver will create a vhost device for this virtio device. + * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)`` -For messages with a file descriptor, the file descriptor could be directly used in the vhost -process as it is already installed by unix domain socket. + vhost invokes this function to get the copy data completed by async + devices. - * VHOST_SET_MEM_TABLE - * VHOST_SET_VRING_KICK - * VHOST_SET_VRING_CALL - * VHOST_SET_LOG_FD - * VHOST_SET_VRING_ERR +* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config, ops)`` -For VHOST_SET_MEM_TABLE message, QEMU will send us information for each memory region and its -file descriptor in the ancillary data of the message. The fd is used to map that region. + Register an async copy device channel for a vhost queue without + performing any locking. -There is no VHOST_NET_SET_BACKEND message as in vhost cuse to signal us whether virtio device -is ready or should be stopped. -VHOST_SET_VRING_KICK is used as the signal to put the vhost device onto data plane. -VHOST_GET_VRING_BASE is used as the signal to remove vhost device from data plane. + This function is only safe to call in vhost callback functions + (i.e., struct vhost_device_ops). + +* ``rte_vhost_async_channel_unregister(vid, queue_id)`` + + Unregister the async copy device channel from a vhost queue. + Unregistration will fail, if the vhost queue has in-flight + packets that are not completed. + + Unregister async copy devices in vring_state_changed() may + fail, as this API tries to acquire the spinlock of vhost + queue. The recommended way is to unregister async copy + devices for all vhost queues in destroy_device(), when a + virtio device is paused or shut down. + +* ``rte_vhost_async_channel_unregister_thread_unsafe(vid, queue_id)`` + + Unregister the async copy device channel for a vhost queue without + performing any locking. + + This function is only safe to call in vhost callback functions + (i.e., struct vhost_device_ops). + +* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts, comp_count)`` + + Submit an enqueue request to transmit ``count`` packets from host to guest + by async data path. Successfully enqueued packets can be transfer completed + or being occupied by DMA engines; transfer completed packets are returned in + ``comp_pkts``, but others are not guaranteed to finish, when this API + call returns. + + Applications must not free the packets submitted for enqueue until the + packets are completed. + +* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)`` + + Poll enqueue completion status from async data path. Completed packets + are returned to applications through ``pkts``. + +* ``rte_vhost_async_get_inflight(vid, queue_id)`` + + This function returns the amount of in-flight packets for the vhost + queue using async acceleration. + +* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)`` + + Clear inflight packets which are submitted to DMA engine in vhost async data + path. Completed packets are returned to applications through ``pkts``. + +Vhost-user Implementations +-------------------------- + +Vhost-user uses Unix domain sockets for passing messages. This means the DPDK +vhost-user implementation has two options: + +* DPDK vhost-user acts as the server. + + DPDK will create a Unix domain socket server file and listen for + connections from the frontend. + + Note, this is the default mode, and the only mode before DPDK v16.07. + + +* DPDK vhost-user acts as the client. + + Unlike the server mode, this mode doesn't create the socket file; + it just tries to connect to the server (which responses to create the + file instead). + + When the DPDK vhost-user application restarts, DPDK vhost-user will try to + connect to the server again. This is how the "reconnect" feature works. + + .. Note:: + * The "reconnect" feature requires **QEMU v2.7** (or above). + + * The vhost supported features must be exactly the same before and + after the restart. For example, if TSO is disabled and then enabled, + nothing will work and issues undefined might happen. + +No matter which mode is used, once a connection is established, DPDK +vhost-user will start receiving and processing vhost messages from QEMU. + +For messages with a file descriptor, the file descriptor can be used directly +in the vhost process as it is already installed by the Unix domain socket. + +The supported vhost messages are: + +* ``VHOST_SET_MEM_TABLE`` +* ``VHOST_SET_VRING_KICK`` +* ``VHOST_SET_VRING_CALL`` +* ``VHOST_SET_LOG_FD`` +* ``VHOST_SET_VRING_ERR`` + +For ``VHOST_SET_MEM_TABLE`` message, QEMU will send information for each +memory region and its file descriptor in the ancillary data of the message. +The file descriptor is used to map that region. + +``VHOST_SET_VRING_KICK`` is used as the signal to put the vhost device into +the data plane, and ``VHOST_GET_VRING_BASE`` is used as the signal to remove +the vhost device from the data plane. When the socket connection is closed, vhost will destroy the device. +Guest memory requirement +------------------------ + +* Memory pre-allocation + + For non-async data path, guest memory pre-allocation is not a + must. This can help save of memory. If users really want the guest memory + to be pre-allocated (e.g., for performance reason), we can add option + ``-mem-prealloc`` when starting QEMU. Or, we can lock all memory at vhost + side which will force memory to be allocated when mmap at vhost side; + option --mlockall in ovs-dpdk is an example in hand. + + For async data path, we force the VM memory to be pre-allocated at vhost + lib when mapping the guest memory; and also we need to lock the memory to + prevent pages being swapped out to disk. + +* Memory sharing + + Make sure ``share=on`` QEMU option is given. vhost-user will not work with + a QEMU version without shared memory mapping. + Vhost supported vSwitch reference --------------------------------- -For more vhost details and how to support vhost in vSwitch, please refer to vhost example in the -DPDK Sample Applications Guide. +For more vhost details and how to support vhost in vSwitch, please refer to +the vhost example in the DPDK Sample Applications Guide. + +Vhost data path acceleration (vDPA) +----------------------------------- + +vDPA supports selective datapath in vhost-user lib by enabling virtio ring +compatible devices to serve virtio driver directly for datapath acceleration. + +``rte_vhost_driver_attach_vdpa_device`` is used to configure the vhost device +with accelerated backend. + +Also vhost device capabilities are made configurable to adopt various devices. +Such capabilities include supported features, protocol features, queue number. + +Finally, a set of device ops is defined for device specific operations: + +* ``get_queue_num`` + + Called to get supported queue number of the device. + +* ``get_features`` + + Called to get supported features of the device. + +* ``get_protocol_features`` + + Called to get supported protocol features of the device. + +* ``dev_conf`` + + Called to configure the actual device when the virtio device becomes ready. + +* ``dev_close`` + + Called to close the actual device when the virtio device is stopped. + +* ``set_vring_state`` + + Called to change the state of the vring in the actual device when vring state + changes. + +* ``set_features`` + + Called to set the negotiated features to device. + +* ``migration_done`` + + Called to allow the device to response to RARP sending. + +* ``get_vfio_group_fd`` + + Called to get the VFIO group fd of the device. + +* ``get_vfio_device_fd`` + + Called to get the VFIO device fd of the device. + +* ``get_notify_area`` + + Called to get the notify area info of the queue.