2 Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
5 Redistribution and use in source and binary forms, with or without
6 modification, are permitted provided that the following conditions
9 * Redistributions of source code must retain the above copyright
10 notice, this list of conditions and the following disclaimer.
11 * Redistributions in binary form must reproduce the above copyright
12 notice, this list of conditions and the following disclaimer in
13 the documentation and/or other materials provided with the
15 * Neither the name of Intel Corporation nor the names of its
16 contributors may be used to endorse or promote products derived
17 from this software without specific prior written permission.
19 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
20 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
21 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
22 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
23 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
24 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
25 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
26 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
27 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
31 DPDK Xen Based Packet-Switching Solution
32 ========================================
37 DPDK provides a para-virtualization packet switching solution, based on the Xen hypervisor's Grant Table, Note 1,
38 which provides simple and fast packet switching capability between guest domains and host domain based on MAC address or VLAN tag.
40 This solution is comprised of two components;
41 a Poll Mode Driver (PMD) as the front end in the guest domain and a switching back end in the host domain.
42 XenStore is used to exchange configure information between the PMD front end and switching back end,
43 including grant reference IDs for shared Virtio RX/TX rings,
44 MAC address, device state, and so on. XenStore is an information storage space shared between domains,
45 see further information on XenStore below.
47 The front end PMD can be found in the DPDK directory lib/ librte_pmd_xenvirt and back end example in examples/vhost_xen.
49 The PMD front end and switching back end use shared Virtio RX/TX rings as para- virtualized interface.
50 The Virtio ring is created by the front end, and Grant table references for the ring are passed to host.
51 The switching back end maps those grant table references and creates shared rings in a mapped address space.
53 The following diagram describes the functionality of the DPDK Xen Packet- Switching Solution.
56 .. _figure_dpdk_xen_pkt_switch:
58 .. figure:: img/dpdk_xen_pkt_switch.*
60 Functionality of the DPDK Xen Packet Switching Solution.
63 Note 1 The Xen hypervisor uses a mechanism called a Grant Table to share memory between domains
64 (`http://wiki.xen.org/wiki/Grant Table <http://wiki.xen.org/wiki/Grant%20Table>`_).
66 A diagram of the design is shown below, where "gva" is the Guest Virtual Address,
67 which is the data pointer of the mbuf, and "hva" is the Host Virtual Address:
70 .. _figure_grant_table:
72 .. figure:: img/grant_table.*
77 In this design, a Virtio ring is used as a para-virtualized interface for better performance over a Xen private ring
78 when packet switching to and from a VM.
79 The additional performance is gained by avoiding a system call and memory map in each memory copy with a XEN private ring.
84 Poll Mode Driver Front End
85 ~~~~~~~~~~~~~~~~~~~~~~~~~~
87 * Mbuf pool allocation:
89 To use a Xen switching solution, the DPDK application should use rte_mempool_gntalloc_create()
90 to reserve mbuf pools during initialization.
91 rte_mempool_gntalloc_create() creates a mempool with objects from memory allocated and managed via gntalloc/gntdev.
93 The DPDK now supports construction of mempools from allocated virtual memory through the rte_mempool_xmem_create() API.
95 This front end constructs mempools based on memory allocated through the xen_gntalloc driver.
96 rte_mempool_gntalloc_create() allocates Grant pages, maps them to continuous virtual address space,
97 and calls rte_mempool_xmem_create() to build mempools.
98 The Grant IDs for all Grant pages are passed to the host through XenStore.
100 * Virtio Ring Creation:
102 The Virtio queue size is defined as 256 by default in the VQ_DESC_NUM macro.
103 Using the queue setup function,
104 Grant pages are allocated based on ring size and are mapped to continuous virtual address space to form the Virtio ring.
105 Normally, one ring is comprised of several pages.
106 Their Grant IDs are passed to the host through XenStore.
108 There is no requirement that this memory be physically continuous.
110 * Interrupt and Kick:
112 There are no interrupts in DPDK Xen Switching as both front and back ends work in polling mode.
113 There is no requirement for notification.
115 * Feature Negotiation:
117 Currently, feature negotiation through XenStore is not supported.
119 * Packet Reception & Transmission:
121 With mempools and Virtio rings created, the front end can operate Virtio devices,
122 as it does in Virtio PMD for KVM Virtio devices with the exception that the host
123 does not require notifications or deal with interrupts.
125 XenStore is a database that stores guest and host information in the form of (key, value) pairs.
126 The following is an example of the information generated during the startup of the front end PMD in a guest VM (domain ID 1):
128 .. code-block:: console
130 xenstore -ls /local/domain/1/control/dpdk
131 0_mempool_gref="3042,3043,3044,3045"
132 0_mempool_va="0x7fcbc6881000"
133 0_tx_vring_gref="3049"
134 0_rx_vring_gref="3053"
135 0_ether_addr="4e:0b:d0:4e:aa:f1"
139 Multiple mempools and multiple Virtios may exist in the guest domain, the first number is the index, starting from zero.
141 The idx#_mempool_va stores the guest virtual address for mempool idx#.
143 The idx#_ether_adder stores the MAC address of the guest Virtio device.
145 For idx#_rx_ring_gref, idx#_tx_ring_gref, and idx#_mempool_gref, the value is a list of Grant references.
146 Take idx#_mempool_gref node for example, the host maps those Grant references to a continuous virtual address space.
147 The real Grant reference information is stored in this virtual address space,
148 where (gref, pfn) pairs follow each other with -1 as the terminator.
151 .. _figure_grant_refs:
153 .. figure:: img/grant_refs.*
155 Mapping Grant references to a continuous virtual address space
158 After all gref# IDs are retrieved, the host maps them to a continuous virtual address space.
159 With the guest mempool virtual address, the host establishes 1:1 address mapping.
160 With multiple guest mempools, the host establishes multiple address translation regions.
165 The switching back end monitors changes in XenStore.
166 When the back end detects that a new Virtio device has been created in a guest domain, it will:
168 #. Retrieve Grant and configuration information from XenStore.
170 #. Map and create a Virtio ring.
172 #. Map mempools in the host and establish address translation between the guest address and host address.
174 #. Select a free VMDQ pool, set its affinity with the Virtio device, and set the MAC/ VLAN filter.
179 When packets arrive from an external network, the MAC?VLAN filter classifies packets into queues in one VMDQ pool.
180 As each pool is bonded to a Virtio device in some guest domain, the switching back end will:
182 #. Fetch an available entry from the Virtio RX ring.
184 #. Get gva, and translate it to hva.
186 #. Copy the contents of the packet to the memory buffer pointed to by gva.
188 The DPDK application in the guest domain, based on the PMD front end,
189 is polling the shared Virtio RX ring for available packets and receives them on arrival.
194 When a Virtio device in one guest domain is to transmit a packet,
195 it puts the virtual address of the packet's data area into the shared Virtio TX ring.
197 The packet switching back end is continuously polling the Virtio TX ring.
198 When new packets are available for transmission from a guest, it will:
200 #. Fetch an available entry from the Virtio TX ring.
202 #. Get gva, and translate it to hva.
204 #. Copy the packet from hva to the host mbuf's data area.
206 #. Compare the destination MAC address with all the MAC addresses of the Virtio devices it manages.
207 If a match exists, it directly copies the packet to the matched VIrtio RX ring.
208 Otherwise, it sends the packet out through hardware.
212 The packet switching back end is for demonstration purposes only.
213 The user could implement their switching logic based on this example.
214 In this example, only one physical port on the host is supported.
215 Multiple segments are not supported. The biggest mbuf supported is 4KB.
216 When the back end is restarted, all front ends must also be restarted.
218 Running the Application
219 -----------------------
221 The following describes the steps required to run the application.
223 Validated Environment
224 ~~~~~~~~~~~~~~~~~~~~~
228 Xen-hypervisor: 4.2.2
230 Distribution: Fedora release 18
234 Xen development package (including Xen, Xen-libs, xen-devel): 4.2.3
238 Distribution: Fedora 16 and 18
242 Xen Host Prerequisites
243 ~~~~~~~~~~~~~~~~~~~~~~
245 Note that the following commands might not be the same on different Linux* distributions.
247 * Install xen-devel package:
249 .. code-block:: console
251 yum install xen-devel.x86_64
253 * Start xend if not already started:
255 .. code-block:: console
257 /etc/init.d/xend start
259 * Mount xenfs if not already mounted:
261 .. code-block:: console
263 mount -t xenfs none /proc/xen
265 * Enlarge the limit for xen_gntdev driver:
267 .. code-block:: console
269 modprobe -r xen_gntdev
270 modprobe xen_gntdev limit=1000000
274 The default limit for earlier versions of the xen_gntdev driver is 1024.
275 That is insufficient to support the mapping of multiple Virtio devices into multiple VMs,
276 so it is necessary to enlarge the limit by reloading this module.
277 The default limit of recent versions of xen_gntdev is 1048576.
278 The rough calculation of this limit is:
280 limit=nb_mbuf# * VM#.
282 In DPDK examples, nb_mbuf# is normally 8192.
284 Building and Running the Switching Backend
285 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
287 #. Edit config/common_linuxapp, and change the default configuration value for the following two items:
289 .. code-block:: console
291 CONFIG_RTE_LIBRTE_XEN_DOM0=y
292 CONFIG RTE_LIBRTE_PMD_XENVIRT=n
296 .. code-block:: console
298 make install T=x86_64-native-linuxapp-gcc
300 #. Ensure that RTE_SDK and RTE_TARGET are correctly set. Build the switching example:
302 .. code-block:: console
304 make -C examples/vhost_xen/
306 #. Load the Xen DPDK memory management module and preallocate memory:
308 .. code-block:: console
310 insmod ./x86_64-native-linuxapp-gcc/build/lib/librte_eal/linuxapp/xen_dom0/rte_dom0_mm.ko
311 echo 2048> /sys/kernel/mm/dom0-mm/memsize-mB/memsize
315 On Xen Dom0, there is no hugepage support.
316 Under Xen Dom0, the DPDK uses a special memory management kernel module
317 to allocate chunks of physically continuous memory.
318 Refer to the *DPDK Getting Started Guide* for more information on memory management in the DPDK.
319 In the above command, 4 GB memory is reserved (2048 of 2 MB pages) for DPDK.
321 #. Load uio_pci_generic and bind one Intel NIC controller to it:
323 .. code-block:: console
325 modprobe uio_pci_generic
326 python tools/dpdk-devbind.py -b uio_pci_generic 0000:09:00:00.0
328 In this case, 0000:09:00.0 is the PCI address for the NIC controller.
330 #. Run the switching back end example:
332 .. code-block:: console
334 examples/vhost_xen/build/vhost-switch -c f -n 3 --xen-dom0 -- -p1
338 The -xen-dom0 option instructs the DPDK to use the Xen kernel module to allocate memory.
344 The vm2vm parameter enables/disables packet switching in software.
345 Disabling vm2vm implies that on a VM packet transmission will always go to the Ethernet port
346 and will not be switched to another VM
350 The Stats parameter controls the printing of Virtio-net device statistics.
351 The parameter specifies the interval (in seconds) at which to print statistics,
352 an interval of 0 seconds will disable printing statistics.
354 Xen PMD Frontend Prerequisites
355 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
357 #. Install xen-devel package for accessing XenStore:
359 .. code-block:: console
361 yum install xen-devel.x86_64
363 #. Mount xenfs, if it is not already mounted:
365 .. code-block:: console
367 mount -t xenfs none /proc/xen
369 #. Enlarge the default limit for xen_gntalloc driver:
371 .. code-block:: console
373 modprobe -r xen_gntalloc
374 modprobe xen_gntalloc limit=6000
378 Before the Linux kernel version 3.8-rc5, Jan 15th 2013,
379 a critical defect occurs when a guest is heavily allocating Grant pages.
380 The Grant driver allocates fewer pages than expected which causes kernel memory corruption.
381 This happens, for example, when a guest uses the v1 format of a Grant table entry and allocates
382 more than 8192 Grant pages (this number might be different on different hypervisor versions).
383 To work around this issue, set the limit for gntalloc driver to 6000.
384 (The kernel normally allocates hundreds of Grant pages with one Xen front end per virtualized device).
385 If the kernel allocates a lot of Grant pages, for example, if the user uses multiple net front devices,
386 it is best to upgrade the Grant alloc driver.
387 This defect has been fixed in kernel version 3.8-rc5 and later.
389 Building and Running the Front End
390 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
392 #. Edit config/common_linuxapp, and change the default configuration value:
394 .. code-block:: console
396 CONFIG_RTE_LIBRTE_XEN_DOM0=n
397 CONFIG_RTE_LIBRTE_PMD_XENVIRT=y
399 #. Build the package:
401 .. code-block:: console
403 make install T=x86_64-native-linuxapp-gcc
405 #. Enable hugepages. Refer to the *DPDK Getting Started Guide* for instructions on
406 how to use hugepages in the DPDK.
408 #. Run TestPMD. Refer to *DPDK TestPMD Application User Guide* for detailed parameter usage.
410 .. code-block:: console
412 ./x86_64-native-linuxapp-gcc/app/testpmd -c f -n 4 --vdev="eth_xenvirt0,mac=00:00:00:00:00:11"
416 As an example to run two TestPMD instances over 2 Xen Virtio devices:
418 .. code-block:: console
420 --vdev="eth_xenvirt0,mac=00:00:00:00:00:11" --vdev="eth_xenvirt1;mac=00:00:00:00:00:22"
423 Usage Examples: Injecting a Packet Stream Using a Packet Generator
424 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
429 Run TestPMD in a guest VM:
431 .. code-block:: console
433 ./x86_64-native-linuxapp-gcc/app/testpmd -c f -n 4 --vdev="eth_xenvirt0,mac=00:00:00:00:00:11" -- -i --eth-peer=0,00:00:00:00:00:22
437 Example output of the vhost_switch would be:
439 .. code-block:: console
441 DATA:(0) MAC_ADDRESS 00:00:00:00:00:11 and VLAN_TAG 1000 registered.
443 The above message indicates that device 0 has been registered with MAC address 00:00:00:00:00:11 and VLAN tag 1000.
444 Any packets received on the NIC with these values is placed on the device's receive queue.
446 Configure a packet stream in the packet generator, set the destination MAC address to 00:00:00:00:00:11, and VLAN to 1000,
447 the guest Virtio receives these packets and sends them out with destination MAC address 00:00:00:00:00:22.
452 Run TestPMD in guest VM1:
454 .. code-block:: console
456 ./x86_64-native-linuxapp-gcc/app/testpmd -c f -n 4 --vdev="eth_xenvirt0,mac=00:00:00:00:00:11" -- -i --eth-peer=0,00:00:00:00:00:22 -- -i
458 Run TestPMD in guest VM2:
460 .. code-block:: console
462 ./x86_64-native-linuxapp-gcc/app/testpmd -c f -n 4 --vdev="eth_xenvirt0,mac=00:00:00:00:00:22" -- -i --eth-peer=0,00:00:00:00:00:33
464 Configure a packet stream in the packet generator, and set the destination MAC address to 00:00:00:00:00:11 and VLAN to 1000.
465 The packets received in Virtio in guest VM1 will be forwarded to Virtio in guest VM2 and
466 then sent out through hardware with destination MAC address 00:00:00:00:00:33.
470 packet generator->Virtio in guest VM1->switching backend->Virtio in guest VM2->switching backend->wire