1 .. SPDX-License-Identifier: BSD-3-Clause
2 Copyright(c) 2018 6WIND S.A.
4 .. _switch_representation:
6 Switch Representation within DPDK Applications
7 ==============================================
14 Network adapters with multiple physical ports and/or SR-IOV capabilities
15 usually support the offload of traffic steering rules between their virtual
16 functions (VFs), sub functions (SFs), physical functions (PFs) and ports.
18 Like for standard Ethernet switches, this involves a combination of
19 automatic MAC learning and manual configuration. For most purposes it is
20 managed by the host system and fully transparent to users and applications.
22 On the other hand, applications typically found on hypervisors that process
23 layer 2 (L2) traffic (such as OVS) need to steer traffic themselves
24 according on their own criteria.
26 Without a standard software interface to manage traffic steering rules
27 between VFs, SFs, PFs and the various physical ports of a given device,
28 applications cannot take advantage of these offloads; software processing is
29 mandatory even for traffic which ends up re-injected into the device it
32 This document describes how such steering rules can be configured through
33 the DPDK flow API (**rte_flow**), with emphasis on the SR-IOV use case
34 (PF/VF steering) using a single physical port for clarity, however the same
35 logic applies to any number of ports without necessarily involving SR-IOV.
39 Besides SR-IOV, Sub function is a portion of the PCI device, a SF netdev
40 has its own dedicated queues(txq, rxq). A SF netdev supports E-Switch
41 representation offload similar to existing PF and VF representors.
42 A SF shares PCI level resources with other SFs and/or with its parent PCI
45 Sub function is created on-demand, coexists with VFs. Number of SFs is
46 limited by hardware resources.
51 In many cases, traffic steering rules cannot be determined in advance;
52 applications usually have to process a bit of traffic in software before
53 thinking about offloading specific flows to hardware.
55 Applications therefore need the ability to receive and inject traffic to
56 various device endpoints (other VFs, SFs, PFs or physical ports) before
57 connecting them together. Device drivers must provide means to hook the
58 "other end" of these endpoints and to refer them when configuring flow
61 This role is left to so-called "port representors" (also known as "VF
62 representors" in the specific context of VFs, "SF representors" in the
63 specific context of SFs), which are to DPDK what the Ethernet switch
64 device driver model (**switchdev**) [1]_ is to Linux, and which can be
65 thought as a software "patch panel" front-end for applications.
67 - DPDK port representors are implemented as additional virtual Ethernet
68 device (**ethdev**) instances, spawned on an as needed basis through
69 configuration parameters passed to the driver of the underlying
74 -a pci:dbdf,representor=vf0
75 -a pci:dbdf,representor=vf[0-3]
76 -a pci:dbdf,representor=vf[0,5-11]
77 -a pci:dbdf,representor=sf1
78 -a pci:dbdf,representor=sf[0-1023]
79 -a pci:dbdf,representor=sf[0,2-1023]
81 - As virtual devices, they may be more limited than their physical
82 counterparts, for instance by exposing only a subset of device
83 configuration callbacks and/or by not necessarily having Rx/Tx capability.
85 - Among other things, they can be used to assign MAC addresses to the
86 resource they represent.
88 - Applications can tell port representors apart from other physical of virtual
89 port by checking the dev_flags field within their device information
90 structure for the RTE_ETH_DEV_REPRESENTOR bit-field.
94 struct rte_eth_dev_info {
96 uint32_t dev_flags; /**< Device flags */
100 - The device or group relationship of ports can be discovered using the
101 switch ``domain_id`` field within the devices switch information structure. By
102 default the switch ``domain_id`` of a port will be
103 ``RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID`` to indicate that the port doesn't
104 support the concept of a switch domain, but ports which do support the concept
105 will be allocated a unique switch ``domain_id``, ports within the same switch
106 domain will share the same ``domain_id``. The switch ``port_id`` is used to
107 specify the port_id in terms of the switch, so in the case of SR-IOV devices
108 the switch ``port_id`` would represent the virtual function identifier of the
114 * Ethernet device associated switch information
116 struct rte_eth_switch_info {
117 const char *name; /**< switch name */
118 uint16_t domain_id; /**< switch domain id */
119 uint16_t port_id; /**< switch port id */
123 .. [1] `Ethernet switch device driver model (switchdev)
124 <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
126 - For some PMDs, memory usage of representors is huge when number of
127 representor grows, mbufs are allocated for each descriptor of Rx queue.
128 Polling large number of ports brings more CPU load, cache miss and
129 latency. Shared Rx queue can be used to share Rx queue between PF and
130 representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
131 device info is used to indicate the capability. Setting non-zero share
132 group in Rx queue configuration to enable share, share_qid is used to
133 identify the shared Rx queue in group. Polling any member port can
134 receive packets of all member ports in the group, port ID is saved in
140 "Basic" in the sense that it is not managed by applications, which
141 nonetheless expect traffic to flow between the various endpoints and the
142 outside as if everything was linked by an Ethernet hub.
144 The following diagram pictures a setup involving a device with one PF, two
145 VFs and one shared physical port
149 .-------------. .-------------. .-------------.
150 | hypervisor | | VM 1 | | VM 2 |
151 | application | | application | | application |
152 `--+----------' `----------+--' `--+----------'
158 .-+--. .---+--. .--+---.
159 | PF | | VF 1 | | VF 2 |
160 `-+--' `---+--' `--+---'
162 `---------. .-----------------------' |
163 | | .-------------------------'
174 - A DPDK application running on the hypervisor owns the PF device, which is
175 arbitrarily assigned port index 3.
177 - Both VFs are assigned to VMs and used by unknown applications; they may be
178 DPDK-based or anything else.
180 - Interconnection is not necessarily done through a true Ethernet switch and
181 may not even exist as a separate entity. The role of this block is to show
182 that something brings PF, VFs and physical ports together and enables
183 communication between them, with a number of built-in restrictions.
185 Subsequent sections in this document describe means for DPDK applications
186 running on the hypervisor to freely assign specific flows between PF, VFs
187 and physical ports based on traffic properties, by managing this
196 When a DPDK application gets assigned a PF device and is deliberately not
197 started in `basic SR-IOV`_ mode, any traffic coming from physical ports is
198 received by PF according to default rules, while VFs remain isolated.
202 .-------------. .-------------. .-------------.
203 | hypervisor | | VM 1 | | VM 2 |
204 | application | | application | | application |
205 `--+----------' `----------+--' `--+----------'
211 .-+--. .---+--. .--+---.
212 | PF | | VF 1 | | VF 2 |
213 `-+--' `------' `------'
217 .--+----------------------.
218 | managed interconnection |
219 `------------+------------'
226 In this mode, interconnection must be configured by the application to
227 enable VF communication, for instance by explicitly directing traffic with a
228 given destination MAC address to VF 1 and allowing that with the same source
229 MAC address to come out of it.
231 For this to work, hypervisor applications need a way to refer to either VF 1
232 or VF 2 in addition to the PF. This is addressed by `VF representors`_.
237 VF representors are virtual but standard DPDK network devices (albeit with
238 limited capabilities) created by PMDs when managing a PF device.
240 Since they represent VF instances used by other applications, configuring
241 them (e.g. assigning a MAC address or setting up promiscuous mode) affects
242 interconnection accordingly. If supported, they may also be used as two-way
243 communication ports with VFs (assuming **switchdev** topology)
248 .-------------. .-------------. .-------------.
249 | hypervisor | | VM 1 | | VM 2 |
250 | application | | application | | application |
251 `--+---+---+--' `----------+--' `--+----------'
253 | | `-------------------. | |
256 .-----+-----. .-----+-----. .-----+-----. | |
257 | port_id 3 | | port_id 4 | | port_id 5 | | |
258 `-----+-----' `-----+-----' `-----+-----' | |
260 .-+--. .-----+-----. .-----+-----. .---+--. .--+---.
261 | PF | | VF 1 rep. | | VF 2 rep. | | VF 1 | | VF 2 |
262 `-+--' `-----+-----' `-----+-----' `---+--' `--+---'
265 `-----. | | .-----------------' |
266 | | | | .---------------------'
268 .--+-------+---+---+---+--.
269 | managed interconnection |
270 `------------+------------'
277 - VF representors are assigned arbitrary port indices 4 and 5 in the
278 hypervisor application and are respectively associated with VF 1 and VF 2.
280 - They can't be dissociated; even if VF 1 and VF 2 were not connected,
281 representors could still be used for configuration.
283 - In this context, port index 3 can be thought as a representor for physical
286 As previously described, the "interconnection" block represents a logical
287 concept. Interconnection occurs when hardware configuration enables traffic
288 flows from one place to another (e.g. physical port 0 to VF 1) according to
291 This is discussed in more detail in `traffic steering`_.
296 In the following diagram, each meaningful traffic origin or endpoint as seen
297 by the hypervisor application is tagged with a unique letter from A to F.
301 .-------------. .-------------. .-------------.
302 | hypervisor | | VM 1 | | VM 2 |
303 | application | | application | | application |
304 `--+---+---+--' `----------+--' `--+----------'
306 | | `-------------------. | |
309 .----(A)----. .----(B)----. .----(C)----. | |
310 | port_id 3 | | port_id 4 | | port_id 5 | | |
311 `-----+-----' `-----+-----' `-----+-----' | |
313 .-+--. .-----+-----. .-----+-----. .---+--. .--+---.
314 | PF | | VF 1 rep. | | VF 2 rep. | | VF 1 | | VF 2 |
315 `-+--' `-----+-----' `-----+-----' `--(D)-' `-(E)--'
318 `-----. | | .-----------------' |
319 | | | | .---------------------'
321 .--+-------+---+---+---+--.
322 | managed interconnection |
323 `------------+------------'
331 - **B**: port representor for VF 1.
332 - **C**: port representor for VF 2.
333 - **D**: VF 1 proper.
334 - **E**: VF 2 proper.
335 - **F**: physical port.
337 Although uncommon, some devices do not enforce a one to one mapping between
338 PF and physical ports. For instance, by default all ports of **mlx4**
339 adapters are available to all their PF/VF instances, in which case
340 additional ports appear next to **F** in the above diagram.
342 Assuming no interconnection is provided by default in this mode, setting up
343 a `basic SR-IOV`_ configuration involving physical port 0 could be broken
348 - **A to F**: let everything through.
349 - **F to A**: PF MAC as destination.
353 - **A to D**, **E to D** and **F to D**: VF 1 MAC as destination.
354 - **D to A**: VF 1 MAC as source and PF MAC as destination.
355 - **D to E**: VF 1 MAC as source and VF 2 MAC as destination.
356 - **D to F**: VF 1 MAC as source.
360 - **A to E**, **D to E** and **F to E**: VF 2 MAC as destination.
361 - **E to A**: VF 2 MAC as source and PF MAC as destination.
362 - **E to D**: VF 2 MAC as source and VF 1 MAC as destination.
363 - **E to F**: VF 2 MAC as source.
365 Devices may additionally support advanced matching criteria such as
366 IPv4/IPv6 addresses or TCP/UDP ports.
368 The combination of matching criteria with target endpoints fits well with
369 **rte_flow** [6]_, which expresses flow rules as combinations of patterns
372 Enhancing **rte_flow** with the ability to make flow rules match and target
373 these endpoints provides a standard interface to manage their
374 interconnection without introducing new concepts and whole new API to
375 implement them. This is described in `flow API (rte_flow)`_.
377 .. [6] :doc:`Generic flow API (rte_flow) <rte_flow>`
385 Compared to creating a brand new dedicated interface, **rte_flow** was
386 deemed flexible enough to manage representor traffic only with minor
389 - Using physical ports, PF, SF, VF or port representors as targets.
391 - Affecting traffic that is not necessarily addressed to the DPDK port ID a
392 flow rule is associated with (e.g. forcing VF traffic redirection to PF).
396 - Rule-based packet counters.
398 - The ability to combine several identical actions for traffic duplication
399 (e.g. VF representor in addition to a physical port).
401 - Dedicated actions for traffic encapsulation / decapsulation before
402 reaching an endpoint.
407 From an application standpoint, "ingress" and "egress" flow rule attributes
408 apply to the DPDK port ID they are associated with. They select a traffic
409 direction for matching patterns, but have no impact on actions.
411 When matching traffic coming from or going to a different place than the
412 immediate port ID a flow rule is associated with, these attributes keep
413 their meaning while applying to the chosen origin, as highlighted by the
418 .-------------. .-------------. .-------------.
419 | hypervisor | | VM 1 | | VM 2 |
420 | application | | application | | application |
421 `--+---+---+--' `----------+--' `--+----------'
423 | | `-------------------. | |
426 | | ingress | | ingress | | ingress | |
427 | | egress | | egress | | egress | |
429 .----(A)----. .----(B)----. .----(C)----. | |
430 | port_id 3 | | port_id 4 | | port_id 5 | | |
431 `-----+-----' `-----+-----' `-----+-----' | |
433 .-+--. .-----+-----. .-----+-----. .---+--. .--+---.
434 | PF | | VF 1 rep. | | VF 2 rep. | | VF 1 | | VF 2 |
435 `-+--' `-----+-----' `-----+-----' `--(D)-' `-(E)--'
437 | | | egress | | | | egress
438 | | | ingress | | | | ingress
439 | | .---------' v | | v
440 `-----. | | .-----------------' |
441 | | | | .---------------------'
443 .--+-------+---+---+---+--.
444 | managed interconnection |
445 `------------+------------'
455 Ingress and egress are defined as relative to the application creating the
458 For instance, matching traffic sent by VM 2 would be done through an ingress
459 flow rule on VF 2 (**E**). Likewise for incoming traffic on physical port
460 (**F**). This also applies to **C** and **A** respectively.
465 Without Port Representors
466 ^^^^^^^^^^^^^^^^^^^^^^^^^
468 `Traffic direction`_ describes how an application could match traffic coming
469 from or going to a specific place reachable from a DPDK port ID. This makes
470 sense when the traffic in question is normally seen (i.e. sent or received)
471 by the application creating the flow rule (e.g. as in "redirect all traffic
472 coming from VF 1 to local queue 6").
474 However this does not force such traffic to take a specific route. Creating
475 a flow rule on **A** matching traffic coming from **D** is only meaningful
476 if it can be received by **A** in the first place, otherwise doing so simply
479 A new flow rule attribute named "transfer" is necessary for that. Combining
480 it with "ingress" or "egress" and a specific origin requests a flow rule to
481 be applied at the lowest level
485 ingress only : ingress + transfer
487 .-------------. .-------------. : .-------------. .-------------.
488 | hypervisor | | VM 1 | : | hypervisor | | VM 1 |
489 | application | | application | : | application | | application |
490 `------+------' `--+----------' : `------+------' `--+----------'
491 | | | traffic : | | | traffic
492 .----(A)----. | v : .----(A)----. | v
493 | port_id 3 | | : | port_id 3 | |
494 `-----+-----' | : `-----+-----' |
497 .-+--. .---+--. : .-+--. .---+--.
498 | PF | | VF 1 | : | PF | | VF 1 |
499 `-+--' `--(D)-' : `-+--' `--(D)-'
500 | | | traffic : | ^ | | traffic
501 | | v : | | traffic | v
502 .--+-----------+--. : .--+-----------+--.
503 | interconnection | : | interconnection |
504 `--------+--------' : `--------+--------'
507 .---(F)----. : .---(F)----.
508 | physical | : | physical |
509 | port 0 | : | port 0 |
510 `----------' : `----------'
512 With "ingress" only, traffic is matched on **A** thus still goes to physical
513 port **F** by default
518 testpmd> flow create 3 ingress pattern vf id is 1 / end
519 actions queue index 6 / end
521 With "ingress + transfer", traffic is matched on **D** and is therefore
522 successfully assigned to queue 6 on **A**
527 testpmd> flow create 3 ingress transfer pattern vf id is 1 / end
528 actions queue index 6 / end
531 With Port Representors
532 ^^^^^^^^^^^^^^^^^^^^^^
534 When port representors exist, implicit flow rules with the "transfer"
535 attribute (described in `without port representors`_) are be assumed to
536 exist between them and their represented resources. These may be immutable.
538 In this case, traffic is received by default through the representor and
539 neither the "transfer" attribute nor traffic origin in flow rule patterns
540 are necessary. They simply have to be created on the representor port
541 directly and may target a different representor as described in `PORT_ID
544 Implicit traffic flow with port representor
548 .-------------. .-------------.
549 | hypervisor | | VM 1 |
550 | application | | application |
551 `--+-------+--' `----------+--'
556 .----(A)----. .----(B)----. |
557 | port_id 3 | | port_id 4 | |
558 `-----+-----' `-----+-----' |
560 .-+--. .-----+-----. .---+--.
561 | PF | | VF 1 rep. | | VF 1 |
562 `-+--' `-----+-----' `--(D)-'
564 .--|-------------|-----------|--.
568 `--|----------------------------'
575 Pattern Items And Actions
576 ~~~~~~~~~~~~~~~~~~~~~~~~~
581 Matches traffic originating from (ingress) or going to (egress) a physical
582 port of the underlying device.
584 Using this pattern item without specifying a port index matches the physical
585 port associated with the current DPDK port ID by default. As described in
586 `traffic steering`_, specifying it should be rarely needed.
588 - Matches **F** in `traffic steering`_.
593 Directs matching traffic to a given physical port index.
595 - Targets **F** in `traffic steering`_.
600 Matches traffic originating from (ingress) or going to (egress) a given DPDK
603 Normally only supported if the port ID in question is known by the
604 underlying PMD and related to the device the flow rule is created against.
606 This must not be confused with the `PORT pattern item`_ which refers to the
607 physical port of a device. ``PORT_ID`` refers to a ``struct rte_eth_dev``
608 object on the application side (also known as "port representor" depending
609 on the kind of underlying device).
611 - Matches **A**, **B** or **C** in `traffic steering`_.
616 Directs matching traffic to a given DPDK port ID.
618 Same restrictions as `PORT_ID pattern item`_.
620 - Targets **A**, **B** or **C** in `traffic steering`_.
625 Matches traffic originating from (ingress) or going to (egress) the physical
626 function of the current device.
628 If supported, should work even if the physical function is not managed by
629 the application and thus not associated with a DPDK port ID. Its behavior is
630 otherwise similar to `PORT_ID pattern item`_ using PF port ID.
632 - Matches **A** in `traffic steering`_.
637 Directs matching traffic to the physical function of the current device.
639 Same restrictions as `PF pattern item`_.
641 - Targets **A** in `traffic steering`_.
646 Matches traffic originating from (ingress) or going to (egress) a given
647 virtual function of the current device.
649 If supported, should work even if the virtual function is not managed by
650 the application and thus not associated with a DPDK port ID. Its behavior is
651 otherwise similar to `PORT_ID pattern item`_ using VF port ID.
653 Note this pattern item does not match VF representors traffic which, as
654 separate entities, should be addressed through their own port IDs.
656 - Matches **D** or **E** in `traffic steering`_.
661 Directs matching traffic to a given virtual function of the current device.
663 Same restrictions as `VF pattern item`_.
665 - Targets **D** or **E** in `traffic steering`_.
670 These actions are named according to the protocol they encapsulate traffic
671 with (e.g. ``VXLAN_ENCAP``) and using specific parameters (e.g. VNI for
674 While they modify traffic and can be used multiple times (order matters),
675 unlike `PORT_ID action`_ and friends, they have no impact on steering.
677 As described in `actions order and repetition`_ this means they are useless
678 if used alone in an action list, the resulting traffic gets dropped unless
679 combined with either ``PASSTHRU`` or other endpoint-targeting actions.
684 They perform the reverse of `\*_ENCAP actions`_ by popping protocol headers
685 from traffic instead of pushing them. They can be used multiple times as
688 Note that using these actions on non-matching traffic results in undefined
689 behavior. It is recommended to match the protocol headers to decapsulate on
690 the pattern side of a flow rule in order to use these actions or otherwise
691 make sure only matching traffic goes through.
693 Actions Order and Repetition
694 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
696 Flow rules are currently restricted to at most a single action of each
697 supported type, performed in an unpredictable order (or all at once). To
698 repeat actions in a predictable fashion, applications have to make rules
699 pass-through and use priority levels.
701 It's now clear that PMD support for chaining multiple non-terminating flow
702 rules of varying priority levels is prohibitively difficult to implement
703 compared to simply allowing multiple identical actions performed in a
704 defined order by a single flow rule.
706 - This change is required to support protocol encapsulation offloads and the
707 ability to perform them multiple times (e.g. VLAN then VXLAN).
709 - It makes the ``DUP`` action redundant since multiple ``QUEUE`` actions can
710 be combined for duplication.
712 - The (non-)terminating property of actions must be discarded. Instead, flow
713 rules themselves must be considered terminating by default (i.e. dropping
714 traffic if there is no specific target) unless a ``PASSTHRU`` action is
720 This section provides practical examples based on the established testpmd
721 flow command syntax [2]_, in the context described in `traffic steering`_
725 .-------------. .-------------. .-------------.
726 | hypervisor | | VM 1 | | VM 2 |
727 | application | | application | | application |
728 `--+---+---+--' `----------+--' `--+----------'
730 | | `-------------------. | |
733 .----(A)----. .----(B)----. .----(C)----. | |
734 | port_id 3 | | port_id 4 | | port_id 5 | | |
735 `-----+-----' `-----+-----' `-----+-----' | |
737 .-+--. .-----+-----. .-----+-----. .---+--. .--+---.
738 | PF | | VF 1 rep. | | VF 2 rep. | | VF 1 | | VF 2 |
739 `-+--' `-----+-----' `-----+-----' `--(D)-' `-(E)--'
742 `-----. | | .-----------------' |
743 | | | | .---------------------'
745 .--|-------|---|---|---|--.
749 `------------|------------'
756 By default, PF (**A**) can communicate with the physical port it is
757 associated with (**F**), while VF 1 (**D**) and VF 2 (**E**) are isolated
758 and restricted to communicate with the hypervisor application through their
759 respective representors (**B** and **C**) if supported.
761 Examples in subsequent sections apply to hypervisor applications only and
762 are based on port representors **A**, **B** and **C**.
764 .. [2] :ref:`Flow syntax <testpmd_rte_flow>`
766 Associating VF 1 with Physical Port 0
767 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
769 Assign all port traffic (**F**) to VF 1 (**D**) indiscriminately through
774 flow create 3 ingress pattern / end actions port_id id 4 / end
775 flow create 4 ingress pattern / end actions port_id id 3 / end
777 More practical example with MAC address restrictions
781 flow create 3 ingress
782 pattern eth dst is {VF 1 MAC} / end
783 actions port_id id 4 / end
787 flow create 4 ingress
788 pattern eth src is {VF 1 MAC} / end
789 actions port_id id 3 / end
795 From outside to PF and VFs
799 flow create 3 ingress
800 pattern eth dst is ff:ff:ff:ff:ff:ff / end
801 actions port_id id 3 / port_id id 4 / port_id id 5 / end
803 Note ``port_id id 3`` is necessary otherwise only VFs would receive matching
806 From PF to outside and VFs
811 pattern eth dst is ff:ff:ff:ff:ff:ff / end
812 actions port / port_id id 4 / port_id id 5 / end
814 From VFs to outside and PF
818 flow create 4 ingress
819 pattern eth dst is ff:ff:ff:ff:ff:ff src is {VF 1 MAC} / end
820 actions port_id id 3 / port_id id 5 / end
822 flow create 5 ingress
823 pattern eth dst is ff:ff:ff:ff:ff:ff src is {VF 2 MAC} / end
824 actions port_id id 4 / port_id id 4 / end
826 Similar ``33:33:*`` rules based on known MAC addresses should be added for
829 Encapsulating VF 2 Traffic in VXLAN
830 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
832 Assuming pass-through flow rules are supported
836 flow create 5 ingress
838 actions vxlan_encap vni 42 / passthru / end
843 pattern vxlan vni is 42 / end
844 actions vxlan_decap / passthru / end
846 Here ``passthru`` is needed since as described in `actions order and
847 repetition`_, flow rules are otherwise terminating; if supported, a rule
848 without a target endpoint will drop traffic.
850 Without pass-through support, ingress encapsulation on the destination
851 endpoint might not be supported and action list must provide one
855 flow create 5 ingress
856 pattern eth src is {VF 2 MAC} / end
857 actions vxlan_encap vni 42 / port_id id 3 / end
859 flow create 3 ingress
860 pattern vxlan vni is 42 / end
861 actions vxlan_decap / port_id id 5 / end