MLX5 poll mode driver
=====================
-The MLX5 poll mode driver library (**librte_pmd_mlx5**) provides support
+The MLX5 poll mode driver library (**librte_net_mlx5**) provides support
for **Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx** , **Mellanox
-ConnectX-5**, **Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx** and
-**Mellanox BlueField** families of 10/25/40/50/100/200 Gb/s adapters
-as well as their virtual functions (VF) in SR-IOV context.
+ConnectX-5**, **Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx**, **Mellanox
+ConnectX-6 Lx**, **Mellanox BlueField** and **Mellanox BlueField-2** families
+of 10/25/40/50/100/200 Gb/s adapters as well as their virtual functions (VF)
+in SR-IOV context.
Information and documentation about these adapters can be found on the
`Mellanox website <http://www.mellanox.com>`__. Help is also provided by the
There is also a `section dedicated to this poll mode driver
<http://www.mellanox.com/page/products_dyn?product_family=209&mtag=pmd_for_dpdk>`__.
-.. note::
-
- Due to external dependencies, this driver is disabled in default configuration
- of the "make" build. It can be enabled with ``CONFIG_RTE_LIBRTE_MLX5_PMD=y``
- or by using "meson" build system which will detect dependencies.
Design
------
Besides its dependency on libibverbs (that implies libmlx5 and associated
-kernel support), librte_pmd_mlx5 relies heavily on system calls for control
+kernel support), librte_net_mlx5 relies heavily on system calls for control
operations such as querying/updating the MTU and flow control parameters.
For security reasons and robustness, this driver only deals with virtual
- DevX allows to access firmware objects
- Direct Rules manages flow steering at low-level hardware layer
-Enabling librte_pmd_mlx5 causes DPDK applications to be linked against
+Enabling librte_net_mlx5 causes DPDK applications to be linked against
libibverbs.
Features
- Multi arch support: x86_64, POWER8, ARMv8, i686.
- Multiple TX and RX queues.
-- Support for scattered TX and RX frames.
+- Support for scattered TX frames.
+- Advanced support for scattered Rx frames with tunable buffer attributes.
- IPv4, IPv6, TCPv4, TCPv6, UDPv4 and UDPv6 RSS on any number of queues.
- RSS using different combinations of fields: L3 only, L4 only or both,
and source only, destination only or both.
- RX VLAN stripping.
- TX VLAN insertion.
- RX CRC stripping configuration.
+- TX mbuf fast free offload.
- Promiscuous mode on PF and VF.
- Multicast promiscuous mode on PF and VF.
- Hardware checksum offloads.
- Support for multiple rte_flow groups.
- Per packet no-inline hint flag to disable packet data copying into Tx descriptors.
- Hardware LRO.
+- Hairpin.
+- Multiple-thread flow insertion.
+- Matching on IPv4 Internet Header Length (IHL).
+- Matching on GTP extension header with raw encap/decap action.
+- Matching on Geneve TLV option header with raw encap/decap action.
+- RSS support in sample action.
+- E-Switch mirroring and jump.
+- E-Switch mirroring and modify.
+- 21844 flow priorities for ingress or egress flow groups greater than 0 and for any transfer
+ flow group.
+- Flow metering, including meter policy API.
+- Flow meter hierarchy.
+- Flow integrity offload API.
+- Connection tracking.
+- Sub-Function representors.
+- Sub-Function.
+
Limitations
-----------
+- Windows support:
+
+ On Windows, the features are limited:
+
+ - Promiscuous mode is not supported
+ - The following rules are supported:
+
+ - IPv4/UDP with CVLAN filtering
+ - Unicast MAC filtering
+
+ - Additional rules are supported from WinOF2 version 2.70:
+
+ - IPv4/TCP with CVLAN filtering
+ - L4 steering rules for port RSS of UDP, TCP and IP
+
- For secondary process:
- Forked secondary process not supported.
Will match any ipv4 packet (VLAN included).
+- When using Verbs flow engine (``dv_flow_en`` = 0), multi-tagged(QinQ) match is not supported.
+
+- When using DV flow engine (``dv_flow_en`` = 1), flow pattern with any VLAN specification will match only single-tagged packets unless the ETH item ``type`` field is 0x88A8 or the VLAN item ``has_more_vlan`` field is 1.
+ The flow rule::
+
+ flow create 0 ingress pattern eth / ipv4 / end ...
+
+ Will match any ipv4 packet.
+ The flow rules::
+
+ flow create 0 ingress pattern eth / vlan / end ...
+ flow create 0 ingress pattern eth has_vlan is 1 / end ...
+ flow create 0 ingress pattern eth type is 0x8100 / end ...
+
+ Will match single-tagged packets only, with any VLAN ID value.
+ The flow rules::
+
+ flow create 0 ingress pattern eth type is 0x88A8 / end ...
+ flow create 0 ingress pattern eth / vlan has_more_vlan is 1 / end ...
+
+ Will match multi-tagged packets only, with any VLAN ID value.
+
+- A flow pattern with 2 sequential VLAN items is not supported.
+
- VLAN pop offload command:
- Flow rules having a VLAN pop offload command as one of their actions and
are lacking a match on VLAN as one of their items are not supported.
- - The command is not supported on egress traffic.
+ - The command is not supported on egress traffic in NIC mode.
-- VLAN push offload is not supported on ingress traffic.
+- VLAN push offload is not supported on ingress traffic in NIC mode.
- VLAN set PCP offload is not supported on existing headers.
size and ``txq_inline_min`` settings and may be from 2 (worst case forced by maximal
inline settings) to 58.
-- Flows with a VXLAN Network Identifier equal (or ends to be equal)
- to 0 are not supported.
+- Match on VXLAN supports the following fields only:
-- VXLAN TSO and checksum offloads are not supported on VM.
+ - VNI
+ - Last reserved 8-bits
+
+ Last reserved 8-bits matching is only supported When using DV flow
+ engine (``dv_flow_en`` = 1).
+ For ConnectX-5, the UDP destination port must be the standard one (4789).
+ Group zero's behavior may differ which depends on FW.
+ Matching value equals 0 (value & mask) is not supported.
- L3 VXLAN and VXLAN-GPE tunnels cannot be supported together with MPLSoGRE and MPLSoUDP.
- OAM
- protocol type
- options length
- Currently, the only supported options length value is 0.
+
+- Match on Geneve TLV option is supported on the following fields:
+
+ - Class
+ - Type
+ - Length
+ - Data
+
+ Only one Class/Type/Length Geneve TLV option is supported per shared device.
+ Class/Type/Length fields must be specified as well as masks.
+ Class/Type/Length specified masks must be full.
+ Matching Geneve TLV option without specifying data is not supported.
+ Matching Geneve TLV option with ``data & mask == 0`` is not supported.
- VF: flow rules created on VF devices can only match traffic targeted at the
configured MAC addresses (see ``rte_eth_dev_mac_addr_add()``).
- Match on GTP tunnel header item supports the following fields only:
+ - v_pt_rsv_flags: E flag, S flag, PN flag
- msg_type
- teid
+- Match on GTP extension header only for GTP PDU session container (next
+ extension header type = 0x85).
+- Match on GTP extension header is not supported in group 0.
+
- No Tx metadata go to the E-Switch steering domain for the Flow group 0.
The flows within group 0 and set metadata action are rejected by hardware.
the device. In case of ungraceful program termination, some entries may
remain present and should be removed manually by other means.
+- Buffer split offload is supported with regular Rx burst routine only,
+ no MPRQ feature or vectorized code can be engaged.
+
- When Multi-Packet Rx queue is configured (``mprq_en``), a Rx packet can be
externally attached to a user-provided mbuf with having EXT_ATTACHED_MBUF in
ol_flags. As the mempool for the external buffer is managed by PMD, all the
reduce the requested Tx size or adjust data inline settings with
``txq_inline_max`` and ``txq_inline_mpw`` devargs keys.
+- To provide the packet send scheduling on mbuf timestamps the ``tx_pp``
+ parameter should be specified.
+ When PMD sees the RTE_MBUF_DYNFLAG_TX_TIMESTAMP_NAME set on the packet
+ being sent it tries to synchronize the time of packet appearing on
+ the wire with the specified packet timestamp. It the specified one
+ is in the past it should be ignored, if one is in the distant future
+ it should be capped with some reasonable value (in range of seconds).
+ These specific cases ("too late" and "distant future") can be optionally
+ reported via device xstats to assist applications to detect the
+ time-related problems.
+
+ The timestamp upper "too-distant-future" limit
+ at the moment of invoking the Tx burst routine
+ can be estimated as ``tx_pp`` option (in nanoseconds) multiplied by 2^23.
+ Please note, for the testpmd txonly mode,
+ the limit is deduced from the expression::
+
+ (n_tx_descriptors / burst_size + 1) * inter_burst_gap
+
+ There is no any packet reordering according timestamps is supposed,
+ neither within packet burst, nor between packets, it is an entirely
+ application responsibility to generate packets and its timestamps
+ in desired order. The timestamps can be put only in the first packet
+ in the burst providing the entire burst scheduling.
+
- E-Switch decapsulation Flow:
- can be applied to PF port only.
- The input buffer, providing the removal size, is not validated.
- The buffer size must match the length of the headers to be removed.
-- ICMP/ICMP6 code/type matching, IP-in-IP and MPLS flow matching are all
+- ICMP(code/type/identifier/sequence number) / ICMP6(code/type) matching, IP-in-IP and MPLS flow matching are all
mutually exclusive features which cannot be supported together
(see :ref:`mlx5_firmware_config`).
TCP header (122B).
- Rx queue with LRO offload enabled, receiving a non-LRO packet, can forward
it with size limited to max LRO size, not to max RX packet length.
+ - LRO can be used with outer header of TCP packets of the standard format:
+ eth (with or without vlan) / ipv4 or ipv6 / tcp / payload
+
+ Other TCP packets (e.g. with MPLS label) received on Rx queue with LRO enabled, will be received with bad checksum.
+ - LRO packet aggregation is performed by HW only for packet size larger than
+ ``lro_min_mss_size``. This value is reported on device start, when debug
+ mode is enabled.
+
+- CRC:
+
+ - ``DEV_RX_OFFLOAD_KEEP_CRC`` cannot be supported with decapsulation
+ for some NICs (such as ConnectX-6 Dx, ConnectX-6 Lx, and BlueField-2).
+ The capability bit ``scatter_fcs_w_decap_disable`` shows NIC support.
+
+- TX mbuf fast free:
+
+ - fast free offload assumes the all mbufs being sent are originated from the
+ same memory pool and there is no any extra references to the mbufs (the
+ reference counter for each mbuf is equal 1 on tx_burst call). The latter
+ means there should be no any externally attached buffers in mbufs. It is
+ an application responsibility to provide the correct mbufs if the fast
+ free offload is engaged. The mlx5 PMD implicitly produces the mbufs with
+ externally attached buffers if MPRQ option is enabled, hence, the fast
+ free offload is neither supported nor advertised if there is MPRQ enabled.
+
+- Sample flow:
+
+ - Supports ``RTE_FLOW_ACTION_TYPE_SAMPLE`` action only within NIC Rx and
+ E-Switch steering domain.
+ - For E-Switch Sampling flow with sample ratio > 1, additional actions are not
+ supported in the sample actions list.
+ - For ConnectX-5, the ``RTE_FLOW_ACTION_TYPE_SAMPLE`` is typically used as
+ first action in the E-Switch egress flow if with header modify or
+ encapsulation actions.
+ - For NIC Rx flow, supports ``MARK``, ``COUNT``, ``QUEUE``, ``RSS`` in the
+ sample actions list.
+ - For E-Switch mirroring flow, supports ``RAW ENCAP``, ``Port ID``,
+ ``VXLAN ENCAP``, ``NVGRE ENCAP`` in the sample actions list.
+
+- Modify Field flow:
+
+ - Supports the 'set' operation only for ``RTE_FLOW_ACTION_TYPE_MODIFY_FIELD`` action.
+ - Modification of an arbitrary place in a packet via the special ``RTE_FLOW_FIELD_START`` Field ID is not supported.
+ - Modification of the 802.1Q Tag, VXLAN Network or GENEVE Network ID's is not supported.
+ - Encapsulation levels are not supported, can modify outermost header fields only.
+ - Offsets must be 32-bits aligned, cannot skip past the boundary of a field.
+
+- IPv6 header item 'proto' field, indicating the next header protocol, should
+ not be set as extension header.
+ In case the next header is an extension header, it should not be specified in
+ IPv6 header item 'proto' field.
+ The last extension header item 'next header' field can specify the following
+ header protocol type.
+
+- Hairpin:
+
+ - Hairpin between two ports could only manual binding and explicit Tx flow mode. For single port hairpin, all the combinations of auto/manual binding and explicit/implicit Tx flow mode could be supported.
+ - Hairpin in switchdev SR-IOV mode is not supported till now.
+
+- Meter:
+
+ - All the meter colors with drop action will be counted only by the global drop statistics.
+ - Yellow detection is only supported with ASO metering.
+ - Red color must be with drop action.
+ - Meter statistics are supported only for drop case.
+ - A meter action created with pre-defined policy must be the last action in the flow except single case where the policy actions are:
+ - green: NULL or END.
+ - yellow: NULL or END.
+ - RED: DROP / END.
+ - The only supported meter policy actions:
+ - green: QUEUE, RSS, PORT_ID, REPRESENTED_PORT, JUMP, DROP, MARK and SET_TAG.
+ - yellow: QUEUE, RSS, PORT_ID, REPRESENTED_PORT, JUMP, DROP, MARK and SET_TAG.
+ - RED: must be DROP.
+ - Policy actions of RSS for green and yellow should have the same configuration except queues.
+ - meter profile packet mode is supported.
+ - meter profiles of RFC2697, RFC2698 and RFC4115 are supported.
+
+- Integrity:
+
+ - Integrity offload is enabled for **ConnectX-6** family.
+ - Verification bits provided by the hardware are ``l3_ok``, ``ipv4_csum_ok``, ``l4_ok``, ``l4_csum_ok``.
+ - ``level`` value 0 references outer headers.
+ - Multiple integrity items not supported in a single flow rule.
+ - Flow rule items supplied by application must explicitly specify network headers referred by integrity item.
+ For example, if integrity item mask sets ``l4_ok`` or ``l4_csum_ok`` bits, reference to L4 network header,
+ TCP or UDP, must be in the rule pattern as well::
+
+ flow create 0 ingress pattern integrity level is 0 value mask l3_ok value spec l3_ok / eth / ipv6 / end …
+ or
+ flow create 0 ingress pattern integrity level is 0 value mask l4_ok value spec 0 / eth / ipv4 proto is udp / end …
+
+- Connection tracking:
+
+ - Cannot co-exist with ASO meter, ASO age action in a single flow rule.
+ - Flow rules insertion rate and memory consumption need more optimization.
+ - 256 ports maximum.
+ - 4M connections maximum.
+
+- Multi-thread flow insertion:
+
+ - In order to achieve best insertion rate, application should manage the flows per lcore.
+ - Better to disable memory reclaim by setting ``reclaim_mem_mode`` to 0 to accelerate the flow object allocation and release with cache.
Statistics
----------
Compilation options
~~~~~~~~~~~~~~~~~~~
-These options can be modified in the ``.config`` file.
-
-- ``CONFIG_RTE_LIBRTE_MLX5_PMD`` (default **n**)
-
- Toggle compilation of librte_pmd_mlx5 itself.
-
-- ``CONFIG_RTE_IBVERBS_LINK_DLOPEN`` (default **n**)
-
- Build PMD with additional code to make it loadable without hard
- dependencies on **libibverbs** nor **libmlx5**, which may not be installed
- on the target system.
-
- In this mode, their presence is still required for it to run properly,
- however their absence won't prevent a DPDK application from starting (with
- ``CONFIG_RTE_BUILD_SHARED_LIB`` disabled) and they won't show up as
- missing with ``ldd(1)``.
-
- It works by moving these dependencies to a purpose-built rdma-core "glue"
- plug-in which must either be installed in a directory whose name is based
- on ``CONFIG_RTE_EAL_PMD_PATH`` suffixed with ``-glue`` if set, or in a
- standard location for the dynamic linker (e.g. ``/lib``) if left to the
- default empty string (``""``).
+The ibverbs libraries can be linked with this PMD in a number of ways,
+configured by the ``ibverbs_link`` build option:
- This option has no performance impact.
+- ``shared`` (default): the PMD depends on some .so files.
-- ``CONFIG_RTE_IBVERBS_LINK_STATIC`` (default **n**)
+- ``dlopen``: Split the dependencies glue in a separate library
+ loaded when needed by dlopen.
+ It make dependencies on libibverbs and libmlx4 optional,
+ and has no performance impact.
- Embed static flavor of the dependencies **libibverbs** and **libmlx5**
+- ``static``: Embed static flavor of the dependencies libibverbs and libmlx4
in the PMD shared library or the executable static binary.
-- ``CONFIG_RTE_LIBRTE_MLX5_DEBUG`` (default **n**)
-
- Toggle debugging code and stricter compilation flags. Enabling this option
- adds additional run-time checks and debugging messages at the cost of
- lower performance.
-
-.. note::
-
- For BlueField, target should be set to ``arm64-bluefield-linux-gcc``. This
- will enable ``CONFIG_RTE_LIBRTE_MLX5_PMD`` and set ``RTE_CACHE_LINE_SIZE`` to
- 64. Default armv8a configuration of make build and meson build set it to 128
- then brings performance degradation.
-
-This option is available in meson:
-
-- ``ibverbs_link`` can be ``static``, ``shared``, or ``dlopen``.
-
Environment variables
~~~~~~~~~~~~~~~~~~~~~
A list of directories in which to search for the rdma-core "glue" plug-in,
separated by colons or semi-colons.
- Only matters when compiled with ``CONFIG_RTE_IBVERBS_LINK_DLOPEN``
- enabled and most useful when ``CONFIG_RTE_EAL_PMD_PATH`` is also set,
- since ``LD_LIBRARY_PATH`` has no effect in this case.
-
- ``MLX5_SHUT_UP_BF``
Configures HW Tx doorbell register as IO-mapped.
Run-time configuration
~~~~~~~~~~~~~~~~~~~~~~
-- librte_pmd_mlx5 brings kernel network interfaces up during initialization
+- librte_net_mlx5 brings kernel network interfaces up during initialization
because it is affected by their state. Forcing them down prevents packets
reception.
- **ethtool** operations on related kernel interfaces also affect the PMD.
+Run as non-root
+^^^^^^^^^^^^^^^
+
+In order to run as a non-root user,
+some capabilities must be granted to the application::
+
+ setcap cap_sys_admin,cap_net_admin,cap_net_raw,cap_ipc_lock+ep <dpdk-app>
+
+Below are the reasons of the need for each capability:
+
+``cap_sys_admin``
+ When using physical addresses (PA mode), with Linux >= 4.0,
+ for access to ``/proc/self/pagemap``.
+
+``cap_net_admin``
+ For device configuration.
+
+``cap_net_raw``
+ For raw ethernet queue allocation through kernel driver.
+
+``cap_ipc_lock``
+ For DMA memory pinning.
+
+Driver options
+^^^^^^^^^^^^^^
+
- ``rxq_cqe_comp_en`` parameter [int]
A nonzero value enables the compression of CQE on RX side. This feature
allows to save PCI bandwidth and improve performance. Enabled by default.
+ Different compression formats are supported in order to achieve the best
+ performance for different traffic patterns. Default format depends on
+ Multi-Packet Rx queue configuration: Hash RSS format is used in case
+ MPRQ is disabled, Checksum format is used in case MPRQ is enabled.
+
+ Specifying 2 as a ``rxq_cqe_comp_en`` value selects Flow Tag format for
+ better compression rate in case of RTE Flow Mark traffic.
+ Specifying 3 as a ``rxq_cqe_comp_en`` value selects Checksum format.
+ Specifying 4 as a ``rxq_cqe_comp_en`` value selects L3/L4 Header format for
+ better compression rate in case of mixed TCP/UDP and IPv4/IPv6 traffic.
+ CQE compression format selection requires DevX to be enabled. If there is
+ no DevX enabled/supported the value is reset to 1 by default.
Supported on:
- - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx
- and BlueField.
- - POWER9 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx
- and BlueField.
-
-- ``rxq_cqe_pad_en`` parameter [int]
-
- A nonzero value enables 128B padding of CQE on RX side. The size of CQE
- is aligned with the size of a cacheline of the core. If cacheline size is
- 128B, the CQE size is configured to be 128B even though the device writes
- only 64B data on the cacheline. This is to avoid unnecessary cache
- invalidation by device's two consecutive writes on to one cacheline.
- However in some architecture, it is more beneficial to update entire
- cacheline with padding the rest 64B rather than striding because
- read-modify-write could drop performance a lot. On the other hand,
- writing extra data will consume more PCIe bandwidth and could also drop
- the maximum throughput. It is recommended to empirically set this
- parameter. Disabled by default.
-
- Supported on:
-
- - CPU having 128B cacheline with ConnectX-5 and BlueField.
+ - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
+ ConnectX-6 Lx, BlueField and BlueField-2.
+ - POWER9 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
+ ConnectX-6 Lx, BlueField and BlueField-2.
- ``rxq_pkt_pad_en`` parameter [int]
Supported on:
- - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx
- and BlueField.
- - POWER8 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx
- and BlueField.
+ - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
+ ConnectX-6 Lx, BlueField and BlueField-2.
+ - POWER8 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
+ ConnectX-6 Lx, BlueField and BlueField-2.
- ``mprq_en`` parameter [int]
A nonzero value enables configuring Multi-Packet Rx queues. Rx queue is
configured as Multi-Packet RQ if the total number of Rx queues is
- ``rxqs_min_mprq`` or more and Rx scatter isn't configured. Disabled by
- default.
+ ``rxqs_min_mprq`` or more. Disabled by default.
Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe bandwidth
by posting a single large buffer for multiple packets. Instead of posting a
and each stride receives one packet. MPRQ can improve throughput for
small-packet traffic.
- When MPRQ is enabled, max_rx_pkt_len can be larger than the size of
+ When MPRQ is enabled, MTU can be larger than the size of
user-provided mbuf even if DEV_RX_OFFLOAD_SCATTER isn't enabled. PMD will
- configure large stride size enough to accommodate max_rx_pkt_len as long as
+ configure large stride size enough to accommodate MTU as long as
device allows. Note that this can waste system memory compared to enabling Rx
scatter and multi-segment packet.
The size of Rx queue should be bigger than the number of strides.
+- ``mprq_log_stride_size`` parameter [int]
+
+ Log 2 of the size of a stride for Multi-Packet Rx queue. Configuring a smaller
+ stride size can save some memory and reduce probability of a depletion of all
+ available strides due to unreleased packets by an application. If configured
+ value is not in the range of device capability, the default value will be set
+ with a warning message. The default value is 11 which is 2048 bytes per a
+ stride, valid only if ``mprq_en`` is set. With ``mprq_log_stride_size`` set
+ it is possible for a packet to span across multiple strides. This mode allows
+ support of jumbo frames (9K) with MPRQ. The memcopy of some packets (or part
+ of a packet if Rx scatter is configured) may be required in case there is no
+ space left for a head room at the end of a stride which incurs some
+ performance penalty.
+
- ``mprq_max_memcpy_len`` parameter [int]
The maximum length of packet to memcpy in case of Multi-Packet Rx queue. Rx
it is not recommended and may prevent NIC from sending packets over
some configurations.
+ For ConnectX-4 and ConnectX-4 Lx NICs, automatically configured value
+ is insufficient for some traffic, because they require at least all L2 headers
+ to be inlined. For example, Q-in-Q adds 4 bytes to default 18 bytes
+ of Ethernet and VLAN, thus ``txq_inline_min`` must be set to 22.
+ MPLS would add 4 bytes per label. Final value must account for all possible
+ L2 encapsulation headers used in particular environment.
+
Please, note, this minimal data inlining disengages eMPW feature (Enhanced
Multi-Packet Write), because last one does not support partial packet inlining.
This is not very critical due to minimal data inlining is mostly required
- ``txq_mpw_en`` parameter [int]
A nonzero value enables Enhanced Multi-Packet Write (eMPW) for ConnectX-5,
- ConnectX-6, ConnectX-6 Dx and BlueField. eMPW allows the TX burst function to pack
- up multiple packets in a single descriptor session in order to save PCI bandwidth
- and improve performance at the cost of a slightly higher CPU usage. When
- ``txq_inline_mpw`` is set along with ``txq_mpw_en``, TX burst function copies
- entire packet data on to TX descriptor instead of including pointer of packet.
+ ConnectX-6, ConnectX-6 Dx, ConnectX-6 Lx, BlueField, BlueField-2.
+ eMPW allows the Tx burst function to pack up multiple packets
+ in a single descriptor session in order to save PCI bandwidth
+ and improve performance at the cost of a slightly higher CPU usage.
+ When ``txq_inline_mpw`` is set along with ``txq_mpw_en``,
+ Tx burst function copies entire packet data on to Tx descriptor
+ instead of including pointer of packet.
The Enhanced Multi-Packet Write feature is enabled by default if NIC supports
it, can be disabled by explicit specifying 0 value for ``txq_mpw_en`` option.
variable "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF",
the default ``tx_db_nc`` value is zero for ARM64 hosts and one for others.
+- ``tx_pp`` parameter [int]
+
+ If a nonzero value is specified the driver creates all necessary internal
+ objects to provide accurate packet send scheduling on mbuf timestamps.
+ The positive value specifies the scheduling granularity in nanoseconds,
+ the packet send will be accurate up to specified digits. The allowed range is
+ from 500 to 1 million of nanoseconds. The negative value specifies the module
+ of granularity and engages the special test mode the check the schedule rate.
+ By default (if the ``tx_pp`` is not specified) send scheduling on timestamps
+ feature is disabled.
+
+- ``tx_skew`` parameter [int]
+
+ The parameter adjusts the send packet scheduling on timestamps and represents
+ the average delay between beginning of the transmitting descriptor processing
+ by the hardware and appearance of actual packet data on the wire. The value
+ should be provided in nanoseconds and is valid only if ``tx_pp`` parameter is
+ specified. The default value is zero.
+
- ``tx_vec_en`` parameter [int]
- A nonzero value enables Tx vector on ConnectX-5, ConnectX-6, ConnectX-6 Dx
- and BlueField NICs if the number of global Tx queues on the port is less than
- ``txqs_max_vec``. The parameter is deprecated and ignored.
+ A nonzero value enables Tx vector on ConnectX-5, ConnectX-6, ConnectX-6 Dx,
+ ConnectX-6 Lx, BlueField and BlueField-2 NICs
+ if the number of global Tx queues on the port is less than ``txqs_max_vec``.
+ The parameter is deprecated and ignored.
- ``rx_vec_en`` parameter [int]
24 bits. The actual supported width can be retrieved in runtime by
series of rte_flow_validate() trials.
+ - 3, this engages tunnel offload mode. In E-Switch configuration, that
+ mode implicitly activates ``dv_xmeta_en=1``.
+
+------+-----------+-----------+-------------+-------------+
| Mode | ``MARK`` | ``META`` | ``META`` Tx | FDB/Through |
+======+===========+===========+=============+=============+
+------+-----------+-----------+-------------+-------------+
| 1 | 24 bits | vary 0-32 | 32 bits | yes |
+------+-----------+-----------+-------------+-------------+
- | 2 | vary 0-32 | 32 bits | 32 bits | yes |
+ | 2 | vary 0-24 | 32 bits | 32 bits | yes |
+------+-----------+-----------+-------------+-------------+
If there is no E-Switch configuration the ``dv_xmeta_en`` parameter is
of the extensive metadata features. The legacy Verbs supports FLAG and
MARK metadata actions over NIC Rx steering domain only.
+ Setting META value to zero in flow action means there is no item provided
+ and receiving datapath will not report in mbufs the metadata are present.
+ Setting MARK value to zero in flow action means the zero FDIR ID value
+ will be reported on packet receiving.
+
+ For the MARK action the last 16 values in the full range are reserved for
+ internal PMD purposes (to emulate FLAG action). The valid range for the
+ MARK action values is 0-0xFFEF for the 16-bit mode and 0-xFFFFEF
+ for the 24-bit mode, the flows with the MARK action value outside
+ the specified range will be rejected.
+
- ``dv_flow_en`` parameter [int]
A nonzero value enables the DV flow steering assuming it is supported
Enabled by default if supported.
+- ``lacp_by_user`` parameter [int]
+
+ A nonzero value enables the control of LACP traffic by the user application.
+ When a bond exists in the driver, by default it should be managed by the
+ kernel and therefore LACP traffic should be steered to the kernel.
+ If this devarg is set to 1 it will allow the user to manage the bond by
+ itself and not steer LACP traffic to the kernel.
+
+ Disabled by default (set to 0).
+
- ``mr_ext_memseg_en`` parameter [int]
A nonzero value enables extending memseg when registering DMA memory. If
Enabled by default.
+- ``mr_mempool_reg_en`` parameter [int]
+
+ A nonzero value enables implicit registration of DMA memory of all mempools
+ except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically
+ for mempools populated with non-contiguous objects or those without IOVA.
+ The effect is that when a packet from a mempool is transmitted,
+ its memory is already registered for DMA in the PMD and no registration
+ will happen on the data path. The tradeoff is extra work on the creation
+ of each mempool and increased HW resource use if some mempools
+ are not used with MLX5 devices.
+
+ Enabled by default.
+
- ``representor`` parameter [list]
This parameter can be used to instantiate DPDK Ethernet devices from
- existing port (or VF) representors configured on the device.
+ existing port (PF, VF or SF) representors configured on the device.
It is a standard parameter whose format is described in
:ref:`ethernet_device_standard_device_arguments`.
- For instance, to probe port representors 0 through 2::
+ For instance, to probe VF port representors 0 through 2::
+
+ <PCI_BDF>,representor=vf[0-2]
+
+ To probe SF port representors 0 through 2::
- representor=[0-2]
+ <PCI_BDF>,representor=sf[0-2]
+
+ To probe VF port representors 0 through 2 on both PFs of bonding device::
+
+ <Primary_PCI_BDF>,representor=pf[0,1]vf[0-2]
- ``max_dump_files_num`` parameter [int]
If this parameter is not specified, by default PMD will set
the smallest value supported by HW.
+- ``hp_buf_log_sz`` parameter [int]
+
+ The total data buffer size of a hairpin queue (logarithmic form), in bytes.
+ PMD will set the data buffer size to 2 ** ``hp_buf_log_sz``, both for RX & TX.
+ The capacity of the value is specified by the firmware and the initialization
+ will get a failure if it is out of scope.
+ The range of the value is from 11 to 19 right now, and the supported frame
+ size of a single packet for hairpin is from 512B to 128KB. It might change if
+ different firmware release is being used. By using a small value, it could
+ reduce memory consumption but not work with a large frame. If the value is
+ too large, the memory consumption will be high and some potential performance
+ degradation will be introduced.
+ By default, the PMD will set this value to 16, which means that 9KB jumbo
+ frames will be supported.
+
+- ``reclaim_mem_mode`` parameter [int]
+
+ Cache some resources in flow destroy will help flow recreation more efficient.
+ While some systems may require the all the resources can be reclaimed after
+ flow destroyed.
+ The parameter ``reclaim_mem_mode`` provides the option for user to configure
+ if the resource cache is needed or not.
+
+ There are three options to choose:
+
+ - 0. It means the flow resources will be cached as usual. The resources will
+ be cached, helpful with flow insertion rate.
+
+ - 1. It will only enable the DPDK PMD level resources reclaim.
+
+ - 2. Both DPDK PMD level and rdma-core low level will be configured as
+ reclaimed mode.
+
+ By default, the PMD will set this value to 0.
+
+- ``sys_mem_en`` parameter [int]
+
+ A non-zero value enables the PMD memory management allocating memory
+ from system by default, without explicit rte memory flag.
+
+ By default, the PMD will set this value to 0.
+
+- ``decap_en`` parameter [int]
+
+ Some devices do not support FCS (frame checksum) scattering for
+ tunnel-decapsulated packets.
+ If set to 0, this option forces the FCS feature and rejects tunnel
+ decapsulation in the flow engine for such devices.
+
+ By default, the PMD will set this value to 1.
+
+- ``allow_duplicate_pattern`` parameter [int]
+
+ There are two options to choose:
+
+ - 0. Prevent insertion of rules with the same pattern items on non-root table.
+ In this case, only the first rule is inserted and the following rules are
+ rejected and error code EEXIST is returned.
+
+ - 1. Allow insertion of rules with the same pattern items.
+ In this case, all rules are inserted but only the first rule takes effect,
+ the next rule takes effect only if the previous rules are deleted.
+
+ By default, the PMD will set this value to 1.
+
.. _mlx5_firmware_config:
Firmware configuration
FLEX_PARSER_PROFILE_ENABLE=1
-- enable ICMP/ICMP6 code/type fields matching::
+- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
FLEX_PARSER_PROFILE_ENABLE=2
or
FLEX_PARSER_PROFILE_ENABLE=1
+- enable Geneve TLV option flow matching::
+
+ FLEX_PARSER_PROFILE_ENABLE=0
+
- enable GTP flow matching::
FLEX_PARSER_PROFILE_ENABLE=3
-Prerequisites
--------------
+- enable eCPRI flow matching::
+
+ FLEX_PARSER_PROFILE_ENABLE=4
+ PROG_PARSE_GRAPH=1
+
+Linux Prerequisites
+-------------------
This driver relies on external libraries and kernel drivers for resources
allocations and initialization. The following dependencies are not part of
- **libibverbs**
- User space Verbs framework used by librte_pmd_mlx5. This library provides
+ User space Verbs framework used by librte_net_mlx5. This library provides
a generic interface between the kernel and low-level user space drivers
such as libmlx5.
.. _`Linux installation documentation`: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/plain/Documentation/admin-guide/README.rst
.. _`RDMA Core installation documentation`: https://raw.githubusercontent.com/linux-rdma/rdma-core/master/README.md
-If rdma-core libraries are built but not installed, DPDK makefile can link them,
-thanks to these environment variables:
-
- - ``EXTRA_CFLAGS=-I/path/to/rdma-core/build/include``
- - ``EXTRA_LDFLAGS=-L/path/to/rdma-core/build/lib``
- - ``PKG_CONFIG_PATH=/path/to/rdma-core/build/lib/pkgconfig``
Mellanox OFED/EN
^^^^^^^^^^^^^^^^
Several versions of Mellanox OFED/EN are available. Installing the version
this DPDK release was developed and tested against is strongly
- recommended. Please check the `prerequisites`_.
+ recommended. Please check the `linux prerequisites`_.
+
+Windows Prerequisites
+---------------------
+
+This driver relies on external libraries and kernel drivers for resources
+allocations and initialization. The dependencies in the following sub-sections
+are not part of DPDK, and must be installed separately.
+
+Compilation Prerequisites
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DevX SDK installation
+^^^^^^^^^^^^^^^^^^^^^
+
+The DevX SDK must be installed on the machine building the Windows PMD.
+Additional information can be found at
+`How to Integrate Windows DevX in Your Development Environment
+<https://docs.mellanox.com/display/winof2v250/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`__.
+
+Runtime Prerequisites
+~~~~~~~~~~~~~~~~~~~~~
+
+WinOF2 version 2.60 or higher must be installed on the machine.
+
+WinOF2 installation
+^^^^^^^^^^^^^^^^^^^
+
+The driver can be downloaded from the following site:
+`WINOF2
+<https://www.mellanox.com/products/adapter-software/ethernet/windows/winof-2>`__
+
+DevX Enablement
+^^^^^^^^^^^^^^^
+
+DevX for Windows must be enabled in the Windows registry.
+The keys ``DevxEnabled`` and ``DevxFsRules`` must be set.
+Additional information can be found in the WinOF2 user manual.
Supported NICs
--------------
- ConnectX-5 Ex
- ConnectX-6
- ConnectX-6 Dx
+ - ConnectX-6 Lx
- BlueField
+ - BlueField-2
Below are detailed device names:
* Mellanox\ |reg| ConnectX\ |reg|-6 200G MCX654106A-HCAT (2x200G)
* Mellanox\ |reg| ConnectX\ |reg|-6 Dx EN 100G MCX623106AN-CDAT (2x100G)
* Mellanox\ |reg| ConnectX\ |reg|-6 Dx EN 200G MCX623105AN-VDAT (1x200G)
+* Mellanox\ |reg| ConnectX\ |reg|-6 Lx EN 25G MCX631102AN-ADAT (2x25G)
Quick Start Guide on OFED/EN
----------------------------
-1. Download latest Mellanox OFED/EN. For more info check the `prerequisites`_.
+1. Download latest Mellanox OFED/EN. For more info check the `linux prerequisites`_.
2. Install the required libraries and kernel modules either by installing
echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
-6. Compile DPDK and you are ready to go. See instructions on
- :ref:`Development Kit Build System <Development_Kit_Build_System>`
+6. Install DPDK and you are ready to go.
+ See :doc:`compilation instructions <../linux_gsg/build_dpdk>`.
Enable switchdev mode
---------------------
-Switchdev mode is a mode in E-Switch, that binds between representor and VF.
-Representor is a port in DPDK that is connected to a VF in such a way
-that assuming there are no offload flows, each packet that is sent from the VF
-will be received by the corresponding representor. While each packet that is
-sent to a representor will be received by the VF.
+Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF.
+Representor is a port in DPDK that is connected to a VF or SF in such a way
+that assuming there are no offload flows, each packet that is sent from the VF or SF
+will be received by the corresponding representor. While each packet that is or SF
+sent to a representor will be received by the VF or SF.
This is very useful in case of SRIOV mode, where the first packet that is sent
-by the VF will be received by the DPDK application which will decide if this
+by the VF or SF will be received by the DPDK application which will decide if this
flow should be offloaded to the E-Switch. After offloading the flow packet
-that the VF that are matching the flow will not be received any more by
+that the VF or SF that are matching the flow will not be received any more by
the DPDK application.
1. Enable SRIOV mode::
echo -n "<device pci address" > /sys/bus/pci/drivers/mlx5_core/unbind
-5. Enbale switchdev mode::
+5. Enable switchdev mode::
echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
+Sub-Function support
+--------------------
+
+Sub-Function is a portion of the PCI device, a SF netdev has its own
+dedicated queues (txq, rxq).
+A SF shares PCI level resources with other SFs and/or with its parent PCI function.
+
+0. Requirement::
+
+ OFED version >= 5.4-0.3.3.0
+
+1. Configure SF feature::
+
+ # Run mlxconfig on both PFs on host and ECPFs on BlueField.
+ mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12
+
+2. Enable switchdev mode::
+
+ mlxdevm dev eswitch set pci/<DBDF> mode switchdev
+
+3. Add SF port::
+
+ mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
+
+ Get SFID from output: pci/<DBDF>/<SFID>
+
+4. Modify MAC address::
+
+ mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC>
+
+5. Activate SF port::
+
+ mlxdevm port function set pci/<DBDF>/<ID> state active
+
+6. Devargs to probe SF device::
+
+ auxiliary:mlx5_core.sf.<num>,dv_flow_en=1
+
+Sub-Function representor support
+--------------------------------
+
+A SF netdev supports E-Switch representation offload
+similar to PF and VF representors.
+Use <sfnum> to probe SF representor::
+
+ testpmd> port attach <PCI_BDF>,representor=sf<sfnum>,dv_flow_en=1
+
Performance tuning
------------------
for better performance. For VMs, verify that the right CPU
and NUMA node are pinned according to the above. Run::
- lstopo-no-graphics
+ lstopo-no-graphics --merge
to identify the NUMA node to which the PCIe adapter is connected.
- Configure per-lcore cache when creating Mempools for packet buffer.
- Refrain from dynamically allocating/freeing memory in run-time.
+Rx burst functions
+------------------
+
+There are multiple Rx burst functions with different advantages and limitations.
+
+.. table:: Rx burst functions
+
+ +-------------------+------------------------+---------+-----------------+------+-------+
+ || Function Name || Enabler || Scatter|| Error Recovery || CQE || Large|
+ | | | | || comp|| MTU |
+ +===================+========================+=========+=================+======+=======+
+ | rx_burst | rx_vec_en=0 | Yes | Yes | Yes | Yes |
+ +-------------------+------------------------+---------+-----------------+------+-------+
+ | rx_burst_vec | rx_vec_en=1 (default) | No | if CQE comp off | Yes | No |
+ +-------------------+------------------------+---------+-----------------+------+-------+
+ | rx_burst_mprq || mprq_en=1 | No | Yes | Yes | Yes |
+ | || RxQs >= rxqs_min_mprq | | | | |
+ +-------------------+------------------------+---------+-----------------+------+-------+
+ | rx_burst_mprq_vec || rx_vec_en=1 (default) | No | if CQE comp off | Yes | Yes |
+ | || mprq_en=1 | | | | |
+ | || RxQs >= rxqs_min_mprq | | | | |
+ +-------------------+------------------------+---------+-----------------+------+-------+
+
.. _mlx5_offloads_support:
Supported hardware offloads
.. table:: Minimal SW/HW versions for queue offloads
- ============== ===== ===== ========= ===== ========== ==========
+ ============== ===== ===== ========= ===== ========== =============
Offload DPDK Linux rdma-core OFED firmware hardware
- ============== ===== ===== ========= ===== ========== ==========
+ ============== ===== ===== ========= ===== ========== =============
common base 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4
checksums 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4
Rx timestamp 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4
TSO 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4
LRO 19.08 N/A N/A 4.6-4 16.25.6406 ConnectX-5
- ============== ===== ===== ========= ===== ========== ==========
+ Tx scheduling 20.08 N/A N/A 5.1-2 22.28.2006 ConnectX-6 Dx
+ Buffer Split 20.11 N/A N/A 5.1-2 16.28.2006 ConnectX-5
+ ============== ===== ===== ========= ===== ========== =============
.. table:: Minimal SW/HW versions for rte_flow offloads
| | | | | rdma-core 23 |
| | | | | ConnectX-4 |
+-----------------------+-----------------+-----------------+
+ | Shared action | | | | |
+ | | | :numref:`sact`| | :numref:`sact`|
+ | | | | | |
+ | | | | | |
+ +-----------------------+-----------------+-----------------+
+ | | VLAN | | DPDK 19.11 | | DPDK 19.11 |
+ | | (of_pop_vlan / | | OFED 4.7-1 | | OFED 4.7-1 |
+ | | of_push_vlan / | | ConnectX-5 | | ConnectX-5 |
+ | | of_set_vlan_pcp / | | | | |
+ | | of_set_vlan_vid) | | | | |
+ +-----------------------+-----------------+-----------------+
+ | | VLAN | | DPDK 21.05 | | |
+ | | ingress and / | | OFED 5.3 | | N/A |
+ | | of_push_vlan / | | ConnectX-6 Dx | | |
+ +-----------------------+-----------------+-----------------+
+ | | VLAN | | DPDK 21.05 | | |
+ | | egress and / | | OFED 5.3 | | N/A |
+ | | of_pop_vlan / | | ConnectX-6 Dx | | |
+ +-----------------------+-----------------+-----------------+
| Encapsulation | | DPDK 19.05 | | DPDK 19.02 |
| (VXLAN / NVGRE / RAW) | | OFED 4.7-1 | | OFED 4.6 |
| | | rdma-core 24 | | rdma-core 23 |
| | | rdma-core 27 | | rdma-core 27 |
| | | ConnectX-5 | | ConnectX-5 |
+-----------------------+-----------------+-----------------+
+ | Tunnel Offload | | DPDK 20.11 | | DPDK 20.11 |
+ | | | OFED 5.1-2 | | OFED 5.1-2 |
+ | | | rdma-core 32 | | N/A |
+ | | | ConnectX-5 | | ConnectX-5 |
+ +-----------------------+-----------------+-----------------+
| | Header rewrite | | DPDK 19.05 | | DPDK 19.02 |
| | (set_ipv4_src / | | OFED 4.7-1 | | OFED 4.7-1 |
| | set_ipv4_dst / | | rdma-core 24 | | rdma-core 24 |
| | | rdma-core 24 | | rdma-core 23 |
| | | ConnectX-5 | | ConnectX-4 |
+-----------------------+-----------------+-----------------+
+ | Meta data | | DPDK 19.11 | | DPDK 19.11 |
+ | | | OFED 4.7-3 | | OFED 4.7-3 |
+ | | | rdma-core 26 | | rdma-core 26 |
+ | | | ConnectX-5 | | ConnectX-5 |
+ +-----------------------+-----------------+-----------------+
| Port ID | | DPDK 19.05 | | N/A |
| | | OFED 4.7-1 | | N/A |
| | | rdma-core 24 | | N/A |
| | | ConnectX-5 | | N/A |
+-----------------------+-----------------+-----------------+
- | | VLAN | | DPDK 19.11 | | DPDK 19.11 |
- | | (of_pop_vlan / | | OFED 4.7-1 | | OFED 4.7-1 |
- | | of_push_vlan / | | ConnectX-5 | | ConnectX-5 |
- | | of_set_vlan_pcp / | | | | |
- | | of_set_vlan_vid) | | | | |
- +-----------------------+-----------------+-----------------+
| Hairpin | | | | DPDK 19.11 |
| | | N/A | | OFED 4.7-3 |
| | | | | rdma-core 26 |
| | | | | ConnectX-5 |
+-----------------------+-----------------+-----------------+
- | Meta data | | DPDK 19.11 | | DPDK 19.11 |
- | | | OFED 4.7-3 | | OFED 4.7-3 |
- | | | rdma-core 26 | | rdma-core 26 |
- | | | ConnectX-5 | | ConnectX-5 |
+ | 2-port Hairpin | | | | DPDK 20.11 |
+ | | | N/A | | OFED 5.1-2 |
+ | | | | | N/A |
+ | | | | | ConnectX-5 |
+-----------------------+-----------------+-----------------+
| Metering | | DPDK 19.11 | | DPDK 19.11 |
| | | OFED 4.7-3 | | OFED 4.7-3 |
| | | rdma-core 26 | | rdma-core 26 |
| | | ConnectX-5 | | ConnectX-5 |
+-----------------------+-----------------+-----------------+
+ | ASO Metering | | DPDK 21.05 | | DPDK 21.05 |
+ | | | OFED 5.3 | | OFED 5.3 |
+ | | | rdma-core 33 | | rdma-core 33 |
+ | | | ConnectX-6 Dx| | ConnectX-6 Dx |
+ +-----------------------+-----------------+-----------------+
+ | Metering Hierarchy | | DPDK 21.08 | | DPDK 21.08 |
+ | | | OFED 5.3 | | OFED 5.3 |
+ | | | N/A | | N/A |
+ | | | ConnectX-6 Dx| | ConnectX-6 Dx |
+ +-----------------------+-----------------+-----------------+
+ | Sampling | | DPDK 20.11 | | DPDK 20.11 |
+ | | | OFED 5.1-2 | | OFED 5.1-2 |
+ | | | rdma-core 32 | | N/A |
+ | | | ConnectX-5 | | ConnectX-5 |
+ +-----------------------+-----------------+-----------------+
+ | Encapsulation | | DPDK 21.02 | | DPDK 21.02 |
+ | GTP PSC | | OFED 5.2 | | OFED 5.2 |
+ | | | rdma-core 35 | | rdma-core 35 |
+ | | | ConnectX-6 Dx| | ConnectX-6 Dx |
+ +-----------------------+-----------------+-----------------+
+ | Encapsulation | | DPDK 21.02 | | DPDK 21.02 |
+ | GENEVE TLV option | | OFED 5.2 | | OFED 5.2 |
+ | | | rdma-core 34 | | rdma-core 34 |
+ | | | ConnectX-6 Dx | | ConnectX-6 Dx |
+ +-----------------------+-----------------+-----------------+
+ | Modify Field | | DPDK 21.02 | | DPDK 21.02 |
+ | | | OFED 5.2 | | OFED 5.2 |
+ | | | rdma-core 35 | | rdma-core 35 |
+ | | | ConnectX-5 | | ConnectX-5 |
+ +-----------------------+-----------------+-----------------+
+ | Connection tracking | | | | DPDK 21.05 |
+ | | | N/A | | OFED 5.3 |
+ | | | | | rdma-core 35 |
+ | | | | | ConnectX-6 Dx |
+ +-----------------------+-----------------+-----------------+
+
+.. table:: Minimal SW/HW versions for shared action offload
+ :name: sact
+
+ +-----------------------+-----------------+-----------------+
+ | Shared Action | with E-Switch | with NIC |
+ +=======================+=================+=================+
+ | RSS | | | | DPDK 20.11 |
+ | | | N/A | | OFED 5.2 |
+ | | | | | rdma-core 33 |
+ | | | | | ConnectX-5 |
+ +-----------------------+-----------------+-----------------+
+ | Age | | DPDK 20.11 | | DPDK 20.11 |
+ | | | OFED 5.2 | | OFED 5.2 |
+ | | | rdma-core 32 | | rdma-core 32 |
+ | | | ConnectX-6 Dx | | ConnectX-6 Dx |
+ +-----------------------+-----------------+-----------------+
+ | Count | | DPDK 21.05 | | DPDK 21.05 |
+ | | | OFED 4.6 | | OFED 4.6 |
+ | | | rdma-core 24 | | rdma-core 23 |
+ | | | ConnectX-5 | | ConnectX-5 |
+ +-----------------------+-----------------+-----------------+
+
+Notes for metadata
+------------------
+
+MARK and META items are interrelated with datapath - they might move from/to
+the applications in mbuf fields. Hence, zero value for these items has the
+special meaning - it means "no metadata are provided", not zero values are
+treated by applications and PMD as valid ones.
+
+Moreover in the flow engine domain the value zero is acceptable to match and
+set, and we should allow to specify zero values as rte_flow parameters for the
+META and MARK items and actions. In the same time zero mask has no meaning and
+should be rejected on validation stage.
+
+Notes for rte_flow
+------------------
+
+Flows are not cached in the driver.
+When stopping a device port, all the flows created on this port from the
+application will be flushed automatically in the background.
+After stopping the device port, all flows on this port become invalid and
+not represented in the system.
+All references to these flows held by the application should be discarded
+directly but neither destroyed nor flushed.
+
+The application should re-create the flows as required after the port restart.
Notes for testpmd
-----------------
-Compared to librte_pmd_mlx4 that implements a single RSS configuration per
-port, librte_pmd_mlx5 supports per-protocol RSS configuration.
+Compared to librte_net_mlx4 that implements a single RSS configuration per
+port, librte_net_mlx5 supports per-protocol RSS configuration.
Since ``testpmd`` defaults to IP RSS mode and there is currently no
command-line parameter to enable additional protocols (UDP and TCP as well
as IP), the following commands must be entered from its CLI to get the same
-behavior as librte_pmd_mlx4::
+behavior as librte_net_mlx4::
> port stop all
> port config all rss all
-------------
This section demonstrates how to launch **testpmd** with Mellanox
-ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_pmd_mlx5.
+ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_net_mlx5.
#. Load the kernel modules::
eth32
eth33
-#. Optionally, retrieve their PCI bus addresses for whitelisting::
+#. Optionally, retrieve their PCI bus addresses for to be used with the allow list::
{
for intf in eth2 eth3 eth4 eth5;
(cd "/sys/class/net/${intf}/device/" && pwd -P);
done;
} |
- sed -n 's,.*/\(.*\),-w \1,p'
+ sed -n 's,.*/\(.*\),-a \1,p'
Example output::
- -w 0000:05:00.1
- -w 0000:06:00.0
- -w 0000:06:00.1
- -w 0000:05:00.0
+ -a 0000:05:00.1
+ -a 0000:06:00.0
+ -a 0000:06:00.1
+ -a 0000:05:00.0
#. Request huge pages::
- echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages/nr_hugepages
+ dpdk-hugepages.py --setup 2G
#. Start testpmd with basic parameters::
- testpmd -l 8-15 -n 4 -w 05:00.0 -w 05:00.1 -w 06:00.0 -w 06:00.1 -- --rxq=2 --txq=2 -i
+ dpdk-testpmd -l 8-15 -n 4 -a 05:00.0 -a 05:00.1 -a 06:00.0 -a 06:00.1 -- --rxq=2 --txq=2 -i
Example output::
[...]
EAL: PCI device 0000:05:00.0 on NUMA socket 0
- EAL: probe driver: 15b3:1013 librte_pmd_mlx5
- PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_0" (VF: false)
- PMD: librte_pmd_mlx5: 1 port(s) detected
- PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe
+ EAL: probe driver: 15b3:1013 librte_net_mlx5
+ PMD: librte_net_mlx5: PCI information matches, using device "mlx5_0" (VF: false)
+ PMD: librte_net_mlx5: 1 port(s) detected
+ PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe
EAL: PCI device 0000:05:00.1 on NUMA socket 0
- EAL: probe driver: 15b3:1013 librte_pmd_mlx5
- PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_1" (VF: false)
- PMD: librte_pmd_mlx5: 1 port(s) detected
- PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff
+ EAL: probe driver: 15b3:1013 librte_net_mlx5
+ PMD: librte_net_mlx5: PCI information matches, using device "mlx5_1" (VF: false)
+ PMD: librte_net_mlx5: 1 port(s) detected
+ PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff
EAL: PCI device 0000:06:00.0 on NUMA socket 0
- EAL: probe driver: 15b3:1013 librte_pmd_mlx5
- PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_2" (VF: false)
- PMD: librte_pmd_mlx5: 1 port(s) detected
- PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa
+ EAL: probe driver: 15b3:1013 librte_net_mlx5
+ PMD: librte_net_mlx5: PCI information matches, using device "mlx5_2" (VF: false)
+ PMD: librte_net_mlx5: 1 port(s) detected
+ PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa
EAL: PCI device 0000:06:00.1 on NUMA socket 0
- EAL: probe driver: 15b3:1013 librte_pmd_mlx5
- PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_3" (VF: false)
- PMD: librte_pmd_mlx5: 1 port(s) detected
- PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb
+ EAL: probe driver: 15b3:1013 librte_net_mlx5
+ PMD: librte_net_mlx5: PCI information matches, using device "mlx5_3" (VF: false)
+ PMD: librte_net_mlx5: 1 port(s) detected
+ PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb
Interactive-mode selected
Configuring Port 0 (socket 0)
- PMD: librte_pmd_mlx5: 0x8cba80: TX queues number update: 0 -> 2
- PMD: librte_pmd_mlx5: 0x8cba80: RX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8cba80: TX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8cba80: RX queues number update: 0 -> 2
Port 0: E4:1D:2D:E7:0C:FE
Configuring Port 1 (socket 0)
- PMD: librte_pmd_mlx5: 0x8ccac8: TX queues number update: 0 -> 2
- PMD: librte_pmd_mlx5: 0x8ccac8: RX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8ccac8: TX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8ccac8: RX queues number update: 0 -> 2
Port 1: E4:1D:2D:E7:0C:FF
Configuring Port 2 (socket 0)
- PMD: librte_pmd_mlx5: 0x8cdb10: TX queues number update: 0 -> 2
- PMD: librte_pmd_mlx5: 0x8cdb10: RX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8cdb10: TX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8cdb10: RX queues number update: 0 -> 2
Port 2: E4:1D:2D:E7:0C:FA
Configuring Port 3 (socket 0)
- PMD: librte_pmd_mlx5: 0x8ceb58: TX queues number update: 0 -> 2
- PMD: librte_pmd_mlx5: 0x8ceb58: RX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8ceb58: TX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8ceb58: RX queues number update: 0 -> 2
Port 3: E4:1D:2D:E7:0C:FB
Checking link statuses...
Port 0 Link Up - speed 40000 Mbps - full-duplex
.. code-block:: console
- testpmd> flow dump <port> <output_file>
+ To dump all flows:
+ testpmd> flow dump <port> all <output_file>
+ and dump one flow:
+ testpmd> flow dump <port> rule <rule_id> <output_file>
- call rte_flow_dev_dump api:
.. code-block:: console
- rte_flow_dev_dump(port, file, NULL);
+ rte_flow_dev_dump(port, flow, file, NULL);
#. Dump human-readable flows from raw file:
.. code-block:: console
- mlx_steering_dump.py -f <output_file>
+ mlx_steering_dump.py -f <output_file> -flowptr <flow_ptr>
+
+How to share a meter between ports in the same switch domain
+------------------------------------------------------------
+
+This section demonstrates how to use the shared meter. A meter M can be created
+on port X and to be shared with a port Y on the same switch domain by the next way:
+
+.. code-block:: console
+
+ flow create X ingress transfer pattern eth / port_id id is Y / end actions meter mtr_id M / end
+
+How to use meter hierarchy
+--------------------------
+
+This section demonstrates how to create and use a meter hierarchy.
+A termination meter M can be the policy green action of another termination meter N.
+The two meters are chained together as a chain. Using meter N in a flow will apply
+both the meters in hierarchy on that flow.
+
+.. code-block:: console
+
+ add port meter policy 0 1 g_actions queue index 0 / end y_actions end r_actions drop / end
+ create port meter 0 M 1 1 yes 0xffff 1 0
+ add port meter policy 0 2 g_actions meter mtr_id M / end y_actions end r_actions drop / end
+ create port meter 0 N 2 2 yes 0xffff 1 0
+ flow create 0 ingress group 1 pattern eth / end actions meter mtr_id N / end