MLX5 poll mode driver
=====================
-The MLX5 poll mode driver library (**librte_pmd_mlx5**) provides support
+The MLX5 poll mode driver library (**librte_net_mlx5**) provides support
for **Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx** , **Mellanox
ConnectX-5**, **Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx** and
**Mellanox BlueField** families of 10/25/40/50/100/200 Gb/s adapters
There is also a `section dedicated to this poll mode driver
<http://www.mellanox.com/page/products_dyn?product_family=209&mtag=pmd_for_dpdk>`__.
-.. note::
-
- Due to external dependencies, this driver is disabled in default configuration
- of the "make" build. It can be enabled with ``CONFIG_RTE_LIBRTE_MLX5_PMD=y``
- or by using "meson" build system which will detect dependencies.
Design
------
Besides its dependency on libibverbs (that implies libmlx5 and associated
-kernel support), librte_pmd_mlx5 relies heavily on system calls for control
+kernel support), librte_net_mlx5 relies heavily on system calls for control
operations such as querying/updating the MTU and flow control parameters.
For security reasons and robustness, this driver only deals with virtual
- DevX allows to access firmware objects
- Direct Rules manages flow steering at low-level hardware layer
-Enabling librte_pmd_mlx5 causes DPDK applications to be linked against
+Enabling librte_net_mlx5 causes DPDK applications to be linked against
libibverbs.
Features
Will match any ipv4 packet (VLAN included).
+- When using DV flow engine (``dv_flow_en`` = 1), flow pattern without VLAN item
+ will match untagged packets only.
+ The flow rule::
+
+ flow create 0 ingress pattern eth / ipv4 / end ...
+
+ Will match untagged packets only.
+ The flow rule::
+
+ flow create 0 ingress pattern eth / vlan / ipv4 / end ...
+
+ Will match tagged packets only, with any VLAN ID value.
+ The flow rule::
+
+ flow create 0 ingress pattern eth / vlan vid is 3 / ipv4 / end ...
+
+ Will only match tagged packets with VLAN ID 3.
+
- VLAN pop offload command:
- Flow rules having a VLAN pop offload command as one of their actions and
- Flows with a VXLAN Network Identifier equal (or ends to be equal)
to 0 are not supported.
-- VXLAN TSO and checksum offloads are not supported on VM.
-
- L3 VXLAN and VXLAN-GPE tunnels cannot be supported together with MPLSoGRE and MPLSoUDP.
- Match on Geneve header supports the following fields only:
- Match on GTP tunnel header item supports the following fields only:
+ - v_pt_rsv_flags: E flag, S flag, PN flag
- msg_type
- teid
reduce the requested Tx size or adjust data inline settings with
``txq_inline_max`` and ``txq_inline_mpw`` devargs keys.
+- To provide the packet send scheduling on mbuf timestamps the ``tx_pp``
+ parameter should be specified.
+ When PMD sees the RTE_MBUF_DYNFLAG_TX_TIMESTAMP_NAME set on the packet
+ being sent it tries to synchronize the time of packet appearing on
+ the wire with the specified packet timestamp. It the specified one
+ is in the past it should be ignored, if one is in the distant future
+ it should be capped with some reasonable value (in range of seconds).
+ These specific cases ("too late" and "distant future") can be optionally
+ reported via device xstats to assist applications to detect the
+ time-related problems.
+
+ The timestamp upper "too-distant-future" limit
+ at the moment of invoking the Tx burst routine
+ can be estimated as ``tx_pp`` option (in nanoseconds) multiplied by 2^23.
+ Please note, for the testpmd txonly mode,
+ the limit is deduced from the expression::
+
+ (n_tx_descriptors / burst_size + 1) * inter_burst_gap
+
+ There is no any packet reordering according timestamps is supposed,
+ neither within packet burst, nor between packets, it is an entirely
+ application responsibility to generate packets and its timestamps
+ in desired order. The timestamps can be put only in the first packet
+ in the burst providing the entire burst scheduling.
+
- E-Switch decapsulation Flow:
- can be applied to PF port only.
- The input buffer, providing the removal size, is not validated.
- The buffer size must match the length of the headers to be removed.
-- ICMP/ICMP6 code/type matching, IP-in-IP and MPLS flow matching are all
+- ICMP(code/type/identifier/sequence number) / ICMP6(code/type) matching, IP-in-IP and MPLS flow matching are all
mutually exclusive features which cannot be supported together
(see :ref:`mlx5_firmware_config`).
TCP header (122B).
- Rx queue with LRO offload enabled, receiving a non-LRO packet, can forward
it with size limited to max LRO size, not to max RX packet length.
+ - LRO can be used with outer header of TCP packets of the standard format:
+ eth (with or without vlan) / ipv4 or ipv6 / tcp / payload
+
+ Other TCP packets (e.g. with MPLS label) received on Rx queue with LRO enabled, will be received with bad checksum.
+ - LRO packet aggregation is performed by HW only for packet size larger than
+ ``lro_min_mss_size``. This value is reported on device start, when debug
+ mode is enabled.
+
+- CRC:
+
+ - ``DEV_RX_OFFLOAD_KEEP_CRC`` cannot be supported with decapsulation
+ for some NICs (such as ConnectX-6 Dx and BlueField 2).
+ The capability bit ``scatter_fcs_w_decap_disable`` shows NIC support.
+
+- Sample flow:
+
+ - Supports ``RTE_FLOW_ACTION_TYPE_SAMPLE`` action only within NIC Rx and E-Switch steering domain.
+ - The E-Switch Sample flow must have the eswitch_manager VPORT destination (PF or ECPF) and no additional actions.
+ - For ConnectX-5, the ``RTE_FLOW_ACTION_TYPE_SAMPLE`` is typically used as first action in the E-Switch egress flow if with header modify or encapsulation actions.
+
+- IPv6 header item 'proto' field, indicating the next header protocol, should
+ not be set as extension header.
+ In case the next header is an extension header, it should not be specified in
+ IPv6 header item 'proto' field.
+ The last extension header item 'next header' field can specify the following
+ header protocol type.
Statistics
----------
Compilation options
~~~~~~~~~~~~~~~~~~~
-These options can be modified in the ``.config`` file.
-
-- ``CONFIG_RTE_LIBRTE_MLX5_PMD`` (default **n**)
-
- Toggle compilation of librte_pmd_mlx5 itself.
-
-- ``CONFIG_RTE_IBVERBS_LINK_DLOPEN`` (default **n**)
+The ibverbs libraries can be linked with this PMD in a number of ways,
+configured by the ``ibverbs_link`` build option:
- Build PMD with additional code to make it loadable without hard
- dependencies on **libibverbs** nor **libmlx5**, which may not be installed
- on the target system.
+- ``shared`` (default): the PMD depends on some .so files.
- In this mode, their presence is still required for it to run properly,
- however their absence won't prevent a DPDK application from starting (with
- ``CONFIG_RTE_BUILD_SHARED_LIB`` disabled) and they won't show up as
- missing with ``ldd(1)``.
+- ``dlopen``: Split the dependencies glue in a separate library
+ loaded when needed by dlopen.
+ It make dependencies on libibverbs and libmlx4 optional,
+ and has no performance impact.
- It works by moving these dependencies to a purpose-built rdma-core "glue"
- plug-in which must either be installed in a directory whose name is based
- on ``CONFIG_RTE_EAL_PMD_PATH`` suffixed with ``-glue`` if set, or in a
- standard location for the dynamic linker (e.g. ``/lib``) if left to the
- default empty string (``""``).
-
- This option has no performance impact.
-
-- ``CONFIG_RTE_IBVERBS_LINK_STATIC`` (default **n**)
-
- Embed static flavor of the dependencies **libibverbs** and **libmlx5**
+- ``static``: Embed static flavor of the dependencies libibverbs and libmlx4
in the PMD shared library or the executable static binary.
-- ``CONFIG_RTE_LIBRTE_MLX5_DEBUG`` (default **n**)
-
- Toggle debugging code and stricter compilation flags. Enabling this option
- adds additional run-time checks and debugging messages at the cost of
- lower performance.
-
-.. note::
-
- For BlueField, target should be set to ``arm64-bluefield-linux-gcc``. This
- will enable ``CONFIG_RTE_LIBRTE_MLX5_PMD`` and set ``RTE_CACHE_LINE_SIZE`` to
- 64. Default armv8a configuration of make build and meson build set it to 128
- then brings performance degradation.
-
-This option is available in meson:
-
-- ``ibverbs_link`` can be ``static``, ``shared``, or ``dlopen``.
-
Environment variables
~~~~~~~~~~~~~~~~~~~~~
A list of directories in which to search for the rdma-core "glue" plug-in,
separated by colons or semi-colons.
- Only matters when compiled with ``CONFIG_RTE_IBVERBS_LINK_DLOPEN``
- enabled and most useful when ``CONFIG_RTE_EAL_PMD_PATH`` is also set,
- since ``LD_LIBRARY_PATH`` has no effect in this case.
-
- ``MLX5_SHUT_UP_BF``
Configures HW Tx doorbell register as IO-mapped.
Run-time configuration
~~~~~~~~~~~~~~~~~~~~~~
-- librte_pmd_mlx5 brings kernel network interfaces up during initialization
+- librte_net_mlx5 brings kernel network interfaces up during initialization
because it is affected by their state. Forcing them down prevents packets
reception.
- **ethtool** operations on related kernel interfaces also affect the PMD.
+Run as non-root
+^^^^^^^^^^^^^^^
+
+In order to run as a non-root user,
+some capabilities must be granted to the application::
+
+ setcap cap_sys_admin,cap_net_admin,cap_net_raw,cap_ipc_lock+ep <dpdk-app>
+
+Below are the reasons of the need for each capability:
+
+``cap_sys_admin``
+ When using physical addresses (PA mode), with Linux >= 4.0,
+ for access to ``/proc/self/pagemap``.
+
+``cap_net_admin``
+ For device configuration.
+
+``cap_net_raw``
+ For raw ethernet queue allocation through kernel driver.
+
+``cap_ipc_lock``
+ For DMA memory pinning.
+
+Driver options
+^^^^^^^^^^^^^^
+
- ``rxq_cqe_comp_en`` parameter [int]
A nonzero value enables the compression of CQE on RX side. This feature
value is not in the range of device capability, the default value will be set
with a warning message. The default value is 11 which is 2048 bytes per a
stride, valid only if ``mprq_en`` is set. With ``mprq_log_stride_size`` set
- it is possible for a pcaket to span across multiple strides. This mode allows
+ it is possible for a packet to span across multiple strides. This mode allows
support of jumbo frames (9K) with MPRQ. The memcopy of some packets (or part
of a packet if Rx scatter is configured) may be required in case there is no
space left for a head room at the end of a stride which incurs some
variable "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF",
the default ``tx_db_nc`` value is zero for ARM64 hosts and one for others.
+- ``tx_pp`` parameter [int]
+
+ If a nonzero value is specified the driver creates all necessary internal
+ objects to provide accurate packet send scheduling on mbuf timestamps.
+ The positive value specifies the scheduling granularity in nanoseconds,
+ the packet send will be accurate up to specified digits. The allowed range is
+ from 500 to 1 million of nanoseconds. The negative value specifies the module
+ of granularity and engages the special test mode the check the schedule rate.
+ By default (if the ``tx_pp`` is not specified) send scheduling on timestamps
+ feature is disabled.
+
+- ``tx_skew`` parameter [int]
+
+ The parameter adjusts the send packet scheduling on timestamps and represents
+ the average delay between beginning of the transmitting descriptor processing
+ by the hardware and appearance of actual packet data on the wire. The value
+ should be provided in nanoseconds and is valid only if ``tx_pp`` parameter is
+ specified. The default value is zero.
+
- ``tx_vec_en`` parameter [int]
A nonzero value enables Tx vector on ConnectX-5, ConnectX-6, ConnectX-6 Dx
Enabled by default if supported.
+- ``lacp_by_user`` parameter [int]
+
+ A nonzero value enables the control of LACP traffic by the user application.
+ When a bond exists in the driver, by default it should be managed by the
+ kernel and therefore LACP traffic should be steered to the kernel.
+ If this devarg is set to 1 it will allow the user to manage the bond by
+ itself and not steer LACP traffic to the kernel.
+
+ Disabled by default (set to 0).
+
- ``mr_ext_memseg_en`` parameter [int]
A nonzero value enables extending memseg when registering DMA memory. If
By default, the PMD will set this value to 16, which means that 9KB jumbo
frames will be supported.
+- ``reclaim_mem_mode`` parameter [int]
+
+ Cache some resources in flow destroy will help flow recreation more efficient.
+ While some systems may require the all the resources can be reclaimed after
+ flow destroyed.
+ The parameter ``reclaim_mem_mode`` provides the option for user to configure
+ if the resource cache is needed or not.
+
+ There are three options to choose:
+
+ - 0. It means the flow resources will be cached as usual. The resources will
+ be cached, helpful with flow insertion rate.
+
+ - 1. It will only enable the DPDK PMD level resources reclaim.
+
+ - 2. Both DPDK PMD level and rdma-core low level will be configured as
+ reclaimed mode.
+
+ By default, the PMD will set this value to 0.
+
+- ``sys_mem_en`` parameter [int]
+
+ A non-zero value enables the PMD memory management allocating memory
+ from system by default, without explicit rte memory flag.
+
+ By default, the PMD will set this value to 0.
+
+- ``decap_en`` parameter [int]
+
+ Some devices do not support FCS (frame checksum) scattering for
+ tunnel-decapsulated packets.
+ If set to 0, this option forces the FCS feature and rejects tunnel
+ decapsulation in the flow engine for such devices.
+
+ By default, the PMD will set this value to 1.
+
.. _mlx5_firmware_config:
Firmware configuration
FLEX_PARSER_PROFILE_ENABLE=1
-- enable ICMP/ICMP6 code/type fields matching::
+- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
FLEX_PARSER_PROFILE_ENABLE=2
FLEX_PARSER_PROFILE_ENABLE=3
+- enable eCPRI flow matching::
+
+ FLEX_PARSER_PROFILE_ENABLE=4
+ PROG_PARSE_GRAPH=1
+
Prerequisites
-------------
- **libibverbs**
- User space Verbs framework used by librte_pmd_mlx5. This library provides
+ User space Verbs framework used by librte_net_mlx5. This library provides
a generic interface between the kernel and low-level user space drivers
such as libmlx5.
.. _`Linux installation documentation`: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/plain/Documentation/admin-guide/README.rst
.. _`RDMA Core installation documentation`: https://raw.githubusercontent.com/linux-rdma/rdma-core/master/README.md
-If rdma-core libraries are built but not installed, DPDK makefile can link them,
-thanks to these environment variables:
-
- - ``EXTRA_CFLAGS=-I/path/to/rdma-core/build/include``
- - ``EXTRA_LDFLAGS=-L/path/to/rdma-core/build/lib``
- - ``PKG_CONFIG_PATH=/path/to/rdma-core/build/lib/pkgconfig``
Mellanox OFED/EN
^^^^^^^^^^^^^^^^
echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
-6. Compile DPDK and you are ready to go. See instructions on
- :ref:`Development Kit Build System <Development_Kit_Build_System>`
+6. Install DPDK and you are ready to go.
+ See :doc:`compilation instructions <../linux_gsg/build_dpdk>`.
Enable switchdev mode
---------------------
| | | rdma-core 26 | | rdma-core 26 |
| | | ConnectX-5 | | ConnectX-5 |
+-----------------------+-----------------+-----------------+
+ | Sampling | | DPDK 20.11 | | DPDK 20.11 |
+ | | | OFED 5.2 | | OFED 5.2 |
+ | | | rdma-core 32 | | rdma-core 32 |
+ | | | ConnectX-5 | | ConnectX-5 |
+ +-----------------------+-----------------+-----------------+
Notes for metadata
------------------
Notes for testpmd
-----------------
-Compared to librte_pmd_mlx4 that implements a single RSS configuration per
-port, librte_pmd_mlx5 supports per-protocol RSS configuration.
+Compared to librte_net_mlx4 that implements a single RSS configuration per
+port, librte_net_mlx5 supports per-protocol RSS configuration.
Since ``testpmd`` defaults to IP RSS mode and there is currently no
command-line parameter to enable additional protocols (UDP and TCP as well
as IP), the following commands must be entered from its CLI to get the same
-behavior as librte_pmd_mlx4::
+behavior as librte_net_mlx4::
> port stop all
> port config all rss all
-------------
This section demonstrates how to launch **testpmd** with Mellanox
-ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_pmd_mlx5.
+ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_net_mlx5.
#. Load the kernel modules::
[...]
EAL: PCI device 0000:05:00.0 on NUMA socket 0
- EAL: probe driver: 15b3:1013 librte_pmd_mlx5
- PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_0" (VF: false)
- PMD: librte_pmd_mlx5: 1 port(s) detected
- PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe
+ EAL: probe driver: 15b3:1013 librte_net_mlx5
+ PMD: librte_net_mlx5: PCI information matches, using device "mlx5_0" (VF: false)
+ PMD: librte_net_mlx5: 1 port(s) detected
+ PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe
EAL: PCI device 0000:05:00.1 on NUMA socket 0
- EAL: probe driver: 15b3:1013 librte_pmd_mlx5
- PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_1" (VF: false)
- PMD: librte_pmd_mlx5: 1 port(s) detected
- PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff
+ EAL: probe driver: 15b3:1013 librte_net_mlx5
+ PMD: librte_net_mlx5: PCI information matches, using device "mlx5_1" (VF: false)
+ PMD: librte_net_mlx5: 1 port(s) detected
+ PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff
EAL: PCI device 0000:06:00.0 on NUMA socket 0
- EAL: probe driver: 15b3:1013 librte_pmd_mlx5
- PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_2" (VF: false)
- PMD: librte_pmd_mlx5: 1 port(s) detected
- PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa
+ EAL: probe driver: 15b3:1013 librte_net_mlx5
+ PMD: librte_net_mlx5: PCI information matches, using device "mlx5_2" (VF: false)
+ PMD: librte_net_mlx5: 1 port(s) detected
+ PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa
EAL: PCI device 0000:06:00.1 on NUMA socket 0
- EAL: probe driver: 15b3:1013 librte_pmd_mlx5
- PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_3" (VF: false)
- PMD: librte_pmd_mlx5: 1 port(s) detected
- PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb
+ EAL: probe driver: 15b3:1013 librte_net_mlx5
+ PMD: librte_net_mlx5: PCI information matches, using device "mlx5_3" (VF: false)
+ PMD: librte_net_mlx5: 1 port(s) detected
+ PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb
Interactive-mode selected
Configuring Port 0 (socket 0)
- PMD: librte_pmd_mlx5: 0x8cba80: TX queues number update: 0 -> 2
- PMD: librte_pmd_mlx5: 0x8cba80: RX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8cba80: TX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8cba80: RX queues number update: 0 -> 2
Port 0: E4:1D:2D:E7:0C:FE
Configuring Port 1 (socket 0)
- PMD: librte_pmd_mlx5: 0x8ccac8: TX queues number update: 0 -> 2
- PMD: librte_pmd_mlx5: 0x8ccac8: RX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8ccac8: TX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8ccac8: RX queues number update: 0 -> 2
Port 1: E4:1D:2D:E7:0C:FF
Configuring Port 2 (socket 0)
- PMD: librte_pmd_mlx5: 0x8cdb10: TX queues number update: 0 -> 2
- PMD: librte_pmd_mlx5: 0x8cdb10: RX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8cdb10: TX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8cdb10: RX queues number update: 0 -> 2
Port 2: E4:1D:2D:E7:0C:FA
Configuring Port 3 (socket 0)
- PMD: librte_pmd_mlx5: 0x8ceb58: TX queues number update: 0 -> 2
- PMD: librte_pmd_mlx5: 0x8ceb58: RX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8ceb58: TX queues number update: 0 -> 2
+ PMD: librte_net_mlx5: 0x8ceb58: RX queues number update: 0 -> 2
Port 3: E4:1D:2D:E7:0C:FB
Checking link statuses...
Port 0 Link Up - speed 40000 Mbps - full-duplex