X-Git-Url: http://git.droids-corp.org/?a=blobdiff_plain;f=doc%2Fguides%2Fnics%2Fmlx5.rst;h=69bb4fca0f743237783866648cc7dc109f5b72d8;hb=0e484278c85f17ebf1d6a03b4ff93f1511245b9e;hp=07f5a3bccde233067d5f6a45bd85e2f18bbb4c6b;hpb=92818d839e8eb0ce479db826f00aa6d62384fc92;p=dpdk.git diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst index 07f5a3bccd..69bb4fca0f 100644 --- a/doc/guides/nics/mlx5.rst +++ b/doc/guides/nics/mlx5.rst @@ -7,7 +7,7 @@ MLX5 poll mode driver ===================== -The MLX5 poll mode driver library (**librte_pmd_mlx5**) provides support +The MLX5 poll mode driver library (**librte_net_mlx5**) provides support for **Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx** , **Mellanox ConnectX-5**, **Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx** and **Mellanox BlueField** families of 10/25/40/50/100/200 Gb/s adapters @@ -20,17 +20,12 @@ Information and documentation about these adapters can be found on the There is also a `section dedicated to this poll mode driver `__. -.. note:: - - Due to external dependencies, this driver is disabled in default configuration - of the "make" build. It can be enabled with ``CONFIG_RTE_LIBRTE_MLX5_PMD=y`` - or by using "meson" build system which will detect dependencies. Design ------ Besides its dependency on libibverbs (that implies libmlx5 and associated -kernel support), librte_pmd_mlx5 relies heavily on system calls for control +kernel support), librte_net_mlx5 relies heavily on system calls for control operations such as querying/updating the MTU and flow control parameters. For security reasons and robustness, this driver only deals with virtual @@ -56,7 +51,7 @@ to get the best performances: - DevX allows to access firmware objects - Direct Rules manages flow steering at low-level hardware layer -Enabling librte_pmd_mlx5 causes DPDK applications to be linked against +Enabling librte_net_mlx5 causes DPDK applications to be linked against libibverbs. Features @@ -163,8 +158,6 @@ Limitations - Flows with a VXLAN Network Identifier equal (or ends to be equal) to 0 are not supported. -- VXLAN TSO and checksum offloads are not supported on VM. - - L3 VXLAN and VXLAN-GPE tunnels cannot be supported together with MPLSoGRE and MPLSoUDP. - Match on Geneve header supports the following fields only: @@ -180,6 +173,7 @@ Limitations - Match on GTP tunnel header item supports the following fields only: + - v_pt_rsv_flags: E flag, S flag, PN flag - msg_type - teid @@ -242,6 +236,31 @@ Limitations reduce the requested Tx size or adjust data inline settings with ``txq_inline_max`` and ``txq_inline_mpw`` devargs keys. +- To provide the packet send scheduling on mbuf timestamps the ``tx_pp`` + parameter should be specified. + When PMD sees the RTE_MBUF_DYNFLAG_TX_TIMESTAMP_NAME set on the packet + being sent it tries to synchronize the time of packet appearing on + the wire with the specified packet timestamp. It the specified one + is in the past it should be ignored, if one is in the distant future + it should be capped with some reasonable value (in range of seconds). + These specific cases ("too late" and "distant future") can be optionally + reported via device xstats to assist applications to detect the + time-related problems. + + The timestamp upper "too-distant-future" limit + at the moment of invoking the Tx burst routine + can be estimated as ``tx_pp`` option (in nanoseconds) multiplied by 2^23. + Please note, for the testpmd txonly mode, + the limit is deduced from the expression:: + + (n_tx_descriptors / burst_size + 1) * inter_burst_gap + + There is no any packet reordering according timestamps is supposed, + neither within packet burst, nor between packets, it is an entirely + application responsibility to generate packets and its timestamps + in desired order. The timestamps can be put only in the first packet + in the burst providing the entire burst scheduling. + - E-Switch decapsulation Flow: - can be applied to PF port only. @@ -263,7 +282,7 @@ Limitations - The input buffer, providing the removal size, is not validated. - The buffer size must match the length of the headers to be removed. -- ICMP/ICMP6 code/type matching, IP-in-IP and MPLS flow matching are all +- ICMP(code/type/identifier/sequence number) / ICMP6(code/type) matching, IP-in-IP and MPLS flow matching are all mutually exclusive features which cannot be supported together (see :ref:`mlx5_firmware_config`). @@ -279,6 +298,28 @@ Limitations eth (with or without vlan) / ipv4 or ipv6 / tcp / payload Other TCP packets (e.g. with MPLS label) received on Rx queue with LRO enabled, will be received with bad checksum. + - LRO packet aggregation is performed by HW only for packet size larger than + ``lro_min_mss_size``. This value is reported on device start, when debug + mode is enabled. + +- CRC: + + - ``DEV_RX_OFFLOAD_KEEP_CRC`` cannot be supported with decapsulation + for some NICs (such as ConnectX-6 Dx and BlueField 2). + The capability bit ``scatter_fcs_w_decap_disable`` shows NIC support. + +- Sample flow: + + - Supports ``RTE_FLOW_ACTION_TYPE_SAMPLE`` action only within NIC Rx and E-Switch steering domain. + - The E-Switch Sample flow must have the eswitch_manager VPORT destination (PF or ECPF) and no additional actions. + - For ConnectX-5, the ``RTE_FLOW_ACTION_TYPE_SAMPLE`` is typically used as first action in the E-Switch egress flow if with header modify or encapsulation actions. + +- IPv6 header item 'proto' field, indicating the next header protocol, should + not be set as extension header. + In case the next header is an extension header, it should not be specified in + IPv6 header item 'proto' field. + The last extension header item 'next header' field can specify the following + header protocol type. Statistics ---------- @@ -297,53 +338,19 @@ Configuration Compilation options ~~~~~~~~~~~~~~~~~~~ -These options can be modified in the ``.config`` file. - -- ``CONFIG_RTE_LIBRTE_MLX5_PMD`` (default **n**) - - Toggle compilation of librte_pmd_mlx5 itself. - -- ``CONFIG_RTE_IBVERBS_LINK_DLOPEN`` (default **n**) - - Build PMD with additional code to make it loadable without hard - dependencies on **libibverbs** nor **libmlx5**, which may not be installed - on the target system. - - In this mode, their presence is still required for it to run properly, - however their absence won't prevent a DPDK application from starting (with - ``CONFIG_RTE_BUILD_SHARED_LIB`` disabled) and they won't show up as - missing with ``ldd(1)``. - - It works by moving these dependencies to a purpose-built rdma-core "glue" - plug-in which must either be installed in a directory whose name is based - on ``CONFIG_RTE_EAL_PMD_PATH`` suffixed with ``-glue`` if set, or in a - standard location for the dynamic linker (e.g. ``/lib``) if left to the - default empty string (``""``). +The ibverbs libraries can be linked with this PMD in a number of ways, +configured by the ``ibverbs_link`` build option: - This option has no performance impact. +- ``shared`` (default): the PMD depends on some .so files. -- ``CONFIG_RTE_IBVERBS_LINK_STATIC`` (default **n**) +- ``dlopen``: Split the dependencies glue in a separate library + loaded when needed by dlopen. + It make dependencies on libibverbs and libmlx4 optional, + and has no performance impact. - Embed static flavor of the dependencies **libibverbs** and **libmlx5** +- ``static``: Embed static flavor of the dependencies libibverbs and libmlx4 in the PMD shared library or the executable static binary. -- ``CONFIG_RTE_LIBRTE_MLX5_DEBUG`` (default **n**) - - Toggle debugging code and stricter compilation flags. Enabling this option - adds additional run-time checks and debugging messages at the cost of - lower performance. - -.. note:: - - For BlueField, target should be set to ``arm64-bluefield-linux-gcc``. This - will enable ``CONFIG_RTE_LIBRTE_MLX5_PMD`` and set ``RTE_CACHE_LINE_SIZE`` to - 64. Default armv8a configuration of make build and meson build set it to 128 - then brings performance degradation. - -This option is available in meson: - -- ``ibverbs_link`` can be ``static``, ``shared``, or ``dlopen``. - Environment variables ~~~~~~~~~~~~~~~~~~~~~ @@ -352,10 +359,6 @@ Environment variables A list of directories in which to search for the rdma-core "glue" plug-in, separated by colons or semi-colons. - Only matters when compiled with ``CONFIG_RTE_IBVERBS_LINK_DLOPEN`` - enabled and most useful when ``CONFIG_RTE_EAL_PMD_PATH`` is also set, - since ``LD_LIBRARY_PATH`` has no effect in this case. - - ``MLX5_SHUT_UP_BF`` Configures HW Tx doorbell register as IO-mapped. @@ -376,12 +379,38 @@ Environment variables Run-time configuration ~~~~~~~~~~~~~~~~~~~~~~ -- librte_pmd_mlx5 brings kernel network interfaces up during initialization +- librte_net_mlx5 brings kernel network interfaces up during initialization because it is affected by their state. Forcing them down prevents packets reception. - **ethtool** operations on related kernel interfaces also affect the PMD. +Run as non-root +^^^^^^^^^^^^^^^ + +In order to run as a non-root user, +some capabilities must be granted to the application:: + + setcap cap_sys_admin,cap_net_admin,cap_net_raw,cap_ipc_lock+ep + +Below are the reasons of the need for each capability: + +``cap_sys_admin`` + When using physical addresses (PA mode), with Linux >= 4.0, + for access to ``/proc/self/pagemap``. + +``cap_net_admin`` + For device configuration. + +``cap_net_raw`` + For raw ethernet queue allocation through kernel driver. + +``cap_ipc_lock`` + For DMA memory pinning. + +Driver options +^^^^^^^^^^^^^^ + - ``rxq_cqe_comp_en`` parameter [int] A nonzero value enables the compression of CQE on RX side. This feature @@ -464,7 +493,7 @@ Run-time configuration value is not in the range of device capability, the default value will be set with a warning message. The default value is 11 which is 2048 bytes per a stride, valid only if ``mprq_en`` is set. With ``mprq_log_stride_size`` set - it is possible for a pcaket to span across multiple strides. This mode allows + it is possible for a packet to span across multiple strides. This mode allows support of jumbo frames (9K) with MPRQ. The memcopy of some packets (or part of a packet if Rx scatter is configured) may be required in case there is no space left for a head room at the end of a stride which incurs some @@ -675,6 +704,25 @@ Run-time configuration variable "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the default ``tx_db_nc`` value is zero for ARM64 hosts and one for others. +- ``tx_pp`` parameter [int] + + If a nonzero value is specified the driver creates all necessary internal + objects to provide accurate packet send scheduling on mbuf timestamps. + The positive value specifies the scheduling granularity in nanoseconds, + the packet send will be accurate up to specified digits. The allowed range is + from 500 to 1 million of nanoseconds. The negative value specifies the module + of granularity and engages the special test mode the check the schedule rate. + By default (if the ``tx_pp`` is not specified) send scheduling on timestamps + feature is disabled. + +- ``tx_skew`` parameter [int] + + The parameter adjusts the send packet scheduling on timestamps and represents + the average delay between beginning of the transmitting descriptor processing + by the hardware and appearance of actual packet data on the wire. The value + should be provided in nanoseconds and is valid only if ``tx_pp`` parameter is + specified. The default value is zero. + - ``tx_vec_en`` parameter [int] A nonzero value enables Tx vector on ConnectX-5, ConnectX-6, ConnectX-6 Dx @@ -771,6 +819,16 @@ Run-time configuration Enabled by default if supported. +- ``lacp_by_user`` parameter [int] + + A nonzero value enables the control of LACP traffic by the user application. + When a bond exists in the driver, by default it should be managed by the + kernel and therefore LACP traffic should be steered to the kernel. + If this devarg is set to 1 it will allow the user to manage the bond by + itself and not steer LACP traffic to the kernel. + + Disabled by default (set to 0). + - ``mr_ext_memseg_en`` parameter [int] A nonzero value enables extending memseg when registering DMA memory. If @@ -824,6 +882,42 @@ Run-time configuration By default, the PMD will set this value to 16, which means that 9KB jumbo frames will be supported. +- ``reclaim_mem_mode`` parameter [int] + + Cache some resources in flow destroy will help flow recreation more efficient. + While some systems may require the all the resources can be reclaimed after + flow destroyed. + The parameter ``reclaim_mem_mode`` provides the option for user to configure + if the resource cache is needed or not. + + There are three options to choose: + + - 0. It means the flow resources will be cached as usual. The resources will + be cached, helpful with flow insertion rate. + + - 1. It will only enable the DPDK PMD level resources reclaim. + + - 2. Both DPDK PMD level and rdma-core low level will be configured as + reclaimed mode. + + By default, the PMD will set this value to 0. + +- ``sys_mem_en`` parameter [int] + + A non-zero value enables the PMD memory management allocating memory + from system by default, without explicit rte memory flag. + + By default, the PMD will set this value to 0. + +- ``decap_en`` parameter [int] + + Some devices do not support FCS (frame checksum) scattering for + tunnel-decapsulated packets. + If set to 0, this option forces the FCS feature and rejects tunnel + decapsulation in the flow engine for such devices. + + By default, the PMD will set this value to 1. + .. _mlx5_firmware_config: Firmware configuration @@ -887,7 +981,7 @@ Below are some firmware configurations listed. FLEX_PARSER_PROFILE_ENABLE=1 -- enable ICMP/ICMP6 code/type fields matching:: +- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching:: FLEX_PARSER_PROFILE_ENABLE=2 @@ -901,6 +995,11 @@ Below are some firmware configurations listed. FLEX_PARSER_PROFILE_ENABLE=3 +- enable eCPRI flow matching:: + + FLEX_PARSER_PROFILE_ENABLE=4 + PROG_PARSE_GRAPH=1 + Prerequisites ------------- @@ -910,7 +1009,7 @@ DPDK and must be installed separately: - **libibverbs** - User space Verbs framework used by librte_pmd_mlx5. This library provides + User space Verbs framework used by librte_net_mlx5. This library provides a generic interface between the kernel and low-level user space drivers such as libmlx5. @@ -982,12 +1081,6 @@ RDMA Core with Linux Kernel .. _`Linux installation documentation`: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/plain/Documentation/admin-guide/README.rst .. _`RDMA Core installation documentation`: https://raw.githubusercontent.com/linux-rdma/rdma-core/master/README.md -If rdma-core libraries are built but not installed, DPDK makefile can link them, -thanks to these environment variables: - - - ``EXTRA_CFLAGS=-I/path/to/rdma-core/build/include`` - - ``EXTRA_LDFLAGS=-L/path/to/rdma-core/build/lib`` - - ``PKG_CONFIG_PATH=/path/to/rdma-core/build/lib/pkgconfig`` Mellanox OFED/EN ^^^^^^^^^^^^^^^^ @@ -1120,8 +1213,8 @@ Quick Start Guide on OFED/EN echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs -6. Compile DPDK and you are ready to go. See instructions on - :ref:`Development Kit Build System ` +6. Install DPDK and you are ready to go. + See :doc:`compilation instructions <../linux_gsg/build_dpdk>`. Enable switchdev mode --------------------- @@ -1325,6 +1418,11 @@ Supported hardware offloads | | | rdma-core 26 | | rdma-core 26 | | | | ConnectX-5 | | ConnectX-5 | +-----------------------+-----------------+-----------------+ + | Sampling | | DPDK 20.11 | | DPDK 20.11 | + | | | OFED 5.2 | | OFED 5.2 | + | | | rdma-core 32 | | rdma-core 32 | + | | | ConnectX-5 | | ConnectX-5 | + +-----------------------+-----------------+-----------------+ Notes for metadata ------------------ @@ -1355,13 +1453,13 @@ The application should re-create the flows as required after the port restart. Notes for testpmd ----------------- -Compared to librte_pmd_mlx4 that implements a single RSS configuration per -port, librte_pmd_mlx5 supports per-protocol RSS configuration. +Compared to librte_net_mlx4 that implements a single RSS configuration per +port, librte_net_mlx5 supports per-protocol RSS configuration. Since ``testpmd`` defaults to IP RSS mode and there is currently no command-line parameter to enable additional protocols (UDP and TCP as well as IP), the following commands must be entered from its CLI to get the same -behavior as librte_pmd_mlx4:: +behavior as librte_net_mlx4:: > port stop all > port config all rss all @@ -1371,7 +1469,7 @@ Usage example ------------- This section demonstrates how to launch **testpmd** with Mellanox -ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_pmd_mlx5. +ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_net_mlx5. #. Load the kernel modules:: @@ -1428,41 +1526,41 @@ ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_pmd_mlx5. [...] EAL: PCI device 0000:05:00.0 on NUMA socket 0 - EAL: probe driver: 15b3:1013 librte_pmd_mlx5 - PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_0" (VF: false) - PMD: librte_pmd_mlx5: 1 port(s) detected - PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe + EAL: probe driver: 15b3:1013 librte_net_mlx5 + PMD: librte_net_mlx5: PCI information matches, using device "mlx5_0" (VF: false) + PMD: librte_net_mlx5: 1 port(s) detected + PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe EAL: PCI device 0000:05:00.1 on NUMA socket 0 - EAL: probe driver: 15b3:1013 librte_pmd_mlx5 - PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_1" (VF: false) - PMD: librte_pmd_mlx5: 1 port(s) detected - PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff + EAL: probe driver: 15b3:1013 librte_net_mlx5 + PMD: librte_net_mlx5: PCI information matches, using device "mlx5_1" (VF: false) + PMD: librte_net_mlx5: 1 port(s) detected + PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff EAL: PCI device 0000:06:00.0 on NUMA socket 0 - EAL: probe driver: 15b3:1013 librte_pmd_mlx5 - PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_2" (VF: false) - PMD: librte_pmd_mlx5: 1 port(s) detected - PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa + EAL: probe driver: 15b3:1013 librte_net_mlx5 + PMD: librte_net_mlx5: PCI information matches, using device "mlx5_2" (VF: false) + PMD: librte_net_mlx5: 1 port(s) detected + PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa EAL: PCI device 0000:06:00.1 on NUMA socket 0 - EAL: probe driver: 15b3:1013 librte_pmd_mlx5 - PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_3" (VF: false) - PMD: librte_pmd_mlx5: 1 port(s) detected - PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb + EAL: probe driver: 15b3:1013 librte_net_mlx5 + PMD: librte_net_mlx5: PCI information matches, using device "mlx5_3" (VF: false) + PMD: librte_net_mlx5: 1 port(s) detected + PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb Interactive-mode selected Configuring Port 0 (socket 0) - PMD: librte_pmd_mlx5: 0x8cba80: TX queues number update: 0 -> 2 - PMD: librte_pmd_mlx5: 0x8cba80: RX queues number update: 0 -> 2 + PMD: librte_net_mlx5: 0x8cba80: TX queues number update: 0 -> 2 + PMD: librte_net_mlx5: 0x8cba80: RX queues number update: 0 -> 2 Port 0: E4:1D:2D:E7:0C:FE Configuring Port 1 (socket 0) - PMD: librte_pmd_mlx5: 0x8ccac8: TX queues number update: 0 -> 2 - PMD: librte_pmd_mlx5: 0x8ccac8: RX queues number update: 0 -> 2 + PMD: librte_net_mlx5: 0x8ccac8: TX queues number update: 0 -> 2 + PMD: librte_net_mlx5: 0x8ccac8: RX queues number update: 0 -> 2 Port 1: E4:1D:2D:E7:0C:FF Configuring Port 2 (socket 0) - PMD: librte_pmd_mlx5: 0x8cdb10: TX queues number update: 0 -> 2 - PMD: librte_pmd_mlx5: 0x8cdb10: RX queues number update: 0 -> 2 + PMD: librte_net_mlx5: 0x8cdb10: TX queues number update: 0 -> 2 + PMD: librte_net_mlx5: 0x8cdb10: RX queues number update: 0 -> 2 Port 2: E4:1D:2D:E7:0C:FA Configuring Port 3 (socket 0) - PMD: librte_pmd_mlx5: 0x8ceb58: TX queues number update: 0 -> 2 - PMD: librte_pmd_mlx5: 0x8ceb58: RX queues number update: 0 -> 2 + PMD: librte_net_mlx5: 0x8ceb58: TX queues number update: 0 -> 2 + PMD: librte_net_mlx5: 0x8ceb58: RX queues number update: 0 -> 2 Port 3: E4:1D:2D:E7:0C:FB Checking link statuses... Port 0 Link Up - speed 40000 Mbps - full-duplex