1 .. SPDX-License-Identifier: BSD-3-Clause
2 Copyright 2022 6WIND S.A.
3 Copyright (c) 2022 NVIDIA Corporation & Affiliates
5 .. include:: <isonum.txt>
10 The mlx5 common driver library (**librte_common_mlx5**) provides support for
11 **Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx**, **Mellanox ConnectX-5**,
12 **Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx**, **Mellanox ConnectX-6 Lx**,
13 **Mellanox BlueField** and **Mellanox BlueField-2** families of
14 10/25/40/50/100/200 Gb/s adapters.
16 Information and documentation for these adapters can be found on the
17 `NVIDIA website <https://www.nvidia.com/en-us/networking/>`_.
18 Help is also provided by the
19 `Mellanox community <http://community.mellanox.com/welcome>`_.
20 In addition, there is a `web section dedicated to the Poll Mode Driver
21 <https://developer.nvidia.com/networking/dpdk>`_.
27 For security reasons and to enhance robustness,
28 this driver only handles virtual memory addresses.
29 The way resources allocations are handled by the kernel,
30 combined with hardware specifications that allow handling virtual memory addresses directly,
31 ensure that DPDK applications cannot access random physical memory
32 (or memory that does not belong to the current process).
34 There are different levels of objects and bypassing abilities
35 which are used to get the best performance:
37 - **Verbs** is a complete high-level generic API
38 - **Direct Verbs** is a device-specific API
39 - **DevX** allows accessing firmware objects
40 - **Direct Rules** manages flow steering at the low-level hardware layer
42 On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`.
43 See :ref:`mlx5_linux_prerequisites` for installation.
45 On Windows, DevX is the only requirement from the above list.
46 See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation.
54 One mlx5 device can be probed by a number of different PMDs.
55 To select a specific PMD, its name should be specified as a device parameter
56 (e.g. ``0000:08:00.1,class=eth``).
58 In order to allow probing by multiple PMDs,
59 several classes may be listed separated by a colon.
60 For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs.
66 - ``class=compress`` for :doc:`../../compressdevs/mlx5`.
67 - ``class=crypto`` for :doc:`../../cryptodevs/mlx5`.
68 - ``class=eth`` for :doc:`../../nics/mlx5`.
69 - ``class=regex`` for :doc:`../../regexdevs/mlx5`.
70 - ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`.
72 By default, the mlx5 device will be probed by the ``eth`` PMD.
78 - ``eth`` and ``vdpa`` PMDs cannot be probed at the same time.
79 All other combinations are possible.
81 - On Windows, only ``eth`` and ``crypto`` are supported.
84 .. _mlx5_common_compilation:
86 Compilation Prerequisites
87 -------------------------
89 .. _mlx5_linux_prerequisites:
94 This driver relies on external libraries and kernel drivers for resources
95 allocations and initialization.
96 The following dependencies are not part of DPDK and must be installed separately:
100 User space Verbs framework used by ``librte_common_mlx5``.
101 This library provides a generic interface between the kernel
102 and low-level user space drivers such as ``libmlx5``.
104 It allows slow and privileged operations (context initialization,
105 hardware resources allocations) to be managed by the kernel
106 and fast operations to never leave user space.
110 Low-level user space driver library for Mellanox devices,
111 it is automatically loaded by ``libibverbs``.
113 This library basically implements send/receive calls to the hardware queues.
117 They provide the kernel-side Verbs API and low level device drivers
118 that manage actual hardware initialization
119 and resources sharing with user-space processes.
121 Unlike most other PMDs, these modules must remain loaded and bound to
124 - ``mlx5_core``: hardware driver managing Mellanox devices
125 and related Ethernet kernel network devices.
126 - ``mlx5_ib``: InfiniBand device driver.
127 - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``).
129 - **Firmware update**
131 Mellanox OFED/EN releases include firmware updates.
133 Because each release provides new features, these updates must be applied to
134 match the kernel modules and libraries they come with.
136 Libraries and kernel modules can be provided either by the Linux distribution,
137 or by installing Mellanox OFED/EN which provides compatibility with older kernels.
140 Upstream Dependencies
141 ^^^^^^^^^^^^^^^^^^^^^
143 The mlx5 kernel modules are part of upstream Linux.
144 The minimal supported kernel version is 4.14.
145 For 32-bit, version 4.14.41 or above is required.
147 The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``.
148 It is packaged by most of Linux distributions.
149 The minimal supported rdma-core version is 16.
150 For 32-bit, version 18 or above is required.
152 The rdma-core sources can be downloaded at
153 https://github.com/linux-rdma/rdma-core
155 It is possible to build rdma-core as static libraries starting with version 21::
158 CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja ..
165 The kernel modules and libraries are packaged with other tools
166 in Mellanox OFED or Mellanox EN.
167 The minimal supported versions are:
169 - Mellanox OFED version: **4.5** and above.
170 - Mellanox EN version: **4.5** and above.
173 - ConnectX-4: **12.21.1000** and above.
174 - ConnectX-4 Lx: **14.21.1000** and above.
175 - ConnectX-5: **16.21.1000** and above.
176 - ConnectX-5 Ex: **16.21.1000** and above.
177 - ConnectX-6: **20.27.0090** and above.
178 - ConnectX-6 Dx: **22.27.0090** and above.
179 - BlueField: **18.25.1010** and above.
180 - BlueField-2: **24.28.1002** and above.
182 The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules
183 are packaged in `Mellanox OFED
184 <https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_.
185 After downloading, it can be installed with this command::
187 ./mlnxofedinstall --dpdk
190 <https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_
191 is a smaller package including what is needed for DPDK.
192 After downloading, it can be installed with this command::
196 After installing, the firmware version can be checked::
202 Several versions of Mellanox OFED/EN are available. Installing the version
203 this DPDK release was developed and tested against is strongly recommended.
204 Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`.
207 .. _mlx5_windows_prerequisites:
209 Windows Prerequisites
210 ~~~~~~~~~~~~~~~~~~~~~
212 The mlx5 PMDs rely on external libraries and kernel drivers
213 for resource allocation and initialization.
216 DevX SDK Installation
217 ^^^^^^^^^^^^^^^^^^^^^
219 The DevX SDK must be installed on the machine building the Windows PMD.
220 Additional information can be found at
221 `How to Integrate Windows DevX in Your Development Environment
222 <https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_.
223 The minimal supported WinOF2 version is 2.60.
232 The ibverbs libraries can be linked with this PMD in a number of ways,
233 configured by the ``ibverbs_link`` build option:
236 The PMD depends on some .so files.
239 Split the dependencies glue in a separate library
240 loaded when needed by dlopen (see ``MLX5_GLUE_PATH``).
241 It makes dependencies on libibverbs and libmlx5 optional,
242 and has no performance impact.
245 Embed static flavor of the dependencies libibverbs and libmlx5
246 in the PMD shared library or the executable static binary.
249 Compilation on Windows
250 ~~~~~~~~~~~~~~~~~~~~~~
252 The DevX SDK location must be set through two environment variables:
255 path to the DevX lib file.
258 path to the DevX header files.
263 Environment Configuration
264 -------------------------
269 The kernel network interfaces are brought up during initialization.
270 Forcing them down prevents packets reception.
272 The ethtool operations on the kernel interfaces may also affect the PMD.
274 Some runtime behaviours may be configured through environment variables.
277 If built with ``ibverbs_link=dlopen``,
278 list of directories in which to search for the rdma-core "glue" plug-in,
279 separated by colons or semi-colons.
282 If Verbs is used (DevX disabled),
283 HW queue doorbell register mapping.
284 The value 0 means non-cached IO mapping,
285 while 1 is a regular memory mapping.
287 With regular memory mapping, the register is flushed to HW
288 usually when the write-combining buffer becomes full,
289 but it depends on CPU design.
292 Port Link with OFED/EN
293 ^^^^^^^^^^^^^^^^^^^^^^
295 Ports links must be set to Ethernet::
297 mlxconfig -d <mst device> query | grep LINK_TYPE
301 mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
303 Link type values are:
307 * ``3`` VPI (auto-sense)
309 If link type was changed, firmware must be reset as well::
311 mlxfwreset -d <mst device> reset
316 SR-IOV Virtual Function with OFED/EN
317 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
319 SR-IOV must be enabled on the NIC.
320 It can be checked in the following command::
322 mlxconfig -d <mst device> query | grep SRIOV_EN
325 If needed, configure SR-IOV::
327 mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
328 mlxfwreset -d <mst device> reset
330 After doing the change, restart the driver::
332 /etc/init.d/openibd restart
336 service openibd restart
338 Then the virtual functions can be instantiated::
340 echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
343 .. _mlx5_sub_function:
345 Sub-Function with OFED/EN
346 ^^^^^^^^^^^^^^^^^^^^^^^^^
348 Sub-Function is a portion of the PCI device,
349 it has its own dedicated queues.
350 An SF shares PCI-level resources with other SFs and/or with its parent PCI function.
354 OFED version >= 5.4-0.3.3.0
356 1. Configure SF feature::
358 # Run mlxconfig on both PFs on host and ECPFs on BlueField.
359 mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12
361 2. Enable switchdev mode::
363 mlxdevm dev eswitch set pci/<DBDF> mode switchdev
367 mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
369 Get SFID from output: pci/<DBDF>/<SFID>
371 4. Modify MAC address::
373 mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC>
375 5. Activate SF port::
377 mlxdevm port function set pci/<DBDF>/<ID> state active
379 6. Devargs to probe SF device::
381 auxiliary:mlx5_core.sf.<num>,class=eth:regex
384 Enable Switchdev Mode
385 ^^^^^^^^^^^^^^^^^^^^^
387 Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF.
388 Representor is a port in DPDK that is connected to a VF or SF in such a way
389 that assuming there are no offload flows, each packet that is sent from the VF or SF
390 will be received by the corresponding representor.
391 While each packet that is sent to a representor will be received by the VF or SF.
393 After :ref:`configuring VF <mlx5_vf>`, the device must be unbound::
395 printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind
397 Then switchdev mode is enabled::
399 echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
401 The device can be bound again at this point.
407 In order to run as a non-root user,
408 some capabilities must be granted to the application::
410 setcap cap_sys_admin,cap_net_admin,cap_net_raw,cap_ipc_lock+ep <dpdk-app>
412 Below are the reasons for the need of each capability:
415 When using physical addresses (PA mode), with Linux >= 4.0,
416 for access to ``/proc/self/pagemap``.
419 For device configuration.
422 For raw ethernet queue allocation through kernel driver.
425 For DMA memory pinning.
431 WinOF2 version 2.60 or higher must be installed on the machine.
437 The driver can be downloaded from the following site: `WINOF2
438 <https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_.
444 DevX for Windows must be enabled in the Windows registry.
445 The keys ``DevxEnabled`` and ``DevxFsRules`` must be set.
446 Additional information can be found in the WinOF2 user manual.
449 .. _mlx5_firmware_config:
451 Firmware Configuration
452 ~~~~~~~~~~~~~~~~~~~~~~
454 Firmware features can be configured as key/value pairs.
456 The command to set a value is::
458 mlxconfig -d <device> set <key>=<value>
460 The command to query a value is::
462 mlxconfig -d <device> query <key>
464 The device name for the command ``mlxconfig`` can be either the PCI address,
465 or the mst device name found with::
469 Below are some firmware configurations listed.
475 value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
481 - the maximum number of SR-IOV virtual functions::
485 - enable DevX (required by Direct Rules and other features)::
489 - aggressive CQE zipping::
493 - L3 VXLAN and VXLAN-GPE destination UDP port::
496 IP_OVER_VXLAN_PORT=<udp dport>
498 - enable VXLAN-GPE tunnel flow matching::
500 FLEX_PARSER_PROFILE_ENABLE=0
502 FLEX_PARSER_PROFILE_ENABLE=2
504 - enable IP-in-IP tunnel flow matching::
506 FLEX_PARSER_PROFILE_ENABLE=0
508 - enable MPLS flow matching::
510 FLEX_PARSER_PROFILE_ENABLE=1
512 - enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
514 FLEX_PARSER_PROFILE_ENABLE=2
516 - enable Geneve flow matching::
518 FLEX_PARSER_PROFILE_ENABLE=0
520 FLEX_PARSER_PROFILE_ENABLE=1
522 - enable Geneve TLV option flow matching::
524 FLEX_PARSER_PROFILE_ENABLE=0
526 - enable GTP flow matching::
528 FLEX_PARSER_PROFILE_ENABLE=3
530 - enable eCPRI flow matching::
532 FLEX_PARSER_PROFILE_ENABLE=4
535 - enable dynamic flex parser for flex item::
537 FLEX_PARSER_PROFILE_ENABLE=4
540 - enable realtime timestamp format::
542 REAL_TIME_CLOCK_ENABLE=1
545 .. _mlx5_common_driver_options:
550 The driver can be configured per device.
551 A single argument list can be used for a device managed by multiple PMDs.
552 The parameters must be passed through the EAL option ``-a``,
557 -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0
561 -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0
563 Each device class PMD has its own list of specific arguments,
564 and below are the arguments supported by the common mlx5 layer.
566 - ``class`` parameter [string]
568 Select the classes of the drivers that should probe the device.
569 See :ref:`mlx5_classes` for more explanation and details.
571 The default value is ``eth``.
573 - ``mr_ext_memseg_en`` parameter [int]
575 A nonzero value enables extending memseg when registering DMA memory. If
576 enabled, the number of entries in MR (Memory Region) lookup table on datapath
577 is minimized and it benefits performance. On the other hand, it worsens memory
578 utilization because registered memory is pinned by kernel driver. Even if a
579 page in the extended chunk is freed, that doesn't become reusable until the
580 entire memory is freed.
584 - ``mr_mempool_reg_en`` parameter [int]
586 A nonzero value enables implicit registration of DMA memory of all mempools
587 except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically
588 for mempools populated with non-contiguous objects or those without IOVA.
589 The effect is that when a packet from a mempool is transmitted,
590 its memory is already registered for DMA in the PMD and no registration
591 will happen on the data path. The tradeoff is extra work on the creation
592 of each mempool and increased HW resource use if some mempools
593 are not used with MLX5 devices.
597 - ``sys_mem_en`` parameter [int]
599 A non-zero value enables the PMD memory management allocating memory
600 from system by default, without explicit rte memory flag.
602 By default, the PMD will set this value to 0.
604 - ``sq_db_nc`` parameter [int]
606 The rdma core library can map doorbell register in two ways,
607 depending on the environment variable "MLX5_SHUT_UP_BF":
609 - As regular cached memory (usually with write combining attribute),
610 if the variable is either missing or set to zero.
611 - As non-cached memory, if the variable is present and set to not "0" value.
613 The same doorbell mapping approach is implemented directly by PMD
614 in UAR generation for queues created with DevX.
616 The type of mapping may slightly affect the send queue performance,
617 the optimal choice strongly relied on the host architecture
618 and should be deduced practically.
620 If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to
621 regular memory (with write combining), the PMD will perform the extra write
622 memory barrier after writing to doorbell, it might increase the needed CPU
623 clocks per packet to send, but latency might be improved.
625 If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non
626 cached memory, the PMD will not perform the extra write memory barrier after
627 writing to doorbell, on some architectures it might improve the performance.
629 If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to
630 regular memory, the PMD will use heuristics to decide whether a write memory
631 barrier should be performed. For bursts with size multiple of recommended one
632 (64 pkts) it is supposed the next burst is coming and no need to issue the
633 extra memory barrier (it is supposed to be issued in the next coming burst,
634 at least after descriptor writing). It might increase latency (on some hosts
635 till the next packets transmit) and should be used with care.
636 The PMD uses heuristics only for Tx queue, for other semd queues the doorbell
637 is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0.
639 If ``sq_db_nc`` is omitted, the preset (if any) environment variable
640 "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the
641 default ``sq_db_nc`` value is zero for ARM64 hosts and one for others.
643 - ``cmd_fd`` parameter [int]
645 File descriptor of ``ibv_context`` created outside the PMD.
646 PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from
647 the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed.
648 This parameter is valid only if ``pd_handle`` parameter is specified.
650 By default, the PMD will create a new ``ibv_context``.
654 When FD comes from another process, it is the user responsibility to
655 share the FD between the processes (e.g. by SCM_RIGHTS).
657 - ``pd_handle`` parameter [int]
659 Protection domain handle of ``ibv_pd`` created outside the PMD.
660 PMD will use this handle to import remote PD. The ``pd_handle`` can be
661 achieved from the original PD by getting its ``ibv_pd->handle`` member value.
662 This parameter is valid only if ``cmd_fd`` parameter is specified,
663 and its value must be a valid kernel handle for a PD object
664 in the context represented by given ``cmd_fd``.
666 By default, the PMD will allocate a new PD.
670 The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member.