1 .. SPDX-License-Identifier: BSD-3-Clause
2 Copyright 2015 6WIND S.A.
3 Copyright 2015 Mellanox Technologies, Ltd
5 .. include:: <isonum.txt>
10 The MLX5 poll mode driver library (**librte_net_mlx5**) provides support
11 for **Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx** , **Mellanox
12 ConnectX-5**, **Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx**, **Mellanox
13 ConnectX-6 Lx**, **Mellanox BlueField** and **Mellanox BlueField-2** families
14 of 10/25/40/50/100/200 Gb/s adapters as well as their virtual functions (VF)
17 Information and documentation about these adapters can be found on the
18 `Mellanox website <http://www.mellanox.com>`__. Help is also provided by the
19 `Mellanox community <http://community.mellanox.com/welcome>`__.
21 There is also a `section dedicated to this poll mode driver
22 <http://www.mellanox.com/page/products_dyn?product_family=209&mtag=pmd_for_dpdk>`__.
28 Besides its dependency on libibverbs (that implies libmlx5 and associated
29 kernel support), librte_net_mlx5 relies heavily on system calls for control
30 operations such as querying/updating the MTU and flow control parameters.
32 For security reasons and robustness, this driver only deals with virtual
33 memory addresses. The way resources allocations are handled by the kernel,
34 combined with hardware specifications that allow to handle virtual memory
35 addresses directly, ensure that DPDK applications cannot access random
36 physical memory (or memory that does not belong to the current process).
38 This capability allows the PMD to coexist with kernel network interfaces
39 which remain functional, although they stop receiving unicast packets as
40 long as they share the same MAC address.
41 This means legacy linux control tools (for example: ethtool, ifconfig and
42 more) can operate on the same network interfaces that owned by the DPDK
45 The PMD can use libibverbs and libmlx5 to access the device firmware
46 or directly the hardware components.
47 There are different levels of objects and bypassing abilities
48 to get the best performances:
50 - Verbs is a complete high-level generic API
51 - Direct Verbs is a device-specific API
52 - DevX allows to access firmware objects
53 - Direct Rules manages flow steering at low-level hardware layer
55 Enabling librte_net_mlx5 causes DPDK applications to be linked against
61 - Multi arch support: x86_64, POWER8, ARMv8, i686.
62 - Multiple TX and RX queues.
63 - Support for scattered TX frames.
64 - Advanced support for scattered Rx frames with tunable buffer attributes.
65 - IPv4, IPv6, TCPv4, TCPv6, UDPv4 and UDPv6 RSS on any number of queues.
66 - RSS using different combinations of fields: L3 only, L4 only or both,
67 and source only, destination only or both.
68 - Several RSS hash keys, one for each flow type.
69 - Default RSS operation with no hash key specification.
70 - Configurable RETA table.
71 - Link flow control (pause frame).
72 - Support for multiple MAC addresses.
76 - RX CRC stripping configuration.
77 - TX mbuf fast free offload.
78 - Promiscuous mode on PF and VF.
79 - Multicast promiscuous mode on PF and VF.
80 - Hardware checksum offloads.
81 - Flow director (RTE_FDIR_MODE_PERFECT, RTE_FDIR_MODE_PERFECT_MAC_VLAN and
83 - Flow API, including :ref:`flow_isolated_mode`.
85 - KVM and VMware ESX SR-IOV modes are supported.
86 - RSS hash result is supported.
87 - Hardware TSO for generic IP or UDP tunnel, including VXLAN and GRE.
88 - Hardware checksum Tx offload for generic IP or UDP tunnel, including VXLAN and GRE.
90 - Statistics query including Basic, Extended and per queue.
92 - Tunnel types: VXLAN, L3 VXLAN, VXLAN-GPE, GRE, MPLSoGRE, MPLSoUDP, IP-in-IP, Geneve, GTP.
93 - Tunnel HW offloads: packet type, inner/outer RSS, IP and UDP checksum verification.
94 - NIC HW offloads: encapsulation (vxlan, gre, mplsoudp, mplsogre), NAT, routing, TTL
95 increment/decrement, count, drop, mark. For details please see :ref:`mlx5_offloads_support`.
96 - Flow insertion rate of more then million flows per second, when using Direct Rules.
97 - Support for multiple rte_flow groups.
98 - Per packet no-inline hint flag to disable packet data copying into Tx descriptors.
101 - Multiple-thread flow insertion.
102 - Matching on IPv4 Internet Header Length (IHL).
103 - Matching on GTP extension header with raw encap/decap action.
104 - Matching on Geneve TLV option header with raw encap/decap action.
105 - RSS support in sample action.
106 - E-Switch mirroring and jump.
107 - E-Switch mirroring and modify.
108 - 21844 flow priorities for ingress or egress flow groups greater than 0 and for any transfer
110 - Flow metering, including meter policy API.
111 - Flow meter hierarchy.
112 - Flow integrity offload API.
113 - Connection tracking.
114 - Sub-Function representors.
123 On Windows, the features are limited:
125 - Promiscuous mode is not supported
126 - The following rules are supported:
128 - IPv4/UDP with CVLAN filtering
129 - Unicast MAC filtering
131 - Additional rules are supported from WinOF2 version 2.70:
133 - IPv4/TCP with CVLAN filtering
134 - L4 steering rules for port RSS of UDP, TCP and IP
136 - For secondary process:
138 - Forked secondary process not supported.
139 - External memory unregistered in EAL memseg list cannot be used for DMA
140 unless such memory has been registered by ``mlx5_mr_update_ext_mp()`` in
141 primary process and remapped to the same virtual address in secondary
142 process. If the external memory is registered by primary process but has
143 different virtual address in secondary process, unexpected error may happen.
145 - When using Verbs flow engine (``dv_flow_en`` = 0), flow pattern without any
146 specific VLAN will match for VLAN packets as well:
148 When VLAN spec is not specified in the pattern, the matching rule will be created with VLAN as a wild card.
149 Meaning, the flow rule::
151 flow create 0 ingress pattern eth / vlan vid is 3 / ipv4 / end ...
153 Will only match vlan packets with vid=3. and the flow rule::
155 flow create 0 ingress pattern eth / ipv4 / end ...
157 Will match any ipv4 packet (VLAN included).
159 - When using Verbs flow engine (``dv_flow_en`` = 0), multi-tagged(QinQ) match is not supported.
161 - When using DV flow engine (``dv_flow_en`` = 1), flow pattern with any VLAN specification will match only single-tagged packets unless the ETH item ``type`` field is 0x88A8 or the VLAN item ``has_more_vlan`` field is 1.
164 flow create 0 ingress pattern eth / ipv4 / end ...
166 Will match any ipv4 packet.
169 flow create 0 ingress pattern eth / vlan / end ...
170 flow create 0 ingress pattern eth has_vlan is 1 / end ...
171 flow create 0 ingress pattern eth type is 0x8100 / end ...
173 Will match single-tagged packets only, with any VLAN ID value.
176 flow create 0 ingress pattern eth type is 0x88A8 / end ...
177 flow create 0 ingress pattern eth / vlan has_more_vlan is 1 / end ...
179 Will match multi-tagged packets only, with any VLAN ID value.
181 - A flow pattern with 2 sequential VLAN items is not supported.
183 - VLAN pop offload command:
185 - Flow rules having a VLAN pop offload command as one of their actions and
186 are lacking a match on VLAN as one of their items are not supported.
187 - The command is not supported on egress traffic in NIC mode.
189 - VLAN push offload is not supported on ingress traffic in NIC mode.
191 - VLAN set PCP offload is not supported on existing headers.
193 - A multi segment packet must have not more segments than reported by dev_infos_get()
194 in tx_desc_lim.nb_seg_max field. This value depends on maximal supported Tx descriptor
195 size and ``txq_inline_min`` settings and may be from 2 (worst case forced by maximal
196 inline settings) to 58.
198 - Match on VXLAN supports the following fields only:
201 - Last reserved 8-bits
203 Last reserved 8-bits matching is only supported When using DV flow
204 engine (``dv_flow_en`` = 1).
205 Group zero's behavior may differ which depends on FW.
206 Matching value equals 0 (value & mask) is not supported.
208 - L3 VXLAN and VXLAN-GPE tunnels cannot be supported together with MPLSoGRE and MPLSoUDP.
210 - Match on Geneve header supports the following fields only:
217 - Match on Geneve TLV option is supported on the following fields:
224 Only one Class/Type/Length Geneve TLV option is supported per shared device.
225 Class/Type/Length fields must be specified as well as masks.
226 Class/Type/Length specified masks must be full.
227 Matching Geneve TLV option without specifying data is not supported.
228 Matching Geneve TLV option with ``data & mask == 0`` is not supported.
230 - VF: flow rules created on VF devices can only match traffic targeted at the
231 configured MAC addresses (see ``rte_eth_dev_mac_addr_add()``).
233 - Match on GTP tunnel header item supports the following fields only:
235 - v_pt_rsv_flags: E flag, S flag, PN flag
239 - Match on GTP extension header only for GTP PDU session container (next
240 extension header type = 0x85).
241 - Match on GTP extension header is not supported in group 0.
243 - No Tx metadata go to the E-Switch steering domain for the Flow group 0.
244 The flows within group 0 and set metadata action are rejected by hardware.
248 MAC addresses not already present in the bridge table of the associated
249 kernel network device will be added and cleaned up by the PMD when closing
250 the device. In case of ungraceful program termination, some entries may
251 remain present and should be removed manually by other means.
253 - Buffer split offload is supported with regular Rx burst routine only,
254 no MPRQ feature or vectorized code can be engaged.
256 - When Multi-Packet Rx queue is configured (``mprq_en``), a Rx packet can be
257 externally attached to a user-provided mbuf with having EXT_ATTACHED_MBUF in
258 ol_flags. As the mempool for the external buffer is managed by PMD, all the
259 Rx mbufs must be freed before the device is closed. Otherwise, the mempool of
260 the external buffers will be freed by PMD and the application which still
261 holds the external buffers may be corrupted.
263 - If Multi-Packet Rx queue is configured (``mprq_en``) and Rx CQE compression is
264 enabled (``rxq_cqe_comp_en``) at the same time, RSS hash result is not fully
265 supported. Some Rx packets may not have PKT_RX_RSS_HASH.
267 - IPv6 Multicast messages are not supported on VM, while promiscuous mode
268 and allmulticast mode are both set to off.
269 To receive IPv6 Multicast messages on VM, explicitly set the relevant
270 MAC address using rte_eth_dev_mac_addr_add() API.
272 - To support a mixed traffic pattern (some buffers from local host memory, some
273 buffers from other devices) with high bandwidth, a mbuf flag is used.
275 An application hints the PMD whether or not it should try to inline the
276 given mbuf data buffer. PMD should do the best effort to act upon this request.
278 The hint flag ``RTE_PMD_MLX5_FINE_GRANULARITY_INLINE`` is dynamic,
279 registered by application with rte_mbuf_dynflag_register(). This flag is
280 purely driver-specific and declared in PMD specific header ``rte_pmd_mlx5.h``,
281 which is intended to be used by the application.
283 To query the supported specific flags in runtime,
284 the function ``rte_pmd_mlx5_get_dyn_flag_names`` returns the array of
285 currently (over present hardware and configuration) supported specific flags.
286 The "not inline hint" feature operating flow is the following one:
289 - probe the devices, ports are created
290 - query the port capabilities
291 - if port supporting the feature is found
292 - register dynamic flag ``RTE_PMD_MLX5_FINE_GRANULARITY_INLINE``
293 - application starts the ports
294 - on ``dev_start()`` PMD checks whether the feature flag is registered and
295 enables the feature support in datapath
296 - application might set the registered flag bit in ``ol_flags`` field
297 of mbuf being sent and PMD will handle ones appropriately.
299 - The amount of descriptors in Tx queue may be limited by data inline settings.
300 Inline data require the more descriptor building blocks and overall block
301 amount may exceed the hardware supported limits. The application should
302 reduce the requested Tx size or adjust data inline settings with
303 ``txq_inline_max`` and ``txq_inline_mpw`` devargs keys.
305 - To provide the packet send scheduling on mbuf timestamps the ``tx_pp``
306 parameter should be specified.
307 When PMD sees the RTE_MBUF_DYNFLAG_TX_TIMESTAMP_NAME set on the packet
308 being sent it tries to synchronize the time of packet appearing on
309 the wire with the specified packet timestamp. It the specified one
310 is in the past it should be ignored, if one is in the distant future
311 it should be capped with some reasonable value (in range of seconds).
312 These specific cases ("too late" and "distant future") can be optionally
313 reported via device xstats to assist applications to detect the
314 time-related problems.
316 The timestamp upper "too-distant-future" limit
317 at the moment of invoking the Tx burst routine
318 can be estimated as ``tx_pp`` option (in nanoseconds) multiplied by 2^23.
319 Please note, for the testpmd txonly mode,
320 the limit is deduced from the expression::
322 (n_tx_descriptors / burst_size + 1) * inter_burst_gap
324 There is no any packet reordering according timestamps is supposed,
325 neither within packet burst, nor between packets, it is an entirely
326 application responsibility to generate packets and its timestamps
327 in desired order. The timestamps can be put only in the first packet
328 in the burst providing the entire burst scheduling.
330 - E-Switch decapsulation Flow:
332 - can be applied to PF port only.
333 - must specify VF port action (packet redirection from PF to VF).
334 - optionally may specify tunnel inner source and destination MAC addresses.
336 - E-Switch encapsulation Flow:
338 - can be applied to VF ports only.
339 - must specify PF port action (packet redirection from VF to PF).
343 - The input buffer, used as outer header, is not validated.
347 - The decapsulation is always done up to the outermost tunnel detected by the HW.
348 - The input buffer, providing the removal size, is not validated.
349 - The buffer size must match the length of the headers to be removed.
351 - ICMP(code/type/identifier/sequence number) / ICMP6(code/type) matching, IP-in-IP and MPLS flow matching are all
352 mutually exclusive features which cannot be supported together
353 (see :ref:`mlx5_firmware_config`).
357 - Requires DevX and DV flow to be enabled.
358 - KEEP_CRC offload cannot be supported with LRO.
359 - The first mbuf length, without head-room, must be big enough to include the
361 - Rx queue with LRO offload enabled, receiving a non-LRO packet, can forward
362 it with size limited to max LRO size, not to max RX packet length.
363 - LRO can be used with outer header of TCP packets of the standard format:
364 eth (with or without vlan) / ipv4 or ipv6 / tcp / payload
366 Other TCP packets (e.g. with MPLS label) received on Rx queue with LRO enabled, will be received with bad checksum.
367 - LRO packet aggregation is performed by HW only for packet size larger than
368 ``lro_min_mss_size``. This value is reported on device start, when debug
373 - ``DEV_RX_OFFLOAD_KEEP_CRC`` cannot be supported with decapsulation
374 for some NICs (such as ConnectX-6 Dx, ConnectX-6 Lx, and BlueField-2).
375 The capability bit ``scatter_fcs_w_decap_disable`` shows NIC support.
379 - fast free offload assumes the all mbufs being sent are originated from the
380 same memory pool and there is no any extra references to the mbufs (the
381 reference counter for each mbuf is equal 1 on tx_burst call). The latter
382 means there should be no any externally attached buffers in mbufs. It is
383 an application responsibility to provide the correct mbufs if the fast
384 free offload is engaged. The mlx5 PMD implicitly produces the mbufs with
385 externally attached buffers if MPRQ option is enabled, hence, the fast
386 free offload is neither supported nor advertised if there is MPRQ enabled.
390 - Supports ``RTE_FLOW_ACTION_TYPE_SAMPLE`` action only within NIC Rx and
391 E-Switch steering domain.
392 - For E-Switch Sampling flow with sample ratio > 1, additional actions are not
393 supported in the sample actions list.
394 - For ConnectX-5, the ``RTE_FLOW_ACTION_TYPE_SAMPLE`` is typically used as
395 first action in the E-Switch egress flow if with header modify or
396 encapsulation actions.
397 - For NIC Rx flow, supports ``MARK``, ``COUNT``, ``QUEUE``, ``RSS`` in the
399 - For E-Switch mirroring flow, supports ``RAW ENCAP``, ``Port ID``,
400 ``VXLAN ENCAP``, ``NVGRE ENCAP`` in the sample actions list.
404 - Supports the 'set' operation only for ``RTE_FLOW_ACTION_TYPE_MODIFY_FIELD`` action.
405 - Modification of an arbitrary place in a packet via the special ``RTE_FLOW_FIELD_START`` Field ID is not supported.
406 - Modification of the 802.1Q Tag, VXLAN Network or GENEVE Network ID's is not supported.
407 - Encapsulation levels are not supported, can modify outermost header fields only.
408 - Offsets must be 32-bits aligned, cannot skip past the boundary of a field.
410 - IPv6 header item 'proto' field, indicating the next header protocol, should
411 not be set as extension header.
412 In case the next header is an extension header, it should not be specified in
413 IPv6 header item 'proto' field.
414 The last extension header item 'next header' field can specify the following
415 header protocol type.
419 - Hairpin between two ports could only manual binding and explicit Tx flow mode. For single port hairpin, all the combinations of auto/manual binding and explicit/implicit Tx flow mode could be supported.
420 - Hairpin in switchdev SR-IOV mode is not supported till now.
424 - All the meter colors with drop action will be counted only by the global drop statistics.
425 - Yellow detection is only supported with ASO metering.
426 - Red color must be with drop action.
427 - Meter statistics are supported only for drop case.
428 - A meter action created with pre-defined policy must be the last action in the flow except single case where the policy actions are:
429 - green: NULL or END.
430 - yellow: NULL or END.
432 - The only supported meter policy actions:
433 - green: QUEUE, RSS, PORT_ID, JUMP, DROP, MARK and SET_TAG.
434 - yellow: QUEUE, RSS, PORT_ID, JUMP, DROP, MARK and SET_TAG.
436 - Policy actions of RSS for green and yellow should have the same configuration except queues.
437 - meter profile packet mode is supported.
438 - meter profiles of RFC2697, RFC2698 and RFC4115 are supported.
442 - Integrity offload is enabled for **ConnectX-6** family.
443 - Verification bits provided by the hardware are ``l3_ok``, ``ipv4_csum_ok``, ``l4_ok``, ``l4_csum_ok``.
444 - ``level`` value 0 references outer headers.
445 - Multiple integrity items not supported in a single flow rule.
446 - Flow rule items supplied by application must explicitly specify network headers referred by integrity item.
447 For example, if integrity item mask sets ``l4_ok`` or ``l4_csum_ok`` bits, reference to L4 network header,
448 TCP or UDP, must be in the rule pattern as well::
450 flow create 0 ingress pattern integrity level is 0 value mask l3_ok value spec l3_ok / eth / ipv6 / end …
452 flow create 0 ingress pattern integrity level is 0 value mask l4_ok value spec 0 / eth / ipv4 proto is udp / end …
454 - Connection tracking:
456 - Cannot co-exist with ASO meter, ASO age action in a single flow rule.
457 - Flow rules insertion rate and memory consumption need more optimization.
459 - 4M connections maximum.
461 - Multi-thread flow insertion:
463 - In order to achieve best insertion rate, application should manage the flows per lcore.
464 - Better to disable memory reclaim by setting ``reclaim_mem_mode`` to 0 to accelerate the flow object allocation and release with cache.
469 MLX5 supports various methods to report statistics:
471 Port statistics can be queried using ``rte_eth_stats_get()``. The received and sent statistics are through SW only and counts the number of packets received or sent successfully by the PMD. The imissed counter is the amount of packets that could not be delivered to SW because a queue was full. Packets not received due to congestion in the bus or on the NIC can be queried via the rx_discards_phy xstats counter.
473 Extended statistics can be queried using ``rte_eth_xstats_get()``. The extended statistics expose a wider set of counters counted by the device. The extended port statistics counts the number of packets received or sent successfully by the port. As Mellanox NICs are using the :ref:`Bifurcated Linux Driver <linux_gsg_linux_drivers>` those counters counts also packet received or sent by the Linux kernel. The counters with ``_phy`` suffix counts the total events on the physical port, therefore not valid for VF.
475 Finally per-flow statistics can by queried using ``rte_flow_query`` when attaching a count action for specific flow. The flow counter counts the number of packets received successfully by the port and match the specific flow.
483 The ibverbs libraries can be linked with this PMD in a number of ways,
484 configured by the ``ibverbs_link`` build option:
486 - ``shared`` (default): the PMD depends on some .so files.
488 - ``dlopen``: Split the dependencies glue in a separate library
489 loaded when needed by dlopen.
490 It make dependencies on libibverbs and libmlx4 optional,
491 and has no performance impact.
493 - ``static``: Embed static flavor of the dependencies libibverbs and libmlx4
494 in the PMD shared library or the executable static binary.
496 Environment variables
497 ~~~~~~~~~~~~~~~~~~~~~
501 A list of directories in which to search for the rdma-core "glue" plug-in,
502 separated by colons or semi-colons.
504 - ``MLX5_SHUT_UP_BF``
506 Configures HW Tx doorbell register as IO-mapped.
508 By default, the HW Tx doorbell is configured as a write-combining register.
509 The register would be flushed to HW usually when the write-combining buffer
510 becomes full, but it depends on CPU design.
512 Except for vectorized Tx burst routines, a write memory barrier is enforced
513 after updating the register so that the update can be immediately visible to
516 When vectorized Tx burst is called, the barrier is set only if the burst size
517 is not aligned to MLX5_VPMD_TX_MAX_BURST. However, setting this environmental
518 variable will bring better latency even though the maximum throughput can
521 Run-time configuration
522 ~~~~~~~~~~~~~~~~~~~~~~
524 - librte_net_mlx5 brings kernel network interfaces up during initialization
525 because it is affected by their state. Forcing them down prevents packets
528 - **ethtool** operations on related kernel interfaces also affect the PMD.
533 In order to run as a non-root user,
534 some capabilities must be granted to the application::
536 setcap cap_sys_admin,cap_net_admin,cap_net_raw,cap_ipc_lock+ep <dpdk-app>
538 Below are the reasons of the need for each capability:
541 When using physical addresses (PA mode), with Linux >= 4.0,
542 for access to ``/proc/self/pagemap``.
545 For device configuration.
548 For raw ethernet queue allocation through kernel driver.
551 For DMA memory pinning.
556 - ``rxq_cqe_comp_en`` parameter [int]
558 A nonzero value enables the compression of CQE on RX side. This feature
559 allows to save PCI bandwidth and improve performance. Enabled by default.
560 Different compression formats are supported in order to achieve the best
561 performance for different traffic patterns. Default format depends on
562 Multi-Packet Rx queue configuration: Hash RSS format is used in case
563 MPRQ is disabled, Checksum format is used in case MPRQ is enabled.
565 Specifying 2 as a ``rxq_cqe_comp_en`` value selects Flow Tag format for
566 better compression rate in case of RTE Flow Mark traffic.
567 Specifying 3 as a ``rxq_cqe_comp_en`` value selects Checksum format.
568 Specifying 4 as a ``rxq_cqe_comp_en`` value selects L3/L4 Header format for
569 better compression rate in case of mixed TCP/UDP and IPv4/IPv6 traffic.
570 CQE compression format selection requires DevX to be enabled. If there is
571 no DevX enabled/supported the value is reset to 1 by default.
575 - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
576 ConnectX-6 Lx, BlueField and BlueField-2.
577 - POWER9 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
578 ConnectX-6 Lx, BlueField and BlueField-2.
580 - ``rxq_pkt_pad_en`` parameter [int]
582 A nonzero value enables padding Rx packet to the size of cacheline on PCI
583 transaction. This feature would waste PCI bandwidth but could improve
584 performance by avoiding partial cacheline write which may cause costly
585 read-modify-copy in memory transaction on some architectures. Disabled by
590 - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
591 ConnectX-6 Lx, BlueField and BlueField-2.
592 - POWER8 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
593 ConnectX-6 Lx, BlueField and BlueField-2.
595 - ``mprq_en`` parameter [int]
597 A nonzero value enables configuring Multi-Packet Rx queues. Rx queue is
598 configured as Multi-Packet RQ if the total number of Rx queues is
599 ``rxqs_min_mprq`` or more. Disabled by default.
601 Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe bandwidth
602 by posting a single large buffer for multiple packets. Instead of posting a
603 buffers per a packet, one large buffer is posted in order to receive multiple
604 packets on the buffer. A MPRQ buffer consists of multiple fixed-size strides
605 and each stride receives one packet. MPRQ can improve throughput for
606 small-packet traffic.
608 When MPRQ is enabled, max_rx_pkt_len can be larger than the size of
609 user-provided mbuf even if DEV_RX_OFFLOAD_SCATTER isn't enabled. PMD will
610 configure large stride size enough to accommodate max_rx_pkt_len as long as
611 device allows. Note that this can waste system memory compared to enabling Rx
612 scatter and multi-segment packet.
614 - ``mprq_log_stride_num`` parameter [int]
616 Log 2 of the number of strides for Multi-Packet Rx queue. Configuring more
617 strides can reduce PCIe traffic further. If configured value is not in the
618 range of device capability, the default value will be set with a warning
619 message. The default value is 4 which is 16 strides per a buffer, valid only
620 if ``mprq_en`` is set.
622 The size of Rx queue should be bigger than the number of strides.
624 - ``mprq_log_stride_size`` parameter [int]
626 Log 2 of the size of a stride for Multi-Packet Rx queue. Configuring a smaller
627 stride size can save some memory and reduce probability of a depletion of all
628 available strides due to unreleased packets by an application. If configured
629 value is not in the range of device capability, the default value will be set
630 with a warning message. The default value is 11 which is 2048 bytes per a
631 stride, valid only if ``mprq_en`` is set. With ``mprq_log_stride_size`` set
632 it is possible for a packet to span across multiple strides. This mode allows
633 support of jumbo frames (9K) with MPRQ. The memcopy of some packets (or part
634 of a packet if Rx scatter is configured) may be required in case there is no
635 space left for a head room at the end of a stride which incurs some
638 - ``mprq_max_memcpy_len`` parameter [int]
640 The maximum length of packet to memcpy in case of Multi-Packet Rx queue. Rx
641 packet is mem-copied to a user-provided mbuf if the size of Rx packet is less
642 than or equal to this parameter. Otherwise, PMD will attach the Rx packet to
643 the mbuf by external buffer attachment - ``rte_pktmbuf_attach_extbuf()``.
644 A mempool for external buffers will be allocated and managed by PMD. If Rx
645 packet is externally attached, ol_flags field of the mbuf will have
646 EXT_ATTACHED_MBUF and this flag must be preserved. ``RTE_MBUF_HAS_EXTBUF()``
647 checks the flag. The default value is 128, valid only if ``mprq_en`` is set.
649 - ``rxqs_min_mprq`` parameter [int]
651 Configure Rx queues as Multi-Packet RQ if the total number of Rx queues is
652 greater or equal to this value. The default value is 12, valid only if
655 - ``txq_inline`` parameter [int]
657 Amount of data to be inlined during TX operations. This parameter is
658 deprecated and converted to the new parameter ``txq_inline_max`` providing
659 partial compatibility.
661 - ``txqs_min_inline`` parameter [int]
663 Enable inline data send only when the number of TX queues is greater or equal
666 This option should be used in combination with ``txq_inline_max`` and
667 ``txq_inline_mpw`` below and does not affect ``txq_inline_min`` settings above.
669 If this option is not specified the default value 16 is used for BlueField
670 and 8 for other platforms
672 The data inlining consumes the CPU cycles, so this option is intended to
673 auto enable inline data if we have enough Tx queues, which means we have
674 enough CPU cores and PCI bandwidth is getting more critical and CPU
675 is not supposed to be bottleneck anymore.
677 The copying data into WQE improves latency and can improve PPS performance
678 when PCI back pressure is detected and may be useful for scenarios involving
679 heavy traffic on many queues.
681 Because additional software logic is necessary to handle this mode, this
682 option should be used with care, as it may lower performance when back
683 pressure is not expected.
685 If inline data are enabled it may affect the maximal size of Tx queue in
686 descriptors because the inline data increase the descriptor size and
687 queue size limits supported by hardware may be exceeded.
689 - ``txq_inline_min`` parameter [int]
691 Minimal amount of data to be inlined into WQE during Tx operations. NICs
692 may require this minimal data amount to operate correctly. The exact value
693 may depend on NIC operation mode, requested offloads, etc. It is strongly
694 recommended to omit this parameter and use the default values. Anyway,
695 applications using this parameter should take into consideration that
696 specifying an inconsistent value may prevent the NIC from sending packets.
698 If ``txq_inline_min`` key is present the specified value (may be aligned
699 by the driver in order not to exceed the limits and provide better descriptor
700 space utilization) will be used by the driver and it is guaranteed that
701 requested amount of data bytes are inlined into the WQE beside other inline
702 settings. This key also may update ``txq_inline_max`` value (default
703 or specified explicitly in devargs) to reserve the space for inline data.
705 If ``txq_inline_min`` key is not present, the value may be queried by the
706 driver from the NIC via DevX if this feature is available. If there is no DevX
707 enabled/supported the value 18 (supposing L2 header including VLAN) is set
708 for ConnectX-4 and ConnectX-4 Lx, and 0 is set by default for ConnectX-5
709 and newer NICs. If packet is shorter the ``txq_inline_min`` value, the entire
712 For ConnectX-4 NIC, driver does not allow specifying value below 18
713 (minimal L2 header, including VLAN), error will be raised.
715 For ConnectX-4 Lx NIC, it is allowed to specify values below 18, but
716 it is not recommended and may prevent NIC from sending packets over
719 For ConnectX-4 and ConnectX-4 Lx NICs, automatically configured value
720 is insufficient for some traffic, because they require at least all L2 headers
721 to be inlined. For example, Q-in-Q adds 4 bytes to default 18 bytes
722 of Ethernet and VLAN, thus ``txq_inline_min`` must be set to 22.
723 MPLS would add 4 bytes per label. Final value must account for all possible
724 L2 encapsulation headers used in particular environment.
726 Please, note, this minimal data inlining disengages eMPW feature (Enhanced
727 Multi-Packet Write), because last one does not support partial packet inlining.
728 This is not very critical due to minimal data inlining is mostly required
729 by ConnectX-4 and ConnectX-4 Lx, these NICs do not support eMPW feature.
731 - ``txq_inline_max`` parameter [int]
733 Specifies the maximal packet length to be completely inlined into WQE
734 Ethernet Segment for ordinary SEND method. If packet is larger than specified
735 value, the packet data won't be copied by the driver at all, data buffer
736 is addressed with a pointer. If packet length is less or equal all packet
737 data will be copied into WQE. This may improve PCI bandwidth utilization for
738 short packets significantly but requires the extra CPU cycles.
740 The data inline feature is controlled by number of Tx queues, if number of Tx
741 queues is larger than ``txqs_min_inline`` key parameter, the inline feature
742 is engaged, if there are not enough Tx queues (which means not enough CPU cores
743 and CPU resources are scarce), data inline is not performed by the driver.
744 Assigning ``txqs_min_inline`` with zero always enables the data inline.
746 The default ``txq_inline_max`` value is 290. The specified value may be adjusted
747 by the driver in order not to exceed the limit (930 bytes) and to provide better
748 WQE space filling without gaps, the adjustment is reflected in the debug log.
749 Also, the default value (290) may be decreased in run-time if the large transmit
750 queue size is requested and hardware does not support enough descriptor
751 amount, in this case warning is emitted. If ``txq_inline_max`` key is
752 specified and requested inline settings can not be satisfied then error
755 - ``txq_inline_mpw`` parameter [int]
757 Specifies the maximal packet length to be completely inlined into WQE for
758 Enhanced MPW method. If packet is large the specified value, the packet data
759 won't be copied, and data buffer is addressed with pointer. If packet length
760 is less or equal, all packet data will be copied into WQE. This may improve PCI
761 bandwidth utilization for short packets significantly but requires the extra
764 The data inline feature is controlled by number of TX queues, if number of Tx
765 queues is larger than ``txqs_min_inline`` key parameter, the inline feature
766 is engaged, if there are not enough Tx queues (which means not enough CPU cores
767 and CPU resources are scarce), data inline is not performed by the driver.
768 Assigning ``txqs_min_inline`` with zero always enables the data inline.
770 The default ``txq_inline_mpw`` value is 268. The specified value may be adjusted
771 by the driver in order not to exceed the limit (930 bytes) and to provide better
772 WQE space filling without gaps, the adjustment is reflected in the debug log.
773 Due to multiple packets may be included to the same WQE with Enhanced Multi
774 Packet Write Method and overall WQE size is limited it is not recommended to
775 specify large values for the ``txq_inline_mpw``. Also, the default value (268)
776 may be decreased in run-time if the large transmit queue size is requested
777 and hardware does not support enough descriptor amount, in this case warning
778 is emitted. If ``txq_inline_mpw`` key is specified and requested inline
779 settings can not be satisfied then error will be raised.
781 - ``txqs_max_vec`` parameter [int]
783 Enable vectorized Tx only when the number of TX queues is less than or
784 equal to this value. This parameter is deprecated and ignored, kept
785 for compatibility issue to not prevent driver from probing.
787 - ``txq_mpw_hdr_dseg_en`` parameter [int]
789 A nonzero value enables including two pointers in the first block of TX
790 descriptor. The parameter is deprecated and ignored, kept for compatibility
793 - ``txq_max_inline_len`` parameter [int]
795 Maximum size of packet to be inlined. This limits the size of packet to
796 be inlined. If the size of a packet is larger than configured value, the
797 packet isn't inlined even though there's enough space remained in the
798 descriptor. Instead, the packet is included with pointer. This parameter
799 is deprecated and converted directly to ``txq_inline_mpw`` providing full
800 compatibility. Valid only if eMPW feature is engaged.
802 - ``txq_mpw_en`` parameter [int]
804 A nonzero value enables Enhanced Multi-Packet Write (eMPW) for ConnectX-5,
805 ConnectX-6, ConnectX-6 Dx, ConnectX-6 Lx, BlueField, BlueField-2.
806 eMPW allows the Tx burst function to pack up multiple packets
807 in a single descriptor session in order to save PCI bandwidth
808 and improve performance at the cost of a slightly higher CPU usage.
809 When ``txq_inline_mpw`` is set along with ``txq_mpw_en``,
810 Tx burst function copies entire packet data on to Tx descriptor
811 instead of including pointer of packet.
813 The Enhanced Multi-Packet Write feature is enabled by default if NIC supports
814 it, can be disabled by explicit specifying 0 value for ``txq_mpw_en`` option.
815 Also, if minimal data inlining is requested by non-zero ``txq_inline_min``
816 option or reported by the NIC, the eMPW feature is disengaged.
818 - ``tx_db_nc`` parameter [int]
820 The rdma core library can map doorbell register in two ways, depending on the
821 environment variable "MLX5_SHUT_UP_BF":
823 - As regular cached memory (usually with write combining attribute), if the
824 variable is either missing or set to zero.
825 - As non-cached memory, if the variable is present and set to not "0" value.
827 The type of mapping may slightly affect the Tx performance, the optimal choice
828 is strongly relied on the host architecture and should be deduced practically.
830 If ``tx_db_nc`` is set to zero, the doorbell is forced to be mapped to regular
831 memory (with write combining), the PMD will perform the extra write memory barrier
832 after writing to doorbell, it might increase the needed CPU clocks per packet
833 to send, but latency might be improved.
835 If ``tx_db_nc`` is set to one, the doorbell is forced to be mapped to non
836 cached memory, the PMD will not perform the extra write memory barrier
837 after writing to doorbell, on some architectures it might improve the
840 If ``tx_db_nc`` is set to two, the doorbell is forced to be mapped to regular
841 memory, the PMD will use heuristics to decide whether write memory barrier
842 should be performed. For bursts with size multiple of recommended one (64 pkts)
843 it is supposed the next burst is coming and no need to issue the extra memory
844 barrier (it is supposed to be issued in the next coming burst, at least after
845 descriptor writing). It might increase latency (on some hosts till next
846 packets transmit) and should be used with care.
848 If ``tx_db_nc`` is omitted or set to zero, the preset (if any) environment
849 variable "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF",
850 the default ``tx_db_nc`` value is zero for ARM64 hosts and one for others.
852 - ``tx_pp`` parameter [int]
854 If a nonzero value is specified the driver creates all necessary internal
855 objects to provide accurate packet send scheduling on mbuf timestamps.
856 The positive value specifies the scheduling granularity in nanoseconds,
857 the packet send will be accurate up to specified digits. The allowed range is
858 from 500 to 1 million of nanoseconds. The negative value specifies the module
859 of granularity and engages the special test mode the check the schedule rate.
860 By default (if the ``tx_pp`` is not specified) send scheduling on timestamps
863 - ``tx_skew`` parameter [int]
865 The parameter adjusts the send packet scheduling on timestamps and represents
866 the average delay between beginning of the transmitting descriptor processing
867 by the hardware and appearance of actual packet data on the wire. The value
868 should be provided in nanoseconds and is valid only if ``tx_pp`` parameter is
869 specified. The default value is zero.
871 - ``tx_vec_en`` parameter [int]
873 A nonzero value enables Tx vector on ConnectX-5, ConnectX-6, ConnectX-6 Dx,
874 ConnectX-6 Lx, BlueField and BlueField-2 NICs
875 if the number of global Tx queues on the port is less than ``txqs_max_vec``.
876 The parameter is deprecated and ignored.
878 - ``rx_vec_en`` parameter [int]
880 A nonzero value enables Rx vector if the port is not configured in
881 multi-segment otherwise this parameter is ignored.
885 - ``vf_nl_en`` parameter [int]
887 A nonzero value enables Netlink requests from the VF to add/remove MAC
888 addresses or/and enable/disable promiscuous/all multicast on the Netdevice.
889 Otherwise the relevant configuration must be run with Linux iproute2 tools.
890 This is a prerequisite to receive this kind of traffic.
892 Enabled by default, valid only on VF devices ignored otherwise.
894 - ``l3_vxlan_en`` parameter [int]
896 A nonzero value allows L3 VXLAN and VXLAN-GPE flow creation. To enable
897 L3 VXLAN or VXLAN-GPE, users has to configure firmware and enable this
898 parameter. This is a prerequisite to receive this kind of traffic.
902 - ``dv_xmeta_en`` parameter [int]
904 A nonzero value enables extensive flow metadata support if device is
905 capable and driver supports it. This can enable extensive support of
906 ``MARK`` and ``META`` item of ``rte_flow``. The newly introduced
907 ``SET_TAG`` and ``SET_META`` actions do not depend on ``dv_xmeta_en``.
909 There are some possible configurations, depending on parameter value:
911 - 0, this is default value, defines the legacy mode, the ``MARK`` and
912 ``META`` related actions and items operate only within NIC Tx and
913 NIC Rx steering domains, no ``MARK`` and ``META`` information crosses
914 the domain boundaries. The ``MARK`` item is 24 bits wide, the ``META``
915 item is 32 bits wide and match supported on egress only.
917 - 1, this engages extensive metadata mode, the ``MARK`` and ``META``
918 related actions and items operate within all supported steering domains,
919 including FDB, ``MARK`` and ``META`` information may cross the domain
920 boundaries. The ``MARK`` item is 24 bits wide, the ``META`` item width
921 depends on kernel and firmware configurations and might be 0, 16 or
922 32 bits. Within NIC Tx domain ``META`` data width is 32 bits for
923 compatibility, the actual width of data transferred to the FDB domain
924 depends on kernel configuration and may be vary. The actual supported
925 width can be retrieved in runtime by series of rte_flow_validate()
928 - 2, this engages extensive metadata mode, the ``MARK`` and ``META``
929 related actions and items operate within all supported steering domains,
930 including FDB, ``MARK`` and ``META`` information may cross the domain
931 boundaries. The ``META`` item is 32 bits wide, the ``MARK`` item width
932 depends on kernel and firmware configurations and might be 0, 16 or
933 24 bits. The actual supported width can be retrieved in runtime by
934 series of rte_flow_validate() trials.
936 - 3, this engages tunnel offload mode. In E-Switch configuration, that
937 mode implicitly activates ``dv_xmeta_en=1``.
939 +------+-----------+-----------+-------------+-------------+
940 | Mode | ``MARK`` | ``META`` | ``META`` Tx | FDB/Through |
941 +======+===========+===========+=============+=============+
942 | 0 | 24 bits | 32 bits | 32 bits | no |
943 +------+-----------+-----------+-------------+-------------+
944 | 1 | 24 bits | vary 0-32 | 32 bits | yes |
945 +------+-----------+-----------+-------------+-------------+
946 | 2 | vary 0-24 | 32 bits | 32 bits | yes |
947 +------+-----------+-----------+-------------+-------------+
949 If there is no E-Switch configuration the ``dv_xmeta_en`` parameter is
950 ignored and the device is configured to operate in legacy mode (0).
952 Disabled by default (set to 0).
954 The Direct Verbs/Rules (engaged with ``dv_flow_en`` = 1) supports all
955 of the extensive metadata features. The legacy Verbs supports FLAG and
956 MARK metadata actions over NIC Rx steering domain only.
958 Setting META value to zero in flow action means there is no item provided
959 and receiving datapath will not report in mbufs the metadata are present.
960 Setting MARK value to zero in flow action means the zero FDIR ID value
961 will be reported on packet receiving.
963 For the MARK action the last 16 values in the full range are reserved for
964 internal PMD purposes (to emulate FLAG action). The valid range for the
965 MARK action values is 0-0xFFEF for the 16-bit mode and 0-xFFFFEF
966 for the 24-bit mode, the flows with the MARK action value outside
967 the specified range will be rejected.
969 - ``dv_flow_en`` parameter [int]
971 A nonzero value enables the DV flow steering assuming it is supported
972 by the driver (RDMA Core library version is rdma-core-24.0 or higher).
974 Enabled by default if supported.
976 - ``dv_esw_en`` parameter [int]
978 A nonzero value enables E-Switch using Direct Rules.
980 Enabled by default if supported.
982 - ``lacp_by_user`` parameter [int]
984 A nonzero value enables the control of LACP traffic by the user application.
985 When a bond exists in the driver, by default it should be managed by the
986 kernel and therefore LACP traffic should be steered to the kernel.
987 If this devarg is set to 1 it will allow the user to manage the bond by
988 itself and not steer LACP traffic to the kernel.
990 Disabled by default (set to 0).
992 - ``mr_ext_memseg_en`` parameter [int]
994 A nonzero value enables extending memseg when registering DMA memory. If
995 enabled, the number of entries in MR (Memory Region) lookup table on datapath
996 is minimized and it benefits performance. On the other hand, it worsens memory
997 utilization because registered memory is pinned by kernel driver. Even if a
998 page in the extended chunk is freed, that doesn't become reusable until the
999 entire memory is freed.
1003 - ``representor`` parameter [list]
1005 This parameter can be used to instantiate DPDK Ethernet devices from
1006 existing port (PF, VF or SF) representors configured on the device.
1008 It is a standard parameter whose format is described in
1009 :ref:`ethernet_device_standard_device_arguments`.
1011 For instance, to probe VF port representors 0 through 2::
1013 <PCI_BDF>,representor=vf[0-2]
1015 To probe SF port representors 0 through 2::
1017 <PCI_BDF>,representor=sf[0-2]
1019 To probe VF port representors 0 through 2 on both PFs of bonding device::
1021 <Primary_PCI_BDF>,representor=pf[0,1]vf[0-2]
1023 - ``max_dump_files_num`` parameter [int]
1025 The maximum number of files per PMD entity that may be created for debug information.
1026 The files will be created in /var/log directory or in current directory.
1028 set to 128 by default.
1030 - ``lro_timeout_usec`` parameter [int]
1032 The maximum allowed duration of an LRO session, in micro-seconds.
1033 PMD will set the nearest value supported by HW, which is not bigger than
1034 the input ``lro_timeout_usec`` value.
1035 If this parameter is not specified, by default PMD will set
1036 the smallest value supported by HW.
1038 - ``hp_buf_log_sz`` parameter [int]
1040 The total data buffer size of a hairpin queue (logarithmic form), in bytes.
1041 PMD will set the data buffer size to 2 ** ``hp_buf_log_sz``, both for RX & TX.
1042 The capacity of the value is specified by the firmware and the initialization
1043 will get a failure if it is out of scope.
1044 The range of the value is from 11 to 19 right now, and the supported frame
1045 size of a single packet for hairpin is from 512B to 128KB. It might change if
1046 different firmware release is being used. By using a small value, it could
1047 reduce memory consumption but not work with a large frame. If the value is
1048 too large, the memory consumption will be high and some potential performance
1049 degradation will be introduced.
1050 By default, the PMD will set this value to 16, which means that 9KB jumbo
1051 frames will be supported.
1053 - ``reclaim_mem_mode`` parameter [int]
1055 Cache some resources in flow destroy will help flow recreation more efficient.
1056 While some systems may require the all the resources can be reclaimed after
1058 The parameter ``reclaim_mem_mode`` provides the option for user to configure
1059 if the resource cache is needed or not.
1061 There are three options to choose:
1063 - 0. It means the flow resources will be cached as usual. The resources will
1064 be cached, helpful with flow insertion rate.
1066 - 1. It will only enable the DPDK PMD level resources reclaim.
1068 - 2. Both DPDK PMD level and rdma-core low level will be configured as
1071 By default, the PMD will set this value to 0.
1073 - ``sys_mem_en`` parameter [int]
1075 A non-zero value enables the PMD memory management allocating memory
1076 from system by default, without explicit rte memory flag.
1078 By default, the PMD will set this value to 0.
1080 - ``decap_en`` parameter [int]
1082 Some devices do not support FCS (frame checksum) scattering for
1083 tunnel-decapsulated packets.
1084 If set to 0, this option forces the FCS feature and rejects tunnel
1085 decapsulation in the flow engine for such devices.
1087 By default, the PMD will set this value to 1.
1089 - ``allow_duplicate_pattern`` parameter [int]
1091 There are two options to choose:
1093 - 0. Prevent insertion of rules with the same pattern items on non-root table.
1094 In this case, only the first rule is inserted and the following rules are
1095 rejected and error code EEXIST is returned.
1097 - 1. Allow insertion of rules with the same pattern items.
1098 In this case, all rules are inserted but only the first rule takes effect,
1099 the next rule takes effect only if the previous rules are deleted.
1101 By default, the PMD will set this value to 1.
1103 .. _mlx5_firmware_config:
1105 Firmware configuration
1106 ~~~~~~~~~~~~~~~~~~~~~~
1108 Firmware features can be configured as key/value pairs.
1110 The command to set a value is::
1112 mlxconfig -d <device> set <key>=<value>
1114 The command to query a value is::
1116 mlxconfig -d <device> query | grep <key>
1118 The device name for the command ``mlxconfig`` can be either the PCI address,
1119 or the mst device name found with::
1123 Below are some firmware configurations listed.
1129 value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
1135 - maximum number of SR-IOV virtual functions::
1139 - enable DevX (required by Direct Rules and other features)::
1143 - aggressive CQE zipping::
1147 - L3 VXLAN and VXLAN-GPE destination UDP port::
1150 IP_OVER_VXLAN_PORT=<udp dport>
1152 - enable VXLAN-GPE tunnel flow matching::
1154 FLEX_PARSER_PROFILE_ENABLE=0
1156 FLEX_PARSER_PROFILE_ENABLE=2
1158 - enable IP-in-IP tunnel flow matching::
1160 FLEX_PARSER_PROFILE_ENABLE=0
1162 - enable MPLS flow matching::
1164 FLEX_PARSER_PROFILE_ENABLE=1
1166 - enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
1168 FLEX_PARSER_PROFILE_ENABLE=2
1170 - enable Geneve flow matching::
1172 FLEX_PARSER_PROFILE_ENABLE=0
1174 FLEX_PARSER_PROFILE_ENABLE=1
1176 - enable Geneve TLV option flow matching::
1178 FLEX_PARSER_PROFILE_ENABLE=0
1180 - enable GTP flow matching::
1182 FLEX_PARSER_PROFILE_ENABLE=3
1184 - enable eCPRI flow matching::
1186 FLEX_PARSER_PROFILE_ENABLE=4
1192 This driver relies on external libraries and kernel drivers for resources
1193 allocations and initialization. The following dependencies are not part of
1194 DPDK and must be installed separately:
1198 User space Verbs framework used by librte_net_mlx5. This library provides
1199 a generic interface between the kernel and low-level user space drivers
1202 It allows slow and privileged operations (context initialization, hardware
1203 resources allocations) to be managed by the kernel and fast operations to
1204 never leave user space.
1208 Low-level user space driver library for Mellanox
1209 ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices, it is automatically loaded
1212 This library basically implements send/receive calls to the hardware
1215 - **Kernel modules**
1217 They provide the kernel-side Verbs API and low level device drivers that
1218 manage actual hardware initialization and resources sharing with user
1221 Unlike most other PMDs, these modules must remain loaded and bound to
1224 - mlx5_core: hardware driver managing Mellanox
1225 ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices and related Ethernet kernel
1227 - mlx5_ib: InifiniBand device driver.
1228 - ib_uverbs: user space driver for Verbs (entry point for libibverbs).
1230 - **Firmware update**
1232 Mellanox OFED/EN releases include firmware updates for
1233 ConnectX-4/ConnectX-5/ConnectX-6/BlueField adapters.
1235 Because each release provides new features, these updates must be applied to
1236 match the kernel modules and libraries they come with.
1240 Both libraries are BSD and GPL licensed. Linux kernel modules are GPL
1246 Either RDMA Core library with a recent enough Linux kernel release
1247 (recommended) or Mellanox OFED/EN, which provides compatibility with older
1250 RDMA Core with Linux Kernel
1251 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
1253 - Minimal kernel version : v4.14 or the most recent 4.14-rc (see `Linux installation documentation`_)
1254 - Minimal rdma-core version: v15+ commit 0c5f5765213a ("Merge pull request #227 from yishaih/tm")
1255 (see `RDMA Core installation documentation`_)
1256 - When building for i686 use:
1258 - rdma-core version 18.0 or above built with 32bit support.
1259 - Kernel version 4.14.41 or above.
1261 - Starting with rdma-core v21, static libraries can be built::
1264 CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja ..
1267 .. _`Linux installation documentation`: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/plain/Documentation/admin-guide/README.rst
1268 .. _`RDMA Core installation documentation`: https://raw.githubusercontent.com/linux-rdma/rdma-core/master/README.md
1274 - Mellanox OFED version: **4.5** and above /
1275 Mellanox EN version: **4.5** and above
1278 - ConnectX-4: **12.21.1000** and above.
1279 - ConnectX-4 Lx: **14.21.1000** and above.
1280 - ConnectX-5: **16.21.1000** and above.
1281 - ConnectX-5 Ex: **16.21.1000** and above.
1282 - ConnectX-6: **20.27.0090** and above.
1283 - ConnectX-6 Dx: **22.27.0090** and above.
1284 - BlueField: **18.25.1010** and above.
1286 While these libraries and kernel modules are available on OpenFabrics
1287 Alliance's `website <https://www.openfabrics.org/>`__ and provided by package
1288 managers on most distributions, this PMD requires Ethernet extensions that
1289 may not be supported at the moment (this is a work in progress).
1292 <http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux>`__ and
1294 <http://www.mellanox.com/page/products_dyn?product_family=27&mtag=linux>`__
1295 include the necessary support and should be used in the meantime. For DPDK,
1296 only libibverbs, libmlx5, mlnx-ofed-kernel packages and firmware updates are
1297 required from that distribution.
1301 Several versions of Mellanox OFED/EN are available. Installing the version
1302 this DPDK release was developed and tested against is strongly
1303 recommended. Please check the `linux prerequisites`_.
1305 Windows Prerequisites
1306 ---------------------
1308 This driver relies on external libraries and kernel drivers for resources
1309 allocations and initialization. The dependencies in the following sub-sections
1310 are not part of DPDK, and must be installed separately.
1312 Compilation Prerequisites
1313 ~~~~~~~~~~~~~~~~~~~~~~~~~
1315 DevX SDK installation
1316 ^^^^^^^^^^^^^^^^^^^^^
1318 The DevX SDK must be installed on the machine building the Windows PMD.
1319 Additional information can be found at
1320 `How to Integrate Windows DevX in Your Development Environment
1321 <https://docs.mellanox.com/display/winof2v250/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`__.
1323 Runtime Prerequisites
1324 ~~~~~~~~~~~~~~~~~~~~~
1326 WinOF2 version 2.60 or higher must be installed on the machine.
1331 The driver can be downloaded from the following site:
1333 <https://www.mellanox.com/products/adapter-software/ethernet/windows/winof-2>`__
1338 DevX for Windows must be enabled in the Windows registry.
1339 The keys ``DevxEnabled`` and ``DevxFsRules`` must be set.
1340 Additional information can be found in the WinOF2 user manual.
1345 The following Mellanox device families are supported by the same mlx5 driver:
1357 Below are detailed device names:
1359 * Mellanox\ |reg| ConnectX\ |reg|-4 10G MCX4111A-XCAT (1x10G)
1360 * Mellanox\ |reg| ConnectX\ |reg|-4 10G MCX412A-XCAT (2x10G)
1361 * Mellanox\ |reg| ConnectX\ |reg|-4 25G MCX4111A-ACAT (1x25G)
1362 * Mellanox\ |reg| ConnectX\ |reg|-4 25G MCX412A-ACAT (2x25G)
1363 * Mellanox\ |reg| ConnectX\ |reg|-4 40G MCX413A-BCAT (1x40G)
1364 * Mellanox\ |reg| ConnectX\ |reg|-4 40G MCX4131A-BCAT (1x40G)
1365 * Mellanox\ |reg| ConnectX\ |reg|-4 40G MCX415A-BCAT (1x40G)
1366 * Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX413A-GCAT (1x50G)
1367 * Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX4131A-GCAT (1x50G)
1368 * Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX414A-BCAT (2x50G)
1369 * Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX415A-GCAT (1x50G)
1370 * Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX416A-BCAT (2x50G)
1371 * Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX416A-GCAT (2x50G)
1372 * Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX415A-CCAT (1x100G)
1373 * Mellanox\ |reg| ConnectX\ |reg|-4 100G MCX416A-CCAT (2x100G)
1374 * Mellanox\ |reg| ConnectX\ |reg|-4 Lx 10G MCX4111A-XCAT (1x10G)
1375 * Mellanox\ |reg| ConnectX\ |reg|-4 Lx 10G MCX4121A-XCAT (2x10G)
1376 * Mellanox\ |reg| ConnectX\ |reg|-4 Lx 25G MCX4111A-ACAT (1x25G)
1377 * Mellanox\ |reg| ConnectX\ |reg|-4 Lx 25G MCX4121A-ACAT (2x25G)
1378 * Mellanox\ |reg| ConnectX\ |reg|-4 Lx 40G MCX4131A-BCAT (1x40G)
1379 * Mellanox\ |reg| ConnectX\ |reg|-5 100G MCX556A-ECAT (2x100G)
1380 * Mellanox\ |reg| ConnectX\ |reg|-5 Ex EN 100G MCX516A-CDAT (2x100G)
1381 * Mellanox\ |reg| ConnectX\ |reg|-6 200G MCX654106A-HCAT (2x200G)
1382 * Mellanox\ |reg| ConnectX\ |reg|-6 Dx EN 100G MCX623106AN-CDAT (2x100G)
1383 * Mellanox\ |reg| ConnectX\ |reg|-6 Dx EN 200G MCX623105AN-VDAT (1x200G)
1384 * Mellanox\ |reg| ConnectX\ |reg|-6 Lx EN 25G MCX631102AN-ADAT (2x25G)
1386 Quick Start Guide on OFED/EN
1387 ----------------------------
1389 1. Download latest Mellanox OFED/EN. For more info check the `linux prerequisites`_.
1392 2. Install the required libraries and kernel modules either by installing
1393 only the required set, or by installing the entire Mellanox OFED/EN::
1395 ./mlnxofedinstall --upstream-libs --dpdk
1397 3. Verify the firmware is the correct one::
1401 4. Verify all ports links are set to Ethernet::
1403 mlxconfig -d <mst device> query | grep LINK_TYPE
1407 Link types may have to be configured to Ethernet::
1409 mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
1411 * LINK_TYPE_P1=<1|2|3> , 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
1413 For hypervisors, verify SR-IOV is enabled on the NIC::
1415 mlxconfig -d <mst device> query | grep SRIOV_EN
1418 If needed, configure SR-IOV::
1420 mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
1421 mlxfwreset -d <mst device> reset
1423 5. Restart the driver::
1425 /etc/init.d/openibd restart
1429 service openibd restart
1431 If link type was changed, firmware must be reset as well::
1433 mlxfwreset -d <mst device> reset
1435 For hypervisors, after reset write the sysfs number of virtual functions
1438 To dynamically instantiate a given number of virtual functions (VFs)::
1440 echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
1442 6. Install DPDK and you are ready to go.
1443 See :doc:`compilation instructions <../linux_gsg/build_dpdk>`.
1445 Enable switchdev mode
1446 ---------------------
1448 Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF.
1449 Representor is a port in DPDK that is connected to a VF or SF in such a way
1450 that assuming there are no offload flows, each packet that is sent from the VF or SF
1451 will be received by the corresponding representor. While each packet that is or SF
1452 sent to a representor will be received by the VF or SF.
1453 This is very useful in case of SRIOV mode, where the first packet that is sent
1454 by the VF or SF will be received by the DPDK application which will decide if this
1455 flow should be offloaded to the E-Switch. After offloading the flow packet
1456 that the VF or SF that are matching the flow will not be received any more by
1457 the DPDK application.
1459 1. Enable SRIOV mode::
1461 mlxconfig -d <mst device> set SRIOV_EN=true
1463 2. Configure the max number of VFs::
1465 mlxconfig -d <mst device> set NUM_OF_VFS=<num of vfs>
1469 mlxfwreset -d <mst device> reset
1471 3. Configure the actual number of VFs::
1473 echo <num of vfs > /sys/class/net/<net device>/device/sriov_numvfs
1475 4. Unbind the device (can be rebind after the switchdev mode)::
1477 echo -n "<device pci address" > /sys/bus/pci/drivers/mlx5_core/unbind
1479 5. Enbale switchdev mode::
1481 echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
1483 Sub-Function support
1484 --------------------
1486 Sub-Function is a portion of the PCI device, a SF netdev has its own
1487 dedicated queues (txq, rxq).
1488 A SF shares PCI level resources with other SFs and/or with its parent PCI function.
1492 OFED version >= 5.4-0.3.3.0
1494 1. Configure SF feature::
1496 # Run mlxconfig on both PFs on host and ECPFs on BlueField.
1497 mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12
1499 2. Enable switchdev mode::
1501 mlxdevm dev eswitch set pci/<DBDF> mode switchdev
1505 mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
1507 Get SFID from output: pci/<DBDF>/<SFID>
1509 4. Modify MAC address::
1511 mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC>
1513 5. Activate SF port::
1515 mlxdevm port function set pci/<DBDF>/<ID> state active
1517 6. Devargs to probe SF device::
1519 auxiliary:mlx5_core.sf.<num>,dv_flow_en=1
1521 Sub-Function representor support
1522 --------------------------------
1524 A SF netdev supports E-Switch representation offload
1525 similar to PF and VF representors.
1526 Use <sfnum> to probe SF representor::
1528 testpmd> port attach <PCI_BDF>,representor=sf<sfnum>,dv_flow_en=1
1533 1. Configure aggressive CQE Zipping for maximum performance::
1535 mlxconfig -d <mst device> s CQE_COMPRESSION=1
1537 To set it back to the default CQE Zipping mode use::
1539 mlxconfig -d <mst device> s CQE_COMPRESSION=0
1541 2. In case of virtualization:
1543 - Make sure that hypervisor kernel is 3.16 or newer.
1544 - Configure boot with ``iommu=pt``.
1545 - Use 1G huge pages.
1546 - Make sure to allocate a VM on huge pages.
1547 - Make sure to set CPU pinning.
1549 3. Use the CPU near local NUMA node to which the PCIe adapter is connected,
1550 for better performance. For VMs, verify that the right CPU
1551 and NUMA node are pinned according to the above. Run::
1553 lstopo-no-graphics --merge
1555 to identify the NUMA node to which the PCIe adapter is connected.
1557 4. If more than one adapter is used, and root complex capabilities allow
1558 to put both adapters on the same NUMA node without PCI bandwidth degradation,
1559 it is recommended to locate both adapters on the same NUMA node.
1560 This in order to forward packets from one to the other without
1561 NUMA performance penalty.
1563 5. Disable pause frames::
1565 ethtool -A <netdev> rx off tx off
1567 6. Verify IO non-posted prefetch is disabled by default. This can be checked
1568 via the BIOS configuration. Please contact you server provider for more
1569 information about the settings.
1573 On some machines, depends on the machine integrator, it is beneficial
1574 to set the PCI max read request parameter to 1K. This can be
1575 done in the following way:
1577 To query the read request size use::
1579 setpci -s <NIC PCI address> 68.w
1581 If the output is different than 3XXX, set it by::
1583 setpci -s <NIC PCI address> 68.w=3XXX
1585 The XXX can be different on different systems. Make sure to configure
1586 according to the setpci output.
1588 7. To minimize overhead of searching Memory Regions:
1590 - '--socket-mem' is recommended to pin memory by predictable amount.
1591 - Configure per-lcore cache when creating Mempools for packet buffer.
1592 - Refrain from dynamically allocating/freeing memory in run-time.
1597 There are multiple Rx burst functions with different advantages and limitations.
1599 .. table:: Rx burst functions
1601 +-------------------+------------------------+---------+-----------------+------+-------+
1602 || Function Name || Enabler || Scatter|| Error Recovery || CQE || Large|
1603 | | | | || comp|| MTU |
1604 +===================+========================+=========+=================+======+=======+
1605 | rx_burst | rx_vec_en=0 | Yes | Yes | Yes | Yes |
1606 +-------------------+------------------------+---------+-----------------+------+-------+
1607 | rx_burst_vec | rx_vec_en=1 (default) | No | if CQE comp off | Yes | No |
1608 +-------------------+------------------------+---------+-----------------+------+-------+
1609 | rx_burst_mprq || mprq_en=1 | No | Yes | Yes | Yes |
1610 | || RxQs >= rxqs_min_mprq | | | | |
1611 +-------------------+------------------------+---------+-----------------+------+-------+
1612 | rx_burst_mprq_vec || rx_vec_en=1 (default) | No | if CQE comp off | Yes | Yes |
1613 | || mprq_en=1 | | | | |
1614 | || RxQs >= rxqs_min_mprq | | | | |
1615 +-------------------+------------------------+---------+-----------------+------+-------+
1617 .. _mlx5_offloads_support:
1619 Supported hardware offloads
1620 ---------------------------
1622 .. table:: Minimal SW/HW versions for queue offloads
1624 ============== ===== ===== ========= ===== ========== =============
1625 Offload DPDK Linux rdma-core OFED firmware hardware
1626 ============== ===== ===== ========= ===== ========== =============
1627 common base 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4
1628 checksums 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4
1629 Rx timestamp 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4
1630 TSO 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4
1631 LRO 19.08 N/A N/A 4.6-4 16.25.6406 ConnectX-5
1632 Tx scheduling 20.08 N/A N/A 5.1-2 22.28.2006 ConnectX-6 Dx
1633 Buffer Split 20.11 N/A N/A 5.1-2 16.28.2006 ConnectX-5
1634 ============== ===== ===== ========= ===== ========== =============
1636 .. table:: Minimal SW/HW versions for rte_flow offloads
1638 +-----------------------+-----------------+-----------------+
1639 | Offload | with E-Switch | with NIC |
1640 +=======================+=================+=================+
1641 | Count | | DPDK 19.05 | | DPDK 19.02 |
1642 | | | OFED 4.6 | | OFED 4.6 |
1643 | | | rdma-core 24 | | rdma-core 23 |
1644 | | | ConnectX-5 | | ConnectX-5 |
1645 +-----------------------+-----------------+-----------------+
1646 | Drop | | DPDK 19.05 | | DPDK 18.11 |
1647 | | | OFED 4.6 | | OFED 4.5 |
1648 | | | rdma-core 24 | | rdma-core 23 |
1649 | | | ConnectX-5 | | ConnectX-4 |
1650 +-----------------------+-----------------+-----------------+
1651 | Queue / RSS | | | | DPDK 18.11 |
1652 | | | N/A | | OFED 4.5 |
1653 | | | | | rdma-core 23 |
1654 | | | | | ConnectX-4 |
1655 +-----------------------+-----------------+-----------------+
1656 | Shared action | | | | |
1657 | | | :numref:`sact`| | :numref:`sact`|
1660 +-----------------------+-----------------+-----------------+
1661 | | VLAN | | DPDK 19.11 | | DPDK 19.11 |
1662 | | (of_pop_vlan / | | OFED 4.7-1 | | OFED 4.7-1 |
1663 | | of_push_vlan / | | ConnectX-5 | | ConnectX-5 |
1664 | | of_set_vlan_pcp / | | | | |
1665 | | of_set_vlan_vid) | | | | |
1666 +-----------------------+-----------------+-----------------+
1667 | | VLAN | | DPDK 21.05 | | |
1668 | | ingress and / | | OFED 5.3 | | N/A |
1669 | | of_push_vlan / | | ConnectX-6 Dx | | |
1670 +-----------------------+-----------------+-----------------+
1671 | | VLAN | | DPDK 21.05 | | |
1672 | | egress and / | | OFED 5.3 | | N/A |
1673 | | of_pop_vlan / | | ConnectX-6 Dx | | |
1674 +-----------------------+-----------------+-----------------+
1675 | Encapsulation | | DPDK 19.05 | | DPDK 19.02 |
1676 | (VXLAN / NVGRE / RAW) | | OFED 4.7-1 | | OFED 4.6 |
1677 | | | rdma-core 24 | | rdma-core 23 |
1678 | | | ConnectX-5 | | ConnectX-5 |
1679 +-----------------------+-----------------+-----------------+
1680 | Encapsulation | | DPDK 19.11 | | DPDK 19.11 |
1681 | GENEVE | | OFED 4.7-3 | | OFED 4.7-3 |
1682 | | | rdma-core 27 | | rdma-core 27 |
1683 | | | ConnectX-5 | | ConnectX-5 |
1684 +-----------------------+-----------------+-----------------+
1685 | Tunnel Offload | | DPDK 20.11 | | DPDK 20.11 |
1686 | | | OFED 5.1-2 | | OFED 5.1-2 |
1687 | | | rdma-core 32 | | N/A |
1688 | | | ConnectX-5 | | ConnectX-5 |
1689 +-----------------------+-----------------+-----------------+
1690 | | Header rewrite | | DPDK 19.05 | | DPDK 19.02 |
1691 | | (set_ipv4_src / | | OFED 4.7-1 | | OFED 4.7-1 |
1692 | | set_ipv4_dst / | | rdma-core 24 | | rdma-core 24 |
1693 | | set_ipv6_src / | | ConnectX-5 | | ConnectX-5 |
1694 | | set_ipv6_dst / | | | | |
1695 | | set_tp_src / | | | | |
1696 | | set_tp_dst / | | | | |
1697 | | dec_ttl / | | | | |
1698 | | set_ttl / | | | | |
1699 | | set_mac_src / | | | | |
1700 | | set_mac_dst) | | | | |
1701 +-----------------------+-----------------+-----------------+
1702 | | Header rewrite | | DPDK 20.02 | | DPDK 20.02 |
1703 | | (set_dscp) | | OFED 5.0 | | OFED 5.0 |
1704 | | | | rdma-core 24 | | rdma-core 24 |
1705 | | | | ConnectX-5 | | ConnectX-5 |
1706 +-----------------------+-----------------+-----------------+
1707 | Jump | | DPDK 19.05 | | DPDK 19.02 |
1708 | | | OFED 4.7-1 | | OFED 4.7-1 |
1709 | | | rdma-core 24 | | N/A |
1710 | | | ConnectX-5 | | ConnectX-5 |
1711 +-----------------------+-----------------+-----------------+
1712 | Mark / Flag | | DPDK 19.05 | | DPDK 18.11 |
1713 | | | OFED 4.6 | | OFED 4.5 |
1714 | | | rdma-core 24 | | rdma-core 23 |
1715 | | | ConnectX-5 | | ConnectX-4 |
1716 +-----------------------+-----------------+-----------------+
1717 | Meta data | | DPDK 19.11 | | DPDK 19.11 |
1718 | | | OFED 4.7-3 | | OFED 4.7-3 |
1719 | | | rdma-core 26 | | rdma-core 26 |
1720 | | | ConnectX-5 | | ConnectX-5 |
1721 +-----------------------+-----------------+-----------------+
1722 | Port ID | | DPDK 19.05 | | N/A |
1723 | | | OFED 4.7-1 | | N/A |
1724 | | | rdma-core 24 | | N/A |
1725 | | | ConnectX-5 | | N/A |
1726 +-----------------------+-----------------+-----------------+
1727 | Hairpin | | | | DPDK 19.11 |
1728 | | | N/A | | OFED 4.7-3 |
1729 | | | | | rdma-core 26 |
1730 | | | | | ConnectX-5 |
1731 +-----------------------+-----------------+-----------------+
1732 | 2-port Hairpin | | | | DPDK 20.11 |
1733 | | | N/A | | OFED 5.1-2 |
1735 | | | | | ConnectX-5 |
1736 +-----------------------+-----------------+-----------------+
1737 | Metering | | DPDK 19.11 | | DPDK 19.11 |
1738 | | | OFED 4.7-3 | | OFED 4.7-3 |
1739 | | | rdma-core 26 | | rdma-core 26 |
1740 | | | ConnectX-5 | | ConnectX-5 |
1741 +-----------------------+-----------------+-----------------+
1742 | Sampling | | DPDK 20.11 | | DPDK 20.11 |
1743 | | | OFED 5.1-2 | | OFED 5.1-2 |
1744 | | | rdma-core 32 | | N/A |
1745 | | | ConnectX-5 | | ConnectX-5 |
1746 +-----------------------+-----------------+-----------------+
1747 | Encapsulation | | DPDK 21.02 | | DPDK 21.02 |
1748 | GTP PSC | | OFED 5.2 | | OFED 5.2 |
1749 | | | rdma-core 35 | | rdma-core 35 |
1750 | | | ConnectX-6 Dx| | ConnectX-6 Dx |
1751 +-----------------------+-----------------+-----------------+
1752 | Encapsulation | | DPDK 21.02 | | DPDK 21.02 |
1753 | GENEVE TLV option | | OFED 5.2 | | OFED 5.2 |
1754 | | | rdma-core 34 | | rdma-core 34 |
1755 | | | ConnectX-6 Dx | | ConnectX-6 Dx |
1756 +-----------------------+-----------------+-----------------+
1757 | Modify Field | | DPDK 21.02 | | DPDK 21.02 |
1758 | | | OFED 5.2 | | OFED 5.2 |
1759 | | | rdma-core 35 | | rdma-core 35 |
1760 | | | ConnectX-5 | | ConnectX-5 |
1761 +-----------------------+-----------------+-----------------+
1762 | Connection tracking | | | | DPDK 21.05 |
1763 | | | N/A | | OFED 5.3 |
1764 | | | | | rdma-core 35 |
1765 | | | | | ConnectX-6 Dx |
1766 +-----------------------+-----------------+-----------------+
1768 .. table:: Minimal SW/HW versions for shared action offload
1771 +-----------------------+-----------------+-----------------+
1772 | Shared Action | with E-Switch | with NIC |
1773 +=======================+=================+=================+
1774 | RSS | | | | DPDK 20.11 |
1775 | | | N/A | | OFED 5.2 |
1776 | | | | | rdma-core 33 |
1777 | | | | | ConnectX-5 |
1778 +-----------------------+-----------------+-----------------+
1779 | Age | | DPDK 20.11 | | DPDK 20.11 |
1780 | | | OFED 5.2 | | OFED 5.2 |
1781 | | | rdma-core 32 | | rdma-core 32 |
1782 | | | ConnectX-6 Dx | | ConnectX-6 Dx |
1783 +-----------------------+-----------------+-----------------+
1784 | Count | | DPDK 21.05 | | DPDK 21.05 |
1785 | | | OFED 4.6 | | OFED 4.6 |
1786 | | | rdma-core 24 | | rdma-core 23 |
1787 | | | ConnectX-5 | | ConnectX-5 |
1788 +-----------------------+-----------------+-----------------+
1793 MARK and META items are interrelated with datapath - they might move from/to
1794 the applications in mbuf fields. Hence, zero value for these items has the
1795 special meaning - it means "no metadata are provided", not zero values are
1796 treated by applications and PMD as valid ones.
1798 Moreover in the flow engine domain the value zero is acceptable to match and
1799 set, and we should allow to specify zero values as rte_flow parameters for the
1800 META and MARK items and actions. In the same time zero mask has no meaning and
1801 should be rejected on validation stage.
1806 Flows are not cached in the driver.
1807 When stopping a device port, all the flows created on this port from the
1808 application will be flushed automatically in the background.
1809 After stopping the device port, all flows on this port become invalid and
1810 not represented in the system.
1811 All references to these flows held by the application should be discarded
1812 directly but neither destroyed nor flushed.
1814 The application should re-create the flows as required after the port restart.
1819 Compared to librte_net_mlx4 that implements a single RSS configuration per
1820 port, librte_net_mlx5 supports per-protocol RSS configuration.
1822 Since ``testpmd`` defaults to IP RSS mode and there is currently no
1823 command-line parameter to enable additional protocols (UDP and TCP as well
1824 as IP), the following commands must be entered from its CLI to get the same
1825 behavior as librte_net_mlx4::
1828 > port config all rss all
1834 This section demonstrates how to launch **testpmd** with Mellanox
1835 ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_net_mlx5.
1837 #. Load the kernel modules::
1839 modprobe -a ib_uverbs mlx5_core mlx5_ib
1841 Alternatively if MLNX_OFED/MLNX_EN is fully installed, the following script
1844 /etc/init.d/openibd restart
1848 User space I/O kernel modules (uio and igb_uio) are not used and do
1849 not have to be loaded.
1851 #. Make sure Ethernet interfaces are in working order and linked to kernel
1852 verbs. Related sysfs entries should be present::
1854 ls -d /sys/class/net/*/device/infiniband_verbs/uverbs* | cut -d / -f 5
1863 #. Optionally, retrieve their PCI bus addresses for to be used with the allow list::
1866 for intf in eth2 eth3 eth4 eth5;
1868 (cd "/sys/class/net/${intf}/device/" && pwd -P);
1871 sed -n 's,.*/\(.*\),-a \1,p'
1880 #. Request huge pages::
1882 dpdk-hugepages.py --setup 2G
1884 #. Start testpmd with basic parameters::
1886 dpdk-testpmd -l 8-15 -n 4 -a 05:00.0 -a 05:00.1 -a 06:00.0 -a 06:00.1 -- --rxq=2 --txq=2 -i
1891 EAL: PCI device 0000:05:00.0 on NUMA socket 0
1892 EAL: probe driver: 15b3:1013 librte_net_mlx5
1893 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_0" (VF: false)
1894 PMD: librte_net_mlx5: 1 port(s) detected
1895 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe
1896 EAL: PCI device 0000:05:00.1 on NUMA socket 0
1897 EAL: probe driver: 15b3:1013 librte_net_mlx5
1898 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_1" (VF: false)
1899 PMD: librte_net_mlx5: 1 port(s) detected
1900 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff
1901 EAL: PCI device 0000:06:00.0 on NUMA socket 0
1902 EAL: probe driver: 15b3:1013 librte_net_mlx5
1903 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_2" (VF: false)
1904 PMD: librte_net_mlx5: 1 port(s) detected
1905 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa
1906 EAL: PCI device 0000:06:00.1 on NUMA socket 0
1907 EAL: probe driver: 15b3:1013 librte_net_mlx5
1908 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_3" (VF: false)
1909 PMD: librte_net_mlx5: 1 port(s) detected
1910 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb
1911 Interactive-mode selected
1912 Configuring Port 0 (socket 0)
1913 PMD: librte_net_mlx5: 0x8cba80: TX queues number update: 0 -> 2
1914 PMD: librte_net_mlx5: 0x8cba80: RX queues number update: 0 -> 2
1915 Port 0: E4:1D:2D:E7:0C:FE
1916 Configuring Port 1 (socket 0)
1917 PMD: librte_net_mlx5: 0x8ccac8: TX queues number update: 0 -> 2
1918 PMD: librte_net_mlx5: 0x8ccac8: RX queues number update: 0 -> 2
1919 Port 1: E4:1D:2D:E7:0C:FF
1920 Configuring Port 2 (socket 0)
1921 PMD: librte_net_mlx5: 0x8cdb10: TX queues number update: 0 -> 2
1922 PMD: librte_net_mlx5: 0x8cdb10: RX queues number update: 0 -> 2
1923 Port 2: E4:1D:2D:E7:0C:FA
1924 Configuring Port 3 (socket 0)
1925 PMD: librte_net_mlx5: 0x8ceb58: TX queues number update: 0 -> 2
1926 PMD: librte_net_mlx5: 0x8ceb58: RX queues number update: 0 -> 2
1927 Port 3: E4:1D:2D:E7:0C:FB
1928 Checking link statuses...
1929 Port 0 Link Up - speed 40000 Mbps - full-duplex
1930 Port 1 Link Up - speed 40000 Mbps - full-duplex
1931 Port 2 Link Up - speed 10000 Mbps - full-duplex
1932 Port 3 Link Up - speed 10000 Mbps - full-duplex
1939 This section demonstrates how to dump flows. Currently, it's possible to dump
1940 all flows with assistance of external tools.
1942 #. 2 ways to get flow raw file:
1944 - Using testpmd CLI:
1946 .. code-block:: console
1949 testpmd> flow dump <port> all <output_file>
1951 testpmd> flow dump <port> rule <rule_id> <output_file>
1953 - call rte_flow_dev_dump api:
1955 .. code-block:: console
1957 rte_flow_dev_dump(port, flow, file, NULL);
1959 #. Dump human-readable flows from raw file:
1961 Get flow parsing tool from: https://github.com/Mellanox/mlx_steering_dump
1963 .. code-block:: console
1965 mlx_steering_dump.py -f <output_file> -flowptr <flow_ptr>
1967 How to share a meter between ports in the same switch domain
1968 ------------------------------------------------------------
1970 This section demonstrates how to use the shared meter. A meter M can be created
1971 on port X and to be shared with a port Y on the same switch domain by the next way:
1973 .. code-block:: console
1975 flow create X ingress transfer pattern eth / port_id id is Y / end actions meter mtr_id M / end
1977 How to use meter hierarchy
1978 --------------------------
1980 This section demonstrates how to create and use a meter hierarchy.
1981 A termination meter M can be the policy green action of another termination meter N.
1982 The two meters are chained together as a chain. Using meter N in a flow will apply
1983 both the meters in hierarchy on that flow.
1985 .. code-block:: console
1987 add port meter policy 0 1 g_actions queue index 0 / end y_actions end r_actions drop / end
1988 create port meter 0 M 1 1 yes 0xffff 1 0
1989 add port meter policy 0 2 g_actions meter mtr_id M / end y_actions end r_actions drop / end
1990 create port meter 0 N 2 2 yes 0xffff 1 0
1991 flow create 0 ingress group 1 pattern eth / end actions meter mtr_id N / end