Bruce Richardson [Thu, 23 Jun 2022 13:49:33 +0000 (14:49 +0100)]
dma/idxd: fix non-AVX builds with old compilers
When building without AVX2 support using an older compiler e.g. gcc 4.8
on Centos/RHEL 7, we get build errors due to the use of AVX2 intrinsics.
This is because the compiler does not support
"__attribute__((target(AVX2)))" function attribute. Disable build of
this driver such edge cases.
Generic builds using recent compilers, and all builds with a minimum
baseline of AVX2 are unaffected by this change.
Fixes: aa802b10237c ("dma/idxd: fix AVX2 in non-datapath functions") Cc: stable@dpdk.org Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Tested-by: Yu Jiang <yux.jiang@intel.com>
Bruce Richardson [Thu, 23 Jun 2022 13:49:32 +0000 (14:49 +0100)]
raw/ioat: fix build when ioat dmadev enabled
The build of the raw/ioat driver only occurs when the equivalent dmadev
drivers are disabled. Complications occur when the ioat dmadev is being
built but not the idxd. In this case, only the idxd part of raw/ioat
gets built, but the definition of the logtype is in the ioat part,
causing build errors.
.../raw_ioat_idxd_bus.c.o: In function `idxd_vdev_mmap_wq':
idxd_bus.c:(.text+0x116): undefined reference to `ioat_pmd_logtype'
Fix this by moving the logtype definition to the common C file, and
renaming it to avoid conflicts with a similarly named value in the
dma/ioat driver.
Bruce Richardson [Thu, 23 Jun 2022 13:49:31 +0000 (14:49 +0100)]
raw/ioat: fix build missing errno include
The inline functions in rte_idxd_rawdev_fns.h make use of rte_errno, but
the header with its definition is not included by that file leading to
build errors.
Fixes: f82c87eb14a4 ("raw/ioat: move idxd functions to separate file") Cc: stable@dpdk.org Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Make sure all functions which use the convention that XXX_free(NULL)
is a nop are all documented.
The wording is chosen to match the documentation of free(3).
"If ptr is NULL, no operation is performed."
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
[David: squashed with other series updates, unified wording]
Yunjian Wang [Fri, 24 Dec 2021 03:06:19 +0000 (11:06 +0800)]
net/mlx5: fix stack buffer overflow in drop action
The mlx5_drop_action_create function use mlx5_malloc for allocating
'hrxq', but don't allocate for 'rss_key'. This is wrong and it can
cause buffer overflow.
Detected with address sanitizer:
0 (/usr/lib64/libasan.so.4+0x7b8e2)
1 in mlx5_devx_tir_attr_set ../drivers/net/mlx5/mlx5_devx.c:765
2 in mlx5_devx_hrxq_new ../drivers/net/mlx5/mlx5_devx.c:800
3 in mlx5_devx_drop_action_create ../drivers/net/mlx5/mlx5_devx.c:1051
4 in mlx5_drop_action_create ../drivers/net/mlx5/mlx5_rxq.c:2846
5 in mlx5_dev_spawn ../drivers/net/mlx5/linux/mlx5_os.c:1743
6 in mlx5_os_pci_probe_pf ../drivers/net/mlx5/linux/mlx5_os.c:2501
7 in mlx5_os_pci_probe ../drivers/net/mlx5/linux/mlx5_os.c:2647
8 in mlx5_os_net_probe ../drivers/net/mlx5/linux/mlx5_os.c:2722
9 in drivers_probe ../drivers/common/mlx5/mlx5_common.c:657
10 in mlx5_common_dev_probe ../drivers/common/mlx5/mlx5_common.c:711
11 in mlx5_common_pci_probe ../drivers/common/mlx5/mlx5_common_pci.c:150
12 in rte_pci_probe_one_driver ../drivers/bus/pci/pci_common.c:269
13 in pci_probe_all_drivers ../drivers/bus/pci/pci_common.c:353
14 in pci_probe ../drivers/bus/pci/pci_common.c:380
15 in rte_bus_probe ../lib/eal/common/eal_common_bus.c:72
16 in rte_eal_init ../lib/eal/linux/eal.c:1286
17 in main ../app/test-pmd/testpmd.c:4112
Fixes: 0c762e81da9b ("net/mlx5: share Rx queue drop action code") Cc: stable@dpdk.org Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Shun Hao [Sun, 19 Jun 2022 03:21:27 +0000 (06:21 +0300)]
net/mlx5: add limitation for E-Switch Manager match
For BF with old FW which doesn't expose the E-Switch Manager vport ID,
E-Switch Manager port matching works correctly only when BF is in
embedded CPU mode.
Spike Du [Thu, 16 Jun 2022 08:41:54 +0000 (11:41 +0300)]
app/testpmd: add host shaper command
Add command line options to support host shaper configure.
- Command syntax:
mlx5 set port <port_id> host_shaper avail_thresh_triggered <0|1> rate
<rate_num>
- Example commands:
To enable avail_thresh_triggered on port 1 and disable current host
shaper:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
To disable avail_thresh_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
The rate unit is 100Mbps.
To disable avail_thresh_triggered and configure a shaper of 5Gbps on
port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50
Add sample code to handle rxq available descriptor threshold event, it
delays a while so that rxq empties, then disables host shaper and
rearms available descriptor threshold event.
Signed-off-by: Spike Du <spiked@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com>
Spike Du [Thu, 16 Jun 2022 08:41:53 +0000 (11:41 +0300)]
net/mlx5: add API to configure host port shaper
Host port shaper can be configured with QSHR (QoS Shaper Host Register).
Add check in build files to enable this function or not.
The host shaper configuration affects all the ethdev ports belonging to the
same host port.
Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives available descriptor
threshold event.
Signed-off-by: Spike Du <spiked@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com>
Spike Du [Thu, 16 Jun 2022 08:41:52 +0000 (11:41 +0300)]
net/mlx5: support Rx descriptor threshold event
Add mlx5 specific available descriptor threshold configuration
and query handler.
In mlx5 PMD, available descriptor threshold is also called
LWM (limit watermark).
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next Rx queue with pending LWM event
if any, starting from the given Rx queue index.
Signed-off-by: Spike Du <spiked@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com>
Spike Du [Thu, 16 Jun 2022 08:41:51 +0000 (11:41 +0300)]
net/mlx5: handle Rx descriptor LWM event
When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.
Signed-off-by: Spike Du <spiked@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com>
Spike Du [Thu, 16 Jun 2022 08:41:50 +0000 (11:41 +0300)]
common/mlx5: share interrupt management
There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.
Signed-off-by: Spike Du <spiked@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com>
Spike Du [Thu, 16 Jun 2022 08:41:49 +0000 (11:41 +0300)]
net/mlx5: support descriptor LWM for Rx queue
Add LWM (Limit WaterMark) field to Rxq object which indicates the percentage
of Rx queue size used by HW to raise descriptor event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.
Signed-off-by: Spike Du <spiked@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com>
Ali Alnubani [Wed, 11 May 2022 16:41:09 +0000 (19:41 +0300)]
net/mlx5: fix build with clang 14
Use fgets instead of fscanf to resolve the following warning
reported by clang 14.0.0 in Fedora 37 (Rawhide):
drivers/net/mlx5/linux/mlx5_ethdev_os.c:1137:52: error:
'fscanf' may overflow; destination buffer in argument 3 has size 16,
but the corresponding specifier may require size 17
[-Werror,-Wfortify-source]
ret = fscanf(file, "%" RTE_STR(IF_NAMESIZE) "s", port_name);
Fixes: 63d1db710fbc ("net/mlx5: fix unlimited parsing of switch info") Cc: stable@dpdk.org Signed-off-by: Ali Alnubani <alialnu@nvidia.com> Acked-by: Thomas Monjalon <thomas@monjalon.net> Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Gregory Etelson [Wed, 8 Jun 2022 11:58:26 +0000 (14:58 +0300)]
common/mlx5: update log for DevX object creation failure
Application can fetch syndrome value after FW operation failure
starting from Mellanox OFED-5.6.
The patch updates log data after devx_obj_create error.
Gregory Etelson [Wed, 8 Jun 2022 11:58:25 +0000 (14:58 +0300)]
common/mlx5: update log for DevX general command failure
Application can fetch syndrome value after FW operation failure
starting from Mellanox OFED-5.6.
The patch updates log data issued after devx_general_cmd error.
Sean Zhang [Tue, 7 Jun 2022 11:17:32 +0000 (14:17 +0300)]
net/mlx5: support represented port item in flow rules
Add support for represented_port item in pattern. And if the spec and mask
both are NULL, translate function will not add source vport to matcher.
For example, testpmd starts with PF, VF-rep0 and VF-rep1, below command
will redirect packets from VF0 and VF1 to wire:
testpmd> flow create 0 ingress transfer group 0 pattern eth /
represented_port / end actions represented_port ethdev_id is 0 / end
Signed-off-by: Sean Zhang <xiazhang@nvidia.com> Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Don Wallwork [Thu, 23 Jun 2022 11:21:27 +0000 (07:21 -0400)]
eal/linux: allocate worker lcore stacks in hugepages
Add support for using hugepages for worker lcore stack memory. The
intent is to improve performance by reducing stack memory related TLB
misses and also by using memory local to the NUMA node of each lcore.
EAL option '--huge-worker-stack[=stack-size-in-kbytes]' is added to allow
the feature to be enabled at runtime. If the size is not specified,
the system pthread stack size will be used.
Huichao Cai [Sat, 18 Jun 2022 14:09:40 +0000 (22:09 +0800)]
ip_frag: fix build with GCC 12
GCC 12 raises warnings on usage of rte_memcpy with IPv4 options handling
in fragments for both the ip_frag library and unit tests.
For example in the library:
In function ‘_mm256_storeu_si256’,
inlined from ‘rte_mov32’ at
../lib/eal/x86/include/rte_memcpy.h:347:2,
inlined from ‘rte_mov128’ at
../lib/eal/x86/include/rte_memcpy.h:369:2,
inlined from ‘rte_memcpy_generic’
at ../lib/eal/x86/include/rte_memcpy.h:445:4,
inlined from ‘rte_memcpy’
at ../lib/eal/x86/include/rte_memcpy.h:851:10,
inlined from ‘__create_ipopt_frag_hdr’
at ../lib/ip_frag/rte_ipv4_fragmentation.c:68:4,
inlined from ‘rte_ipv4_fragment_packet’
at ../lib/ip_frag/rte_ipv4_fragmentation.c:242:16:
/usr/lib/gcc/x86_64-redhat-linux/12/include/avxintrin.h:935:8: error:
array subscript ‘__m256i_u[1]’ is partly outside array bounds of
‘uint8_t[60]’ {aka ‘unsigned char[60]’} [-Werror=array-bounds]
935 | *__P = __A;
| ~~~~~^~~~~
../lib/ip_frag/rte_ipv4_fragmentation.c: In function
‘rte_ipv4_fragment_packet’:
../lib/ip_frag/rte_ipv4_fragmentation.c:122:17: note: at offset [52, 60]
into object ‘ipopt_frag_hdr’ of size 60
122 | uint8_t ipopt_frag_hdr[IPV4_HDR_MAX_LEN];
| ^~~~~~~~~~~~~~
To resolve the compilation warning, replace the rte_memcpy with memcpy.
The x86 version of rte_memcpy can cause warnings. The driver does
not need to use rte_memcpy for everything. Standard memcpy is
just as fast and safer; the compiler and static analysis tools
treat memcpy specially.
Cc: stable@dpdk.org Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Wenxuan Wu [Thu, 23 Jun 2022 09:01:05 +0000 (17:01 +0800)]
net/ice/base: fix build with GCC 12
GCC 12 with -O2 flag would raise the following warning:
../drivers/net/ice/base/ice_switch.c:7220:61: error: writing 1 byte into a
region of size 0 [-Werror=stringop-overflow=]
7220 | buf[recps].content.lkup_indx[i + 1] = entry->fv_idx[i];
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
This patch changed the type of fv_idx in struct ice_recp_grp_entry to
align with its callers which are also u8 type.
Kathleen Capella [Fri, 17 Jun 2022 18:21:34 +0000 (18:21 +0000)]
net/iavf: add basic NEON Rx
This patch adds the basic NEON Rx path to the iavf driver. It does not
include scatter or flex varieties.
Tested on N1SDP platform with Intel XL710 NIC and 40G connection.
Tested with a single core and testpmd rxonly mode. Saw no significant
performance difference between scalar and Arm vPMD paths using this test
in iavf and saw the same results when comparing scalar and Arm vPMD
path in i40e.
Robin Zhang [Fri, 10 Jun 2022 16:29:44 +0000 (16:29 +0000)]
net/i40e: add outer VLAN processing
Outer VLAN processing is supported after firmware v8.4, kernel driver
also change the default behavior to support this feature. To align with
kernel driver, add support for outer VLAN processing in DPDK.
But it is forbidden for firmware to change the Inner/Outer VLAN
configuration while there are MAC/VLAN filters in the switch table.
Therefore, we need to clear the MAC table before setting config,
and then restore the MAC table after setting.
This will not impact on an old firmware.
Signed-off-by: Robin Zhang <robinx.zhang@intel.com> Signed-off-by: Kevin Liu <kevinx.liu@intel.com> Acked-by: Yuying Zhang <yuying.zhang@intel.com>
Parameters:
<port_id> the PF Port ID
<config_path> dumped runtime configure file, if not a absolute path,
it will be dumped to testpmd running directory.
For example:
testpmd> ddp dump 0 current.pkg
If you want to dump ice VF DDP runtime configure, you need bind other
unused PF port of the NIC first, and then dump the PF's runtime configure
as target output.
Signed-off-by: Steve Yang <stevex.yang@intel.com> Acked-by: Qi Zhang <qi.z.zhang@intel.com>
Simei Su [Wed, 8 Jun 2022 02:46:01 +0000 (10:46 +0800)]
net/ice: fix race condition in Rx timestamp
In multi-cores cases for Rx timestamp offload, to avoid phc time being
frequently overwritten, move related variables from ice_adapter to
ice_rx_queue structure, and each queue will handle timestamp calculation
by itself.
Fixes: 953e74e6b73a ("net/ice: enable Rx timestamp on flex descriptor") Fixes: 5543827fc6df ("net/ice: improve performance of Rx timestamp offload") Cc: stable@dpdk.org Signed-off-by: Simei Su <simei.su@intel.com> Acked-by: Qi Zhang <qi.z.zhang@intel.com>
Ferruh Yigit [Thu, 16 Jun 2022 17:02:09 +0000 (18:02 +0100)]
net/qede: fix build with GCC 13
Reproduced with "gcc (GCC) 13.0.0 20220616 (experimental)"
Build error:
In file included from ../drivers/net/qede/qede_debug.c:9:
../drivers/net/qede/qede_debug.c: In function ‘qed_grc_dump_addr_range’:
../drivers/net/qede/base/ecore.h:95:17:
warning: overflow in conversion from ‘int’ to ‘u8’
{aka ‘unsigned char’} changes value from ‘(int)vf_id << 8 | 128’
to ‘128’ [-Woverflow]
95 | ((_value & _name##_MASK) << _name##_SHIFT)
| ^
../drivers/net/qede/qede_debug.c:1907:31:
note: in expansion of macro ‘FIELD_VALUE’
1907 | fid = FIELD_VALUE(PXP_PRETEND_CONCRETE_FID_VFVALID, 1)
| ^~~~~~~~~~~
To prevent overflow converting 'fib' to uint16_t,
while updating it also updated 'vf_id' to 16 bit too.
Fixes: ec55c118792b ("net/qede: add infrastructure for debug data collection") Cc: stable@dpdk.org Signed-off-by: Ferruh Yigit <ferruh.yigit@xilinx.com> Acked-by: Devendra Singh Rawat <dsinghrawat@marvell.com>
Harman Kalra [Thu, 16 Jun 2022 09:24:12 +0000 (14:54 +0530)]
common/cnxk: support same TC value across multiple queues
User may want to configure same TC value across multiple queues, but
for that all queues should have a common TL3 where this TC value will
get configured.
Changed the pfc_tc_cq_map/pfc_tc_sq_map array indexing to qid and store
TC values in the array. As multiple queues may have same TC value.
Bruce Richardson [Wed, 15 Jun 2022 17:10:13 +0000 (18:10 +0100)]
common/cnxk: add include for macro definition
The header file "roc_io.h" uses the "__plt_always_inline" macro but
don't include "roc_platform.h" to get the definition of it. This
inclusion is not necessary for compilation, but the lack of it can
confuse some indexers - such as those in eclipse, which reports the
lines:
"static __plt_always_inline uint64_t"
as possible definitions of a variable called "uint64_t". This confusion
leads to uint64_t being flagged as an unknown type in all other parts of
the project being indexed, e.g. across all of DPDK code.
Adding in the include of roc_platform.h makes it clear to the indexer
that those lines are part of a function definition, and that allows
eclipse to correctly recognise uint64_t as a type from stdint.h
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Acked-by: Jerin Jacob <jerinj@marvell.com>
When counting the batch allocated pointers in cnxk mempool driver,
currently it always waits for in-flight batch operations to finish.
Add a provision to make this waiting optional.
Signed-off-by: Ashwin Sekhar T K <asekhar@marvell.com>
Satheesh Paul [Mon, 2 May 2022 08:47:30 +0000 (14:17 +0530)]
common/cnxk: fix channel number setting in MCAM entries
Adding changes to accommodate the following requirements
while masking the channel number.
1. For CN10K device, channel number should not be masked
for first pass rules with RTE_FLOW_ACTION_TYPE_SECURITY
action. And channel number should be masked for all
other flow rules.
2. For CN9K device channel number should not be masked.
Fixes: 4968b362b639 ("common/cnxk: support CPT second pass flow rules") Cc: stable@dpdk.org Signed-off-by: Satheesh Paul <psatheesh@marvell.com> Reviewed-by: Kiran Kumar K <kirankumark@marvell.com>
Harman Kalra [Tue, 24 May 2022 08:42:26 +0000 (14:12 +0530)]
net/octeontx: fix port close
Segmentation fault has been observed while closing the ethernet
port. Reason for the segfault is, eth port close also shuts down
event device while other ethernet port is still using the event
device.
The crossbuild-essential-<arch> packages contain all necessary
dependencies to cross-compile binaries for a given architecture
including C and C++ compilers. Therefore use those instead of listing
packages directly. This way C++ compiler is also installed and C++
include checks will be checked in CI for ARM and PowerPC.
Cc: stable@dpdk.org Signed-off-by: Stanislaw Kardach <kda@semihalf.com> Reviewed-by: David Marchand <david.marchand@redhat.com>
Through some mixup all cross-files for ARM and PowerPC platforms were
using C Preprocessor (cpp) instead of GCC (g++).
This caused meson to fail detecting the C++ compiler presence and
therefore disabling some targets (i.e. C++ include file checks).
Fixes: e53a5299d219 ("build: support vendor specific ARM cross builds") Cc: stable@dpdk.org Signed-off-by: Stanislaw Kardach <kda@semihalf.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
rte_dump_stack() needs to be usable in situations when a bug is
encountered and from signal handlers (such as SEGV).
Glibc backtrace_symbols() calls malloc which makes it
dangerous in a signal handler that is handling errors that maybe
due to memory corruption. Additionally, rte_log() is unsafe because
syslog() is not signal safe; printf() is also documented as
not being safe.
This version formats message and uses writev for each line in a manner
similar to what glibc version of backtrace_symbols_fd() does. The
FreeBSD version of backtrace_symbols_fd() is not signal safe.
David Marchand [Wed, 22 Jun 2022 15:30:20 +0000 (17:30 +0200)]
vhost/crypto: fix descriptor processing
copy_data was returning a pointer to an increased (off by one) descriptor.
Subsequent calls to copy_data in the library were then failing.
Fix this by incrementing the descriptor only if there is some left data
to copy.
Fixes: 4414bb67010d ("vhost/crypto: fix build with GCC 12") Cc: stable@dpdk.org Reported-by: Jakub Poczatek <jakub.poczatek@intel.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Jakub Poczatek <jakub.poczatek@intel.com> Acked-by: Fan Zhang <roy.fan.zhang@intel.com>
Yuan Wang [Mon, 6 Jun 2022 15:55:43 +0000 (23:55 +0800)]
net/virtio: unmap PCI device in secondary process
In multi-process, the secondary process will remap PCI during
initialization, but the mapping is not removed in the uninit path,
the device is not closed, and the device busy error will be reported
when the device is hotplugged.
This patch unmaps PCI device at secondary process uninitialization
based on virtio_rempa_pci.
Li Zhang [Sat, 18 Jun 2022 09:02:58 +0000 (12:02 +0300)]
vdpa/mlx5: prepare virtqueue resource creation
Split the virtqs virt-queue resource between
the configuration threads.
Also need pre-created virt-queue resource
after virtq destruction.
This accelerates the LM process and reduces its time by 30%.
Signed-off-by: Li Zhang <lizh@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Li Zhang [Sat, 18 Jun 2022 09:02:57 +0000 (12:02 +0300)]
vdpa/mlx5: add virtq sub-resources creation
pre-created virt-queue sub-resource in device probe stage
and then modify virtqueue in device config stage.
Steer table also need to support dummy virt-queue.
This accelerates the LM process and reduces its time by 40%.
Signed-off-by: Li Zhang <lizh@nvidia.com> Signed-off-by: Yajun Wu <yajunw@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Li Zhang [Sat, 18 Jun 2022 09:02:56 +0000 (12:02 +0300)]
vdpa/mlx5: add device close task
Split the virtqs device close tasks after
stopping virt-queue between the configuration threads.
This accelerates the LM process and
reduces its time by 50%.
Signed-off-by: Li Zhang <lizh@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Li Zhang [Sat, 18 Jun 2022 09:02:54 +0000 (12:02 +0300)]
vdpa/mlx5: add virtq creation task
The virtq object and all its sub-resources use a lot of
FW commands and can be accelerated by the MT management.
Split the virtqs creation between the configuration threads.
This accelerates the LM process and reduces its time by 20%.
Signed-off-by: Li Zhang <lizh@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Li Zhang [Sat, 18 Jun 2022 09:02:53 +0000 (12:02 +0300)]
vdpa/mlx5: add VM memory registration task
The driver creates a direct MR object of
the HW for each VM memory region,
which maps the VM physical address to
the actual physical address.
Later, after all the MRs are ready,
the driver creates an indirect MR to group all the direct MRs
into one virtual space from the HW perspective.
Create direct MRs in parallel using the MT mechanism.
After completion, the primary thread creates the indirect MR
needed for the following virtqs configurations.
This optimization accelerrate the LM process and
reduce its time by 5%.
Signed-off-by: Li Zhang <lizh@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Li Zhang [Sat, 18 Jun 2022 09:02:52 +0000 (12:02 +0300)]
vdpa/mlx5: add task ring for multi-thread management
The configuration threads tasks need a container to
support multiple tasks assigned to a thread in parallel.
Use rte_ring container per thread to manage
the thread tasks without locks.
The caller thread from the user context opens a task to
a thread and enqueue it to the thread ring.
The thread polls its ring and dequeue tasks.
That’s why the ring should be in multi-producer
and single consumer mode.
Anatomic counter manages the tasks completion notification.
The threads report errors to the caller by
a dedicated error counter per task.
Signed-off-by: Li Zhang <lizh@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Li Zhang [Sat, 18 Jun 2022 09:02:51 +0000 (12:02 +0300)]
vdpa/mlx5: add multi-thread management for configuration
The LM process includes a lot of objects creations and
destructions in the source and the destination servers.
As much as LM time increases, the packet drop of the VM increases.
To improve LM time need to parallel the configurations for mlx5 FW.
Add internal multi-thread management in the driver for it.
A new devarg defines the number of threads and their CPU.
The management is shared between all the devices of the driver.
Since the event_core also affects the datapath events thread,
reduce the priority of the datapath event thread to
allow fast configuration of the devices doing the LM.
Signed-off-by: Li Zhang <lizh@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
The driver used a single global lock for any synchronization
needed for the datapath and control path.
It is better to group the critical sections with
the other ones that should be synchronized.
Replace the global lock with the following locks:
1.virtq locks(per virtq) synchronize datapath polling and
parallel configurations on the same virtq.
2.A doorbell lock synchronizes doorbell update,
which is shared for all the virtqs in the device.
3.A steering lock for the shared steering objects updates.
Signed-off-by: Li Zhang <lizh@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Li Zhang [Sat, 18 Jun 2022 09:02:48 +0000 (12:02 +0300)]
common/mlx5: extend virtq modifiable fields
A virtq configuration can be modified after the virtq creation.
Added the following modifiable fields:
1.address fields: desc_addr/used_addr/available_addr
2.hw_available_index
3.hw_used_index
4.virtio_q_type
5.version type
6.queue mkey
7.feature bit mask: tso_ipv4/tso_ipv6/tx_csum/rx_csum
8.event mode: event_mode/event_qpn_or_msix
Signed-off-by: Li Zhang <lizh@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Yajun Wu [Sat, 18 Jun 2022 09:02:47 +0000 (12:02 +0300)]
vdpa/mlx5: reuse event queues
To speed up queue creation time, event QP and CQ will create only once.
Each virtq creation will reuse same event QP and CQ.
Because FW will set event QP to error state during virtq destroy,
need modify event QP to RESET state, then modify QP to RTS state as
usual. This can save about 1.5ms for each virtq creation.
After SW QP reset, QP pi/ci all become 0 while CQ pi/ci keep as
previous. Add new variable qp_ci to save SW QP ci. Move QP pi
independently with CQ ci.
Add new function mlx5_vdpa_drain_cq to drain CQ CQE after virtq
release.
Yajun Wu [Sat, 18 Jun 2022 09:02:45 +0000 (12:02 +0300)]
vdpa/mlx5: support pre-creation of virtq resource
The motivation of this change is to reduce vDPA device queue creation
time by creating some queue resource in vDPA device probe stage.
In VM live migration scenario, this can reduce 0.8ms for each queue
creation, thus reduce LM network downtime.
To create queue resource(umem/counter) in advance, we need to know
virtio queue depth and max number of queue VM will use.
Introduce two new devargs: queues(max queue pair number) and queue_size
(queue depth). Two args must be both provided, if only one argument
provided, the argument will be ignored and no pre-creation.
The queues and queue_size must also be identical to vhost configuration
driver later receive. Otherwise either the pre-create resource is wasted
or missing or the resource need destroy and recreate(in case queue_size
mismatch).
Pre-create umem/counter will keep alive until vDPA device removal.
David Marchand [Fri, 17 Jun 2022 05:40:03 +0000 (07:40 +0200)]
vhost: fix log message for async dequeue
Since the commit 02798b073520 ("vhost: improve virtio-net layer logs"),
vhost logs contain the socket path as a prefix.
Async dequeue path was copied from the sync dequeue path but a log
was incorrect.
Fixes: 84d5204310d7 ("vhost: support async dequeue for split ring") Signed-off-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Yajun Wu [Wed, 15 Jun 2022 10:02:27 +0000 (13:02 +0300)]
vdpa/mlx5: workaround VAR offset within page
vDPA driver first uses kernel driver to allocate doorbell (VAR) area for
each device. Then uses var->mmap_off and var->length to mmap uverbs device
file as doorbell userspace virtual address.
Current kernel driver provides var->mmap_off equal to page start of VAR.
It's fine with x86 4K page server, because VAR physical address is only 4K
aligned thus locate in 4K page start.
But with aarch64 64K page server, the actual VAR physical address has
offset within page (not located in 64K page start).
So the vDPA driver needs to add this within page offset
(caps.doorbell_bar_offset) to get the right VAR virtual address.
Yuan Wang [Thu, 9 Jun 2022 17:34:04 +0000 (01:34 +0800)]
examples/vhost: support clear in-flight for async dequeue
This patch allows vring_state_changed() to clear in-flight
dequeue packets. It also clears the in-flight packets in
a thread-safe way in destroy_device().
Signed-off-by: Yuan Wang <yuanx.wang@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: Jiayu Hu <jiayu.hu@intel.com>
Yuan Wang [Thu, 9 Jun 2022 17:34:03 +0000 (01:34 +0800)]
vhost: support clear in-flight packets for async dequeue
rte_vhost_clear_queue_thread_unsafe() supports to clear
in-flight packets for async enqueue only. But after
supporting async dequeue, this API should support async dequeue too.
This patch also adds the thread-safe version of this API,
the difference between the two API is that thread safety uses lock.
These APIs maybe used to clean up packets in the async channel
to prevent packet loss when the device state changes or
when the device is destroyed.
Signed-off-by: Yuan Wang <yuanx.wang@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: Jiayu Hu <jiayu.hu@intel.com>
Maxime Coquelin [Wed, 8 Jun 2022 12:49:46 +0000 (14:49 +0200)]
net/vhost: perform SW checksum in Tx path
Virtio specification supports guest checksum offloading
for L4, which is enabled with VIRTIO_NET_F_GUEST_CSUM
feature negotiation. However, the Vhost PMD does not
advertise Tx checksum offload capabilities.
Advertising these offload capabilities at the ethdev level
is not enough, because we could still end-up with the
application enabling these offloads while the guest not
negotiating it.
This patch advertises the Tx checksum offload capabilities,
and introduces a compatibility layer to cover the case
VIRTIO_NET_F_GUEST_CSUM has not been negotiated but the
application does configure the Tx checksum offloads. This
function performs the L4 Tx checksum in SW for UDP and TCP.
Compared to Rx SW checksum, the Tx SW checksum function
needs to compute the pseudo-header checksum, as we cannot
know whether it was done before.
This patch does not advertise SCTP checksum offloading
capability for now, but it could be handled later if the
need arises.
Reported-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: Chenbo Xia <chenbo.xia@intel.com> Reviewed-by: Cheng Jiang <cheng1.jiang@intel.com>