Ivan Malov [Tue, 1 Feb 2022 08:49:56 +0000 (11:49 +0300)]
common/sfc_efx/base: query RSS queue span limit on Riverhead
On Riverhead boards, clients can query the limit on how many
queues an RSS context may address. Put the capability to use.
Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru> Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> Reviewed-by: Andy Moreton <amoreton@xilinx.com>
Ivan Malov [Tue, 1 Feb 2022 08:49:55 +0000 (11:49 +0300)]
net/sfc: rework flow action RSS support
Currently, the driver always allocates a dedicated NIC RSS context
for every separate flow rule with action RSS, which is not optimal.
First of all, multiple rules which have the same RSS specification
can share a context since filters in the hardware operate this way.
Secondly, entries in a context's indirection table are not precise
queue IDs but offsets with regard to the base queue ID of a filter.
This way, for example, queue arrays "0, 1, 2" and "3, 4, 5" in two
otherwise identical RSS specifications allow the driver to use the
same context since they both yield the same table of queue offsets.
Rework flow action RSS support in order to use these optimisations.
Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru> Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> Reviewed-by: Andy Moreton <amoreton@xilinx.com>
When a tap device is hotplugged to primary process which in turn
adds the device to all secondary process, the secondary process
does a tap_mp_attach_queues, but the fds are not populated in
the primary during the probe they are populated during the queue_setup,
added a fix to sync the queues during rte_eth_dev_start
Ciara Loftus [Fri, 28 Jan 2022 09:50:29 +0000 (09:50 +0000)]
net/af_xdp: use libxdp if available
AF_XDP support is deprecated in libbpf since v0.7.0 [1]. The libxdp library
now provides the functionality which once was in libbpf and which the
AF_XDP PMD relies on. This commit updates the AF_XDP meson build to use the
libxdp library if a version >= v1.2.2 is available. If it is not available,
only versions of libbpf prior to v0.7.0 are allowed, as they still contain
the required AF_XDP functionality.
libbpf still remains a dependency even if libxdp is present, as we use
libbpf APIs for program loading.
The minimum required kernel version for libxdp for use with AF_XDP is v5.3.
For the library to be fully-featured, a kernel v5.10 or newer is
recommended. The full compatibility information can be found in the libxdp
README.
v1.2.2 of libxdp includes an important fix required for linking with DPDK
which is why this version or greater is required. Meson uses pkg-config to
verify the version of libxdp on the system, so it is necessary that the
library is discoverable using pkg-config in order for the PMD to use it. To
verify this, you can run: pkg-config --modversion libxdp
The reason to the bug is that rte timer do not be cancelled when quit.
That is, in 'bond_ethdev_start', resources are allocated according to
different bonding mode. In 'bond_ethdev_stop', resources are free by
the corresponding mode.
For example, 'bond_ethdev_start' start bond_mode_8023ad_ext_periodic_cb
timer for bonding mode 4. and 'bond_ethdev_stop' cancel the timer only
when the current bonding mode is 4. If the bonding mode is changed,
and directly quit the process, the timer will still on, and freed memory
will be accessed, then segmentation fault.
'bonding mode' changed means resources changed, reallocate resources for
different mode should be done, that is, device should be restarted.
Fixes: 2950a769315e ("bond: testpmd support") Cc: stable@dpdk.org Signed-off-by: Min Hu (Connor) <humin29@huawei.com> Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>
Min Hu (Connor) [Fri, 28 Jan 2022 02:25:33 +0000 (10:25 +0800)]
net/bonding: fix reference count on mbufs
In bonding Tx broadcast mode, Packets should be sent by every slave,
but only one mbuf exits. The solution is to increment reference count
on mbufs, but it ignores multi segments.
This patch fixed it by adding reference for every segment in multi
segments Tx scenario.
Fixes: 2efb58cbab6e ("bond: new link bonding library") Cc: stable@dpdk.org Signed-off-by: Min Hu (Connor) <humin29@huawei.com>
Min Hu (Connor) [Fri, 28 Jan 2022 02:25:32 +0000 (10:25 +0800)]
net/bonding: fix promiscuous and allmulticast state
Currently, promiscuous or allmulticast state of bonding port will not be
passed to the new primary slave when active/standby switch-over. It
causes bugs in some scenario.
For example, promiscuous state of bonding port is off now, primary slave
(called A) is off but secondary slave(called B) is on.
Then active/standby switch-over, promiscuous state of the bonding port
is off, but the new primary slave turns to be B and its promiscuous
state is still on.
It is not consistent with bonding port. And this patch will fix it.
Fixes: 2efb58cbab6e ("bond: new link bonding library") Fixes: 68218b87c184 ("net/bonding: prefer allmulti to promiscuous for LACP") Cc: stable@dpdk.org Signed-off-by: Min Hu (Connor) <humin29@huawei.com>
Yunjian Wang [Fri, 24 Dec 2021 11:26:38 +0000 (19:26 +0800)]
net/ixgbe: check filter init failure
The function ixgbe_fdir_filter_init() and ixgbe_l2_tn_filter_init()
could return errors, the return value need to be checked and returned.
Fixes: 080e3c0ee989 ("net/ixgbe: store flow director filter") Fixes: d0c0c416ef1f ("net/ixgbe: store L2 tunnel filter") Cc: stable@dpdk.org Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Acked-by: Haiyue Wang <haiyue.wang@intel.com>
Chengwen Feng [Fri, 28 Jan 2022 02:07:08 +0000 (10:07 +0800)]
net/hns3: delete duplicated RSS type
The hns3_set_rss_types hold two IPV4_TCP items, this patch deletes
duplicate item.
Fixes: 806f1d5ab0e3 ("net/hns3: set RSS hash type input configuration") Cc: stable@dpdk.org Signed-off-by: Chengwen Feng <fengchengwen@huawei.com> Signed-off-by: Min Hu (Connor) <humin29@huawei.com>
Huisong Li [Fri, 28 Jan 2022 02:07:07 +0000 (10:07 +0800)]
net/hns3: fix operating queue when TCAM table is invalid
Reset queues will query the TCAM table. The table is cleared after global
or imp reset. Currently, PF driver first resets Rx/Tx queues and then
restore the table during the reset recovery process, which will fail to
query the table and trigger a RAS error.
Fixes: fa29fe45a7b4 ("net/hns3: support queue start and stop") Cc: stable@dpdk.org Signed-off-by: Huisong Li <lihuisong@huawei.com> Signed-off-by: Min Hu (Connor) <humin29@huawei.com>
Huisong Li [Fri, 28 Jan 2022 02:07:06 +0000 (10:07 +0800)]
net/hns3: fix double decrement of secondary count
The "secondary_cnt" indicates the number of secondary processes on an
Ethernet device. But the variable is double subtracted when detach the
device in secondary processes.
Fixes: ff6dc76e40b8 ("net/hns3: refactor multi-process initialization") Cc: stable@dpdk.org Signed-off-by: Huisong Li <lihuisong@huawei.com> Signed-off-by: Min Hu (Connor) <humin29@huawei.com>
Huisong Li [Fri, 28 Jan 2022 02:07:05 +0000 (10:07 +0800)]
net/hns3: fix insecure way to query MAC statistics
The query way of MAC statistics in HNS3 PF driver is as following:
1) get MAC statistics register number and calculate descriptor number.
2) use above descriptor number to send command to firmware to query all
MAC statistics and copy to hns3_mac_stats struct in driver.
The preceding way does not verify the validity of the number of obtained
register, which may cause memory out-of-bounds.
Fixes: 8839c5e202f3 ("net/hns3: support device stats") Cc: stable@dpdk.org Signed-off-by: Huisong Li <lihuisong@huawei.com> Signed-off-by: Min Hu (Connor) <humin29@huawei.com>
Lijun Ou [Fri, 28 Jan 2022 02:07:04 +0000 (10:07 +0800)]
net/hns3: fix RSS key with null
Since the patch '1848b117' has initialized the variable 'key' in
'struct rte_flow_action_rss' with 'NULL', the PMD will use the
default RSS key when create the first RSS rule with NULL RSS key.
Then, if create a repeated RSS rule with the above, it will not
identify duplicate rules and return an error message.
To solve the preceding problem, determine whether the current RSS keys
are the same based on whether the length of key_len of rss is 0.
Fixes: 1848b117cca1 ("app/testpmd: fix RSS key for flow API RSS rule") Cc: stable@dpdk.org Signed-off-by: Lijun Ou <oulijun@huawei.com>
Huisong Li [Fri, 28 Jan 2022 02:07:03 +0000 (10:07 +0800)]
net/hns3: fix max packet size rollback in PF
HNS3 PF driver use the hns->pf.mps to restore the MTU when a reset
occurs.
If user fails to configure the MTU, the MPS of PF may not be restored to
the original value.
Fixes: 25fb790f7868 ("net/hns3: fix HW buffer size on MTU update") Fixes: 1f5ca0b460cd ("net/hns3: support some device operations") Fixes: d51867db65c1 ("net/hns3: add initialization") Cc: stable@dpdk.org Signed-off-by: Huisong Li <lihuisong@huawei.com> Signed-off-by: Min Hu (Connor) <humin29@huawei.com>
Weiguo Li [Tue, 25 Jan 2022 12:00:49 +0000 (20:00 +0800)]
net/enic: fix dereference before null check
Move memcpy to 'ah->key' after 'ah' null check
Fixes: bb66d562aefc ("net/enic: share flow actions with same signature") Cc: stable@dpdk.org Signed-off-by: Weiguo Li <liwg06@foxmail.com> Reviewed-by: John Daley <johndale@cisco.com>
support systemd service convention for runtime directory
Systemd.exec supports configuring the runtime directory of a service
via RuntimeDirectory=. This creates the directory with the necessary
permissions which actual service may not have if running in container.
The change to DPDK is to look for the environment RUNTIME_DIRECTORY
first and use that in preference to the fallback alternatives.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Bruce Richardson <bruce.richardson@intel.com> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
The size argument to eal_set_runtime_dir is useless and was
being used incorrectly in strlcpy. It worked only because
all callers passed PATH_MAX which is same as sizeof the destination
runtime_dir.
Note: this is an internal API so no user exposed change.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Morten Brørup <mb@smartsharesystems.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Added an internal helper to get OS-specific EAL mapping base address
This helper can be used by the drivers to program offload / accelerator
devices, where the base address can be used as a reference address by
the accelerator to access the host memory
An address can also be represented as an offset relative to the base
address using smaller data types
Dmitry Kozlyuk [Thu, 3 Feb 2022 18:13:36 +0000 (20:13 +0200)]
eal: extend --huge-unlink for hugepage file reuse
Expose Linux EAL ability to reuse existing hugepage files
via --huge-unlink=never switch.
Default behavior is unchanged, it can also be specified
using --huge-unlink=existing for consistency.
Old --huge-unlink switch is kept,
it is an alias for --huge-unlink=always.
Add a test case for the --huge-unlink=never mode.
Dmitry Kozlyuk [Thu, 3 Feb 2022 18:13:35 +0000 (20:13 +0200)]
eal/linux: allow hugepage file reuse
Linux EAL ensured that mapped hugepages are clean
by always mapping from newly created files:
existing hugepage backing files were always removed.
In this case, the kernel clears the page to prevent data leaks,
because the mapped memory may contain leftover data
from the previous process that was using this memory.
Clearing takes the bulk of the time spent in mmap(2),
increasing EAL initialization time.
Introduce a mode to keep existing files and reuse them
in order to speed up initial memory allocation in EAL.
Hugepages mapped from such files may contain data
left by the previous process that used this memory,
so RTE_MEMSEG_FLAG_DIRTY is set for their segments.
If multiple hugepages are mapped from the same file:
1. When fallocate(2) is used, all memory mapped from this file
is considered dirty, because it is unknown
which parts of the file are holes.
2. When ftruncate(3) is used, memory mapped from this file
is considered dirty unless the file is extended
to create a new mapping, which implies clean memory.
Dmitry Kozlyuk [Thu, 3 Feb 2022 18:13:34 +0000 (20:13 +0200)]
eal: refactor --huge-unlink storage
In preparation to extend --huge-unlink option semantics
refactor how it is stored in the internal configuration.
It makes future changes more isolated.
Dmitry Kozlyuk [Thu, 3 Feb 2022 18:13:33 +0000 (20:13 +0200)]
mem: add dirty malloc element support
EAL malloc layer assumed all free elements content
is filled with zeros ("clean"), as opposed to uninitialized ("dirty").
This assumption was ensured in two ways:
1. EAL memalloc layer always returned clean memory.
2. Freed memory was cleared before returning into the heap.
Clearing the memory can be as slow as around 14 GiB/s.
To save doing so, memalloc layer is allowed to return dirty memory.
Such segments being marked with RTE_MEMSEG_FLAG_DIRTY.
The allocator tracks elements that contain dirty memory
using the new flag in the element header.
When clean memory is requested via rte_zmalloc*()
and the suitable element is dirty, it is cleared on allocation.
When memory is deallocated, the freed element is joined
with adjacent free elements, and the dirty flag is updated:
a) If the joint element contains dirty parts, it is dirty:
dirty + freed + dirty = dirty => no need to clean
freed + dirty = dirty the freed memory
Dirty parts may be large (e.g. initial allocation),
so clearing them could create unpredictable slowdown.
b) If the only dirty part of the joint element
is the freed memory, the joint element can be made clean:
Dmitry Kozlyuk [Thu, 3 Feb 2022 18:13:32 +0000 (20:13 +0200)]
app/test: add allocator performance benchmark
Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.
Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.
Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing takes
for each size as a reference (for rte_memzone_reserve estimations
are printed).
Dmitry Kozlyuk [Thu, 3 Feb 2022 18:13:31 +0000 (20:13 +0200)]
doc: add hugepage mapping details
Hugepage mapping is a layer of EAL malloc builds upon.
There were implicit references to its details,
like mentions of segment file descriptors,
but no explicit description of its modes and operation.
Add an overview of mechanics used on ech supported OS.
Convert memory management subsections from list items
to level 4 headers: they are big and important enough.
Jie Zhou [Wed, 26 Jan 2022 05:10:44 +0000 (21:10 -0800)]
test: enable subset of tests on Windows
Enable a subset of unit tests for Windows CI
- For driver tests, driver owners should enable corresponding tests when
enabling driver for Windows.
- For dump tests, currently the tests hang on Windows which require
further investigation.
- For telemetry tests, it has POSIX socket specific codes which require
replacement for Windows. Will investigate and work on a separate patch.
Signed-off-by: Jie Zhou <jizh@linux.microsoft.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Jie Zhou [Wed, 26 Jan 2022 05:10:43 +0000 (21:10 -0800)]
test: replace shell script with Python
- Add python script to check if system supports hugepages
- Remove corresponding .sh script
- Replace calling of .sh with corresponding .py in meson.build
Signed-off-by: Jie Zhou <jizh@linux.microsoft.com> Acked-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
Jie Zhou [Wed, 26 Jan 2022 05:10:42 +0000 (21:10 -0800)]
test: skip unsupported tests on Windows
Skip tests which are not yet supported for Windows:
- The libraries that tests depend on are not enabled on Windows yet
- The tests can compile but with issue still under investigation
* test_func_reentrancy:
Windows EAL has no protection against repeated calls.
* test_lcores:
Execution enters an infinite loops, requires investigation.
* test_rcu_qsbr_perf:
Execution hangs on Windows, requires investigation.
Jie Zhou [Wed, 26 Jan 2022 05:10:39 +0000 (21:10 -0800)]
eal: differentiate strerror message on Windows
On Windows, strerror returns just "Unknown error" for errnum greater
than MAX_ERRNO, while linux and freebsd returns "Unknown error <num>",
which is the current expectation for errno_autotest. Differentiate
the error string on Windows to remove a "duplicate error code" failure.
Signed-off-by: Jie Zhou <jizh@linux.microsoft.com> Acked-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
Jie Zhou [Wed, 26 Jan 2022 05:10:38 +0000 (21:10 -0800)]
test/log: skip regex on Windows
DPDK logs_autotest on Windows failed at "dynamic log types" tests.
The failures are on 2 test cases for rte_log_set_level_regexp API,
due to regular expression is not supported on Windows in DPDK yet
and regcomp/regexec are just stubs on Windows (in regex.h).
In app/test/test_logs.c, ifndef these two test cases, and for the
rte_log_set_level_pattern validation case following these two cases,
differentiate the expected log level passed into macro CHECK_LEVELS
Now logs_autotest completes for all dynamic log types and static log types.
Signed-off-by: Jie Zhou <jizh@linux.microsoft.com> Acked-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
Jie Zhou [Wed, 26 Jan 2022 05:10:37 +0000 (21:10 -0800)]
test/interrupts: skip on Windows
Even though test_interrupts.c can compile on Windows, skip interrupt
tests for now since majority of eal_interrupt on Windows are stubs.
Will remove the skip after interrupt being fully enabled on Windows.
Signed-off-by: Jie Zhou <jizh@linux.microsoft.com> Acked-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
Jie Zhou [Wed, 26 Jan 2022 05:10:35 +0000 (21:10 -0800)]
test: remove POSIX-specific code
- Replace POSIX-specific code with DPDK equivalents or
conditionally disable it on Windows
- Use NUL on Windows as /dev/null for Unix
- Exclude tests not supported on Windows yet
* multi-process
* PMD performance statistics display on signal
Jie Zhou [Wed, 26 Jan 2022 05:10:34 +0000 (21:10 -0800)]
eal/windows: fix error code for not supported API
UT memory_autotest on Windows has 2 failed cases on EAL APIs
eal_memalloc_get_seg_fd and eal_memalloc_get_seg_fd_offset. These 2
APIs are not supported on Windows yet. Should return ENOTSUP such that
in test_memory.c these 2 ENOTSUP cases will not be marked as failures,
same as other ENOTSUP cases.
Zhihong Wang [Tue, 14 Dec 2021 03:30:16 +0000 (11:30 +0800)]
ring: fix overflow in memory size calculation
Parameters count and esize are both unsigned int, and their product can
legaly exceed unsigned int and lead to runtime access violation.
Fixes: cc4b218790f6 ("ring: support configurable element size") Cc: stable@dpdk.org Signed-off-by: Zhihong Wang <wangzhihong.wzh@bytedance.com> Reviewed-by: Liang Ma <liangma@liangbit.com> Reviewed-by: Morten Brørup <mb@smartsharesystems.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Yunjian Wang [Mon, 10 Jan 2022 09:23:03 +0000 (17:23 +0800)]
ring: fix error code when creating ring
The error value returned by rte_ring_create_elem() should be positive
integers. However, if the rte_ring_get_memsize_elem() function fails,
a negative number is returned and is directly used as the return value.
As a result, this will cause the external call to check the return
value to fail(like called by rte_mempool_create()).
Fixes: a182620042aa ("ring: get size in memory") Cc: stable@dpdk.org Reported-by: Nan Zhou <zhounan14@huawei.com> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
When enqueueing/dequeueing to/from the ring we try to optimize by manual
loop unrolling. The check for this optimization looks like:
if (likely(idx + n < size)) {
where 'idx' points to the first usable element (empty slot for enqueue,
data for dequeue). The correct comparison here should be '<=' instead
of '<'.
This is not a functional error since we fall back to the loop with
correct checks on indexes. Just a minor suboptimal behaviour for the
case when we want to enqueue/dequeue exactly the number of elements that
we have in the ring before wrapping to its beginning.
Fixes: cc4b218790f6 ("ring: support configurable element size") Fixes: 286bd05bf70d ("ring: optimisations") Signed-off-by: Andrzej Ostruszka <amo@semihalf.com> Reviewed-by: Olivier Matz <olivier.matz@6wind.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
Morten Brørup [Mon, 24 Jan 2022 14:59:53 +0000 (15:59 +0100)]
mempool: test performance with constant n
"What gets measured gets done."
This patch adds mempool performance tests where the number of objects to
put and get is constant at compile time, which may significantly improve
the performance of these functions. [*]
Also, it is ensured that the array holding the object used for testing
is cache line aligned, for maximum performance.
And finally, the following entries are added to the list of tests:
- Number of kept objects: 512
- Number of objects to get and to put: The number of pointers fitting
into a cache line, i.e. 8 or 16
[*] Some example performance test (with cache) results:
Markus Theil [Fri, 3 Dec 2021 07:19:07 +0000 (08:19 +0100)]
kni: fix ioctl signature
Fix kni's ioctl signature to correctly match the kernel's
structs. This shaves off the (void*) casts and uses struct file*
instead of struct inode*. With the correct signature, control flow
integrity checkers are no longer confused at this point.
Signed-off-by: Markus Theil <markus.theil@secunet.com> Tested-by: Michael Pfeiffer <michael.pfeiffer@tu-ilmenau.de> Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Tudor Cornea [Thu, 20 Jan 2022 12:41:34 +0000 (14:41 +0200)]
kni: allow configuring thread granularity
The Kni kthreads seem to be re-scheduled at a granularity of roughly
1 millisecond right now, which seems to be insufficient for performing
tests involving a lot of control plane traffic.
Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it
seems that the existing code cannot reschedule at the desired granularily,
due to precision constraints of schedule_timeout_interruptible().
In our use case, we leverage the Linux Kernel for control plane, and
it is not uncommon to have 60K - 100K pps for some signaling protocols.
Since we are not in atomic context, the usleep_range() function seems to be
more appropriate for being able to introduce smaller controlled delays,
in the range of 5-10 microseconds. Upon reading the existing code, it would
seem that this was the original intent. Adding sub-millisecond delays,
seems unfeasible with a call to schedule_timeout_interruptible().
KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */
schedule_timeout_interruptible(
usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL));
Below, we attempted a brief comparison between the existing implementation,
which uses schedule_timeout_interruptible() and usleep_range().
We attempt to measure the CPU usage, and RTT between two Kni interfaces,
which are created on top of vmxnet3 adapters, connected by a vSwitch.
insmod rte_kni.ko kthread_mode=single carrier=on
schedule_timeout_interruptible(usecs_to_jiffies(5))
kni_single CPU Usage: 2-4 %
[root@localhost ~]# ping 1.1.1.2 -I eth1
PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data.
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms
usleep_range(5, 10)
kni_single CPU usage: 50%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms
usleep_range(20, 50)
kni_single CPU usage: 24%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms
usleep_range(50, 100)
kni_single CPU usage: 13%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms
usleep_range(100, 200)
kni_single CPU usage: 7%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms
usleep_range(1000, 1100)
kni_single CPU usage: 2%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms
Upon testing, usleep_range(1000, 1100) seems roughly equivalent in
latency and cpu usage to the variant with schedule_timeout_interruptible(),
while usleep_range(100, 200) seems to give a decent tradeoff between
latency and cpu usage, while allowing users to tweak the limits for
improved precision if they have such use cases.
Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a
softlockup on my kernel.
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 1226 Comm: kni_single Tainted: G W O 3.10 #1
<IRQ> [<ffffffff814f84de>] dump_stack+0x19/0x1b
[<ffffffff814f7891>] panic+0xcd/0x1e0
[<ffffffff810993b0>] watchdog_timer_fn+0x160/0x160
[<ffffffff810644b2>] __run_hrtimer.isra.4+0x42/0xd0
[<ffffffff81064b57>] hrtimer_interrupt+0xe7/0x1f0
[<ffffffff8102cd57>] smp_apic_timer_interrupt+0x67/0xa0
[<ffffffff8150321d>] apic_timer_interrupt+0x6d/0x80
Bruce Richardson [Mon, 24 Jan 2022 17:49:59 +0000 (17:49 +0000)]
build: remove deprecated Meson functions
Starting in meson 0.56, the functions meson.source_root() and
meson.build_root() are deprecated and to be replaced by the [more
descriptive] functions: project_source_root()/global_source_root() and
project_build_root()/global_build_root(). Unfortunately, these new
replacement functions were only added in 0.56 release too, so to use
them we would need version checks for old/new functions to remove the
deprecation warnings.
However, the functions "current_build_dir()" and "current_source_dir()"
remain unaffected by all this, so we can bypass the versioning problem,
by saving off these values to "dpdk_source_root" and "dpdk_build_root"
in the top-level meson.build file
Bugzilla ID: 926 Cc: stable@dpdk.org Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Tested-by: Jerin Jacob <jerinj@marvell.com>
Bruce Richardson [Fri, 21 Jan 2022 16:12:30 +0000 (16:12 +0000)]
build: fix warning about using -Wextra flag
Each build, meson would issue a warning reporting that the
"warning_level" setting should be used in place of adding -Wextra
directly to our build commands. Testing with meson 0.61 shows that the
only difference for gcc and clang builds between warning levels 1 and
2 is the addition of -Wextra, so we can remove the warning by deleting
our explicit set of Wextra and changing the build defaults to
warning_level 2.
Fixes: 524a0d5d66b9 ("build: enable extra warnings with meson") Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Acked-by: Luca Boccassi <bluca@debian.org>
Bruce Richardson [Thu, 20 Jan 2022 18:06:39 +0000 (18:06 +0000)]
build: fix warnings when running external commands
Meson 0.61.1 is giving warnings that the calls to run_command do not
always explicitly specify if the result is to be checked or not, i.e.
there is a missing "check" parameter. This is because the default
behaviour without the parameter is due to change in the future.
We can fix these warnings by explicitly adding into each call whether
the result should be checked by meson or not. This patch therefore
adds in "check: false" to each run_command call where the result is
being checked by the DPDK meson.build code afterwards, and adds in
"check: true" to any calls where the result is currently unchecked.
Bugzilla ID: 921 Cc: stable@dpdk.org Reported-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Tested-by: Jerin Jacob <jerinj@marvell.com>
Feifei Wang [Thu, 27 Jan 2022 07:40:01 +0000 (15:40 +0800)]
net/i40e: remove redundant reset operation
For free buffer operation in i40e vector path, it is unnecessary to
store 'NULL' into txep.mbuf. This is because when putting mbuf into Tx
queue, tx_tail is the sentinel. And when doing tx_free, tx_next_dd is
the sentinel. In all processes, mbuf==NULL is not a condition in check.
Thus reset of mbuf is unnecessary and can be omitted.
Signed-off-by: Feifei Wang <feifei.wang2@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> Acked-by: Qi Zhang <qi.z.zhang@intel.com>
Xiaoyu Min [Tue, 18 Jan 2022 11:38:50 +0000 (19:38 +0800)]
net/mlx5: reject jump to root table
Currently root table as destination is not supported.
The jump action which finally be translated to underlying root table in
rdma-core should be rejected.
Fixes: f78f747f41d0 ("net/mlx5: allow jump to group lower than current") Cc: stable@dpdk.org Signed-off-by: Xiaoyu Min <jackmin@nvidia.com> Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Bing Zhao [Mon, 17 Jan 2022 17:49:14 +0000 (19:49 +0200)]
common/mlx5: fix probing failure code
While probing the device with unsupported class, the process should
fail because no appropriate driver was found. After traversing all
the drivers, an error value should be returned for the case.
In the previous implementation, zero value indicating probing success
was wrongly returned.
Raja Zidane [Sun, 16 Jan 2022 15:23:47 +0000 (15:23 +0000)]
net/mlx5: fix mark enabling for Rx
To optimize datapath, the mlx5 pmd checked for mark action on flow
creation, and flagged possible destination rxqs (through queue/RSS
actions), then it enabled the mark action logic only for flagged rxqs.
Mark action didn't work if no queue/rss action was in the same flow,
even when the user use multi-group logic to manage the flows.
So, if mark action is performed in group X and the packet is moved to
group Y > X when the packet is forwarded to Rx queues, SW did not get
the mark ID to the mbuf.
Flag Rx datapath to report mark action for any queue when the driver
detects the first mark action after dev_start operation.
Fixes: 8e61555657b2 ("net/mlx5: fix shared RSS and mark actions combination") Cc: stable@dpdk.org Signed-off-by: Raja Zidane <rzidane@nvidia.com> Acked-by: Matan Azrad <matan@nvidia.com>
Dmitry Kozlyuk [Fri, 14 Jan 2022 10:52:17 +0000 (12:52 +0200)]
common/mlx5: fix MR lookup for non-contiguous mempool
Memory region (MR) lookup by address inside mempool MRs
was not accounting for the upper bound of an MR.
For mempools covered by multiple MRs this could return
a wrong MR LKey, typically resulting in an unrecoverable
TxQ failure:
mlx5_net: Cannot change Tx QP state to INIT Invalid argument
Corresponding message from /var/log/dpdk_mlx5_port_X_txq_Y_index_Z*:
This is likely to happen with --legacy-mem and IOVA-as-PA,
because EAL intentionally maps pages at non-adjacent PA
to non-adjacent VA in this mode, and MLX5 PMD works with VA.
Maxime Coquelin [Wed, 26 Jan 2022 09:55:07 +0000 (10:55 +0100)]
vhost: improve virtio-net layer logs
This patch standardizes logging done in Virtio-net, so that
the Vhost-user socket path is always prepended to the logs.
It will ease log analysis when multiple Vhost-user ports
are in use.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: Chenbo Xia <chenbo.xia@intel.com> Reviewed-by: David Marchand <david.marchand@redhat.com>
Maxime Coquelin [Wed, 26 Jan 2022 09:55:06 +0000 (10:55 +0100)]
vhost: improve socket layer logs
This patch adds the Vhost socket path whenever possible in
order to make debugging possible when multiple Vhost
devices are in use. Some vhost-user layer functions are
modified to pass the device path down to the socket layer.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: Chenbo Xia <chenbo.xia@intel.com> Reviewed-by: David Marchand <david.marchand@redhat.com>
Harold Huang [Thu, 23 Dec 2021 04:42:37 +0000 (12:42 +0800)]
net/virtio-user: fix resource leak on probing failure
When eth_virtio_dev_init is failed, the registered virtio user memory
event cb is not released and the backend created tap device is not
destroyed. It would cause some residual tap device existed in the host
and creating a new vdev could be failed because the new virtio_user_dev
could use the same address pointer and register memory event cb to the
same address is not allowed.
Fixes: ca8326a94365 ("net/virtio_user: fix error management during init") Cc: stable@dpdk.org Signed-off-by: Harold Huang <baymaxhuang@gmail.com> Reviewed-by: Chenbo Xia <chenbo.xia@intel.com>
Matan Azrad [Mon, 22 Nov 2021 13:12:35 +0000 (15:12 +0200)]
vdpa/mlx5: workaround queue stop with traffic
When the event thread polls traffic and a virtq is stopping, the FW loses
synchronization in the virtq indexes.
It causes LM failure on synchronization between the HOST indexes to
the GUEST indexes.
Unset the event thread before the queue stop in the LM process.
Fixes: 31b9c29c86af ("vdpa/mlx5: support close and config operations") Cc: stable@dpdk.org Signed-off-by: Matan Azrad <matan@nvidia.com> Acked-by: Xueming Li <xuemingl@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Selwin Sebastian [Tue, 25 Jan 2022 12:17:47 +0000 (17:47 +0530)]
net/axgbe: alter port speed bit range
Newer generation Hardware uses the slightly different
port speed bit widths, so alter the existing port speed
bit range to extend support to the newer generation hardware
while maintaining the backward compatibility with older
generation hardware.
The previously reserved bits are now being used which
then requires the adjustment to the BIT values, e.g.:
Selwin Sebastian [Tue, 25 Jan 2022 12:17:45 +0000 (17:47 +0530)]
net/axgbe: reset PHY Rx when mailbox command timeout
Sometimes mailbox commands timeout when the RX data path becomes
unresponsive. This prevents the submission of new mailbox commands
to DXIO. This patch identifies the timeout and resets the RX data
path so that the next message can be submitted properly.
Signed-off-by: Selwin Sebastian <selwin.sebastian@amd.com> Acked-by: Chandubabu Namburu <chandu@amd.com>
Selwin Sebastian [Tue, 25 Jan 2022 12:17:44 +0000 (17:47 +0530)]
net/axgbe: simplify rate change mailbox interface
Simplify and centralize the mailbox command rate change interface by
having a single function perform the writes to the mailbox registers
to issue the request.
Signed-off-by: Selwin Sebastian <selwin.sebastian@amd.com> Acked-by: Chandubabu Namburu <chandu@amd.com>
Selwin Sebastian [Tue, 25 Jan 2022 12:17:43 +0000 (17:47 +0530)]
net/axgbe: toggle PLL settings during rate change
For each rate change command submission, the FW has to do a phy
power off sequence internally. For this to happen correctly, the
PLL re-initialization control setting has to be turned off before
sending mailbox commands and re-enabled once the command submission
is complete. Without the PLL control setting, the link up takes
longer time in a fixed phy configuration.
Signed-off-by: Selwin Sebastian <selwin.sebastian@amd.com> Acked-by: Chandubabu Namburu <chandu@amd.com>
Selwin Sebastian [Tue, 25 Jan 2022 12:17:42 +0000 (17:47 +0530)]
net/axgbe: attempt always link training in KR mode
Link training is always attempted when in KR mode, but the code is
structured to check if link training has been enabled before attempting
to perform it. Since that check will always be true, simplify the code
to always enable and start link training during KR auto-negotiation.
Signed-off-by: Selwin Sebastian <selwin.sebastian@amd.com> Acked-by: Chandubabu Namburu <chandu@amd.com>