git.droids-corp.org - dpdk.git/log

vhost: introduce API to start a specific driver

We used to use rte_vhost_driver_session_start() to trigger the vhost-user
session. It takes no argument, thus it's a global trigger. And it could
be problematic.

The issue is, currently, rte_vhost_driver_register(path, flags) actually
tries to put it into the session loop (by fdset_add). However, it needs
a set of APIs to set a vhost-user driver properly:
  * rte_vhost_driver_register(path, flags);
  * rte_vhost_driver_set_features(path, features);
  * rte_vhost_driver_callback_register(path, vhost_device_ops);

If a new vhost-user driver is registered after the trigger (think OVS-DPDK
that could add a port dynamically from cmdline), the current code will
effectively starts the session for the new driver just after the first
API rte_vhost_driver_register() is invoked, leaving later calls taking
no effect at all.

To handle the case properly, this patch introduce a new API,
rte_vhost_driver_start(path), to trigger a specific vhost-user driver.
To do that, the rte_vhost_driver_register(path, flags) is simplified
to create the socket only and let rte_vhost_driver_start(path) to
actually put it into the session loop.

Meanwhile, the rte_vhost_driver_session_start is removed: we could hide
the session thread internally (create the thread if it has not been
created). This would also simplify the application.

NOTE: the API order in prog guide is slightly adjusted for showing the
correct invoke order.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: export APIs for live migration support

Export few APIs for the vhost-user driver to log the guest memory writes,
which is a must for live migration support.

This patch basically moves vhost_log_write() and vhost_log_used_vring()
into vhost.h and then add an wrapper (the public API) to them.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: add features changed callback

Features could be changed after the feature negotiation. For example,
VHOST_F_LOG_ALL will be set/cleared at the start/end of live migration,
respecitively. Thus, we need a new callback to inform the application
on such change.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: rename virtio-net to vhost

Rename "virtio-net" to "vhost" in the API comments and vhost prog guide.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: rename device ops struct

rename "virtio_net_device_ops" to "vhost_device_ops", to not let it
be virtio-net specific.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: do not include net specific headers

Include it internally, at vhost.h.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: drop the Rx and Tx queue macro

They are virtio-net specific and should be defined inside the virtio-net
driver.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: move the device ready check at proper place

Currently, we check vq->desc, vq->kickfd and vq->callfd to know whether
a virtio device is ready or not. However, we only do it when handling
SET_VRING_KICK message, which could be wrong if a vhost-user frontend
send SET_VRING_KICK first and SET_VRING_CALL later.

To work for all possible vhost-user frontend implementations, we could
move the ready check at the end of vhost-user message handler.

Meanwhile, since we do the check more often than before, the "virtio
not ready" message is dropped, to not flood the screen.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: export the number of vrings

We used to use rte_vhost_get_queue_num() for telling how many vrings.
However, the return value is the number of "queue pairs", which is
very virtio-net specific. To make it generic, we should return the
number of vrings instead, and let the driver do the proper translation.
Say, virtio-net driver could turn it to the number of queue pairs by
dividing 2.

Meanwhile, mark rte_vhost_get_queue_num as deprecated.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: turn queue pair to vring

The queue pair is very virtio-net specific, other devices don't have
such concept. To make it generic, we should log the number of vrings
instead of the number of queue pairs.

This patch just does a simple convert, a later patch would export the
number of vrings to applications.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: export API to translate gpa to vva

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: export vhost vring info

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: introduce API to fetch negotiated features

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: export guest memory regions

Some vhost-user driver may need this info to setup its own page tables
for GPA (guest physical addr) to HPA (host physical addr) translation.
SPDK (Storage Performance Development Kit) is one example.

Besides, by exporting this memory info, we could also export the
gpa_to_vva() as an inline function, which helps for performance.
Otherwise, it has to be referenced indirectly by a "vid".

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: make notify ops per vhost driver

Assume there is an application both support vhost-user net and
vhost-user scsi, the callback should be different. Making notify
ops per vhost driver allow application define different set of
callbacks for different driver.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: use new APIs to handle features

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

net/vhost: remove feature related APIs

The rte_eth_vhost_feature_disable/enable/get APIs are just a wrapper of
rte_vhost_feature_disable/enable/get. However, the later are going to
be refactored; it's going to take an extra parameter (socket_file path),
to let it be per-device.

Instead of changing those vhost-pmd APIs to adapt to the new vhost APIs,
we could simply remove them, and let vdev to serve this purpose. After
all, vdev options is better for disabling/enabling some features.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: introduce driver features related APIs

Introduce few APIs to set/get/enable/disable driver features.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

vhost: fix fd leaks for vhost-user server mode

A vhost-user server socket could have many connections, thus many connfd.
However, we currently just use one single int var to store it. Meaning,
it will get overwritten every time a new connection is created.

While this will not create fatal issue as it sounds (since the correct
connfd is closured to the event loop thread by fdset_add), it may cause
fd leaks if a user invokes rte_vhost_driver_unregister before shutting
down all connections: it just closes the recent connfd.

A simple example that should be able to reproduce this leaks issues is,
del the ovs vhost-user port while the connected VMs are still alive. (Note
that it's suggested to use one socket for one VM, which makes the issue
not that fatal as it sounds again).

Since we already use a struct "vhost_user_connection" to track all info
about one connection, it's obvious that we should put the connfd there.
Then we could build a connection list inside the vhost_user_socket struct,
to represent all connections belong that socket file.

Fixes: 164fd396788d ("vhost: fix unregistering in client mode")
Cc: stable@dpdk.org
Cc: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

net/virtio-user: support LSC

So far, virtio-user with vhost-user as the backend can only support
client mode. So when vhost user backend is down, i.e., unix socket
connection is broken, the connection cannot be re-connected. We will
forcely set the link state to be down.

Note: virtio-user with vhost-kernel as the backend still cannot
support lsc now as we fail to find a way to monitor the backend, tap
device, up/down events.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

net/virtio-user: support to report net status

Originally, we did not report support of VIRTIO_NET_F_STATUS.
This feature is not reported by vhost backend, instead, it
is added/removed by QEMU in virtio PCI case.

We report the support of this feature so that following patch
will depend on this feature to enable LSC interrupt.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

net/virtio-user: support Rx interrupt

For rxq interrupt, the device (backend driver) will notify driver
through callfd. Each virtqueue has a callfd. To keep compatible
with the existing framework, we will give these callfds to
interrupt thread for listening for interrupts.

Before that, we need to allocate intr_handle, and fill callfds
into it so that driver can use it to set up rxq interrupt mode.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>

net/virtio-user: move eventfd open/close into init/uninit

Originally, eventfd is opened when initializing each vq; and gets closded
in virtio_user_stop_device().

To make it possible to initialize intr_handle struct in init() in following
patch, we put the open() of all eventfds into init(); and put the close()
into uninit().

Suggested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

eal/linux: add interrupt type for vdev

A new interrupt type, RTE_INTR_HANDLE_VDEV, is added to support lsc and rxq
interrupt for vdev.

For lsc interrupt, except from original EPOLLIN events, we also listen for
socket peer closed connection event (EPOLLRDHUP and EPOLLHUP).

For rxq interrupt, add a precondition to avoid invoking any vfio and uio
code.

For intr_handle initialization, let each vdev driver to do that.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>

net/virtio-user: support changing tap interface name

This patch adds a new option 'iface' to change the interface name of
tap device with vhost-kernel as backend.

Signed-off-by: Wenfeng Liu <liuwf@arraynetworks.com.cn>
Reviewed-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

vhost: fix false sharing

The broadcast_rarp field in the virtio_net struct is checked in the
dequeue datapath regardless of whether descriptors are available or not.

As it is checked with cmpset leading to a write, false sharing on the
virtio_net struct can happen between enqueue and dequeue datapaths
regardless of whether a RARP is requested. In OVS, the issue can cause
a uni-directional performance drop of up to 15%.

Fix that by only performing the cmpset if a read of broadcast_rarp
indicates that the cmpset is likely to succeed.

Fixes: a66bcad32240 ("vhost: arrange struct fields for better cache sharing")
Cc: stable@dpdk.org
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

doc: fix parameter of virtio-user for container

Update the "Virtio_user for Container Networking" doc, add the
"--file-prefix" option to testpmd in host and container to avoid
hugepage config file conflict.

Fixes: 50665deebda0 ("doc: add guide to use virtio-user for container networking")
Cc: stable@dpdk.org
Signed-off-by: Yong Wang <wang.yong19@zte.com.cn>
Acked-by: John McNamara <john.mcnamara@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

app/testpmd: print MTU in port info

This patch adds MTU display to "show port info" command.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

net/virtio: support MTU feature

This patch implements support for the Virtio MTU feature.
When negotiated, the host shares its maximum supported MTU,
which is used as initial MTU and as maximum MTU the application
can set.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

net/vhost: set MTU

This patch adds a call to rte_vhost_mtu_get() at device creation
time to fill device's MTU property when available.

This makes the MTU value defined in QEMU cmdline accessible to the
application by calling rte_eth_dev_get_mtu().

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

vhost: add API to get MTU value

This patch implements the function for the application to
get the MTU value.

rte_vhost_get_mtu() fills the mtu parameter with the MTU value
set in QEMU if VIRTIO_NET_F_MTU has been negotiated and returns 0,
-ENOTSUP otherwise.

The function returns -EAGAIN if Virtio feature negotiation
didn't happened yet.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

vhost: add new ready status flag

This patch adds a new status flag indicating the Virtio device
is ready to operate.

This is required to be able to call rte_vhost_mtu_get() in the
.new_device() callback, as rte_vhost_mtu_get needs that the
negotiation is done, but it is too early to rely on running status
flag, which is set just after .new_device() returns.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

vhost: support MTU protocol feature

This patch implements the vhost-user MTU protocol feature support.
When VIRTIO_NET_F_MTU is negotiated, QEMU notifies the vhost-user
backend with the configured MTU if dedicated protocol feature is
supported.

The value can be used by the application to ensure consistency with
value set by the user.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

vhost: enable virtio MTU feature

This patch enables the new VIRTIO_NET_F_MTU feature,
which makes possible for the host to advise the guest
with its maximum supported MTU.

MTU value is set via QEMU parameters, either via Libvirt XML, or
directly in virtio-net device command line arguments.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

vhost: remove a hack on queue allocation

We used to allocate queues based on the index from SET_VRING_CALL
request: if corresponding queue hasn't been allocated, allocate it.

Though it's pratically right (it's the first per-vring request we
will get from QEMU for vhost-user negotiation), but it's not technically
right: it's not documented in the vhost-user spec that it will always
be the first per-vring request. For example, SET_VRING_ADDR could also
be the first per-vring request.

Thus, we should not depend the SET_VRING_CALL on queue allocation.
Instead, we could catch all the per-vring messages at the entrance of
request handler, and allocate one if it hasn't been allocated before.

By that, we could remove a hack.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

vhost: fix max queues

0x8000 is the max virito-net queue pairs the virtio 1.0 spec claims to
support. While for vhost-user, it's a different story: the max vring
index could be passed by the vhost-user spec is 0xff, masked by the
VHOST_USER_VRING_IDX_MASK.

That said, the max queue pairs could vhost-user could supported is 0x80.
If user are asking more, I think the vhost-user need be extended.

Fixes: b09b198bfb5c ("vhost-user: announce queue number in message")
Cc: stable@dpdk.org
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

vhost: fix multiple queue not enabled for old kernels

Some macros (say VIRTIO_NET_F_MQ) are needed for enabling multiple queue,
however they are introduced since kernel v3.8, meaning build error happens
if we build DPDK vhost on those platforms.

71dfdbe66a66 ("vhost: fix build with kernel < 3.8") meant to fix it, but
in a wrong way: it completely disables the MQ features for those kernels.
However, the MQ feature doesn't depend on the kernel at all (except the
macros dependency stated above), that we could still enable the MQ feature
even the host kernel has no such support.

The right fix is to define the macro if it's not defined.

Fixes: 71dfdbe66a66 ("vhost: fix build with kernel < 3.8")
Cc: stable@dpdk.org
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

net/virtio: disable LSC interrupt if MSIX not enabled

The link state change interrupt can only be configured if the virtio device
supports MSIX. Prior to this change the writing of the vector to the PCI
config space was causing it to overwrite the initial part of the MAC
address since the MSIX vector is not in the config space and is occupied by
the MAC address.

This has been reproduced in Virtual Box (v5.0.30.r112061) in Windows 7.

Fixes: 954ea11540b6 ("virtio: do not report link state feature unless available")
Cc: stable@dpdk.org
Signed-off-by: Matt Peters <matt.peters@windriver.com>
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>

net/virtio-user: fix overflow

virtio-user limits the qeueue number to 8 but provides no limit
check against the queue number input from user. If a bigger queue
number (> 8) is given, there is an overflow issue. Doing a sanity
check could avoid it.

Fixes: 37a7eb2ae816 ("net/virtio-user: add device emulation layer")
Cc: stable@dpdk.org
Signed-off-by: Wenfeng Liu <liuwf@arraynetworks.com.cn>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

net/virtio-user: fix tapfds close

The valid tap file descriptor range should be equal or greater
than zero instead of non-zero

Fixes: e3b434818bbb ("net/virtio-user: support kernel vhost")
Cc: stable@dpdk.org
Signed-off-by: Wenfeng Liu <liuwf@arraynetworks.com.cn>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

vhost: change log levels in client mode

Inability to connect to socket is a normal situation
in client mode because, in common case, server isn't
started yet. RTE_LOG_WARNING should be suitable for
the case of some unusual errors.
Message about reconnection is not an error at all.

Fixes: e623e0c6d8a5 ("vhost: add reconnect ability")
Cc: stable@dpdk.org
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

net/vhost: remove include of numaif.h

This patch revmoves include of the numaif.h header from rte_eth_vhost.c.
Commit 586e39001317 ("vhost: export numa node") moved the invocation of
get_mempolicy() from rte_eth_vhost.c to librte_vhost. So there is no need
to include the numaif.h header anymore in rte_eth_vhost.c.

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

net/virtio: remove the redundant computing

The minor change aims to remove the redundant computing and make
it easier to understand the code.

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>

vhost: try to shrink pfdset when fdset_add fails

fdset_add increments pfdset->num, but fdset_del doesn't decrement
pfdset->num, so if we call fdset_add then fdset_del in a loop without
calling fdset_shrink, we can easily exceed MAX_FDS with only a few
number of fds used.

So my solution is simply to call fdset_shrink in fdset_add when it
exceeds MAX_FDS.

Because fdset_shrink and fdset_add locks pfdset->fd_mutex we can't
call fdset_shrink inside fdset_add because that would cause a dead
lock, so this patch split fdset_shrink in two, fdset_shrink and
fdset_shrink_nolock.

Fixes: 59317cef249c ("vhost: allow many vhost-user ports")
Cc: stable@dpdk.org
Signed-off-by: Matthias Gatto <matthias.gatto@outscale.com>

net/sfc/base: fix out of bounds read in VIs allocation

Coverity issue: 1349662
Fixes: e7cd430c864f ("net/sfc/base: import SFN7xxx family support")
Cc: stable@dpdk.org
Signed-off-by: Andy Moreton <amoreton@solarflare.com>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>

net/sfc/base: fix potential buffer overflow in Tx queue init

Improve error checking to avoid a caller overflowing the MCDI
request buffer if the requested TXQ size was excessively large.

Coverity issue: 1305527
Fixes: e7cd430c864f ("net/sfc/base: import SFN7xxx family support")
CC: stable@dpdk.org
Signed-off-by: Andy Moreton <amoreton@solarflare.com>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>

net/sfc/base: fix failure path in EF10 Tx queue PIO enable

Coverity issue: 1387551
Fixes: e7cd430c864f ("net/sfc/base: import SFN7xxx family support")
Cc: stable@dpdk.org
Signed-off-by: Andy Moreton <amoreton@solarflare.com>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>

app/testpmd: add CLI to set TC min bandwidth

Add a CLI in testpmd to test the TC min bandwidth
setting.

Signed-off-by: Bernard Iremonger <bernard.iremonger@intel.com>
Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

net/ixgbe: allocate TC bandwidth

Ixgbe supports to set the relative bandwidth for the TCs.
It's a global setting for the PF and all the VFs of a
physical port.
This feature provide the API to set the bandwidth.

Signed-off-by: Bernard Iremonger <bernard.iremonger@intel.com>
Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

net/thunderx: wait to complete during link update

Some DPDK applications/examples check link status on their
start. NICVF does not wait for the link, so those apps fail.

Wait up to 9 seconds for the link as other PMDs do in order
to fix those apps/examples.

Signed-off-by: Andriy Berestovskyy <andriy.berestovskyy@caviumnetworks.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>

net/i40e: fix VLAN promisc setting

After adding VLAN filter, the VLAN promiscuous mode is
disabled. But there's no chance to enable it.
So add the check after deleting VLAN filter. If there's
no VLAN filter left, enable the VLAN promiscuous mode.

Fixes: 9f0645cd147c ("net/i40e: fix VLAN filter")

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

net/tap: fix redirection rule after MAC change

This is necessary to ensure packets with the new MAC address as
destination get redirected to the tap device.

Also change the MAC address only if the current one is different from
the requested one.

Fixes: 2bc06869cd94 ("net/tap: add remote netdevice traffic capture")

Signed-off-by: Pascal Mazon <pascal.mazon@6wind.com>

net/tap: fix null MAC address at init

Immediately after init (probing), the device MAC address is all zeroes.
It should be possible to get a correct MAC address as soon as that,
without need for a dev_configure().

With this patch, a MAC address is set in eth_dev_tap_create()
explicitly. It either comes from the remote if any was configured, or is
randomly generated. In any case, the device MAC address is guaranteed to
be the correct one when the tap netdevice actually gets created in
tun_alloc().

Fixes: f76d46b4ff08 ("net/tap: add MAC address management")
Fixes: 2bc06869cd94 ("net/tap: add remote netdevice traffic capture")

Signed-off-by: Pascal Mazon <pascal.mazon@6wind.com>

net/tap: update netlink error code management

Some errors received from the kernel are acceptable, such as a -ENOENT
for a rule deletion (the rule was already no longer existing in the
kernel). Make sure we consider return codes properly. For that,
nl_recv() has been simplified.

qdisc_exists() function is no longer needed as we can check whether the
kernel returned -EEXIST when requiring the qdisc creation. It's simpler
and faster.

Add a few messages for clarity when a netlink error occurs.

Signed-off-by: Pascal Mazon <pascal.mazon@6wind.com>

net/mlx4: use a single drop queue for all drop flows

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>

net/mlx5: fix RSS flow rule with non existing queues

RSS flow rule validation accepts any queue even non existing ones which
causes a segmentation fault at creation time.

Fixes: 3d821d6fea40 ("net/mlx5: support RSS action flow rule")

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>

net/mlx5: remove duplicated process in flow API

Flow validation already processes the final actions to verify if a rule can
be applied or not and the same is done during creation. As the creation
function relies on validation to generate and apply a rule, this job can be
fully handled by the validation function.

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>

net/tap: remove minimum packet size in Rx

With support for segmented packets, it is now possible to easily receive
packets of many sizes, given an adequate number of descriptors.

Remove limitation on the minimum size of mbuf: on reception, if a packet
won't fit in the queue's mbufs, it will be detected in the packet info
and the packet will be discarded.

Fixes: 0781f5762cfe ("net/tap: support segmented mbufs")

Signed-off-by: Pascal Mazon <pascal.mazon@6wind.com>
Acked-by: Keith Wiles <keith.wiles@intel.com>

net/tap: remove unsupported UDP/TCP port mask in flow

Only full mask (0xffff) is accepted, there is no way to specify a mask
for layer 4 ports to the kernel using TC rules.

Fixes: de96fe68ae95 ("net/tap: add basic flow API patterns and actions")

Signed-off-by: Pascal Mazon <pascal.mazon@6wind.com>
Acked-by: Keith Wiles <keith.wiles@intel.com>

net/i40e/base: add README

Add README file in base to track the base code modification.

Signed-off-by: Jingjing Wu <jingjing.wu@intel.com>

doc: fix typo in i40e guide

This patch fixes a trivial typo in i40e documentation;

The comment about the behavior upon DPDK application quit was wrongly
titled as "after DPDK application exist" instead of "after DPDK
application exit", and this trivial patch fixes it.

Fixes: edf1b618313a ("doc: add limitations for i40e PMD")
Cc: stable@dpdk.org
Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Acked-by: John McNamara <john.mcnamara@intel.com>

net/mlx5: remove unnecessary Verbs library calls

Remove unnecessary interface queries and the Resource Domain when
creating the Queue Pair. Since mlx5 PMD data path is on top of native
APIs, such Verbs library calls are no longer needed.

Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>

net/mlx5: rebuild flows on updating RETA

Currently mlx5_dev_rss_reta_update() just updates tables in the host,
therefore it isn't immediately effective until restarting the device by
calling mlx5_dev_stop()/mlx5_dev_start() to update the changes in the
device side. This patch adds rebuilding the device-specific data
structure and applying it to the device right away.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>

net/mlx5: use correct RETA table size

When querying and updating RSS RETA table, it always uses the max size of
the device instead of configured value. This patch fixes it and removed the
related comments in the code.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>

ethdev: remove requirement of aligned RETA size

In rte_eth_check_reta_mask(), it is required to align the size of the RETA
table to RTE_RETA_GROUP_SIZE but as the size can be less than the limit,
this should be removed. The change is also applied to a command of testpmd.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>

net/fm10k: fix secondary process crash

If the primary process has initialized all the queues to vector
pmd mode, the secondary process should not use scalar code path,
because the per queue data structures haven't been prepared for
that, e.g. txq->ops is for vector Tx rather than scalar Tx.

Fixes: a6ce64a97520 ("fm10k: introduce vector driver")
Cc: stable@dpdk.org
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Acked-by: Jing Chen <jing.d.chen@intel.com>

net/i40e: update tunnel filter restore function

The QinQ filter uses big buffers, set the big_buffer flag
when restoring a QinQ filter.

Signed-off-by: Bernard Iremonger <bernard.iremonger@intel.com>
Acked-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

net/i40e: update destroy tunnel filter function

The QinQ filter uses big buffers, set the big_buffer flag
when removing a QinQ filter.

Signed-off-by: Bernard Iremonger <bernard.iremonger@intel.com>
Acked-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

net/i40e: parse QinQ pattern

add QinQ pattern.
add i40e_flow_parse_qinq_pattern function.
add i40e_flow_parse_qinq_filter function.

Signed-off-by: Bernard Iremonger <bernard.iremonger@intel.com>
Acked-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

net/i40e: add QinQ filter create function

Add i40e_cloud_filter_qinq_create function, and call it
from i40e_dev_consistent_tunnel_filter_set function.
Replace the Outer IP filter with the QinQ filter.

QinQ allows multiple VLAN tags to be inserted into a single Ethernet
frame. A QinQ frame is a frame that has two VLAN 802.1Q headers.
802.1Q tunneling (QinQ) is a technique often used by Metro Ethernet
providers as a layer 2 VPN for customers.

Signed-off-by: Laura Stroe <laura.stroe@intel.com>
Signed-off-by: Bernard Iremonger <bernard.iremonger@intel.com>
Acked-by: Beilei Xing <beilei.xing@intel.com>

net/i40e: initialise L3 MAP register

The L3 MAP register is initialised to support QinQ
cloud filters.

Signed-off-by: Bernard Iremonger <bernard.iremonger@intel.com>
Acked-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

net/mlx5: fix mark id retrieval

Mark ID in the completion queue entry is 24 bits, the remaining 8 bits are
reserved and may be nonzero. Do not take them into account when looking for
marked packets.

Fixes: ea3bc3b1df94 ("net/mlx5: support mark flow action")
Cc: stable@dpdk.org
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

net/sfc: fix device reconfigure

Device reconfigure should be done without close which releases
all transmit and receive queue. ethdev API assumes that previously
setup queues (minimum from configured before and now) are kept
across device reconfigure.

Fixes: aaa3f5f0f79d ("net/sfc: add configure and close stubs")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: support changing the number of transmit queues

Fixes: a8ad8cf83f01 ("net/sfc: provide basic stubs for Tx subsystem")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: support changing the number of receive queues

Fixes: a8e64c6b455f ("net/sfc: implement Rx subsystem stubs")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: clarify Tx subsystem configure/close function names

Prepare to fix device reconfigure. Make it clear that corresponding
functions should be called on device configure and close operations.
No functional change.

Fixes: a8ad8cf83f01 ("net/sfc: provide basic stubs for Tx subsystem")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: clarify Rx subsystem configure/close function names

Prepare to fix device reconfigure. Make it clear that corresponding
functions should be called on device configure and close operations.
No functional change.

Fixes: a8e64c6b455f ("net/sfc: implement Rx subsystem stubs")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: initialize port data on attach

Port configuration should be initialized on attach to avoid reset to
defaults on device reconfigure.

Fixes: 03ed21195d9e ("net/sfc: minimum port control sufficient to receive traffic")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: move event support init to attach stage

Prepare to fix device reconfigure.
Device arguments should be parsed once on attach.
Management event queue should be initialized once on attach.

Fixes: 58294ee65afb ("net/sfc: support event queue")
Fixes: c22d3c508e0c ("net/sfc: support parameter to choose performance profile")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: remove EvQ info array to simplify reconfigure

EvQ info array keeps information about EvQ centralized, however
EvQ pointers are available in TxQ and RxQ structures. Single
array for all EvQs complicates device reconfigure handling, so
simply git rid of it.

It removes notion of EvQ software index since there is no EvQ
array in software any more.

Fixes: 58294ee65afb ("net/sfc: support event queue")
Fixes: 9a75f75cb1f2 ("net/sfc: maintain management event queue")
Fixes: ce35b05c635e ("net/sfc: implement Rx queue setup release operations")
Fixes: 28944ac098aa ("net/sfc: implement Rx queue start and stop operations")
Fixes: b1b7ad933b39 ("net/sfc: set up and release Tx queues")
Fixes: fed9aeb46c19 ("net/sfc: implement transmit path start / stop")
Fixes: 3b809c27b1fe ("net/sfc: support link status change interrupt")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: remove flags from EvQ info

Next step to get rid of EvQ info at all.

Fixes: c22d3c508e0c ("net/sfc: support parameter to choose performance profile")
Fixes: 3b809c27b1fe ("net/sfc: support link status change interrupt")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: move EvQ entries to the EvQ control structure

EvQ info array is a problem on device reconfigure when number of
Rx and Tx queues may change. It is a step to get rid of it.

Fixes: 58294ee65afb ("net/sfc: support event queue")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: remove unused max entries from EvQ info

Fixes: 58294ee65afb ("net/sfc: support event queue")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: bind EvQ DMA space to EvQ type and type-local index

The index of an event queue index is computed from the index of the
corresponding transmit or receive queue, and depends on the total
number of receive queues. As a consequence, the index of an event
queue bound to a transmit queue changes if the total number of
receive queues is changed.

Fixes: 58294ee65afb ("net/sfc: support event queue")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Andy Moreton <amoreton@solarflare.com>

net/sfc: clarify interrupts support function names

Make it clear that corresponding functions should be called on device
configure and close operations. No functional change.

Fixes: 06bc197796e2 ("net/sfc: interrupts support sufficient for event queue init")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>

net/sfc: do not drop TSO on device configure

If Tx datapath does not support TSO, TSO was dropped on device configure.
It is incorrect to change advertised offloads.

Fixes: 7a4d44a639c9 ("net/sfc: make TSO a datapath-dependent feature")

Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>

net/sfc: choose datapaths after probe and before attach

Datapath choice requires NIC capabilities knowledge and, therefore,
should be done after probe. Whereas NIC resources estimation needs
to know chosen datapath (e.g. if Tx datapath is going to use TSO).

Fixes: df1bfde4ff0d ("net/sfc: factor out libefx-based Rx datapath")

Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>

net/sfc: use correct function to free scattered packet on Rx

Put to mempool does not free chained segments.

Fixes: e0b063941e03 ("net/sfc: support scattered Rx DMA")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>

net/ixgbe: ping VF when PF status changes

Signed-off-by: Alex Zelezniak <alexz@att.com>
Acked-by: Wei Dai <wei.dai@intel.com>
Acked-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

net/i40e: enable tunnel filter for MPLS

MPLSoUDP & MPLSoGRE is not supported by tunnel
filter due to limited resource of HW, this patch
enables MPLS tunnel filter by replacing inner_mac
filter.

This configuration will be set when adding MPLSoUDP
and MPLSoGRE filter rules, and it will be invalid
only by NIC core reset.

Signed-off-by: Beilei Xing <beilei.xing@intel.com>
Acked-by: Jingjing Wu <jingjing.wu@intel.com>

net/i40e: add flow API MPLS parsing function

This patch add MPLS parsing function to support MPLS filtering.

Signed-off-by: Beilei Xing <beilei.xing@intel.com>
Acked-by: Jingjing Wu <jingjing.wu@intel.com>

app/testpmd: add MPLS and GRE fields to flow command

This patch exposes the following item fields through the flow command:

- MPLS label
- GRE protocol

Signed-off-by: Beilei Xing <beilei.xing@intel.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

ethdev: add MPLS and GRE flow API items

This patch adds MPLS and GRE items to generic rte flow.

Signed-off-by: Beilei Xing <beilei.xing@intel.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

net/sfc: fix leak if EvQ DMA space allocation fails

Fixes: 58294ee65afb ("net/sfc: support event queue")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>

net/sfc: destroy event queue when Tx queue is released

Fixes: b1b7ad933b39 ("net/sfc: set up and release Tx queues")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>

net/sfc: destroy event queue when Rx queue is released

Fixes: ce35b05c635e ("net/sfc: implement Rx queue setup release operations")
Cc: stable@dpdk.org
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>

net/mlx5: fix Tx when first segment size is too short

First segment size must be at least 18 bytes, packets not respecting this
are silently not sent by the NIC but counted as sent by the PMD. The only
way to figure out is compiling the PMD in debug mode.

Fixes: 6579c27c11a5 ("net/mlx5: remove gather loop on segments")
Cc: stable@dpdk.org
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>

net/qede: prevent crash while changing MTU dynamically

The driver can handle dynamic MTU change without needing the port to be
stopped explicitly by the application. However, there is currently no
check to prevent I/Os from happening on a different thread while the
port is going thru' reset internally. This patch fixes this issue by
assigning RX/TX burst functions to a dummy function and also reconfigure
RX bufsize for each rx queue based on the new MTU value.

Fixes: 200645ac7909 ("net/qede: set MTU")
Cc: stable@dpdk.org
Signed-off-by: Harish Patil <harish.patil@qlogic.com>

net/qede: fix VF RSS configuration

The newer SR-IOV PF drivers expects RX/TX queues to be created before
applying RSS configuration. This patch addresses this requirement by
deferring RSS configuration till the queues are created. Even though
this issue is only seen in SR-IOV context, the changes will be made
applicable to PF also to keep the behavior consistent between VF/PF.

Fixes: 7ab35bf6b97b ("net/qede: fix RSS")
Cc: stable@dpdk.org
Signed-off-by: Harish Patil <harish.patil@qlogic.com>

net/qede: fix missing UDP protocol in RSS offload types

Both UDP and TCP based RSS offload types are supported by the device.
This patch adds UDP protocol which got missed out in the original patch.

Fixes: 4c98f2768eef ("net/qede: support RSS hash configuration")
Cc: stable@dpdk.org
Signed-off-by: Harish Patil <harish.patil@qlogic.com>