Rasesh Mody [Sat, 7 Oct 2017 06:30:58 +0000 (23:30 -0700)]
net/qede/base: introduce HW/SW channel
Introduce 2 new API functions, one for the VF and the other for
PF [per-VF] which allows to decide whether to use HW/SW channel
for PF<->VF communication(a per-VF configuration). A HyperV might
have different VMs with different requirements.
Rasesh Mody [Sat, 7 Oct 2017 06:30:56 +0000 (23:30 -0700)]
net/qede/base: add xcvr type and DON FW defines
Add support to firmware for:
- New SFP type 1000BaseT
- DON (Diag Over Network). This feature implements a server side for
process data access commands over Ethernet.
Jerin Jacob [Sun, 8 Oct 2017 12:44:15 +0000 (18:14 +0530)]
net/octeontx: add net device probe and remove
An octeontx ethdev device consists of multiple PKO VF devices, a PKI
VF device and multiple SSOVF devices which shared between eventdev.
This patch adds a vdev based device called "eth_octeontx" which
will create multiple ethernet ports based on "nr_port" or maximum
physical ports are available in the system.
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Jerin Jacob [Sun, 8 Oct 2017 12:44:13 +0000 (18:14 +0530)]
net/octeontx/base: add base PKO operations
PKO is the packet output processing unit, which receives the packet
from the core and sends to the BGX interface. This patch adds the
basic PKO operation like open, close, start and stop. These operations
are implemented through mailbox messages and kernel PF driver being the
server to process the message with the logical port identifier.
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Jerin Jacob [Sun, 8 Oct 2017 12:44:11 +0000 (18:14 +0530)]
net/octeontx/base: add base PKI operations
PKI is packet input unit, which receives the packet from the
BGX interface. This patch adds the basic PKI operation like
open, close, start and stop. These operations are implemented through
mailbox messages and kernel PF driver being the server to process the
message with the logical port identifier.
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Jerin Jacob [Sun, 8 Oct 2017 12:44:10 +0000 (18:14 +0530)]
net/octeontx/base: probe PKI and PKO PCIe VF devices
An octeontx ethdev device consists of multiple PKO VF devices and an PKI
VF device. On Octeontx HW, each Rx queues are enumerated as SSOVF device
which is exposed as event_octeontx device, Tx queues are enumerated as
PKOVF device, and ingress packet configuration is accomplished through
PKIVF device.
In order to expose as an single ethdev instance, On PCIe VF probe,
the driver stores the information associated with the PCIe VF device and
later with vdev infrastructure creates ethdev device with earlier
probed PCIe VF device.
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Jerin Jacob [Sun, 8 Oct 2017 12:44:08 +0000 (18:14 +0530)]
net/octeontx/base: add base BGX operations
BGX is an HW MAC interface. This patch adds the basic BGX operation like
open, close, start and stop. These operations are implemented through
mailbox messages and kernel PF driver being the server to process the
message with the physical port identifier.
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Some of the internal toolchain versions create unaligned
memory access fault when copying from 17-31B buffer using memcpy.
Subsequent patches in this series will be using 17-31B mbox message.
Since the mailbox message copy comes in slow path, changing memcpy to
byte-per-byte copy to workaround the issue.
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Yuanhan Liu [Fri, 7 Jul 2017 06:02:12 +0000 (14:02 +0800)]
app/testpmd: allow to query any RETA size
Currently, testpmd just allows to query the RETA info only when the
required size equals to configured RETA size.
This patch allows to query any RETA size <= the configured size. This
helps when the RETA size is big (say 512) and when I just want to peak
few RETA entries.
Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org> Acked-by: Jingjing Wu <jingjing.wu@intel.com>
Mark Kavanagh [Sat, 7 Oct 2017 14:56:44 +0000 (22:56 +0800)]
doc: add GSO programmer's guide
Add programmer's guide doc to explain the design and use of the
GSO library.
Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com> Acked-by: John McNamara <john.mcnamara@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Jiayu Hu [Sat, 7 Oct 2017 14:56:43 +0000 (22:56 +0800)]
app/testpmd: enable TCP/IPv4 VxLAN and GRE GSO
This patch adds GSO support to the csum forwarding engine. Oversized
packets transmitted over a GSO-enabled port will undergo segmentation
(with the exception of packet-types unsupported by the GSO library).
GSO support is disabled by default.
GSO support may be toggled on a per-port basis, using the command:
"set port <port_id> gso on|off"
The maximum packet length (including the packet header and payload) for
GSO segments may be set with the command:
"set gso segsz <length>"
Show GSO configuration for a given port with the command:
"show port <port_id> gso"
Signed-off-by: Jiayu Hu <jiayu.hu@intel.com> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Mark Kavanagh [Sat, 7 Oct 2017 14:56:42 +0000 (22:56 +0800)]
gso: support GRE GSO
This patch adds GSO support for GRE-tunneled packets. Supported GRE
packets must contain an outer IPv4 header, and inner TCP/IPv4 headers.
They may also contain a single VLAN tag. GRE GSO doesn't check if all
input packets have correct checksums and doesn't update checksums for
output packets. Additionally, it doesn't process IP fragmented packets.
As with VxLAN GSO, GRE GSO uses a two-segment MBUF to organize each
output packet, which requires multi-segment mbuf support in the TX
functions of the NIC driver. Also, if a packet is GSOed, GRE GSO reduces
its MBUF refcnt by 1. As a result, when all of its GSOed segments are
freed, the packet is freed automatically.
Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Mark Kavanagh [Sat, 7 Oct 2017 14:56:41 +0000 (22:56 +0800)]
gso: support VxLAN GSO
This patch adds a framework that allows GSO on tunneled packets.
Furthermore, it leverages that framework to provide GSO support for
VxLAN-encapsulated packets.
Supported VxLAN packets must have an outer IPv4 header (prepended by an
optional VLAN tag), and contain an inner TCP/IPv4 packet (with an optional
inner VLAN tag).
VxLAN GSO doesn't check if input packets have correct checksums and
doesn't update checksums for output packets. Additionally, it doesn't
process IP fragmented packets.
As with TCP/IPv4 GSO, VxLAN GSO uses a two-segment MBUF to organize each
output packet, which mandates support for multi-segment mbufs in the TX
functions of the NIC driver. Also, if a packet is GSOed, VxLAN GSO
reduces its MBUF refcnt by 1. As a result, when all of its GSO'd segments
are freed, the packet is freed automatically.
Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Jiayu Hu [Sat, 7 Oct 2017 14:56:40 +0000 (22:56 +0800)]
gso: support TCP/IPv4 GSO
This patch adds GSO support for TCP/IPv4 packets. Supported packets
may include a single VLAN tag. TCP/IPv4 GSO doesn't check if input
packets have correct checksums, and doesn't update checksums for
output packets (the responsibility for this lies with the application).
Additionally, TCP/IPv4 GSO doesn't process IP fragmented packets.
TCP/IPv4 GSO uses two chained MBUFs, one direct MBUF and one indrect
MBUF, to organize an output packet. Note that we refer to these two
chained MBUFs as a two-segment MBUF. The direct MBUF stores the packet
header, while the indirect mbuf simply points to a location within the
original packet's payload. Consequently, use of the GSO library requires
multi-segment MBUF support in the TX functions of the NIC driver.
If a packet is GSO'd, TCP/IPv4 GSO reduces its MBUF refcnt by 1. As a
result, when all of its GSOed segments are freed, the packet is freed
automatically.
Signed-off-by: Jiayu Hu <jiayu.hu@intel.com> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Tested-by: Lei Yao <lei.a.yao@intel.com>
Jiayu Hu [Sat, 7 Oct 2017 14:56:39 +0000 (22:56 +0800)]
gso: add Generic Segmentation Offload API framework
Generic Segmentation Offload (GSO) is a SW technique to split large
packets into small ones. Akin to TSO, GSO enables applications to
operate on large packets, thus reducing per-packet processing overhead.
To enable more flexibility to applications, DPDK GSO is implemented
as a standalone library. Applications explicitly use the GSO library
to segment packets. To segment a packet requires two steps. The first
is to set proper flags to mbuf->ol_flags, where the flags are the same
as that of TSO. The second is to call the segmentation API,
rte_gso_segment(). This patch introduces the GSO API framework to DPDK.
rte_gso_segment() splits an input packet into small ones in each
invocation. The GSO library refers to these small packets generated
by rte_gso_segment() as GSO segments. Each of the newly-created GSO
segments is organized as a two-segment MBUF, where the first segment is a
standard MBUF, which stores a copy of packet header, and the second is an
indirect MBUF which points to a section of data in the input packet.
rte_gso_segment() reduces the refcnt of the input packet by 1. Therefore,
when all GSO segments are freed, the input packet is freed automatically.
Additionally, since each GSO segment has multiple MBUFs (i.e. 2 MBUFs),
the driver of the interface which the GSO segments are sent to should
support to transmit multi-segment packets.
The GSO framework clears the PKT_TX_TCP_SEG flag for both the input
packet, and all produced GSO segments in the event of success, since
segmentation in hardware is no longer required at that point.
Signed-off-by: Jiayu Hu <jiayu.hu@intel.com> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Jiayu Hu [Sat, 7 Oct 2017 07:45:57 +0000 (15:45 +0800)]
app/testpmd: enable the heavyweight mode TCP/IPv4 GRO
The GRO library provides two modes to reassemble packets. Currently, the
csum forwarding engine has supported to use the lightweight mode to
reassemble TCP/IPv4 packets. This patch introduces the heavyweight mode
for TCP/IPv4 GRO in the csum forwarding engine.
With the command "set port <port_id> gro on|off", users can enable
TCP/IPv4 GRO for a given port. With the command "set gro flush <cycles>",
users can determine when the GROed TCP/IPv4 packets are flushed from
reassembly tables. With the command "show port <port_id> gro", users can
display GRO configuration.
The GRO library doesn't re-calculate checksums for merged packets. If
users want the merged packets to have correct IP and TCP checksums,
please select HW IP checksum calculation and HW TCP checksum calculation
for the port which the merged packets are transmitted to.
Signed-off-by: Jiayu Hu <jiayu.hu@intel.com> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com> Tested-by: Lei Yao <lei.a.yao@intel.com>
Declan Doherty [Fri, 6 Oct 2017 09:21:12 +0000 (10:21 +0100)]
net/bonding: fix LACP slave deactivate behavioral
During a link down event of a port participating in a LACP 802.3ad
bond the current behavior can cause all ports to be deselected
and temporarily stop all traffic on the bond, causing unexpected
traffic loss across all ports and not just the port which was
affected by the link down event.
The compilation with gcc-6.3.0 and EXTRA_CFLAGS=-Og gives the following
error:
CC virtio_rxtx.o
virtio_rxtx.c: In function ‘virtio_rx_offload’:
virtio_rxtx.c:680:10: error: ‘csum’ may be used uninitialized in
this function [-Werror=maybe-uninitialized]
csum = ~csum;
~~~~~^~~~~~~
The function rte_raw_cksum_mbuf() may indeed return an error, and
in this case, csum won't be initialized. Fix it by initializing csum
to 0.
Fixes: 96cb6711939e ("net/virtio: support Rx checksum offload") Cc: stable@dpdk.org Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Xueming Li [Fri, 6 Oct 2017 15:45:50 +0000 (23:45 +0800)]
net/mlx5: allocate verbs object into shared memory
PMD uses Verbs object which were not available in the shared memory.
This patch modify the location where Verbs objects are allocated (from
process memory address space to shared memory address space) and thus
allow a secondary process to use those object by mapping this shared
memory space its own memory space.
Signed-off-by: Xueming Li <xuemingl@mellanox.com> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Xueming Li [Fri, 6 Oct 2017 15:45:49 +0000 (23:45 +0800)]
net/mlx5: install a socket to exchange a file descriptor
Use a unix socket to get back the communication channel with the Kernel
driver from the primary process, this is necessary to remap those pages
in the secondary process memory space and thus use the same Tx queues.
This is only supported from rdma-core (v15).
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Signed-off-by: Xueming Li <xuemingl@mellanox.com>
Xueming Li [Fri, 6 Oct 2017 15:45:48 +0000 (23:45 +0800)]
net/mlx5: change eth device reference for secondary process
rte_eth_dev created by primary process were not available in secondary
process, it was not possible to use the primary process local memory
object from a secondary process.
This patch modify the reference of primary rte_eth_dev object, use
local rte_eth_dev secondary process instead.
Signed-off-by: Xueming Li <xuemingl@mellanox.com> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
David Hunt [Wed, 11 Oct 2017 16:18:55 +0000 (17:18 +0100)]
examples/vm_power_mgr: set MAC address of VF
We need to set vf mac from the host, so that they will be in sync on the
guest and the host. Otherwise, we'll have a random mac on the guest, and
a 00:00:00:00:00:00 mac on the host.
Signed-off-by: David Hunt <david.hunt@intel.com> Reviewed-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
David Hunt [Wed, 11 Oct 2017 16:18:53 +0000 (17:18 +0100)]
power: add send channel msg function to map file
Adding new wrapper function to existing private (but unused 'till now)
function with an rte_power_ prefix.
The plan is to clean up all the header files in the next release so
that only the intended public functions are in the map file and only
the relevant headers have the rte_ prefix so that only they are
included in the documentation.
Signed-off-by: David Hunt <david.hunt@intel.com> Reviewed-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Rory Sexton [Wed, 11 Oct 2017 16:18:47 +0000 (17:18 +0100)]
net/i40e: support converting VF MAC to VF id
Need a way to convert a VF id to a PF id on the host so as to query the
PF for relevant statistics which are used for the frequency changes in
the vm_power_manager app.
Used when profiles are passed down from the guest to the host, allowing
the host to map the VFs to PFs.
Signed-off-by: Nemanja Marjanovic <nemanja.marjanovic@intel.com> Signed-off-by: Rory Sexton <rory.sexton@intel.com> Signed-off-by: David Hunt <david.hunt@intel.com> Reviewed-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Shreyansh Jain [Sat, 12 Aug 2017 10:22:20 +0000 (15:52 +0530)]
bus: ignore scan and probe failures
Bus scan is responsible for finding devices over *all* buses.
Some of these buses might not be able to scan but that should
not prevent other buses to be scanned.
Same is the case for probing. It is possible that some devices which
were scanned didn't have a specific driver. That should not prevent
other buses from being probed.
Bruce Richardson [Wed, 11 Oct 2017 11:28:17 +0000 (12:28 +0100)]
vhost: fix false-positive warning from clang 5
When compiling with clang extra warning flags, such as used by default with
meson, a warning is given in iotlb.c:
lib/librte_vhost/iotlb.c:318:6: warning:
variable 'socket' is used uninitialized whenever
'if' condition is false [-Wsometimes-uninitialized]
This is a false positive, as the socket value will be initialized by the
call to get_mempolicy in the case where the NUMA build-time flag is set,
and in cases where it is not set, "if (ret)" will always be true as ret is
initialized to -1 and never changed.
However, this is not immediately obvious, and is perhaps a little fragile,
as it will break if other code using ret is subsequently added above the
call to get_mempolicy by someone unaware of this subtle dependency.
Therefore, we can fix the warning and making the code more robust by
explicitly initializing socket to zero, and moving the extra condition
check on the return from get_mempolicy() into the #ifdef
Nikhil Rao [Tue, 10 Oct 2017 22:21:36 +0000 (03:51 +0530)]
eventdev: add eth Rx adapter implementation
The adapter implementation uses eventdev PMDs to configure the packet
transfer if HW support is available and if not, it uses an EAL service
function that reads packets from ethernet Rx queues and injects these
as events into the event device.
Nikhil Rao [Tue, 10 Oct 2017 22:21:35 +0000 (03:51 +0530)]
eventdev: add event type for eth Rx adapter
Add RTE_EVENT_TYPE_ETH_RX_ADAPTER event type. Certain platforms (e.g.,
octeontx), in the event dequeue function, need to identify events
injected from ethernet hardware into eventdev so that DPDK mbuf can be
populated from the HW descriptor.
Events injected from ethernet hardware would use an event type of
RTE_EVENT_TYPE_ETHDEV and events injected from the rx adapter service
function would use an event type of RTE_EVENT_TYPE_ETH_RX_ADAPTER to
help the event dequeue function differentiate between these two event
sources.
Signed-off-by: Nikhil Rao <nikhil.rao@intel.com> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Nikhil Rao [Tue, 10 Oct 2017 22:21:34 +0000 (03:51 +0530)]
eventdev: add eth Rx adapter API
Add common APIs for configuring packet transfer from ethernet Rx
queues to event devices across HW & SW packet transfer mechanisms.
A detailed description of the adapter is contained in the header's
comments.
Signed-off-by: Nikhil Rao <nikhil.rao@intel.com> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Nikhil Rao [Tue, 10 Oct 2017 22:21:32 +0000 (03:51 +0530)]
eventdev: add PMD callbacks for eth Rx adapter
The PMD callbacks are used by the rte_event_eth_rx_xxx() APIs to
configure and control the ethernet receive adapter if packet transfers
from the ethdev to eventdev is implemented in hardware.
Signed-off-by: Nikhil Rao <nikhil.rao@intel.com> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Nikhil Rao [Tue, 10 Oct 2017 22:21:31 +0000 (03:51 +0530)]
eventdev: add capabilities API
The caps API allows application to retrieve capability information
needed to configure the ethernet Rx adapter for the eventdev and
ethdev pair.
For e.g., the ethdev, eventdev pairing maybe such that all of the
ethdev Rx queues can only be connected to a single event queue, in
this case the application is required to pass in -1 as the queue id
when adding a receive queue to the adapter.
Signed-off-by: Nikhil Rao <nikhil.rao@intel.com> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Harry van Haaren [Wed, 20 Sep 2017 13:36:03 +0000 (14:36 +0100)]
eventdev: bump library version
This commit bumps the library version to refect the ABI change
caused by removing the individual rte_event_port_count, queue_count,
and other get functions. These functions are superseded by the
get-attribute style API, which allows fetching values without API/ABI
changes.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Harry van Haaren [Wed, 20 Sep 2017 13:36:02 +0000 (14:36 +0100)]
eventdev: add device started attribute
This commit adds an attribute to the eventdev, allowing applications
to retrieve if the eventdev is running or stopped. Note that no API
or ABI changes were required in adding the statistic, and code changes
are minimal.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Harry van Haaren [Wed, 20 Sep 2017 13:36:01 +0000 (14:36 +0100)]
eventdev: add queue attribute function
This commit adds a generic queue attribute function. It also removes
the previous rte_event_queue_priority() and priority() functions, and
updates the map files and unit tests to use the new attr functions.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Harry van Haaren [Wed, 20 Sep 2017 13:36:00 +0000 (14:36 +0100)]
eventdev: add dev attribute get function
This commit adds a device attribute function, allowing flexible
fetching of device attributes, like port count or queue count.
The unit tests and .map file are updated to the new function.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Harry van Haaren [Wed, 20 Sep 2017 13:35:59 +0000 (14:35 +0100)]
eventdev: add port attribute function
This commit reworks the port functions to retrieve information
about the port, like the enq or deq depths. Note that "port count"
is a device attribute, and is added in a later patch for dev attributes.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tim McDaniel [Wed, 6 Sep 2017 15:42:07 +0000 (10:42 -0500)]
eventdev: clarify usage of forward and release ops
Update doxygen to make it clear that RTE_EVENT_OP_FORWARD and
RTE_EVENT_OP_RELEASE must only be enqueued to the same port that the
original event was dequeued from.
Signed-off-by: Tim McDaniel <timothy.mcdaniel@intel.com> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Events sent through single-link queues are naturally in-order and
atomic, without reordering or atomic scheduling. Logically the
nb_atomic_flows and nb_atomic_order_sequences arguments don't apply to a
single link queue, but applications must set these (depending on the queue
config type) to bypass the is_valid_{ordered, atomic}_queue_conf() checks
in the eventdev layer.
This commit updates those is_valid_* functions to ignore queues with the
SINGLE_LINK flag, to simplify their configuration.
Signed-off-by: Gage Eads <gage.eads@intel.com> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Maxime Coquelin [Tue, 10 Oct 2017 12:47:54 +0000 (14:47 +0200)]
vhost: distinguish master and slave requests
This patch adds an union in VhostUserMsg to distinguish between
master and slave initiated requests, instead of casting slave
requests as master request.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Added new callbacks to notify about socket connection status.
As destroy_device is used for virtqueue processing *pause* as well as
connection close, the user has no distinction between those.
Consider the following scenario:
rte_vhost: received SET_VRING_BASE message,
calling destroy_device() as usual
user: end-user asks to remove the device (together with socket file),
OK, device is not *in use* - that's NOT the behavior we want
calling rte_vhost_driver_unregister() etc.
Instead of changing new_device/destroy_device callbacks and breaking
the ABI, a set of new functions new_connection/destroy_connection
has been added.
Maxime Coquelin [Thu, 5 Oct 2017 08:36:25 +0000 (10:36 +0200)]
vhost: postpone device creation until rings are mapped
Translating the start addresses of the rings is not enough, we need to
be sure all the ring is made available by the guest.
It depends on the size of the rings, which is not known on SET_VRING_ADDR
reception. Furthermore, we need to be be safe against vring pages
invalidates.
This patch introduces a new access_ok flag per virtqueue, which is set
when all the rings are mapped, and cleared as soon as a page used by a
ring is invalidated. The invalidation part is implemented in a following
patch.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Maxime Coquelin [Thu, 5 Oct 2017 08:36:24 +0000 (10:36 +0200)]
vhost: translate ring addresses when IOMMU enabled
When IOMMU is enabled, the ring addresses set by the
VHOST_USER_SET_VRING_ADDR requests are guest's IO virtual addresses,
whereas Qemu virtual addresses when IOMMU is disabled.
When enabled and the required translation is not in the IOTLB cache,
an IOTLB miss request is sent, but being called by the vhost-user
socket handling thread, the function does not wait for the requested
IOTLB update.
The function will be called again on the next IOTLB update message
reception if matching the vring addresses.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Maxime Coquelin [Thu, 5 Oct 2017 08:36:23 +0000 (10:36 +0200)]
vhost: postpone rings addresses translation
This patch postpones rings addresses translations and checks, as
addresses sent by the master shuld not be interpreted as long as
ring is not started and enabled[0].
When protocol features aren't negotiated, the ring is started in
enabled state, so the addresses translations are postponed to
vhost_user_set_vring_kick().
Otherwise, it is postponed to when ring is enabled, in
vhost_user_set_vring_enable().
Maxime Coquelin [Thu, 5 Oct 2017 08:36:22 +0000 (10:36 +0200)]
vhost: fix dereferencing invalid pointer after realloc
numa_realloc() reallocates the virtio_net device structure and
updates the vhost_devices[] table with the new pointer if the rings
are allocated different NUMA node.
Problem is that vhost_user_msg_handler() still dereferences old
pointer afterward.
This patch prevents this by fetching again the dev pointer in
vhost_devices[] after messages have been handled.
Fixes: af295ad4698c ("vhost: realloc device and queues to same numa node as vring desc") Cc: stable@dpdk.org Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Maxime Coquelin [Thu, 5 Oct 2017 08:36:21 +0000 (10:36 +0200)]
vhost: enable rings at the right time
When VHOST_USER_F_PROTOCOL_FEATURES is negotiated, the ring is not
enabled when started, but enabled through dedicated
VHOST_USER_SET_VRING_ENABLE request.
When not negotiated, the ring is started in enabled state, at
VHOST_USER_SET_VRING_KICK request time.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Maxime Coquelin [Thu, 5 Oct 2017 08:36:19 +0000 (10:36 +0200)]
vhost: introduce guest IOVA to backend VA helper
This patch introduces vhost_iova_to_vva() function to translate
guest's IO virtual addresses to backend's virtual addresses.
When IOMMU is enabled, the IOTLB cache is queried to get the
translation. If missing from the IOTLB cache, an IOTLB_MISS request
is sent to Qemu, and IOTLB cache is queried again on IOTLB event
notification.
When IOMMU is disabled, the passed address is a guest's physical
address, so the legacy rte_vhost_gpa_to_vva() API is used.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Maxime Coquelin [Thu, 5 Oct 2017 08:36:18 +0000 (10:36 +0200)]
vhost: handle IOTLB update and invalidate requests
Vhost-user device IOTLB protocol extension introduces
VHOST_USER_IOTLB message type. The associated payload is the
vhost_iotlb_msg struct defined in Kernel, which in this was can
be either an IOTLB update or invalidate message.
On IOTLB update, the virtqueues get notified of a new entry.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Maxime Coquelin [Thu, 5 Oct 2017 08:36:17 +0000 (10:36 +0200)]
vhost: initialize vrings IOTLB caches
The per-virtqueue IOTLB cache init is done at virtqueue
init time. init_vring_queue() now takes vring id as parameter,
so that the IOTLB cache mempool name can be generated.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Maxime Coquelin [Thu, 5 Oct 2017 08:36:15 +0000 (10:36 +0200)]
vhost: add pending IOTLB miss request list and helpers
In order to be able to handle other ports or queues while waiting
for an IOTLB miss reply, a pending list is created so that waiter
can return and restart later on with sending again a miss request.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Maxime Coquelin [Thu, 5 Oct 2017 08:36:12 +0000 (10:36 +0200)]
vhost: support slave requests channel
Currently, only QEMU sends requests, the backend sends
replies. In some cases, the backend may need to send
requests to QEMU, like IOTLB miss events when IOMMU is
supported.
This patch introduces a new channel for such requests.
QEMU sends a file descriptor of a new socket using
VHOST_USER_SET_SLAVE_REQ_FD.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>