dpdk.git
6 years agovfio: fix PCI address comparison
Qi Zhang [Thu, 12 Jul 2018 14:01:42 +0000 (22:01 +0800)]
vfio: fix PCI address comparison

When use memcmp to compare two PCI address, sizeof(struct rte_pci_addr)
is 4 bytes aligned, and it is 8. While only 7 byte of struct rte_pci_addr
is valid. So compare the 8th byte will cause the unexpected result, which
happens when repeatedly attach/detach a device.

Fixes: 94c0776b1bad ("vfio: support hotplug")
Cc: stable@dpdk.org
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agoeal: fix hotplug add and remove
Qi Zhang [Thu, 12 Jul 2018 14:01:41 +0000 (22:01 +0800)]
eal: fix hotplug add and remove

If hotplug add an already plugged PCI device, it will
cause rte_pci_device->device.name be corrupted due to unexpected
rte_devargs_remove. Also if try to hotplug remove an already
unplugged device, it will cause segment fault due to unexpected
bus->unplug on a rte_device whose driver is NULL.
The patch fix these issues.

Fixes: 7e8b26650146 ("eal: fix hotplug add / remove")
Cc: stable@dpdk.org
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agodevtools: fix symbol check for filename with space
Thomas Monjalon [Wed, 18 Jul 2018 21:26:58 +0000 (23:26 +0200)]
devtools: fix symbol check for filename with space

If the patch filename or the temporary file path have a space
in their name, the script check-symbol-change.sh does not work.
The variables for the filenames must be enclosed in quotes
in order to preserve spaces.

Fixes: 4bec48184e33 ("devtools: add checks for ABI symbol addition")

Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
6 years agomem: add logic check for static analyzer
Anatoly Burakov [Tue, 17 Jul 2018 15:41:45 +0000 (16:41 +0100)]
mem: add logic check for static analyzer

Technically, single file segments codepath will never get
triggered when using in-memory mode, because EAL prohibits
mixing these two options at initialization time. However,
code analyzers do not know that, and some will complain
about either using uninitialized variables, or trying to
do operations on an already closed descriptor.

Fix this by assuring the compiler or code analyzer that
in-memory mode code never gets triggered when using
single-file segments mode.

Coverity issue: 302847
Fixes: 72b49ff623c4 ("mem: support --in-memory mode")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agomalloc: do not skip pad on free
Anatoly Burakov [Thu, 19 Jul 2018 09:42:46 +0000 (10:42 +0100)]
malloc: do not skip pad on free

Previously, we were skipping erasing pad because we were
expecting it to be freed when we were merging adjacent
segments. However, if there were no adjacent segments to
merge, we would've skipped erasing the pad, leaving non-zero
memory in our free space.

Fix this by including pad in the erasing unconditionally.

Fixes: e43a9f52b7ff ("malloc: fix pad erasing")
Cc: stable@dpdk.org
Reported-by: Andrew Rybchenko <arybchenko@solarflare.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
6 years agodevargs: fix parsing truncation when using format
Andrew Rybchenko [Wed, 18 Jul 2018 07:23:30 +0000 (08:23 +0100)]
devargs: fix parsing truncation when using format

Space for string terminating NUL character should be provided to
snprintf() to avoid the last symbol truncation.

Fixes: a23bc2c4e01b ("devargs: add non-variadic parsing function")

Reported-by: Ivan Malov <ivan.malov@oktetlabs.ru>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agoeal: fix dependency in multi-process detection
Anatoly Burakov [Wed, 18 Jul 2018 10:53:42 +0000 (11:53 +0100)]
eal: fix dependency in multi-process detection

Currently, we need runtime dir to put all of our runtime info in,
including the DPDK shared config. However, we use the shared
config to determine our proc type, and this happens earlier than
we actually create the config dir and thus can know where to
place the config file.

Fix this by moving runtime dir creation right after the EAL
arguments parsing, but before proc type autodetection. Also,
previously we were creating the config file unconditionally,
even if we specified no_shconf - fix it by only creating
the config file if no_shconf is not set.

Fixes: adf1d867361c ("eal: move runtime config file to new location")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Lei Yao <lei.a.yao@intel.com>
6 years agomem: fix alignment of requested virtual areas
Anatoly Burakov [Mon, 16 Jul 2018 14:57:19 +0000 (15:57 +0100)]
mem: fix alignment of requested virtual areas

The original code did not align any addresses that were requested as
page-aligned, but were different because addr_is_hint was set.

Below fix by Dariusz has introduced an issue where all unaligned addresses
were left as unaligned.

This patch is a partial revert of
commit 7fa7216ed48d ("mem: fix alignment of requested virtual areas")

and implements a proper fix for this issue, by asking for alignment in all
but the following two cases:

1) page size is equal to system page size, or
2) we got an aligned requested address, and will not accept a different one

This ensures that alignment is performed in all cases, except for those we
can guarantee that the address will not need alignment.

Fixes: b7cc54187ea4 ("mem: move virtual area function in common directory")
Fixes: 7fa7216ed48d ("mem: fix alignment of requested virtual areas")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Lei Yao <lei.a.yao@intel.com>
Acked-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
6 years agodevargs: fix build with gcc 4.7
Pablo de Lara [Mon, 16 Jul 2018 06:26:27 +0000 (07:26 +0100)]
devargs: fix build with gcc 4.7

Fixed possible out-of-bounds issue:

lib/librte_eal/common/eal_common_devargs.c:
In function ‘rte_devargs_layers_parse’:
lib/librte_eal/common/eal_common_devargs.c:121:7:
error: array subscript is above array bounds

Bugzilla ID: 71
Fixes: 338327d731e6 ("devargs: add function to parse device layers")

Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agoversion: 18.08-rc1
Thomas Monjalon [Sun, 15 Jul 2018 23:17:18 +0000 (01:17 +0200)]
version: 18.08-rc1

Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
6 years agodevtools: add checks for ABI symbol addition
Neil Horman [Wed, 27 Jun 2018 18:01:01 +0000 (14:01 -0400)]
devtools: add checks for ABI symbol addition

Recently, some additional patches were added to allow for programmatic
marking of C symbols as experimental.  The addition of these markers is
dependent on the manual addition of exported symbols to the EXPERIMENTAL
section of the corresponding libraries version map file.  The consensus
on review is that, in addition to mandating the addition of symbols to
the EXPERIMENTAL version in the map, we need a mechanism to enforce our
documented process of mandating that addition when they are introduced.
To that end, I am proposing this change.  It is an addition to the
checkpatches script, which scan incoming patches for additions and
removals of symbols to the map file, and warns the user appropriately.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
6 years agoapp/testpmd: fix typo in setting Tx offload command
Ferruh Yigit [Thu, 5 Jul 2018 16:58:00 +0000 (17:58 +0100)]
app/testpmd: fix typo in setting Tx offload command

udp_cksum is duplicated, second one should be tcp_cksum

Fixes: c73a9071877a ("app/testpmd: add commands to test new offload API")
Cc: stable@dpdk.org
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Bernard Iremonger <bernard.iremonger@intel.com>
6 years agoapp/testpmd: set keep CRC offload flag
Ferruh Yigit [Tue, 3 Jul 2018 18:44:52 +0000 (19:44 +0100)]
app/testpmd: set keep CRC offload flag

If "--disable-crc-strip" testpmd parameter issued, it removes the
DEV_RX_OFFLOAD_CRC_STRIP flag.
With introduction of new DEV_RX_OFFLOAD_KEEP_CRC offload flag, this
flag also should be set when this parameter issued.

Fixes: 70815c9ecadd ("ethdev: add new offload flag to keep CRC")

Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Bernard Iremonger <bernard.iremonger@intel.com>
6 years agokvargs: add generic string matching callback
Gaetan Rivet [Wed, 11 Jul 2018 21:45:02 +0000 (23:45 +0200)]
kvargs: add generic string matching callback

This function can be used as a callback to
rte_kvargs_process.

This should reduce code duplication.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agoeal: implement device iteration
Gaetan Rivet [Wed, 11 Jul 2018 21:45:01 +0000 (23:45 +0200)]
eal: implement device iteration

Use the iteration hooks in the abstraction layers to perform the
requested filtering on the internal device lists.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agoeal: implement device iteration initialization
Gaetan Rivet [Wed, 11 Jul 2018 21:45:00 +0000 (23:45 +0200)]
eal: implement device iteration initialization

Parse a device description.
Split this description in their relevant part for each layers.
No dynamic allocation is performed.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agoeal: add device iterator interface
Gaetan Rivet [Wed, 11 Jul 2018 21:44:59 +0000 (23:44 +0200)]
eal: add device iterator interface

A device iterator allows iterating over a set of devices.
This set is defined by the two descriptions offered,

  * rte_bus
  * rte_class

Only one description can be provided, or both. It is not allowed to
provide no description at all.

Each layer of abstraction then performs a filter based on the
description provided. This filtering allows iterating on their internal
set of devices, stopping when a match is valid and returning the current
iteration context.

This context allows starting the next iteration from the same point and
going forward.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agodevargs: add function to parse device layers
Gaetan Rivet [Wed, 11 Jul 2018 21:44:58 +0000 (23:44 +0200)]
devargs: add function to parse device layers

This function is private to the EAL.
It is used to parse each layers in a device description string,
and store the result in an rte_devargs structure.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
6 years agoeal: introduce device class abstraction
Gaetan Rivet [Wed, 11 Jul 2018 21:44:57 +0000 (23:44 +0200)]
eal: introduce device class abstraction

This abstraction exists since the infancy of DPDK.
It needs to be fleshed out however, to allow a generic
description of devices properties and capabilities.

A device class is the northbound interface of the device, intended
for applications to know what it can be used for.

It is conceptually just above buses.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agoeal: introduce destructor macros
Gaetan Rivet [Wed, 11 Jul 2018 21:44:56 +0000 (23:44 +0200)]
eal: introduce destructor macros

This macro adds symbols to the .fini section using the global
RTE priorities, to ensure consistency.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
6 years agokvargs: introduce a more flexible parsing function
Gaetan Rivet [Wed, 11 Jul 2018 21:44:55 +0000 (23:44 +0200)]
kvargs: introduce a more flexible parsing function

This function permits defining additional terminating characters,
ending the parsing to arbitrary delimiters.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
6 years agokvargs: build before EAL
Gaetan Rivet [Wed, 11 Jul 2018 21:44:54 +0000 (23:44 +0200)]
kvargs: build before EAL

This library will be used by the EAL to parse parameters.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agokvargs: remove error logs
Gaetan Rivet [Wed, 11 Jul 2018 21:44:53 +0000 (23:44 +0200)]
kvargs: remove error logs

Error logs in kvargs parsing should be better handled in components
calling the library.

This library must be as lean as possible.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
6 years agodevargs: add non-variadic parsing function
Gaetan Rivet [Wed, 11 Jul 2018 21:44:52 +0000 (23:44 +0200)]
devargs: add non-variadic parsing function

rte_devargs_parse becomes non-variadic,
rte_devargs_parsef becomes the variadic version, to be used to compose
device strings.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
6 years agodevargs: use log functions
Gaetan Rivet [Wed, 11 Jul 2018 21:44:51 +0000 (23:44 +0200)]
devargs: use log functions

Use the standard EAL logging functions in rte_devargs.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
6 years agobus/vmbus: fix build without libuuid
Thomas Monjalon [Sun, 15 Jul 2018 20:32:49 +0000 (22:32 +0200)]
bus/vmbus: fix build without libuuid

The dependency on libuuid is useless because the required code
is embedded in EAL, see commit 6bc67c497a51 ("eal: add uuid API").

Fixes: 831dba47bd36 ("bus/vmbus: add Hyper-V virtual bus support")

Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
6 years agoethdev: check queue stats mapping input arguments
Kiran Kumar [Wed, 11 Jul 2018 08:41:59 +0000 (14:11 +0530)]
ethdev: check queue stats mapping input arguments

With current implementation, we are not checking for queue_id range
and stat_idx range in stats mapping function. This patch will add
check for queue_id and stat_idx range.

Fixes: 5de201df892 ("ethdev: add stats per queue")

Signed-off-by: Kiran Kumar <kkokkilagadda@caviumnetworks.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
6 years agonet/netvsc: add documentation
Stephen Hemminger [Fri, 13 Jul 2018 17:06:44 +0000 (10:06 -0700)]
net/netvsc: add documentation

Matching documentation for new netvsc device.
Includes a brief note about the restart issue.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
6 years agonet/netvsc: add Hyper-V network device
Stephen Hemminger [Fri, 13 Jul 2018 17:06:43 +0000 (10:06 -0700)]
net/netvsc: add Hyper-V network device

The driver supports Hyper-V networking directly like
virtio for KVM or vmxnet3 for VMware.

This code is based off of the FreeBSD driver. The file and variable
names are kept the same to help with understanding (with most of the
BSD style warts removed).

This version supports the latest NetVSP 6.1 version and
older versions.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
6 years agobus/vmbus: add Hyper-V virtual bus support
Stephen Hemminger [Fri, 13 Jul 2018 17:06:42 +0000 (10:06 -0700)]
bus/vmbus: add Hyper-V virtual bus support

This patch adds support for an additional bus type Virtual Machine BUS
(VMBUS) on Microsoft Hyper-V in Windows 10, Windows Server 2016
and Azure. Most of this code was extracted from FreeBSD and some of
this is from earlier code donated by Brocade.

Only Linux is supported at present, but the code is split
to allow future FreeBSD and Windows support.

The bus support relies on the uio_hv_generic driver from Linux
kernel 4.16. Multiple queue support requires additional sysfs
interfaces which is in kernel 5.0 (a.k.a 4.17).

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
6 years agoeal: add uuid API
Stephen Hemminger [Fri, 13 Jul 2018 17:06:41 +0000 (10:06 -0700)]
eal: add uuid API

Since uuid functions may not be available everywhere, implement
uuid functions in DPDK. These are based off the BSD licensed
libuuid in util-link.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
6 years agovhost/crypto: use function to access mbuf private area
Dan Gora [Mon, 18 Jun 2018 23:37:33 +0000 (16:37 -0700)]
vhost/crypto: use function to access mbuf private area

Use rte_mbuf_to_priv() to access the private data area in the mbuf.

Signed-off-by: Dan Gora <dg@adax.com>
6 years agoexamples/ipsec-secgw: use function to access mbuf private
Dan Gora [Mon, 18 Jun 2018 23:36:18 +0000 (16:36 -0700)]
examples/ipsec-secgw: use function to access mbuf private

Update get_priv() to use rte_mbuf_to_priv() to access the private
area in the mbuf.

In inbound_sa_check(), use the application's get_priv() function to
access the private area in the mbuf.

Signed-off-by: Dan Gora <dg@adax.com>
6 years agombuf: add accessor function for private data area
Dan Gora [Mon, 18 Jun 2018 23:35:34 +0000 (16:35 -0700)]
mbuf: add accessor function for private data area

Add an inline accessor function to return the starting address of
the private data area in the supplied mbuf.

This allows applications to easily access the private data area between
the struct rte_mbuf and the data buffer in the specified mbuf without
creating private macros or accessor functions.

No checks are made to ensure that a private data area actually exists
in the buffer.

Signed-off-by: Dan Gora <dg@adax.com>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
6 years agonet/mlx5: support 32-bit systems
Moti Haimovsky [Thu, 12 Jul 2018 12:01:31 +0000 (15:01 +0300)]
net/mlx5: support 32-bit systems

This patch adds support for building and running mlx5 PMD on
32bit systems such as i686.

The main issue to tackle was handling the 32bit access to the UAR
as quoted from the mlx5 PRM:
QP and CQ DoorBells require 64-bit writes. For best performance, it
is recommended to execute the QP/CQ DoorBell as a single 64-bit write
operation. For platforms that do not support 64 bit writes, it is
possible to issue the 64 bits DoorBells through two consecutive
writes,
each write 32 bits, as described below:
* The order of writing each of the Dwords is from lower to upper
  addresses.
* No other DoorBell can be rung (or even start ringing) in the midst
 of an on-going write of a DoorBell over a given UAR page.

The last rule implies that in a multi-threaded environment, the access
to a UAR page (which can be accessible by all threads in the process)
must be synchronized (for example, using a semaphore) unless an atomic
write of 64 bits in a single bus operation is guaranteed. Such a
synchronization is not required for when ringing DoorBells on different
UAR pages.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: fix build with rdma-core v19
Shahaf Shuler [Thu, 12 Jul 2018 06:40:32 +0000 (09:40 +0300)]
net/mlx5: fix build with rdma-core v19

The flow counter support introduced by
commit 9a761de8ea14 ("net/mlx5: flow counter support") was intend to
work only with MLNX_OFED_4.3 as the upstream rdma-core
libraries were lack such support.

On rdma-core v19 the support for the flow counters was added but with
different user APIs, hence causing compilation issues on the PMD.

This patch fix the compilation errors by forcing the flow counters
to be enabled only with MLNX_OFED APIs.
Once MLNX_OFED and rdma-core APIs will be aligned, a proper patch to
support the new API will be submitted.

Fixes: 9a761de8ea14 ("net/mlx5: flow counter support")
Cc: stable@dpdk.org
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Reported-by: Ferruh Yigit <ferruh.yigit@intel.com>
Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Acked-by: Ori Kam <orika@mellanox.com>
6 years agonet/mlx5: add count flow action
Nelio Laranjeiro [Thu, 12 Jul 2018 09:31:07 +0000 (11:31 +0200)]
net/mlx5: add count flow action

This is only supported by Mellanox OFED.

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow MPLS item
Nelio Laranjeiro [Thu, 12 Jul 2018 09:31:06 +0000 (11:31 +0200)]
net/mlx5: add flow MPLS item

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow GRE item
Nelio Laranjeiro [Thu, 12 Jul 2018 09:31:05 +0000 (11:31 +0200)]
net/mlx5: add flow GRE item

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow VXLAN-GPE item
Nelio Laranjeiro [Thu, 12 Jul 2018 09:31:04 +0000 (11:31 +0200)]
net/mlx5: add flow VXLAN-GPE item

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow VXLAN item
Nelio Laranjeiro [Thu, 12 Jul 2018 09:31:03 +0000 (11:31 +0200)]
net/mlx5: add flow VXLAN item

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: support inner RSS computation
Nelio Laranjeiro [Thu, 12 Jul 2018 09:31:02 +0000 (11:31 +0200)]
net/mlx5: support inner RSS computation

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: remove useless arguments in hrxq API
Nelio Laranjeiro [Thu, 12 Jul 2018 09:31:01 +0000 (11:31 +0200)]
net/mlx5: remove useless arguments in hrxq API

RSS level is necessary to had a bit in the hash_fields which is already
provided in this API, for the tunnel, it is necessary to request such
queue to compute the checksum on the inner most, this last one should
always be activated.

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add RSS flow action
Nelio Laranjeiro [Thu, 12 Jul 2018 09:31:00 +0000 (11:31 +0200)]
net/mlx5: add RSS flow action

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: use a macro for the RSS key size
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:59 +0000 (11:30 +0200)]
net/mlx5: use a macro for the RSS key size

ConnectX 4-5 support only 40 bytes of RSS key, using a compiled size
hash key is not necessary.

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add mark/flag flow action
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:58 +0000 (11:30 +0200)]
net/mlx5: add mark/flag flow action

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow TCP item
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:57 +0000 (11:30 +0200)]
net/mlx5: add flow TCP item

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow UDP item
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:56 +0000 (11:30 +0200)]
net/mlx5: add flow UDP item

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow IPv6 item
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:55 +0000 (11:30 +0200)]
net/mlx5: add flow IPv6 item

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow IPv4 item
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:54 +0000 (11:30 +0200)]
net/mlx5: add flow IPv4 item

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow VLAN item
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:53 +0000 (11:30 +0200)]
net/mlx5: add flow VLAN item

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow stop/start
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:52 +0000 (11:30 +0200)]
net/mlx5: add flow stop/start

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add flow queue action
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:51 +0000 (11:30 +0200)]
net/mlx5: add flow queue action

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: support flow Ethernet item along with drop action
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:50 +0000 (11:30 +0200)]
net/mlx5: support flow Ethernet item along with drop action

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: replace verbs priorities by flow
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:49 +0000 (11:30 +0200)]
net/mlx5: replace verbs priorities by flow

Previous work introduce verbs priorities, whereas the PMD is making
translation between Flow priority into Verbs.  Rename this to make more
sense on what the PMD has to translate.

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: handle drop queues as regular queues
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:48 +0000 (11:30 +0200)]
net/mlx5: handle drop queues as regular queues

Drop queues are essentially used in flows due to Verbs API, the
information if the fate of the flow is a drop or not is already present
in the flow.  Due to this, drop queues can be fully mapped on regular
queues.

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: remove flow support
Nelio Laranjeiro [Thu, 12 Jul 2018 09:30:47 +0000 (11:30 +0200)]
net/mlx5: remove flow support

This start a series to re-work the flow engine in mlx5 to easily support
flow conversion to Verbs or TC.  This is necessary to handle both regular
flows and representors flows.

As the full file needs to be clean-up to re-write all items/actions
processing, this patch starts to disable the regular code and only let the
PMD to start in isolated mode.

After this patch flow API will not be usable.

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
6 years agonet/mlx5: add parameter for port representors
Adrien Mazarguil [Tue, 10 Jul 2018 16:04:58 +0000 (18:04 +0200)]
net/mlx5: add parameter for port representors

Prior to this patch, all port representors detected on a given device were
probed and Ethernet devices instantiated for each of them.

This patch adds support for the standard "representor" parameter, which
implies that port representors are not probed by default anymore, except
for the list provided through device arguments.

(Patch based on prior work from Yuanhan Liu)

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Reviewed-by: Xueming Li <xuemingl@mellanox.com>
6 years agonet/mlx5: probe port representors in natural order
Adrien Mazarguil [Tue, 10 Jul 2018 16:04:56 +0000 (18:04 +0200)]
net/mlx5: probe port representors in natural order

Port representors are probed in whatever unspecified order
ibv_get_device_list() returns them.

This is counterintuitive to users since DPDK port IDs assignment almost
never follows the same sequence as representor IDs. Additionally, the
master device does not necessarily inherit the lowest DPDK port ID.

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
6 years agonet/mlx5: probe all port representors
Adrien Mazarguil [Tue, 10 Jul 2018 16:04:54 +0000 (18:04 +0200)]
net/mlx5: probe all port representors

Probe existing port representors in addition to their master device and
associate them automatically.

To avoid collision between Ethernet devices, they are named as follows:

- "{DBDF}" for master/switch devices.
- "{DBDF}_representor_{rep}" with "rep" starting from 0 for port
  representors.

(Patch based on prior work from Yuanhan Liu)

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Reviewed-by: Xueming Li <xuemingl@mellanox.com>
6 years agonet/mlx5: add port representor awareness
Adrien Mazarguil [Tue, 10 Jul 2018 16:04:52 +0000 (18:04 +0200)]
net/mlx5: add port representor awareness

The current PCI probing method is not aware of Verbs port representors,
which appear as standard Verbs devices bound to the same PCI address and
cannot be distinguished.

Problem is that more often than not, the wrong Verbs device is used,
resulting in unexpected traffic.

This patch makes the driver discard representors to only use the master
device. If unable to identify it (e.g. kernel drivers not recent enough),
either:

- There is only one matching device which isn't identified as a
  representor, in that case use it.
- Otherwise log an error and do not probe the device.

(Patch based on prior work from Yuanhan Liu)

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Reviewed-by: Xueming Li <xuemingl@mellanox.com>
6 years agonet/mlx5: re-indent generic probing function
Adrien Mazarguil [Tue, 10 Jul 2018 16:04:50 +0000 (18:04 +0200)]
net/mlx5: re-indent generic probing function

Since commit "net/mlx5: drop useless support for several Verbs ports"
removed an inner loop, mlx5_dev_spawn() is left with an unnecessary indent
level.

This patch eliminates a block, moves its local variables to function scope,
and re-indents its contents (diff best viewed with --ignore-all-space).

No functional impact.

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Reviewed-by: Xueming Li <xuemingl@mellanox.com>
6 years agonet/mlx5: split PCI from generic probing
Adrien Mazarguil [Tue, 10 Jul 2018 16:04:48 +0000 (18:04 +0200)]
net/mlx5: split PCI from generic probing

All the generic probing code needs is an IB device. While this device is
currently supplied by a PCI lookup, other methods will be added soon.

This patch divides the original function, which has become huge over time,
as follows:

1. PCI-specific (mlx5_pci_probe()).
2. Verbs device (mlx5_dev_spawn()).

(Patch based on prior work from Yuanhan Liu)

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Reviewed-by: Xueming Li <xuemingl@mellanox.com>
6 years agonet/mlx5: drop useless support for several Verbs ports
Adrien Mazarguil [Tue, 10 Jul 2018 16:04:46 +0000 (18:04 +0200)]
net/mlx5: drop useless support for several Verbs ports

Unlike mlx4 from which this capability was inherited, mlx5 devices expose
exactly one Verbs port per PCI bus address. Each physical port gets
assigned its own bus address with a single Verbs port.

While harmless, this code requires an extra loop that would get in the way
of subsequent refactoring.

No functional impact.

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
6 years agonet/mlx5: remove redundant objects in probe function
Adrien Mazarguil [Tue, 10 Jul 2018 16:04:44 +0000 (18:04 +0200)]
net/mlx5: remove redundant objects in probe function

This patch gets rid of redundant calls to open the device and query its
attributes in order to simplify the code.

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Reviewed-by: Xueming Li <xuemingl@mellanox.com>
6 years agonet/mlx5: rename confusing object in probe function
Adrien Mazarguil [Tue, 10 Jul 2018 16:04:42 +0000 (18:04 +0200)]
net/mlx5: rename confusing object in probe function

There are several attribute objects in this function:

- IB device attributes (struct ibv_device_attr_ex device_attr).
- Direct Verbs attributes (struct mlx5dv_context attrs_out).
- Port attributes (struct ibv_port_attr).
- IB device attributes again (struct ibv_device_attr_ex device_attr_ex).

"attrs_out" is both odd and initialized using a nonstandard syntax. Rename
it "dv_attr" for consistency.

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Reviewed-by: Xueming Li <xuemingl@mellanox.com>
6 years agonet/mlx4: support hardware TSO
Moti Haimovsky [Tue, 10 Jul 2018 10:45:54 +0000 (13:45 +0300)]
net/mlx4: support hardware TSO

Implement support for hardware TSO.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
6 years agotest/power: fix 32-bit build
Pablo de Lara [Fri, 13 Jul 2018 04:51:03 +0000 (05:51 +0100)]
test/power: fix 32-bit build

Compilation issue:

test/test/test_power_acpi_cpufreq.c:556:31:
error: format ‘%lx’ expects argument of type ‘long unsigned int’,
but argument 2 has type ‘uint64_t {aka long long unsigned int}’

  printf("ACPI: Capabilities %lx\n", caps.capabilities);
                             ~~^     ~~~~~~~~~~~~~~~~~
                             %llx

Fixes: 39e38d583075 ("test/power: add unit test for get capabilities API")

Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Acked-by: Radu Nicolau <radu.nicolau@intel.com>
6 years agoethdev: fix missing function in map file
Nelio Laranjeiro [Fri, 13 Jul 2018 09:11:30 +0000 (11:11 +0200)]
ethdev: fix missing function in map file

Add rte_flow_expand_rss in map file and tag it as experimental.

Fixes: 4ed05fcd441b ("ethdev: add flow API to expand RSS flows")

Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
6 years agodoc: fix lists in release notes
Thomas Monjalon [Fri, 13 Jul 2018 13:36:51 +0000 (15:36 +0200)]
doc: fix lists in release notes

Some blank lines and hyphens are missing, so lists were badly
interpreted and rendered.

Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
6 years agomem: support --in-memory mode
Anatoly Burakov [Fri, 13 Jul 2018 12:48:04 +0000 (13:48 +0100)]
mem: support --in-memory mode

Implement the final piece of the in-memory mode puzzle - enable running
DPDK entirely in memory, without creating any files.

To do it, use mmap with MAP_HUGETLB and size flags to enable DPDK to work
without hugetlbfs mountpoints. In order to enable this, a few things needed
to be changed.

First of all, we need to allow empty hugetlbfs mountpoints in
hugepage_info, and handle them correctly (by not trying to create any
files and lock any directories).

Next, we need to reorder the mapping sequence, because the page is not
really allocated until the page fault, and we cannot get its IOVA
address before we trigger the page fault.

Finally, decide at compile time whether we are going to be supporting
anonymous hugepages or not, because we cannot check for it at runtime.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal: add --in-memory option
Anatoly Burakov [Fri, 13 Jul 2018 12:48:03 +0000 (13:48 +0100)]
eal: add --in-memory option

This command-line option will cause DPDK to operate entirely in
memory and not create any shared files at runtime, including any
shared configuration or hugetlbfs files. This is useful for debug
purposes, as well as for certain use cases like containers or
automatic memory cleanup.

Currently, this option acts as a strict superset of --no-shconf and
--huge-unlink commands.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agomem: support --huge-unlink mode
Anatoly Burakov [Fri, 13 Jul 2018 12:48:02 +0000 (13:48 +0100)]
mem: support --huge-unlink mode

Unlink hugepages after creating them, to honor the hugepage-unlink mode.
We cannot resize non-existing files, so make single file segments
explicitly unsupported.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal: do not create runtime dir in --no-shconf mode
Anatoly Burakov [Fri, 13 Jul 2018 12:48:01 +0000 (13:48 +0100)]
eal: do not create runtime dir in --no-shconf mode

Now that the rest of the EAL is adjusted to not create any shared
files, prevent runtime directory from ever being created.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal: support --no-shconf in hugepage data file
Anatoly Burakov [Fri, 13 Jul 2018 12:48:00 +0000 (13:48 +0100)]
eal: support --no-shconf in hugepage data file

Do not create a shared hugepage data file if we were asked to
not create any shared files.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal: support --no-shconf for hugepage info
Anatoly Burakov [Fri, 13 Jul 2018 12:47:59 +0000 (13:47 +0100)]
eal: support --no-shconf for hugepage info

Do not create any shared hugepage size info files if we were
asked to not create any shared files.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoipc: support --no-shconf mode
Anatoly Burakov [Fri, 13 Jul 2018 12:47:58 +0000 (13:47 +0100)]
ipc: support --no-shconf mode

IPC is an inter-process communication mechanism. Since no secondaries
can ever be expected to run in no-shconf mode, IPC will be useless, so
do not enable it in the first place. In the interests of API usage
convenience, we will still allow registering callbacks, but obviously
they won't ever be triggered.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agofbarray: support --no-shconf mode
Anatoly Burakov [Fri, 13 Jul 2018 12:47:57 +0000 (13:47 +0100)]
fbarray: support --no-shconf mode

When using --no-shconf option, the expectation is that no multiprocess
will be supported as no shared files are created. However, fbarray still
creates some shared files that prevent multiple processes with the same
prefix from starting.

Fix this by avoiding creating shared files whenever noshconf option is
specified. Since virtual areas we get from eal_get_virtual_area() are
read-only, remap them as writable.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal: move runtime config file to new location
Anatoly Burakov [Fri, 13 Jul 2018 10:44:48 +0000 (11:44 +0100)]
eal: move runtime config file to new location

As per deprecation notice [1], move DPDK runtime config to default
DPDK runtime data location. Also, remove the deprecation notice and
update release notes to indicate the changes.

[1] http://dpdk.org/patch/40418

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agodoc: add IPC callback limitations
Anatoly Burakov [Tue, 26 Jun 2018 10:53:18 +0000 (11:53 +0100)]
doc: add IPC callback limitations

For asynchronous requests, user callback may be triggered either from
IPC thread or from interrupt thread. Because of this, delivery of
other interrupt-based events such as alarms may not be possible inside
the asynchronous IPC request callback handler. Document this
limitation.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoipc: remove thread for async requests
Anatoly Burakov [Tue, 26 Jun 2018 10:53:17 +0000 (11:53 +0100)]
ipc: remove thread for async requests

Previously, we were using two IPC threads - one to handle messages
and synchronous requests, and another to handle asynchronous requests.
To handle replies for an async request, rte_mp_handle woke up the
rte_mp_handle_async thread to process through pthread_cond variable.

Change it to handle asynchronous messages within the main IPC thread.
To handle timeout events, for each async request which is sent,
we set an alarm for it. If its reply is received before timeout,
we will cancel the alarm when we handle the reply; otherwise,
alarm will invoke the async_reply_handle() as the alarm callback.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Suggested-by: Thomas Monjalon <thomas@monjalon.net>
6 years agoeal: bring forward init of interrupt handling
Jianfeng Tan [Tue, 26 Jun 2018 10:53:16 +0000 (11:53 +0100)]
eal: bring forward init of interrupt handling

Next commit will make asynchronous IPC requests rely on alarm API,
which in turn relies on interrupts to work. Therefore, move the EAL
interrupt initialization before IPC initialization to avoid breaking
IPC in the next commit.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal/bsd: support alarm API
Anatoly Burakov [Tue, 26 Jun 2018 10:53:15 +0000 (11:53 +0100)]
eal/bsd: support alarm API

Implement EAL alarm API support for FreeBSD. The implementation
is largely identical to that of Linux version, with one key
difference.

The alarm API is a little Linux-centric in that it is expecting
the alarm API to manage alarm timeouts without involvement of the
interrupt thread. This works on Linux because in Linux, there's
timerfd API which allows waiting for timer events on an fd.

On FreeBSD, however, there are no timerfd's, and timer events are
set up directly in kevent. There is no way to pass information from
the alarm API to the interrupt thread, so we also add a little
back-channel magic to get soonest alarm timeout from the alarm API.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal/bsd: add interrupt thread
Anatoly Burakov [Tue, 26 Jun 2018 10:53:14 +0000 (11:53 +0100)]
eal/bsd: add interrupt thread

Add interrupt thread to FreeBSD. It is largely a copy-paste from
Linuxapp interrupt thread, except for a few key differences:

* Use kevent instead of epoll
* Do not recreate the event queue on adding/removing interrupt
  sources, add/remove them to/from the queue on the fly instead
* No support for UIO/VFIO handles

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal/linux: use libc malloc in interrupt handling
Jianfeng Tan [Tue, 26 Jun 2018 10:53:13 +0000 (11:53 +0100)]
eal/linux: use libc malloc in interrupt handling

IPC uses interrupts API internally, and memory subsystem uses IPC.
Therefore, IPC should not use rte_malloc to avoid circular dependency.
Switch to using regular glibc malloc in interrupts API.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal/linux: use libc malloc in alarm
Jianfeng Tan [Tue, 26 Jun 2018 10:53:12 +0000 (11:53 +0100)]
eal/linux: use libc malloc in alarm

Alarm API is going to be used by IPC internally. However, because
memory subsystem depends on IPC, alarm API cannot use rte_malloc as
it creates a circular dependency.

To avoid such chicken and egg problem, we change to use glibc malloc
in the alarm API.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agovfio: fix uninitialized variable
Anatoly Burakov [Fri, 1 Jun 2018 09:08:12 +0000 (10:08 +0100)]
vfio: fix uninitialized variable

Some static analyzers complain about it, even though
value is never used if not initialized. To avoid additional
false positives about a potential null-pointer dereferences,
also add a null-check.

Bugzilla ID: 58
Fixes: ea2dc1066870 ("vfio: add multi container support")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal/linux: fix uninitialized value
Anatoly Burakov [Fri, 1 Jun 2018 09:08:11 +0000 (10:08 +0100)]
eal/linux: fix uninitialized value

The value is not used, but some static analyzers may give out a
warning. Fix it by assigning default value of zero.

Bugzilla ID: 58
Fixes: cdc242f260e7 ("eal/linux: support running as unprivileged user")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal/linux: fix invalid syntax in interrupts
Anatoly Burakov [Fri, 1 Jun 2018 09:08:10 +0000 (10:08 +0100)]
eal/linux: fix invalid syntax in interrupts

Parentheses were missing. It worked because macro is enclosed in
parentheses, so syntax was valid after macro expansion.

Bugzilla ID: 58
Fixes: 0a45657a6794 ("pci: rework interrupt handling")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agoeal: add option to limit memory allocation on sockets
Anatoly Burakov [Thu, 31 May 2018 17:35:33 +0000 (18:35 +0100)]
eal: add option to limit memory allocation on sockets

Previously, it was possible to limit maximum amount of memory
allowed for allocation by creating validator callbacks. Although a
powerful tool, it's a bit of a hassle and requires modifying the
application for it to work with DPDK example applications.

Fix this by adding a new parameter "--socket-limit", with syntax
similar to "--socket-mem", which would set per-socket memory
allocation limits, and set up a default validator callback to deny
all allocations above the limit.

This option is incompatible with legacy mode, as validator callbacks
are not supported there.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agomemzone: improve zero-length reserve
Anatoly Burakov [Thu, 31 May 2018 09:51:01 +0000 (10:51 +0100)]
memzone: improve zero-length reserve

Currently, reserving zero-length memzones is done by looking at
malloc statistics, and reserving biggest sized element found in those
statistics. This has two issues.

First, there is a race condition. The heap is unlocked between the
time we check stats, and the time we reserve malloc element for memzone.
This may lead to inability to reserve the memzone we wanted to reserve,
because another allocation might have taken place and biggest sized
element may no longer be available.

Second, the size returned by malloc statistics does not include any
alignment information, which is worked around by being conservative and
subtracting alignment length from the final result. This leads to
fragmentation and reserving memzones that could have been bigger but
aren't.

Fix all of this by using earlier-introduced operation to reserve
biggest possible malloc element. This, however, comes with a trade-off,
because we can only lock one heap at a time. So, if we check the first
available heap and find *any* element at all, that element will be
considered "the biggest", even though other heaps might have bigger
elements. We cannot know what other heaps have before we try and
allocate it, and it is not a good idea to lock all of the heaps at
the same time, so, we will just document this limitation and
encourage users to reserve memzones with socket id properly set.

Also, fixup unit tests to account for the new behavior.

Fixes: fafcc11985a2 ("mem: rework memzone to be allocated by malloc")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agomalloc: allow reserving biggest element
Anatoly Burakov [Thu, 31 May 2018 09:51:00 +0000 (10:51 +0100)]
malloc: allow reserving biggest element

Add an internal-only function to allocate biggest element from
the heap. Nominally, it supports SOCKET_ID_ANY as its socket
argument, but it's essentially useless because other sockets
will only be allocated from if the entire heap on current or
specified socket is busy.

Still, asking to reserve a biggest element will allow fixing
race condition in memzone reserve that has been there for a
long time.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Remy Horton <remy.horton@intel.com>
6 years agomalloc: add finding biggest free IOVA-contiguous element
Anatoly Burakov [Thu, 31 May 2018 09:50:59 +0000 (10:50 +0100)]
malloc: add finding biggest free IOVA-contiguous element

Adding internal-only function to find biggest free IOVA-contiguous
malloc element. This is not exposed to external API.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Remy Horton <remy.horton@intel.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
6 years agomalloc: fix pad erasing
Anatoly Burakov [Thu, 31 May 2018 17:05:40 +0000 (18:05 +0100)]
malloc: fix pad erasing

Previously, when joining adjacent free elements, we were erasing
trailer and header, but did not erase the padding. Fix this by
accounting for padding on erase, and do not erase padding twice
by adjusting data pointer and data len to not include padding.

Fixes: bb372060dad4 ("malloc: make heap a doubly-linked list")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agomem: provide thread-unsafe memseg list walk variant
Anatoly Burakov [Tue, 12 Jun 2018 09:46:16 +0000 (10:46 +0100)]
mem: provide thread-unsafe memseg list walk variant

Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_list_walk() inside callbacks.

Also, remove existing reimplementation from memalloc code and use
the new API instead.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agomem: provide thread-unsafe memseg walk variant
Anatoly Burakov [Tue, 12 Jun 2018 09:46:15 +0000 (10:46 +0100)]
mem: provide thread-unsafe memseg walk variant

Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_walk() inside callbacks.

Also, remove existing reimplementation from sPAPR VFIO code and use
the new API instead.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agomem: provide thread-unsafe contig walk variant
Anatoly Burakov [Tue, 12 Jun 2018 09:46:14 +0000 (10:46 +0100)]
mem: provide thread-unsafe contig walk variant

Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_contig_walk() inside callbacks.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agomem: mark pages as freeable on exit
Anatoly Burakov [Thu, 31 May 2018 16:11:47 +0000 (17:11 +0100)]
mem: mark pages as freeable on exit

When rte_eal_cleanup() is called, it is expected that DPDK will be able to
release all of its memory back to the system. However, if pages are marked
as unfreeable, the pages will not be released back. Fix this to mark all
pages as freeable on calling rte_eal_cleanup(), but only do it for primary
process, as secondaries can come and go.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agomem: allocate in reverse to reduce fragmentation
Anatoly Burakov [Mon, 11 Jun 2018 20:55:42 +0000 (21:55 +0100)]
mem: allocate in reverse to reduce fragmentation

Currently, all hugepages are allocated from lower VA address to
higher VA address, while malloc heap allocates from higher VA
address to lower VA address. This results in heap fragmentation
over time due to multiple reserves leaving small space below the
allocated elements.

Fix this by allocating VA memory from the top, thereby reducing
fragmentation and lowering overall memory usage.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
6 years agotest/fbarray: add autotests
Anatoly Burakov [Mon, 11 Jun 2018 20:55:41 +0000 (21:55 +0100)]
test/fbarray: add autotests

Introduce a suite of autotests to cover functionality of fbarray.
This will check for invalid parameters, check API return values and
errno codes, and will also do some basic functionality checks on the
indexing code.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>