net/mlx5: handle Tx completion with error
When WQEs are posted to the HW to send packets, the PMD may get a
completion report with error from the HW, aka error CQE which is
associated to a bad WQE.
The error reason may be bad address, wrong lkey, bad sizes, etc.
that can wrongly be configured by the PMD or by the user.
Checking all the optional mistakes to prevent error CQEs doesn't make
sense due to performance impacts and huge complexity.
The error CQEs change the SQ state to error state what causes all the
next posted WQEs to be completed with CQE flush error forever.
Currently, the PMD doesn't handle Tx error CQEs and even may crashed
when one of them appears.
Extend the Tx data-path to detect these error CQEs, to report them by
the statistics error counters, to recover the SQ by moving the state
to ready again and adjusting the management variables appropriately.
Sometimes the error CQE root cause is very hard to debug and even may
be related to some corner cases which are not reproducible easily, hence
a dump file with debug information will be created for the first number
of error CQEs, this number can be configured by the PMD probe
parameters.
Cc: stable@dpdk.org
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>