examples/l3fwd: improve grouping by destination port
Latest changes introduced a small degradation for the corner case
when each input packet is destined to the different port.
For the test-case when 1 core manages 4 ports and packet stream looks like:
IPV4_DSTPORT0, IPV4_DSTPORT1, IPV4_DSTPORT3, IPV4_DSTPORT4, IPV4_DSTPORT0, ...
non-optimised code path outperforms optimised one by 2-3%.
These changes supposed to close that gap.
From my testing: now for the case described above optimised code path
produces same numbers as non-optimised one.
For other test-cases numbers remain about the same.
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>