eal/x86: optimize memcpy for SSE and AVX
Main code changes:
1. Differentiate architectural features based on CPU flags
a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth
b. Implement separated copy flow specifically optimized for target architecture
2. Rewrite the memcpy function "rte_memcpy"
a. Add store aligning
b. Add load aligning based on architectural features
c. Put block copy loop into inline move functions for better control of instruction order
d. Eliminate unnecessary MOVs
3. Rewrite the inline move functions
a. Add move functions for unaligned load cases
b. Change instruction order in copy loops for better pipeline utilization
c. Use intrinsics instead of assembly code
4. Remove slow glibc call for constant copies
Test report: http://dpdk.org/ml/archives/dev/2015-January/011848.html
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
Tested-by: Jingguo Fu <jingguox.fu@intel.com>
Reviewed-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Cunming Liang <cunming.liang@intel.com>
Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>