X-Git-Url: http://git.droids-corp.org/?a=blobdiff_plain;f=doc%2Fguides%2Fprog_guide%2Ftoeplitz_hash_lib.rst;h=f916857efca5a33e14b3c4797308226e338c84db;hb=b54403fd08374dae243bf0a0e90488f51485154e;hp=2480c551d070e89940d7464499ef50dfdf22c223;hpb=534fe5f33999fd92f27cbcf731accc91b8dc8b78;p=dpdk.git diff --git a/doc/guides/prog_guide/toeplitz_hash_lib.rst b/doc/guides/prog_guide/toeplitz_hash_lib.rst index 2480c551d0..f916857efc 100644 --- a/doc/guides/prog_guide/toeplitz_hash_lib.rst +++ b/doc/guides/prog_guide/toeplitz_hash_lib.rst @@ -36,3 +36,250 @@ The ``rte_softrss()`` function expects the ``rss_key`` to be exactly the same as the one installed on the NIC. The ``rte_softrss_be`` function is a faster implementation, but it expects ``rss_key`` to be converted to the host byte order. + + +Predictable RSS +--------------- + +In some use cases it is useful to have a way to find partial collisions of the +Toeplitz hash function. In figure :numref:`figure_rss_queue_assign` only a few +of the least significant bits (LSB) of the hash value are used to indicate an +entry in the RSS Redirection Table (ReTa) and thus the index of the queue. So, +in this case it would be useful to find another tuple whose hash has the same +LSB's as the hash from the original tuple. + +For example: + +- In the case of SNAT (Source Network Address Translation) it is possible to + find a special source port number on translation so that the hash of + returning packets, of the given connection, will have desired LSB's. +- In the case of MPLS (Multiprotocol Label Switching), if the MPLS tag is used + in the hash calculation, the Label Switching router can allocate a special + MPLS tag to bind an LSP (Label Switching Path) to a given queue. This method + can be used with the allocation of IPSec SPI, VXLan VNI, etc., to bind the + tunnel to the desired queue. +- In the case of a TCP stack, a special source port could be chosen for + outgoing connections, such that the response packets will be assigned to the + desired queue. + +This functionality is provided by the API shown below. +The API consists of 3 parts: + +* Create the thash context. + +* Create the thash helper, associated with a context. + +* Use the helper run time to calculate the adjustable bits of the tuple to + ensure a collision. + + +Thash context +~~~~~~~~~~~~~ + +The function ``rte_thash_init_ctx()`` initializes the context struct +associated with a particular NIC or a set of NICs. It expects: + +* The log2 value of the size of the RSS redirection table for the + corresponding NIC. It reflects the number of least significant bits of the + hash value to produce a collision for. + +* A predefined RSS hash key. This is optional, if ``NULL`` then a random key + will be initialized. + +* The length of the RSS hash key. This value is usually hardware/driver + specific and can be found in the NIC datasheet. + +* Optional flags, as shown below. + +Supported flags: + +* ``RTE_THASH_IGNORE_PERIOD_OVERFLOW`` - By default, and for security reasons, + the library prohibits generating a repeatable sequence in the hash key. This + flag disables such checking. The flag is mainly used for testing in the lab + to generate an RSS hash key with a uniform hash distribution, if the input + traffic also has a uniform distribution. + +* ``RTE_THASH_MINIMAL_SEQ`` - By default, the library generates a special bit + sequence in the hash key for all the bits of the subtuple. However, the + collision generation task requires only the ``log2(RETA_SZ)`` bits in the + subtuple. This flag forces the minimum bit sequence in the hash key to be + generated for the required ``log2(RETA_SZ)`` least significant bits of the + subtuple. The flag can be used in the case of a relatively large number of + helpers that may overlap with their corresponding bit sequences of RSS hash + keys. + + +Thash helper +~~~~~~~~~~~~ + +The function ``rte_thash_add_helper()`` initializes the helper struct +associated with a given context and a part of a target tuple of interest which +could be altered to produce a hash collision. On success it writes a specially +calculated bit sequence into the RSS hash key which is stored in the context +and calculates a table with values to be XORed with a subtuple. + +It expects: + +* A pointer to the Thash context to be associated with. + +* A length of the subtuple to be modified. The length is counted in bits. + +* An offset of the subtuple to be modified from the beginning of the tuple. It + is also counted in bits. + +.. note:: + + Adding a helper changes the key stored in the corresponding context. So the + updated RSS hash key must be uploaded into the NIC after creating all the + required helpers. + + +Calculation of the complementary bits to adjust the subtuple +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rte_thash_get_complement()`` function returns a special bit sequence +with length ``N = log2(rss_reta_sz)`` (for the ``rss_reta_sz`` provided at +context initialization) to be xored with N least significant bits of the +subtuple. + +It expects: + +* A corresponding helper created for a given subtuple of the tuple. + +* A hash value of the tuple we want to alter. + +* The desired LSB's of the hash value the user expects to have. + +After the returned bit sequence has been XORed with the subtuple, the resulted +LSB's of the new hash value, calculated from the altered tuple, will be the +same as in ``desired_hash``. + + +Adjust tuple API +~~~~~~~~~~~~~~~~~ + +The ``rte_thash_get_complement()`` function is a user-friendly wrapper around +a number of other functions. It alters a passed tuple to meet the above +mentioned requirements around the desired hash LSB's. + +It expects: + +* A Thash context and helper. + +* A pointer to the tuple to be changed. + +* The length of the tuple. + +* A callback function and its userdata to check the tuple after it has been + changed. + +* The number of attempts to change the tuple. Basically, it makes sense if + there is a callback and a limit on the number of attempts to change the + tuple, if the callback function returns an error. + + +Use case example +---------------- + +There could be a number of different use cases, such as NAT, TCP stack, MPLS +tag allocation, etc. In the following we will consider a SNAT application. + +Packets of a single bidirectional flow belonging to different directions can +end up being assigned to different queues and thus processed by different +lcores, as shown in :numref:`figure_predictable_snat_1`: + +.. _figure_predictable_snat_1: + +.. figure:: img/predictable_snat_1.* + + Bidirectional flow packets distribution in general + +That leads to a situation where the same packet flow can be shared between two +cores. Such a situation is not ideal from a performance perspective and +requires extra synchronization efforts that might lead to various performance +penalties, for example: + +* The connections table is global so locking/RCU on the flow insertion/removal + is required. + +* Connection metadata must be protected to avoid race conditions. + +* More cache pressure if a single connection metadata is kept in different + L1/L2 caches of a different CPU core. + +* Cache pressure/less cache locality on packet handover to the different cores. + +We can avoid all these penalties if it can be guaranteed that packets +belonging to one bidirectional flow will be assigned to the same queue, as +shown in :numref:`figure_predictable_snat_2`: + +.. _figure_predictable_snat_2: + +.. figure:: img/predictable_snat_2.* + + Bidirectional flow packets distribution with predictable RSS + + +To achieve this in a SNAT scenario it is possible to choose a source port not +randomly, but using the predictable RSS library to produce a partial hash +collision. This is shown in the code below. + +.. code-block:: c + + int key_len = 40; /* The default Niantic RSS key length. */ + + /** The default Niantic RSS reta size = 2^7 entries, LSBs of hash value are + * used as an indexes in RSS ReTa. */ + int reta_sz = 7; + int ret; + struct rte_thash_ctx *ctx; + + uint8_t initial_key[key_len] = {0}; /* Default empty key. */ + + /* Create and initialize a new thash context. */ + ctx = rte_thash_init_ctx("SNAT", key_len, reta_sz, initial_key, 0); + + /** Add a helper and specify the variable tuple part and its length. In the + * SNAT case we want to choose a new source port on SNAT translation in a + * way that the reverse tuple will have the same LSBs as the original + * direction tuple so that the selected source port will be the + * destination port on reply. + */ + ret = rte_thash_add_helper(ctx, "snat", sizeof(uint16_t) * 8, + offsetof(union rte_thash_tuple, v4.dport) * 8); + + if (ret != 0) + return ret; + + /* Get handler of the required helper. */ + struct rte_thash_subtuple_helper *h = rte_thash_get_helper(ctx, "snat"); + + /** After calling rte_thash_add_helper() the initial_key passed on ctx + * creation has been changed so we get the new one. + */ + uint8_t *new_key = rte_thash_get_key(ctx); + + union rte_thash_tuple tuple, rev_tuple; + + /* A complete tuple from the packet. */ + complete_tuple(mbuf, &tuple); + + /* Calculate the RSS hash or get it from mbuf->hash.rss. */ + uint32_t orig_hash = rte_softrss((uint32_t *)&tuple, RTE_THASH_V4_L4_LEN, new_key); + + /** Complete the reverse tuple by translating the SRC address and swapping + * src and dst addresses and ports. + */ + get_rev_tuple(&rev_tuple, &tuple, new_ip); + + /* Calculate the expected rss hash for the reverse tuple. */ + uint32_t rev_hash = rte_softrss((uint32_t *)&rev_tuple, RTE_THASH_V4_L4_LEN, new_key); + + /* Get the adjustment bits for the src port to get a new port. */ + uint32_t adj = rte_thash_get_compliment(h, rev_hash, orig_hash); + + /* Adjust the source port bits. */ + uint16_t new_sport = tuple.v4.sport ^ adj; + + /* Make an actual packet translation. */ + do_snat(mbuf, new_ip, new_sport);