From: Jerin Jacob Date: Sat, 11 Apr 2020 14:14:27 +0000 (+0530) Subject: doc: add graph library guide X-Git-Url: http://git.droids-corp.org/?a=commitdiff_plain;h=4dc6d8e63c168f70c875e4a804c1249bd15b6f46;p=dpdk.git doc: add graph library guide Adding programmer's guide for Graph library and the inbuilt nodes. This patch also updates the release note for the new libraries. Signed-off-by: Jerin Jacob Signed-off-by: Kiran Kumar K Signed-off-by: Nithin Dabilpuram Signed-off-by: Pavan Nikhilesh --- diff --git a/MAINTAINERS b/MAINTAINERS index 3e7d22c8db..1ca7a2a3c5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1476,6 +1476,7 @@ Graph - EXPERIMENTAL M: Jerin Jacob M: Kiran Kumar K F: lib/librte_graph/ +F: doc/guides/prog_guide/graph_lib.rst F: app/test/test_graph* M: Nithin Dabilpuram F: examples/l3fwd-graph/ diff --git a/doc/guides/prog_guide/graph_lib.rst b/doc/guides/prog_guide/graph_lib.rst new file mode 100644 index 0000000000..669d77c740 --- /dev/null +++ b/doc/guides/prog_guide/graph_lib.rst @@ -0,0 +1,397 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(C) 2020 Marvell International Ltd. + +Graph Library and Inbuilt Nodes +=============================== + +Graph architecture abstracts the data processing functions as a ``node`` and +``links`` them together to create a complex ``graph`` to enable reusable/modular +data processing functions. + +The graph library provides API to enable graph framework operations such as +create, lookup, dump and destroy on graph and node operations such as clone, +edge update, and edge shrink, etc. The API also allows to create the stats +cluster to monitor per graph and per node stats. + +Features +-------- + +Features of the Graph library are: + +- Nodes as plugins. +- Support for out of tree nodes. +- Inbuilt nodes for packet processing. +- Multi-process support. +- Low overhead graph walk and node enqueue. +- Low overhead statistics collection infrastructure. +- Support to export the graph as a Graphviz dot file. See ``rte_graph_export()``. +- Allow having another graph walk implementation in the future by segregating + the fast path(``rte_graph_worker.h``) and slow path code. + +Advantages of Graph architecture +-------------------------------- + +- Memory latency is the enemy for high-speed packet processing, moving the + similar packet processing code to a node will reduce the I cache and D + caches misses. +- Exploits the probability that most packets will follow the same nodes in the + graph. +- Allow SIMD instructions for packet processing of the node.- +- The modular scheme allows having reusable nodes for the consumers. +- The modular scheme allows us to abstract the vendor HW specific + optimizations as a node. + +Performance tuning parameters +----------------------------- + +- Test with various burst size values (256, 128, 64, 32) using + CONFIG_RTE_GRAPH_BURST_SIZE config option. + The testing shows, on x86 and arm64 servers, The sweet spot is 256 burst + size. While on arm64 embedded SoCs, it is either 64 or 128. +- Disable node statistics (using ``CONFIG_RTE_LIBRTE_GRAPH_STATS`` config option) + if not needed. +- Use arm64 optimized memory copy for arm64 architecture by + selecting ``CONFIG_RTE_ARCH_ARM64_MEMCPY``. + +Programming model +----------------- + +Anatomy of Node: +~~~~~~~~~~~~~~~~ + +.. _figure_anatomy_of_a_node: + +.. figure:: img/anatomy_of_a_node.* + +The :numref:`figure_anatomy_of_a_node` diagram depicts the anatomy of a node. + +The node is the basic building block of the graph framework. + +A node consists of: + +process(): +^^^^^^^^^^ + +The callback function will be invoked by worker thread using +``rte_graph_walk()`` function when there is data to be processed by the node. +A graph node process the function using ``process()`` and enqueue to next +downstream node using ``rte_node_enqueue*()`` function. + +Context memory: +^^^^^^^^^^^^^^^ + +It is memory allocated by the library to store the node-specific context +information. This memory will be used by process(), init(), fini() callbacks. + +init(): +^^^^^^^ + +The callback function will be invoked by ``rte_graph_create()`` on when +a node gets attached to a graph. + +fini(): +^^^^^^^ + +The callback function will be invoked by ``rte_graph_destroy()`` on when a +node gets detached to a graph. + +Node name: +^^^^^^^^^^ + +It is the name of the node. When a node registers to graph library, the library +gives the ID as ``rte_node_t`` type. Both ID or Name shall be used lookup the +node. ``rte_node_from_name()``, ``rte_node_id_to_name()`` are the node +lookup functions. + +nb_edges: +^^^^^^^^^ + +The number of downstream nodes connected to this node. The ``next_nodes[]`` +stores the downstream nodes objects. ``rte_node_edge_update()`` and +``rte_node_edge_shrink()`` functions shall be used to update the ``next_node[]`` +objects. Consumers of the node APIs are free to update the ``next_node[]`` +objects till ``rte_graph_create()`` invoked. + +next_node[]: +^^^^^^^^^^^^ + +The dynamic array to store the downstream nodes connected to this node. Downstream +node should not be current node itself or a source node. + +Source node: +^^^^^^^^^^^^ + +Source nodes are static nodes created using ``RTE_NODE_REGISTER`` by passing +``flags`` as ``RTE_NODE_SOURCE_F``. +While performing the graph walk, the ``process()`` function of all the source +nodes will be called first. So that these nodes can be used as input nodes for a graph. + +Node creation and registration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +* Node implementer creates the node by implementing ops and attributes of + ``struct rte_node_register``. + +* The library registers the node by invoking RTE_NODE_REGISTER on library load + using the constructor scheme. The constructor scheme used here to support multi-process. + +Link the Nodes to create the graph topology +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +.. _figure_link_the_nodes: + +.. figure:: img/link_the_nodes.* + +The :numref:`figure_link_the_nodes` diagram shows a graph topology after +linking the N nodes. + +Once nodes are available to the program, Application or node public API +functions can links them together to create a complex packet processing graph. + +There are multiple different types of strategies to link the nodes. + +Method (a): +^^^^^^^^^^^ +Provide the ``next_nodes[]`` at the node registration time. See ``struct rte_node_register::nb_edges``. +This is a use case to address the static node scheme where one knows upfront the +``next_nodes[]`` of the node. + +Method (b): +^^^^^^^^^^^ +Use ``rte_node_edge_get()``, ``rte_node_edge_update()``, ``rte_node_edge_shrink()`` +to update the ``next_nodes[]`` links for the node runtime but before graph create. + +Method (c): +^^^^^^^^^^^ +Use ``rte_node_clone()`` to clone a already existing node, created using RTE_NODE_REGISTER. +When ``rte_node_clone()`` invoked, The library, would clone all the attributes +of the node and creates a new one. The name for cloned node shall be +``"parent_node_name-user_provided_name"``. + +This method enables the use case of Rx and Tx nodes where multiple of those nodes +need to be cloned based on the number of CPU available in the system. +The cloned nodes will be identical, except the ``"context memory"``. +Context memory will have information of port, queue pair in case of Rx and Tx +ethdev nodes. + +Create the graph object +~~~~~~~~~~~~~~~~~~~~~~~ +Now that the nodes are linked, Its time to create a graph by including +the required nodes. The application can provide a set of node patterns to +form a graph object. The ``famish()`` API used underneath for the pattern +matching to include the required nodes. After the graph create any changes to +nodes or graph is not allowed. + +The ``rte_graph_create()`` API shall be used to create the graph. + +Example of a graph object creation: + +.. code-block:: console + + {"ethdev_rx-0-0", ip4*, ethdev_tx-*"} + +In the above example, A graph object will be created with ethdev Rx +node of port 0 and queue 0, all ipv4* nodes in the system, +and ethdev tx node of all ports. + +Multicore graph processing +~~~~~~~~~~~~~~~~~~~~~~~~~~ +In the current graph library implementation, specifically, +``rte_graph_walk()`` and ``rte_node_enqueue*`` fast path API functions +are designed to work on single-core to have better performance. +The fast path API works on graph object, So the multi-core graph +processing strategy would be to create graph object PER WORKER. + +In fast path +~~~~~~~~~~~~ +Typical fast-path code looks like below, where the application +gets the fast-path graph object using ``rte_graph_lookup()`` +on the worker thread and run the ``rte_graph_walk()`` in a tight loop. + +.. code-block:: c + + struct rte_graph *graph = rte_graph_lookup("worker0"); + + while (!done) { + rte_graph_walk(graph); + } + +Context update when graph walk in action +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The fast-path object for the node is ``struct rte_node``. + +It may be possible that in slow-path or after the graph walk-in action, +the user needs to update the context of the node hence access to +``struct rte_node *`` memory. + +``rte_graph_foreach_node()``, ``rte_graph_node_get()``, +``rte_graph_node_get_by_name()`` APIs can be used to to get the +``struct rte_node*``. ``rte_graph_foreach_node()`` iterator function works on +``struct rte_graph *`` fast-path graph object while others works on graph ID or name. + +Get the node statistics using graph cluster +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The user may need to know the aggregate stats of the node across +multiple graph objects. Especially the situation where each graph object bound +to a worker thread. + +Introduced a graph cluster object for statistics. +``rte_graph_cluster_stats_create()`` API shall be used for creating a +graph cluster with multiple graph objects and ``rte_graph_cluster_stats_get()`` +to get the aggregate node statistics. + +An example statistics output from ``rte_graph_cluster_stats_get()`` + +.. code-block:: diff + + +---------+-----------+-------------+---------------+-----------+---------------+-----------+ + |Node |calls |objs |realloc_count |objs/call |objs/sec(10E6) |cycles/call| + +---------------------+-------------+---------------+-----------+---------------+-----------+ + |node0 |12977424 |3322220544 |5 |256.000 |3047.151872 |20.0000 | + |node1 |12977653 |3322279168 |0 |256.000 |3047.210496 |17.0000 | + |node2 |12977696 |3322290176 |0 |256.000 |3047.221504 |17.0000 | + |node3 |12977734 |3322299904 |0 |256.000 |3047.231232 |17.0000 | + |node4 |12977784 |3322312704 |1 |256.000 |3047.243776 |17.0000 | + |node5 |12977825 |3322323200 |0 |256.000 |3047.254528 |17.0000 | + +---------+-----------+-------------+---------------+-----------+---------------+-----------+ + +Node writing guidelines +~~~~~~~~~~~~~~~~~~~~~~~ + +The ``process()`` function of a node is the fast-path function and that needs +to be written carefully to achieve max performance. + +Broadly speaking, there are two different types of nodes. + +Static nodes +~~~~~~~~~~~~ +The first kind of nodes are those that have a fixed ``next_nodes[]`` for the +complete burst (like ethdev_rx, ethdev_tx) and it is simple to write. +``process()`` function can move the obj burst to the next node either using +``rte_node_next_stream_move()`` or using ``rte_node_next_stream_get()`` and +``rte_node_next_stream_put()``. + +Intermediate nodes +~~~~~~~~~~~~~~~~~~ +The second kind of such node is ``intermediate nodes`` that decide what is the +``next_node[]`` to send to on a per-packet basis. In these nodes, + +* Firstly, there has to be the best possible packet processing logic. + +* Secondly, each packet needs to be queued to its next node. + +This can be done using ``rte_node_enqueue_[x1|x2|x4]()`` APIs if +they are to single next or ``rte_node_enqueue_next()`` that takes array of nexts. + +In scenario where multiple intermediate nodes are present but most of the time +each node using the same next node for all its packets, the cost of moving every +pointer from current node's stream to next node's stream could be avoided. +This is called home run and ``rte_node_next_stream_move()`` could be used to +just move stream from the current node to the next node with least number of cycles. +Since this can be avoided only in the case where all the packets are destined +to the same next node, node implementation should be also having worst-case +handling where every packet could be going to different next node. + +Example of intermediate node implementation with home run: +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +1. Start with speculation that next_node = node->ctx. +This could be the next_node application used in the previous function call of this node. + +2. Get the next_node stream array with required space using +``rte_node_next_stream_get(next_node, space)``. + +3. while n_left_from > 0 (i.e packets left to be sent) prefetch next pkt_set +and process current pkt_set to find their next node + +4. if all the next nodes of the current pkt_set match speculated next node, +just count them as successfully speculated(``last_spec``) till now and +continue the loop without actually moving them to the next node. else if there is +a mismatch, copy all the pkt_set pointers that were ``last_spec`` and move the +current pkt_set to their respective next's nodes using ``rte_enqueue_next_x1()``. +Also, one of the next_node can be updated as speculated next_node if it is more +probable. Finally, reset ``last_spec`` to zero. + +5. if n_left_from != 0 then goto 3) to process remaining packets. + +6. if last_spec == nb_objs, All the objects passed were successfully speculated +to single next node. So, the current stream can be moved to next node using +``rte_node_next_stream_move(node, next_node)``. +This is the ``home run`` where memcpy of buffer pointers to next node is avoided. + +7. Update the ``node->ctx`` with more probable next node. + +Graph object memory layout +-------------------------- +.. _figure_graph_mem_layout: + +.. figure:: img/graph_mem_layout.* + +The :numref:`figure_graph_mem_layout` diagram shows ``rte_graph`` object memory +layout. Understanding the memory layout helps to debug the graph library and +improve the performance if needed. + +Graph object consists of a header, circular buffer to store the pending +stream when walking over the graph, and variable-length memory to store +the ``rte_node`` objects. + +The graph_nodes_mem_create() creates and populate this memory. The functions +such as ``rte_graph_walk()`` and ``rte_node_enqueue_*`` use this memory +to enable fastpath services. + +Inbuilt Nodes +------------- + +DPDK provides a set of nodes for data processing. The following section +details the documentation for the same. + +ethdev_rx +~~~~~~~~~ +This node does ``rte_eth_rx_burst()`` into stream buffer passed to it +(src node stream) and does ``rte_node_next_stream_move()`` only when +there are packets received. Each ``rte_node`` works only on one Rx port and +queue that it gets from node->ctx. For each (port X, rx_queue Y), +a rte_node is cloned from ethdev_rx_base_node as ``ethdev_rx-X-Y`` in +``rte_node_eth_config()`` along with updating ``node->ctx``. +Each graph needs to be associated with a unique rte_node for a (port, rx_queue). + +ethdev_tx +~~~~~~~~~ +This node does ``rte_eth_tx_burst()`` for a burst of objs received by it. +It sends the burst to a fixed Tx Port and Queue information from +node->ctx. For each (port X), this ``rte_node`` is cloned from +ethdev_tx_node_base as "ethdev_tx-X" in ``rte_node_eth_config()`` +along with updating node->context. + +Since each graph doesn't need more than one Txq, per port, a Txq is assigned +based on graph id to each rte_node instance. Each graph needs to be associated +with a rte_node for each (port). + +pkt_drop +~~~~~~~~ +This node frees all the objects passed to it considering them as +``rte_mbufs`` that need to be freed. + +ip4_lookup +~~~~~~~~~~ +This node is an intermediate node that does LPM lookup for the received +ipv4 packets and the result determines each packets next node. + +On successful LPM lookup, the result contains the ``next_node`` id and +``next-hop`` id with which the packet needs to be further processed. + +On LPM lookup failure, objects are redirected to pkt_drop node. +``rte_node_ip4_route_add()`` is control path API to add ipv4 routes. +To achieve home run, node use ``rte_node_stream_move()`` as mentioned in above +sections. + +ip4_rewrite +~~~~~~~~~~~ +This node gets packets from ``ip4_lookup`` node with next-hop id for each +packet is embedded in ``node_mbuf_priv1(mbuf)->nh``. This id is used +to determine the L2 header to be written to the packet before sending +the packet out to a particular ethdev_tx node. +``rte_node_ip4_rewrite_add()`` is control path API to add next-hop info. + +null +~~~~ +This node ignores the set of objects passed to it and reports that all are +processed. + diff --git a/doc/guides/prog_guide/img/anatomy_of_a_node.svg b/doc/guides/prog_guide/img/anatomy_of_a_node.svg new file mode 100644 index 0000000000..fa4b5b2d5a --- /dev/null +++ b/doc/guides/prog_guide/img/anatomy_of_a_node.svg @@ -0,0 +1,1078 @@ + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/doc/guides/prog_guide/img/graph_mem_layout.svg b/doc/guides/prog_guide/img/graph_mem_layout.svg new file mode 100644 index 0000000000..1d41729c9c --- /dev/null +++ b/doc/guides/prog_guide/img/graph_mem_layout.svg @@ -0,0 +1,702 @@ + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/doc/guides/prog_guide/img/link_the_nodes.svg b/doc/guides/prog_guide/img/link_the_nodes.svg new file mode 100644 index 0000000000..4a127e67c5 --- /dev/null +++ b/doc/guides/prog_guide/img/link_the_nodes.svg @@ -0,0 +1,3330 @@ + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst index 1d0cd49cd7..f0ae3c1e7e 100644 --- a/doc/guides/prog_guide/index.rst +++ b/doc/guides/prog_guide/index.rst @@ -58,6 +58,7 @@ Programmer's Guide metrics_lib bpf_lib ipsec_lib + graph_lib source_org dev_kit_build_system dev_kit_root_make_help diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst index ab20a7d021..3d53c8c99b 100644 --- a/doc/guides/rel_notes/release_20_05.rst +++ b/doc/guides/rel_notes/release_20_05.rst @@ -212,6 +212,30 @@ New Features * Added IPsec inbound load-distribution support for ipsec-secgw application using NIC load distribution feature(Flow Director). +* **Added rte_graph library.** + + Graph architecture abstracts the data processing functions as a ``node`` and + ``links`` them together to create a complex ``graph`` to enable reusable/modular + data processing functions. The graph library provides API to enable graph + framework operations such as create, lookup, dump and destroy on graph and node + operations such as clone, edge update, and edge shrink, etc. + The API also allows to create the stats cluster to monitor per graph and per node stats. + +* **Added rte_node library which consists of a set of packet processing nodes.** + + The rte_node library that consists of nodes used by rte_graph library. Each + node performs a specific packet processing function based on application + configuration. The following nodes are added: + + * Null node: Skeleton node that defines the general structure of a node. + * Ethernet device node: Consists of ethernet Rx/Tx nodes as well as ethernet + control APIs. + * IPv4 lookup node: Consists of ipv4 extract and lpm lookup node. Routes can + be configured by the application through ``rte_node_ip4_route_add`` function. + * IPv4 rewrite node: Consists of ipv4 and ethernet header rewrite functionality + that can be configured through ``rte_node_ip4_rewrite_add`` function. + * Packet drop node: Frees the packets received to their respective mempool. + Removed Items -------------