doc/guides/prog_guide/traffic_management.rst

   1 ..  BSD LICENSE
   2     Copyright(c) 2017 Intel Corporation. All rights reserved.
   3     All rights reserved.
   4
   5     Redistribution and use in source and binary forms, with or without
   6     modification, are permitted provided that the following conditions
   7     are met:
   8
   9     * Redistributions of source code must retain the above copyright
  10     notice, this list of conditions and the following disclaimer.
  11     * Redistributions in binary form must reproduce the above copyright
  12     notice, this list of conditions and the following disclaimer in
  13     the documentation and/or other materials provided with the
  14     distribution.
  15     * Neither the name of Intel Corporation nor the names of its
  16     contributors may be used to endorse or promote products derived
  17     from this software without specific prior written permission.
  18
  19     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  20     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  21     LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  22     A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  23     OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  24     SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  25     LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  26     DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  27     THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  28     (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  29     OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  30
  31
  32 Traffic Management API
  33 ======================
  34
  35
  36 Overview
  37 --------
  38
  39 This is the generic API for the Quality of Service (QoS) Traffic Management of
  40 Ethernet devices, which includes the following main features: hierarchical
  41 scheduling, traffic shaping, congestion management, packet marking. This API
  42 is agnostic of the underlying HW, SW or mixed HW-SW implementation.
  43
  44 Main features:
  45
  46 * Part of DPDK rte_ethdev API
  47 * Capability query API per port, per hierarchy level and per hierarchy node
  48 * Scheduling algorithms: Strict Priority (SP), Weighed Fair Queuing (WFQ)
  49 * Traffic shaping: single/dual rate, private (per node) and
  50   shared (by multiple nodes) shapers
  51 * Congestion management for hierarchy leaf nodes: algorithms of tail drop, head
  52   drop, WRED, private (per node) and shared (by multiple nodes) WRED contexts
  53 * Packet marking: IEEE 802.1q (VLAN DEI), IETF RFC 3168 (IPv4/IPv6 ECN for TCP
  54   and SCTP), IETF RFC 2597 (IPv4 / IPv6 DSCP)
  55
  56
  57 Capability API
  58 --------------
  59
  60 The aim of these APIs is to advertise the capability information (i.e critical
  61 parameter values) that the TM implementation (HW/SW) is able to support for the
  62 application. The APIs supports the information disclosure at the TM level, at
  63 any hierarchical level of the TM and at any node level of the specific
  64 hierarchical level. Such information helps towards rapid understanding of
  65 whether a specific implementation does meet the needs to the user application.
  66
  67 At the TM level, users can get high level idea with the help of various
  68 parameters such as maximum number of nodes, maximum number of hierarchical
  69 levels, maximum number of shapers, maximum number of private shapers, type of
  70 scheduling algorithm (Strict Priority, Weighted Fair Queueing , etc.), etc.,
  71 supported by the implementation.
  72
  73 Likewise, users can query the capability of the TM at the hierarchical level to
  74 have more granular knowledge about the specific level. The various parameters
  75 such as maximum number of nodes at the level, maximum number of leaf/non-leaf
  76 nodes at the level, type of the shaper(dual rate, single rate) supported at
  77 the level if node is non-leaf type etc., are exposed as a result of
  78 hierarchical level capability query.
  79
  80 Finally, the node level capability API offers knowledge about the capability
  81 supported by the node at any specific level. The information whether the
  82 support is available for private shaper, dual rate shaper, maximum and minimum
  83 shaper rate, etc. is exposed by node level capability API.
  84
  85
  86 Scheduling Algorithms
  87 ---------------------
  88
  89 The fundamental scheduling algorithms that are supported are Strict Priority
  90 (SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported
  91 at the level of each node of the scheduling hierarchy, regardless of the node
  92 level/position in the tree. The SP algorithm is used to schedule between
  93 sibling nodes with different priority, while WFQ is used to schedule between
  94 groups of siblings that have the same priority.
  95
  96 Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR
  97 (DWRR), etc are considered approximations of the ideal WFQ and are therefore
  98 assimilated to WFQ, although an associated implementation-dependent accuracy,
  99 performance and resource usage trade-off might exist.
 100
 101
 102 Traffic Shaping
 103 ---------------
 104
 105 The TM API provides support for single rate and dual rate shapers (rate
 106 limiters) for the hierarchy nodes, subject to the specific implementation
 107 support being available.
 108
 109 Each hierarchy node has zero or one private shaper (only one node using it)
 110 and/or zero, one or several shared shapers (multiple nodes use the same shaper
 111 instance). A private shaper is used to perform traffic shaping for a single
 112 node, while a shared shaper is used to perform traffic shaping for a group of
 113 nodes.
 114
 115 The configuration of private and shared shapers is done through the definition
 116 of shaper profiles. Any shaper profile (single rate or dual rate shaper) can be
 117 used by one or several shaper instances (either private or shared).
 118
 119 Single rate shapers use a single token bucket. Therefore, single rate shaper is
 120 configured by setting the rate of the committed bucket to zero, which
 121 effectively disables this bucket. The peak bucket is used to limit the rate
 122 and the burst size for the single rate shaper. Dual rate shapers use both the
 123 committed and the peak token buckets. The rate of the peak bucket has to be
 124 bigger than zero, as well as greater than or equal to the rate of the committed
 125 bucket.
 126
 127
 128 Congestion Management
 129 ---------------------
 130
 131 Congestion management is used to control the admission of packets into a packet
 132 queue or group of packet queues on congestion. The congestion management
 133 algorithms that are supported are: Tail Drop, Head Drop and Weighted Random
 134 Early Detection (WRED). They are made available for every leaf node in the
 135 hierarchy, subject to the specific implementation supporting them.
 136 On request of writing a new packet into the current queue while the queue is
 137 full, the Tail Drop algorithm drops the new packet while leaving the queue
 138 unmodified, as opposed to the Head Drop* algorithm, which drops the packet
 139 at the head of the queue (the oldest packet waiting in the queue) and admits
 140 the new packet at the tail of the queue.
 141
 142 The Random Early Detection (RED) algorithm works by proactively dropping more
 143 and more input packets as the queue occupancy builds up. When the queue is full
 144 or almost full, RED effectively works as Tail Drop. The Weighted RED (WRED)
 145 algorithm uses a separate set of RED thresholds for each packet color and uses
 146 separate set of RED thresholds for each packet color.
 147
 148 Each hierarchy leaf node with WRED enabled as its congestion management mode
 149 has zero or one private WRED context (only one leaf node using it) and/or zero,
 150 one or several shared WRED contexts (multiple leaf nodes use the same WRED
 151 context). A private WRED context is used to perform congestion management for
 152 a single leaf node, while a shared WRED context is used to perform congestion
 153 management for a group of leaf nodes.
 154
 155 The configuration of WRED private and shared contexts is done through the
 156 definition of WRED profiles. Any WRED profile can be used by one or several
 157 WRED contexts (either private or shared).
 158
 159
 160 Packet Marking
 161 --------------
 162 The TM APIs have been provided to support various types of packet marking such
 163 as VLAN DEI packet marking (IEEE 802.1Q), IPv4/IPv6 ECN marking of TCP and SCTP
 164 packets (IETF RFC 3168) and IPv4/IPv6 DSCP packet marking (IETF RFC 2597).
 165 All VLAN frames of a given color get their DEI bit set if marking is enabled
 166 for this color. In case, when marking for a given color is not enabled, the
 167 DEI bit is left as is (either set or not).
 168
 169 All IPv4/IPv6 packets of a given color with ECN set to 2’b01 or 2’b10 carrying
 170 TCP or SCTP have their ECN set to 2’b11 if the marking feature is enabled for
 171 the current color, otherwise the ECN field is left as is.
 172
 173 All IPv4/IPv6 packets have their color marked into DSCP bits 3 and 4 as
 174 follows: green mapped to Low Drop Precedence (2’b01), yellow to Medium (2’b10)
 175 and red to High (2’b11). Marking needs to be explicitly enabled for each color;
 176 when not enabled for a given color, the DSCP field of all packets with that
 177 color is left as is.
 178
 179
 180 Steps to Setup the Hierarchy
 181 ----------------------------
 182
 183 The TM hierarchical tree consists of leaf nodes and non-leaf nodes. Each leaf
 184 node sits on top of a scheduling queue of the current Ethernet port. Therefore,
 185 the leaf nodes have predefined IDs in the range of 0... (N-1), where N is the
 186 number of scheduling queues of the current Ethernet port. The non-leaf nodes
 187 have their IDs generated by the application outside of the above range, which
 188 is reserved for leaf nodes.
 189
 190 Each non-leaf node has multiple inputs (its children nodes) and single output
 191 (which is input to its parent node). It arbitrates its inputs using Strict
 192 Priority (SP) and Weighted Fair Queuing (WFQ) algorithms to schedule input
 193 packets to its output while observing its shaping (rate limiting) constraints.
 194
 195 The children nodes with different priorities are scheduled using the SP
 196 algorithm based on their priority, with 0 as the highest priority. Children
 197 with the same priority are scheduled using the WFQ algorithm according to their
 198 weights. The WFQ weight of a given child node is relative to the sum of the
 199 weights of all its sibling nodes that have the same priority, with 1 as the
 200 lowest weight. For each SP priority, the WFQ weight mode can be set as either
 201 byte-based or packet-based.
 202
 203
 204 Initial Hierarchy Specification
 205 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 206
 207 The hierarchy is specified by incrementally adding nodes to build up the
 208 scheduling tree. The first node that is added to the hierarchy becomes the root
 209 node and all the nodes that are subsequently added have to be added as
 210 descendants of the root node. The parent of the root node has to be specified
 211 as RTE_TM_NODE_ID_NULL and there can only be one node with this parent ID
 212 (i.e. the root node). The unique ID that is assigned to each node when the node
 213 is created is further used to update the node configuration or to connect
 214 children nodes to it.
 215
 216 During this phase, some limited checks on the hierarchy specification can be
 217 conducted, usually limited in scope to the current node, its parent node and
 218 its sibling nodes. At this time, since the hierarchy is not fully defined,
 219 there is typically no real action performed by the underlying implementation.
 220
 221
 222 Hierarchy Commit
 223 ~~~~~~~~~~~~~~~~
 224
 225 The hierarchy commit API is called during the port initialization phase (before
 226 the Ethernet port is started) to freeze the start-up hierarchy.  This function
 227 typically performs the following steps:
 228
 229 * It validates the start-up hierarchy that was previously defined for the
 230   current port through successive node add API invocations.
 231 * Assuming successful validation, it performs all the necessary implementation
 232   specific operations to install the specified hierarchy on the current port,
 233   with immediate effect once the port is started.
 234
 235 This function fails when the currently configured hierarchy is not supported by
 236 the Ethernet port, in which case the user can abort or try out another
 237 hierarchy configuration (e.g. a hierarchy with less leaf nodes), which can be
 238 built from scratch or by modifying the existing hierarchy configuration. Note
 239 that this function can still fail due to other causes (e.g. not enough memory
 240 available in the system, etc.), even though the specified hierarchy is
 241 supported in principle by the current port.
 242
 243
 244 Run-Time Hierarchy Updates
 245 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 246
 247 The TM API provides support for on-the-fly changes to the scheduling hierarchy,
 248 thus operations such as node add/delete, node suspend/resume, parent node
 249 update, etc., can be invoked after the Ethernet port has been started, subject
 250 to the specific implementation supporting them. The set of dynamic updates
 251 supported by the implementation is advertised through the port capability set.