doc/guides/prog_guide/qos_framework.rst

   1 ..  BSD LICENSE
   2     Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
   3     All rights reserved.
   4
   5     Redistribution and use in source and binary forms, with or without
   6     modification, are permitted provided that the following conditions
   7     are met:
   8
   9     * Redistributions of source code must retain the above copyright
  10     notice, this list of conditions and the following disclaimer.
  11     * Redistributions in binary form must reproduce the above copyright
  12     notice, this list of conditions and the following disclaimer in
  13     the documentation and/or other materials provided with the
  14     distribution.
  15     * Neither the name of Intel Corporation nor the names of its
  16     contributors may be used to endorse or promote products derived
  17     from this software without specific prior written permission.
  18
  19     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  20     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  21     LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  22     A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  23     OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  24     SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  25     LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  26     DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  27     THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  28     (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  29     OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  30
  31 Quality of Service (QoS) Framework
  32 ==================================
  33
  34 This chapter describes the DPDK Quality of Service (QoS) framework.
  35
  36 Packet Pipeline with QoS Support
  37 --------------------------------
  38
  39 An example of a complex packet processing pipeline with QoS support is shown in the following figure.
  40
  41 .. _pg_figure_21:
  42
  43 **Figure 21. Complex Packet Processing Pipeline with QoS Support**
  44
  45 .. image47_png has been renamed
  46
  47 |pkt_proc_pipeline_qos|
  48
  49 This pipeline can be built using reusable DPDK software libraries.
  50 The main blocks implementing QoS in this pipeline are: the policer, the dropper and the scheduler.
  51 A functional description of each block is provided in the following table.
  52
  53 .. _pg_table_1:
  54
  55 **Table 1. Packet Processing Pipeline Implementing QoS**
  56
  57 +---+------------------------+--------------------------------------------------------------------------------+
  58 | # | Block                  | Functional Description                                                         |
  59 |   |                        |                                                                                |
  60 +===+========================+================================================================================+
  61 | 1 | Packet I/O RX & TX     | Packet reception/ transmission from/to multiple NIC ports. Poll mode drivers   |
  62 |   |                        | (PMDs) for Intel 1 GbE/10 GbE NICs.                                            |
  63 |   |                        |                                                                                |
  64 +---+------------------------+--------------------------------------------------------------------------------+
  65 | 2 | Packet parser          | Identify the protocol stack of the input packet. Check the integrity of the    |
  66 |   |                        | packet headers.                                                                |
  67 |   |                        |                                                                                |
  68 +---+------------------------+--------------------------------------------------------------------------------+
  69 | 3 | Flow classification    | Map the input packet to one of the known traffic flows. Exact match table      |
  70 |   |                        | lookup using configurable hash function (jhash, CRC and so on) and bucket      |
  71 |   |                        | logic to handle collisions.                                                    |
  72 |   |                        |                                                                                |
  73 +---+------------------------+--------------------------------------------------------------------------------+
  74 | 4 | Policer                | Packet metering using srTCM (RFC 2697) or trTCM (RFC2698) algorithms.          |
  75 |   |                        |                                                                                |
  76 +---+------------------------+--------------------------------------------------------------------------------+
  77 | 5 | Load Balancer          | Distribute the input packets to the application workers. Provide uniform load  |
  78 |   |                        | to each worker. Preserve the affinity of traffic flows to workers and the      |
  79 |   |                        | packet order within each flow.                                                 |
  80 |   |                        |                                                                                |
  81 +---+------------------------+--------------------------------------------------------------------------------+
  82 | 6 | Worker threads         | Placeholders for the customer specific application workload (for example, IP   |
  83 |   |                        | stack and so on).                                                              |
  84 |   |                        |                                                                                |
  85 +---+------------------------+--------------------------------------------------------------------------------+
  86 | 7 | Dropper                | Congestion management using the Random Early Detection (RED) algorithm         |
  87 |   |                        | (specified by the Sally Floyd - Van Jacobson paper) or Weighted RED (WRED).    |
  88 |   |                        | Drop packets based on the current scheduler queue load level and packet        |
  89 |   |                        | priority. When congestion is experienced, lower priority packets are dropped   |
  90 |   |                        | first.                                                                         |
  91 |   |                        |                                                                                |
  92 +---+------------------------+--------------------------------------------------------------------------------+
  93 | 8 | Hierarchical Scheduler | 5-level hierarchical scheduler (levels are: output port, subport, pipe,        |
  94 |   |                        | traffic class and queue) with thousands (typically 64K) leaf nodes (queues).   |
  95 |   |                        | Implements traffic shaping (for subport and pipe levels), strict priority      |
  96 |   |                        | (for traffic class level) and Weighted Round Robin (WRR) (for queues within    |
  97 |   |                        | each pipe traffic class).                                                      |
  98 |   |                        |                                                                                |
  99 +---+------------------------+--------------------------------------------------------------------------------+
 100
 101 The infrastructure blocks used throughout the packet processing pipeline are listed in the following table.
 102
 103 .. _pg_table_2:
 104
 105 **Table 2. Infrastructure Blocks Used by the Packet Processing Pipeline**
 106
 107 +---+-----------------------+-----------------------------------------------------------------------+
 108 | # | Block                 | Functional Description                                                |
 109 |   |                       |                                                                       |
 110 +===+=======================+=======================================================================+
 111 | 1 | Buffer manager        | Support for global buffer pools and private per-thread buffer caches. |
 112 |   |                       |                                                                       |
 113 +---+-----------------------+-----------------------------------------------------------------------+
 114 | 2 | Queue manager         | Support for message passing between pipeline blocks.                  |
 115 |   |                       |                                                                       |
 116 +---+-----------------------+-----------------------------------------------------------------------+
 117 | 3 | Power saving          | Support for power saving during low activity periods.                 |
 118 |   |                       |                                                                       |
 119 +---+-----------------------+-----------------------------------------------------------------------+
 120
 121 The mapping of pipeline blocks to CPU cores is configurable based on the performance level required by each specific application
 122 and the set of features enabled for each block.
 123 Some blocks might consume more than one CPU core (with each CPU core running a different instance of the same block on different input packets),
 124 while several other blocks could be mapped to the same CPU core.
 125
 126 Hierarchical Scheduler
 127 ----------------------
 128
 129 The hierarchical scheduler block, when present, usually sits on the TX side just before the transmission stage.
 130 Its purpose is to prioritize the transmission of packets from different users and different traffic classes
 131 according to the policy specified by the Service Level Agreements (SLAs) of each network node.
 132
 133 Overview
 134 ~~~~~~~~
 135
 136 The hierarchical scheduler block is similar to the traffic manager block used by network processors
 137 that typically implement per flow (or per group of flows) packet queuing and scheduling.
 138 It typically acts like a buffer that is able to temporarily store a large number of packets just before their transmission (enqueue operation);
 139 as the NIC TX is requesting more packets for transmission,
 140 these packets are later on removed and handed over to the NIC TX with the packet selection logic observing the predefined SLAs (dequeue operation).
 141
 142 .. _pg_figure_22:
 143
 144 **Figure 22. Hierarchical Scheduler Block Internal Diagram**
 145
 146 .. image48_png has been renamed
 147
 148 |hier_sched_blk|
 149
 150 The hierarchical scheduler is optimized for a large number of packet queues.
 151 When only a small number of queues are needed, message passing queues should be used instead of this block.
 152 See Section 26.2.5 "Worst Case Scenarios for Performance" for a more detailed discussion.
 153
 154 Scheduling Hierarchy
 155 ~~~~~~~~~~~~~~~~~~~~
 156
 157 The scheduling hierarchy is shown in Figure 23.
 158 The first level of the hierarchy is the Ethernet TX port 1/10/40 GbE,
 159 with subsequent hierarchy levels defined as subport, pipe, traffic class and queue.
 160
 161 Typically, each subport represents a predefined group of users, while each pipe represents an individual user/subscriber.
 162 Each traffic class is the representation of a different traffic type with specific loss rate,
 163 delay and jitter requirements, such as voice, video or data transfers.
 164 Each queue hosts packets from one or multiple connections of the same type belonging to the same user.
 165
 166 .. _pg_figure_23:
 167
 168 **Figure 23. Scheduling Hierarchy per Port**
 169
 170 .. image49_png has been renamed
 171
 172 |sched_hier_per_port|
 173
 174 The functionality of each hierarchical level is detailed in the following table.
 175
 176 .. _pg_table_3:
 177
 178 **Table 3. Port Scheduling Hierarchy**
 179
 180 +---+--------------------+----------------------------+---------------------------------------------------------------+
 181 | # | Level              | Siblings per Parent        | Functional Description                                        |
 182 |   |                    |                            |                                                               |
 183 +===+====================+============================+===============================================================+
 184 | 1 | Port               | -                          | #.  Output Ethernet port 1/10/40 GbE.                         |
 185 |   |                    |                            |                                                               |
 186 |   |                    |                            | #.  Multiple ports are scheduled in round robin order with    |
 187 |   |                    |                            |     all ports having equal priority.                          |
 188 |   |                    |                            |                                                               |
 189 +---+--------------------+----------------------------+---------------------------------------------------------------+
 190 | 2 | Subport            | Configurable (default: 8)  | #.  Traffic shaping using token bucket algorithm (one token   |
 191 |   |                    |                            |     bucket per subport).                                      |
 192 |   |                    |                            |                                                               |
 193 |   |                    |                            | #.  Upper limit enforced per Traffic Class (TC) at the        |
 194 |   |                    |                            |     subport level.                                            |
 195 |   |                    |                            |                                                               |
 196 |   |                    |                            | #.  Lower priority TCs able to reuse subport bandwidth        |
 197 |   |                    |                            |     currently unused by higher priority TCs.                  |
 198 |   |                    |                            |                                                               |
 199 +---+--------------------+----------------------------+---------------------------------------------------------------+
 200 | 3 | Pipe               | Configurable (default: 4K) | #.  Traffic shaping using the token bucket algorithm (one     |
 201 |   |                    |                            |     token bucket per pipe.                                    |
 202 |   |                    |                            |                                                               |
 203 +---+--------------------+----------------------------+---------------------------------------------------------------+
 204 | 4 | Traffic Class (TC) | 4                          | #.  TCs of the same pipe handled in strict priority order.    |
 205 |   |                    |                            |                                                               |
 206 |   |                    |                            | #.  Upper limit enforced per TC at the pipe level.            |
 207 |   |                    |                            |                                                               |
 208 |   |                    |                            | #.  Lower priority TCs able to reuse pipe bandwidth currently |
 209 |   |                    |                            |     unused by higher priority TCs.                            |
 210 |   |                    |                            |                                                               |
 211 |   |                    |                            | #.  When subport TC is oversubscribed (configuration time     |
 212 |   |                    |                            |     event), pipe TC upper limit is capped to a dynamically    |
 213 |   |                    |                            |     adjusted value that is shared by all the subport pipes.   |
 214 |   |                    |                            |                                                               |
 215 +---+--------------------+----------------------------+---------------------------------------------------------------+
 216 | 5 | Queue              | 4                          | #.  Queues of the same TC are serviced using Weighted Round   |
 217 |   |                    |                            |     Robin (WRR) according to predefined weights.              |
 218 |   |                    |                            |                                                               |
 219 +---+--------------------+----------------------------+---------------------------------------------------------------+
 220
 221 Application Programming Interface (API)
 222 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 223
 224 Port Scheduler Configuration API
 225 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 226
 227 The rte_sched.h file contains configuration functions for port, subport and pipe.
 228
 229 Port Scheduler Enqueue API
 230 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 231
 232 The port scheduler enqueue API is very similar to the API of the DPDK PMD TX function.
 233
 234 .. code-block:: c
 235
 236     int rte_sched_port_enqueue(struct rte_sched_port *port, struct rte_mbuf **pkts, uint32_t n_pkts);
 237
 238 Port Scheduler Dequeue API
 239 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 240
 241 The port scheduler dequeue API is very similar to the API of the DPDK PMD RX function.
 242
 243 .. code-block:: c
 244
 245     int rte_sched_port_dequeue(struct rte_sched_port *port, struct rte_mbuf **pkts, uint32_t n_pkts);
 246
 247 Usage Example
 248 ^^^^^^^^^^^^^
 249
 250 .. code-block:: c
 251
 252     /* File "application.c" */
 253
 254     #define N_PKTS_RX   64
 255     #define N_PKTS_TX   48
 256     #define NIC_RX_PORT 0
 257     #define NIC_RX_QUEUE 0
 258     #define NIC_TX_PORT 1
 259     #define NIC_TX_QUEUE 0
 260
 261     struct rte_sched_port *port = NULL;
 262     struct rte_mbuf *pkts_rx[N_PKTS_RX], *pkts_tx[N_PKTS_TX];
 263     uint32_t n_pkts_rx, n_pkts_tx;
 264
 265     /* Initialization */
 266
 267     <initialization code>
 268
 269     /* Runtime */
 270     while (1) {
 271         /* Read packets from NIC RX queue */
 272
 273         n_pkts_rx = rte_eth_rx_burst(NIC_RX_PORT, NIC_RX_QUEUE, pkts_rx, N_PKTS_RX);
 274
 275         /* Hierarchical scheduler enqueue */
 276
 277         rte_sched_port_enqueue(port, pkts_rx, n_pkts_rx);
 278
 279         /* Hierarchical scheduler dequeue */
 280
 281         n_pkts_tx = rte_sched_port_dequeue(port, pkts_tx, N_PKTS_TX);
 282
 283         /* Write packets to NIC TX queue */
 284
 285         rte_eth_tx_burst(NIC_TX_PORT, NIC_TX_QUEUE, pkts_tx, n_pkts_tx);
 286     }
 287
 288 Implementation
 289 ~~~~~~~~~~~~~~
 290
 291 Internal Data Structures per Port
 292 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 293
 294 A schematic of the internal data structures in shown in with details in.
 295
 296 .. _pg_figure_24:
 297
 298 **Figure 24. Internal Data Structures per Port**
 299
 300 .. image50_png has been renamed
 301
 302 |data_struct_per_port|
 303
 304 .. _pg_table_4:
 305
 306 **Table 4. Scheduler Internal Data Structures per Port**
 307
 308 +---+----------------------+-------------------------+---------------------+------------------------------+---------------------------------------------------+
 309 | # | Data structure       | Size (bytes)            | # per port          | Access type                  | Description                                       |
 310 |   |                      |                         |                     |                              |                                                   |
 311 |   |                      |                         |                     +-------------+----------------+---------------------------------------------------+
 312 |   |                      |                         |                     | Enq         | Deq            |                                                   |
 313 |   |                      |                         |                     |             |                |                                                   |
 314 +===+======================+=========================+=====================+=============+================+===================================================+
 315 | 1 | Subport table entry  | 64                      | # subports per port | -           | Rd, Wr         | Persistent subport data (credits, etc).           |
 316 |   |                      |                         |                     |             |                |                                                   |
 317 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
 318 | 2 | Pipe table entry     | 64                      | # pipes per port    | -           | Rd, Wr         | Persistent data for pipe, its TCs and its queues  |
 319 |   |                      |                         |                     |             |                | (credits, etc) that is updated during run-time.   |
 320 |   |                      |                         |                     |             |                |                                                   |
 321 |   |                      |                         |                     |             |                | The pipe configuration parameters do not change   |
 322 |   |                      |                         |                     |             |                | during run-time. The same pipe configuration      |
 323 |   |                      |                         |                     |             |                | parameters are shared by multiple pipes,          |
 324 |   |                      |                         |                     |             |                | therefore they are not part of pipe table entry.  |
 325 |   |                      |                         |                     |             |                |                                                   |
 326 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
 327 | 3 | Queue table entry    | 4                       | #queues per port    | Rd, Wr      | Rd, Wr         | Persistent queue data (read and write pointers).  |
 328 |   |                      |                         |                     |             |                | The queue size is the same per TC for all queues, |
 329 |   |                      |                         |                     |             |                | allowing the queue base address to be computed    |
 330 |   |                      |                         |                     |             |                | using a fast formula, so these two parameters are |
 331 |   |                      |                         |                     |             |                | not part of queue table entry.                    |
 332 |   |                      |                         |                     |             |                |                                                   |
 333 |   |                      |                         |                     |             |                | The queue table entries for any given pipe are    |
 334 |   |                      |                         |                     |             |                | stored in the same cache line.                    |
 335 |   |                      |                         |                     |             |                |                                                   |
 336 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
 337 | 4 | Queue storage area   | Config (default: 64 x8) | # queues per port   | Wr          | Rd             | Array of elements per queue; each element is 8    |
 338 |   |                      |                         |                     |             |                | byte in size (mbuf pointer).                      |
 339 |   |                      |                         |                     |             |                |                                                   |
 340 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
 341 | 5 | Active queues bitmap | 1 bit per queue         | 1                   | Wr (Set)    | Rd, Wr (Clear) | The bitmap maintains one status bit per queue:    |
 342 |   |                      |                         |                     |             |                | queue not active (queue is empty) or queue active |
 343 |   |                      |                         |                     |             |                | (queue is not empty).                             |
 344 |   |                      |                         |                     |             |                |                                                   |
 345 |   |                      |                         |                     |             |                | Queue bit is set by the scheduler enqueue and     |
 346 |   |                      |                         |                     |             |                | cleared by the scheduler dequeue when queue       |
 347 |   |                      |                         |                     |             |                | becomes empty.                                    |
 348 |   |                      |                         |                     |             |                |                                                   |
 349 |   |                      |                         |                     |             |                | Bitmap scan operation returns the next non-empty  |
 350 |   |                      |                         |                     |             |                | pipe and its status (16-bit mask of active queue  |
 351 |   |                      |                         |                     |             |                | in the pipe).                                     |
 352 |   |                      |                         |                     |             |                |                                                   |
 353 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
 354 | 6 | Grinder              | ~128                    | Config (default: 8) | -           | Rd, Wr         | Short list of active pipes currently under        |
 355 |   |                      |                         |                     |             |                | processing. The grinder contains temporary data   |
 356 |   |                      |                         |                     |             |                | during pipe processing.                           |
 357 |   |                      |                         |                     |             |                |                                                   |
 358 |   |                      |                         |                     |             |                | Once the current pipe exhausts packets or         |
 359 |   |                      |                         |                     |             |                | credits, it is replaced with another active pipe  |
 360 |   |                      |                         |                     |             |                | from the bitmap.                                  |
 361 |   |                      |                         |                     |             |                |                                                   |
 362 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
 363
 364 Multicore Scaling Strategy
 365 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 366
 367 The multicore scaling strategy is:
 368
 369 #.  Running different physical ports on different threads. The enqueue and dequeue of the same port are run by the same thread.
 370
 371 #.  Splitting the same physical port to different threads by running different sets of subports of the same physical port (virtual ports) on different threads.
 372     Similarly, a subport can be split into multiple subports that are each run by a different thread.
 373     The enqueue and dequeue of the same port are run by the same thread.
 374     This is only required if, for performance reasons, it is not possible to handle a full port with a single core.
 375
 376 Enqueue and Dequeue for the Same Output Port
 377 """"""""""""""""""""""""""""""""""""""""""""
 378
 379 Running enqueue and dequeue operations for the same output port from different cores is likely to cause significant impact on scheduler's performance
 380 and it is therefore not recommended.
 381
 382 The port enqueue and dequeue operations share access to the following data structures:
 383
 384 #.  Packet descriptors
 385
 386 #.  Queue table
 387
 388 #.  Queue storage area
 389
 390 #.  Bitmap of active queues
 391
 392 The expected drop in performance is due to:
 393
 394 #.  Need to make the queue and bitmap operations thread safe,
 395     which requires either using locking primitives for access serialization (for example, spinlocks/ semaphores) or
 396     using atomic primitives for lockless access (for example, Test and Set, Compare And Swap, an so on).
 397     The impact is much higher in the former case.
 398
 399 #.  Ping-pong of cache lines storing the shared data structures between the cache hierarchies of the two cores
 400     (done transparently by the MESI protocol cache coherency CPU hardware).
 401
 402 Therefore, the scheduler enqueue and dequeue operations have to be run from the same thread,
 403 which allows the queues and the bitmap operations to be non-thread safe and
 404 keeps the scheduler data structures internal to the same core.
 405
 406 Performance Scaling
 407 """""""""""""""""""
 408
 409 Scaling up the number of NIC ports simply requires a proportional increase in the number of CPU cores to be used for traffic scheduling.
 410
 411 Enqueue Pipeline
 412 ^^^^^^^^^^^^^^^^
 413
 414 The sequence of steps per packet:
 415
 416 #.  *Access* the mbuf to read the data fields required to identify the destination queue for the packet.
 417     These fields are: port, subport, traffic class and queue within traffic class, and are typically set by the classification stage.
 418
 419 #.  *Access* the queue structure to identify the write location in the queue array.
 420     If the queue is full, then the packet is discarded.
 421
 422 #.  *Access* the queue array location to store the packet (i.e. write the mbuf pointer).
 423
 424 It should be noted the strong data dependency between these steps, as steps 2 and 3 cannot start before the result from steps 1 and 2 becomes available,
 425 which prevents the processor out of order execution engine to provide any significant performance optimizations.
 426
 427 Given the high rate of input packets and the large amount of queues,
 428 it is expected that the data structures accessed to enqueue the current packet are not present
 429 in the L1 or L2 data cache of the current core, thus the above 3 memory accesses would result (on average) in L1 and L2 data cache misses.
 430 A number of 3 L1/L2 cache misses per packet is not acceptable for performance reasons.
 431
 432 The workaround is to prefetch the required data structures in advance. The prefetch operation has an execution latency during which
 433 the processor should not attempt to access the data structure currently under prefetch, so the processor should execute other work.
 434 The only other work available is to execute different stages of the enqueue sequence of operations on other input packets,
 435 thus resulting in a pipelined implementation for the enqueue operation.
 436
 437 Figure 25 illustrates a pipelined implementation for the enqueue operation with 4 pipeline stages and each stage executing 2 different input packets.
 438 No input packet can be part of more than one pipeline stage at a given time.
 439
 440 .. _pg_figure_25:
 441
 442 **Figure 25. Prefetch Pipeline for the Hierarchical Scheduler Enqueue Operation**
 443
 444 .. image51 has been renamed
 445
 446 |prefetch_pipeline|
 447
 448 The congestion management scheme implemented by the enqueue pipeline described above is very basic:
 449 packets are enqueued until a specific queue becomes full,
 450 then all the packets destined to the same queue are dropped until packets are consumed (by the dequeue operation).
 451 This can be improved by enabling RED/WRED as part of the enqueue pipeline which looks at the queue occupancy and
 452 packet priority in order to yield the enqueue/drop decision for a specific packet
 453 (as opposed to enqueuing all packets / dropping all packets indiscriminately).
 454
 455 Dequeue State Machine
 456 ^^^^^^^^^^^^^^^^^^^^^
 457
 458 The sequence of steps to schedule the next packet from the current pipe is:
 459
 460 #.  Identify the next active pipe using the bitmap scan operation, *prefetch* pipe.
 461
 462 #.  *Read* pipe data structure. Update the credits for the current pipe and its subport.
 463     Identify the first active traffic class within the current pipe, select the next queue using WRR,
 464     *prefetch* queue pointers for all the 16 queues of the current pipe.
 465
 466 #.  *Read* next element from the current WRR queue and *prefetch* its packet descriptor.
 467
 468 #.  *Read* the packet length from the packet descriptor (mbuf structure).
 469     Based on the packet length and the available credits (of current pipe, pipe traffic class, subport and subport traffic class),
 470     take the go/no go scheduling decision for the current packet.
 471
 472 To avoid the cache misses, the above data structures (pipe, queue, queue array, mbufs) are prefetched in advance of being accessed.
 473 The strategy of hiding the latency of the prefetch operations is to switch from the current pipe (in grinder A) to another pipe
 474 (in grinder B) immediately after a prefetch is issued for the current pipe.
 475 This gives enough time to the prefetch operation to complete before the execution switches back to this pipe (in grinder A).
 476
 477 The dequeue pipe state machine exploits the data presence into the processor cache,
 478 therefore it tries to send as many packets from the same pipe TC and pipe as possible (up to the available packets and credits) before
 479 moving to the next active TC from the same pipe (if any) or to another active pipe.
 480
 481 .. _pg_figure_26:
 482
 483 **Figure 26. Pipe Prefetch State Machine for the Hierarchical Scheduler Dequeue Operation**
 484
 485 .. image52 has been renamed
 486
 487 |pipe_prefetch_sm|
 488
 489 Timing and Synchronization
 490 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 491
 492 The output port is modeled as a conveyor belt of byte slots that need to be filled by the scheduler with data for transmission.
 493 For 10 GbE, there are 1.25 billion byte slots that need to be filled by the port scheduler every second.
 494 If the scheduler is not fast enough to fill the slots, provided that enough packets and credits exist,
 495 then some slots will be left unused and bandwidth will be wasted.
 496
 497 In principle, the hierarchical scheduler dequeue operation should be triggered by NIC TX.
 498 Usually, once the occupancy of the NIC TX input queue drops below a predefined threshold,
 499 the port scheduler is woken up (interrupt based or polling based,
 500 by continuously monitoring the queue occupancy) to push more packets into the queue.
 501
 502 Internal Time Reference
 503 """""""""""""""""""""""
 504
 505 The scheduler needs to keep track of time advancement for the credit logic,
 506 which requires credit updates based on time (for example, subport and pipe traffic shaping, traffic class upper limit enforcement, and so on).
 507
 508 Every time the scheduler decides to send a packet out to the NIC TX for transmission, the scheduler will increment its internal time reference accordingly.
 509 Therefore, it is convenient to keep the internal time reference in units of bytes,
 510 where a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium.
 511 This way, as a packet is scheduled for transmission, the time is incremented with (n + h),
 512 where n is the packet length in bytes and h is the number of framing overhead bytes per packet.
 513
 514 Internal Time Reference Re-synchronization
 515 """"""""""""""""""""""""""""""""""""""""""
 516
 517 The scheduler needs to align its internal time reference to the pace of the port conveyor belt.
 518 The reason is to make sure that the scheduler does not feed the NIC TX with more bytes than the line rate of the physical medium in order to prevent packet drop
 519 (by the scheduler, due to the NIC TX input queue being full, or later on, internally by the NIC TX).
 520
 521 The scheduler reads the current time on every dequeue invocation.
 522 The CPU time stamp can be obtained by reading either the Time Stamp Counter (TSC) register or the High Precision Event Timer (HPET) register.
 523 The current CPU time stamp is converted from number of CPU clocks to number of bytes:
 524 *time_bytes = time_cycles / cycles_per_byte, where cycles_per_byte*
 525 is the amount of CPU cycles that is equivalent to the transmission time for one byte on the wire
 526 (e.g. for a CPU frequency of 2 GHz and a 10GbE port,*cycles_per_byte = 1.6*).
 527
 528 The scheduler maintains an internal time reference of the NIC time.
 529 Whenever a packet is scheduled, the NIC time is incremented with the packet length (including framing overhead).
 530 On every dequeue invocation, the scheduler checks its internal reference of the NIC time against the current time:
 531
 532 #. If NIC time is in the future (NIC time >= current time), no adjustment of NIC time is needed.
 533    This means that scheduler is able to schedule NIC packets before the NIC actually needs those packets, so the NIC TX is well supplied with packets;
 534
 535 #. If NIC time is in the past (NIC time < current time), then NIC time should be adjusted by setting it to the current time.
 536    This means that the scheduler is not able to keep up with the speed of the NIC byte conveyor belt,
 537    so NIC bandwidth is wasted due to poor packet supply to the NIC TX.
 538
 539 Scheduler Accuracy and Granularity
 540 """"""""""""""""""""""""""""""""""
 541
 542 The scheduler round trip delay (SRTD) is the time (number of CPU cycles) between two consecutive examinations of the same pipe by the scheduler.
 543
 544 To keep up with the output port (that is, avoid bandwidth loss),
 545 the scheduler should be able to schedule n packets faster than the same n packets are transmitted by NIC TX.
 546
 547 The scheduler needs to keep up with the rate of each individual pipe,
 548 as configured for the pipe token bucket, assuming that no port oversubscription is taking place.
 549 This means that the size of the pipe token bucket should be set high enough to prevent it from overflowing due to big SRTD,
 550 as this would result in credit loss (and therefore bandwidth loss) for the pipe.
 551
 552 Credit Logic
 553 ^^^^^^^^^^^^
 554
 555 Scheduling Decision
 556 """""""""""""""""""
 557
 558 The scheduling decision to send next packet from (subport S, pipe P, traffic class TC, queue Q) is favorable (packet is sent)
 559 when all the conditions below are met:
 560
 561 *   Pipe P of subport S is currently selected by one of the port grinders;
 562
 563 *   Traffic class TC is the highest priority active traffic class of pipe P;
 564
 565 *   Queue Q is the next queue selected by WRR within traffic class TC of pipe P;
 566
 567 *   Subport S has enough credits to send the packet;
 568
 569 *   Subport S has enough credits for traffic class TC to send the packet;
 570
 571 *   Pipe P has enough credits to send the packet;
 572
 573 *   Pipe P has enough credits for traffic class TC to send the packet.
 574
 575 If all the above conditions are met,
 576 then the packet is selected for transmission and the necessary credits are subtracted from subport S,
 577 subport S traffic class TC, pipe P, pipe P traffic class TC.
 578
 579 Framing Overhead
 580 """"""""""""""""
 581
 582 As the greatest common divisor for all packet lengths is one byte, the unit of credit is selected as one byte.
 583 The number of credits required for the transmission of a packet of n bytes is equal to (n+h),
 584 where h is equal to the number of framing overhead bytes per packet.
 585
 586 .. _pg_table_5:
 587
 588 **Table 5. Ethernet Frame Overhead Fields**
 589
 590 +---+--------------------------------+----------------+---------------------------------------------------------------------------+
 591 | # | Packet field                   | Length (bytes) | Comments                                                                  |
 592 |   |                                |                |                                                                           |
 593 +===+================================+================+===========================================================================+
 594 | 1 | Preamble                       | 7              |                                                                           |
 595 |   |                                |                |                                                                           |
 596 +---+--------------------------------+----------------+---------------------------------------------------------------------------+
 597 | 2 | Start of Frame Delimiter (SFD) | 1              |                                                                           |
 598 |   |                                |                |                                                                           |
 599 +---+--------------------------------+----------------+---------------------------------------------------------------------------+
 600 | 3 | Frame Check Sequence (FCS)     | 4              | Considered overhead only if not included in the mbuf packet length field. |
 601 |   |                                |                |                                                                           |
 602 +---+--------------------------------+----------------+---------------------------------------------------------------------------+
 603 | 4 | Inter Frame Gap (IFG)          | 12             |                                                                           |
 604 |   |                                |                |                                                                           |
 605 +---+--------------------------------+----------------+---------------------------------------------------------------------------+
 606 | 5 | Total                          | 24             |                                                                           |
 607 |   |                                |                |                                                                           |
 608 +---+--------------------------------+----------------+---------------------------------------------------------------------------+
 609
 610 Traffic Shaping
 611 """""""""""""""
 612
 613 The traffic shaping for subport and pipe is implemented using a token bucket per subport/per pipe.
 614 Each token bucket is implemented using one saturated counter that keeps track of the number of available credits.
 615
 616 The token bucket generic parameters and operations are presented in Table 6 and Table 7.
 617
 618 .. _pg_table_6:
 619
 620 **Table 6. Token Bucket Generic Operations**
 621
 622 +---+------------------------+--------------------+---------------------------------------------------------+
 623 | # | Token Bucket Parameter | Unit               | Description                                             |
 624 |   |                        |                    |                                                         |
 625 +===+========================+====================+=========================================================+
 626 | 1 | bucket_rate            | Credits per second | Rate of adding credits to the bucket.                   |
 627 |   |                        |                    |                                                         |
 628 +---+------------------------+--------------------+---------------------------------------------------------+
 629 | 2 | bucket_size            | Credits            | Max number of credits that can be stored in the bucket. |
 630 |   |                        |                    |                                                         |
 631 +---+------------------------+--------------------+---------------------------------------------------------+
 632
 633 .. _pg_table_7:
 634
 635 **Table 7. Token Bucket Generic Parameters**
 636
 637 +---+------------------------+------------------------------------------------------------------------------+
 638 | # | Token Bucket Operation | Description                                                                  |
 639 |   |                        |                                                                              |
 640 +===+========================+==============================================================================+
 641 | 1 | Initialization         | Bucket set to a predefined value, e.g. zero or half of the bucket size.      |
 642 |   |                        |                                                                              |
 643 +---+------------------------+------------------------------------------------------------------------------+
 644 | 2 | Credit update          | Credits are added to the bucket on top of existing ones, either periodically |
 645 |   |                        | or on demand, based on the bucket_rate. Credits cannot exceed the upper      |
 646 |   |                        | limit defined by the bucket_size, so any credits to be added to the bucket   |
 647 |   |                        | while the bucket is full are dropped.                                        |
 648 |   |                        |                                                                              |
 649 +---+------------------------+------------------------------------------------------------------------------+
 650 | 3 | Credit consumption     | As result of packet scheduling, the necessary number of credits is removed   |
 651 |   |                        | from the bucket. The packet can only be sent if enough credits are in the    |
 652 |   |                        | bucket to send the full packet (packet bytes and framing overhead for the    |
 653 |   |                        | packet).                                                                     |
 654 |   |                        |                                                                              |
 655 +---+------------------------+------------------------------------------------------------------------------+
 656
 657 To implement the token bucket generic operations described above,
 658 the current design uses the persistent data structure presented in,
 659 while the implementation of the token bucket operations is described in Table 9.
 660
 661 .. _pg_table_8:
 662
 663 **Table 8. Token Bucket Persistent Data Structure**
 664
 665 +---+------------------------+-------+----------------------------------------------------------------------+
 666 | # | Token bucket field     | Unit  | Description                                                          |
 667 |   |                        |       |                                                                      |
 668 +===+========================+=======+======================================================================+
 669 | 1 | tb_time                | Bytes | Time of the last credit update. Measured in bytes instead of seconds |
 670 |   |                        |       | or CPU cycles for ease of credit consumption operation               |
 671 |   |                        |       | (as the current time is also maintained in bytes).                   |
 672 |   |                        |       |                                                                      |
 673 |   |                        |       | See  Section 26.2.4.5.1 "Internal Time Reference" for an             |
 674 |   |                        |       | explanation of why the time is maintained in byte units.             |
 675 |   |                        |       |                                                                      |
 676 +---+------------------------+-------+----------------------------------------------------------------------+
 677 | 2 | tb_period              | Bytes | Time period that should elapse since the last credit update in order |
 678 |   |                        |       | for the bucket to be awarded tb_credits_per_period worth or credits. |
 679 |   |                        |       |                                                                      |
 680 +---+------------------------+-------+----------------------------------------------------------------------+
 681 | 3 | tb_credits_per_period  | Bytes | Credit allowance per tb_period.                                      |
 682 |   |                        |       |                                                                      |
 683 +---+------------------------+-------+----------------------------------------------------------------------+
 684 | 4 | tb_size                | Bytes | Bucket size, i.e. upper limit for the tb_credits.                    |
 685 |   |                        |       |                                                                      |
 686 +---+------------------------+-------+----------------------------------------------------------------------+
 687 | 5 | tb_credits             | Bytes | Number of credits currently in the bucket.                           |
 688 |   |                        |       |                                                                      |
 689 +---+------------------------+-------+----------------------------------------------------------------------+
 690
 691 The bucket rate (in bytes per second) can be computed with the following formula:
 692
 693 *bucket_rate = (tb_credits_per_period / tb_period) * r*
 694
 695 where, r = port line rate (in bytes per second).
 696
 697 .. _pg_table_9:
 698
 699 **Table 9. Token Bucket Operations**
 700
 701 +---+-------------------------+-----------------------------------------------------------------------------+
 702 | # | Token bucket operation  | Description                                                                 |
 703 |   |                         |                                                                             |
 704 +===+=========================+=============================================================================+
 705 | 1 | Initialization          | *tb_credits = 0; or tb_credits = tb_size / 2;*                              |
 706 |   |                         |                                                                             |
 707 +---+-------------------------+-----------------------------------------------------------------------------+
 708 | 2 | Credit update           | Credit update options:                                                      |
 709 |   |                         |                                                                             |
 710 |   |                         | *   Every time a packet is sent for a port, update the credits of all the   |
 711 |   |                         |     the subports and pipes of that port. Not feasible.                      |
 712 |   |                         |                                                                             |
 713 |   |                         | *   Every time a packet is sent, update the credits for the pipe and        |
 714 |   |                         |     subport. Very accurate, but not needed (a lot of calculations).         |
 715 |   |                         |                                                                             |
 716 |   |                         | *   Every time a pipe is selected (that is, picked by one                   |
 717 |   |                         |     of the grinders), update the credits for the pipe and its subport.      |
 718 |   |                         |                                                                             |
 719 |   |                         | The current implementation is using option 3.  According to Section         |
 720 |   |                         | 26.2.4.4 "Dequeue State Machine", the pipe and subport credits are          |
 721 |   |                         | updated every time a pipe is selected by the dequeue process before the     |
 722 |   |                         | pipe and subport credits are actually used.                                 |
 723 |   |                         |                                                                             |
 724 |   |                         | The implementation uses a tradeoff between accuracy and speed by updating   |
 725 |   |                         | the bucket credits only when at least a full *tb_period*  has elapsed since |
 726 |   |                         | the last update.                                                            |
 727 |   |                         |                                                                             |
 728 |   |                         | *   Full accuracy can be achieved by selecting the value for *tb_period*    |
 729 |   |                         |     for which  *tb_credits_per_period = 1*.                                 |
 730 |   |                         |                                                                             |
 731 |   |                         | *   When full accuracy is not required, better performance is achieved by   |
 732 |   |                         |     setting *tb_credits* to a larger value.                                 |
 733 |   |                         |                                                                             |
 734 |   |                         | Update operations:                                                          |
 735 |   |                         |                                                                             |
 736 |   |                         | *   n_periods = (time - tb_time) / tb_period;                               |
 737 |   |                         |                                                                             |
 738 |   |                         | *   tb_credits += n_periods * tb_credits_per_period;                        |
 739 |   |                         |                                                                             |
 740 |   |                         | *   tb_credits = min(tb_credits, tb_size);                                  |
 741 |   |                         |                                                                             |
 742 |   |                         | *   tb_time += n_periods * tb_period;                                       |
 743 |   |                         |                                                                             |
 744 +---+-------------------------+-----------------------------------------------------------------------------+
 745 | 3 | Credit consumption      | As result of packet scheduling, the necessary number of credits is removed  |
 746 |   |  (on packet scheduling) | from the bucket. The packet can only be sent if enough credits are in the   |
 747 |   |                         | bucket to send the full packet (packet bytes and framing overhead for the   |
 748 |   |                         | packet).                                                                    |
 749 |   |                         |                                                                             |
 750 |   |                         | Scheduling operations:                                                      |
 751 |   |                         |                                                                             |
 752 |   |                         | pkt_credits = pkt_len + frame_overhead;                                     |
 753 |   |                         | if (tb_credits >= pkt_credits){tb_credits -= pkt_credits;}                  |
 754 |   |                         |                                                                             |
 755 +---+-------------------------+-----------------------------------------------------------------------------+
 756
 757 Traffic Classes
 758 """""""""""""""
 759
 760 Implementation of Strict Priority Scheduling
 761 ''''''''''''''''''''''''''''''''''''''''''''
 762
 763 Strict priority scheduling of traffic classes within the same pipe is implemented by the pipe dequeue state machine,
 764 which selects the queues in ascending order.
 765 Therefore, queues 0..3 (associated with TC 0, highest priority TC) are handled before
 766 queues 4..7 (TC 1, lower priority than TC 0),
 767 which are handled before queues 8..11 (TC 2),
 768 which are handled before queues 12..15 (TC 3, lowest priority TC).
 769
 770 Upper Limit Enforcement
 771 '''''''''''''''''''''''
 772
 773 The traffic classes at the pipe and subport levels are not traffic shaped,
 774 so there is no token bucket maintained in this context.
 775 The upper limit for the traffic classes at the subport and
 776 pipe levels is enforced by periodically refilling the subport / pipe traffic class credit counter,
 777 out of which credits are consumed every time a packet is scheduled for that subport / pipe,
 778 as described in Table 10 and Table 11.
 779
 780 .. _pg_table_10:
 781
 782 **Table 10. Subport/Pipe Traffic Class Upper Limit Enforcement Persistent Data Structure**
 783
 784 +---+-----------------------+-------+-----------------------------------------------------------------------+
 785 | # | Subport or pipe field | Unit  | Description                                                           |
 786 |   |                       |       |                                                                       |
 787 +===+=======================+=======+=======================================================================+
 788 | 1 | tc_time               | Bytes | Time of the next update (upper limit refill) for the 4 TCs of the     |
 789 |   |                       |       | current subport / pipe.                                               |
 790 |   |                       |       |                                                                       |
 791 |   |                       |       | See  Section 26.2.4.5.1, "Internal Time Reference" for the            |
 792 |   |                       |       | explanation of why the time is maintained in byte units.              |
 793 |   |                       |       |                                                                       |
 794 +---+-----------------------+-------+-----------------------------------------------------------------------+
 795 | 2 | tc_period             | Bytes | Time between two consecutive updates for the 4 TCs of the current     |
 796 |   |                       |       | subport / pipe. This is expected to be many times bigger than the     |
 797 |   |                       |       | typical value of the token bucket tb_period.                          |
 798 |   |                       |       |                                                                       |
 799 +---+-----------------------+-------+-----------------------------------------------------------------------+
 800 | 3 | tc_credits_per_period | Bytes | Upper limit for the number of credits allowed to be consumed by the   |
 801 |   |                       |       | current TC during each enforcement period tc_period.                  |
 802 |   |                       |       |                                                                       |
 803 +---+-----------------------+-------+-----------------------------------------------------------------------+
 804 | 4 | tc_credits            | Bytes | Current upper limit for the number of credits that can be consumed by |
 805 |   |                       |       | the current traffic class for the remainder of the current            |
 806 |   |                       |       | enforcement period.                                                   |
 807 |   |                       |       |                                                                       |
 808 +---+-----------------------+-------+-----------------------------------------------------------------------+
 809
 810 .. _pg_table_11:
 811
 812 **Table 11. Subport/Pipe Traffic Class Upper Limit Enforcement Operations**
 813
 814 +---+--------------------------+----------------------------------------------------------------------------+
 815 | # | Traffic Class Operation  | Description                                                                |
 816 |   |                          |                                                                            |
 817 +===+==========================+============================================================================+
 818 | 1 | Initialization           | tc_credits = tc_credits_per_period;                                        |
 819 |   |                          |                                                                            |
 820 |   |                          | tc_time = tc_period;                                                       |
 821 |   |                          |                                                                            |
 822 +---+--------------------------+----------------------------------------------------------------------------+
 823 | 2 | Credit update            | Update operations:                                                         |
 824 |   |                          |                                                                            |
 825 |   |                          | if (time >= tc_time) {                                                     |
 826 |   |                          |                                                                            |
 827 |   |                          | tc_credits = tc_credits_per_period;                                        |
 828 |   |                          |                                                                            |
 829 |   |                          | tc_time = time + tc_period;                                                |
 830 |   |                          |                                                                            |
 831 |   |                          | }                                                                          |
 832 |   |                          |                                                                            |
 833 +---+--------------------------+----------------------------------------------------------------------------+
 834 | 3 | Credit consumption       | As result of packet scheduling, the TC limit is decreased with the         |
 835 |   | (on packet scheduling)   | necessary number of credits. The packet can only be sent if enough credits |
 836 |   |                          | are currently available in the TC limit to send the full packet            |
 837 |   |                          | (packet bytes and framing overhead for the packet).                        |
 838 |   |                          |                                                                            |
 839 |   |                          | Scheduling operations:                                                     |
 840 |   |                          |                                                                            |
 841 |   |                          | pkt_credits = pk_len + frame_overhead;                                     |
 842 |   |                          |                                                                            |
 843 |   |                          | if (tc_credits >= pkt_credits) {tc_credits -= pkt_credits;}                |
 844 |   |                          |                                                                            |
 845 +---+--------------------------+----------------------------------------------------------------------------+
 846
 847 Weighted Round Robin (WRR)
 848 """"""""""""""""""""""""""
 849
 850 The evolution of the WRR design solution from simple to complex is shown in Table 12.
 851
 852 .. _pg_table_12:
 853
 854 **Table 12. Weighted Round Robin (WRR)**
 855
 856 +---+------------+-----------------+-------------+----------------------------------------------------------+
 857 | # | All Queues | Equal Weights   | All Packets | Strategy                                                 |
 858 |   | Active?    | for All Queues? | Equal?      |                                                          |
 859 +===+============+=================+=============+==========================================================+
 860 | 1 | Yes        | Yes             | Yes         | **Byte level round robin**                               |
 861 |   |            |                 |             |                                                          |
 862 |   |            |                 |             | *Next queue*  queue #i, i =  *(i + 1) % n*               |
 863 |   |            |                 |             |                                                          |
 864 +---+------------+-----------------+-------------+----------------------------------------------------------+
 865 | 2 | Yes        | Yes             | No          | **Packet level round robin**                             |
 866 |   |            |                 |             |                                                          |
 867 |   |            |                 |             | Consuming one byte from queue #i requires consuming      |
 868 |   |            |                 |             | exactly one token for queue #i.                          |
 869 |   |            |                 |             |                                                          |
 870 |   |            |                 |             | T(i) = Accumulated number of tokens previously consumed  |
 871 |   |            |                 |             | from queue #i. Every time a packet is consumed from      |
 872 |   |            |                 |             | queue #i, T(i) is updated as: T(i) += *pkt_len*.         |
 873 |   |            |                 |             |                                                          |
 874 |   |            |                 |             | *Next queue* : queue with the smallest T.                |
 875 |   |            |                 |             |                                                          |
 876 |   |            |                 |             |                                                          |
 877 +---+------------+-----------------+-------------+----------------------------------------------------------+
 878 | 3 | Yes        | No              | No          | **Packet level weighted round robin**                    |
 879 |   |            |                 |             |                                                          |
 880 |   |            |                 |             | This case can be reduced to the previous case by         |
 881 |   |            |                 |             | introducing a cost per byte that is different for each   |
 882 |   |            |                 |             | queue. Queues with lower weights have a higher cost per  |
 883 |   |            |                 |             | byte. This way, it is still meaningful to compare the    |
 884 |   |            |                 |             | consumption amongst different queues in order to select  |
 885 |   |            |                 |             | the next queue.                                          |
 886 |   |            |                 |             |                                                          |
 887 |   |            |                 |             | w(i) = Weight of queue #i                                |
 888 |   |            |                 |             |                                                          |
 889 |   |            |                 |             | t(i) = Tokens per byte for queue #i, defined as the      |
 890 |   |            |                 |             | inverse weight of queue #i.                              |
 891 |   |            |                 |             | For example, if w[0..3] = [1:2:4:8],                     |
 892 |   |            |                 |             | then t[0..3] = [8:4:2:1]; if w[0..3] = [1:4:15:20],      |
 893 |   |            |                 |             | then t[0..3] = [60:15:4:3].                              |
 894 |   |            |                 |             | Consuming one byte from queue #i requires consuming t(i) |
 895 |   |            |                 |             | tokens for queue #i.                                     |
 896 |   |            |                 |             |                                                          |
 897 |   |            |                 |             | T(i) = Accumulated number of tokens previously consumed  |
 898 |   |            |                 |             | from queue #i. Every time a packet is consumed from      |
 899 |   |            |                 |             | queue #i, T(i) is updated as:  *T(i) += pkt_len * t(i)*. |
 900 |   |            |                 |             | *Next queue* : queue with the smallest T.                |
 901 |   |            |                 |             |                                                          |
 902 +---+------------+-----------------+-------------+----------------------------------------------------------+
 903 | 4 | No         | No              | No          | **Packet level weighted round robin with variable queue  |
 904 |   |            |                 |             | status**                                                 |
 905 |   |            |                 |             |                                                          |
 906 |   |            |                 |             | Reduce this case to the previous case by setting the     |
 907 |   |            |                 |             | consumption of inactive queues to a high number, so that |
 908 |   |            |                 |             | the inactive queues will never be selected by the        |
 909 |   |            |                 |             | smallest T logic.                                        |
 910 |   |            |                 |             |                                                          |
 911 |   |            |                 |             | To prevent T from overflowing as result of successive    |
 912 |   |            |                 |             | accumulations, T(i) is truncated after each packet       |
 913 |   |            |                 |             | consumption for all queues.                              |
 914 |   |            |                 |             | For example, T[0..3] = [1000, 1100, 1200, 1300]          |
 915 |   |            |                 |             | is truncated to T[0..3] = [0, 100, 200, 300]             |
 916 |   |            |                 |             | by subtracting the min T from T(i), i = 0..n.            |
 917 |   |            |                 |             |                                                          |
 918 |   |            |                 |             | This requires having at least one active queue in the    |
 919 |   |            |                 |             | set of input queues, which is guaranteed by the dequeue  |
 920 |   |            |                 |             | state machine never selecting an inactive traffic class. |
 921 |   |            |                 |             |                                                          |
 922 |   |            |                 |             | *mask(i) = Saturation mask for queue #i, defined as:*    |
 923 |   |            |                 |             |                                                          |
 924 |   |            |                 |             | mask(i) = (queue #i is active)? 0 : 0xFFFFFFFF;          |
 925 |   |            |                 |             |                                                          |
 926 |   |            |                 |             | w(i) = Weight of queue #i                                |
 927 |   |            |                 |             |                                                          |
 928 |   |            |                 |             | t(i) = Tokens per byte for queue #i, defined as the      |
 929 |   |            |                 |             | inverse weight of queue #i.                              |
 930 |   |            |                 |             |                                                          |
 931 |   |            |                 |             | T(i) = Accumulated numbers of tokens previously consumed |
 932 |   |            |                 |             | from queue #i.                                           |
 933 |   |            |                 |             |                                                          |
 934 |   |            |                 |             | *Next queue*  : queue with smallest T.                   |
 935 |   |            |                 |             |                                                          |
 936 |   |            |                 |             | Before packet consumption from queue #i:                 |
 937 |   |            |                 |             |                                                          |
 938 |   |            |                 |             | *T(i) |= mask(i)*                                        |
 939 |   |            |                 |             |                                                          |
 940 |   |            |                 |             | After packet consumption from queue #i:                  |
 941 |   |            |                 |             |                                                          |
 942 |   |            |                 |             | T(j) -= T(i), j != i                                     |
 943 |   |            |                 |             |                                                          |
 944 |   |            |                 |             | T(i) = pkt_len * t(i)                                    |
 945 |   |            |                 |             |                                                          |
 946 |   |            |                 |             | Note: T(j) uses the T(i) value before T(i) is updated.   |
 947 |   |            |                 |             |                                                          |
 948 +---+------------+-----------------+-------------+----------------------------------------------------------+
 949
 950 Subport Traffic Class Oversubscription
 951 """"""""""""""""""""""""""""""""""""""
 952
 953 Problem Statement
 954 '''''''''''''''''
 955
 956 Oversubscription for subport traffic class X is a configuration-time event that occurs when
 957 more bandwidth is allocated for traffic class X at the level of subport member pipes than
 958 allocated for the same traffic class at the parent subport level.
 959
 960 The existence of the oversubscription for a specific subport and
 961 traffic class is solely the result of pipe and
 962 subport-level configuration as opposed to being created due
 963 to dynamic evolution of the traffic load at run-time (as congestion is).
 964
 965 When the overall demand for traffic class X for the current subport is low,
 966 the existence of the oversubscription condition does not represent a problem,
 967 as demand for traffic class X is completely satisfied for all member pipes.
 968 However, this can no longer be achieved when the aggregated demand for traffic class X
 969 for all subport member pipes exceeds the limit configured at the subport level.
 970
 971 Solution Space
 972 ''''''''''''''
 973
 974 summarizes some of the possible approaches for handling this problem,
 975 with the third approach selected for implementation.
 976
 977 .. _pg_table_13:
 978
 979 **Table 13. Subport Traffic Class Oversubscription**
 980
 981 +-----+---------------------------+-------------------------------------------------------------------------+
 982 | No. | Approach                  | Description                                                             |
 983 |     |                           |                                                                         |
 984 +=====+===========================+=========================================================================+
 985 | 1   | Don't care                | First come, first served.                                               |
 986 |     |                           |                                                                         |
 987 |     |                           | This approach is not fair amongst subport member pipes, as pipes that   |
 988 |     |                           | are served first will use up as much bandwidth for TC X as they need,   |
 989 |     |                           | while pipes that are served later will receive poor service due to      |
 990 |     |                           | bandwidth for TC X at the subport level being scarce.                   |
 991 |     |                           |                                                                         |
 992 +-----+---------------------------+-------------------------------------------------------------------------+
 993 | 2   | Scale down all pipes      | All pipes within the subport have their bandwidth limit for TC X scaled |
 994 |     |                           | down by the same factor.                                                |
 995 |     |                           |                                                                         |
 996 |     |                           | This approach is not fair among subport member pipes, as the low end    |
 997 |     |                           | pipes (that is, pipes configured with low bandwidth) can potentially    |
 998 |     |                           | experience severe service degradation that might render their service   |
 999 |     |                           | unusable (if available bandwidth for these pipes drops below the        |
1000 |     |                           | minimum requirements for a workable service), while the service         |
1001 |     |                           | degradation for high end pipes might not be noticeable at all.          |
1002 |     |                           |                                                                         |
1003 +-----+---------------------------+-------------------------------------------------------------------------+
1004 | 3   | Cap the high demand pipes | Each subport member pipe receives an equal share of the bandwidth       |
1005 |     |                           | available at run-time for TC X at the subport level. Any bandwidth left |
1006 |     |                           | unused by the low-demand pipes is redistributed in equal portions to    |
1007 |     |                           | the high-demand pipes. This way, the high-demand pipes are truncated    |
1008 |     |                           | while the low-demand pipes are not impacted.                            |
1009 |     |                           |                                                                         |
1010 +-----+---------------------------+-------------------------------------------------------------------------+
1011
1012 Typically, the subport TC oversubscription feature is enabled only for the lowest priority traffic class (TC 3),
1013 which is typically used for best effort traffic,
1014 with the management plane preventing this condition from occurring for the other (higher priority) traffic classes.
1015
1016 To ease implementation, it is also assumed that the upper limit for subport TC 3 is set to 100% of the subport rate,
1017 and that the upper limit for pipe TC 3 is set to 100% of pipe rate for all subport member pipes.
1018
1019 Implementation Overview
1020 '''''''''''''''''''''''
1021
1022 The algorithm computes a watermark, which is periodically updated based on the current demand experienced by the subport member pipes,
1023 whose purpose is to limit the amount of traffic that each pipe is allowed to send for TC 3.
1024 The watermark is computed at the subport level at the beginning of each traffic class upper limit enforcement period and
1025 the same value is used by all the subport member pipes throughout the current enforcement period.
1026 illustrates how the watermark computed as subport level at the beginning of each period is propagated to all subport member pipes.
1027
1028 At the beginning of the current enforcement period (which coincides with the end of the previous enforcement period),
1029 the value of the watermark is adjusted based on the amount of bandwidth allocated to TC 3 at the beginning of the previous period that
1030 was not left unused by the subport member pipes at the end of the previous period.
1031
1032 If there was subport TC 3 bandwidth left unused,
1033 the value of the watermark for the current period is increased to encourage the subport member pipes to consume more bandwidth.
1034 Otherwise, the value of the watermark is decreased to enforce equality of bandwidth consumption among subport member pipes for TC 3.
1035
1036 The increase or decrease in the watermark value is done in small increments,
1037 so several enforcement periods might be required to reach the equilibrium state.
1038 This state can change at any moment due to variations in the demand experienced by the subport member pipes for TC 3, for example,
1039 as a result of demand increase (when the watermark needs to be lowered) or demand decrease (when the watermark needs to be increased).
1040
1041 When demand is low, the watermark is set high to prevent it from impeding the subport member pipes from consuming more bandwidth.
1042 The highest value for the watermark is picked as the highest rate configured for a subport member pipe.
1043 Table 15 illustrates the watermark operation.
1044
1045 .. _pg_table_14:
1046
1047 **Table 14. Watermark Propagation from Subport Level to Member Pipes at the Beginning of Each Traffic Class Upper Limit Enforcement Period**
1048
1049 +-----+---------------------------------+----------------------------------------------------+
1050 | No. | Subport Traffic Class Operation | Description                                        |
1051 |     |                                 |                                                    |
1052 +=====+=================================+====================================================+
1053 | 1   | Initialization                  | **Subport level**: subport_period_id= 0            |
1054 |     |                                 |                                                    |
1055 |     |                                 | **Pipe level**: pipe_period_id = 0                 |
1056 |     |                                 |                                                    |
1057 +-----+---------------------------------+----------------------------------------------------+
1058 | 2   | Credit update                   | **Subport Level**:                                 |
1059 |     |                                 |                                                    |
1060 |     |                                 | if (time>=subport_tc_time)                         |
1061 |     |                                 |                                                    |
1062 |     |                                 | {                                                  |
1063 |     |                                 |     subport_wm = water_mark_update();              |
1064 |     |                                 |                                                    |
1065 |     |                                 |     subport_tc_time = time + subport_tc_period;    |
1066 |     |                                 |                                                    |
1067 |     |                                 |     subport_period_id++;                           |
1068 |     |                                 |                                                    |
1069 |     |                                 | }                                                  |
1070 |     |                                 |                                                    |
1071 |     |                                 | **Pipelevel:**                                     |
1072 |     |                                 |                                                    |
1073 |     |                                 | if(pipe_period_id != subport_period_id)            |
1074 |     |                                 |                                                    |
1075 |     |                                 | {                                                  |
1076 |     |                                 |                                                    |
1077 |     |                                 |     pipe_ov_credits = subport_wm \* pipe_weight;   |
1078 |     |                                 |                                                    |
1079 |     |                                 |     pipe_period_id = subport_period_id;            |
1080 |     |                                 |                                                    |
1081 |     |                                 | }                                                  |
1082 |     |                                 |                                                    |
1083 +-----+---------------------------------+----------------------------------------------------+
1084 | 3   | Credit consumption              | **Pipe level:**                                    |
1085 |     | (on packet scheduling)          |                                                    |
1086 |     |                                 | pkt_credits = pk_len + frame_overhead;             |
1087 |     |                                 |                                                    |
1088 |     |                                 | if(pipe_ov_credits >= pkt_credits{                 |
1089 |     |                                 |                                                    |
1090 |     |                                 |    pipe_ov_credits -= pkt_credits;                 |
1091 |     |                                 |                                                    |
1092 |     |                                 | }                                                  |
1093 |     |                                 |                                                    |
1094 +-----+---------------------------------+----------------------------------------------------+
1095
1096 .. _pg_table_15:
1097
1098 **Table 15. Watermark Calculation**
1099
1100 +-----+------------------+----------------------------------------------------------------------------------+
1101 | No. | Subport Traffic  | Description                                                                      |
1102 |     | Class Operation  |                                                                                  |
1103 +=====+==================+==================================================================================+
1104 | 1   | Initialization   | **Subport level:**                                                               |
1105 |     |                  |                                                                                  |
1106 |     |                  | wm = WM_MAX                                                                      |
1107 |     |                  |                                                                                  |
1108 +-----+------------------+----------------------------------------------------------------------------------+
1109 | 2   | Credit update    | **Subport level (water_mark_update):**                                           |
1110 |     |                  |                                                                                  |
1111 |     |                  | tc0_cons = subport_tc0_credits_per_period - subport_tc0_credits;                 |
1112 |     |                  |                                                                                  |
1113 |     |                  | tc1_cons = subport_tc1_credits_per_period - subport_tc1_credits;                 |
1114 |     |                  |                                                                                  |
1115 |     |                  | tc2_cons = subport_tc2_credits_per_period - subport_tc2_credits;                 |
1116 |     |                  |                                                                                  |
1117 |     |                  | tc3_cons = subport_tc3_credits_per_period - subport_tc3_credits;                 |
1118 |     |                  |                                                                                  |
1119 |     |                  | tc3_cons_max = subport_tc3_credits_per_period - (tc0_cons + tc1_cons +           |
1120 |     |                  | tc2_cons);                                                                       |
1121 |     |                  |                                                                                  |
1122 |     |                  | if(tc3_consumption > (tc3_consumption_max - MTU)){                               |
1123 |     |                  |                                                                                  |
1124 |     |                  |     wm -= wm >> 7;                                                               |
1125 |     |                  |                                                                                  |
1126 |     |                  |     if(wm < WM_MIN) wm =  WM_MIN;                                                |
1127 |     |                  |                                                                                  |
1128 |     |                  | } else {                                                                         |
1129 |     |                  |                                                                                  |
1130 |     |                  |    wm += (wm >> 7) + 1;                                                          |
1131 |     |                  |                                                                                  |
1132 |     |                  |    if(wm > WM_MAX) wm = WM_MAX;                                                  |
1133 |     |                  |                                                                                  |
1134 |     |                  | }                                                                                |
1135 |     |                  |                                                                                  |
1136 +-----+------------------+----------------------------------------------------------------------------------+
1137
1138 Worst Case Scenarios for Performance
1139 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1140
1141 Lots of Active Queues with Not Enough Credits
1142 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1143
1144 The more queues the scheduler has to examine for packets and credits in order to select one packet,
1145 the lower the performance of the scheduler is.
1146
1147 The scheduler maintains the bitmap of active queues, which skips the non-active queues,
1148 but in order to detect whether a specific pipe has enough credits,
1149 the pipe has to be drilled down using the pipe dequeue state machine,
1150 which consumes cycles regardless of the scheduling result
1151 (no packets are produced or at least one packet is produced).
1152
1153 This scenario stresses the importance of the policer for the scheduler performance:
1154 if the pipe does not have enough credits,
1155 its packets should be dropped as soon as possible (before they reach the hierarchical scheduler),
1156 thus rendering the pipe queues as not active,
1157 which allows the dequeue side to skip that pipe with no cycles being spent on investigating the pipe credits
1158 that would result in a "not enough credits" status.
1159
1160 Single Queue with 100% Line Rate
1161 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1162
1163 The port scheduler performance is optimized for a large number of queues.
1164 If the number of queues is small,
1165 then the performance of the port scheduler for the same level of active traffic is expected to be worse than
1166 the performance of a small set of message passing queues.
1167
1168 .. _Dropper:
1169
1170 Dropper
1171 -------
1172
1173 The purpose of the DPDK dropper is to drop packets arriving at a packet scheduler to avoid congestion.
1174 The dropper supports the Random Early Detection (RED),
1175 Weighted Random Early Detection (WRED) and tail drop algorithms.
1176 Figure 1 illustrates how the dropper integrates with the scheduler.
1177 The DPDK currently does not support congestion management
1178 so the dropper provides the only method for congestion avoidance.
1179
1180 .. _pg_figure_27:
1181
1182 **Figure 27. High-level Block Diagram of the DPDK Dropper**
1183
1184 .. image53_png has been renamed
1185
1186 |blk_diag_dropper|
1187
1188 The dropper uses the Random Early Detection (RED) congestion avoidance algorithm as documented in the reference publication.
1189 The purpose of the RED algorithm is to monitor a packet queue,
1190 determine the current congestion level in the queue and decide whether an arriving packet should be enqueued or dropped.
1191 The RED algorithm uses an Exponential Weighted Moving Average (EWMA) filter to compute average queue size which
1192 gives an indication of the current congestion level in the queue.
1193
1194 For each enqueue operation, the RED algorithm compares the average queue size to minimum and maximum thresholds.
1195 Depending on whether the average queue size is below, above or in between these thresholds,
1196 the RED algorithm calculates the probability that an arriving packet should be dropped and
1197 makes a random decision based on this probability.
1198
1199 The dropper also supports Weighted Random Early Detection (WRED) by allowing the scheduler to select
1200 different RED configurations for the same packet queue at run-time.
1201 In the case of severe congestion, the dropper resorts to tail drop.
1202 This occurs when a packet queue has reached maximum capacity and cannot store any more packets.
1203 In this situation, all arriving packets are dropped.
1204
1205 The flow through the dropper is illustrated in Figure 28.
1206 The RED/WRED algorithm is exercised first and tail drop second.
1207
1208 .. _pg_figure_28:
1209
1210 **Figure 28. Flow Through the Dropper**
1211
1212 ..  image54_png has been renamed
1213
1214 |flow_tru_droppper|
1215
1216 The use cases supported by the dropper are:
1217
1218 *   *    Initialize configuration data
1219
1220 *   *    Initialize run-time data
1221
1222 *   *    Enqueue (make a decision to enqueue or drop an arriving packet)
1223
1224 *   *    Mark empty (record the time at which a packet queue becomes empty)
1225
1226 The configuration use case is explained in :ref:`Section2.23.3.1 <Configuration>`,
1227 the enqueue operation is explained in  :ref:`Section 2.23.3.2 <Enqueue_Operation>`
1228 and the mark empty operation is explained in :ref:`Section 2.23.3.3 <Queue_Empty_Operation>`.
1229
1230 .. _Configuration:
1231
1232 Configuration
1233 ~~~~~~~~~~~~~
1234
1235 A RED configuration contains the parameters given in Table 16.
1236
1237 .. _pg_table_16:
1238
1239 **Table 16. RED Configuration Parameters**
1240
1241 +--------------------------+---------+---------+------------------+
1242 | Parameter                | Minimum | Maximum | Typical          |
1243 |                          |         |         |                  |
1244 +==========================+=========+=========+==================+
1245 | Minimum Threshold        | 0       | 1022    | 1/4 x queue size |
1246 |                          |         |         |                  |
1247 +--------------------------+---------+---------+------------------+
1248 | Maximum Threshold        | 1       | 1023    | 1/2 x queue size |
1249 |                          |         |         |                  |
1250 +--------------------------+---------+---------+------------------+
1251 | Inverse Mark Probability | 1       | 255     | 10               |
1252 |                          |         |         |                  |
1253 +--------------------------+---------+---------+------------------+
1254 | EWMA Filter Weight       | 1       | 12      | 9                |
1255 |                          |         |         |                  |
1256 +--------------------------+---------+---------+------------------+
1257
1258 The meaning of these parameters is explained in more detail in the following sections.
1259 The format of these parameters as specified to the dropper module API
1260 corresponds to the format used by Cisco* in their RED implementation.
1261 The minimum and maximum threshold parameters are specified to the dropper module in terms of number of packets.
1262 The mark probability parameter is specified as an inverse value, for example,
1263 an inverse mark probability parameter value of 10 corresponds
1264 to a mark probability of 1/10 (that is, 1 in 10 packets will be dropped).
1265 The EWMA filter weight parameter is specified as an inverse log value,
1266 for example, a filter weight parameter value of 9 corresponds to a filter weight of 1/29.
1267
1268 .. _Enqueue_Operation:
1269
1270 Enqueue Operation
1271 ~~~~~~~~~~~~~~~~~
1272
1273 In the example shown in Figure 29, q (actual queue size) is the input value,
1274 avg (average queue size) and count (number of packets since the last drop) are run-time values,
1275 decision is the output value and the remaining values are configuration parameters.
1276
1277 .. _pg_figure_29:
1278
1279 **Figure 29. Example Data Flow Through Dropper**
1280
1281 .. image55_png has been renamed
1282
1283 |ex_data_flow_tru_dropper|
1284
1285 EWMA Filter Microblock
1286 ^^^^^^^^^^^^^^^^^^^^^^
1287
1288 The purpose of the EWMA Filter microblock is to filter queue size values to smooth out transient changes
1289 that result from "bursty" traffic.
1290 The output value is the average queue size which gives a more stable view of the current congestion level in the queue.
1291
1292 The EWMA filter has one configuration parameter, filter weight, which determines how quickly
1293 or slowly the average queue size output responds to changes in the actual queue size input.
1294 Higher values of filter weight mean that the average queue size responds more quickly to changes in actual queue size.
1295
1296 Average Queue Size Calculation when the Queue is not Empty
1297 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""
1298
1299 The definition of the EWMA filter is given in the following equation.
1300
1301 **Equation 1.**
1302
1303 .. image56_png has been renamed
1304
1305 |ewma_filter_eq_1|
1306
1307 Where:
1308
1309 *   *avg*  = average queue size
1310
1311 *   *wq*   = filter weight
1312
1313 *   *q*    = actual queue size
1314
1315 .. note::
1316
1317     The filter weight, wq = 1/2^n, where n is the filter weight parameter value passed to the dropper module
1318         on configuration (see :ref:`Section2.23.3.1 <Configuration>` ).
1319
1320 Average Queue Size Calculation when the Queue is Empty
1321 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1322
1323 The EWMA filter does not read time stamps and instead assumes that enqueue operations will happen quite regularly.
1324 Special handling is required when the queue becomes empty as the queue could be empty for a short time or a long time.
1325 When the queue becomes empty, average queue size should decay gradually to zero instead of dropping suddenly to zero
1326 or remaining stagnant at the last computed value.
1327 When a packet is enqueued on an empty queue, the average queue size is computed using the following formula:
1328
1329 **Equation 2.**
1330
1331 .. image57_png has been renamed
1332
1333 |ewma_filter_eq_2|
1334
1335 Where:
1336
1337 *   *m*   = the number of enqueue operations that could have occurred on this queue while the queue was empty
1338
1339 In the dropper module, *m* is defined as:
1340
1341 .. image58_png has been renamed
1342
1343 |m_definition|
1344
1345 Where:
1346
1347 *   *time*  = current time
1348
1349 *   *qtime* = time the queue became empty
1350
1351 *   *s* = typical time between successive enqueue operations on this queue
1352
1353 The time reference is in units of bytes,
1354 where a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium
1355 (see Section 26.2.4.5.1 "Internal Time Reference").
1356 The parameter s is defined in the dropper module as a constant with the value: s=2^22.
1357 This corresponds to the time required by every leaf node in a hierarchy with 64K leaf nodes
1358 to transmit one 64-byte packet onto the wire and represents the worst case scenario.
1359 For much smaller scheduler hierarchies,
1360 it may be necessary to reduce the parameter s, which is defined in the red header source file (rte_red.h) as:
1361
1362 .. code-block:: c
1363
1364     #define RTE_RED_S
1365
1366 Since the time reference is in bytes, the port speed is implied in the expression: *time-qtime*.
1367 The dropper does not have to be configured with the actual port speed.
1368 It adjusts automatically to low speed and high speed links.
1369
1370 Implementation
1371 """"""""""""""
1372
1373 A numerical method is used to compute the factor (1-wq)^m that appears in Equation 2.
1374
1375 This method is based on the following identity:
1376
1377 .. image59_png has been renamed
1378
1379 |eq2_factor|
1380
1381 This allows us to express the following:
1382
1383 .. image60_png has been renamed
1384
1385 |eq2_expression|
1386
1387 In the dropper module, a look-up table is used to compute log2(1-wq) for each value of wq supported by the dropper module.
1388 The factor (1-wq)^m can then be obtained by multiplying the table value by *m* and applying shift operations.
1389 To avoid overflow in the multiplication, the value, *m*, and the look-up table values are limited to 16 bits.
1390 The total size of the look-up table is 56 bytes.
1391 Once the factor (1-wq)^m is obtained using this method, the average queue size can be calculated from Equation 2.
1392
1393 Alternative Approaches
1394 """"""""""""""""""""""
1395
1396 Other methods for calculating the factor (1-wq)^m in the expression for computing average queue size
1397 when the queue is empty (Equation 2) were considered.
1398 These approaches include:
1399
1400 *   Floating-point evaluation
1401
1402 *   Fixed-point evaluation using a small look-up table (512B) and up to 16 multiplications
1403     (this is the approach used in the FreeBSD* ALTQ RED implementation)
1404
1405 *   Fixed-point evaluation using a small look-up table (512B) and 16 SSE multiplications
1406     (SSE optimized version of the approach used in the FreeBSD* ALTQ RED implementation)
1407
1408 *   Large look-up table (76 KB)
1409
1410 The method that was finally selected (described above in Section 26.3.2.2.1) out performs all of these approaches
1411 in terms of run-time performance and memory requirements and
1412 also achieves accuracy comparable to floating-point evaluation.
1413 Table 17 lists the performance of each of these alternative approaches relative to the method that is used in the dropper.
1414 As can be seen, the floating-point implementation achieved the worst performance.
1415
1416 .. _pg_table_17:
1417
1418 **Table 17. Relative Performance of Alternative Approaches**
1419
1420 +------------------------------------------------------------------------------------+----------------------+
1421 | Method                                                                             | Relative Performance |
1422 |                                                                                    |                      |
1423 +====================================================================================+======================+
1424 | Current dropper method (see :ref:`Section 23.3.2.1.3 <Dropper>`)                   | 100%                 |
1425 |                                                                                    |                      |
1426 +------------------------------------------------------------------------------------+----------------------+
1427 | Fixed-point method with small (512B) look-up table                                 | 148%                 |
1428 |                                                                                    |                      |
1429 +------------------------------------------------------------------------------------+----------------------+
1430 | SSE method with small (512B) look-up table                                         | 114%                 |
1431 |                                                                                    |                      |
1432 +------------------------------------------------------------------------------------+----------------------+
1433 | Large (76KB) look-up table                                                         | 118%                 |
1434 |                                                                                    |                      |
1435 +------------------------------------------------------------------------------------+----------------------+
1436 | Floating-point                                                                     | 595%                 |
1437 |                                                                                    |                      |
1438 +------------------------------------------------------------------------------------+----------------------+
1439 | **Note**: In this case, since performance is expressed as time spent executing the operation in a         |
1440 | specific condition, any relative performance value above 100% runs slower than the reference method.      |
1441 |                                                                                                           |
1442 +-----------------------------------------------------------------------------------------------------------+
1443
1444 Drop Decision Block
1445 ^^^^^^^^^^^^^^^^^^^
1446
1447 The Drop Decision block:
1448
1449 *   Compares the average queue size with the minimum and maximum thresholds
1450
1451 *   Calculates a packet drop probability
1452
1453 *   Makes a random decision to enqueue or drop an arriving packet
1454
1455 The calculation of the drop probability occurs in two stages.
1456 An initial drop probability is calculated based on the average queue size,
1457 the minimum and maximum thresholds and the mark probability.
1458 An actual drop probability is then computed from the initial drop probability.
1459 The actual drop probability takes the count run-time value into consideration
1460 so that the actual drop probability increases as more packets arrive to the packet queue
1461 since the last packet was dropped.
1462
1463 Initial Packet Drop Probability
1464 """""""""""""""""""""""""""""""
1465
1466 The initial drop probability is calculated using the following equation.
1467
1468 **Equation 3.**
1469
1470 .. image61_png has been renamed
1471
1472 |drop_probability_eq3|
1473
1474 Where:
1475
1476 *   *maxp*  = mark probability
1477
1478 *   *avg*  = average queue size
1479
1480 *   *minth*  = minimum threshold
1481
1482 *   *maxth*  = maximum threshold
1483
1484 The calculation of the packet drop probability using Equation 3 is illustrated in Figure 30.
1485 If the average queue size is below the minimum threshold, an arriving packet is enqueued.
1486 If the average queue size is at or above the maximum threshold, an arriving packet is dropped.
1487 If the average queue size is between the minimum and maximum thresholds,
1488 a drop probability is calculated to determine if the packet should be enqueued or dropped.
1489
1490 .. _pg_figure_30:
1491
1492 **Figure 30. Packet Drop Probability for a Given RED Configuration**
1493
1494 .. image62_png has been renamed
1495
1496 |pkt_drop_probability|
1497
1498 Actual Drop Probability
1499 """""""""""""""""""""""
1500
1501 If the average queue size is between the minimum and maximum thresholds,
1502 then the actual drop probability is calculated from the following equation.
1503
1504 **Equation 4.**
1505
1506 .. image63_png has been renamed
1507
1508 |drop_probability_eq4|
1509
1510 Where:
1511
1512 *   *Pb*  = initial drop probability (from Equation 3)
1513
1514 *   *count* = number of packets that have arrived since the last drop
1515
1516 The constant 2, in Equation 4 is the only deviation from the drop probability formulae
1517 given in the reference document where a value of 1 is used instead.
1518 It should be noted that the value pa computed from can be negative or greater than 1.
1519 If this is the case, then a value of 1 should be used instead.
1520
1521 The initial and actual drop probabilities are shown in Figure 31.
1522 The actual drop probability is shown for the case where
1523 the formula given in the reference document1 is used (blue curve)
1524 and also for the case where the formula implemented in the dropper module,
1525 is used (red curve).
1526 The formula in the reference document results in a significantly higher drop rate
1527 compared to the mark probability configuration parameter specified by the user.
1528 The choice to deviate from the reference document is simply a design decision and
1529 one that has been taken by other RED implementations, for example, FreeBSD* ALTQ RED.
1530
1531 .. _pg_figure_31:
1532
1533 **Figure 31. Initial Drop Probability (pb), Actual Drop probability (pa) Computed Using a Factor 1 (Blue Curve) and a Factor 2 (Red Curve)**
1534
1535 .. image64_png has been renamed
1536
1537 |drop_probability_graph|
1538
1539 .. _Queue_Empty_Operation:
1540
1541 Queue Empty Operation
1542 ~~~~~~~~~~~~~~~~~~~~~
1543
1544 The time at which a packet queue becomes empty must be recorded and saved with the RED run-time data
1545 so that the EWMA filter block can calculate the average queue size on the next enqueue operation.
1546 It is the responsibility of the calling application to inform the dropper module
1547 through the API that a queue has become empty.
1548
1549 Source Files Location
1550 ~~~~~~~~~~~~~~~~~~~~~
1551
1552 The source files for the DPDK dropper are located at:
1553
1554 *   DPDK/lib/librte_sched/rte_red.h
1555
1556 *   DPDK/lib/librte_sched/rte_red.c
1557
1558 Integration with the DPDK QoS Scheduler
1559 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1560
1561 RED functionality in the DPDK QoS scheduler is disabled by default.
1562 To enable it, use the DPDK configuration parameter:
1563
1564 ::
1565
1566     CONFIG_RTE_SCHED_RED=y
1567
1568 This parameter must be set to y.
1569 The parameter is found in the build configuration files in the DPDK/config directory,
1570 for example, DPDK/config/common_linuxapp.
1571 RED configuration parameters are specified in the rte_red_params structure within the rte_sched_port_params structure
1572 that is passed to the scheduler on initialization.
1573 RED parameters are specified separately for four traffic classes and three packet colors (green, yellow and red)
1574 allowing the scheduler to implement Weighted Random Early Detection (WRED).
1575
1576 Integration with the DPDK QoS Scheduler Sample Application
1577 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1578
1579 The DPDK QoS Scheduler Application reads a configuration file on start-up.
1580 The configuration file includes a section containing RED parameters.
1581 The format of these parameters is described in :ref:`Section2.23.3.1 <Configuration>`.
1582 A sample RED configuration is shown below. In this example, the queue size is 64 packets.
1583
1584 .. note::
1585
1586     For correct operation, the same EWMA filter weight parameter (wred weight) should be used
1587     for each packet color (green, yellow, red) in the same traffic class (tc).
1588
1589 ::
1590
1591     ; RED params per traffic class and color (Green / Yellow / Red)
1592
1593    [red]
1594    tc 0 wred min = 28 22 16
1595    tc 0 wred max = 32 32 32
1596    tc 0 wred inv prob = 10 10 10
1597    tc 0 wred weight = 9 9 9
1598
1599    tc 1 wred min = 28 22 16
1600    tc 1 wred max = 32 32 32
1601    tc 1 wred inv prob = 10 10 10
1602    tc 1 wred weight = 9 9 9
1603
1604    tc 2 wred min = 28 22 16
1605    tc 2 wred max = 32 32 32
1606    tc 2 wred inv prob = 10 10 10
1607    tc 2 wred weight = 9 9 9
1608
1609    tc 3 wred min = 28 22 16
1610    tc 3 wred max = 32 32 32
1611    tc 3 wred inv prob = 10 10 10
1612    tc 3 wred weight = 9 9 9
1613
1614 With this configuration file, the RED configuration that applies to green,
1615 yellow and red packets in traffic class 0 is shown in Table 18.
1616
1617 .. _pg_table_18:
1618
1619 **Table 18. RED Configuration Corresponding to RED Configuration File**
1620
1621 +--------------------+--------------------+-------+--------+-----+
1622 | RED Parameter      | Configuration Name | Green | Yellow | Red |
1623 |                    |                    |       |        |     |
1624 +====================+====================+=======+========+=====+
1625 | Minimum Threshold  | tc 0 wred min      | 28    | 22     | 16  |
1626 |                    |                    |       |        |     |
1627 +--------------------+--------------------+-------+--------+-----+
1628 | Maximum Threshold  | tc 0 wred max      | 32    | 32     | 32  |
1629 |                    |                    |       |        |     |
1630 +--------------------+--------------------+-------+--------+-----+
1631 | Mark Probability   | tc 0 wred inv prob | 10    | 10     | 10  |
1632 |                    |                    |       |        |     |
1633 +--------------------+--------------------+-------+--------+-----+
1634 | EWMA Filter Weight | tc 0 wred weight   | 9     | 9      | 9   |
1635 |                    |                    |       |        |     |
1636 +--------------------+--------------------+-------+--------+-----+
1637
1638 Application Programming Interface (API)
1639 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1640
1641 Enqueue API
1642 ^^^^^^^^^^^
1643
1644 The syntax of the enqueue API is as follows:
1645
1646 .. code-block:: c
1647
1648    int rte_red_enqueue(const struct rte_red_config *red_cfg, struct rte_red *red, const unsigned q, const uint64_t time)
1649
1650
1651 The arguments passed to the enqueue API are configuration data, run-time data,
1652 the current size of the packet queue (in packets) and a value representing the current time.
1653 The time reference is in units of bytes,
1654 where a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium
1655 (see Section 26.2.4.5.1 "Internal Time Reference" ).
1656 The dropper reuses the scheduler time stamps for performance reasons.
1657
1658 Empty API
1659 ^^^^^^^^^
1660
1661 The syntax of the empty API is as follows:
1662
1663 .. code-block:: c
1664
1665     void rte_red_mark_queue_empty(struct rte_red *red, const uint64_t time)
1666
1667 The arguments passed to the empty API are run-time data and the current time in bytes.
1668
1669 Traffic Metering
1670 ----------------
1671
1672 The traffic metering component implements the Single Rate Three Color Marker (srTCM) and
1673 Two Rate Three Color Marker (trTCM) algorithms, as defined by IETF RFC 2697 and 2698 respectively.
1674 These algorithms meter the stream of incoming packets based on the allowance defined in advance for each traffic flow.
1675 As result, each incoming packet is tagged as green,
1676 yellow or red based on the monitored consumption of the flow the packet belongs to.
1677
1678 Functional Overview
1679 ~~~~~~~~~~~~~~~~~~~
1680
1681 The srTCM algorithm defines two token buckets for each traffic flow,
1682 with the two buckets sharing the same token update rate:
1683
1684 *   Committed (C) bucket: fed with tokens at the rate defined by the Committed Information Rate (CIR) parameter
1685     (measured in IP packet bytes per second).
1686     The size of the C bucket is defined by the Committed Burst Size (CBS) parameter (measured in bytes);
1687
1688 *   Excess (E) bucket: fed with tokens at the same rate as the C bucket.
1689     The size of the E bucket is defined by the Excess Burst Size (EBS) parameter (measured in bytes).
1690
1691 The trTCM algorithm defines two token buckets for each traffic flow,
1692 with the two buckets being updated with tokens at independent rates:
1693
1694 *   Committed (C) bucket: fed with tokens at the rate defined by the Committed Information Rate (CIR) parameter
1695     (measured in bytes of IP packet per second).
1696     The size of the C bucket is defined by the Committed Burst Size (CBS) parameter (measured in bytes);
1697
1698 *   Peak (P) bucket: fed with tokens at the rate defined by the Peak Information Rate (PIR) parameter
1699     (measured in IP packet bytes per second).
1700     The size of the P bucket is defined by the Peak Burst Size (PBS) parameter (measured in bytes).
1701
1702 Please refer to RFC 2697 (for srTCM) and RFC 2698 (for trTCM) for details on how tokens are consumed
1703 from the buckets and how the packet color is determined.
1704
1705 Color Blind and Color Aware Modes
1706 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1707
1708 For both algorithms, the color blind mode is functionally equivalent to the color aware mode with input color set as green.
1709 For color aware mode, a packet with red input color can only get the red output color,
1710 while a packet with yellow input color can only get the yellow or red output colors.
1711
1712 The reason why the color blind mode is still implemented distinctly than the color aware mode is
1713 that color blind mode can be implemented with fewer operations than the color aware mode.
1714
1715 Implementation Overview
1716 ~~~~~~~~~~~~~~~~~~~~~~~
1717
1718 For each input packet, the steps for the srTCM / trTCM algorithms are:
1719
1720 *   Update the C and E / P token buckets. This is done by reading the current time (from the CPU timestamp counter),
1721     identifying the amount of time since the last bucket update and computing the associated number of tokens
1722     (according to the pre-configured bucket rate).
1723     The number of tokens in the bucket is limited by the pre-configured bucket size;
1724
1725 *   Identify the output color for the current packet based on the size of the IP packet
1726     and the amount of tokens currently available in the C and E / P buckets; for color aware mode only,
1727     the input color of the packet is also considered.
1728     When the output color is not red, a number of tokens equal to the length of the IP packet are
1729     subtracted from the C or E /P or both buckets, depending on the algorithm and the output color of the packet.
1730
1731 .. |flow_tru_droppper| image:: img/flow_tru_droppper.png
1732
1733 .. |drop_probability_graph| image:: img/drop_probability_graph.png
1734
1735 .. |drop_probability_eq3| image:: img/drop_probability_eq3.png
1736
1737 .. |eq2_expression| image:: img/eq2_expression.png
1738
1739 .. |drop_probability_eq4| image:: img/drop_probability_eq4.png
1740
1741 .. |pkt_drop_probability| image:: img/pkt_drop_probability.png
1742
1743 .. |pkt_proc_pipeline_qos| image:: img/pkt_proc_pipeline_qos.png
1744
1745 .. |ex_data_flow_tru_dropper| image:: img/ex_data_flow_tru_dropper.png
1746
1747 .. |ewma_filter_eq_1| image:: img/ewma_filter_eq_1.png
1748
1749 .. |ewma_filter_eq_2| image:: img/ewma_filter_eq_2.png
1750
1751 .. |data_struct_per_port| image:: img/data_struct_per_port.png
1752
1753 .. |prefetch_pipeline| image:: img/prefetch_pipeline.png
1754
1755 .. |pipe_prefetch_sm| image:: img/pipe_prefetch_sm.png
1756
1757 .. |blk_diag_dropper| image:: img/blk_diag_dropper.png
1758
1759 .. |m_definition| image:: img/m_definition.png
1760
1761 .. |eq2_factor| image:: img/eq2_factor.png
1762
1763 .. |sched_hier_per_port| image:: img/sched_hier_per_port.png
1764
1765 .. |hier_sched_blk| image:: img/hier_sched_blk.png