Fair Queuing

(British spelling: queueing)

Supppose we have several competing flows at a router:

    flow1--\
            \
    flow2----[R]----output
            /
    flow3--/

We can ask how fairly R divides the output bandwidth among the flows. By "flow" we mean any recognized bundle of data; it could be a single TCP connection or it could be all traffic from a given host or subnet.

A typical router (random-drop, or drop-tail without serious phase effects) allocates output bandwidth in proportion to the input bandwidth. That is, if the three input flows above send 12, 4, and 2 packets per second respectively, for a total of 18, but the output can only handle 9 packets per second, then the flows will successfully transmit 6, 2, and 1 packet per second respectively.

This can, of course, be seen as "fair": each flow gets bandwidth in proportion to its demand. However, it can also be seen as favoring a "greedy" approach: flows that cause the most congestion get the most bandwidth.

"Fair Queuing" is an attempt to give the flows above equal shares, at least within the limits of actual demand. We could easily cap each flow at 3 packets/second, but since flow3 isn't actually sending 3/sec, R is then actually processing 3+3+2 = 8 packets/sec, and there is idle capacity. It is important for a queuing strategy to be work-conserving; that is, for it to schedule no idle output time unless all inputs are idle. To this end, R would allow flow3 to send its 2 packets/sec, and divide the remaining 7 packets/sec of output flow equally between flow1 and flow2. (Weighted fair queuing gives each flow a designated fraction of the total, but the principle is the same.)

The simplest algorithm for achieving this is round-robin queue service, with all packets of equal size; this is sometimes called Nagle Fair Queuing. This means we keep a separate input queue for each flow, and offer service to the nonempty queues in round-robin (cyclic) fashion. Empty queues do not tie up resources. Over time, each nonempty queue gets to send an equal share of packets, and as we have assumed all packets have the same size, this means that each queue gets equal bandwidth. (Time-division multiplexing, on the other hand, is like round-robin but idle connections do tie up resources.)

Nagle fair queuing allows other flows to use more than their equal share, if some flows are underutilizing. This is a Good Thing. Shares are divided equally among the active flows. As soon as a flow becomes active (that is, its queue becomes nonempty) it gets to start sharing in the bandwidth allocation; it does not have to wait for other flows to work through their backlogs.

Round-robin is a successful implementation of fair queuing as long as all packets have the same size! If packets have different sizes, then flows all get their fair share of packets per second, but this may not relate to bytes per second, which is really what we want.

What happens when packets aren't all of the same size?

Sometimes it doesn't matter that much, as long as we can, say, relegate oversized packets to a lower-priority queue altogether. For example, we might want fair queuing and higher-priority to apply to all VOIP packets (identified by port or destination). We can then relegate all oversized VOIP packets to the low-priority stream (pretty good incentive for senders to be compliant with our requirement for packet size), and not worry about undersized VOIP packets.

However, often we would like fair queuing to work in settings with fundamentally variable packet sizes. A first stab is "bit-by-bit" round-robin (bbrr). Conceptually, we handle individual bits from each flow on a round-robin basis. If there are three packets, we send their bits intermingled 1,2,3,1,2,3,etc; it takes three bit-times to send one bit from each packet. While bbrr would achieve the goals of fair queuing, it is unfortunately not implementable; packets must be sent undivided.

So what we will do instead is "simulate" bbrr as follows:

We calculate the packet finishing times under that simulation.
We send the packets undivided, in order of increasing finishing time

More precisely, we compute a "virtual finishing time" at which the packet would finish, if true bbrr were being used. We then send the packets, undivided and without preemption, in order of increasing finishing time. Ties are resolved by any plausible tie-breaker. We'll call this "virtual-finish" bbrr, or vfbbrr, and fill in some important details below. Note two things now however:

Once we begin transmission of a packet, it is quite possible that a shorter packet arrives on another queue that would have a smaller (sooner) finishing time under bbrr. However, fair queuing is non-preemptive; we continue with the packet we have started.
If a packet is sitting in the output queue with its finishing time computed, it is quite possible for another packet to come along and have a smaller finishing time, and so the other packet will thus be sent first (as in a priority queue).

We can do an even finer-grained model than bbrr! The "fluid" model states that, when there are N active queues, each queue is emptied continuously at rate 1/N × output_rate. This allows the application of continuous mathematics (eg differential equations), but I don't have any examples of how this helps.

As an alternative to vfbbrr, what about an algorithm where we provide, for each queue, a bit-position marker that advances by one bit for each bit we have available to send from that queue? If three queues are active, then in the time it takes us to send three bits, each marker would advance by one bit. This can be seen as a more explicit simulation of bbrr. Packets would then be transmitted (or at least placed in the output queue) when all their bits have been ticked through. With N active queues, we tick through N input bits in the time it takes us to send N output bits, so the output line should not develop any significant backlog.

This isn't, unfortunately, quite what we want: it is not work-conserving. If all queues are empty and a packet arrives, it still has to wait to be transmitted while the router ticks through its bits. However, we will call this "simulated bit-by-bit round robin", or sbbrr.

Thus, note that vfbbrr doesn't really simulate bbrr; it instead uses the bbrr finishing times as "priorities".

We could try to fix sbbrr by just sending a newly arriving packet immediately if all other queues are empty. But what if, when we finish sending that packet, we now have several other nonempty queues, but none has a packet whose bits have all been ticked through and is thus ready to send?

Here is another attempt, a work-conserving version of sbbrr: For each input queue, maintain a bit position in the head packet of that queue. If N queues are active, then for every N output bits we could have sent, advance each bit position by 1. If we advance to the end of the packet, put it into the output queue. If at any point the output queue is empty, take the input packet with the fewest bits left to pass over and move it immediately to the output queue.

Alas, we still have a problem. Suppose a very large packet, Pbig, arrives on queue1, and just after it begins transmission, a tiny packet P1 arrives on queue1 at the same time as a medium-sized packet P2 on queue2. Under sbbrr, once we've started transmitting Pbig we have "forgotten" how big it is. Under the variation above, we would send P1 next, and then P2, because P1 is smaller and we'd thus tick through its bits sooner. However, under true bbrr P2 would be sent before P1, since the bits of P2 would be sent alternating with Pbig. In fact, P2 would end up completely transmitted first, followed by the rest of Pbig, followed by P1. As soon as P2 arrived, we would alternate bits from P2 and Pbig; when P2 was done we would finish Pbig and then go on to P1.

In other words, we really need to compute those finishing times.

Towards this end, suppose we have just one input queue, and assume the output queue handles one bit per second. If the ith packet arrives at time A_i, of size P_i, then consider the time when it finishes transmitting, F_i. If the queue was empty on arrival, F_i = A_i + P_i; if the queue was nonempty then we have to wait for the previous packet to finish before we can start and so F_i = F_i-1 + P_i. Combining,
F_i = Max(F_i-1, A_i) + P_i
This gives us the effective finishing time F_i that we want.

There is just one catch to extending this calculation to multiple queues: it can't be done.

At least, not if we use "real time", or "wallclock time". We know when a packet arrives, of course, that is, A_i. And we know the size, P_i. The calculation of F_i is done independently for each queue; i refers to the ith packet of that queue. If we are computing F_i for a given queue, packets arriving in other queues do not affect the value of any F_i.

But even in the simple case when all queues are empty when a packet arrives, and so we begin transmitting it immediately (because, no matter what its finishing time F_i, it is the only and thus the smallest finishing time we have at that point), we cannot calculate the true bbrr finishing time. The problem is that another packet may arrive on another queue while the first packet is in progress, at which point true bbrr would start alternating the transmitted bits between the two packets, and this would draw out the finishing time of the first packet. For example, suppose a 1000-bit packet, P1, arrives at T=0, and a 500-bit packet P2 arrives at T=100, with time measured in bit-times. At the time of P1's arrival, we guess that it will finish at T=1000, ie F₁=1000. At T=100, we have sent 100 bits, but now true bbrr would start alternating with P2's bits. For the next 1000 ticks, we would send 500 bits each from P1 and P2; time is now 1100 and we have 400 bits left of P1. We actually finish P1 at T=1500, assuming we get no more interruptions.

In light of this, we might attempt either of the following approximations to the calculation of the finishing time F_i:

Multiply the size of an arriving packet by the total number of input queues.
Multiply the size of an arriving packet by the number of nonempty input queues, at the time of arrival.

Unfortunately, neither of these works very well; neither of these strategies leads to a reasonable approximation of fairness. In the next two examples, we assume that all packets are the same size, and each takes 1 unit to transmit.

Example 1 (failure): attempting to scale P_i by the total number of queues
We have three queues. At T=0, queue1 receives a bunch of packets. If only one queue were present they would finish at times 1,2,3,...; because we are scaling finishing times by 3 queues, they are actually assigned finishing times of 3,6,9,12,.... At T=2, the first two of these packets have been sent (as the other queues are empty), and now two packets arrive on queue2. These latter two are assigned finishing times of 2+3=5, 2+6=8. They therefore leapfrog the remaining packets in queue1, which have finishing times 9,12,....

Example 2 (failure): attempting to scale P_i by the number of nonempty queues
Again we have three queues. At T=0, five packets on queue1 arrive. As queue 1 is the only active queue, we assign them respective finishing times of 1,2,3,4,5. Now, at T=1, with four packets left in queue1, four packets arrive at queue2. They get assigned finishing times of 3,5,7,9, adding 2 each time because there are now two nonempty queues. Queue1 gets twice as much service!

Calculating finishing times seems difficult. The trick is, instead of scaling P_i up, we scale A_i down. We create a virtual clock, called the FQ-clock; at any moment when N>0 queues are nonempty, the FQ-clock runs at 1/N of the normal wallclock speed. As the number of nonempty queues rises, the FQ-clock runs slower and slower. Using the FQ-clock to measure the arrival time A_i, we do indeed get an accurate estimate of the finishing time F_i, as also measured on the FQ-clock. The FQ-clock always runs at a rate such that any one queue gets to send one bit for each tick of the FQ-clock, no matter how many queues are active. For example, if there are three nonempty queues, then the FQ-clock ticks once for each three ticks of real time. However, under true bbrr, each of the three queues sends one bit in three bit-times, or, if we equate ticks with bit-times, each queue sends one bit per FQ-tick.

After a packet is received, the number of active queues may fluctuate up and down. However, the FQ-clock runs slower or faster in exact lockstep, so that after we start sending a packet of size P, it takes exactly P many FQ-bit-ticks before the packet is finished. Our formula,
         F_i = Max(F_i-1, A_i) + P_i
still works.

   F_i = virtual finishing time of i^th packet
   A_i = arrival time
   S_i = start time = max(F_i-1, A_i)
   P_i = size of packet

For each arriving packet, we compute F_i as a timestamp when it arrives, and then send packets in order of increasing F_i. What the F_i thus represents, exactly (ignoring one-bit rounding) is the FQ-clock time at which the ith packet would finish, if we were using true bbrr.

Here is the scenario of the above two examples using the FQ-clock.

Example 3 (success): measuring time with the FQ clock
At T=0, queue1 receives a bunch of packets and FQ-clock = 0. The packets are assigned finishing times of 1,2,3,4,.... At real time T=2.0, the first two of these have been sent, and now two packets arrive on queue2. Since only one queue has been active, FQ-clock has advanced by 2.0 in the intervening time, and so the packets now on queue2 are assigned finishing times of 3,4,5,.... These compete fairly with those on queue1, and the packets from the two queues are interleaved, one for one. At real time T=4.0, the FQ-clock has advanced to 3.0, and one packet on queue1 and one packet on queue2, each with a virtual finishing time of 3.0, has been sent.

Clock arithmetic

It has often been noted that needing floating-point arithmetic to maintain your clock is a burden. One simplifying approximation, that works well when the clock resolution is high compared to the time needed to send a packet, is to advance the FQ clock by 1 tick when N queues are active and N realtime ticks have elapsed. The discrepancy here is that the number of active queues might change from N to N-1 during the N realtime ticks, but this discrepancy is small for a high-resolution clock. Note that if a queue becomes active so that the number of queues rises to N+1, we can simply agree to records the new packet's arrival as of when the N-tick block finishes, slightly after its true arrival time, and accommodate the discrepancy that way.

Also, note that the FQ-clock rate only changes when packets first arrive on an empty queue, or packets are dequeued leaving an empty queue. It suffices to update the FQ-clock only at these times.

Linux SFQ

(SFQ = Stochastic Fair Queuing)
Flows here are individual TCP connections, not hosts, or subnets. Flows are hashed by <srcaddr, destaddr, port>. Identical hash buckets are considered the same queue, so bucket collisions result in unfairness. As a result, the hash function is altered at regular intervals.Also, one can get twice the bandwidth by creating a second connection!

What we really want is a way to define flow_id's so they can be created dynamically and connections can be assigned to a flow_id by connection, host, subnet, etc.