Fair Queuing (British spelling: queueing)


Supppose we have several competing flows at a router:

    flow1--\
            \
    flow2----[R]----output
            /
    flow3--/

We can ask how fairly R divides the output bandwidth among the flows. By "flow" we mean any recognized bundle of data; it could be a singleTCP connection or it could be all traffic from a given host or subnet.

A typical router (random-drop, or drop-tail without serious phase effects) allocates output bandwidth in proportion to the input bandwidth. That is, if the three input flows above send 12, 4, and 2 packets per second, for a total of 18, but the output can only handle 9 packets per second, then the flows will successfully transmit 6, 2, and 1 packet per second respectively.

This can, of course, be seen as "fair": each flow gets bandwidth in proportion to its demand. However, it can also be seen as favoring a "greedy" approach: flows that cause the most congestion get the most bandwidth.

"Fair Queuing" is an attempt to give the flows above equal shares, at least within the limits of actual demand. We could easily cap each flow at 3 packets/second, but since flow3 isn't actually sending 3/sec, R is then actually processing 3+3+2 = 8 packets/sec, and there is idle capacity. It is important for a queuing strategy to be *work-conserving*; that is, for it to schedule no idle output time unless all inputs are idle. To this end, R would allow flow3 to send its 2 packets/sec, and divide the remaining 7 packets/sec of output flow equally between flow1 and flow2.

(Weighted fair queuing gives each flow a designated fraction of the total.)

The simplest algorithm for fair queuing is round-robin queue service, with all packets of equal size; this is sometimes called Nagle Fair Queuing. This means we keep a separate input queue for each flow, and offer service to the nonempty queues in round-robin (cyclic) fashion. Empty queues do not tie up resources. Over time, each nonempty queue gets to send an equal share of packets, and as we have assumed all packets have the same size, this means that each queue gets equal bandwidth. (Time-division multiplexing, on the other hand, is like round-robin  but idle connections do tie up resources.)
        
Nagle fair queuing allows other flows to use more than their equal share, if some flows are underutilizing. This is a Good Thing. Shares are divided equally among the active flows. As soon as a flow becomes active (that is, its queue becomes nonempty) it gets to start sharing in the bandwidth allocation; it does not have to wait for other flows to work through their backlogs.

Round-robin works as fair queuing as long as all packets have the same size! If packets have different sizes, then flows all get their fair share of packets per second, but this may not relate to bytes per second, which is really what we want.

What happens when packets aren't all of the same size?

Simple models

There are two simplified models for this: fluid-flow networks and bit-by-bit round-robin.

For fluid-flow networks, we assume that with two (or more) packets to send at one time, we can just send them both, at half the usual rate. (Strictly speaking, we should now also consider packets arriving at the fair-queuing router as part of a fractional stream, but we'll ignore that.) At any time there are N input queues active, each front packet is sent at rate 1/N. If a new queue becomes active due to packet arrival, the rate goes down to 1/(N+1). If we empty a queue, by transmitting its last packet, the rate for the others goes up to 1/(N-1).

Bit-by-bit round-robin is an attempt to simulate fluid-flow at a very fine granularity. We handle individual bits from each flow on a round-robin basis. If there are three packets, we send their bits intermingled 1,2,3,1,2,3,etc; it takes three bit-times to send one bit from each packet. Clearly this (like fluid flow) is not implementable; packets must be sent undivided.

So what we do instead is "simulate" bit-by-bit round-robin ("bbrr"), and calculate the finishing times under that simulation. More precisely, we compute a "virtual finishing time" at which the packet would finish, if bbrr were being used. We then send the packets, undivided and without preemption, in order of increasing finishing time. Ties are resolved by any plausible tie-breaker. Note two things:
  1.   Once we begin transmission of a packet, it is quite possible that a shorter packet arrives on another queue that would have a smaller (sooner) finishing time under bbrr. However, fair queuing is non-preemptive; we continue with the packet we've started. 
  2.   If a packet is sitting in the output queue with its finishing time computed, it is quite possible for another packet to come along and have a smaller finishing time, and so the other packet will thus be sent first.

Extensions to real packets

Consider for the moment a "bit-ticking" algorithm where we provide, for each queue, a bit-position marker that advances by one bit for each bit we have available to send from that queue? If three queues are active, then in the time it takes us to send three bits, each marker advances by one bit. This is a more explicit simulation of bbrr. Packets would then be transmitted when all their bits have been ticked through, or, if the output line is busy, they would be transmitted in the order in which their bits are ticked through. With N active queues, we tick through N input bits in the time it takes us to send N output bits, so the output line should not develop any significant backlog.

This isn't, unfortunately, quite what we want: bit-ticking is not work-conserving. If all queues are empty and a packet arrives, it still has to wait to be transmitted while the router ticks through its bits. However, we will call this "simulated bit-by-bit round robin", or sbbrr. Thus, note that fair queuing doesn't really simulate bbrr; it instead uses the bbrr finishing times as "priorities".

We could try to fix bit-ticking by just sending a newly arriving packet immediately if all other queues are empty. But what if we finish sending a packet and several other queues are nonempty, but none has a packet whose bits have all been ticked through and is thus ready to send?

Here is a final, work-conserving version of bit-ticking, which we'll call wc-ticking:

For each input queue, maintain a bit position in the head packet of that queue. If N>0 queues are active, then for every N output bits we send or could have sent, advance each bit position by 1. If we advance to the end of the packet, put it into the output queue. If at any point the output queue is empty, take the input packet with the fewest bits left to pass over and move it immediately to the output queue.
        
Alas, we still have a problem, though wc-ticking is at least a reasonable approximation. Suppose a very large packet, p1, arrives on queue 1, and just after it begins transmission, a tiny packet p2 arrives on queue1 at the same time as a medium-sized packet p3 on queue2. Under bit-ticking, once we've started transmitting p1 we have "forgotten" how big it is. We would send p2 next, and then p3, because p2 is smaller and we'd thus tick through its bits sooner. However, under bbrr p3 would be sent before p2, since the bits of p3 would be sent alternating with p1. In fact, p3 would end up completely transmitted first, followed by p1, followed by p2. As soon as p3 arrived, we would alternate bits from p3 and p1; when p3 was done we would finish p1, and then go on to p2.

In other words, we really should compute those finishing times.

Suppose we have just one input queue, and assume the output line transmits one bit per tick. If the ith packet arrives at time A[i], of size P[i], then consider the time when it finishes transmitting, F[i]. If the queue was empty on arrival, F[i] = A[i] + P[i]; if the queue was nonempty then we have to wait for the previous packet to finish before we can start and so F[i] = F[i-1] + P[i]. Combining,
         F[i] = Max(F[i-1], A[i]) + P[i]
This allows us to compute the finishing time. Packets, of course, would be sent in order of increasing F[i].

There is just one catch to extending this calculation to multiple queues: it can't be done.

At least, not if we use "real time", or "wallclock time". We know when a packet arrives, of course, that is, A[i]. And we know the size, P[i]. The calculation of F[i] is done independently for each queue; i refers to the ith packet of that queue. If we are computing F[i] for a given queue, packets arriving in other queues do not affect the value of any F[i].
 
But even in the simple case when all queues are empty when a packet arrives, and so we begin transmitting it immediately, we can't calculate the real bbrr finishing time. The problem is that another packet may arrive on another queue, at which point we would start alternating the transmitted bits between the two packets, and this would draw out the finishing time of the first packet. For example, suppose a 1000-bit packet, p1, arrives at T=0, and a 500-bit packet p2 arrives at T=100, with time measured in bit-times.  At the time of its arrival, we guess that p1 will finish at T=1000, ie F[1]=1000. At T=100, we have sent 100 bits, but now we start alternating with p2's bits (as we're simulating bit-by-bit round-robin). For the next 1000 ticks, we send 500 bits each from p1 and p2; time is now 1100 and we have 400 bits left of p1. We finish p1 at T=1500, assuming we get no more interruptions. The bottom line is that, for any one queue, we just don't know the real rate of bbrr transmission.

In light of this, we might attempt either of the following approximations to the calculation of the finishing time F[i]:
  1. Multiply the size of an arriving packet by the total number of input queues.
  2. Multiply the size of an arriving packet by the number of nonempty input queues, at the time of arrival.
Unfortunately, neither of these works very well; neither of these strategies leads to a reasonable approximation of fairness. In the next two examples, we assume that all packets are the same size, and each takes 1 unit to transmit.

Example 1: scaling P[i] by the total number of queues. Assume there are 3 queues in all. At T=0, queue1 receives a bunch of packets. They are assigned finishing times of 3,6,9,12,.... At T=2, the first two of these packets have been sent (as the other queues are empty), and now two packets arrive on queue2. They are assigned finishing times of 2+3=5, 2+6=8. They therefore leapfrog the remaining packets in queue1, which have finishing times 9,12,....

Example 2: Scaling P[i] by the number of nonempty queues: At T=0, five packets on queue1 arrive. Assign them respective finishing times of 1,2,3,4,5. Now, at T=1, with four packets left in queue1, four packets arrive at queue2. They get assigned finishing  times of 3,5,7,9, adding 2 each time because there are now 2 nonempty queues. Queue1 gets twice as much service!

The trick is, instead of scaling P[i] up, we scale A[i] down. We create a virtual clock, called the FQ-clock; at any moment when N>0 queues are nonempty, the FQ-clock runs at 1/N of the normal wallclock speed. (In calculus terms, dFQ/dt = 1/N.) As the number of nonempty queues rises, the FQ-clock runs slower and slower. Using the FQ-clock to measure the arrival time A[i], we do indeed get an accurate estimate of the finishing time F[i], as also measured on the FQ-clock. The FQ-clock always runs at a rate such that any one queue gets to send one bit for each tick of the FQ-clock. For example, if there are three nonempty queues, then the FQ-clock ticks once for each three ticks of real time. However, under bbrr, each of the three queues sends one bit in three bit-times, or, if we equate ticks with bit-times, each queue sends one bit per FQ-tick.

It is straighforward to show that, for the fluid model, this approach also calculates the exact finishing time for each packet.

After a packet is received, the number of active queues may fluctuate up and down. However, the FQ-clock runs slower or faster in exact lockstep, so that after we start sending a packet of size P, it takes exactly P many FQ-bit-ticks before the packet is finished. Our formula,
         F[i] = Max(F[i-1], A[i]) + P[i]
still works.

    F[i] = virtual finishing time of ith packet
    A[i] = arrival time
    S[i] = start time = max(F[i-1], A[i])
    P[i] = size of packet
    
For each arriving packet, we compute F[i] as a timestamp, and send packets in order of increasing F[i]. What the F[i] thus represents, *exactly* (ignoring one-bit rounding) is the FQ-clock time at which the ith packet would finish, if we were using true bbrr.

Here is the scenario of the above two examples using the FQ-clock.

Example: At T=0, queue1 receives a bunch of packets and FQ-clock = 0. The packets are assigned finishing times of 1,2,3,4,.... At T=2, the first two of these have been sent, and now two packets arrive on queue2. Since only one queue has been active, FQ-clock has advanced by 2.0 in the intervening time, and so the packets now on queue2 are assigned finishing times of 3,4,5,.... These compete fairly with those on queue1, and the packets from the two queues are interleaved, one for one.

Implementation

How hard is this to implement? Let FQsofar be the value of the FQ clock at time Tsofar, and let N be the number of active queues. If N hasn't changed, then
    FQ = FQsofar + (T-Tsofar)/N   (floating-point divisio)
We need to update both of these every time N changes; that is, whenever we finish sending a packet and empty a queue, and whenever a packet arrives on an empty queue.