Fair Queuing (British spelling: queueing)
Supppose we have several competing flows at a router:
flow1--\
\
flow2----[R]----output
/
flow3--/
We can ask how fairly R divides the output bandwidth among the flows.
By "flow" we mean any recognized bundle of data; it could be a
singleTCP connection or it could be all traffic from a given host or
subnet.
A typical router (random-drop, or drop-tail without serious phase
effects) allocates output bandwidth in proportion to the input
bandwidth. That is, if the three input flows above send 12, 4, and 2
packets per second, for a total of 18, but the output can only handle 9
packets per second, then the flows will successfully transmit 6, 2, and
1 packet per second respectively.
This can, of course, be seen as "fair": each flow gets bandwidth in
proportion to its demand. However, it can also be seen as favoring a
"greedy" approach: flows that cause the most congestion get the most
bandwidth.
"Fair Queuing" is an attempt to give the flows above equal
shares, at least within the limits of actual demand. We could easily
cap each flow at 3 packets/second, but since flow3 isn't
actually sending 3/sec, R is then actually processing 3+3+2 = 8
packets/sec, and there is idle capacity. It is important for a queuing
strategy to be *work-conserving*; that is, for it to schedule no idle
output time unless all inputs are idle. To this end, R would allow
flow3 to send its 2 packets/sec, and divide the remaining 7 packets/sec
of output flow equally between flow1 and flow2.
(Weighted fair queuing gives each flow a designated fraction of the total.)
The simplest algorithm for fair queuing is round-robin queue service,
with all packets of equal size; this is sometimes called Nagle Fair
Queuing. This means we keep a separate input queue for each flow, and
offer service to the nonempty
queues in round-robin (cyclic) fashion. Empty queues do not tie up
resources. Over time, each nonempty queue gets to send an equal share
of packets, and as we have assumed all packets have the same size, this
means that each queue gets equal bandwidth. (Time-division
multiplexing, on the other hand, is like round-robin but idle
connections do tie up resources.)
Nagle fair queuing allows other flows to use more than their equal
share, if some flows are underutilizing. This is a Good Thing. Shares
are divided equally among the active
flows. As soon as a flow becomes active (that is, its queue becomes
nonempty) it gets to start sharing in the bandwidth allocation; it does
not have to wait for other flows to work through their backlogs.
Round-robin works as fair queuing as long as all packets have the same
size! If packets have different sizes, then flows all get their fair
share of packets per second, but this may not relate to bytes per second, which is really what we want.
What happens when packets aren't all of the same size?
Simple models
There are two simplified models for this: fluid-flow networks and bit-by-bit round-robin.
For fluid-flow networks, we assume that with two (or more) packets to
send at one time, we can just send them both, at half the usual rate.
(Strictly speaking, we should now also consider packets arriving
at the fair-queuing router as part of a fractional stream, but we'll
ignore that.) At any time there are N input queues active, each front
packet is sent at rate 1/N. If a new queue becomes active due to packet
arrival, the rate goes down to 1/(N+1). If we empty a queue, by
transmitting its last packet, the rate for the others goes up to
1/(N-1).
Bit-by-bit round-robin is an attempt to simulate fluid-flow at a very
fine granularity. We handle individual bits from each flow on a
round-robin basis. If there are three packets, we send their bits
intermingled 1,2,3,1,2,3,etc; it takes three bit-times to send one bit
from each packet. Clearly this (like fluid flow) is not implementable;
packets must be sent undivided.
So what we do instead is "simulate" bit-by-bit round-robin ("bbrr"),
and calculate the finishing times under that simulation. More
precisely, we compute a "virtual finishing time" at which the
packet would finish, if
bbrr were being used. We then send the packets, undivided and without
preemption, in order of increasing finishing time. Ties are resolved by
any plausible tie-breaker. Note two things:
- Once we begin transmission of a packet, it is quite
possible that a shorter packet arrives on another queue that would have
a smaller (sooner) finishing time under bbrr. However, fair queuing
is non-preemptive; we continue with the packet we've started.
- If a packet is sitting in the output queue with its
finishing time computed, it is quite possible for another packet
to come along and have a smaller finishing time, and so the other
packet will thus be sent first.
Extensions to real packets
Consider for the moment a "bit-ticking" algorithm where we provide, for
each queue, a bit-position marker that advances by one bit for each bit
we have available to send from that queue? If three queues are active,
then in the time it takes us to send three bits, each marker advances
by one bit. This is a
more explicit simulation of bbrr. Packets would then be transmitted
when all their bits have been ticked through, or, if the output line is
busy, they would be transmitted in the order in which their bits are
ticked through. With N active queues, we tick through N input bits in
the time it takes us to send N output bits, so the output line should
not develop any significant backlog.
This isn't, unfortunately, quite what we want: bit-ticking is not
work-conserving. If all queues are empty and a packet arrives, it still
has to wait to be transmitted while the router ticks through its bits.
However, we will call this "simulated bit-by-bit round robin", or
sbbrr. Thus, note that fair queuing doesn't really simulate bbrr; it instead uses the bbrr finishing times as "priorities".
We could try to fix bit-ticking by just sending a newly arriving packet
immediately if all other queues are empty. But what if we finish
sending a packet and several other queues are nonempty, but none has a
packet whose bits have all been ticked through and is thus ready to
send?
Here is a final, work-conserving version of bit-ticking, which we'll call wc-ticking:
For each input queue, maintain a bit
position in the head packet of that queue. If N>0 queues are active,
then for every N output bits we send or could have sent, advance each
bit position by 1. If we advance to the end of the packet, put it into
the output queue. If at any point the output queue is empty, take the
input packet with the fewest bits left to pass over and move it
immediately to the output queue.
Alas, we still have a problem, though wc-ticking is at least a
reasonable approximation. Suppose a very large packet, p1, arrives on
queue 1, and just after it begins transmission, a tiny packet p2
arrives on queue1 at the same time as a medium-sized packet p3 on
queue2. Under bit-ticking, once we've started transmitting p1 we have
"forgotten" how big it is. We would send p2 next, and then p3, because
p2 is smaller and we'd thus tick through its bits sooner. However,
under bbrr p3 would be sent before p2, since the bits of p3 would be
sent alternating with p1.
In fact, p3 would end up completely transmitted first, followed by p1,
followed by p2. As soon as p3 arrived, we would alternate bits from p3
and p1; when p3 was done we would finish p1, and then go on to p2.
In other words, we really should compute those finishing times.
Suppose we have just one
input queue, and assume the output line transmits one bit per tick. If
the ith packet arrives at time A[i], of size P[i], then consider the
time when it finishes transmitting, F[i]. If the queue was empty on
arrival, F[i] = A[i] + P[i]; if the queue was nonempty then we have to
wait for the previous packet to finish before we can start and so F[i]
= F[i-1] + P[i]. Combining,
F[i] = Max(F[i-1], A[i]) + P[i]
This allows us to compute the finishing time. Packets, of course, would be sent in order of increasing F[i].
There is just one catch to extending this calculation to multiple queues: it can't be done.
At least, not if we use "real time", or "wallclock time". We know when
a packet arrives, of course, that is, A[i]. And we know the size, P[i].
The calculation of F[i] is done independently for each queue; i refers
to the ith packet of that queue. If we are computing F[i] for a given
queue, packets arriving in other queues do not affect the value of any F[i].
But even in the simple case when all queues are empty when a packet
arrives, and so we begin transmitting it immediately, we can't
calculate the real bbrr finishing time. The problem is that another
packet may arrive on another queue, at which point we would start
alternating the transmitted bits between the two packets, and this
would draw out the finishing time of the first packet. For example,
suppose a 1000-bit packet, p1, arrives at T=0, and a 500-bit packet p2
arrives at T=100, with time measured in bit-times. At the time of
its arrival, we guess that p1 will finish at T=1000, ie F[1]=1000. At
T=100, we have sent 100 bits, but now we start alternating with p2's
bits (as we're simulating bit-by-bit round-robin). For the next 1000
ticks, we send 500 bits each from p1 and p2; time is now 1100 and we
have 400 bits left of p1. We finish p1 at T=1500, assuming we get no
more interruptions. The bottom line is that, for any one queue, we just
don't know the real rate of bbrr transmission.
In light of this, we might attempt either of the following approximations to the calculation of the finishing time F[i]:
- Multiply the size of an arriving packet by the total number of input queues.
- Multiply the size of an arriving packet by the number of nonempty input queues, at the time of arrival.
Unfortunately, neither of these works very well; neither of these
strategies leads to a reasonable approximation of fairness. In the next
two examples, we assume that all packets are the same size, and each
takes 1 unit to transmit.
Example 1: scaling P[i] by the
total number of queues. Assume there are 3 queues in all. At T=0,
queue1 receives a bunch of packets. They are assigned finishing times
of 3,6,9,12,.... At T=2, the first two of these packets have
been sent (as the other queues are empty), and now two packets
arrive on queue2. They are assigned finishing times of 2+3=5, 2+6=8.
They therefore leapfrog the remaining packets in queue1,
which have finishing times 9,12,....
Example 2: Scaling P[i] by the number of nonempty
queues: At T=0, five packets on queue1 arrive. Assign them respective
finishing times of 1,2,3,4,5. Now, at T=1, with four packets left in
queue1, four packets arrive at queue2. They get assigned finishing
times of 3,5,7,9, adding 2 each time because there are now 2
nonempty queues. Queue1 gets twice as much service!
The trick is, instead of scaling P[i] up, we scale A[i] down.
We create a virtual clock, called the FQ-clock; at any moment when
N>0 queues are nonempty, the FQ-clock runs at 1/N of the normal
wallclock speed. (In calculus terms, dFQ/dt = 1/N.) As the number of
nonempty queues rises, the FQ-clock runs slower and slower. Using the
FQ-clock to measure the arrival time A[i], we do indeed get an accurate
estimate of the finishing time F[i], as also measured on the FQ-clock.
The FQ-clock always runs at a rate such that any one queue gets to send
one bit for each tick of the FQ-clock. For example, if there are three
nonempty queues, then the FQ-clock ticks once for each three ticks of
real time. However, under bbrr, each of the three queues sends one bit
in three bit-times, or, if we equate ticks with bit-times, each queue
sends one bit per FQ-tick.
It is straighforward to show that, for the fluid model, this approach also calculates the exact finishing time for each packet.
After a packet is received, the number of active queues may fluctuate
up and down. However, the FQ-clock runs slower or faster in exact
lockstep, so that after we start sending a packet of size P, it takes
exactly P many FQ-bit-ticks before the packet is finished. Our formula,
F[i] = Max(F[i-1], A[i]) + P[i]
still works.
F[i] = virtual finishing time of ith packet
A[i] = arrival time
S[i] = start time = max(F[i-1], A[i])
P[i] = size of packet
For each arriving packet, we compute F[i] as a timestamp, and send
packets in order of increasing F[i]. What the F[i] thus represents,
*exactly* (ignoring one-bit rounding) is the FQ-clock time at which the
ith packet would finish, if we were using true bbrr.
Here is the scenario of the above two examples using the FQ-clock.
Example: At T=0, queue1
receives a bunch of packets and FQ-clock = 0. The packets are assigned
finishing times of 1,2,3,4,.... At T=2, the first two of these have
been sent, and now two packets arrive on queue2. Since only one queue
has been active, FQ-clock has advanced by 2.0 in the intervening time,
and so the packets now on queue2 are assigned finishing times of
3,4,5,.... These compete fairly with those on queue1, and the packets
from the two queues are interleaved, one for one.
Implementation
How hard is this to implement? Let FQsofar be the value of the FQ clock
at time Tsofar, and let N be the number of active queues. If N hasn't
changed, then
FQ = FQsofar + (T-Tsofar)/N (floating-point divisio)
We need to update both of these every time N changes; that is, whenever
we finish sending a packet and empty a queue, and whenever a packet
arrives on an empty queue.