Fair Queuing
(British spelling: queueing)
Supppose we have several competing flows at a router:
flow1--\
\
flow2----[R]----output
/
flow3--/
We can ask how fairly R divides the output bandwidth among the flows.
By "flow" we mean any recognized bundle of data; it could be a single
TCP connection or it could be all traffic from a given host or subnet.
A typical router (random-drop, or drop-tail without serious phase
effects) allocates output bandwidth in proportion to the input
bandwidth. That is, if the three input flows above send 12, 4, and 2
packets per second respectively, for a total of 18, but the output can only handle 9
packets per second, then the flows will successfully transmit 6, 2, and
1 packet per second respectively.
This can, of course, be seen as "fair": each flow gets bandwidth in
proportion to its demand. However, it can also be seen as favoring a
"greedy" approach: flows that cause the most congestion get the most
bandwidth.
"Fair Queuing" is an attempt to give the flows above equal shares, at
least within the limits of actual demand. We could easily cap each flow
at 3 packets/second, but since flow3 isn't actually sending 3/sec, R
is then actually processing 3+3+2 = 8 packets/sec, and there is idle
capacity. It is important for a queuing strategy to be work-conserving; that is, for it to schedule no idle output time
unless all inputs are idle. To this end, R would allow flow3 to send
its 2 packets/sec, and divide the remaining 7 packets/sec of output
flow equally between flow1 and flow2. (Weighted fair queuing gives each flow a designated fraction of the total, but the principle is the same.)
The simplest algorithm for achieving this is round-robin queue service, with all
packets of equal size; this is sometimes called Nagle Fair Queuing.
This means we keep a separate input queue for each flow, and offer
service to the nonempty queues in round-robin (cyclic) fashion. Empty
queues do not tie up resources. Over time, each nonempty queue gets to
send an equal share of packets, and as we have assumed all packets have
the same size, this means that each queue gets equal bandwidth.
(Time-division multiplexing, on the other hand, is like round-robin but
idle connections do tie up resources.)
Nagle fair queuing allows other flows to use more than their equal
share, if some flows are underutilizing. This is a Good Thing. Shares
are divided equally among the active flows. As soon as a flow becomes
active (that is, its queue becomes nonempty) it gets to start sharing
in the bandwidth allocation; it does not have to wait for other flows
to work through their backlogs.
Round-robin is a successful implementation of fair queuing as long as all packets have the same
size! If packets have different sizes, then flows all get their fair
share of packets per second, but this may not relate to bytes per
second, which is really what we want.
What happens when packets aren't all of the same size?
Sometimes
it doesn't matter that much, as long as we can, say, relegate oversized
packets to a lower-priority queue altogether. For example, we might
want fair queuing and higher-priority to apply to all VOIP packets
(identified by port or destination). We can then relegate all oversized
VOIP packets to the low-priority stream (pretty good incentive for
senders to be compliant with our requirement for packet size), and not
worry about undersized VOIP packets.
However, often we would like fair queuing to work in settings with
fundamentally variable packet sizes. A first stab is "bit-by-bit"
round-robin (bbrr). Conceptually, we handle individual bits
from each flow on a round-robin basis. If there are three packets, we
send their bits intermingled 1,2,3,1,2,3,etc; it takes three bit-times
to send one bit from each packet. While bbrr would achieve the goals of
fair queuing, it is unfortunately not implementable;
packets must be sent undivided.
So what we will do instead is "simulate" bbrr
as follows:
- We calculate the packet finishing times under that simulation.
- We send the packets undivided, in order of increasing finishing time
More
precisely, we compute a "virtual finishing time" at which the packet would
finish, if true bbrr were being used. We then send the packets,
undivided and without preemption, in order of increasing finishing
time. Ties are resolved by any plausible tie-breaker. We'll call this
"virtual-finish" bbrr, or vfbbrr, and fill in some important details
below. Note two things now however:
- Once we begin transmission of a packet, it is quite possible that a
shorter packet arrives on another queue that would have a smaller
(sooner) finishing time under bbrr. However, fair queuing is non-preemptive; we continue with the packet we have started.
- If a packet is sitting in the output queue with its finishing time
computed, it is quite possible for another packet to come along and
have a smaller finishing time, and so the other packet will thus be
sent first (as in a priority queue).
We can do an even finer-grained model than bbrr! The "fluid" model
states that, when there are N active queues, each queue is emptied
continuously at rate 1/N × output_rate. This allows the application of
continuous mathematics (eg differential equations), but I don't have
any examples of how this helps.
As an alternative to vfbbrr, what about an algorithm where we provide,
for each queue, a
bit-position marker that advances by one bit for each bit we have
available to send from that queue? If three queues are active, then in
the time it takes us to send three bits, each marker would advance by
one
bit. This can be seen as a more explicit simulation of bbrr. Packets
would then be
transmitted (or at least placed in the output queue) when all their
bits have been ticked through. With N active queues, we tick through N
input bits in the time it takes us to send N output bits, so the output
line should not develop any significant backlog.
This isn't, unfortunately, quite what we want: it is not
work-conserving. If all queues are empty and a packet arrives, it still
has to wait to be transmitted while the router ticks through its bits.
However, we will call this "simulated bit-by-bit round robin", or
sbbrr.
Thus, note that vfbbrr doesn't really simulate bbrr; it instead uses the bbrr finishing times as "priorities".
We could try to fix sbbrr by just sending a newly arriving packet
immediately if all other queues are empty. But what if, when we finish
sending that packet, we now have several other nonempty queues, but none has a
packet whose bits have all been ticked through and is thus ready to
send?
Here is another attempt, a work-conserving version of sbbrr: For each input queue,
maintain a bit position in the head packet of that queue. If N queues
are active, then for every N output bits we could have sent, advance
each bit position by 1. If we advance to the end of the packet, put it
into the output queue. If at any point the output queue is empty, take
the input packet with the fewest bits left to pass over and move it
immediately to the output queue.
Alas, we still have a problem. Suppose a very large packet, Pbig,
arrives
on queue1, and just after it begins transmission, a tiny packet P1
arrives on queue1 at the same time as a medium-sized packet P2 on
queue2. Under sbbrr, once we've started transmitting Pbig we have
"forgotten" how big it is. Under the variation above, we would send P1
next, and then P2, because P1 is smaller and we'd thus tick through its
bits sooner. However,
under true bbrr P2 would be sent before P1, since the bits of P2 would
be
sent alternating with Pbig. In fact, P2 would end up completely
transmitted first, followed by the rest of Pbig, followed by P1. As
soon as P2
arrived, we would alternate bits from P2 and Pbig; when P2 was done we
would finish Pbig and then go on to P1.
In other words, we really need to compute those finishing times.
Towards this end, suppose we have just one input queue, and assume the output queue
handles one bit per second. If the ith packet arrives at time Ai, of
size Pi, then consider the time when it finishes transmitting, Fi.
If the queue was empty on arrival, Fi = Ai + Pi; if the queue was
nonempty then we have to wait for the previous packet to finish before
we can start and so Fi = Fi-1 + Pi. Combining,
Fi = Max(Fi-1, Ai) + Pi
This gives us the effective finishing time Fi that we want.
There is just one catch to extending this calculation to multiple queues: it can't be done.
At least, not if we use "real time", or "wallclock time". We know when
a packet arrives, of course, that is, Ai. And we know the size, Pi.
The calculation of Fi is done independently for each queue; i refers
to the ith packet of that queue. If we are computing Fi for a given
queue, packets arriving in other queues do not affect the value of
any Fi.
But even in the simple case when all queues are empty when a packet
arrives, and so we begin transmitting it immediately (because, no matter what its finishing time Fi,
it is the only and thus the smallest finishing time we have at that
point), we cannot
calculate the true bbrr finishing time. The problem is that another
packet may arrive on another queue while the first packet is in
progress, at which point true bbrr would start
alternating the transmitted bits between the two packets, and this
would draw out the finishing time of the first packet. For example,
suppose a 1000-bit packet, P1, arrives at T=0, and a 500-bit packet P2
arrives at T=100, with time measured in bit-times. At the time of
P1's arrival, we guess that it will finish at T=1000, ie F1=1000. At
T=100, we have sent 100 bits, but now true bbrr would start alternating with P2's
bits. For the next
1000 ticks, we would send 500 bits each from P1 and P2; time is now
1100 and we have 400 bits left of P1. We actually finish P1 at T=1500, assuming
we get no more interruptions.
In light of this, we might attempt either of the following approximations to the calculation of the finishing time Fi:
- Multiply the size of an arriving packet by the total number of input queues.
- Multiply the size of an arriving packet by the number of nonempty input queues, at the time of arrival.
Unfortunately, neither of these works very well; neither of these
strategies leads to a reasonable approximation of fairness. In the next
two examples, we assume that all packets are the same size, and each
takes 1 unit to transmit.
Example 1 (failure): attempting to scale Pi by the total number of queues
We
have three queues. At T=0, queue1 receives a bunch of packets. If only
one queue were present they would finish at times 1,2,3,...; because we
are scaling finishing times by 3 queues, they are actually assigned
finishing
times of 3,6,9,12,.... At T=2, the first two of these packets have been
sent (as the other queues are empty), and now two packets arrive on
queue2. These latter two are assigned finishing times of 2+3=5, 2+6=8.
They
therefore leapfrog the remaining packets in queue1, which have
finishing times 9,12,....
Example 2 (failure): attempting to scale Pi by the number of nonempty queues
Again we have three queues. At T=0, five packets on queue1 arrive. As queue 1 is the only active queue, we assign them
respective finishing times of 1,2,3,4,5. Now, at T=1, with four packets
left in queue1, four packets arrive at queue2. They get assigned
finishing times of 3,5,7,9, adding 2 each time because there are now two
nonempty queues. Queue1 gets twice as much service!
Calculating finishing times seems difficult. The trick is, instead of scaling Pi up, we scale Ai down.
We create a virtual clock, called the FQ-clock; at any moment when
N>0 queues are nonempty, the FQ-clock runs at 1/N of the normal
wallclock speed. As the number of nonempty queues rises, the
FQ-clock runs slower and slower. Using the FQ-clock to measure the
arrival time Ai, we do indeed get an accurate estimate of the
finishing time Fi, as also measured on the FQ-clock. The FQ-clock
always runs at a rate such that any one queue gets to send one bit for
each tick of the FQ-clock, no matter how many queues are active. For example, if there are three nonempty
queues, then the FQ-clock ticks once for each three ticks of real time.
However, under true bbrr, each of the three queues sends one bit in three
bit-times, or, if we equate ticks with bit-times, each queue sends one
bit per FQ-tick.
After a packet is received, the number of active queues may fluctuate
up and down. However, the FQ-clock runs slower or faster in exact
lockstep, so that after we start sending a packet of size P, it takes
exactly P many FQ-bit-ticks before the packet is finished. Our formula,
Fi = Max(Fi-1, Ai) + Pi
still works.
Fi = virtual finishing time of ith packet
Ai = arrival time
Si = start time = max(Fi-1, Ai)
Pi = size of packet
For each arriving packet, we compute Fi as a timestamp when it arrives, and then send
packets in order of increasing Fi. What the Fi thus represents, exactly (ignoring one-bit rounding) is the FQ-clock time at which the
ith packet would finish, if we were using true bbrr.
Here is the scenario of the above two examples using the FQ-clock.
Example 3 (success): measuring time with the FQ clock
At T=0, queue1 receives a bunch of packets and FQ-clock = 0. The
packets are assigned finishing times of 1,2,3,4,.... At real time
T=2.0,
the first two of these have been sent, and now two packets arrive
on queue2. Since only one queue has been active, FQ-clock has advanced
by 2.0 in the intervening time, and so the packets now on queue2 are
assigned finishing times of 3,4,5,.... These compete fairly with those
on queue1, and the packets from the two queues are interleaved, one for
one. At real time T=4.0, the FQ-clock has advanced to 3.0, and one
packet on queue1 and one packet on queue2, each with a virtual
finishing time of 3.0, has been sent.
Clock arithmetic
It has often been noted that needing floating-point arithmetic to
maintain your clock is a burden. One simplifying approximation, that
works well when the clock resolution is high compared to the time
needed to send a packet, is to advance the FQ clock by 1 tick when N
queues are active and N realtime ticks have elapsed. The discrepancy
here is that the number of active queues might change from N to N-1
during the N realtime ticks, but this discrepancy is small for a
high-resolution clock. Note that if a queue becomes active so that the
number of queues rises to N+1, we can simply agree to records the new
packet's arrival as of when the N-tick block finishes, slightly after
its true arrival time, and accommodate the discrepancy that way.
Also, note that the FQ-clock rate only changes when packets first
arrive on an empty queue, or packets are dequeued leaving an empty
queue. It suffices to update the FQ-clock only at these times.
Linux SFQ
(SFQ = Stochastic Fair Queuing)
Flows here are individual TCP connections, not hosts, or subnets. Flows
are hashed by <srcaddr, destaddr, port>. Identical hash buckets
are considered the same queue, so bucket collisions result in
unfairness. As a result, the hash function is altered at regular
intervals.Also, one can get twice the bandwidth by creating a second connection!
What we really want is a way to define flow_id's so they can be created
dynamically and connections can be assigned to a flow_id by connection,
host, subnet, etc.