Computer Networks Week 11 Apr 15 Corboy Law 522
CIDR and Geographical Addresses
minor problems:
- inefficient paths between close sites.
- large sites
Real issue with geographical routing: who carries the traffic? Provider-based: business model jibes with routing model!!
New routing picture: destinations are networks, still, but some are
organizations and some are major providers, with intermediate nets in
between. Sometimes we might CHOOSE whether to view a large net as one
unit, or to view it as separate medium-sized subunits (for the sake of
visualization, assume the subunits have some geographical nature, or
other attribute so that we can treat them as separate destinations.
Tradeoff:
- consolidation => more compact routing table
- individual subentries => more optimal route selection
2-step routing: when does it NOT find optimal routes?
Here is a picture of address allocation as of a few years ago: http://xkcd.com/195
Chapter 6: congestion
Basics of flows
taxonomy: 6.1.2
- router v host
- reservation v feedback
- window v. rate
digression on window size
: beyond the transit capacity, window size measures our queue use
Power curves: throughput/delay (they tend to rise in proportion)
6.2 Queuing: [FIFO versus Fair Queuing; tail-drop v random-drop]
Fair Queuing
(British spelling: queueing)
Supppose we have several competing flows at a router:
flow1--\
\
flow2----[R]----output
/
flow3--/
We can ask how fairly R divides the output bandwidth among the flows.
By "flow" we mean any recognized bundle of data; it could be a single
TCP connection or it could be all traffic from a given host or subnet.
A typical router (random-drop, or drop-tail without serious phase
effects) allocates output bandwidth in proportion to the input
bandwidth. That is, if the three input flows above send 12, 4, and 2
packets per second respectively, for a total of 18, but the output can
only handle 9
packets per second, then the flows will successfully transmit 6, 2, and
1 packet per second respectively.
This can, of course, be seen as "fair": each flow gets bandwidth in
proportion to its demand. However, it can also be seen as favoring a
"greedy" approach: flows that cause the most congestion get the most
bandwidth.
"Fair Queuing" is an attempt to give the flows above equal shares, at
least within the limits of actual demand. We could easily cap each flow
at 3 packets/second, but since flow3 isn't actually sending 3/sec, R
is then actually processing 3+3+2 = 8 packets/sec, and there is idle
capacity. It is important for a queuing strategy to be work-conserving; that is, for it to
schedule no idle output time
unless all inputs are idle. To this end, R would allow flow3 to send
its 2 packets/sec, and divide the remaining 7 packets/sec of output
flow equally between flow1 and flow2. (Weighted fair queuing gives each flow a designated
fraction of the total, but the principle is the same.)
The simplest algorithm for achieving this is round-robin queue service,
with all
packets of equal size; this is sometimes called Nagle Fair Queuing.
This means we keep a separate input queue for each flow, and offer
service to the nonempty queues
in round-robin (cyclic) fashion. Empty
queues do not tie up resources. Over time, each nonempty queue gets to
send an equal share of packets, and as we have assumed all packets have
the same size, this means that each queue gets equal bandwidth.
(Time-division multiplexing, on the other hand, is like round-robin but
idle connections do tie up
resources.)
Nagle fair queuing allows other flows to use more than their equal
share, if some flows are underutilizing. This is a Good Thing. Shares
are divided equally among the active
flows. As soon as a flow becomes
active (that is, its queue becomes nonempty) it gets to start sharing
in the bandwidth allocation; it does not have to wait for other flows
to work through their backlogs.
Round-robin is a successful implementation of fair queuing as long as all packets have the same
size! If packets have different sizes, then flows all get their
fair
share of packets per second,
but this may not relate to bytes
per
second, which is really what we want.
If packets are of different sizes, a simple if not quite exact strategy is the quantum
approach: we pick a quantum value, larger than any single packet. We
also service the queues round-robin. Each sender, when its turn comes
up, can send up to but not over quantum bytes, which is always at least
one packet. If the sender had
(at least) one more packet to send, but that packet straddled the
quantum limit (ie the packet was size A+B bytes, with A bytes remaining
in the quantum and B over), then we do not send the packet, but we do
add A to that sender's next quantum.
When that sender's next quantum rolls around, it will again get to send
as many packets as it can, up to the limit quantum+A. There may again
be a "straddling" packet that we can't send, of size A2 + B2, where A2
is the number of bytes we had remaining to send. Note that it will
always be the case that A2 < max_packet_size < quantum.
6.3: TCP Congestion avoidance
How did TCP get this job? Part of the goal is good, STABLE performance
for our own connection, but part is helping everyone else.
rate-based v window-based congestion management
self-clocking: sliding windows itself keeps the number of outstanding packets constant
RTT = Round Trip Time, SWS = Sender Window Size
RTTnoload = travel time with no queuing
(RTT-RTTnoload) = time spent in queues
SWS × (RTT-RTTnoload) = number of packets in queues, usually
all at one router (the "bottleneck" router, right before slowest
link)
. Note that the sender can calculate this (assuming we can
estimate RTTnoload).
Note that TCP's self-clocking (ie that new transmissions are paced by
returning ACKs) is what guarantees that the queue will build only at
the bottleneck router. Self-clocking means that the rate of packet
transmissions is equal to the available bandwidth of the bottleneck
link. All the other links have higher bandwidth, and can therefore
handle the current connection's packets as fast as they are
transmitted. There are some spikes when a burst of packets is sent (eg
when the sender increases its window size), but in the steady state
self-clocking means that packets accumulate only at the bottleneck.
The ideal window size for a TCP connection would be bandwidth × RTTnoload. With this window size, we have exactly filled the transit capacity along the path to our destination, and we have used none of the queue capacity.
Actually, TCP does not do this.
Instead, TCP
- guesses at a reasonable initial window size
- slowly increases the window size if no losses occur, but rapidly decreases it otherwise
The idea is that there is a time-varying "magic ceiling" of packets the
network can accept. We try to stay near but just below this level.
Occasioinally we will overshoot, but this just teaches us a little more
about what the "magic ceiling" is.
Actually, as we'll see, this model isn't quite right, but it's worked well enough.
Also, it's time to admit that there are multiple versions of TCP here,
each incorporating different congestion-management algorithms. The two
we will start with are TCP Tahoe (1988) and TCP Reno (1990); the names
Tahoe and Reno were originally the codenames of the Berkeley Unix
distributions that included the respective TCP implementations. TCP
Tahoe came from a 1988 paper by Van Jacobson entitled Congestion Avoidance and Control; TCP Reno then refined this a couple years later.
Originally, the SWS for a TCP connection came from the value suggested by the receiver,
essentially representing how many packet buffers it could allocate.
This value may be so large it contributes to network congestion,
however, so is usually reduced. When the SWS is adjusted out of concern
for congestion, it is generally thought of as the CongestionWindow, or cwnd (a variable name in a BSD implementation). Strictly speaking, SWS = min(cwnd, AdvertisedWindow).
Congestion Avoidance: Additive Increase / Multiplicative Decrease
The name "congestion avoidance phase" is given to the stage where TCP
has established a reasonable guess for cwnd, and wants to engage in
some fine-tuning. The central observation is that when a packet is
lost, cwnd should decrease rapidly, but otherwise should increase
"slowly". The strategy employed is called additive increase, multiplicative decrease,
because when a windowful of packets have been sent with no loss we set
cwnd = cwnd+1, but if a windowful of packets involves losses we set
cwnd = cwnd/2. Note that a windowful is, of course, cwnd many packets;
with no losses, we might send successive windowfuls of, say, 20, 21,
22, 23, 24, .... This amounts to conservative "probing" of the network,
trying larger cwnd values because the absence of loss means the current
cwnd is below the "magic ceiling".
If a loss occurs our goal is to cut the window size in half
immediately. (As we will see, Tahoe actually handles this in a somewhat
roundabout way.) Informally, the idea is that we need to respond
aggressively to congestion. More precisely, lost packets mean we have filled
the queue of the bottleneck router, and we need to dial back to a level
that will allow the queue to clear. If we assume that the transit
capacity is roughly equal to the queue capacity (say each is equal to
N), then we overflow the queue and drop packets when cwnd = 2N, and so
cwnd = cwnd/2 leaves us with cwnd = N, which just fills the transit
capacity and leaves the queue empty.
Of course, assuming any relationship between transit capacity and queue
capacity is highly speculative. On a 5,000 km fiber-optic link with a
bandwidth of 1 Gbps, the round-trip transit capacity would be about 6
MB. That is much larger than any router queue is likely to be.
The congestion-avoidance algorithm leads to the classic "TCP sawtooth"
graph, where the peaks are at the points where the slowly rising cwnd
crossed above the "magic ceiling".
What might the "magic ceiling" be? It represents the largest cwnd that
does not lead to packet loss, ie the cwnd that at that particular
moment completely fills but does not overflow the bottleneck queue. The
transit capacity of the path (and the queue capacity of the bottleneck
router) is unvarying; however, that capacity is also shared with other
connections and other connections may come and go with time. This is
why the ceiling does vary in real terms. If two other connections share
the path with capacity 60 packets, they each might get about 20 packets
as their share. If one of those connections terminates, the two others
might each rise to 30 packets. And if instead a fourth connection joins
the mix, then after equilibrium is reached each connection might expect
a share of 15 packets.
Speaking of sharing, it is straightforward to show that the
additive-increase/multiplicative-decrease algorithm leads to equal
bandwidth sharing when two connections share a bottleneck link,
provided both have the same RTT. Assume that during any given RTT
either both connections or neither connection experiences packet loss,
and onsider cwnd1 - cwnd2. If there is no loss, cwnd1-cwnd2 stays the
same as both cwnds increment by 1. If there is a loss, then both are
cut in half, and so cwnd1-cwnd2 is cut in half. Thus, over time,
cwnd1-cwnd2 is repeatedly cut in half, until it dwindles to
inconsequentiality.
Slow Start
How do we make that initial guess as to the network capacity? And even
if we have such a guess, how do we avoid flooding the network sending
an initial burst of packets?
The answer is slow start. If you are trying to guess a number in a
fixed range, you are likely to use binary search. Not knowing the range
for the "magic ceiling", a good strategy is to guess cwnd=1 at first
and keep doubling until you've gone too far. Then revert to the
previous guess (which worked). That ensures that you are now within 50%
of the true capacity.
This is kind of an oversimplification. What we actually do is to
increment cwnd by 1 for each ACK received. This seems linear, not
exponential, but that is misleading:
after we send a windowful of packets (cwnd many), we have incremented
cwnd-many times, and so have set cwnd to (cwnd+cwnd) = 2*cwnd. In other
words, cwnd=cwnd*2 after each windowful is the same as cwnd+=1 after each packet.
Similarly, during congestion-avoidance, we set cwnd = cwnd+1 after each windowful. Since the window size is cwnd, this amounts to cwnd = cwnd + 1/cwnd after each packet
(this is an approximation, because cwnd keeps changing, but it works in
practice if your TCP driver is willing to engage in floating-point
arithmentic. An exact formula is cwnd = cwnd + 1/cwnd0, where cwnd0
is the value of cwnd at the start of that particular windowful.
Another, simpler, approach is to use cwnd += 1/cwnd, and to keep the
fractional part recorded, but to use floor(cwnd) (the integer part of
cwnd) when actually sending packets.
Assuming packets travel together, this means cwnd doubles each RTT. Eventually the network gets "full", and drops a packet.
Let us suppose this is after N RTTs, so cwnd=2N. Then during the previous RTT, cwnd=2N-1 worked successfully, so we go back to that previous value by setting cwnd = cwnd/2.
Sometimes we will use Slow Start even when we know the working network
capacity. After a packet loss, we halve the previous cwnd and this
gives us a pretty good idea of what to expect. If cwnd had been 100, we
halve it to 50. However, after a packet loss, there are no returning ACKs to self-clock our transmission, and we do not
want to dump 50 packets on the network all at once. So we use what
might be called "threshold" slow-start: we use slow-start, but stop
when cwnd reaches the target.
The simplified algorithm is thus to use slow-start until the first
packet loss, and then halve cwnd and switch to the congestion-avoidance
phase.
Actually, on every packet loss (including the original slow-start
loss), we use "threshold" slow-start to ramp up again, stopping when
cwnd reaches half its previous value. More precisely, when a packet
loss occurs, we set the slow-start threshold, ssthresh,
equal to half of the value of cwnd at the time of the loss; this is our
target new cwnd. We then set cwnd=1, and begin slow-start mode, up
until cwnd==ssthresh. At that point, we revert to congestion-avoidance.
Review of TCP so far
- slow start + congestion avoidance
- We need threshold slow-start (slow-start with ssthresh) after each loss
- Slow-start and congestion-avoidance have to work together
- self-clocking
Note that everything is expressed here in terms of manipulating cwnd.
Summary:
phase
|
cwnd change, loss
|
cwnd change, no loss
|
|
per window
|
per window
|
per ACK
|
slow start
|
cwnd/2
|
cwnd *= 2
|
cwnd += 1
|
cong avoid
|
cwnd/2
|
cwnd +=1
|
cwnd += 1/cwnd
|
TCP idealized sawtooth v approximation
real situation: sender realizes lost packet is lost only after
protracted continued sending, at which point the queue will take quite
a while to reduce.
fast retransmit: TCP Tahoe.
If we send packets 1,2,3,4,5,6 and get back ACK1, ACK2, ACK2, ACK2, ACK2 we can infer a couple things:
- data 3 got lost, which is why we're stuck on ACK2
- data 4,5,6 did make it through, and triggered the three duplicate ACK2s (the three ACK2s following the first ACK2).
Fast Retransmit is the name
given to this idea as incorporated in TCP. On the third dupACK, we
retransmit the lost packet. We also set ssthresh = cwnd/2, and cwnd=1;
we resume transmitting when the ACK of the lost packet arrives.
TCP and one connection
nam demo
Note interaction between queue size and pipe size
Single sender situation
example A-----R----slow---B, with R having queue size of 4
bottleneck_queue >= bandwidth×RTTnoload
Fast Recovery / TCP Reno
Fast Retransmit requires us to go to cwnd=1 because there are no
arriving packets to pace transmission. Fast Recovery is a workaround:
we use the arriving dupACKs to pace retransmission. The idea is to set
cwnd=cwnd/2, and then to figure out how many dupACKs we have to wait
for. Let cwnd = N, and suppose we've sent packets 1-N and packet 1 is
lost. We will then get N-1 dupACKs for packets 2-N.
During the recovery process, we will ignore SWS and instead use the concept of Estimated FlightSize,
or EFS, which is the sender's best guess at the number of outstanding
packets. Under normal circumstances, EFS = cwnd (except for that tiny
interval between when an ACK arrives and we send out the next packet).
At the point of the third dupACK, the sender calculates as follows: EFS
had been N. However, one of the packets has been lost, making it N-1.
Three dupACKs have arrived, representing three later packets no longer
in flight, so EFS is now N-4. Fast Retransmit has us retransmit the
packet that we inferred was lost, so EFS increments by 1, to N-3.
Our target new cwnd is N/2. So, we wait for N/2 - 3 more dupACKs to
arrive, at which point EFS is N-3-(N/2-3) = N/2. We now send one new packet for each arriving dupACK following,
up until we receive the ACK for the lost packet (it will actually be a
cumulative ack for all the later received packets as well). At this
point we declare cwnd = N/2, and keep going. As EFS was already N/2,
fast recovery detailed diagram, SWS = 10, packet 10
lost, 11-19 resulting in dupACK9's. The third dupACK9 is "really" from
data13; at that point we retransmit data10. EFS = 5 when we get two
more dupACK9's (from data14 and data15). The next dupACK9 (really from
data16) has us transmit data20:
dupACK9/16 data20
dupACK9/17 data21
dupACK9/18 data22
dupACK9/19 data23
ACK19
data24 The ACK19 is sent
when the receiver gets the retransmitted data10
ACK20 data25 cwnd = 5 now
EFS: Estimated FlightSize: sender's estimate of how many packets are in transit, one way or the other. This replaces SWS.
cwnd inflation/deflation
The original algorithm had a more complex strategy; note that strictly
speaking, Reno DOES break the sliding-windows rules during fast
recovery.
NewReno tweak: better handling of the case when two packets were lost
If packets 1 and 4 are lost in a window 0,1,2,3,4,5,6,7
then initially we will get dupACK0's.
When packet 1 is successfully retransmitted,
we start getting dupACK3's.
This particular window is much too small to wait for 3 dupACKs.
NewReno is essentially the "natural" approach to continuing