Networks week 11

Computer Networks Week 11 Apr 15 Corboy Law 522

CIDR and Geographical Addresses

minor problems:

inefficient paths between close sites.
large sites

Real issue with geographical routing: who carries the traffic? Provider-based: business model jibes with routing model!!

New routing picture: destinations are networks, still, but some are organizations and some are major providers, with intermediate nets in between. Sometimes we might CHOOSE whether to view a large net as one unit, or to view it as separate medium-sized subunits (for the sake of visualization, assume the subunits have some geographical nature, or other attribute so that we can treat them as separate destinations.

Tradeoff:

consolidation => more compact routing table
individual subentries => more optimal route selection

2-step routing: when does it NOT find optimal routes?

Here is a picture of address allocation as of a few years ago: http://xkcd.com/195

Chapter 6: congestion

Basics of flows
taxonomy: 6.1.2

router v host
reservation v feedback
window v. rate

digression on window size : beyond the transit capacity, window size measures our queue use
Power curves: throughput/delay (they tend to rise in proportion)

6.2 Queuing: [FIFO versus Fair Queuing; tail-drop v random-drop]

Fair Queuing

(British spelling: queueing)

Supppose we have several competing flows at a router:

    flow1--\
            \
    flow2----[R]----output
            /
    flow3--/

We can ask how fairly R divides the output bandwidth among the flows. By "flow" we mean any recognized bundle of data; it could be a single TCP connection or it could be all traffic from a given host or subnet.

A typical router (random-drop, or drop-tail without serious phase effects) allocates output bandwidth in proportion to the input bandwidth. That is, if the three input flows above send 12, 4, and 2 packets per second respectively, for a total of 18, but the output can only handle 9 packets per second, then the flows will successfully transmit 6, 2, and 1 packet per second respectively.

This can, of course, be seen as "fair": each flow gets bandwidth in proportion to its demand. However, it can also be seen as favoring a "greedy" approach: flows that cause the most congestion get the most bandwidth.

"Fair Queuing" is an attempt to give the flows above equal shares, at least within the limits of actual demand. We could easily cap each flow at 3 packets/second, but since flow3 isn't actually sending 3/sec, R is then actually processing 3+3+2 = 8 packets/sec, and there is idle capacity. It is important for a queuing strategy to be work-conserving; that is, for it to schedule no idle output time unless all inputs are idle. To this end, R would allow flow3 to send its 2 packets/sec, and divide the remaining 7 packets/sec of output flow equally between flow1 and flow2. (Weighted fair queuing gives each flow a designated fraction of the total, but the principle is the same.)

The simplest algorithm for achieving this is round-robin queue service, with all packets of equal size; this is sometimes called Nagle Fair Queuing. This means we keep a separate input queue for each flow, and offer service to the nonempty queues in round-robin (cyclic) fashion. Empty queues do not tie up resources. Over time, each nonempty queue gets to send an equal share of packets, and as we have assumed all packets have the same size, this means that each queue gets equal bandwidth. (Time-division multiplexing, on the other hand, is like round-robin but idle connections do tie up resources.)

Nagle fair queuing allows other flows to use more than their equal share, if some flows are underutilizing. This is a Good Thing. Shares are divided equally among the active flows. As soon as a flow becomes active (that is, its queue becomes nonempty) it gets to start sharing in the bandwidth allocation; it does not have to wait for other flows to work through their backlogs.

Round-robin is a successful implementation of fair queuing as long as all packets have the same size! If packets have different sizes, then flows all get their fair share of packets per second, but this may not relate to bytes per second, which is really what we want.

If packets are of different sizes, a simple if not quite exact strategy is the quantum approach: we pick a quantum value, larger than any single packet. We also service the queues round-robin. Each sender, when its turn comes up, can send up to but not over quantum bytes, which is always at least one packet. If the sender had (at least) one more packet to send, but that packet straddled the quantum limit (ie the packet was size A+B bytes, with A bytes remaining in the quantum and B over), then we do not send the packet, but we do add A to that sender's next quantum.

When that sender's next quantum rolls around, it will again get to send as many packets as it can, up to the limit quantum+A. There may again be a "straddling" packet that we can't send, of size A2 + B2, where A2 is the number of bytes we had remaining to send. Note that it will always be the case that A2 < max_packet_size < quantum.

6.3: TCP Congestion avoidance

How did TCP get this job? Part of the goal is good, STABLE performance for our own connection, but part is helping everyone else.

rate-based v window-based congestion management

self-clocking: sliding windows itself keeps the number of outstanding packets constant

RTT = Round Trip Time, SWS = Sender Window Size

RTT_noload = travel time with no queuing
(RTT-RTT_noload) = time spent in queues
SWS × (RTT-RTT_noload) = number of packets in queues, usually all at one router (the "bottleneck" router, right before slowest link) . Note that the sender can calculate this (assuming we can estimate RTT_noload).

Note that TCP's self-clocking (ie that new transmissions are paced by returning ACKs) is what guarantees that the queue will build only at the bottleneck router. Self-clocking means that the rate of packet transmissions is equal to the available bandwidth of the bottleneck link. All the other links have higher bandwidth, and can therefore handle the current connection's packets as fast as they are transmitted. There are some spikes when a burst of packets is sent (eg when the sender increases its window size), but in the steady state self-clocking means that packets accumulate only at the bottleneck.

The ideal window size for a TCP connection would be bandwidth × RTT_noload. With this window size, we have exactly filled the transit capacity along the path to our destination, and we have used none of the queue capacity.

Actually, TCP does not do this.

Instead, TCP

guesses at a reasonable initial window size
slowly increases the window size if no losses occur, but rapidly decreases it otherwise

The idea is that there is a time-varying "magic ceiling" of packets the network can accept. We try to stay near but just below this level. Occasioinally we will overshoot, but this just teaches us a little more about what the "magic ceiling" is.

Actually, as we'll see, this model isn't quite right, but it's worked well enough.

Also, it's time to admit that there are multiple versions of TCP here, each incorporating different congestion-management algorithms. The two we will start with are TCP Tahoe (1988) and TCP Reno (1990); the names Tahoe and Reno were originally the codenames of the Berkeley Unix distributions that included the respective TCP implementations. TCP Tahoe came from a 1988 paper by Van Jacobson entitled Congestion Avoidance and Control; TCP Reno then refined this a couple years later.

Originally, the SWS for a TCP connection came from the value suggested by the receiver, essentially representing how many packet buffers it could allocate. This value may be so large it contributes to network congestion, however, so is usually reduced. When the SWS is adjusted out of concern for congestion, it is generally thought of as the CongestionWindow, or cwnd (a variable name in a BSD implementation). Strictly speaking, SWS = min(cwnd, AdvertisedWindow).

Congestion Avoidance: Additive Increase / Multiplicative Decrease

The name "congestion avoidance phase" is given to the stage where TCP has established a reasonable guess for cwnd, and wants to engage in some fine-tuning. The central observation is that when a packet is lost, cwnd should decrease rapidly, but otherwise should increase "slowly". The strategy employed is called additive increase, multiplicative decrease, because when a windowful of packets have been sent with no loss we set cwnd = cwnd+1, but if a windowful of packets involves losses we set cwnd = cwnd/2. Note that a windowful is, of course, cwnd many packets; with no losses, we might send successive windowfuls of, say, 20, 21, 22, 23, 24, .... This amounts to conservative "probing" of the network, trying larger cwnd values because the absence of loss means the current cwnd is below the "magic ceiling".

If a loss occurs our goal is to cut the window size in half immediately. (As we will see, Tahoe actually handles this in a somewhat roundabout way.) Informally, the idea is that we need to respond aggressively to congestion. More precisely, lost packets mean we have filled the queue of the bottleneck router, and we need to dial back to a level that will allow the queue to clear. If we assume that the transit capacity is roughly equal to the queue capacity (say each is equal to N), then we overflow the queue and drop packets when cwnd = 2N, and so cwnd = cwnd/2 leaves us with cwnd = N, which just fills the transit capacity and leaves the queue empty.

Of course, assuming any relationship between transit capacity and queue capacity is highly speculative. On a 5,000 km fiber-optic link with a bandwidth of 1 Gbps, the round-trip transit capacity would be about 6 MB. That is much larger than any router queue is likely to be.

The congestion-avoidance algorithm leads to the classic "TCP sawtooth" graph, where the peaks are at the points where the slowly rising cwnd crossed above the "magic ceiling".

What might the "magic ceiling" be? It represents the largest cwnd that does not lead to packet loss, ie the cwnd that at that particular moment completely fills but does not overflow the bottleneck queue. The transit capacity of the path (and the queue capacity of the bottleneck router) is unvarying; however, that capacity is also shared with other connections and other connections may come and go with time. This is why the ceiling does vary in real terms. If two other connections share the path with capacity 60 packets, they each might get about 20 packets as their share. If one of those connections terminates, the two others might each rise to 30 packets. And if instead a fourth connection joins the mix, then after equilibrium is reached each connection might expect a share of 15 packets.

Speaking of sharing, it is straightforward to show that the additive-increase/multiplicative-decrease algorithm leads to equal bandwidth sharing when two connections share a bottleneck link, provided both have the same RTT. Assume that during any given RTT either both connections or neither connection experiences packet loss, and onsider cwnd1 - cwnd2. If there is no loss, cwnd1-cwnd2 stays the same as both cwnds increment by 1. If there is a loss, then both are cut in half, and so cwnd1-cwnd2 is cut in half. Thus, over time, cwnd1-cwnd2 is repeatedly cut in half, until it dwindles to inconsequentiality.

Slow Start

How do we make that initial guess as to the network capacity? And even if we have such a guess, how do we avoid flooding the network sending an initial burst of packets?

The answer is slow start. If you are trying to guess a number in a fixed range, you are likely to use binary search. Not knowing the range for the "magic ceiling", a good strategy is to guess cwnd=1 at first and keep doubling until you've gone too far. Then revert to the previous guess (which worked). That ensures that you are now within 50% of the true capacity.

This is kind of an oversimplification. What we actually do is to increment cwnd by 1 for each ACK received. This seems linear, not exponential, but that is misleading: after we send a windowful of packets (cwnd many), we have incremented cwnd-many times, and so have set cwnd to (cwnd+cwnd) = 2*cwnd. In other words, cwnd=cwnd*2 after each windowful is the same as cwnd+=1 after each packet.

Similarly, during congestion-avoidance, we set cwnd = cwnd+1 after each windowful. Since the window size is cwnd, this amounts to cwnd = cwnd + 1/cwnd after each packet (this is an approximation, because cwnd keeps changing, but it works in practice if your TCP driver is willing to engage in floating-point arithmentic. An exact formula is cwnd = cwnd + 1/cwnd₀, where cwnd₀ is the value of cwnd at the start of that particular windowful. Another, simpler, approach is to use cwnd += 1/cwnd, and to keep the fractional part recorded, but to use floor(cwnd) (the integer part of cwnd) when actually sending packets.

Assuming packets travel together, this means cwnd doubles each RTT. Eventually the network gets "full", and drops a packet. Let us suppose this is after N RTTs, so cwnd=2^N. Then during the previous RTT, cwnd=2^N-1 worked successfully, so we go back to that previous value by setting cwnd = cwnd/2.

Sometimes we will use Slow Start even when we know the working network capacity. After a packet loss, we halve the previous cwnd and this gives us a pretty good idea of what to expect. If cwnd had been 100, we halve it to 50. However, after a packet loss, there are no returning ACKs to self-clock our transmission, and we do not want to dump 50 packets on the network all at once. So we use what might be called "threshold" slow-start: we use slow-start, but stop when cwnd reaches the target.

The simplified algorithm is thus to use slow-start until the first packet loss, and then halve cwnd and switch to the congestion-avoidance phase.

Actually, on every packet loss (including the original slow-start loss), we use "threshold" slow-start to ramp up again, stopping when cwnd reaches half its previous value. More precisely, when a packet loss occurs, we set the slow-start threshold, ssthresh, equal to half of the value of cwnd at the time of the loss; this is our target new cwnd. We then set cwnd=1, and begin slow-start mode, up until cwnd==ssthresh. At that point, we revert to congestion-avoidance.


Review of TCP so far

slow start + congestion avoidance
We need threshold slow-start (slow-start with ssthresh) after each loss
Slow-start and congestion-avoidance have to work together
self-clocking

Note that everything is expressed here in terms of manipulating cwnd.

Summary:

phase	cwnd change, loss	cwnd change, no loss
	per window	per window	per ACK
slow start	cwnd/2	cwnd *= 2	cwnd += 1
cong avoid	cwnd/2	cwnd +=1	cwnd += 1/cwnd

TCP idealized sawtooth v approximation

real situation: sender realizes lost packet is lost only after protracted continued sending, at which point the queue will take quite a while to reduce.

fast retransmit: TCP Tahoe.
If we send packets 1,2,3,4,5,6 and get back ACK1, ACK2, ACK2, ACK2, ACK2 we can infer a couple things:

data 3 got lost, which is why we're stuck on ACK2
data 4,5,6 did make it through, and triggered the three duplicate ACK2s (the three ACK2s following the first ACK2).

Fast Retransmit is the name given to this idea as incorporated in TCP. On the third dupACK, we retransmit the lost packet. We also set ssthresh = cwnd/2, and cwnd=1; we resume transmitting when the ACK of the lost packet arrives.

TCP and one connection
nam demo
Note interaction between queue size and pipe size

Single sender situation
example A-----R----slow---B, with R having queue size of 4
bottleneck_queue >= bandwidth×RTT_noload

Fast Recovery / TCP Reno

Fast Retransmit requires us to go to cwnd=1 because there are no arriving packets to pace transmission. Fast Recovery is a workaround: we use the arriving dupACKs to pace retransmission. The idea is to set cwnd=cwnd/2, and then to figure out how many dupACKs we have to wait for. Let cwnd = N, and suppose we've sent packets 1-N and packet 1 is lost. We will then get N-1 dupACKs for packets 2-N.

During the recovery process, we will ignore SWS and instead use the concept of Estimated FlightSize, or EFS, which is the sender's best guess at the number of outstanding packets. Under normal circumstances, EFS = cwnd (except for that tiny interval between when an ACK arrives and we send out the next packet).

At the point of the third dupACK, the sender calculates as follows: EFS had been N. However, one of the packets has been lost, making it N-1. Three dupACKs have arrived, representing three later packets no longer in flight, so EFS is now N-4. Fast Retransmit has us retransmit the packet that we inferred was lost, so EFS increments by 1, to N-3.

Our target new cwnd is N/2. So, we wait for N/2 - 3 more dupACKs to arrive, at which point EFS is N-3-(N/2-3) = N/2. We now send one new packet for each arriving dupACK following, up until we receive the ACK for the lost packet (it will actually be a cumulative ack for all the later received packets as well). At this point we declare cwnd = N/2, and keep going. As EFS was already N/2,


fast recovery detailed diagram, SWS = 10, packet 10 lost, 11-19 resulting in dupACK9's. The third dupACK9 is "really" from data13; at that point we retransmit data10. EFS = 5 when we get two more dupACK9's (from data14 and data15). The next dupACK9 (really from data16) has us transmit data20:
    dupACK9/16   data20
    dupACK9/17   data21
    dupACK9/18   data22
    dupACK9/19   data23
    ACK19            data24   The ACK19 is sent when the receiver gets the retransmitted data10
    ACK20            data25   cwnd = 5 now

EFS: Estimated FlightSize: sender's estimate of how many packets are in transit, one way or the other. This replaces SWS.

cwnd inflation/deflation
The original algorithm had a more complex strategy; note that strictly speaking, Reno DOES break the sliding-windows rules during fast recovery.

NewReno tweak: better handling of the case when two packets were lost
If packets 1 and 4 are lost in a window 0,1,2,3,4,5,6,7
then initially we will get dupACK0's.
When packet 1 is successfully retransmitted,
we start getting dupACK3's.
This particular window is much too small to wait for 3 dupACKs.
NewReno is essentially the "natural" approach to continuing