Networks Week 11

29100

Comp 343/443

Fall 2011, LT 412, Tuesday 4:15-6:45
Week 11, Nov 22

Read:

Ch 3, sections 1, 2, 3
Ch 5, sections 1 (UDP) and 2 (TCP)

Demos of portscan

Demo of WUMP

Subnet example

Suppose the Loyola CS dept (147.126.65.0/24) and the uchicago cs dept (128.135.??? We'll say 128.135.11.0/24) install a private link.
How does this affect routing?

Each department router would add an entry for the other subnet, routing along the private link. Traffic addressed to the other subnet would take the private link. All other traffic would go to the default router. Traffic from the uchicago department to 147.126.64.0/24 would take the long route, and Loyola traffic to 128.135.12.0/24 would take the long route.

How would nearby subnets at either endpoint decide whether to use the private link? Classical link-state or DV theory requires that they be able to compare the private-link route with the going-around-the-long-way route. But they can't!

The real issue here is that luc.edu and uchicago.edu are in different routing domains, and have no way to compare each others' routing metrics!

An introduction to CIDR

See P&D 5th edition p 225, "Classless Addressing", or 4th edition §4.3.2. CIDR stands for Classless Internet Domain Routing.

Subnetting moves the network/host division line further rightwards. The revised division line is visible only within the organization that owns the IP network address; subnetting is not visible outside.

What do we have to do to move the network/host division line further to the left, and why might we want to? Here are some reasons, all relating to the routing-table explosion:

we've run out of Class B addresses
there are too many Class C's for backbone routing tables
IANA has now run out of IP address space (/8 blocks to allocate to RIRs)

As for what we have to do, note that now the backbone routing infrastructure will have to understand what is going on, particularly if any routing-table size reduction is hoped for.

Note that awarding multiple Class C's in lieu of a single Class B helped with the address-space allocation issue, but made the routing-table size issue worse.

basic strategy: to consolidate multiple networks going to the same destination into a single entry.
Suppose a router has four class C's all to the same destination:
    200.0.0.0/24   -> foo
    200.0.1.0/24   -> foo
    200.0.2.0/24   -> foo
    200.0.3.0/24   -> foo
That router can replace all these with the single entry
    200.0.0.0/22   -> foo

Do the bit arithmetic to make sure you understand why it's /22 here.

Implementations actually use a mask, FF.FF.FC.00, rather than /22; note FC = 1111 1100 with 6 1-bits, so FF.FF.FC.00 has 8+8+6=22 1-bits.

new problem: a packet with an address A arrives, but masks only exist in the forwarding table; packets don't carry masks with them.
How does lookup work?

Answer:
Theoretical algorithm: given a dest A, and table entries
    ⟨D[i],M[i]⟩, search for i such that A & M[i] = D[i]    ("&" is the bitwise-AND operator)
Or, in terms of # of bits, where D[i] has N[i] network bits,

        A == D[i] to first N[i] bits

Problem: it is possible to have multiple matches, and responsibility for avoiding this is much too distributed to be feasible.

longest-match rule to the rescue!

Ways CIDR might be used:

to consolidate multiple Class-C's (or class B's) into a larger single entry, as above
to allocate very large blocks to large-scale units (eg geographical areas or providers); the large-scale units would then suballocate to customers and even to smaller ISPs
to enable hierarchical routing
to get away entirely from the Class A/B/C concept

policy v mechanism: CIDR is an address-block-allocation mechanism
how provider-based routing might work

review longest-match mechanism
What policies do we want to implement with it?

NSFnet-model: NSFnet was the backbone; providers formed a tree below it. But IP addresses were still handed out by IANA directly to organizations. (IANA = Internet Assigned Numbers Association)

Application 1: CIDR allows IANA to allocate multiple blocks of Class C to a single customer

Application 2: CIDR allows huge provider blocks, with suballocation by the provider

Literally understood, the second strategy works only if all other providers route to the first provider's customers via the same path. This is mostly, but not exactly, true in practice. It would certainly be true if the provider had a single point of entry to reach all its customers (or at least all those in the CIDR block). Unfortunately, large providers usually want to have multiple external connections to other providers, and sometimes customers have their own multiple connections to other providers.

Two-stage routing

However, we can make provider-based CIDR addresssing work 100% by quietly replacing our usual routing algorithm with a two-stage routing version: first route to the appropriate provider, and then route within that provider to the appropriate customer. This is similar to subnets: first route to a customer, and then within the customer site route to the appropriate subnet. It always works, but the route taken may no longer be quite optimal; we've replaced the problem of finding an optimal route from A to B to the two problems of finding an optimal route from A to B's provider, and then from B's provider entry point to B.

Providers P0(A,B,C), P1(D,E), P2(F,G) each with customers shown in parentheses: how provider-based address allocation helps
Routing tables assuming each customer gets an address from its provider's block,

    P0: 200.0.0.0/8
    P1: 201.0.0.0/8
    P2: 202.0.0.0/8

    A: 200.0.0.0/16
    B: 200.1.0.0/12
    C: 200.1.16.0/12      (16 = 0xF0)

    D: 201.0.0.0/16
    E: 201.1.0.0/16

    F: 202.0.0.0/16
    G: 202.1.0.0/16

Routing model: route to provider, then to customer
This CHANGES things, subtly; we're no longer looking for the optimum path (at least once the NSFnet routing model broke down).

CIDR and staying out of jail

Longest-match rule and changing providers
    If A moves from P1 to P2, what changes do P3, P4, etc have to make?

(If we design an allocation strategy that does not allow change of provider, we may be guilty of antitrust violations.)

Longest-match allows customers to move without renumbering
hidden cost of such moves

New case:
Providers P0(B,C), P1(A,D,E), P2(F,G) each with customers listed
(A has moved from P0 to P1)
but now we have addrs unrelated to provider, and so A needs to be entered in every table!


Consider
P0---P1---P2
versus
/------\
P0___P1___P2
(that is, P0 and P2 connecting indirectly through P1, versus having a direct connection)

router pseudo-hierarchy v. address-allocation true hierarchy
These don't have to agree, but there is a cost for disagreement

What if B adds a link to P1, in addition to link to A?

How CIDR allows provider-based and geography-based routing

provider-based addresses
Problems:

route asymmetries: we assumed there was a single entry point to each provider, but usually this is false!
inefficient routes (send to closest link to dest provider?)

                                         A
                                         |
                    P1: r1--------r2----+---R3
                         |         |         |
                         |         |         |
                         |         |         |
                    P2: s1--+-----s2--------s3
                             |
                             B

BGP "MED" value (not discussed yet) allows server providers to carry the server's outbound traffic!

renumbering: threat or menace? [DHCP, NAT]
           Locators v. EID
           changing IP addrs midstream

geographical addresses

minor problems:

inefficient paths between close sites.
large sites

Real issue with geographical routing: who carries the traffic? Provider-based: business model jibes with routing model!!

New routing picture: destinations are networks, still, but some are organizations and some are major providers, with intermediate nets in between. Sometimes we might CHOOSE whether to view a large net as one unit, or to view it as separate medium-sized subunits (for the sake of visualization, assume the subunits have some geographical nature, or other attribute so that we can treat them as separate destinations.

Tradeoff:

consolidation => more compact routing table
individual subentries => more optimal route selection

2-step routing: when does it NOT find optimal routes?

minor problems:

inefficient paths between close sites.
large sites

consolidation => more compact routing table
individual subentries => more optimal route selection

2-step routing: when does it NOT find optimal routes?

Here is a picture of address allocation as of a few years ago: http://xkcd.com/195

Chapter 6: congestion

Basics of flows
taxonomy: 6.1.2

router v host
reservation v feedback
window v. rate

digression on window size : beyond the transit capacity, window size measures our queue use
Power curves: throughput/delay (they tend to rise in proportion)

6.2 Queuing: [FIFO versus Fair Queuing; tail-drop v random-drop]

6.3: TCP Congestion avoidance

How did TCP get this job? Part of the goal is good, STABLE performance for our own connection, but part is helping everyone else.

rate-based v window-based congestion management

self-clocking: sliding windows itself keeps the number of outstanding packets constant

RTT = Round Trip Time, SWS = Sender Window Size

RTT_noload = travel time with no queuing
(RTT-RTT_noload) = time spent in queues
SWS × (RTT-RTT_noload) = number of packets in queues, usually all at one router (the "bottleneck" router, right before slowest link) . Note that the sender can calculate this (assuming we can estimate RTT_noload).

Note that TCP's self-clocking (ie that new transmissions are paced by returning ACKs) is what guarantees that the queue will build only at the bottleneck router. Self-clocking means that the rate of packet transmissions is equal to the available bandwidth of the bottleneck link. All the other links have higher bandwidth, and can therefore handle the current connection's packets as fast as they are transmitted. There are some spikes when a burst of packets is sent (eg when the sender increases its window size), but in the steady state self-clocking means that packets accumulate only at the bottleneck.

The ideal window size for a TCP connection would be bandwidth × RTT_noload. With this window size, we have exactly filled the transit capacity along the path to our destination, and we have used none of the queue capacity.

Actually, TCP does not do this.

Instead, TCP

guesses at a reasonable initial window size
slowly increases the window size if no losses occur, but rapidly decreases it otherwise

The idea is that there is a time-varying "magic ceiling" of packets the network can accept. We try to stay near but just below this level. Occasioinally we will overshoot, but this just teaches us a little more about what the "magic ceiling" is.

Actually, as we'll see, this model isn't quite right, but it's worked well enough.

Also, it's time to admit that there are multiple versions of TCP here, each incorporating different congestion-management algorithms. The two we will start with are TCP Tahoe (1988) and TCP Reno (1990); the names Tahoe and Reno were originally the codenames of the Berkeley Unix distributions that included the respective TCP implementations. TCP Tahoe came from a 1988 paper by Van Jacobson entitled Congestion Avoidance and Control; TCP Reno then refined this a couple years later.

Originally, the SWS for a TCP connection came from the value suggested by the receiver, essentially representing how many packet buffers it could allocate. This value may be so large it contributes to network congestion, however, so is usually reduced. When the SWS is adjusted out of concern for congestion, it is generally thought of as the CongestionWindow, or cwnd (a variable name in a BSD implementation). Strictly speaking, SWS = min(cwnd, AdvertisedWindow).

Congestion Avoidance: Additive Increase / Multiplicative Decrease

The name "congestion avoidance phase" is given to the stage where TCP has established a reasonable guess for cwnd, and wants to engage in some fine-tuning. The central observation is that when a packet is lost, cwnd should decrease rapidly, but otherwise should increase "slowly". The strategy employed is called additive increase, multiplicative decrease, because when a windowful of packets have been sent with no loss we set cwnd = cwnd+1, but if a windowful of packets involves losses we set cwnd = cwnd/2. Note that a windowful is, of course, cwnd many packets; with no losses, we might send successive windowfuls of, say, 20, 21, 22, 23, 24, .... This amounts to conservative "probing" of the network, trying larger cwnd values because the absence of loss means the current cwnd is below the "magic ceiling".

If a loss occurs our goal is to cut the window size in half immediately. (As we will see, Tahoe actually handles this in a somewhat roundabout way.) Informally, the idea is that we need to respond aggressively to congestion. More precisely, lost packets mean we have filled the queue of the bottleneck router, and we need to dial back to a level that will allow the queue to clear. If we assume that the transit capacity is roughly equal to the queue capacity (say each is equal to N), then we overflow the queue and drop packets when cwnd = 2N, and so cwnd = cwnd/2 leaves us with cwnd = N, which just fills the transit capacity and leaves the queue empty.

Of course, assuming any relationship between transit capacity and queue capacity is highly speculative. On a 5,000 km fiber-optic link with a bandwidth of 1 Gbps, the round-trip transit capacity would be about 6 MB. That is much larger than any router queue is likely to be.

The congestion-avoidance algorithm leads to the classic "TCP sawtooth" graph, where the peaks are at the points where the slowly rising cwnd crossed above the "magic ceiling".

What might the "magic ceiling" be? It represents the largest cwnd that does not lead to packet loss, ie the cwnd that at that particular moment completely fills but does not overflow the bottleneck queue. The transit capacity of the path (and the queue capacity of the bottleneck router) is unvarying; however, that capacity is also shared with other connections and other connections may come and go with time. This is why the ceiling does vary in real terms. If two other connections share the path with capacity 60 packets, they each might get about 20 packets as their share. If one of those connections terminates, the two others might each rise to 30 packets. And if instead a fourth connection joins the mix, then after equilibrium is reached each connection might expect a share of 15 packets.

Speaking of sharing, it is straightforward to show that the additive-increase/multiplicative-decrease algorithm leads to equal bandwidth sharing when two connections share a bottleneck link, provided both have the same RTT. Assume that during any given RTT either both connections or neither connection experiences packet loss, and onsider cwnd1 - cwnd2. If there is no loss, cwnd1-cwnd2 stays the same as both cwnds increment by 1. If there is a loss, then both are cut in half, and so cwnd1-cwnd2 is cut in half. Thus, over time, cwnd1-cwnd2 is repeatedly cut in half, until it dwindles to inconsequentiality.