Networks Week 12

Comp 343/443

Fall 2011, LT 412, Tuesday 4:15-6:45
Week 12, Nov 29

Read:

Ch 3, sections 1, 2, 3
Ch 5, sections 1 (UDP) and 2 (TCP)

Demo of portscan

This works fine from hopper.cs.luc.edu, and almost anywhere else (even at home, from behind firewalls).

However, from the instructor machine in LT412, it runs strangely. I had read timeouts on every readable port, but the program was correctly determining what ports were open! It turned out (after I ran wireshark to do a packet trace) that the data packets were all arriving 5000 ms after the connnection was finished. Raising the read timeout to 5500 ms fixed it. ???

Demo of WUMP

Most systems now have individual firewalls that block new inbound TCP connections except to certain specified ports; for example, suppose our host A allows connections only to ports 22/ssh, 80/http and 993/imaps. Users are happiest, though, if outbound TCP connections from A are allowed from any port; however, an outbound connection still involves inbound packets. In particular, the individual inbound TCP packets that are part of a connection initiated by A are generally allowed, even if (as is nearly inevitable) they don't come from the listed special ports. That is, a user on machine A would be allowed to connect to port 8080 on host B, and B's response packets to A from port 8080 would be allowed past the firewall even though 8080 is not on the list [22, 80, 993]. One simple implementation of this is to block all incoming SYN-only packets (that is, packets initiating a connection) except to ports 22, 80 and 993, but to allow any other TCP packet (eg any TCP packet with the ACK flag set), from any port.

Firewalls like to do this with UDP also, but it's harder; there is no "connection setup" as such. For server processes running on A, it is straightforward to block inbound packets except to a list of designated ports; this is fine for services that run on those ports that need to be contacted by outsiders.

This works for for server processes, but what about UDP-based clients? What if a client process on A starts up on port 2000, not listed in the port firewall, and sends a response to server B? Will B's answer to port 2000 get through?

Generally, most UDP firewalls try to simulate the TCP allow-outbound-connections feature as follows: if A sends a packet to B from port 2000 and to port 1000 (say), then B is allowed to send a packet back to A to port 2000, as long as it comes from port 1000. The permission may be allowed for a limited period of time, or else for as long as A keeps open the socket involved. (Can you think of an experiment to determine which strategy is used?)

In the WUMP case, the REQ packet goes to port 4715 on ulam2; the reply then comes from a different port, say 60,000. The standard firewall on the client machine, however, usually blocks that arriving packet, as it sees it coming from a port with which it has no previous contact.

To get around this, I've done two things.

The first is to create a version of the WUMP server that responds from the same port, 4715. The problem is that multiple clients may be connected simultaneously, and the server can no longer keep these connections separate based on the port to which the traffic is sent. This means that the server had to be completely rewritten, to maintain a "dictionary" of all extant connections, and a "connection block" of all data for a given connection (eg timeout time, blocknum, last packet sent, sockaddr of remote end, file handle of file from which that connection is reading, etc). Here's the "main event loop" of my new server, slightly simplified, and unfortunately in C rather than Java:

        while (TRUE)  {
		struct connblock * cbp;	// connection-block pointer
		long currtime;		// current time
		struct wumppkt pbuf;	// struct buffer for actual packet
		struct sockaddr_in clientname;	// struct to hold src IP addr / port of arriving packet
                cnamelen = sizeof clientname;
                pbuflen = recvfrom(sock, (void *) &pbuf, sizeof pbuf, 0, (struct sockaddr *) &clientname, &cnamelen);

		// hard timeout
		if (pbuflen < 0) {	// this means a hard timeout occurred
			// check connectionList for expired timeouts
			cbp = connectionList;
			currtime = time(0);	// get current time
			while (cbp != NULL) {
			    if (cbp->timeout_time <= currtime) {   // has packet's timeout_time passed?
				    sendbuf(cbp);		// retransmit
			    }
			    cbp = cbp -> next;
			}
			continue;
		}

		// bad size, other basic sanity checks here (not shown)

		cbp = cl_search(clientname);	// search connectionList for existing connection

		switch (pbuf.opcode) {
		    case REQ : 
			if (cbp!= NULL) continue;	// ignore REQ for which we already have a connection
			reqp = (struct reqpkt*) &pbuf;
                        fname = reqp -> filename;
			struct connblock * ncb = new_connection(clientname, fname); // new connblock object
			ncb->next = connectionList;	// add it to connectionList
			connectionList = ncb;
			fillbuf(ncb);		// read first block into send buffer (within connblock object)
			sendbuf(ncb);		// actually send it
			break;
		    case ACK: 
			if (cbp == NULL) continue;  // ignore ACKs for nonexistent connections (should send error?)
			ackp = (struct ackpkt*) &pbuf;
			int arriving_blocknum = ntohl(ackp->blocknum);
			int lastsent_blocknum = cbp->blocknum;
			// case 1: got ACK for packet we last sent; go on to next!
			if (arriving_blocknum == lastsent_blocknum) {
				if (cbp->eof_reached) {	// final ACK received
					cl_delete(cbp);
					continue;
				}
				fillbuf(cbp);	// read next block into send buffer
				sendbuf(cbp);	// actually send it
			} // case 2: got ACK for prev packet; RETRANSMIT
			else if (arriving_blocknum = lastsent_blocknum -1) {
				sendbuf(cbp);	// just retransmit; this also updates cbp->timeout_time
			} else {
				// older ACK; ignore
			}
			break;
		    case ERROR:	// client sent us a dup-port error?
			if (cbp != NULL) {
				cl_delete(cbp);
			}
			break;
		}

The second thing I did (which involved considerable trial-and-error with the linux "ufw" firewall) was to allow Loyola student hosts hopper.cs.luc.edu and turing.cs.luc.edu to accept UDP packets from ulam2, period. Actually, I rewrote the standard server to answer always from a port in a specified range (eg 60,000 to 61,000), and hopper/turing may allow connections only from those ports. Note that in order to avoid the old-late-duplicates problem, it is essential that ulam2 choose a new port as often as possible; it does this by using ports in round-robin fashion (60000, 60001, 60002, ... , 60998, 60999, 61000, 60000, 60001, ...).

The ufw command was
ufw allow proto udp from 147.126.65.47 port 60000:61000

NAT firewalls

Note that all this is entirely separate from the problem introduced by a NAT firewall (NAT = Network Address Translation). Here, host A would be behind a router R, and have a "private" address such as 10.38.2.42. The router might have a public address, eg 147.126.65.43, and each packet from A to the outside world has the IP source address rewritten from 10.38.2.42 to 147.126.65.43; the router R then handles the reverse rewriting for packets from the outside in, looking up the address/port combinations contained in the packet in a connection table and rerouting the packet to A.

The problem is that the port number is often rewritten too (in order to handle the case that both A and R are simultaneously connecting to an outside host using port 3000). This means that, in the scenario above, when A sends something to B's port 4715 and B answers from 60,000, R has no entry in its table for what to do with packets from ⟨B,60000⟩ and so it fails to forward the packet. Because R is a third party, there is generally no way to tell it that packets from an unknown port should go to A (and that's not something you would want to implement!).

Chapter 6: congestion

6.2 Queuing: [FIFO versus Fair Queuing; tail-drop v random-drop]

6.3: TCP Congestion avoidance

How did TCP get this job? Part of the goal is good, STABLE performance for our own connection, but part is helping everyone else.

rate-based v window-based congestion management

self-clocking: sliding windows itself keeps the number of outstanding packets constant

RTT = Round Trip Time, SWS = Sender Window Size

RTT_noload = travel time with no queuing
(RTT-RTT_noload) = time spent in queues
SWS × (RTT-RTT_noload) = number of packets in queues, usually all at one router (the "bottleneck" router, right before slowest link) . Note that the sender can calculate this (assuming we can estimate RTT_noload).

Note that TCP's self-clocking (ie that new transmissions are paced by returning ACKs) is what guarantees that the queue will build only at the bottleneck router. Self-clocking means that the rate of packet transmissions is equal to the available bandwidth of the bottleneck link. All the other links have higher bandwidth, and can therefore handle the current connection's packets as fast as they are transmitted. There are some spikes when a burst of packets is sent (eg when the sender increases its window size), but in the steady state self-clocking means that packets accumulate only at the bottleneck.

The ideal window size for a TCP connection would be bandwidth × RTT_noload. With this window size, we have exactly filled the transit capacity along the path to our destination, and we have used none of the queue capacity.

Actually, TCP does not do this.

Instead, TCP

guesses at a reasonable initial window size
slowly increases the window size if no losses occur, but rapidly decreases it otherwise

The idea is that there is a time-varying "magic ceiling" of packets the network can accept. We try to stay near but just below this level. Occasioinally we will overshoot, but this just teaches us a little more about what the "magic ceiling" is.

Actually, as we'll see, this model isn't quite right, but it's worked well enough.

Also, it's time to admit that there are multiple versions of TCP here, each incorporating different congestion-management algorithms. The two we will start with are TCP Tahoe (1988) and TCP Reno (1990); the names Tahoe and Reno were originally the codenames of the Berkeley Unix distributions that included the respective TCP implementations. TCP Tahoe came from a 1988 paper by Van Jacobson entitled Congestion Avoidance and Control; TCP Reno then refined this a couple years later.

Originally, the SWS for a TCP connection came from the value suggested by the receiver, essentially representing how many packet buffers it could allocate. This value may be so large it contributes to network congestion, however, so is usually reduced. When the SWS is adjusted out of concern for congestion, it is generally thought of as the CongestionWindow, or cwnd (a variable name in a BSD implementation). Strictly speaking, SWS = min(cwnd, AdvertisedWindow).

Congestion Avoidance: Additive Increase / Multiplicative Decrease

The name "congestion avoidance phase" is given to the stage where TCP has established a reasonable guess for cwnd, and wants to engage in some fine-tuning. The central observation is that when a packet is lost, cwnd should decrease rapidly, but otherwise should increase "slowly". The strategy employed is called additive increase, multiplicative decrease, because when a windowful of packets have been sent with no loss we set cwnd = cwnd+1, but if a windowful of packets involves losses we set cwnd = cwnd/2. Note that a windowful is, of course, cwnd many packets; with no losses, we might send successive windowfuls of, say, 20, 21, 22, 23, 24, .... This amounts to conservative "probing" of the network, trying larger cwnd values because the absence of loss means the current cwnd is below the "magic ceiling".

If a loss occurs our goal is to cut the window size in half immediately. (As we will see, Tahoe actually handles this in a somewhat roundabout way.) Informally, the idea is that we need to respond aggressively to congestion. More precisely, lost packets mean we have filled the queue of the bottleneck router, and we need to dial back to a level that will allow the queue to clear. If we assume that the transit capacity is roughly equal to the queue capacity (say each is equal to N), then we overflow the queue and drop packets when cwnd = 2N, and so cwnd = cwnd/2 leaves us with cwnd = N, which just fills the transit capacity and leaves the queue empty.

Of course, assuming any relationship between transit capacity and queue capacity is highly speculative. On a 5,000 km fiber-optic link with a bandwidth of 1 Gbps, the round-trip transit capacity would be about 6 MB. That is much larger than any router queue is likely to be.

The congestion-avoidance algorithm leads to the classic "TCP sawtooth" graph, where the peaks are at the points where the slowly rising cwnd crossed above the "magic ceiling".

What might the "magic ceiling" be? It represents the largest cwnd that does not lead to packet loss, ie the cwnd that at that particular moment completely fills but does not overflow the bottleneck queue. The transit capacity of the path (and the queue capacity of the bottleneck router) is unvarying; however, that capacity is also shared with other connections and other connections may come and go with time. This is why the ceiling does vary in real terms. If two other connections share the path with capacity 60 packets, they each might get about 20 packets as their share. If one of those connections terminates, the two others might each rise to 30 packets. And if instead a fourth connection joins the mix, then after equilibrium is reached each connection might expect a share of 15 packets.

Speaking of sharing, it is straightforward to show that the additive-increase/multiplicative-decrease algorithm leads to equal bandwidth sharing when two connections share a bottleneck link, provided both have the same RTT. Assume that during any given RTT either both connections or neither connection experiences packet loss, and onsider cwnd1 - cwnd2. If there is no loss, cwnd1-cwnd2 stays the same as both cwnds increment by 1. If there is a loss, then both are cut in half, and so cwnd1-cwnd2 is cut in half. Thus, over time, cwnd1-cwnd2 is repeatedly cut in half, until it dwindles to inconsequentiality.

Slow Start

How do we make that initial guess as to the network capacity? And even if we have such a guess, how do we avoid flooding the network sending an initial burst of packets?

The answer is slow start. If you are trying to guess a number in a fixed range, you are likely to use binary search. Not knowing the range for the "magic ceiling", a good strategy is to guess cwnd=1 at first and keep doubling until you've gone too far. Then revert to the previous guess (which worked). That ensures that you are now within 50% of the true capacity.

This is kind of an oversimplification. What we actually do is to increment cwnd by 1 for each ACK received. This seems linear, not exponential, but that is misleading: after we send a windowful of packets (cwnd many), we have incremented cwnd-many times, and so have set cwnd to (cwnd+cwnd) = 2*cwnd. In other words, cwnd=cwnd*2 after each windowful is the same as cwnd+=1 after each packet.

Similarly, during congestion-avoidance, we set cwnd = cwnd+1 after each windowful. Since the window size is cwnd, this amounts to cwnd = cwnd + 1/cwnd after each packet (this is an approximation, because cwnd keeps changing, but it works in practice if your TCP driver is willing to engage in floating-point arithmentic. An exact formula is cwnd = cwnd + 1/cwnd₀, where cwnd₀ is the value of cwnd at the start of that particular windowful. Another, simpler, approach is to use cwnd += 1/cwnd, and to keep the fractional part recorded, but to use floor(cwnd) (the integer part of cwnd) when actually sending packets.

Assuming packets travel together, this means cwnd doubles each RTT. Eventually the network gets "full", and drops a packet. Let us suppose this is after N RTTs, so cwnd=2^N. Then during the previous RTT, cwnd=2^N-1 worked successfully, so we go back to that previous value by setting cwnd = cwnd/2.

Sometimes we will use Slow Start even when we know the working network capacity. After a packet loss, we halve the previous cwnd and this gives us a pretty good idea of what to expect. If cwnd had been 100, we halve it to 50. However, after a packet loss, there are no returning ACKs to self-clock our transmission, and we do not want to dump 50 packets on the network all at once. So we use what might be called "threshold" slow-start: we use slow-start, but stop when cwnd reaches the target.

The simplified algorithm is thus to use slow-start until the first packet loss, and then halve cwnd and switch to the congestion-avoidance phase.

Actually, on every packet loss (including the original slow-start loss), we use "threshold" slow-start to ramp up again, stopping when cwnd reaches half its previous value. More precisely, when a packet loss occurs, we set the slow-start threshold, ssthresh, equal to half of the value of cwnd at the time of the loss; this is our target new cwnd. We then set cwnd=1, and begin slow-start mode, up until cwnd==ssthresh. At that point, we revert to congestion-avoidance.


Review of TCP so far

slow start + congestion avoidance
We need threshold slow-start (slow-start with ssthresh) after each loss
Slow-start and congestion-avoidance have to work together
self-clocking

Note that everything is expressed here in terms of manipulating cwnd.

Summary:

phase	cwnd change, loss	cwnd change, no loss
	per window	per window	per ACK
slow start	cwnd/2	cwnd *= 2	cwnd += 1
cong avoid	cwnd/2	cwnd +=1	cwnd += 1/cwnd

TCP idealized sawtooth v approximation

real situation: sender realizes lost packet is lost only after protracted continued sending, at which point the queue will take quite a while to reduce.

fast retransmit: TCP Tahoe.
If we send packets 1,2,3,4,5,6 and get back ACK1, ACK2, ACK2, ACK2, ACK2 we can infer a couple things:

data 3 got lost, which is why we're stuck on ACK2
data 4,5,6 did make it through, and triggered the three duplicate ACK2s (the three ACK2s following the first ACK2).

Fast Retransmit is the name given to this idea as incorporated in TCP. On the third dupACK, we retransmit the lost packet. We also set ssthresh = cwnd/2, and cwnd=1; we resume transmitting when the ACK of the lost packet arrives.

TCP and one connection
nam demo
Note interaction between queue size and pipe size

Single sender situation
example A-----R----slow---B, with R having queue size of 4
bottleneck_queue >= bandwidth×RTT_noload

Fast Recovery / TCP Reno

Fast Retransmit requires us to go to cwnd=1 because there are no arriving packets to pace transmission. Fast Recovery is a workaround: we use the arriving dupACKs to pace retransmission. The idea is to set cwnd=cwnd/2, and then to figure out how many dupACKs we have to wait for. Let cwnd = N, and suppose we've sent packets 1-N and packet 1 is lost. We will then get N-1 dupACKs for packets 2-N.

During the recovery process, we will ignore SWS and instead use the concept of Estimated FlightSize, or EFS, which is the sender's best guess at the number of outstanding packets. Under normal circumstances, EFS = cwnd (except for that tiny interval between when an ACK arrives and we send out the next packet).

At the point of the third dupACK, the sender calculates as follows: EFS had been N. However, one of the packets has been lost, making it N-1. Three dupACKs have arrived, representing three later packets no longer in flight, so EFS is now N-4. Fast Retransmit has us retransmit the packet that we inferred was lost, so EFS increments by 1, to N-3.

Our target new cwnd is N/2. So, we wait for N/2 - 3 more dupACKs to arrive, at which point EFS is N-3-(N/2-3) = N/2. We now send one new packet for each arriving dupACK following, up until we receive the ACK for the lost packet (it will actually be a cumulative ack for all the later received packets as well). At this point we declare cwnd = N/2, and keep going. As EFS was already N/2,


fast recovery detailed diagram, SWS = 10, packet 10 lost, 11-19 resulting in dupACK9's. The third dupACK9 is "really" from data13; at that point we retransmit data10. EFS = 5 when we get two more dupACK9's (from data14 and data15). The next dupACK9 (really from data16) has us transmit data20:
    dupACK9/16   data20
    dupACK9/17   data21
    dupACK9/18   data22
    dupACK9/19   data23
    ACK19            data24   The ACK19 is sent when the receiver gets the retransmitted data10
    ACK20            data25   cwnd = 5 now

EFS: Estimated FlightSize: sender's estimate of how many packets are in transit, one way or the other. This replaces SWS.

cwnd inflation/deflation
The original algorithm had a more complex strategy; note that strictly speaking, Reno DOES break the sliding-windows rules during fast recovery.

NewReno tweak: better handling of the case when two packets were lost
If packets 1 and 4 are lost in a window 0,1,2,3,4,5,6,7
then initially we will get dupACK0's.
When packet 1 is successfully retransmitted,
we start getting dupACK3's.
This particular window is much too small to wait for 3 dupACKs.
NewReno is essentially the "natural" approach to continuing

ns/nam demos

basic1: one-way-propagation = 10, run-time = 8 sec, note unbounded slow-start, bounded slow-start, congestion-avoidance. Loss at ~3.5 sec with complex recovery

basic 1 variations 100 sec, bandwidth = 100 pkts/sec (so theoretical max is 10,000 packets): queue size = 5: 8490 packets

queue size	throughput
3	7836
5	8490
10	9306
20	9512
50	9490
100	9667

basic2: two connections sharing a link and competing for bandwidth. Demo. Note short-term lack of obvious fairness. Note also that when we increase the delay for the first link, while we expect a slow decline in the share of that link, in fact we get widely ranging values:

# delay    bw1    bw2
# 10    1216    1339
# 12    1339    1216
# 14    1396    1151
# 16    1150    1396
# 18    1142    1396
# 20    1903    737
# 22    1147    1396
# 24    1156    1348
# 26    1002    1472
# 28    1337    1112
# 30    2106    608
# 40    948    1566
# 50    1033    1515
# 90    453    2066
#100    412    2127
#110    730    1877
#120    432    2121
#130    515    2038
#140    289    2472
#150    441    2065
#160    567    1820
#170    1265    1286
#180    530    1914
#190    490    1849
#200    446    1991

TCP Fairness

Example 1: SAME rtt; both connections get approximately equal allocations, as we saw above.

Example 2: different RTT
Classic version:
Connection 2 has twice the RTT of connection 1.
As before, we assume both connections lose a packet when cwin1+cwin2 > 10; otherwise neither loses.
Both start at 1

connection 1	1	2	3	4	5	6	7*	3	4	5	6	7	8*	4	5	6	7	8	9*	4	5	6	7	8	9*	...
connection 2	1		2		3		4*		2		3		4*		2		3		4*		2		3		4*

Conection2 averages half the window size. As the time it takes to send a window is doubled,
the throughput is down by a factor of FOUR.

What is really going on here? Is there something that can be "fixed"?

Early thinking (through the 90's?) was that there was.
Current thinking is to DEFINE what TCP does as "fair", and leave it at that.