Comp 343/443
Fall 2011, LT 412, Tuesday 4:15-6:45
Week 12, Nov 29
Read:
Ch 3, sections 1, 2, 3
Ch 5, sections 1 (UDP) and 2 (TCP)
Demo of portscan
This works fine from hopper.cs.luc.edu, and almost anywhere else (even at home, from behind firewalls).
However, from the instructor machine in LT412, it runs strangely. I had read timeouts on every readable port, but the program was
correctly determining what ports were open! It turned out (after I ran
wireshark to do a packet trace) that the data packets were all arriving
5000 ms after the connnection was finished. Raising the read timeout to
5500 ms fixed it. ???
Demo of WUMP
Most systems now have individual firewalls that block new inbound
TCP connections except to certain specified ports; for example, suppose
our host A allows connections only to ports 22/ssh, 80/http and
993/imaps. Users are happiest, though, if outbound TCP connections from A are allowed from any port; however, an outbound connection still involves inbound packets. In particular, the individual inbound TCP packets that are part of a connection initiated
by A are generally allowed, even if (as is nearly inevitable) they
don't come from the listed special ports. That is, a user on machine A
would be allowed to connect to port 8080 on host B, and B's response
packets to A from port 8080 would be allowed past the firewall even
though 8080 is not on the list [22, 80, 993]. One simple implementation
of this is to block all incoming SYN-only packets (that is, packets
initiating a connection) except to ports 22, 80 and 993, but to allow any other TCP packet (eg any TCP packet with the ACK flag set), from any port.
Firewalls like to do this with UDP also, but it's harder; there is no
"connection setup" as such. For server processes running on A, it is
straightforward to block inbound packets except to a list of designated
ports; this is fine for services that run on those ports that need to
be contacted by outsiders.
This works for for server processes, but what about UDP-based clients? What if a client process on A starts up on port 2000, not listed in the port firewall, and sends a response to server B? Will B's answer to port 2000 get through?
Generally, most UDP firewalls try to simulate the TCP allow-outbound-connections feature as follows: if A sends a packet to B from port 2000 and to port 1000 (say), then B is allowed to send a packet back to A to port 2000, as long as it comes from port 1000. The permission may be allowed for a limited period of time, or else for as long as A keeps open the socket involved. (Can you think of an experiment to determine which strategy is used?)
In the WUMP case, the REQ packet goes to port 4715 on ulam2; the reply
then comes from a different port, say 60,000. The standard firewall on
the client machine, however, usually blocks that arriving packet, as it sees it coming from a port with which it has no previous contact.
To get around this, I've done two things.
The first is to create a version of the WUMP server that responds from
the same port, 4715. The problem is that multiple clients may be
connected simultaneously, and the server can no longer keep these
connections separate based on the port to which the traffic is sent.
This means that the server had to be completely rewritten, to maintain
a "dictionary" of all extant connections, and a "connection block" of
all data for a given connection (eg timeout time, blocknum, last packet
sent, sockaddr of remote end, file handle of file from which that
connection is reading, etc). Here's the "main event loop" of my new
server, slightly simplified, and unfortunately in C rather than Java:
while (TRUE) {
struct connblock * cbp; // connection-block pointer
long currtime; // current time
struct wumppkt pbuf; // struct buffer for actual packet
struct sockaddr_in clientname; // struct to hold src IP addr / port of arriving packet
cnamelen = sizeof clientname;
pbuflen = recvfrom(sock, (void *) &pbuf, sizeof pbuf, 0, (struct sockaddr *) &clientname, &cnamelen);
// hard timeout
if (pbuflen < 0) { // this means a hard timeout occurred
// check connectionList for expired timeouts
cbp = connectionList;
currtime = time(0); // get current time
while (cbp != NULL) {
if (cbp->timeout_time <= currtime) { // has packet's timeout_time passed?
sendbuf(cbp); // retransmit
}
cbp = cbp -> next;
}
continue;
}
// bad size, other basic sanity checks here (not shown)
cbp = cl_search(clientname); // search connectionList for existing connection
switch (pbuf.opcode) {
case REQ :
if (cbp!= NULL) continue; // ignore REQ for which we already have a connection
reqp = (struct reqpkt*) &pbuf;
fname = reqp -> filename;
struct connblock * ncb = new_connection(clientname, fname); // new connblock object
ncb->next = connectionList; // add it to connectionList
connectionList = ncb;
fillbuf(ncb); // read first block into send buffer (within connblock object)
sendbuf(ncb); // actually send it
break;
case ACK:
if (cbp == NULL) continue; // ignore ACKs for nonexistent connections (should send error?)
ackp = (struct ackpkt*) &pbuf;
int arriving_blocknum = ntohl(ackp->blocknum);
int lastsent_blocknum = cbp->blocknum;
// case 1: got ACK for packet we last sent; go on to next!
if (arriving_blocknum == lastsent_blocknum) {
if (cbp->eof_reached) { // final ACK received
cl_delete(cbp);
continue;
}
fillbuf(cbp); // read next block into send buffer
sendbuf(cbp); // actually send it
} // case 2: got ACK for prev packet; RETRANSMIT
else if (arriving_blocknum = lastsent_blocknum -1) {
sendbuf(cbp); // just retransmit; this also updates cbp->timeout_time
} else {
// older ACK; ignore
}
break;
case ERROR: // client sent us a dup-port error?
if (cbp != NULL) {
cl_delete(cbp);
}
break;
}
The second thing I did
(which involved considerable trial-and-error with the linux "ufw"
firewall) was to allow Loyola student hosts hopper.cs.luc.edu and
turing.cs.luc.edu to accept UDP packets from ulam2, period. Actually, I
rewrote the standard server to answer always from a port in a specified
range (eg 60,000 to 61,000), and hopper/turing may allow connections
only from those ports. Note that in order to avoid the
old-late-duplicates problem, it is essential that ulam2 choose a new
port as often as possible; it does this by using ports in round-robin
fashion (60000, 60001, 60002, ... , 60998, 60999, 61000, 60000, 60001, ...).
The ufw command was
ufw allow proto udp from 147.126.65.47 port 60000:61000
NAT firewalls
Note that all this is entirely separate from the problem introduced by a NAT firewall
(NAT = Network Address Translation). Here, host A would be behind a
router R, and have a "private" address such as 10.38.2.42. The router
might have a public address, eg 147.126.65.43, and each packet from A
to the outside world has the IP source address rewritten from
10.38.2.42 to 147.126.65.43; the router R then handles the reverse
rewriting for packets from the outside in, looking up the address/port
combinations contained in the packet in a connection table and
rerouting the packet to A.
The problem is that the port number is often rewritten too (in order to
handle the case that both A and R are simultaneously connecting to an
outside host using port 3000). This means that, in the scenario
above, when A sends something to B's port 4715 and B answers from
60,000, R has no entry in its table for what to do with packets from
⟨B,60000⟩ and so it fails to forward the packet. Because R is a third
party, there is generally no way to tell it that packets from an
unknown port should go to A (and that's not something you would want to implement!).
Chapter 6: congestion
6.2 Queuing: [FIFO versus Fair Queuing; tail-drop v random-drop]
6.3: TCP Congestion avoidance
How did TCP get this job? Part of the goal is good, STABLE performance
for our own connection, but part is helping everyone else.
rate-based v window-based congestion management
self-clocking: sliding windows itself keeps the number of outstanding packets constant
RTT = Round Trip Time, SWS = Sender Window Size
RTTnoload = travel time with no queuing
(RTT-RTTnoload) = time spent in queues
SWS × (RTT-RTTnoload) = number of packets in queues, usually
all at one router (the "bottleneck" router, right before slowest
link)
. Note that the sender can calculate this (assuming we can
estimate RTTnoload).
Note that TCP's self-clocking (ie that new transmissions are paced by
returning ACKs) is what guarantees that the queue will build only at
the bottleneck router. Self-clocking means that the rate of packet
transmissions is equal to the available bandwidth of the bottleneck
link. All the other links have higher bandwidth, and can therefore
handle the current connection's packets as fast as they are
transmitted. There are some spikes when a burst of packets is sent (eg
when the sender increases its window size), but in the steady state
self-clocking means that packets accumulate only at the bottleneck.
The ideal window size for a TCP connection would be bandwidth × RTTnoload. With this window size, we have exactly filled the transit capacity along the path to our destination, and we have used none of the queue capacity.
Actually, TCP does not do this.
Instead, TCP
- guesses at a reasonable initial window size
- slowly increases the window size if no losses occur, but rapidly decreases it otherwise
The idea is that there is a time-varying "magic ceiling" of packets the
network can accept. We try to stay near but just below this level.
Occasioinally we will overshoot, but this just teaches us a little more
about what the "magic ceiling" is.
Actually, as we'll see, this model isn't quite right, but it's worked well enough.
Also, it's time to admit that there are multiple versions of TCP here,
each incorporating different congestion-management algorithms. The two
we will start with are TCP Tahoe (1988) and TCP Reno (1990); the names
Tahoe and Reno were originally the codenames of the Berkeley Unix
distributions that included the respective TCP implementations. TCP
Tahoe came from a 1988 paper by Van Jacobson entitled Congestion Avoidance and Control; TCP Reno then refined this a couple years later.
Originally, the SWS for a TCP connection came from the value suggested by the receiver,
essentially representing how many packet buffers it could allocate.
This value may be so large it contributes to network congestion,
however, so is usually reduced. When the SWS is adjusted out of concern
for congestion, it is generally thought of as the CongestionWindow, or cwnd (a variable name in a BSD implementation). Strictly speaking, SWS = min(cwnd, AdvertisedWindow).
Congestion Avoidance: Additive Increase / Multiplicative Decrease
The name "congestion avoidance phase" is given to the stage where TCP
has established a reasonable guess for cwnd, and wants to engage in
some fine-tuning. The central observation is that when a packet is
lost, cwnd should decrease rapidly, but otherwise should increase
"slowly". The strategy employed is called additive increase, multiplicative decrease,
because when a windowful of packets have been sent with no loss we set
cwnd = cwnd+1, but if a windowful of packets involves losses we set
cwnd = cwnd/2. Note that a windowful is, of course, cwnd many packets;
with no losses, we might send successive windowfuls of, say, 20, 21,
22, 23, 24, .... This amounts to conservative "probing" of the network,
trying larger cwnd values because the absence of loss means the current
cwnd is below the "magic ceiling".
If a loss occurs our goal is to cut the window size in half
immediately. (As we will see, Tahoe actually handles this in a somewhat
roundabout way.) Informally, the idea is that we need to respond
aggressively to congestion. More precisely, lost packets mean we have filled
the queue of the bottleneck router, and we need to dial back to a level
that will allow the queue to clear. If we assume that the transit
capacity is roughly equal to the queue capacity (say each is equal to
N), then we overflow the queue and drop packets when cwnd = 2N, and so
cwnd = cwnd/2 leaves us with cwnd = N, which just fills the transit
capacity and leaves the queue empty.
Of course, assuming any relationship between transit capacity and queue
capacity is highly speculative. On a 5,000 km fiber-optic link with a
bandwidth of 1 Gbps, the round-trip transit capacity would be about 6
MB. That is much larger than any router queue is likely to be.
The congestion-avoidance algorithm leads to the classic "TCP sawtooth"
graph, where the peaks are at the points where the slowly rising cwnd
crossed above the "magic ceiling".
What might the "magic ceiling" be? It represents the largest cwnd that
does not lead to packet loss, ie the cwnd that at that particular
moment completely fills but does not overflow the bottleneck queue. The
transit capacity of the path (and the queue capacity of the bottleneck
router) is unvarying; however, that capacity is also shared with other
connections and other connections may come and go with time. This is
why the ceiling does vary in real terms. If two other connections share
the path with capacity 60 packets, they each might get about 20 packets
as their share. If one of those connections terminates, the two others
might each rise to 30 packets. And if instead a fourth connection joins
the mix, then after equilibrium is reached each connection might expect
a share of 15 packets.
Speaking of sharing, it is straightforward to show that the
additive-increase/multiplicative-decrease algorithm leads to equal
bandwidth sharing when two connections share a bottleneck link,
provided both have the same RTT. Assume that during any given RTT
either both connections or neither connection experiences packet loss,
and onsider cwnd1 - cwnd2. If there is no loss, cwnd1-cwnd2 stays the
same as both cwnds increment by 1. If there is a loss, then both are
cut in half, and so cwnd1-cwnd2 is cut in half. Thus, over time,
cwnd1-cwnd2 is repeatedly cut in half, until it dwindles to
inconsequentiality.
Slow Start
How do we make that initial guess as to the network capacity? And even
if we have such a guess, how do we avoid flooding the network sending
an initial burst of packets?
The answer is slow start. If you are trying to guess a number in a
fixed range, you are likely to use binary search. Not knowing the range
for the "magic ceiling", a good strategy is to guess cwnd=1 at first
and keep doubling until you've gone too far. Then revert to the
previous guess (which worked). That ensures that you are now within 50%
of the true capacity.
This is kind of an oversimplification. What we actually do is to
increment cwnd by 1 for each ACK received. This seems linear, not
exponential, but that is misleading:
after we send a windowful of packets (cwnd many), we have incremented
cwnd-many times, and so have set cwnd to (cwnd+cwnd) = 2*cwnd. In other
words, cwnd=cwnd*2 after each windowful is the same as cwnd+=1 after each packet.
Similarly, during congestion-avoidance, we set cwnd = cwnd+1 after each windowful. Since the window size is cwnd, this amounts to cwnd = cwnd + 1/cwnd after each packet
(this is an approximation, because cwnd keeps changing, but it works in
practice if your TCP driver is willing to engage in floating-point
arithmentic. An exact formula is cwnd = cwnd + 1/cwnd0, where cwnd0
is the value of cwnd at the start of that particular windowful.
Another, simpler, approach is to use cwnd += 1/cwnd, and to keep the
fractional part recorded, but to use floor(cwnd) (the integer part of
cwnd) when actually sending packets.
Assuming packets travel together, this means cwnd doubles each RTT. Eventually the network gets "full", and drops a packet.
Let us suppose this is after N RTTs, so cwnd=2N. Then during the previous RTT, cwnd=2N-1 worked successfully, so we go back to that previous value by setting cwnd = cwnd/2.
Sometimes we will use Slow Start even when we know the working network
capacity. After a packet loss, we halve the previous cwnd and this
gives us a pretty good idea of what to expect. If cwnd had been 100, we
halve it to 50. However, after a packet loss, there are no returning ACKs to self-clock our transmission, and we do not
want to dump 50 packets on the network all at once. So we use what
might be called "threshold" slow-start: we use slow-start, but stop
when cwnd reaches the target.
The simplified algorithm is thus to use slow-start until the first
packet loss, and then halve cwnd and switch to the congestion-avoidance
phase.
Actually, on every packet loss (including the original slow-start
loss), we use "threshold" slow-start to ramp up again, stopping when
cwnd reaches half its previous value. More precisely, when a packet
loss occurs, we set the slow-start threshold, ssthresh,
equal to half of the value of cwnd at the time of the loss; this is our
target new cwnd. We then set cwnd=1, and begin slow-start mode, up
until cwnd==ssthresh. At that point, we revert to congestion-avoidance.
Review of TCP so far
- slow start + congestion avoidance
- We need threshold slow-start (slow-start with ssthresh) after each loss
- Slow-start and congestion-avoidance have to work together
- self-clocking
Note that everything is expressed here in terms of manipulating cwnd.
Summary:
phase
|
cwnd change, loss
|
cwnd change, no loss
|
|
per window
|
per window
|
per ACK
|
slow start
|
cwnd/2
|
cwnd *= 2
|
cwnd += 1
|
cong avoid
|
cwnd/2
|
cwnd +=1
|
cwnd += 1/cwnd
|
TCP idealized sawtooth v approximation
real situation: sender realizes lost packet is lost only after
protracted continued sending, at which point the queue will take quite
a while to reduce.
fast retransmit: TCP Tahoe.
If we send packets 1,2,3,4,5,6 and get back ACK1, ACK2, ACK2, ACK2, ACK2 we can infer a couple things:
- data 3 got lost, which is why we're stuck on ACK2
- data 4,5,6 did make it through, and triggered the three duplicate ACK2s (the three ACK2s following the first ACK2).
Fast Retransmit is the name
given to this idea as incorporated in TCP. On the third dupACK, we
retransmit the lost packet. We also set ssthresh = cwnd/2, and cwnd=1;
we resume transmitting when the ACK of the lost packet arrives.
TCP and one connection
nam demo
Note interaction between queue size and pipe size
Single sender situation
example A-----R----slow---B, with R having queue size of 4
bottleneck_queue >= bandwidth×RTTnoload
Fast Recovery / TCP Reno
Fast Retransmit requires us to go to cwnd=1 because there are no
arriving packets to pace transmission. Fast Recovery is a workaround:
we use the arriving dupACKs to pace retransmission. The idea is to set
cwnd=cwnd/2, and then to figure out how many dupACKs we have to wait
for. Let cwnd = N, and suppose we've sent packets 1-N and packet 1 is
lost. We will then get N-1 dupACKs for packets 2-N.
During the recovery process, we will ignore SWS and instead use the concept of Estimated FlightSize,
or EFS, which is the sender's best guess at the number of outstanding
packets. Under normal circumstances, EFS = cwnd (except for that tiny
interval between when an ACK arrives and we send out the next packet).
At the point of the third dupACK, the sender calculates as follows: EFS
had been N. However, one of the packets has been lost, making it N-1.
Three dupACKs have arrived, representing three later packets no longer
in flight, so EFS is now N-4. Fast Retransmit has us retransmit the
packet that we inferred was lost, so EFS increments by 1, to N-3.
Our target new cwnd is N/2. So, we wait for N/2 - 3 more dupACKs to
arrive, at which point EFS is N-3-(N/2-3) = N/2. We now send one new packet for each arriving dupACK following,
up until we receive the ACK for the lost packet (it will actually be a
cumulative ack for all the later received packets as well). At this
point we declare cwnd = N/2, and keep going. As EFS was already N/2,
fast recovery detailed diagram, SWS = 10, packet 10
lost, 11-19 resulting in dupACK9's. The third dupACK9 is "really" from
data13; at that point we retransmit data10. EFS = 5 when we get two
more dupACK9's (from data14 and data15). The next dupACK9 (really from
data16) has us transmit data20:
dupACK9/16 data20
dupACK9/17 data21
dupACK9/18 data22
dupACK9/19 data23
ACK19
data24 The ACK19 is sent
when the receiver gets the retransmitted data10
ACK20 data25 cwnd = 5 now
EFS: Estimated FlightSize: sender's estimate of how many packets are in transit, one way or the other. This replaces SWS.
cwnd inflation/deflation
The original algorithm had a more complex strategy; note that strictly
speaking, Reno DOES break the sliding-windows rules during fast
recovery.
NewReno tweak: better handling of the case when two packets were lost
If packets 1 and 4 are lost in a window 0,1,2,3,4,5,6,7
then initially we will get dupACK0's.
When packet 1 is successfully retransmitted,
we start getting dupACK3's.
This particular window is much too small to wait for 3 dupACKs.
NewReno is essentially the "natural" approach to continuing
ns/nam demos
- basic1: one-way-propagation = 10, run-time = 8 sec, note
unbounded slow-start, bounded slow-start, congestion-avoidance. Loss at
~3.5 sec with complex recovery
- basic 1 variations 100 sec, bandwidth = 100 pkts/sec (so
theoretical max is 10,000 packets): queue size = 5: 8490 packets
queue size
|
throughput
|
3
|
7836
|
5
|
8490
|
10
|
9306
|
20
|
9512
|
50
|
9490
|
100
|
9667
|
- basic2: two connections sharing a link and competing for bandwidth. Demo. Note short-term
lack of obvious fairness. Note also that when we increase the delay for
the first link, while we expect a slow decline in the share of that
link, in fact we get widely ranging values:
# delay bw1 bw2
# 10 1216 1339
# 12 1339 1216
# 14 1396 1151
# 16 1150 1396
# 18 1142 1396
# 20 1903 737
# 22 1147 1396
# 24 1156 1348
# 26 1002 1472
# 28 1337 1112
# 30 2106 608
# 40 948 1566
# 50 1033 1515
# 90 453 2066
#100 412 2127
#110 730 1877
#120 432 2121
#130 515 2038
#140 289 2472
#150 441 2065
#160 567 1820
#170 1265 1286
#180 530 1914
#190 490 1849
#200 446 1991
TCP Fairness
Example 1: SAME rtt; both connections get approximately equal allocations, as we saw above.
Example 2: different RTT
Classic version:
Connection 2 has twice the RTT of connection 1.
As before, we assume both connections lose a packet when cwin1+cwin2 > 10; otherwise neither loses.
Both start at 1
connection 1
|
1
|
2
|
3
|
4
|
5
|
6
|
7*
|
3
|
4
|
5
|
6
|
7
|
8*
|
4
|
5
|
6
|
7
|
8 |
9*
|
4
|
5
|
6
| 7
|
8
|
9*
|
...
|
connection 2
|
1
|
2
|
3
|
4*
|
2
|
3
|
4*
|
2
|
3
|
4*
|
2
|
3
|
4*
|
Conection2 averages half the window size. As the time it takes to send a window is doubled,
the throughput is down by a factor of FOUR.
What is really going on here? Is there something that can be "fixed"?
Early thinking (through the 90's?) was that there was.
Current thinking is to DEFINE what TCP does as "fair", and leave it at that.