Networks Week 13

Comp 343/443

Fall 2011, LT 412, Tuesday 4:15-6:45
Week 13, Dec 6

Read:

    Ch 3, sections 1, 2, 3
    Ch 5, sections 1 (UDP) and 2 (TCP)
    Ch 6, sections

Previously:

TCP Tahoe & Fast Retransmit, SlowStart, CongestionAvoidance, ssthresh
TCP Reno & Fast Recovery
TCP fairness

TCP loss rate, p, versus cwin

Let p be the loss rate, in packets, eg p = 0.001 means one packet in 1000 is lost. Then, for some constant k (in the range 1.2 to 1.5), we have

        cwnd = k/sqrt(p)

and so

        bandwidth = cwnd/RTT = k/RTT*sqrt(p)

This relationship comes from the fact that cwnd reaches an equilibrium based on how often we have cwnd=cwnd/2 events counterbalancing cwnd+=1 events, and the former rate is the probability of loss in a windowful which is directly related to the packet-loss rate p.

Explanation in terms of TCP sawtooth

Let us assume we lose a packet at regular intervals, eg after every N windowfuls, and that cwnd varies from M at the start of each window to 2*M at the end (thus reverting to M when we set cwnd = cwnd/2). Then, in N windowfuls, cwnd was incremented N times, and so M+N = 2M and so M=N.

This means we sent N + (N+1) + (N+2) + ... + 2N packets before a loss. This number is about 3/2N². The loss rate is thus p = 1/(3/2)N², and solving for N we get N = (2/3)^0.5 * 1/sqrt(p). The average cwnd is 3/2N, so cwnd_average = 3/2*(2/3)^0.5 / sqrt(p) ≃ 1.225 / sqrt(p). More commonly in the literature we are interested in the maximum cwnd; applying the same technique gives cwnd_max = 2*(2/3)^0.5 / sqrt(p) ≃ 1.633 / sqrt(p).

High-bandwidth TCP

consequence for high bandwidth: the cwnd needed implies a very small p; unrealistically small!
Random losses (not due to congestion) keep window significantly smaller than it should be.

TCP Throughput (Mbps)	RTTs between losses	cwnd	P (loss probability)
1	5.5	8.3	0.02
10	55	83	0.0002
100	555	833	0.000002
1000	5555	8333	0.00000002
10,000	55555	83333	0.0000000002

Table 1: RTTs Between Congestion Events for Standard TCP, for 1500-Byte Packets and a Round-Trip Time of 0.1 Seconds.

Packet Drop Rate P	cwnd	RTTs between losses
10^-2	12	8
10^-3	38	25
10^-4	120	80
10^-5	379	252
10^-6	1200	800
10^-7	3795	2530
10^-8	12000	8000
10^-9	37948	25298
10^-10	120,000	80000

Table 2: TCP window size in terms of drop rate

The above two tables indicate that large window sizes require extremely small drop rates. This is the highspeed-TCP problem: how do we maintain a large window? The issue is that non-congestive (random) packet losses bring the window size down, far below where it could be.

One proposed fix: HighSpeed-TCP: for each no-loss RTT, allow an inflation of cwnd by more than 1, at least for large cwnd. If the increment is N = N(cwnd), this is equivalent to having N parallel TCP connections.

Congestion Window W	Number N(W) of Parallel TCPs
1	1.0
10	1.0
100	1.4
1,000	3.6
10,000	9.2
100,000	23.0

   Table 3: Number N(cwnd) of parallel TCP connections roughly emulated by the HighSpeed TCP response function.

The formula for N(cwnd) is largely empirical.

       N(cwnd) = max(1.0, 0.23 × cwnd^0.4)

Increased window size is not "smooth"

   Note the second term in the max() above begins to dominate when cwnd = 38 or so

TCP Friendliness

Suppose we are sending audio data in a congested environment. Because of the real-time nature of the data, we can't wait for lost-packet recovery, and so must use UDP rather than TCP. (Actually, we could use TCP unless the data is interactive; that is, we can perfectly well use TCP to receive streaming audio broadcasts. And if it's interactive, it's likely 8KB/sec voice, where rate adjustment is impractical. Maybe video would be a better example?) We suppose we can adjust the transmission rate as needed, but would like to keep it relatively high.

How are we to manage congestion? How are we to maximize bandwidth without treating other connections unfairly?

A further problem with TCP is the sawtooth variation in cwnd (leading to at least some sawtooth variation in throughput). We don't want that.

TFRC

TFRC, or TCP-Friendly Rate Control, uses the loss rate experienced, p, and the formulas above to calculate a sending rate. It then allows sending at that rate; that is, TFRC is rate-based rather than window-based. As the loss rate increases, the sending rate is adjusted downwards, and so on. However, adjustments are done more smoothly than with TCP.

From RFC 5348:

TFRC is designed to be reasonably fair when competing for bandwidth with TCP flows, where we call a flow "reasonably fair" if its sending rate is generally within a factor of two of the sending rate of a TCP flow under the same conditions. [emphasis added; a factor of two might not be considered "close enough" in some cases.]

The penalty of having smoother throughput than TCP while competing fairly for bandwidth is that TFRC responds slower than TCP to changes in available bandwidth.

TFRC senders include in each packet a sequence number, a timestamp, and an estimated RTT.

The TFRC receiver is charged with sending back feedback packets, which serve as (partial) acknowledgements, and also include a receiver-calculated value for the loss rate, over the previous RTT. The response packets also include information on the current actual RTT, which the sender can use to update its estimated RTT. The TFRC receiver might send back only one such packet per RTT.

The actual response protocol has several parts, but if the loss rate increases, then the primary feedback mechanism is to compute a new (lower) sending rate, and shift to that. The rate would be cut in half only if the loss rate P quadrupled.

Newer versions of TFRC have a various features for responding more promptly to an unusually sudden problem, but in normal use the calculated sending rate is used most of the time.

AIMD

Another approach to TCP friendliness is to retain TCP's additive-increase, multiplicative-decrease strategy, but to change the numbers. Suppose we denote by AIMD(α,β) the strategy of incrementing the window size by α after a window of no losses, and multipying the window size by (1-β)<1 on loss. TCP is thus AIMD(1,0.5). As β gets closer to 0, the protocol can remain TCP-friendly by appropriately reducing α; eg AIMD(0.2, 1/8). Generally, given a β, the corresponding α is about 3β/(2-β), or about 1.5β for small β.

Having a small β means that a connection doesn't have sudden bandwidth drops when losses occur; this can be important for applications that rely on a regular rate of data transfer (such as voice). Such applications are sometimes said to be slowly responsive, in contrast to TCP's cwnd = cwnd/2 fast-response.

RTP

RTP is sometimes (though not always) coupled with TFRC: TCP-Friendly Rate Control
(RTP is discussed in P&D 3rd edition in section 9.3.1, but in the 4th edition it's been moved to section 5.4)

establish RATE of sending packets
periodic ACKs return summaries of loss rates
suitable for MULTICAST use: greatly limits feasible ACK rates
Adjust sending up/down based on loss rate and TCP cwnd=1.5/sqrt(P) rule
usually some sort of "stability" rule
on loss, reduce by less than half

Satellite Internet: web acceleration
Here the problem is that RTT is sooo large.

SACK TCP

What about cumulative ACKs? Are they part of the problem?

SACK TCP is TCP with Selective ACKs, so we don't just guess from dupACKs what has gotten through. We can receive an ACJ that says:

All packets up through 1000 have been received
All packets up through 1050 have been received except for 1001, 1022, and 1035.

In practice, not nearly as useful as one might imagine. Reno does pretty well, in low-loss environments.

Actually, SACK TCP includes the following in its acknowledgements:

The latest cumulative ACK
The three most recent blocks of consecutive packets received.

Thus, if we've lost 1001, 1022, 1035, and now 1051, and the highest received is 1060, the ACK might say:

All packets up through 1000 have been received
1060-1052 is received
1050-1036 is received
1034-1023 is received

ECN TCP (below) may be just as effective

Active Queue Management

= routers doing some active signaling to manage their queues

What is congestion?
Definition 1: queue size reaches maximum; packet losses occur
Definition 2: queue size is nonzero; delays occur

Dealing with timing:

PacketPairs
Send two packets in rapid succession; do this multiple times; measure the minimum difference in the arrival times.
Assume at the minimum the two packets were transmitted consecutively by the bottleneck router; then bandwidth = size of 1st packet / time gap

This gives a glimpse at the basic network capacity of a link; it doesn't shed much light on average share

DECbit

This is like ECN (below)

Goal: early detection of congestion
DECbit: use AIMD (+1, *.875) to shoot for 50% "congested" (where congested means the average queue size is > 0)

Note that DECbit is shooting for limited queue utilization; this is "congestion" in the second sense.

RED gateways

Basic idea: improve TCP performance by dropping a few packets when the queue is, say, half full.
TCP Reno: behaves badly with multiple losses/window; RED tends to minimize that.

AvgLen: weighted average queue length
dropped packets are spaced more-or-less uniformly

Explicit Congestion Notification (ECN)

routers set bit when we might otherwise drop the packet (possibly when queue is half full, or in lieu of RED drop)
receivers: cwnd = cwnd/2

Biggest advantage of ECN: the receiver discovers the congestion immediately, rather than waiting for the existing queue to be transmitted. Dropped packets are dropped upon arrival at the bottleneck router, but loss is not discovered until the queue is transmitted and three subsequent packets are sent.

RED gateway can set the ECN flag instead of dropping

To enable ECN, two bits are involved. One is set by the sender to indicate that it is ECN-aware; the other is set by the router and "echoed" by the receiver in the ACK packet to indicate that congestion was experienced.

TCP/Vegas

Goal: as a sender, try to minimize congestion in the second sense above; try to scale back cwnd as soon as "backups" occur.
To do this ,we introduce the notion of "extra packets"
We can measure bandwidth as ack_rate × packet_size
Queue_use = bandwidth × (RTT - RTT_{no_load})
We will estimate RTT_{no_load} by the minimum RTT (RTT_min)
bandwidth is easy to estimate; call it BWE; it is simply cwnd/RTT.
"Ideal" cwnd: BWE×RTT_min

Goal: adjust cwnd so BWE*RTT_min + α ≤ cwnd ≤ BWE×RTT_min + β
Typically α = 2-3 packets, β = 4-6
Add 1 to cwnd if we dip down to α,
subtract 1 from cwnd if we rise to β (do NOT divide in half!!)

Book: Diff = ExpectedRate - ActualRate;
ie express in terms of rate rather than bytes
(This is the original, "old-fashioned" TCP Vegas exposition. Larry Peterson was one of the developers of TCP Vegas.)

Why it doesn't compete well with TCP Tahoe/Reno

TCP/Westwood

TCP/Westwood represents an attempt to use the RTT-monitoring strategies of TCP/Vegas to address the high-bandwidth problem. Recall that the issue is to distinguish between congestive losses and random losses.

The sender keeps a continuous estimate of bandwidth, BWE (= ack rate * packet size); BWE * RTT_min = minimum window size to keep bottleneck link busy (as in TCP/Vegas, RTT_min represents our best guess at RTT_noload.)

Here is the TCP/Westwood innovation: on loss, reduce cwnd to max(cwnd/2, BWE*RTT_min). That is, we never drop below the "transit capacity" for the path.

Classic sawtooth, TCP Reno:

cwin alternates between cwin_min and cwin_max = 2*cwin_min.
cwin_max = transit_capacity + queue_capacity (to first approximation)

If transit_capacity < cwin_min, then Reno does a pretty good job keeping the bottleneck link saturated.

But if transit_capacity > cwin_min, then when Reno drops to cwin_min, the bottleneck link is not saturated until cwin climbs to transit_capacity. Westwood, on the other hand, would in that situation reduce cwin to transit_capacity, a smaller reduction.

What about random losses? Note that if there is a loss when cwin < transit_capacity, then Westwood does not reduce the window size at all! So random losses with cwin<transit_capacity have no effect. When cwin > transit_capacity, losses reduce us only to transit_capacity, and thus the link stays saturated.


Reno: on random loss, cwin = cwin/2
Westwood: On random loss, drop back to transit_capacity; if cwin < transit_capacity, don't drop at all!

In Westwood, we use BWE×RTT_min as a "floor" for reducing cwnd. In Vegas, we are shooting to have the actual cwnd be just a few packets above this.

BGP (Border Gateway Protocol)

Internet routing is large and complex, with different sites having
different and sometimes even conflicting goals.

Why we need external routing:
we can't compare internal metrics with someone else's. It just does not work.
Metrics may be based on:

hopcount
RTT
bandwidth
cost
congestion

One provider's metric may even use larger numbers for better routes.

An Autonomous System is a domain in which one consistent metric is used; typically administered by a single organization.

Between AS's we can't use cost info. Lots of problems come up as a result.

BGP basics: how AS's actually talk to each other.
Autonomous Systems
Routing reduced to finding an AS-path!
EGP (predecessor to BGP) and tree structure
configurable for preferences
For each destination:

1. Receive lots of routes from neighbors; filter INPUT
2. Choose route we will use:

eliminate AS_PATH loops
apply local preference
apply MED
break ties by choosing routes through fewer ASs, etc

3. Decide whether we will advertise that route: filter OUTPUT

Rule: we can only advertise routes we actually use!

local traffic v transit traffic
configurable for supporting transit routing or not
ASpath info, and loop avoidance
instability
MED values ("multi exit discriminator")

BGP: important part of network management at ISP level

BGP relationships:

customer-provider: provider agrees to handle transit for customer. Customer advertises its own routes only!
siblings:    often provide mutual backup; not "normal" transit
peers:        large providers exchanging all customer traffic with each other; advertise all routes to each other

Export ONLY customer routes to other peers
Export all peer routes to its customers

In a general sort of way, we advertise routes UPWARDS.
Well, provider routes as "big blocks" do get advertised downwards.

Every AS exports its OWN routes and OWN customers' routes

Internally, we might also rank customer routes over peer routes (illustrate)

customers DO NOT export provider/peer routes to providers
Providers DO export provider/peer routes to customers (often aggregated)

Peers DO NOT export provider/peer routes to each other
(Peers (usually) DO NOT provide transit services to third parties.)

What if small ISP A connects to providers P1 and P2?
A negotiates rules as to what traffic it will send to P1 & what to P2
Then A uses BGP to implement route advertisements (& route learning)
A might advertise its customers to both P1 and P2.

If A "learns" of a route from P1 only, then A will use P1 for routing, even if P2 advertises a route too. This illustrates INPUT FILTERING.

siblings DO export provider/peer routes to one another

Tier1: about 15 ASs with no providers: AT&T, Sprint, UUNET, ...
Transit Core: tier-1s and ASs that peer with those AND EACH OTHER

Regional ISPs: ~2000 of them (Rexford notes)

Stub AS: no peers, no customers

BGP options regarding a hypothetical private link between Loyola and Northwestern:

----ISP1---nwu
            |
            |link1
            |
----ISP2---luc

nwu,luc don't export link1: no transit at all
Export but have ISP1, ISP2 rank at low preference: used for backup only; ISP1 prefers route to luc through ISP2
Have luc have a path to ISP1 via link1; that won't be used unless luc starts to route to ISP1 via link1, eg if ISP2 reports ISP1 is unreachable...

No-valley theorem: at most one peer-peer link; LHS are cust->prov or sib->sib links

General ideas about routing

We need aggregated routing for table-size efficiency (desperately!)
There is often a "natural" routing hierarchy, eg provider-based
CIDR allows us to allocate addresses consistent with the routing hierarchy
Routing "hierarchy" is often just an approximation; there are lots of exception cases that are dealt with via extra table entries.
Longest-match is to allow moving in the hierarchy without renumbering, and multi-homing (multiple attachments) to the hierarchy.

Quality of Service

QoS issues:
playback buffer
fine-grained (per flow) v coarse-grained (per category)

Token Bucket flowspecs

Token bucket flow specification: token rate r bytes/sec, bucket depth B.

Bucket fills at rate specified, does not get fuller than B
When a packet of size S needs to be sent, S tokens are taken from B
(B = B-S)
B represents a "burst capacity".
B = size of queue needed, if outbound link rate is r

Used for input control:
if a packet arrives and the bucket is empty, it is discarded, or marked "noncompliant"

Used for shaping:
Packets wait until there is sufficient capacity.
This is what happens if the outbound link rate is r, and B (thus) represents the queue capacity.

Simple bandwidth summation; bucket depth represents queue capacity needed for bursts

Admission control:

calculation for when a flow spec can be satisfied
noncompliant (with bucket filter) packets can have lower priority

Integrated Services / RSVP

Each flow can make a reservation with the routers.

Note that this is a major retreat from the datagram-routing stateless-router model. Were virtual circuits the better routing model all along?

Routers maintain SOFT STATE about a connection, not hard state!
Can be refreshed if lost (though with some small probability of failure)

RESV packets: move backwards in very special way (NOT sent from receiver to sender)

PATH message contains Tspec and goes from sender to receiver.
Each router figures out reverse path.
RESV packet is sent along this reverse path by receiver.

Compatible w. multicast

Problem: too many reservations
And how do we decide who gets to reserve what?

Two models:
1. Charge $ for reservations
2. Anyone can ask for a reservation, but the answer may be "no". Maybe there would be a cap on size

Differentiated Services

basically just two service classes: high and low (now three levels)

Rules on which packets can be "premium": max rate from border router?

Goal: set some rules on admitting premium packets, and hope that their total numbers to any given destination is small enough that we can meet service targets (not exactly guarantees)

Packets are marked at ingress. This simplifies things.

Example: VOIP
The ISP (not the user!) would mark VOIP packets as they enter, subject to some ceiling. These are routed internally (within the ISP) with premium service. The ISP negotiates with its ISP for a total bulk delivery of premium packets.

One possible arrangement is that the leaf ISPs do use RSVP, but the Internet core runs DS
Packets are DS-marked as they enter the core, based on their RSVP status
DS field:
6 bits; 3+3 class+drop_precedence

Two basic strategies: "Expedited Forwarding" (EF) and "Assured Forwarding" (AF).
101 110    "EF", or "Expedited Forwarding": best service

Assured Forwarding: 3 bits of Class, 3 bits of Drop Precedence
Class:
100    class 4: best
011    class 3
010    class 2
001    class 1

Drop Precedence:

010:    don't drop
100:    medium
110    high

Main thing: The classes each get PRIORITY service, over best-effort.

DS uses IP4 TOS field, widely ignored in the past.

Routers SHOULD implement priority queues for service categories

Basic idea: get your traffic marked for the appropriate class.
Then what?

000 000: current best-effort status
xxx 000: traditional IPv4 precedence

PHBs (Per-Hop Behaviors): implemented by all routers
Only "boundary" routers do traffic policing/shaping/classifying/re-marking to manage categories (re-marking is really part of shaping/policing)

EF: Expedited Forwarding
basically just higher-priority. Packets should experience low queuing delay.

Maybe not exactly; we may give bulk traffic some guaranteed share

Functionality depends on ensuring that there is not too much EF traffic.

Basically, we control at the boundary the total volume of EF traffic (eg to a level that cannot saturate the slowest link), so that we have plenty of capacity for EF traffic. THen we just handle it at a higher priority.

This is the best service.

EF provides a minimum-rate guarantee.
This can be tricky: if we accept input traffic from many sources, and have four traffic outlets R1, R2, R3, R4, then we should only accept enough EF traffic that any one Ri can handle it. But we might go for a more statistical model, if in practice 1/4 of the traffic goes to each Ri.

========================

AF: Assured Forwarding
Simpler than EF, but no guarantee. Traffic totals can be higher. There is an easy way to send more traffic: it is just marked with a higher drop precedence.

In-out marking: each packet is marked "in" or "out" by the policer. Actually, we now have three precedence levels to use for marking.

The policer can be in the end-user network (though "re-policing" within the ISP, to be sure the original markings were within spec, is appropriate). But the point is that the end-user gets to choose which packets get precedence, subject to some total ceiling.

From RFC2597:

The drop precedence level of a packet could be assigned, for example, by using a leaky bucket traffic policer, which has as its parameters a rate and a size, which is the sum of two burst values: a committed burst size and an excess burst size. A packet is assigned low drop precedence if the number of tokens in the bucket is greater than the excess burst size [ie bucket is full], medium drop precedence if the number of tokens in the bucket is greater than zero, but at most the excess burst size, and high drop precedence if the bucket is empty.

Packet mangling to mark DS bits, plus a goodly number of priority bands for the drop precedences)
(not sure how to handle the different classes; they might get classful TBF service)

Fits nicely with RIO routers: RED with In and Out (or In, Middle, and Out): each traffic "level" is subject to a different drop threshold