Comp 343/443
Fall 2011, LT 412, Tuesday 4:15-6:45
Week 13, Dec 6
Read:
Ch 3, sections 1, 2, 3
Ch 5, sections 1 (UDP) and 2 (TCP)
Ch 6, sections
Previously:
- TCP Tahoe & Fast Retransmit, SlowStart, CongestionAvoidance, ssthresh
- TCP Reno & Fast Recovery
- TCP fairness
TCP loss rate, p, versus cwin
Let p be the loss rate, in packets, eg p = 0.001 means one packet
in 1000 is lost. Then, for some constant k (in the range 1.2 to 1.5),
we have
cwnd = k/sqrt(p)
and so
bandwidth = cwnd/RTT = k/RTT*sqrt(p)
This relationship comes from the fact that cwnd reaches an equilibrium
based on how often we have cwnd=cwnd/2 events counterbalancing cwnd+=1
events, and the former rate is the probability of loss in a windowful
which is directly related to the packet-loss rate p.
Explanation in terms of TCP sawtooth
Let us assume we lose a packet at regular intervals, eg after every N
windowfuls, and that cwnd varies from M at the start of each window to
2*M at the end (thus reverting to M when we set cwnd = cwnd/2). Then,
in N windowfuls, cwnd was incremented N times, and so M+N = 2M and so
M=N.
This means we sent N + (N+1) + (N+2) + ... + 2N packets before a loss. This number is about 3/2N2. The loss rate is thus p = 1/(3/2)N2, and solving for N we get N = (2/3)0.5 * 1/sqrt(p). The average cwnd is 3/2N, so cwndaverage = 3/2*(2/3)0.5
/ sqrt(p) ≃ 1.225 / sqrt(p). More commonly in the literature we are
interested in the maximum cwnd; applying the same technique gives cwndmax = 2*(2/3)0.5 / sqrt(p) ≃ 1.633 / sqrt(p).
High-bandwidth TCP
consequence for high bandwidth: the cwnd needed implies a very small p; unrealistically small!
Random losses (not due to congestion) keep window significantly smaller than it should be.
TCP Throughput (Mbps)
|
RTTs between losses
|
cwnd
|
P (loss probability)
|
1
|
5.5
|
8.3
|
0.02
|
10
|
55
|
83
|
0.0002
|
100
|
555
|
833
|
0.000002
|
1000
|
5555
|
8333
|
0.00000002
|
10,000
|
55555
|
83333
|
0.0000000002
|
Table 1: RTTs Between Congestion Events for Standard TCP, for
1500-Byte Packets and a Round-Trip Time of 0.1 Seconds.
Packet Drop Rate P |
cwnd
|
RTTs between losses
|
10-2
|
12
|
8
|
10-3
|
38
|
25
|
10-4
|
120
|
80
|
10-5
|
379
|
252
|
10-6
|
1200
|
800
|
10-7
|
3795
|
2530
|
10-8
|
12000
|
8000
|
10-9
|
37948
|
25298
|
10-10
|
120,000
|
80000
|
Table 2: TCP window size in terms of drop rate
The above two tables indicate that large window sizes require extremely small drop rates. This is the highspeed-TCP problem: how do we maintain a large window? The issue is that non-congestive (random) packet losses bring the window size down, far below where it could be.
One proposed fix: HighSpeed-TCP: for each no-loss RTT, allow an inflation of cwnd by more than 1, at least for large cwnd. If the increment is N = N(cwnd), this
is equivalent to having N parallel TCP connections.
Congestion Window W |
Number N(W) of Parallel TCPs
|
1
|
1.0
|
10
|
1.0
|
100
|
1.4
|
1,000
|
3.6
|
10,000
|
9.2
|
100,000
|
23.0
|
Table 3: Number N(cwnd) of parallel TCP connections roughly emulated by the HighSpeed TCP response function.
The formula for N(cwnd) is largely empirical.
N(cwnd) = max(1.0, 0.23 × cwnd0.4)
Increased window size is not "smooth"
Note the second term in the max() above begins to dominate when cwnd = 38 or so
TCP Friendliness
Suppose we are sending audio data in a congested environment. Because
of the real-time nature of the data, we can't wait for lost-packet
recovery, and so must use UDP rather than TCP. (Actually, we could use TCP unless the data is interactive;
that is, we can perfectly well use TCP to receive streaming audio
broadcasts. And if it's interactive, it's likely 8KB/sec voice, where
rate adjustment is impractical. Maybe video would be a better example?)
We suppose we can adjust the transmission rate as needed, but would
like to keep it relatively high.
How are we to manage congestion? How are we to maximize bandwidth without treating other connections unfairly?
A further problem with TCP is the sawtooth variation in cwnd (leading to at least some sawtooth variation in throughput). We don't want that.
TFRC
TFRC, or TCP-Friendly Rate Control, uses the loss rate experienced, p, and the formulas above to calculate
a sending rate. It then allows sending at that rate; that is, TFRC is rate-based rather than window-based. As the loss rate
increases, the sending rate is adjusted downwards, and so on. However,
adjustments are done more smoothly than with TCP.
From RFC 5348:
TFRC is designed to be reasonably fair
when competing for bandwidth with TCP flows, where we call a flow
"reasonably fair" if its sending rate is generally within a factor of two
of the sending rate of a TCP flow under the same conditions. [emphasis
added; a factor of two might not be considered "close enough" in some
cases.]
The penalty of having smoother throughput than TCP while competing
fairly for bandwidth is that TFRC responds slower than TCP to changes
in available bandwidth.
TFRC senders include in each packet a sequence number, a timestamp, and an estimated RTT.
The TFRC receiver is charged with sending back feedback packets,
which serve as (partial) acknowledgements, and also include a
receiver-calculated value for the loss rate, over the previous RTT. The
response packets also include information on the current actual RTT,
which the sender can use to update its estimated RTT. The
TFRC receiver might send back only one such packet per RTT.
The actual response protocol has several parts, but if the loss rate increases, then the primary
feedback mechanism is to compute a new (lower) sending rate, and shift
to that. The rate would be cut in half only if the loss rate P
quadrupled.
Newer versions of TFRC have a various features for responding more
promptly to an unusually sudden problem, but in normal use the
calculated sending rate is used most of the time.
AIMD
Another approach to TCP friendliness is to retain TCP's
additive-increase, multiplicative-decrease strategy, but to change the
numbers. Suppose we denote by AIMD(α,β) the strategy of incrementing
the window size by α after a window of no losses, and multipying the
window size by (1-β)<1 on loss. TCP is thus AIMD(1,0.5). As β gets
closer to 0, the protocol can remain TCP-friendly by appropriately
reducing α; eg AIMD(0.2, 1/8). Generally, given a β, the corresponding
α is about 3β/(2-β), or about 1.5β for small β.
Having a small β means that a connection doesn't have sudden bandwidth
drops when losses occur; this can be important for applications that
rely on a regular rate of data transfer (such as voice). Such applications are sometimes said to be slowly responsive, in contrast to TCP's cwnd = cwnd/2 fast-response.
RTP
RTP is sometimes (though not always) coupled with TFRC: TCP-Friendly Rate Control
(RTP is discussed in P&D 3rd edition in section 9.3.1, but in the 4th edition it's been moved to section 5.4)
- establish RATE of sending packets
- periodic ACKs return summaries of loss rates
- suitable for MULTICAST use: greatly limits feasible ACK rates
- Adjust sending up/down based on loss rate and TCP cwnd=1.5/sqrt(P) rule
- usually some sort of "stability" rule
- on loss, reduce by less than half
Satellite Internet: web acceleration
Here the problem is that RTT is sooo large.
SACK TCP
What about cumulative ACKs? Are they part of the problem?
SACK TCP is TCP with Selective ACKs, so we don't just guess from
dupACKs what has gotten through. We can receive an ACJ that says:
- All packets up through 1000 have been received
- All packets up through 1050 have been received except for 1001, 1022, and 1035.
In practice, not nearly as useful as one might imagine.
Reno does pretty well, in low-loss environments.
Actually, SACK TCP includes the following in its acknowledgements:
- The latest cumulative ACK
- The three most recent blocks of consecutive packets received.
Thus, if we've lost 1001, 1022, 1035, and now 1051, and the highest received is 1060, the ACK might say:
- All packets up through 1000 have been received
- 1060-1052 is received
- 1050-1036 is received
- 1034-1023 is received
ECN TCP (below) may be just as effective
Active Queue Management
= routers doing some active signaling to manage their queues
What is congestion?
Definition 1: queue size reaches maximum; packet losses occur
Definition 2: queue size is nonzero; delays occur
Dealing with timing:
PacketPairs
Send two packets in rapid succession;
do this multiple times; measure the minimum difference in the arrival times.
Assume at the minimum the two packets were transmitted
consecutively by the bottleneck router; then bandwidth = size of 1st packet / time gap
This gives a glimpse at the basic network capacity of a link; it doesn't shed much light on average share
DECbit
This is like ECN (below)
Goal: early detection of congestion
DECbit: use AIMD (+1, *.875) to shoot for 50% "congested"
(where congested means the average queue size is > 0)
Note that DECbit is shooting for limited queue utilization; this is "congestion" in the second sense.
RED gateways
Basic idea: improve TCP performance by dropping a few packets when the queue is, say, half full.
TCP Reno: behaves badly with multiple losses/window;
RED tends to minimize that.
AvgLen: weighted average queue length
dropped packets are spaced more-or-less uniformly
Explicit Congestion Notification (ECN)
routers set bit when we might otherwise drop the packet
(possibly when queue is half full, or in lieu of RED drop)
receivers: cwnd = cwnd/2
Biggest advantage of ECN: the receiver discovers the congestion
immediately, rather than waiting for the existing queue to be
transmitted. Dropped packets are dropped upon arrival at the bottleneck router, but loss is not discovered until the
queue is transmitted and three subsequent packets are sent.
RED gateway can set the ECN flag instead of dropping
To enable ECN, two bits are involved. One is set by the sender to
indicate that it is ECN-aware; the other is set by the router and
"echoed" by the receiver in the ACK packet to indicate that congestion
was experienced.
TCP/Vegas
Goal: as a sender, try to minimize congestion in the second sense above; try to scale back cwnd as soon as "backups" occur.
To do this ,we introduce the notion of "extra packets"
We can measure bandwidth as ack_rate × packet_size
Queue_use = bandwidth × (RTT - RTTno_load)
We will estimate RTTno_load by the minimum RTT (RTTmin)
bandwidth is easy to estimate; call it BWE; it is simply cwnd/RTT.
"Ideal" cwnd: BWE×RTT_min
Goal: adjust cwnd so BWE*RTTmin + α ≤ cwnd ≤ BWE×RTTmin + β
Typically α = 2-3 packets, β = 4-6
Add 1 to cwnd if we dip down to α,
subtract 1 from cwnd if we rise to β (do NOT divide in half!!)
Book: Diff = ExpectedRate - ActualRate;
ie express in terms of rate rather than bytes
(This is the original, "old-fashioned" TCP Vegas exposition. Larry Peterson was one of the developers of TCP Vegas.)
Why it doesn't compete well with TCP Tahoe/Reno
TCP/Westwood
TCP/Westwood represents an attempt to use the RTT-monitoring strategies
of TCP/Vegas to address the high-bandwidth problem. Recall that the
issue is to distinguish between congestive losses and random losses.
The sender keeps a continuous estimate of bandwidth, BWE (= ack rate * packet size); BWE * RTTmin = minimum window size to keep bottleneck link busy (as in TCP/Vegas, RTTmin represents our best guess at RTTnoload.)
Here is the TCP/Westwood innovation: on loss, reduce cwnd to max(cwnd/2, BWE*RTTmin). That is, we never drop below the "transit capacity" for the path.
Classic sawtooth, TCP Reno:
- cwin alternates between cwin_min and cwin_max = 2*cwin_min.
-
cwin_max = transit_capacity + queue_capacity (to first approximation)
If transit_capacity < cwin_min, then Reno does a pretty good job keeping the bottleneck link saturated.
But if transit_capacity > cwin_min, then when Reno drops to
cwin_min, the bottleneck link is not saturated until cwin climbs to
transit_capacity. Westwood, on the other hand, would in that situation reduce cwin to transit_capacity, a
smaller reduction.
What about random losses?
Note that if there is a loss when cwin < transit_capacity, then
Westwood does not reduce the window size at all! So random losses with
cwin<transit_capacity have no effect. When cwin >
transit_capacity, losses reduce us only to transit_capacity, and thus
the link stays saturated.
Reno: on random loss, cwin = cwin/2
Westwood: On random loss, drop back to transit_capacity; if cwin < transit_capacity, don't drop at all!
In
Westwood, we use BWE×RTTmin as a "floor" for reducing cwnd. In Vegas, we are
shooting to have the actual cwnd be just a few packets above this.
BGP (Border Gateway Protocol)
Internet routing is large and complex, with different sites having
different and sometimes even conflicting goals.
Why we need external routing:
we can't compare internal metrics with someone else's. It just does not work.
Metrics may be based on:
- hopcount
- RTT
- bandwidth
- cost
- congestion
One provider's metric may even use larger numbers for better routes.
An Autonomous System is a domain in which one consistent metric is used; typically administered by a single organization.
Between AS's we can't use cost info. Lots of problems come up as a result.
BGP basics: how AS's actually talk to each other.
Autonomous Systems
Routing reduced to finding an AS-path!
EGP (predecessor to BGP) and tree structure
configurable for preferences
For each destination:
1. Receive lots of routes from neighbors; filter INPUT
2. Choose route we will use:
- eliminate AS_PATH loops
- apply local preference
- apply MED
- break ties by choosing routes through fewer ASs, etc
3. Decide whether we will advertise that route: filter OUTPUT
Rule: we can only advertise routes we actually use!
- local traffic v transit traffic
- configurable for supporting transit routing or not
- ASpath info, and loop avoidance
- instability
- MED values ("multi exit discriminator")
BGP: important part of network management at ISP level
BGP relationships:
customer-provider: provider agrees to handle transit for customer. Customer advertises its own routes only!
siblings: often provide mutual backup; not "normal" transit
peers:
large providers exchanging all customer traffic with
each other; advertise all routes to each other
- Export ONLY customer routes to other peers
- Export all peer routes to its customers
In a general sort of way, we advertise routes UPWARDS.
Well, provider routes as "big blocks" do get advertised downwards.
Every AS exports its OWN routes and OWN customers' routes
Internally, we might also rank customer routes over peer routes (illustrate)
customers DO NOT export provider/peer routes to providers
Providers DO export provider/peer routes to customers (often aggregated)
Peers DO NOT export provider/peer routes to each other
(Peers (usually) DO NOT provide transit services to third parties.)
What if small ISP A connects to providers P1 and P2?
A negotiates rules as to what traffic it will send to P1 & what to P2
Then A uses BGP to implement route advertisements (& route learning)
A might advertise its customers to both P1 and P2.
If A "learns" of a route from P1 only, then A will use P1 for routing,
even if P2 advertises a route too. This illustrates INPUT FILTERING.
siblings DO export provider/peer routes to one another
Tier1: about 15 ASs with no providers: AT&T, Sprint, UUNET, ...
Transit Core: tier-1s and ASs that peer with those AND EACH OTHER
Regional ISPs: ~2000 of them (Rexford notes)
Stub AS: no peers, no customers
BGP options regarding a hypothetical private link between Loyola and Northwestern:
----ISP1---nwu
|
|link1
|
----ISP2---luc
- nwu,luc don't export link1: no transit at all
- Export but have ISP1, ISP2 rank at low preference: used for backup only; ISP1 prefers route to luc through ISP2
- Have luc have a path to ISP1 via link1; that won't be used unless
luc starts to route to ISP1 via link1, eg if ISP2 reports ISP1 is
unreachable...
No-valley theorem: at most one peer-peer link; LHS are cust->prov or sib->sib links
General ideas about routing
- We need aggregated routing for table-size efficiency (desperately!)
- There is often a "natural" routing hierarchy, eg provider-based
- CIDR allows us to allocate addresses consistent with the routing hierarchy
- Routing "hierarchy" is often just an approximation; there are
lots of exception cases that are dealt with via extra table entries.
- Longest-match is to allow moving in the hierarchy without renumbering, and multi-homing (multiple attachments) to the hierarchy.
Quality of Service
QoS issues:
playback buffer
fine-grained (per flow) v coarse-grained (per category)
Token Bucket flowspecs
Token bucket flow specification: token rate r bytes/sec, bucket depth B.
Bucket fills at rate specified, does not get fuller than B
When a packet of size S needs to be sent, S tokens are taken from B
(B = B-S)
B represents a "burst capacity".
B = size of queue needed, if outbound link rate is r
Used for input control:
if a packet arrives and the bucket is empty, it is discarded,
or marked "noncompliant"
Used for shaping:
Packets wait until there is sufficient capacity.
This is what happens if the outbound link rate is r, and B (thus) represents the queue capacity.
Simple bandwidth summation; bucket depth represents queue capacity needed
for bursts
Admission control:
-
calculation for when a flow spec can be satisfied
- noncompliant (with bucket filter) packets can have lower priority
Integrated Services / RSVP
Each flow can make a reservation with the routers.
Note that this is a major retreat from the datagram-routing stateless-router model. Were virtual circuits the better routing model all along?
Routers maintain SOFT STATE about a connection, not hard state!
Can be refreshed if lost (though with some small probability of failure)
RESV packets: move backwards in very special way (NOT sent from receiver to sender)
PATH message contains Tspec and goes from sender to receiver.
Each router figures out reverse path.
RESV packet is sent along this reverse path by receiver.
Compatible w. multicast
Problem: too many reservations
And how do we decide who gets to reserve what?
Two models:
1. Charge $ for reservations
2. Anyone can ask for a reservation, but the answer may be "no". Maybe there would be a cap on size
Differentiated Services
basically just two service classes: high and low (now three levels)
Rules on which packets can be "premium": max rate from border router?
Goal: set some rules on admitting premium packets, and hope that
their total numbers to any given destination is small enough that we
can meet service targets (not exactly guarantees)
Packets are marked at ingress. This simplifies things.
Example: VOIP
The ISP (not
the user!) would mark VOIP packets as they enter, subject to some
ceiling. These are routed internally (within the ISP) with premium
service. The ISP negotiates with its ISP for a total bulk delivery of premium packets.
One possible arrangement is that the leaf ISPs do use RSVP, but the Internet core runs DS
Packets are DS-marked as they enter the core, based on their RSVP status
DS field:
6 bits; 3+3 class+drop_precedence
Two basic strategies: "Expedited Forwarding" (EF) and "Assured Forwarding" (AF).
101 110 "EF", or "Expedited Forwarding": best service
Assured Forwarding: 3 bits of Class, 3 bits of Drop Precedence
Class:
100 class 4: best
011 class 3
010 class 2
001 class 1
Drop Precedence:
010: don't drop
100: medium
110 high
Main thing: The classes each get PRIORITY service, over best-effort.
DS uses IP4 TOS field, widely ignored in the past.
Routers SHOULD implement priority queues for service categories
Basic idea: get your traffic marked for the appropriate class.
Then what?
000 000: current best-effort status
xxx 000: traditional IPv4 precedence
PHBs (Per-Hop Behaviors): implemented by all routers
Only "boundary" routers do traffic policing/shaping/classifying/re-marking
to manage categories (re-marking is really part of shaping/policing)
EF: Expedited Forwarding
basically just higher-priority. Packets should experience low queuing delay.
Maybe not exactly; we may give bulk traffic some guaranteed share
Functionality depends on ensuring that there is not too much EF traffic.
Basically, we control at the boundary the total volume of EF traffic
(eg to a level that cannot saturate the slowest link), so that we have
plenty of capacity for EF traffic. THen we just handle it at a higher
priority.
This is the best service.
EF provides a minimum-rate guarantee.
This can be tricky: if we accept input traffic from many sources,
and have four traffic outlets R1, R2, R3, R4, then we should only accept enough EF traffic that any one Ri can handle it.
But we might go for a more statistical model, if in practice
1/4 of the traffic goes to each Ri.
========================
AF: Assured Forwarding
Simpler than EF, but no guarantee. Traffic totals can be higher.
There is an easy way to send more traffic: it is just marked with a higher drop precedence.
In-out marking: each packet is marked "in" or "out" by the policer.
Actually, we now have three precedence levels to use for marking.
The policer can be in the
end-user network (though "re-policing" within the ISP, to be sure the
original markings were within spec, is appropriate). But the point is
that the end-user gets to choose which packets get precedence, subject to some total ceiling.
From RFC2597:
The drop precedence level of a packet could be assigned, for example,
by using a leaky bucket traffic policer, which has as its parameters
a rate and a size, which is the sum of two burst values: a committed
burst size and an excess burst size. A packet is assigned low drop
precedence if the number of tokens in the bucket is greater than the
excess burst size [ie bucket is full], medium drop precedence if
the number of tokens in the bucket is greater than zero, but at most
the excess burst size, and high drop precedence if the bucket is empty.
Packet mangling to mark DS bits, plus a goodly number of priority bands
for the drop precedences)
(not sure how to handle the different classes; they might get classful
TBF service)
Fits nicely with RIO routers: RED with In and Out (or In, Middle,
and Out): each traffic "level" is subject to a different drop threshold