Comp 343/443 Week 12: Nov 12 Finish BGP TCP congestion Linkstate RPC ============= BGP example: configuring a private link ----ISP1---nwu | |link1 | ----ISP2---luc Note that the issue is the use of the ISP1--nwu link by luc, and the ISP2--luc link by nwu; link1 use *might* be an issue but let's assume that it is not the bottleneck. Three common options: no-transit, backup, load-balancing 1. nwu,luc don't export link1: no transit at all 2. Export but have ISP1, ISP2 rank at low preference: This means that the link will be used for inbound transit for backup only; ISP1 prefers route to luc through ISP2, & vice-versa (you can't necessarily specify ISP1's rank in your advertisement) 3. Have luc have its DEFAULT path be ISP2 by default, but ISP1-via-nwu if ISP2 becomes unreachable. This is the outbound side of backup. 4. How could we achieve inbound LOAD-BALANCING? (outbound load-balancing is sort of "up to us") There's no easy fix here, in that if ISP1 and ISP2 both have routes to luc, we have lost all control over how other sites will prefer one to the other. We *may* be able to artificially make one path appear more expensive. ============= No-valley theorem: at most one peer-peer link; LHS are cust->prov or sib->sib links, RHS are sib->sib and prov->cust links General ideas about routing * we need aggregated routing for table-size efficiency (desperately!) * there is often a "natural" routing hierarchy, eg provider-based * cidr allows us to allocate addresses consistent with the routing hierarchy * routing "hierarchy" is often just an approximation; there are lots of exception cases that are dealt with via extra table entries. * longest-match is to allow moving in the hierarchy without renumbering, and multi-homing (multiple attachments) to the hierarchy. ======== =============================================================================== Chapter 6: congestion Basics of flows taxonomy: 6.1.2 router v host reservation v feedback window v. rate digression on window size Power curves: throughput/delay (they tend to rise in proportion) Fairness; fairness index 6.2 Queuing: [FIFO versus Fair Queuing; tail-drop v random-drop] ================ 6.3: TCP Congestion avoidance How did TCP get this job? Part of the goal is good, STABLE performance for our own connection, but part is helping everyone else. rate-based v window-based congestion management self-clocking: sliding windows itself keeps # of packets constant RTTnoload = travel time with no queuing (RTT-RTTnoload) = time spent in queues sws*(RTT-RTTnoload) = number of packets in queues, usually at one router (the "bottleneck" router, right before slowest link) CongestionWindow: limits amount of data in transit window = #packets in transit + # packets in queues additive increase, multiplicative decrease timeouts as a sign of congestion Reaching equilibrium: slow start last time: double cwnd each RTT This is kind of an oversimplification slow start: for each ACK received, increment cwnd by 1 assuming packets travel together, this means cwnd doubles each RTT Eventually the network gets "full", and drops a packet. Say this is after N RTTs, so cwnd=2^N. Then during the previous RTT, cwnd=2^(N-1) worked fine, so go back to that previous value: set cwnd = cwnd/2. need for further polling for changes in capacity Slow increase *after* equilibrium slow start threshhold, ssthresh. cong. avoidance *only*: on cong., cwnd = curr_window/2 drop back by 1/2; rationale congestion avoidance phase + slow start phase review of TCP so far slow start + congestion avoidance Need for SS after each loss (just hinted at end of class Week 12) How to *combine* SS and CA self-clocking slow-start on initial startup *and* after timeout Note that everything is expressed here in terms of manipulating cwnd. Summary: phase cwnd change, loss cwnd change, no loss, per window per Ack slow-start cwnd/2 cwnd *=2 cwnd+=1 congAvoid cwnd/2 cwnd+=1 cwnd+=(1.0/cwnd) real situation: sender realizes lost packet is lost only after protracted continuation fast retransmit: = TCP Tahoe Single sender situation example A-----R----slow---B, with R having queue size of 4 bottleneck_queue >= bandwidth*no_load_delay Review of Tahoe fast recovery / TCP Reno TCP and one connection nam demo Note interaction between queue size and pipe size fairness same rtt TCP Fairness different RTT Classic example: Connection 2 has twice the RTT of connection 1. Again we assume both lose when cwin1+cwin2 > 10; else neither. Both start at 1 con1: 1 2 3 4 5 6 7* 3 4 5 6 7 8* 4 5 6 7 8 9* 4 5 ... con2: 1 2 3 4* 2 3 4* 2 3 4* 2 ... con2 averages half the window size. As the time it takes to send a window is doubled, the throughput is down by a factor of FOUR. ============================================================== =============================================================================== RPC Remote Procedure Calls (RPC) goals for RPC: lookup, grid computing, Sun network file sharing (NFS) can we just use TCP? YES, but you'll need code like send(message m): if (TCP connection does not exist) reconnect it send m on tcp connection Actually, you'll need to check for failure of the connection *after* trying to send, too: server-reboot problem Nature of request-reply semantics At-least-once semantics, idempotency, and statelessness (SunRPC) client reboot v. server reboot Timeouts XDR (eXternal Data Representation) (omitted) 6.3: BLAST, CHAN, SELECT Think in terms of grid computing Why BLAST has selective ACKs How CHAN implements ACKs; serialization of CHAN How CHAN deals with reboots, lost data: Limitations of having REQ[N+1] implicitly acknowledge REPLY[N] CID: channel ID: at most one req outstanding per channel consequences if processing is slow MID: message ID: messages are numbered serially used as ack field, more or less BID: Boot ID: incremented each time system is booted client reboot server reboot Retransmit timer value T/TCP: (a TCP alternative to RPC) Implications of final ACK TIMEWAIT issues: old segments, lost final ACK. On end, connection goes into TIMEWAIT for 8*RTO Why this time? T/TCP: add new CCOUNT fields Allow SYN+DATA when CCOUNT is new; etc. Connection may be reopened by client *within* this time, if a new CCOUNT is used Serialization issues in RPC (CHAN is *synchronous*) (omit) SunRPC NFS; implications of statelessness NFS stateful operations: probably omit rm, mkdir and server duplicate request cache file locking - server maintains locks, queries clients if it crashes/recovers, keeps list of clients in file NFS v. Unix semantics for deleting open files client-side fix of open-file-deletion problem =============================================================================== LinkState Alternative to distance-vector: dv: keep MINIMUM of network topology linkstate: maximum! 4.2.3: Link-state routing and SPF Flooding, SPF Flooding protocol; LSP's lollipop sequence-numbering SPF algorithm (forward search) B / | \ Example: A | D \ | / C A-B: 5, B-C: 3, C-D: 2, A-C: 10, B-D: 11 Build routes from A to D: (P&D do example from D to A) At each step, (a) take ALL nodes reachable in one hop from the newest member of Confirmed, and see if they improve existing routes and if so add to Tentative. (b) Then take the SHORTEST path in Tentative, & move to Confirmed Step Confirmed Tentative 0 (A,0,-) 1a (A,0,-) (B,5,B)**, (C,10,C) 1b (A,0,-),(B,5,B) (C,10,C) 2a (A,0,-),(B,5,B) (C,8,B)** (better), (D,16,B) (new) 2b ...(B,5,B),(C,8,B) (D,16,B) 3a ...(B,5,B),(C,8,B) (D,10,B) (better, assuming D B routes to C) Another example: A---3---B | | 12 2 | | D---4---C Allows precise or TOS-based metrics (TOS=Type of Service) Allows multiple paths time to compute routes: O(N log N) for SPF, O(N^2) for VD link-state still requires precise universal link-cost measurements!