Comp 343/443 Week 12: Nov 12

Finish BGP
TCP congestion
Linkstate
RPC

=============

BGP example: configuring a private link

----ISP1---nwu	
            |
            |link1
            |
----ISP2---luc

Note that the issue is the use of the ISP1--nwu link by luc, and the
ISP2--luc link by nwu; link1 use *might* be an issue but let's assume
that it is not the bottleneck.

Three common options: no-transit, backup, load-balancing

1. nwu,luc don't export link1: no transit at all

2. Export but have ISP1, ISP2 rank at low preference: 
This means that the link will be used for inbound transit for backup only; 
	ISP1 prefers route to luc through ISP2, & vice-versa
	(you can't necessarily specify ISP1's rank in your advertisement)
	
3. Have luc have its DEFAULT path be ISP2 by default, but ISP1-via-nwu
if ISP2 becomes unreachable. This is the outbound side of backup.

4. How could we achieve inbound LOAD-BALANCING? 
(outbound load-balancing is sort of "up to us")
There's no easy fix here, in that if ISP1 and ISP2 both have routes to luc,
we have lost all control over how other sites will prefer one to the other.
We *may* be able to artificially make one path appear more expensive.

=============

No-valley theorem: at most one peer-peer link; LHS are cust->prov or sib->sib links,
RHS are sib->sib and prov->cust links

General ideas about routing
    * we need aggregated routing for table-size efficiency (desperately!)
    * there is often a "natural" routing hierarchy, eg provider-based
    * cidr allows us to allocate addresses consistent with the routing hierarchy
    * routing "hierarchy" is often just an approximation; there are lots
      of exception cases that are dealt with via extra table entries.
    * longest-match is to allow moving in the hierarchy without renumbering,
      and multi-homing (multiple attachments) to the hierarchy.

========
        
===============================================================================


Chapter 6: congestion
        Basics of flows
        taxonomy: 6.1.2
                router v host
                reservation v feedback
                window v. rate
                        digression on window size
        Power curves: throughput/delay (they tend to rise in proportion)
        
        Fairness; fairness index
        
        6.2 Queuing: [FIFO versus Fair Queuing; tail-drop v random-drop]
        
================
        
6.3: TCP Congestion avoidance

How did TCP get this job? Part of the goal is good, STABLE performance
for our own connection, but part is helping everyone else.

	rate-based v window-based congestion management

        self-clocking: sliding windows itself keeps # of packets constant
        
        RTTnoload = travel time with no queuing
        (RTT-RTTnoload) = time spent in queues
        sws*(RTT-RTTnoload) = number of packets in queues, usually at one router
        	(the "bottleneck" router, right before slowest link)
        	
        CongestionWindow: limits amount of data in transit
        	window = #packets in transit  +  # packets in queues
        additive increase, multiplicative decrease
        timeouts as a sign of congestion
        Reaching equilibrium: slow start
                
	last time: double cwnd each RTT
	This is kind of an oversimplification
        slow start:
                for each ACK received, increment cwnd by 1
                assuming packets travel together, 
                this means cwnd doubles each RTT
                Eventually the network gets "full", and drops a packet.
                Say this is after N RTTs, so cwnd=2^N.
                Then during the previous RTT, cwnd=2^(N-1) worked fine,
                so go back to that previous value: set cwnd = cwnd/2.
        need for further polling for changes in capacity
        Slow increase *after* equilibrium
        slow start threshhold, ssthresh.
        cong. avoidance *only*: on cong., cwnd = curr_window/2
                drop back by 1/2; rationale
        congestion avoidance phase + slow start phase
        
        review of TCP so far
        	slow start + congestion avoidance
        	Need for SS after each loss (just hinted at end of class Week 12)
        	How to *combine* SS and CA
        	self-clocking
        	slow-start on initial startup *and* after timeout
        	
        Note that everything is expressed here in terms of manipulating cwnd.
        
        Summary:
        
        phase	cwnd change, loss	cwnd change, no loss, 
             	                 	per window	per Ack
             	          			
        slow-start	cwnd/2		cwnd *=2	cwnd+=1
        congAvoid	cwnd/2		cwnd+=1		cwnd+=(1.0/cwnd)
        
        real situation: sender realizes lost packet is lost only after protracted continuation
        

        fast retransmit: = TCP Tahoe
        

        Single sender situation
                example A-----R----slow---B, with R having queue size of 4
                bottleneck_queue >= bandwidth*no_load_delay

	Review of Tahoe
	
        fast recovery / TCP Reno
        
        TCP and one connection
        nam demo
        Note interaction between queue size and pipe size

        fairness
                same rtt
TCP Fairness 
	different RTT
	
Classic example: 
Connection 2 has twice the RTT of connection 1.
Again we assume both lose when cwin1+cwin2 > 10; else neither.
Both start at 1

con1:	1   2   3   4   5   6   7*  3   4   5   6   7   8*  4   5   6   7   8   9*  4   5 ...
con2:	1       2       3       4*      2       3       4*      2       3       4*      2 ...

con2 averages half the window size. As the time it takes to send a window is doubled,
the throughput is down by a factor of FOUR.


==============================================================
===============================================================================

RPC

Remote Procedure Calls (RPC) 
	goals for RPC: lookup, grid computing, Sun network file sharing (NFS)
	can we just use TCP?
		YES, but you'll need code like
			send(message m):
				if (TCP connection does not exist) reconnect it
				send m on tcp connection
		Actually, you'll need to check for failure of the connection
		*after* trying to send, too: server-reboot problem
		
	Nature of request-reply semantics
	At-least-once semantics, idempotency, and statelessness (SunRPC)
	client reboot v. server reboot
	Timeouts
	XDR (eXternal Data Representation) (omitted)
	6.3: BLAST, CHAN, SELECT
		Think in terms of grid computing
		Why BLAST has selective ACKs
		How CHAN implements ACKs; serialization of CHAN
		How CHAN deals with reboots, lost data: 
		Limitations of having REQ[N+1] implicitly acknowledge REPLY[N]
			CID: channel ID: at most one req outstanding per channel
				consequences if processing is slow
			MID: message ID: messages are numbered serially
				used as ack field, more or less
			BID: Boot ID: incremented each time system is booted
				client reboot
				server reboot

		Retransmit timer value


	T/TCP: (a TCP alternative to RPC)
		Implications of final ACK
		TIMEWAIT issues: old segments, lost final ACK.
		On end, connection goes into TIMEWAIT for 8*RTO
		Why this time?
		T/TCP: add new CCOUNT fields
		Allow SYN+DATA when CCOUNT is new; etc.
		Connection may be reopened by client *within* this time,
		if a new CCOUNT is used

	Serialization issues in RPC (CHAN is *synchronous*) (omit)

	SunRPC
	NFS; implications of statelessness
	NFS stateful operations: probably omit
		rm, mkdir and server duplicate request cache
		file locking - server maintains locks, queries clients
		    if it crashes/recovers, keeps list of clients in file
		NFS v. Unix semantics for deleting open files
			client-side fix of open-file-deletion problem

===============================================================================

LinkState

Alternative to distance-vector:
	dv: keep MINIMUM of network topology
	linkstate: maximum!
	
    4.2.3: Link-state routing and SPF
    	Flooding, SPF
        Flooding protocol; LSP's
        	lollipop sequence-numbering
        SPF algorithm (forward search)
        
       		             B
       		           / | \
       		Example: A   |   D
       		           \ | /
       		             C
       		A-B: 5, B-C: 3, C-D: 2, A-C: 10, B-D: 11
       		
       	Build routes from A to D: (P&D do example from D to A)
       	
       	At each step, 
       	(a) take ALL nodes reachable in one hop from the newest 
       	member of Confirmed, and see if they improve existing routes and
       	if so add to Tentative.
       	
       	(b) Then take the SHORTEST path in Tentative, & move to Confirmed
       	
       	Step	Confirmed		Tentative
       	0	(A,0,-)
       	1a	(A,0,-)			(B,5,B)**, (C,10,C)
       	1b	(A,0,-),(B,5,B)		(C,10,C)
       	2a	(A,0,-),(B,5,B)		(C,8,B)** (better), (D,16,B) (new)
       	2b	...(B,5,B),(C,8,B)	(D,16,B)
       	3a	...(B,5,B),(C,8,B)	(D,10,B) (better, assuming D B routes to C)
       	
       	
       	Another example:
       	
       		A---3---B
       		|       |
       		12      2
       		|       |
       		D---4---C
       		
        Allows precise or TOS-based metrics (TOS=Type of Service)
        Allows multiple paths
        time to compute routes: O(N log N) for SPF, O(N^2) for VD
        
        link-state still requires precise universal link-cost measurements!