Network Management

Week 13, Apr 25
LT-412, Mon 4:15-6:45 pm

tunnels and routing
tc, tbf, htb

BGP
RSVP, differentiated services

linux1 and linux2 startup

What do we have to do to get these to work?
laptop ----NAT----linux1----net2-----linux2

Step2: Do these
       linux1: ifconfig eth1 10.0.37.1 netmask 255.255.255.0
       linux1 eth0 should come up as 10.2.5.17 by dhcp
       linux2: ifconfig eth1 10.0.37.2 netmask 255.255.255.0

Step 4: enable routing on linux1:
       linux2: ip route add to default via 10.0.37.1
       linux1: echo 1 > /proc/sys/net/ipv4/ip_forward

Step 5e:
        valhal: ip route add to 10.0.37.0/24 via 10.2.5.17 (allows valhal to reach linux2)
       linux1: ip route add to default via 10.2.5.1   (allows linux1 to ping)

Tunnels

I can create a tunnel from home to ulam2 using pppd (point-to-point protocol daemon). The exact details are not important. I end up with an interface tun0 on each machine, to which I can assign IP addresses 192.168.4.1 (ulam2) and 192.168.4.2 (home). Packets are transmitted along this "virtual" link by being encapsulated and sent along the "default" path.

What do I have to do to route traffic to Loyola through this tunnel? What do I have to do to secure it?

Host home connects to a Linksys box with inside IP address 192.168.0.2, through interface eth2. It connects to a household subnet through interface eth0, with IP address 10.0.0.1.

Here is home's routing table, with mystery entries removed

147.126.65.47 via 192.168.0.2 dev eth2                                       # host-specific route
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.1           # local subnet
10.38.0.0/16 via 192.168.4.1 dev tun0                                      # two routes to two Loyola subnets
147.126.0.0/16 via 192.168.4.1 dev tun0                                      # note this one conflicts with #1 above
default via 192.168.0.2 dev eth2

Ulam2 has NAT enabled, with eth0 being the outside interface. Here are the routing entries:

10.0.0.0/24 via 192.168.4.2 dev ppp0
default via 147.126.65.1 dev eth0 metric 100

Securing:
If the point is to secure the home network against attacks from the outside world, it is best to do that at the home end. Here is what to do to allow connections to originate within the home subnet only:

home# iptables --append FORWARD -m state --state ESTABLISHED,RELATED --jump ACCEPT
home# iptables --append FORWARD -m state --state NEW ! -i eth0 --jump ACCEPT
home# iptables --append FORWARD --jump DROP

We would also want to append these entries to the INPUT chain to block connections to machine home (rather than to other machines on the subnet).

Another security risk is that such a private link allows for "steppingstone" attacks from bad -> home -> loyola. One approach is to make the tunnel itself the default route, meaning all my traffic would be routed through Loyola, except for the host-specific route to ulam2.

Traffic shaping and traffic control

Generally there's not much point in doing shaping from the bottleneck link into a faster link. The bottleneck link has done all the shaping already!

fair queuing
        Restricts bandwidth when there is competition, but allows full use when network is idle. Caps share only when link is busy!

token bucket
         Restricts bandwidth to a fixed rate, period (but also allows for burstiness as per the bucket, which can always be small)

tc command:

shaping: output rate limiting (delay or drop)
scheduling: prioritizing. Classic application: sending voip ahead of bulk traffic
policing: input rate regulation
dropping: what gets done to nonconforming traffic

Two scenarios to restrict user/host Joe:
    1. Reduce absolute bandwidth (in/out?) available to Joe. If the link is otherwise idle, Joe is still cut
    2. Guarantee non-Joe users a min share; ie cap Joe's bandwidth only when the link is busy.

qdisc: queuing discipline
You can think of this as the TYPE of queue. Examples: fifo, fifo+taildrop, fifo+randomdrop, fair_queuing, RED, tbf

queuing disciplines are applied to INTERFACES, using the tc command.

Queuing disciplines can be "classless" or "classful" (hierarchical)

Queuing Disciplines (qdisc): does scheduling. Some also support shaping/policing
how packets are enqueued, dequeued

fifo + taildrop
fifo + random drop
RED: introduces random drops not for policing, but to encourage good behavior by senders. Used in core networks, not leaf networks
stochastic fair queuing (each TCP connection is a flow): gives each flow a guaranteed fraction of bandwidth, when needed. Other flavors: flows are subnets, etc. However, if we're doing scheduling of inbound traffic, it doesn't do much good to do sfq based on destination (unless we can do it at the upstream router at our ISP). Some reasons for an application to open multiple TCP connections:

cheating SFQ limits
much improved high-bandwidth performance
the data naturally divides into multiple connections

pfifo_fast (or, generically, pfifo): priority fifo. tc's pfifo_fast has three priority bands built-in.
enqueuing: figure out which band the packet goes into
dequeuing: take from band 0 if nonempty, else 1 if nonempty, else 2

Basic "classless" qdiscs

pfifo_fast

(see man pfifo_fast): three-band FIFO queue

Consider the following iptables command, to set the TOS bits on outbound ssh traffic to "Minimize-Delay":

# iptables -A OUTPUT -t mangle -p tcp --dport 22 -j TOS --set-tos Minimize-Delay

This works with pfifo_fast, which provides three bands. Band selection by default is done using TOS bits of packet header (which you probably have to mangle to set). See Hubert, §9.2.1.1, for a table of the TOS-to-band map.

Dequeuing algorithm (typically invoked whenever the hardware is ready for a packet (or whenever the qdisc reports to the hardware that it has a packet):

    if there are any packets in band 1, dequeue the first one and send it
    else if there are any packets in band 2, dequeue the first one and send it
    else if there are any packets in band 3, dequeue the first one and send it
    else report no packets available

Note that in a very direct sense pfifo_fast does support three "classes" of traffic. However, it is not considered to be classful, since

we cannot control how traffic is classified
we cannot attach subqueues to the individual classes

Example: queue flooding on upload

In practice, it is very important to set interactive traffic to have a higher priority than normal traffic (eg web browsing, file downloads). However, you don't have much control of the downlink traffic, and if the uplink queue is on the ISP's hardware (eg their cablemodem), then you won't have much control of the upload side.

me------<fast>------[broadband gizmo]-----------<slow>-----...internet

In the scenario above, suppose the broadband gizmo, BG, has a queue capacity of 30K (30 packets?). A bulk UPLOAD (though not download) will fill BG's queue, no matter what you do within your subnet with pfifo_fast. (We are presuming you do not have the option of enabling pfifo_fast on BG itself.) This means that every interactive packet will now wait behind a 30KB queue to get up into the internet. As the upload packets are sent, they will be ACKed and then your machine will replenish BG's queue.

One approach is to reduce BG's queue. But this may not be possible.

Here's another approach:

me---<fast>---[tq_router]---<slow2>---[broadband gizmo]---<slow>--internet

Make sure slow2 ~ slow. Then upload will fill tq_router's queue, but fast traffic can still bypass.

Logically, can have "me" == "tq_router"
_______________________

Token bucket filter (tbf)

See man tbf or man tc-tbf

This can be used to restrict flow to a set average rate, while allowing bursts. The tbf qdisc slows down excessive bursts to meet filter limits. This is shape-only, no scheduling.

tbf (or hbf) is probably the preferred way of implementing bandwidth caps.

tokens are put into a bucket at a set rate. If a packet arrives:

tokens available: send immediately and decrement the bucket
no token: drop (or wait, below)

Over the long term, your transmission rate will equal the token rate.
Over the short term, you can transmit bursts of up to token size.

Parameters:

bucketsize (burst): how large the bucket can be
rate: rate at which tokens are put in

limit: number of bytes that can wait for tokens. All packets wait, in essence; if you set this to zero then the throughput is zero. While theoretically it makes sense to set this to zero, in practice that appears to trigger serious clock-granularity problems.

latency: express limit in time units

mpu: minimum packet unit; size of an "empty" packet (headers only)

Granularity issue: buckets are typically updated every 10 ms.
During 10 ms, on a 100Mbps link, 1MB can accumulate, or ~100 large packets! Generally, tbf introduces no burstiness itself up to 1mbit (a 10 kbit packet takes 10 ms on a 1mbit link). Beyond that, a steady stream of packets may "bunch up" due to the every-10-ms bucket refill.

Use of tbf for bulk traffic into a modem:
# tc qdisc add          \
    dev ppp0           \
    root                   \
    tbf             \
    rate 220kbit    \        # if actual bandwidth is 250kbit
    latency 50ms    \
    burst 1540            # one packet
You want packets queued at your end, NOT within your modem!
Otherwise you will have no way to use pfifo_fast to have interactive traffic leapfrog the queue.

What you REALLY want is for TBF to apply only to the bulk bands of PFIFO_FAST.
Can't do this with TBF; that would be classFUL traffic control (though we can do that with PRIO, the classful analogue of pfifo_fast)

Can't be used to limit a subset of the traffic!

TBF is not WORK-CONSERVING:
the link can be idle and someone can have a packet ready to send, and yet it still waits.

Demo: linux1, linux2 and tc

start tcp_writer in ~/networks/java/tcp_reader. This accepts connections to port 5430, and then sends them data as fast as it can.

tbf1:
linux1: tc qdisc add dev eth1 root tbf rate 100kbps burst 1mb limit 1mb

also try:

tc qdisc change dev eth1 root tbf rate newrate burst newburst limit newlimit
This might cause a spike in kbit/sec numbers:
479
479
159
4294
564
564
...

clear with tc qdisc del dev eth1 root

tc_stats

demo: enable rate 1000 kbit/sec (1 kbit/ms), burst 20kb, limit 100kb. Then try limit = 5kb v 6kb.
At the given rate, 1kb takes 8ms. The bucket is replenished every 10ms (hardware clock limitation), so a burst should not be consumed much during a 10ms interval.

At 10mbit, 10ms is 12kb, and we won't get that rate unless burst is set to about that.

___________________________

Fair Queuing

Note that the FQ clock (slow clock) can be reset to zero whenever all queues are empty, and can in fact just stay at zero until something arrives.

Linux "sfq":

Flows are individual tcp connections. NOT hosts, or subnets, etc! Can't get this?!
Each flow is hashed by srcaddr,destaddr,port.
Each bucket is considered a separate input "pseudoqueue"
Collisions result in unfairness, so the hash function is altered at regular intervals to minimize this.

What we really want is a way to define flow_id's, so they can be created dynamically, and connections can be assigned to flow_id by:

connection
host
subnet
etc

    sfq is schedule-only, no shaping

What you probably want is to apply this to DOWNLOADs.
tc doesn't do that if your linux box is tied directly to your broadband gizmo. Applied to downloads at a router,

    joe,mary,alice---<---1--------[router R]------<---2----internet

it would mean that each of joe,mary,alice's connections would get 1/3 the bw, BUT that would be 1/3 of the internal link, link 1.

Regulating shares of link 2 would have to be done upstream.

If we know that link 2 has a bandwidth of 3 Mbps, we can use CBQ (below) to restruct each of joe,mary,alice to 1 Mbps, by controlling the outbound queue at R into link 1.

Further sharing considerations:

If we divide by 1 flow = 1 tcp connection, joe can double throughput by adding a second tcp connection.

If we divide by 1 flow = 1 host, we do a little better at achieving per-user fairness, assuming 1 user = 1 machine. Linux sfq does not support this.

Linux sfq creates multiple virtual queues. It's important to realize that there is only one physical queue; if one sender dominates that queue by keeping it full, to the point that the other connections get less bandwidth than their FQ share, then the later division into LOGICAL queues won't help the underdogs much.

sfq IS work-conserving

The slow-clock algorithm is moderately tricky. Is there a simpler way?

How about round-robin quantums? This means that you establish a quantum size, q, and send packets from each queue until the total number of bytes is > q. You then service the other active queues, in round-robin style. Suppose the total number of bytes sent from one queue in one burst is q+r; next time, you give that queue a reduced quantum of q-r.

Over the long term, this is reasonably close to fair queuing, provided each queue has q bytes available to send at any one time. Otherwise, that queue will be emptied and there will be no carryover opportunity.

Here are two scenarios where that fails, both involving q = 5k:

queue1 typically contains > 5k of data (eg ten 1k-packets (implying a sliding-window size > 10)); queue2 however is doing stop-and-wait and at any one time has only a single 1k packet. In this case, 5k of queue1 will be sent for every 1k of queue2, and queue2 will have no opportunity to catch up
queue1 typically contains > 5k of data, and queue2 is sending a 0.1k voip packet every 10 ms. However, note that if queue1's quantu Fair Queuing again
The slow-clock algorithm is moderately tricky. Is there a simpler way?

How about round-robin quantums? This means that you establish a quantum size, q, and send packets from each queue until the total number of bytes is > q. You then service the other active queues, in round-robin style. Suppose the total number of bytes sent from one queue in one burst is q+r; next time, you give that queue a reduced quantum of q-r.

Over the long term, this is reasonably close to fair queuing, provided each queue has q bytes available to send at any one time. Otherwise, that queue will be emptied and there will be no carryover opportunity.

Here are two scenarios where that fails, both involving q = 5k:
queue1 typically contains > 5k of data (eg ten 1k-packets (implying a sliding-window size > 10)); queue2 however is doing stop-and-wait and at any one time has only a single 1k packet. In this case, 5k of queue1 will be sent for every 1k of queue2, and queue2 will have no opportunity to catch up
queue1 typically contains > 5k of data, and queue2 is sending a 0.1k voip packet every 10 ms. However, note that if queue1's quan
m can be sent in less than 10 ms, queue2 is still getting "immediate" service. And if queue1's quantum takes, say, 20 ms, then queue2 is sending two packets at a time every 20ms, which probably does not make a huge difference to voip.However, that is 20 ms extra, and if there are multiple data queues, having queue2 get higher priority would make sense.

One solution is to make the quantum small, just a little above the max packet size (the so-called MTU, Maximum Transfer Unit).

RED

Generally intended for internet routers
We drop packets at random (at a very low rate) when queue capacity reaches a preset limit (eg 50% of max), to signal tcp senders to slow down.

These gizmos are added to interfaces, basically. If you want to slow a particular sender down, create a virtual interface for them or use classful qdiscs.

classful qdiscs

CBQ, HTB, PRIO

Disneyland example: what is the purpose of having one queue feed into another queue?

However, under tc we can have these classful qdiscs form a tree, possibly of considerable depth.

LARTC 9.3: (http://lartc.org/howto/lartc.qdisc.advice.html)

To purely slow down outgoing traffic, use the Token Bucket Filter. Works up to huge bandwidths, if you scale the bucket.
If your link is truly full and you want to make sure that no single session can dominate your outgoing bandwidth, use Stochastical Fairness Queueing.
If you have a big backbone and know what you are doing, consider Random Early Drop (see Advanced chapter).
To 'shape' incoming traffic which you are not forwarding, use the Ingress Policer. Incoming shaping is called 'policing', by the way, not 'shaping'.
If you are forwarding it, use a TBF on the interface you are forwarding the data to. Unless you want to shape traffic that may go out over several interfaces, in which case the only common factor is the incoming interface. In that case use the Ingress Policer.
If you don't want to shape, but only want to see if your interface is so loaded that it has to queue, use the pfifo queue (not pfifo_fast). It lacks internal bands but does account the size of its backlog.
Finally - you can also do "social shaping". You may not always be able to use technology to achieve what you want. Users experience technical constraints as hostile. A kind word may also help with getting your bandwidth to be divided right!

Basic terminology for classful qdiscs
classes form a tree
each leaf node has a class
At each interior node, there is a CLASSIFIER algorithm (a filter) and a set of child class nodes.

Linux: at router input(ingress), we can apply POLICING to drop packets; at egress, we apply SHAPING to put packets in the right queue. Terminology derives from the fact that there is no ingress queue in linux (or most systems).

Classful Queuing Disciplines

CBQ, an acronym for 'class-based queuing', is the best known.
It is not, however, the only classful queuing discipline. And it is rather baroque.

PRIO: divides into classes :1, :2, :3 (user-configurable this time)
dequeuing: take packet from :1 if avail; if not then go on to :2, etc

by default, packets go into the band they would go into in PFIFO_FAST, using TOS bits. But you can use "shapers" to adjust this.

Hierarchy of PRIO queues is equivalent to the "flattened" PRIO queue.
However, a hierarchy of PRIO queues with SFQ/TBF offspring is not "flattenable".

TBF: classic rate-based shaper; packets wait for their token
(policing version: you get dropped if you arrive before your token)

For a classless qdisc, we're done once we create it. (Its parent might be nonroot, though).

For a classful qdisc, we add CLASSES to it. Each class will then in turn have a qdisc added to it.

parent of a class is either a qdisc or a class of same type

class major numbers must match parent.

qdisc major numbers are new

each class needs to have something below it, although every class gets a fifo qdisc by default.

We then attach a sub-qdisc to each subclass.

LARTC example:

Hubert example in 9.5.3.2 has a prio qdisc at the root.
The subclasses 1:1, 1:2, and 1:3 are automatic, as is the filtering.

           1:   root qdisc
         / | \
        / | \
       /   |   \
     1:1 1:2 1:3    classes, added automatically
      |    |    |
     10: 20: 30:    qdiscs    qdiscs
     sfq tbf sfq
band 0    1    2

Bulk traffic will go to 30:, interactive traffic to 20: or 10:.

Command lines for adding a class-based prio queue (class version of pfifo-fast)

    # tc qdisc add dev eth0 root handle 1: prio

This automatically creates classes 1:1, 1:2, 1:3. We could say
    tc qdisc add dev eth0 root handle 2: prio bands 5
to get bands 2:1, 2:2, 2:3, 2:4, and 2:5. Then zap with
    tc qdisc del dev eth0 root

But suppose we stick with the three bands, and add:
# tc qdisc add dev eth0 parent 1:1 handle 10: sfq           // prob should be tbf too
# tc qdisc add dev eth0 parent 1:2 handle 20: tbf rate 20kbit buffer 1600 limit 3000
# tc qdisc add dev eth0 parent 1:3 handle 30: sfq

We now get a somewhat more complex example.

linux classful qdiscs

CBQ, HTB, PRIO

Hierarchy of PRIO queues is equivalent to the "flattened" PRIO queue.
However, a hierarchy of PRIO queues with SFQ/TBF offspring is not "flattenable".

TBF: classic rate-based shaper; packets wait for their token
(policing version: you get dropped if you arrive before your token)

For a classless qdisc, we're done once we create it (its parent might be nonroot, though).

For a classful qdisc, we add CLASSES to it. Each class will then in turn have a qdisc added to it.

parent of a class is either a qdisc or a class of same type

class major numbers must match parent.

Hierarchical Fair Queuing

1. Why it's not flattenable. Consider this example:

         / \
        /   \
       /     \
     50       50
     /\       /\
    / \     / \
   25 25   25 25
   A    B   C    D

ABC active, D idle:
Hierarchical divides 25/25/50
Flat divides 33/33/33

2. How to define using fluid flows (as we did with flat fair queuing)

3. The slow-clock algorithm (or any finishing-time algorithm) implies that the finishing order of two packets cannot depend on future arrivals. However, as the example below shows, hierarchical fair queuing does need to take into account future arrivals, and so we cannot still use that strategy!

Example: from Bennett & Zhang, Hierarchical packet fair queueing algorithms, IEEE/ACM Transactions on Networking, Oct 1997
2.2

         / \
        /   \
       /     \
     80       20
     /\        |
    / \       |
   75   5      |
   A1   A2     B

All packets have size 1; link speed is 1 (ie 1 packet/unit_time)
T=0: A1's queue is idle; A2's and B's are very full. A2 gets 80%, B gets 20%. Finishing time calculations are such that A2 sends 4, then B sends 1, then A2 sends 4, then B sends 1....

But now let a packet arrive on A1. All of a sudden, A2 should get 5%, or 1/4 the rate of B.
But the finishing-time model can't change, because those calculations don't allow it!

Example 3: from Bennett & Zhang, 3.1

11 users. User 1 has a guarantee of 50%, the others all have 5%. WFQ sends 10 packets for user 1, then one for each of the other users (10 in all). So-called Generalized Processor Sharing model: 5 of user 1 / 10 of others / 10 of user 1 / ...

difference between WFQ (our algorithm, non-hierarchical) and fluid model

There is active research on algorithms that work for packets, have bounded delay with respect to the fluid model, and are fast. The round-robin quantum algorithm is fast, but doesn't quite meet the bounded-delay rule. However, it is easy to implement, and works well if all the queues are "busy".

CBQ

The acronym stands for "Class Based Queuing", though there are several other forms of classful queuing disciplines. CBQ is an attempt to combine some elements of Fair Queuing and Token Bucket in a single mechanism.

Goal: classful shaping. But the shaping (rate-limiting) doesn't work the way you'd like, because the rate is ESTIMATED somewhat artificially by measuring average idle time at the device interface.

Example from LARTC [Hubert] 9.5.4.4

Goal:
    webserver traffic     limited to 5 mbit (class 1:3, qdisc 30:)
    smtp traffic              limited to 3 mbit (class 1:4, qdisc 40:)
    combination            limited to 6 mbit

               1:           root qdisc
               |
              1:1           child class
             /   \
            /     \
          1:3     1:4       leaf classes
           |       |
          30:     40:       qdiscs
         (sfq)   (sfq)

Create root:

# tc qdisc add dev eth0 root handle 1:0 cbq bandwidth 100Mbit avpkt 1000 cell 8

create CLASS below root qdisc node to limit to 6 mbit
"bounded" (at end) means this class can't borrow from other idle classes.
This caps the rate at 6Mbit

# tc class add dev eth0 parent 1:0 classid 1:1 cbq bandwidth 100Mbit \
rate 6Mbit weight 0.6Mbit prio 8 allot 1514 cell 8 maxburst 20 avpkt 1000 bounded

Now create the two leaf classes, with classids 1:3 and 1:4
These are not bounded (which means they can borrow), and also not isolated, which means they can lend. Classid n:m is our choice, but must have n=1 to match parent definition above.

# tc class add dev eth0 parent 1:1 classid 1:3 cbq bandwidth 100Mbit \
rate 5Mbit weight 0.5Mbit prio 5 allot 1514 cell 8 maxburst 20 avpkt 1000

# tc class add dev eth0 parent 1:1 classid 1:4 cbq bandwidth 100Mbit \
rate 3Mbit weight 0.3Mbit prio 5 allot 1514 cell 8 maxburst 20 avpkt 1000

Both leaf classes have fifo qdisc by default. We could leave it that way, but here's how to replace it with sfq

# tc qdisc add dev eth0 parent 1:3 handle 30: sfq
# tc qdisc add dev eth0 parent 1:4 handle 40: sfq

Now we attach filtering rules, to the root node. Note that flowid in a filter spec matches a classid.
sport 80 (srcport 80) means web traffic.

# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 80 0xffff flowid 1:3
# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 25 0xffff flowid 1:4

Note that we use 'tc class add' to CREATE classes within a qdisc, but that we use 'tc qdisc add' to actually add qdiscs to these classes.

Traffic that is not classified by any of the two rules will then be processed within 1:0 (the parent qdisc), and be unlimited.

If SMTP+web together try to exceed the set limit of 6mbit/s, bandwidth will be divided according to the weight parameter, giving 5/8 of traffic to the webserver and 3/8 to the mail server.

How do we have one class represent a subnet, and another class represent "everything else"? Use a default class.
Note that this is specific to the class. cbq doesn't have this.
Boxman, 8.2.1
tc qdisc add dev eth0 root handle 1: htb default 90

Hierarchical token bucket (htb)

This is a classful version of tbf. Note that because we may now have sibling classes, we have an important sharing feature: each sibling class is allocated bandwidth the minimum of what it requests and what its assigned share is; that is, each class is guaranteed a minimum share (like fair queuing). However, the fairness of the division in the absence of traffic from some nodes is limited by the granularity of quantum round-robin.

pld home example

Commands to create an htb qdisc with three child classes:
   1. Hosts 10.0.0.1 and 10.0.0.2 go to class :10
   2. subnet 10.0.0.0/29 (10.0.0.0 - 10.0.0.7 except the above two!) goes to class :29
   3. All other traffic goes to class :100

The qdisc is placed on the interior interface of the router, to regulate inbound traffic.

We suppose that this is to control flow over a link that has a sustained bandwidth limit of BW, a bucket size of BURST, and a peak bandwidth of PBW. (PBW is not used here.)
BW=56         #kbps
BURST=350     #mb

tc qdisc add dev eth0 root handle 1: htb default 100
tc class add dev eth0 parent 1: classid 1:1 htb rate ${BW}kbps burst ${BURST}mb

# class 10 is limited by parent only
tc class add dev eth0 parent 1:1 classid 1:10 htb rate ${BW}kbps burst ${BURST}mb

# class 29 has same rate, but half the burst
HBURST=$(expr $BURST / 2)
tc class add dev eth0 parent 1:1 classid 1:29 htb rate ${BW}kbps burst ${HBURST}mb

# class 100 has 3/4 the refill rate, too
BW100=$(expr 3 \* $BW / 4)
tc class add dev eth0 parent 1:1 classid 1:100 htb rate ${BW100}kbps burst ${HBURST}mb

tc filter add dev eth0 parent 1:0 protocol ip u32 \
    match ip dst 10.0.0.1/32 flowid 1:10
tc filter add dev eth0 parent 1:0 protocol ip u32 \
    match ip dst 10.0.0.2/32 flowid 1:10
tc filter add dev eth0 parent 1:0 protocol ip u32 \
    match ip dst 10.0.0.0/29 classid 1:29
;; no rule for flowid 1:100; taken care of by default rule

Actually, I can't use this, because tc does not handle bucket sizes so large. So what I do instead is have a packet-sniffer-based usage counter on my (linux) router, which issues tc class change commands to reduce any class that is over-quota. I then use more moderate values for tc itself: I create the classes (I'm up to five now), and each one gets a rate of ~1-10Mbit and a bucket of ~10mbyte. However, the rate is reduced drastically when a class reaches its quota.

$RATE = 1-10 mbit
$BURST = 10 mbyte
# band 1
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.0/30 flowid 1:1
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.2/32 flowid 1:1

# band 2
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.4/30 flowid 1:2

# band 3 : gram
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.10/32 flowid 1:3
tc qdisc add dev $DEV parent 1:3 tbf rate $RATE burst $BURST limit $LIMIT

# band 4
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.11/32 flowid 1:4
tc qdisc add dev $DEV parent 1:4 tbf rate $RATE burst $BURST limit $LIMIT

# band 5:
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.0/24 flowid 1:5
tc qdisc add dev $DEV parent 1:4 tbf rate 32kbit burst 10kb limit 10kb

Note that while iproute2 has some problems fitting into the Internet model of one router table, tc applies just to the egress interfaces and so is completely separate from routing.

These first two examples here, using htb, use the u32 classifier

Another HTB example

Token bucket is a simple rate-limiting strategy, but the basic version limits everyone in one pool.

Why does hierarchical matter? Strict rate-limiting would not! (why?)
However, HTB is not "flattenable" because you share the parent bucket with your siblings. Consider
      root
     /    \
A:b=20   b=30
       /    \
     B:b=20   C:b=20

If B and C are both active, their bucket drops to about 15, as each get half the parent's bucket.

htb lets you apply flow-specific rate limits (eg to specific users/machines)

Rate-limiting can be used to limit inbound bandwidth to set values

above example using htb: LARTC 9.5.5.1

Functionally almost identical to the CBQ sample configuration below:

# tc qdisc add dev eth0 root handle 1: htb default 30
(30 is a reference to the 1:30 class, to be added)

# tc class add dev eth0 parent 1: classid 1:1 htb rate 6mbit burst 15k

Now here are the classes: note "ceil" in the second two. The ceil parameter allows "borrowing": use of idle bandwidth. Parent of classes here must be the class above. Only one class can have root as its parent!
Sort of weird.
   1:
   |
        1:1
      / | \
   1:10 1:20 1:30
   web smtp other

# tc class add dev eth0 parent 1:1 classid 1:10 htb rate 5mbit burst 15k
# tc class add dev eth0 parent 1:1 classid 1:20 htb rate 3mbit ceil 6mbit burst 15k
# tc class add dev eth0 parent 1:1 classid 1:30 htb rate 1kbit ceil 6mbit burst 15k

The author then recommends SFQ for beneath these classes (to replace whatever default leaf qdisc is there)

# tc qdisc add dev eth0 parent 1:10 handle 10: sfq perturb 10
# tc qdisc add dev eth0 parent 1:20 handle 20: sfq perturb 10
# tc qdisc add dev eth0 parent 1:30 handle 30: sfq perturb 10

Add the filters which direct traffic to the right classes: here, we divide by web/email/other

# tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
    match ip dport 80 0xffff flowid 1:10
# tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
    match ip sport 25 0xffff flowid 1:20

===================================================================

2.5: classifying/filtering:

fwmark:
# iptables -A PREROUTING -i eth0 -t mangle -p tcp --dport 25 -j MARK --set-mark 1

mangle table & CLASSIFY
iptables -t mangle -A POSTROUTING -o eth2 -p tcp --sport 80 -j CLASSIFY --set-class 1:10

u32:
allows matching on bits of packet headers. u32 is completely stateless (that is, it doesn't remember past connection state; it is strictly a per-packet matcher). The underlying matches are all numeric, but there are preset symbolic names for some fields, to help. See u32 examples above
(repeated here:)

    # tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 80 0xffff flowid 1:3
    # tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 25 0xffff flowid 1:4

3. Example for restricting bandwidth used by a single host (or subnet): From www.roback.cc/howtos/bandwidth.php

uses cbq, which does some approximate calculations to limit the flow of subclasses.

Also uses mangling/fwmark to do classification

DNLD = download bandwidth
UPLD = upload bandiwidth
DWEIGHT/UWEIGHT: weighting factors, more or less 1/10 of DNLD/UPLD

tc qdisc add dev eth0 root handle 11: cbq bandwidth 100Mbit avpkt 1000 mpu 64
tc class add dev eth0 parent 11:0 classid 11:1 cbq rate $DNLD weight $DWEIGHT \
allot 1514 prio 1 avpkt 1000 bounded
tc filter add dev eth0 parent 11:0 protocol ip handle 4 fw flowid 11:1

tc qdisc add dev eth1 root handle 10: cbq bandwidth 10Mbit avpkt 1000 mpu 64
tc class add dev eth1 parent 10:0 classid 10:1 cbq rate $UPLD weight $UWEIGHT \
allot 1514 prio 1 avpkt 1000 bounded
tc filter add dev eth1 parent 10:0 protocol ip handle 3 fw flowid 10:1

Now MARK the incoming packets, from the designated subnet:

# Mark packets to route
# Upload marking
$IPTABLES -t mangle -A FORWARD -s 192.168.0.128/29 -j MARK --set-mark 3
$IPTABLES -t mangle -A FORWARD -s 192.168.0.6 -j MARK --set-mark 3

# Download marking

$IPTABLES -t mangle -A FORWARD -s ! 192.168.0.0/24 -d 192.168.0.128/29 -j MARK --set-mark 4
$IPTABLES -t mangle -A FORWARD -s ! 192.168.0.0/24 -d 192.168.0.6 -j MARK --set-mark 4

Basic idea behind reservation/priority systems:
You don't have to segregate the physical traffic; all you have to do is
    (a) mark or describe the traffic
    (b) use queuing disciplines to give it priority

RSVP; aka Integrated Services

RSVP: reservations compatible with multicast: receiver-initiated!!

[ignore details of mechanism for network management]

transparent to senders
suitable for multicast receivers!!!
mechanisms for CHANGES IN ROUTE
different receivers may have different needs; these can sometimes be accommodated by routers, without bothering the sender.

how RSVP routers recognize flows; IP6 flowID

typical reservation timeout: 30 sec

RSVP does include teardown messages (both directions), but RSVP won't fall apart if they are not sent.

RSVP is transparent to nonparticipating ISPs

Receivers send periodic (30 sec) RESV messages to refresh routers; routers forward up the path.

works well if receiver crashes
works well if router crashes
how does it figure the path? Because routers send PATH mesg.

Each RSVP sender host transmits RSVP "Path" messages downstream along the uni-/multicast routes provided by the routing protocol(s), following the paths of the data. These Path messages store "path state" in each node along the way. This path state includes at least the unicast IP address of the previous hop node, which is used to route the Resv messages hop-by-hop in the reverse direction.

Sometimes RESV messages can be merged!

Sender sends its TSpec (Traffic Spec, expressed as a token bucket) in PATH messages; note that some receivers may ask for a different TSpec

                  R1----R2
             /        \
            A--R0          R5--B
                 \        /
                  R3----R4

What if A->R0->R1->R2->R5->B, but the reverse path is B->R5->R4->R3->R0->A?
A is sender.
        B sends RESV message R5->R2->R1->R0->A by virtue of saved state at routers
        But the actual R5->R2, R2->R1 might actually travel R4->R3->R0!


one-pass upstream request means receiver might not even find out what reservation was actually accepted! Though the sender should know....

Initiated by requests sent by receiver to sender; sender sends PATH packets back down to receiver; these lay groundwork for return-path computation. The PATH message also contains sender's traffic specification, or Tspec.

RFC2210: standard Tspec has

token bucket rate
token bucket size
peak data rate
minimum policed unit m
max packet size M

There is NO MaxDelay!!! But see notes in rfc 2212:

applications can use the Tspec values above to estimate their queuing delay, which when added to propagation delay gives an accurate total.

How we might do RSVP over the backbone: notes in rfc2208
        basic idea: do something else there!!

What do routers do with reserved traffic?
    Lots of scope for priority queues and per-reservation SFQ.
    If reserved traffic has a guarantee of 20 Mbps, and you have
    1% of that, you have 200Kbps. Period.

Assuming the infrastructure doesn't introduce excessive delay.

How VOIP may facilitate adoption of RSVP


Note that for some applications QoS can change dynamically (this is relevant for video). ALso RSVP supports multicast.

Admission control:
    how exercised???

Two questions:
        Can we grant the requested reservation?
        Should we?

The latter is harder. For the former, it is basically a question of whether we have that much capacity.

One approach is to have a separate FQ queue for each reservation. That's a lot of queues.
    Or should we just do TBF policing?

Differential Services

Problem: Integrated Services may not scale well. Few ISPs have adopted it. Can we come up with a simpler mechanism?

Differential services is, in its simplest form, a two-tier model. Priority traffic very limited in bandwidth. A simple priority queue mechanism to handle the service classes is enough.

Priority traffic is marked on ingress; this potentially expensive marking operation only occurs in one place. However, routing within the network just looks at the bits.

Marking uses the DiffServ bits (DS bits), now the second byte of the IP header (formerly the TOS bits).

scaling issues of RSVP; possibility of use of IntServ in core and RSVP at edges

DS may start with a Service Level Agreement (SLA) with your ISP, that defines just what traffic (eg VOIP) will get what service. No need to change applications.

However, what if the upstream ISP does not acknowledge DS?

For that matter, what if it does? Border router of upstream ISP must re-evaluate all DS-marked packets that come in. There may be more DS traffic than expected! It is possible that traffic from some ISPs would be remarked into a lower DS-class.

Potential problem: "ganging up". DS only provides a "preferred" class of traffic; there are no guarantees.

DS octet will be set by ISP (mangle table?)
DS field:
    6 bits; 3+3 class+drop_precedence

Two basic strategies (per-hop behaviors, or PHBs): EF (Expedited Forwarding) and AF (Assured Forwarding).

101 110    "EF", or "Expedited Forwarding": best service. This is supposed to be for "realtime" services like voice.

Assured Forwarding: 3 bits of Class, 3 bits of Drop Precedence
    Class:
    100    class 4: best
    011    class 3
    010    class 2
    001    class 1

    Drop Precedence:

    010:    don't drop
    100:    medium
    110    high

Main thing: The classes each get PRIORITY service, over best-effort.


What happens if you send more than you've contracted for?

Uses IP4 TOS field, widely ignored in the past.

Routers SHOULD implement priority queues for service categories

Basic idea: get your traffic marked for the appropriate class.
Then what?

000 000: current best-effort status
xxx 000: traditional IPv4 precedence

PHBs (Per-Hop Behaviors): implemented by all routers
Only "boundary" routers do traffic policing/shaping/classifying/re-marking
to manage categories (re-marking is really part of shaping/policing)

EF: Expedited Forwarding
basically just higher-priority. Packets should experience low queuing delay.

Maybe not exactly; we may give bulk traffic some guaranteed share

Functionality depends on ensuring that there is not too much EF traffic.

Basically, we control at the boundary the total volume of EF traffic (eg to a level that cannot saturate the slowest link), so that we have plenty of capacity for EF traffic. THen we just handle it at a higher priority.

This is the best service.

EF provides a minimum-rate guarantee.
This can be tricky: if we accept input traffic from many sources, and have four traffic outlets R1, R2, R3, R4, then we should only accept enough EF traffic that any one Ri can handle it. But we might go for a more statistical model, if in practice 1/4 of the traffic goes to each Ri.

AF: Assured Forwarding
Simpler than EF, but no guarantee. Traffic totals can be higher.
There is an easy way to send more traffic: it is just marked as "out".

In-out marking: each packet is marked "in" or "out" by the policer.
Actually, we have three precedence levels to use for marking.5,930,4745,930,4745,930,4745,930,4745,930,4745,930,474

From RFC2597:

The drop precedence level of a packet could be assigned, for example, by using a leaky bucket traffic policer, which has as its parameters a rate and a size, which is the sum of two burst values: a committed burst size and an excess burst size. A packet is assigned low drop precedence if the number of tokens in the bucket is greater than the excess burst size [ie bucket is full], medium drop precedence if the number of tokens in the bucket is greater than zero, but at most the excess burst size, and high drop precedence if the bucket is empty.

Packet mangling can be used to set the DS bits, plus a goodly number of priority bands for the drop precedences (I'm not sure how to handle the different subclasses within precedence groups; they might get classful TBF service)

RIO: RED with In/Out: RED = Random Early Detection

Routers do not reorder!

See Stallings Fig 19.13(b) and # of classes, drop precedence
4 classes x 3 drop priorities
Or else find the Differentiated Services RFC

BGP: Border Gateway Protocol

See notes at bgp.html