Network Management
Week 13, Apr 25
LT-412, Mon 4:15-6:45 pm
tunnels and routing
tc, tbf, htb
BGP
RSVP, differentiated services
linux1 and linux2 startup
What do we have to do to get these to work?
laptop ----NAT----linux1----net2-----linux2
Step2: Do these
linux1: ifconfig eth1 10.0.37.1 netmask 255.255.255.0
linux1 eth0 should come up as 10.2.5.17 by dhcp
linux2: ifconfig eth1 10.0.37.2 netmask
255.255.255.0
Step 4: enable routing on linux1:
linux2: ip route add to default via 10.0.37.1
linux1: echo 1 > /proc/sys/net/ipv4/ip_forward
Step 5e:
valhal: ip route add to 10.0.37.0/24 via 10.2.5.17 (allows valhal to reach linux2)
linux1: ip route add to default via 10.2.5.1 (allows linux1 to ping)
Tunnels
I can create a tunnel from home to ulam2 using pppd (point-to-point protocol daemon). The exact details are not important. I end up with an interface tun0
on each machine, to which I can assign IP addresses 192.168.4.1 (ulam2)
and 192.168.4.2 (home). Packets are transmitted along this "virtual"
link by being encapsulated and sent along the "default" path.
What do I have to do to route traffic to Loyola through this tunnel? What do I have to do to secure it?
Host home connects to a Linksys
box with inside IP address 192.168.0.2, through interface eth2. It
connects to a household subnet through interface eth0, with IP address
10.0.0.1.
Here is home's routing table, with mystery entries removed
147.126.65.47 via 192.168.0.2 dev eth2
# host-specific route
10.0.0.0/24 dev eth0 proto kernel scope link src
10.0.0.1 # local subnet
10.38.0.0/16 via 192.168.4.1 dev tun0
# two routes to two Loyola subnets
147.126.0.0/16 via 192.168.4.1 dev tun0
# note
this one conflicts with #1 above
default via 192.168.0.2 dev eth2
Ulam2 has NAT enabled, with eth0 being the outside interface. Here are the routing entries:
10.0.0.0/24 via 192.168.4.2 dev ppp0
default via 147.126.65.1 dev eth0 metric 100
Securing:
If the point is to secure the home network against attacks from the
outside world, it is best to do that at the home end. Here is what to
do to allow connections to originate within the home subnet only:
home# iptables --append FORWARD -m state --state ESTABLISHED,RELATED --jump ACCEPT
home# iptables --append FORWARD -m state --state NEW ! -i eth0 --jump ACCEPT
home# iptables --append FORWARD --jump DROP
We would also want to append these entries to the INPUT chain to block connections to machine home (rather than to other machines on the subnet).
Another security risk is that such a private link allows for
"steppingstone" attacks from bad -> home -> loyola. One approach
is to make the tunnel itself the default route, meaning all my traffic would be routed through Loyola, except for the host-specific route to ulam2.
Traffic shaping and traffic control
Generally there's not much point in doing shaping from the bottleneck
link into a faster link. The bottleneck link has done all the shaping
already!
fair queuing
Restricts bandwidth when there is
competition, but allows full use when network is idle. Caps share
only when link is busy!
token bucket
Restricts bandwidth to
a fixed rate, period (but also allows for burstiness as per the bucket,
which can always be small)
tc command:
shaping: output rate limiting (delay or drop)
scheduling: prioritizing. Classic application: sending voip ahead of bulk traffic
policing: input rate regulation
dropping: what gets done to nonconforming traffic
Two scenarios to restrict user/host Joe:
1. Reduce absolute bandwidth (in/out?) available to Joe. If the link is otherwise idle, Joe is still cut
2. Guarantee non-Joe users a min share; ie cap Joe's bandwidth only when the link is busy.
qdisc: queuing discipline
You can think of this as the TYPE of queue. Examples: fifo, fifo+taildrop, fifo+randomdrop, fair_queuing, RED, tbf
queuing disciplines are applied to INTERFACES, using the tc command.
Queuing disciplines can be "classless" or "classful" (hierarchical)
Queuing Disciplines (qdisc): does scheduling. Some also support shaping/policing
how packets are enqueued, dequeued
- fifo + taildrop
- fifo + random drop
- RED: introduces random drops not for policing, but to encourage good behavior by senders. Used in core networks, not leaf networks
- stochastic fair queuing
(each TCP connection is a flow): gives each flow a guaranteed
fraction of bandwidth, when needed. Other flavors: flows are subnets,
etc. However, if we're doing scheduling of inbound traffic, it doesn't
do much good to do sfq based on destination (unless we can do it at the
upstream router at our ISP). Some reasons for an application to open
multiple TCP connections:
- cheating SFQ limits
- much improved high-bandwidth performance
- the data naturally divides into multiple connections
- pfifo_fast (or, generically, pfifo): priority fifo. tc's pfifo_fast has three priority bands built-in.
enqueuing: figure out which band the packet goes into
dequeuing: take from band 0 if nonempty, else 1 if nonempty, else 2
Basic "classless" qdiscs
pfifo_fast
(see man pfifo_fast): three-band FIFO queue
Consider the following iptables command, to set the TOS bits on outbound ssh traffic to "Minimize-Delay":
# iptables -A OUTPUT -t mangle -p tcp --dport 22 -j TOS --set-tos Minimize-Delay
This works with pfifo_fast, which provides three bands. Band selection by
default is done using TOS bits of packet header (which you probably
have to mangle to set). See Hubert, §9.2.1.1, for a table of the TOS-to-band map.
Dequeuing algorithm (typically invoked whenever the hardware is
ready for a packet (or whenever the qdisc reports to the hardware that
it has a packet):
if there are any packets in band 1, dequeue the first one and send it
else if there are any packets in band 2, dequeue the first one and send it
else if there are any packets in band 3, dequeue the first one and send it
else report no packets available
Note that in a very direct sense pfifo_fast does support three
"classes" of traffic. However, it is not considered to be classful,
since
- we cannot control how traffic is classified
- we cannot attach subqueues to the individual classes
Example: queue flooding on upload
In practice, it is very important to set interactive traffic to have a
higher priority than normal traffic (eg web browsing, file downloads).
However, you don't have much control of the downlink traffic, and if
the uplink queue is on the ISP's hardware (eg their cablemodem), then
you won't have much control of the upload side.
me------<fast>------[broadband gizmo]-----------<slow>-----...internet
In the scenario above, suppose the broadband gizmo, BG, has a queue
capacity of 30K (30 packets?). A bulk UPLOAD (though not download) will
fill BG's queue, no matter what you do within your subnet with pfifo_fast.
(We are presuming you do not have the option of enabling pfifo_fast on BG itself.) This means that every interactive packet will now wait behind a 30KB
queue to get up into the internet. As the upload packets are sent, they
will be ACKed and then your machine will replenish BG's queue.
One approach is to reduce BG's queue. But this may not be possible.
Here's another approach:
me---<fast>---[tq_router]---<slow2>---[broadband gizmo]---<slow>--internet
Make sure slow2 ~ slow. Then upload will fill tq_router's queue, but fast traffic can still bypass.
Logically, can have "me" == "tq_router"
_______________________
Token bucket filter (tbf)
See man tbf or man tc-tbf
This can be used to restrict flow to a set average rate, while allowing bursts. The tbf qdisc slows down excessive bursts to meet filter limits. This is shape-only, no scheduling.
tbf (or hbf) is probably the preferred way of implementing bandwidth caps.
tokens are put into a bucket at a set rate. If a packet arrives:
- tokens available: send immediately and decrement the bucket
- no token: drop (or wait, below)
Over the long term, your transmission rate will equal the token rate.
Over the short term, you can transmit bursts of up to token size.
Parameters:
- bucketsize (burst): how large the bucket can be
- rate: rate at which tokens are put in
limit:
number of bytes that can wait for tokens. All packets wait, in essence;
if you set this to zero then the throughput is zero. While
theoretically it makes sense to set this to zero, in practice that
appears to trigger serious clock-granularity problems.
latency: express limit in time units
mpu: minimum packet unit; size of an "empty" packet (headers only)
Granularity issue: buckets are typically updated every 10 ms.
During 10 ms, on a 100Mbps link, 1MB can accumulate, or ~100 large
packets! Generally, tbf introduces no burstiness itself up to 1mbit (a
10 kbit packet takes 10 ms on a 1mbit link). Beyond that, a steady
stream of packets may "bunch up" due to the every-10-ms bucket refill.
Use of tbf for bulk traffic into a modem:
# tc qdisc add \
dev ppp0 \
root
\
tbf \
rate 220kbit \ # if actual bandwidth is 250kbit
latency 50ms \
burst 1540 # one packet
You want packets queued at your end, NOT within your modem!
Otherwise you will have no way to use pfifo_fast to have interactive traffic leapfrog the queue.
What you REALLY want is for TBF to apply only to the bulk bands of PFIFO_FAST.
Can't do this with TBF; that would be classFUL traffic control (though we can do that with PRIO, the classful analogue of pfifo_fast)
Can't be used to limit a subset of the traffic!
TBF is not WORK-CONSERVING:
the link can be idle and someone can have a packet ready to send, and yet it still waits.
Demo: linux1, linux2 and tc
start tcp_writer in ~/networks/java/tcp_reader. This accepts
connections to port 5430, and then sends them data as fast as it can.
tbf1:
linux1: tc qdisc add dev eth1 root tbf rate 100kbps burst 1mb limit 1mb
also try:
tc qdisc change dev eth1 root tbf rate newrate burst newburst limit newlimit
This might cause a spike in kbit/sec numbers:
479
479
159
4294
564
564
...
clear with tc qdisc del dev eth1 root
tc_stats
demo: enable rate 1000 kbit/sec (1 kbit/ms), burst 20kb, limit 100kb. Then try limit = 5kb v 6kb.
At the given rate, 1kb takes 8ms. The bucket is replenished every 10ms
(hardware clock limitation), so a burst should not be consumed much
during a 10ms interval.
At 10mbit, 10ms is 12kb, and we won't get that rate unless burst is set to about that.
___________________________
Fair Queuing
Note
that the FQ clock (slow clock) can be reset to zero whenever all queues are empty,
and can in fact just stay at zero until something arrives.
Linux "sfq":
- Flows are individual tcp connections. NOT hosts, or subnets, etc! Can't get this?!
- Each flow is hashed by srcaddr,destaddr,port.
- Each bucket is considered a separate input "pseudoqueue"
- Collisions result in unfairness, so the hash function is altered at regular intervals to minimize this.
What
we really want is a way to define flow_id's, so they can be created
dynamically, and connections can be assigned to flow_id by:
sfq is schedule-only, no shaping
What you probably want is to apply this to DOWNLOADs.
tc doesn't do that if your linux box is tied directly to your broadband gizmo. Applied to downloads at a router,
joe,mary,alice---<---1--------[router R]------<---2----internet
it would mean that each of joe,mary,alice's connections would get
1/3 the bw, BUT that would be 1/3 of the internal link, link 1.
Regulating shares of link 2 would have to be done upstream.
If we know that link 2 has a bandwidth of 3 Mbps, we can use CBQ
(below) to restruct each of joe,mary,alice to 1 Mbps, by controlling
the outbound queue at R into link 1.
Further sharing considerations:
If we divide by 1 flow = 1 tcp connection, joe can double throughput by adding a second tcp connection.
If we divide by 1 flow = 1 host, we do a little better at achieving
per-user fairness, assuming 1 user = 1 machine. Linux sfq does not
support this.
Linux sfq creates multiple virtual queues. It's important to realize
that there is only one physical queue; if one sender dominates that
queue by keeping it full, to the point that the other connections get
less bandwidth than their FQ share, then the later division into
LOGICAL queues won't help the underdogs much.
sfq IS work-conserving
The slow-clock algorithm is moderately tricky. Is there a simpler way?
How about round-robin quantums?
This means that you establish a quantum size, q, and send packets from
each queue until the total number of bytes is > q. You then service
the other active queues, in round-robin style. Suppose the total number
of bytes sent from one queue in one burst is q+r; next time, you give
that queue a reduced quantum of q-r.
Over the long term, this is reasonably close to fair queuing, provided
each queue has q bytes available to send at any one time.
Otherwise, that queue will be emptied and there will be no carryover
opportunity.
Here are two scenarios where that fails, both involving q = 5k:
- queue1 typically contains > 5k of data (eg ten 1k-packets (implying a sliding-window size > 10)); queue2 however is doing stop-and-wait
and at any one time has only a single 1k packet. In this case, 5k of
queue1 will be sent for every 1k of queue2, and queue2 will have no
opportunity to catch up
- queue1 typically contains > 5k of data, and queue2 is sending a 0.1k voip packet every 10 ms.
However, note that if queue1's quantu
Fair Queuing again
The slow-clock algorithm is moderately tricky. Is there a simpler way?
How about round-robin quantums?
This means that you establish a quantum size, q, and send packets from
each queue until the total number of bytes is > q. You then service
the other active queues, in round-robin style. Suppose the total number
of bytes sent from one queue in one burst is q+r; next time, you give
that queue a reduced quantum of q-r.
Over the long term, this is reasonably close to fair queuing, provided
each queue has q bytes available to send at any one time.
Otherwise, that queue will be emptied and there will be no carryover
opportunity.
Here are two scenarios where that fails, both involving q = 5k:
- queue1 typically contains > 5k of data (eg ten 1k-packets (implying a sliding-window size > 10)); queue2 however is doing stop-and-wait
and at any one time has only a single 1k packet. In this case, 5k of
queue1 will be sent for every 1k of queue2, and queue2 will have no
opportunity to catch up
- queue1 typically contains > 5k of data, and queue2 is sending a 0.1k voip packet every 10 ms.
However, note that if queue1's quan
- m can be sent in less than 10 ms,
queue2 is still getting "immediate" service. And if queue1's quantum
takes, say, 20 ms, then queue2 is sending two packets at a time every
20ms, which probably does not make a huge difference to voip.However, that is 20 ms extra, and if there are multiple data queues, having queue2 get higher priority would make sense.
One solution is to make the quantum small, just a little above the max packet size (the so-called MTU, Maximum Transfer Unit).
RED
Generally intended for internet routers
We drop packets at random (at a very low rate) when queue capacity
reaches a preset limit (eg 50% of max), to signal tcp senders to slow
down.
These gizmos are added to interfaces, basically. If you want to slow a
particular sender down, create a virtual interface for them or use
classful qdiscs.
classful qdiscs
CBQ, HTB, PRIO
Disneyland example: what is the purpose of having one queue feed into another queue?
However, under tc we can have these classful qdiscs form a tree, possibly of considerable depth.
LARTC 9.3: (http://lartc.org/howto/lartc.qdisc.advice.html)
- To purely slow down outgoing traffic, use the Token Bucket Filter. Works up to huge bandwidths, if you scale the bucket.
- If your link is truly full and you want to make sure that no
single session can dominate your outgoing bandwidth, use Stochastical
Fairness Queueing.
- If you have a big backbone and know what you are doing, consider Random Early Drop (see Advanced chapter).
- To 'shape' incoming traffic which you are not forwarding, use the Ingress Policer. Incoming shaping is called 'policing', by the way, not 'shaping'.
- If you are forwarding
it, use a TBF on the interface you are forwarding the data to. Unless
you want to shape traffic that may go out over several interfaces, in
which case the only common factor is the incoming interface. In that
case use the Ingress Policer.
- If you don't want to shape, but only want to see if your
interface is so loaded that it has to queue, use the pfifo queue (not
pfifo_fast). It lacks internal bands but does account the size of its
backlog.
- Finally - you can also do "social shaping". You may not always be
able to use technology to achieve what you want. Users experience
technical constraints as hostile. A kind word may also help with
getting your bandwidth to be divided right!
Basic terminology for classful qdiscs
classes form a tree
each leaf node has a class
At each interior node, there is a CLASSIFIER algorithm (a filter) and a set of child class nodes.
Linux: at router input(ingress), we can apply POLICING to drop
packets; at egress, we apply SHAPING to put packets in the right queue.
Terminology derives from the fact that there is no ingress queue in
linux (or most systems).
Classful Queuing Disciplines
CBQ, an acronym for 'class-based queuing', is the best known.
It is not, however, the only classful queuing discipline. And it is rather baroque.
PRIO: divides into classes :1, :2, :3 (user-configurable this time)
dequeuing: take packet from :1 if avail; if not then go on to :2, etc
by default, packets go into the band they would go into in PFIFO_FAST, using TOS bits. But you can use "shapers" to adjust this.
Hierarchy of PRIO queues is equivalent to the "flattened" PRIO queue.
However, a hierarchy of PRIO queues with SFQ/TBF offspring is not "flattenable".
TBF: classic rate-based shaper; packets wait for their token
(policing version: you get dropped if you arrive before your token)
For a classless qdisc, we're done once we create it. (Its parent might be nonroot, though).
For a classful qdisc, we add CLASSES to it. Each class will then in turn have a qdisc added to it.
parent of a class is either a qdisc or a class of same type
class major numbers must match parent.
qdisc major numbers are new
each class needs to have something below it, although every class gets a fifo qdisc by default.
We then attach a sub-qdisc to each subclass.
LARTC example:
Hubert example in 9.5.3.2 has a prio qdisc at the root.
The subclasses 1:1, 1:2, and 1:3 are automatic, as is the filtering.
1: root qdisc
/ | \
/ | \
/ | \
1:1 1:2 1:3 classes, added automatically
| | |
10: 20: 30: qdiscs qdiscs
sfq tbf sfq
band 0 1 2
Bulk traffic will go to 30:, interactive traffic to 20: or 10:.
Command lines for adding a class-based prio queue (class version of pfifo-fast)
# tc qdisc add dev eth0 root handle 1: prio
This automatically creates classes 1:1, 1:2, 1:3. We could say
tc qdisc add dev eth0 root handle 2: prio bands 5
to get bands 2:1, 2:2, 2:3, 2:4, and 2:5. Then zap with
tc qdisc del dev eth0 root
But suppose we stick with the three bands, and add:
#
tc qdisc add dev eth0 parent 1:1 handle 10:
sfq // prob
should be tbf too
# tc qdisc add dev eth0 parent 1:2 handle 20: tbf rate 20kbit buffer 1600 limit 3000
#
tc qdisc add dev eth0 parent 1:3 handle 30:
sfq
We now get a somewhat more complex example.
linux classful qdiscs
CBQ, HTB, PRIO
Hierarchy of PRIO queues is equivalent to the "flattened" PRIO queue.
However, a hierarchy of PRIO queues with SFQ/TBF offspring is not "flattenable".
TBF: classic rate-based shaper; packets wait for their token
(policing version: you get dropped if you arrive before your token)
For a classless qdisc, we're done once we create it (its parent might be nonroot, though).
For a classful qdisc, we add CLASSES to it. Each class will then in turn have a qdisc added to it.
parent of a class is either a qdisc or a class of same type
class major numbers must match parent.
Hierarchical Fair Queuing
1. Why it's not flattenable. Consider this example:
/ \
/ \
/ \
50 50
/\ /\
/ \ / \
25 25 25 25
A B C D
ABC active, D idle:
Hierarchical divides 25/25/50
Flat divides 33/33/33
2. How to define using fluid flows (as we did with flat fair queuing)
3. The slow-clock algorithm (or any finishing-time
algorithm) implies that the finishing order of two packets cannot
depend on future arrivals. However, as the example below shows, hierarchical fair queuing does need to take into account future arrivals, and so we cannot still use that strategy!
Example: from Bennett & Zhang, Hierarchical packet fair queueing algorithms, IEEE/ACM Transactions on Networking, Oct 1997
2.2
/ \
/ \
/ \
80 20
/\ |
/ \ |
75 5 |
A1 A2 B
All packets have size 1; link speed is 1 (ie 1 packet/unit_time)
T=0:
A1's queue is idle; A2's and B's are very full. A2 gets 80%, B gets
20%. Finishing time calculations are such that A2 sends 4, then B sends
1, then A2 sends 4, then B sends 1....
But now let a packet arrive on A1. All of a sudden, A2 should get 5%, or 1/4 the rate of B.
But the finishing-time model can't change, because those calculations don't allow it!
Example 3: from Bennett & Zhang, 3.1
11 users. User 1 has a guarantee of 50%, the others all have 5%. WFQ
sends 10 packets for user 1, then one for each of the other
users (10 in all). So-called Generalized Processor Sharing model: 5 of user 1 / 10
of others / 10 of user 1 / ...
difference between WFQ (our algorithm, non-hierarchical) and fluid model
There is active research on algorithms that work for packets, have
bounded delay with respect to the fluid model, and are fast. The round-robin
quantum algorithm is fast, but doesn't quite meet the bounded-delay
rule. However, it is easy to implement, and works well if all the
queues are "busy".
CBQ
The acronym stands for "Class Based Queuing", though there are several
other forms of classful queuing disciplines. CBQ is an attempt to
combine some elements of Fair Queuing and Token Bucket in a single
mechanism.
Goal: classful shaping. But the shaping (rate-limiting) doesn't work
the way you'd like, because the rate is ESTIMATED somewhat artificially
by measuring average idle time at the device interface.
Example from LARTC [Hubert] 9.5.4.4
Goal:
webserver traffic limited to 5 mbit (class 1:3, qdisc 30:)
smtp traffic
limited to 3 mbit (class 1:4, qdisc 40:)
combination limited to 6 mbit
1: root
qdisc
|
1:1 child
class
/ \
/ \
1:3 1:4
leaf classes
| |
30: 40:
qdiscs
(sfq) (sfq)
Create root:
# tc qdisc add dev eth0 root handle 1:0 cbq bandwidth 100Mbit avpkt 1000 cell 8
create CLASS below root qdisc node to limit to 6 mbit
"bounded" (at end) means this class can't borrow from other idle classes.
This caps the rate at 6Mbit
# tc class add dev eth0 parent 1:0 classid 1:1 cbq bandwidth 100Mbit \
rate 6Mbit weight 0.6Mbit prio 8 allot 1514 cell 8 maxburst 20 avpkt 1000 bounded
Now create the two leaf classes, with classids 1:3 and 1:4
These are not bounded (which means they can borrow), and also not
isolated, which means they can lend. Classid n:m is our choice, but
must have n=1 to match parent definition above.
# tc class add dev eth0 parent 1:1 classid 1:3 cbq bandwidth 100Mbit \
rate 5Mbit weight 0.5Mbit prio 5 allot 1514 cell 8 maxburst 20 avpkt 1000
# tc class add dev eth0 parent 1:1 classid 1:4 cbq bandwidth 100Mbit \
rate 3Mbit weight 0.3Mbit prio 5 allot 1514 cell 8 maxburst 20 avpkt 1000
Both leaf classes have fifo qdisc by default. We could leave it that way, but here's how to replace it with sfq
# tc qdisc add dev eth0 parent 1:3 handle 30: sfq
# tc qdisc add dev eth0 parent 1:4 handle 40: sfq
Now we attach filtering rules, to the root node. Note that flowid in a filter spec matches a classid.
sport 80 (srcport 80) means web traffic.
# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 80 0xffff flowid 1:3
# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 25 0xffff flowid 1:4
Note that we use 'tc class add' to CREATE classes within a qdisc, but
that we use 'tc qdisc add' to actually add qdiscs to these classes.
Traffic that is not classified by any of the two rules will then be processed within 1:0 (the parent qdisc), and be unlimited.
If SMTP+web together try to exceed the set limit of 6mbit/s, bandwidth
will be divided according to the weight parameter, giving 5/8 of
traffic to the webserver and 3/8 to the mail server.
How do we have one class represent a subnet, and another class represent "everything else"? Use a default class.
Note that this is specific to the class. cbq doesn't have this.
Boxman, 8.2.1
tc qdisc add dev eth0 root handle 1: htb default 90
Hierarchical token bucket (htb)
This is a classful version of tbf. Note that because we may now have sibling classes, we have an important sharing
feature: each sibling class is allocated bandwidth the minimum of what
it requests and what its assigned share is; that is, each class is
guaranteed a minimum share (like fair queuing). However, the fairness
of the division in the absence of traffic from some nodes is limited by the granularity of quantum round-robin.
pld home example
Commands to create an htb qdisc with three child classes:
1. Hosts 10.0.0.1 and 10.0.0.2 go to class :10
2. subnet 10.0.0.0/29 (10.0.0.0 - 10.0.0.7 except the above two!) goes to class :29
3. All other traffic goes to class :100
The qdisc is placed on the interior interface of the router, to regulate inbound traffic.
We suppose that this is to control flow over a link that has a
sustained bandwidth limit of BW, a bucket size of BURST, and a peak
bandwidth of PBW. (PBW is not used here.)
BW=56 #kbps
BURST=350 #mb
tc qdisc add dev eth0 root handle 1: htb default 100
tc class add dev eth0 parent 1: classid 1:1 htb rate ${BW}kbps burst ${BURST}mb
# class 10 is limited by parent only
tc class add dev eth0 parent 1:1 classid 1:10 htb rate ${BW}kbps burst ${BURST}mb
# class 29 has same rate, but half the burst
HBURST=$(expr $BURST / 2)
tc class add dev eth0 parent 1:1 classid 1:29 htb rate ${BW}kbps burst ${HBURST}mb
# class 100 has 3/4 the refill rate, too
BW100=$(expr 3 \* $BW / 4)
tc class add dev eth0 parent 1:1 classid 1:100 htb rate ${BW100}kbps burst ${HBURST}mb
tc filter add dev eth0 parent 1:0 protocol ip u32 \
match ip dst 10.0.0.1/32 flowid 1:10
tc filter add dev eth0 parent 1:0 protocol ip u32 \
match ip dst 10.0.0.2/32 flowid 1:10
tc filter add dev eth0 parent 1:0 protocol ip u32 \
match ip dst 10.0.0.0/29 classid 1:29
;; no rule for flowid 1:100; taken care of by default rule
Actually, I can't use this, because tc does not handle bucket sizes so
large. So what I do instead is have a packet-sniffer-based usage
counter on my (linux) router, which issues tc class change commands
to reduce any class that is over-quota. I then use more moderate values
for tc itself: I create the classes (I'm up to five now), and each one
gets a rate of ~1-10Mbit and a bucket of ~10mbyte. However, the rate is
reduced drastically when a class reaches its quota.
$RATE = 1-10 mbit
$BURST = 10 mbyte
# band 1
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.0/30 flowid 1:1
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.2/32 flowid 1:1
# band 2
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.4/30 flowid 1:2
# band 3 : gram
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.10/32 flowid 1:3
tc qdisc add dev $DEV parent 1:3 tbf rate $RATE burst $BURST limit $LIMIT
# band 4
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.11/32 flowid 1:4
tc qdisc add dev $DEV parent 1:4 tbf rate $RATE burst $BURST limit $LIMIT
# band 5:
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.0/24 flowid 1:5
tc qdisc add dev $DEV parent 1:4 tbf rate 32kbit burst 10kb limit 10kb
Note
that while iproute2 has some problems fitting into the Internet model
of one router table, tc applies just to the egress interfaces and so is
completely separate from routing.
These first two examples here, using htb, use the u32 classifier
Another HTB example
Token bucket is a simple rate-limiting strategy, but the basic version limits everyone in one pool.
Why does hierarchical matter? Strict rate-limiting would not! (why?)
However, HTB is not "flattenable" because you share the parent bucket with your siblings. Consider
root
/ \
A:b=20 b=30
/ \
B:b=20 C:b=20
If B and C are both active, their bucket drops to about 15, as each get half the parent's bucket.
htb lets you apply flow-specific rate limits (eg to specific users/machines)
Rate-limiting can be used to limit inbound bandwidth to set values
above example using htb: LARTC 9.5.5.1
Functionally almost identical to the CBQ sample configuration below:
# tc qdisc add dev eth0 root handle 1: htb default 30
(30 is a reference to the 1:30 class, to be added)
# tc class add dev eth0 parent 1: classid 1:1 htb rate 6mbit burst 15k
Now here are the classes: note "ceil" in the second two. The ceil
parameter allows "borrowing": use of idle bandwidth. Parent of classes
here must be the class above. Only one class can have root as its
parent!
Sort of weird.
1:
|
1:1
/ | \
1:10 1:20 1:30
web smtp other
# tc class add dev eth0 parent 1:1 classid 1:10 htb rate 5mbit burst 15k
# tc class add dev eth0 parent 1:1 classid 1:20 htb rate 3mbit ceil 6mbit burst 15k
# tc class add dev eth0 parent 1:1 classid 1:30 htb rate 1kbit ceil 6mbit burst 15k
The author then recommends SFQ for beneath these classes (to replace whatever default leaf qdisc is there)
# tc qdisc add dev eth0 parent 1:10 handle 10: sfq perturb 10
# tc qdisc add dev eth0 parent 1:20 handle 20: sfq perturb 10
# tc qdisc add dev eth0 parent 1:30 handle 30: sfq perturb 10
Add the filters which direct traffic to the right classes: here, we divide by web/email/other
# tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
match ip dport 80 0xffff flowid 1:10
# tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
match ip sport 25 0xffff flowid 1:20
===================================================================
2.5: classifying/filtering:
fwmark:
# iptables -A PREROUTING -i eth0 -t mangle -p tcp --dport 25 -j MARK --set-mark 1
mangle table & CLASSIFY
iptables -t mangle -A POSTROUTING -o eth2 -p tcp --sport 80 -j CLASSIFY --set-class 1:10
u32:
allows matching on bits of packet headers. u32 is
completely stateless (that is, it doesn't remember past connection
state; it is strictly a per-packet matcher). The underlying matches are
all numeric, but there are preset symbolic names for some fields, to
help. See u32 examples above
(repeated here:)
# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 80 0xffff flowid 1:3
# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 25 0xffff flowid 1:4
3. Example for restricting bandwidth used by a single host (or subnet): From www.roback.cc/howtos/bandwidth.php
uses cbq, which does some approximate calculations to limit the flow of subclasses.
Also uses mangling/fwmark to do classification
DNLD = download bandwidth
UPLD = upload bandiwidth
DWEIGHT/UWEIGHT: weighting factors, more or less 1/10 of DNLD/UPLD
tc qdisc add dev eth0 root handle 11: cbq bandwidth 100Mbit avpkt 1000 mpu 64
tc class add dev eth0 parent 11:0 classid 11:1 cbq rate $DNLD weight $DWEIGHT \
allot 1514 prio 1 avpkt 1000 bounded
tc filter add dev eth0 parent 11:0 protocol ip handle 4 fw flowid 11:1
tc qdisc add dev eth1 root handle 10: cbq bandwidth 10Mbit avpkt 1000 mpu 64
tc class add dev eth1 parent 10:0 classid 10:1 cbq rate $UPLD weight $UWEIGHT \
allot 1514 prio 1 avpkt 1000 bounded
tc filter add dev eth1 parent 10:0 protocol ip handle 3 fw flowid 10:1
Now MARK the incoming packets, from the designated subnet:
# Mark packets to route
# Upload marking
$IPTABLES -t mangle -A FORWARD -s 192.168.0.128/29 -j MARK --set-mark 3
$IPTABLES -t mangle -A FORWARD -s 192.168.0.6 -j MARK --set-mark 3
# Download marking
$IPTABLES -t mangle -A FORWARD -s ! 192.168.0.0/24 -d 192.168.0.128/29 -j MARK --set-mark 4
$IPTABLES -t mangle -A FORWARD -s ! 192.168.0.0/24 -d 192.168.0.6 -j MARK --set-mark 4
Basic idea behind reservation/priority systems:
You don't have to segregate the physical traffic; all you have to do is
(a) mark or describe the traffic
(b) use queuing disciplines to give it priority
RSVP; aka Integrated Services
RSVP: reservations compatible with multicast: receiver-initiated!!
[ignore details of mechanism for network management]
- transparent to senders
- suitable for multicast receivers!!!
- mechanisms for CHANGES IN ROUTE
- different receivers may have different needs; these can sometimes be accommodated by routers, without bothering the sender.
how RSVP routers recognize flows; IP6 flowID
typical reservation timeout: 30 sec
RSVP does include teardown messages (both directions), but RSVP won't fall apart if they are not sent.
RSVP is transparent to nonparticipating ISPs
Receivers send periodic (30 sec) RESV messages to refresh routers; routers forward up the path.
- works well if receiver crashes
- works well if router crashes
- how does it figure the path? Because routers send PATH mesg.
Each RSVP sender host transmits RSVP "Path" messages downstream
along the uni-/multicast routes provided by the routing protocol(s),
following the paths of the data. These Path messages store "path
state" in each node along the way. This path state includes at
least the unicast IP address of the previous hop node, which is used to
route the Resv messages hop-by-hop in the reverse direction.
Sometimes RESV messages can be merged!
Sender
sends its TSpec (Traffic Spec, expressed as a token bucket) in PATH
messages; note that some receivers may ask for a different TSpec
R1----R2
/ \
A--R0
R5--B
\ /
R3----R4
What if A->R0->R1->R2->R5->B, but the reverse path is B->R5->R4->R3->R0->A?
A is sender.
B sends RESV message R5->R2->R1->R0->A by virtue of saved state at routers
But the actual R5->R2, R2->R1 might actually travel R4->R3->R0!
one-pass upstream request means receiver might not even find out what reservation was actually accepted! Though the sender should know....
Initiated by requests sent by receiver to sender; sender sends PATH
packets back down to receiver; these lay groundwork for return-path
computation. The PATH message also contains sender's traffic
specification, or Tspec.
RFC2210: standard Tspec has
- token bucket rate
- token bucket size
- peak data rate
- minimum policed unit m
- max packet size M
There is NO MaxDelay!!! But see notes in rfc 2212:
applications can use the Tspec values
above to estimate their queuing delay, which when added to propagation
delay gives an accurate total.
How we might do RSVP over the backbone: notes in rfc2208
basic idea: do something else there!!
What do routers do with reserved traffic?
Lots of scope for priority queues and per-reservation SFQ.
If reserved traffic has a guarantee of 20 Mbps, and you have
1% of that, you have 200Kbps. Period.
Assuming the infrastructure doesn't introduce excessive delay.
How VOIP may facilitate adoption of RSVP
Note that for some applications QoS can change dynamically (this is relevant for video). ALso RSVP supports multicast.
Admission control:
how exercised???
Two questions:
Can we grant the requested reservation?
Should we?
The latter is harder. For the former, it is basically a question of whether we have that much capacity.
One approach is to have a separate FQ queue for each reservation. That's a lot of queues.
Or should we just do TBF policing?
Differential Services
Problem: Integrated Services may not scale well. Few ISPs have adopted it. Can we come up with a simpler mechanism?
Differential services is, in its simplest form, a two-tier model. Priority traffic very limited in bandwidth. A simple priority
queue mechanism to handle the service classes is enough.
Priority traffic is marked
on ingress; this potentially expensive marking operation only occurs in
one place. However, routing within the network just looks at the bits.
Marking uses the DiffServ bits (DS bits), now the second byte of the IP header (formerly the TOS bits).
scaling issues of RSVP; possibility of use of IntServ in core and RSVP at edges
DS
may start with a Service Level Agreement (SLA) with your ISP, that
defines just what traffic (eg VOIP) will get what service. No need to
change applications.
However, what if the upstream ISP does not acknowledge DS?
For that matter, what if it does?
Border router of upstream ISP must re-evaluate all DS-marked packets
that come in. There may be more DS traffic than expected! It is possible that traffic from some ISPs would be remarked into a lower DS-class.
Potential problem: "ganging up". DS only provides a "preferred" class of traffic; there are no guarantees.
DS octet will be set by ISP (mangle table?)
DS field:
6 bits; 3+3 class+drop_precedence
Two basic strategies (per-hop behaviors, or PHBs): EF (Expedited Forwarding) and AF (Assured Forwarding).
101 110 "EF", or "Expedited Forwarding": best service. This is supposed to be for "realtime" services like voice.
Assured Forwarding: 3 bits of Class, 3 bits of Drop Precedence
Class:
100 class 4: best
011 class 3
010 class 2
001 class 1
Drop Precedence:
010: don't drop
100: medium
110 high
Main thing: The classes each get PRIORITY service, over best-effort.
What happens if you send more than you've contracted for?
Uses IP4 TOS field, widely ignored in the past.
Routers SHOULD implement priority queues for service categories
Basic idea: get your traffic marked for the appropriate class.
Then what?
000 000: current best-effort status
xxx 000: traditional IPv4 precedence
PHBs (Per-Hop Behaviors): implemented by all routers
Only "boundary" routers do traffic policing/shaping/classifying/re-marking
to manage categories (re-marking is really part of shaping/policing)
EF: Expedited Forwarding
basically just higher-priority. Packets should experience low queuing delay.
Maybe not exactly; we may give bulk traffic some guaranteed share
Functionality depends on ensuring that there is not too much EF traffic.
Basically, we control at the boundary the total volume of EF
traffic (eg to a level that cannot saturate the slowest link), so that
we have plenty of capacity for EF traffic. THen we just handle it at a
higher priority.
This is the best service.
EF provides a minimum-rate guarantee.
This can be tricky: if we accept input traffic from many sources, and
have four traffic outlets R1, R2, R3, R4, then we should only accept
enough EF traffic that any one Ri can handle it. But we might go for a more statistical model, if in practice 1/4 of the traffic goes to each Ri.
AF: Assured Forwarding
Simpler than EF, but no guarantee. Traffic totals can be higher.
There is an easy way to send more traffic: it is just marked as "out".
In-out marking: each packet is marked "in" or "out" by the policer.
Actually, we have three precedence levels to use for marking.5,930,4745,930,4745,930,4745,930,4745,930,4745,930,474
From RFC2597:
The drop precedence level of a packet
could be assigned, for example, by using a leaky bucket traffic
policer, which has as its parameters a rate and a size, which is the
sum of two burst values: a committed burst size and an excess burst
size. A packet is assigned low drop precedence if the number of tokens in the bucket is greater than the excess burst size [ie bucket is full], medium drop precedence if the number of tokens in the bucket is greater than zero, but at most the excess burst size, and high drop precedence if the bucket is empty.
Packet mangling can be used to set the DS bits, plus
a goodly number of priority bands for the drop precedences (I'm not
sure how to handle the different subclasses within precedence groups;
they might get classful TBF service)
RIO: RED with In/Out: RED = Random Early Detection
Routers do not reorder!
See Stallings Fig 19.13(b) and # of classes, drop precedence
4 classes x 3 drop priorities
Or else find the Differentiated Services RFC
BGP: Border Gateway Protocol
See notes at bgp.html