Linux iptables


The linux iptables package includes support for queuing disciplines, policing, traffic control, reservations, and prioritizing.

Mostly these are handled through the LARTC package: Linux Advanced Routing & Traffic Control

leaf-node zones: here we can regulate who has what share of bandwidth
Core problem: we likely can't regulate inbound traffic directly, as it's already been sent!

Notes on using tc to implement traffic control

Goal: introduce some notion of "state" to stateless routing

LARTC HOWTO        -- Bert Hubert, et al
    http://lartc.org/howto
   
A Practical Guide to Linux Traffic Control    -- Jason Boxman
borg.uu3.net/traffic_shaping.
    good diagrams
   
Traffic Control HOWTO, v 1.0.2     -- Martin Brown: local copy in pdf format

Policy Routing with Linux, Matthew Marsh (PRwL)

Good sites:

http://linux-ip.net/articles/Traffic-Control-HOWTO/classful-qdiscs.html


http://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.adv-filter.u32.html:
good article on u32 classifier

Good stuff on "real-world scenarios":
http://www.trekweb.com/~jasonb/articles/traffic_shaping/scenarios.html



The linux packages we'll be looking at include:

iptables: for basic firewall management, including marking packets. Dates from 1998

iproute2: for actually routing packets based on more than their destination address. Also provides the ip command for maintaining all the system network state. Dates from ~2001?

tc: traffic control, for creating various qdiscs and bandwidth limits



Queuing Disciplines (qdisc): does scheduling. Some also support shaping/policing.
A qdisc determines how packets are enqueued and dequeued. Some options:

Some reasons for an application to open multiple TCP connections:
  1. cheating SFQ limits
  2. much improved high-bandwidth performance
  3. the nature of the data naturally divides into multiple connections
       
tc's pfifo_fast qdisc has three priority bands built-in: 0, 1, and 2.
        enqueuing: figure out which band the packet goes into (based on any packet info; eg is it VOIP?)
        dequeuing: take from band 0 if nonempty, else 1 if nonempty, else 2



ipTables

iptables review: can filter traffic, mark/edit headers, and implement NAT. Fundamentally, iptables is a firewall tool.
How to direct traffic (at least without destaddr rewriting) was somewhat limited.

Iptables has 5 builtin chains, representing specific points in packet processing. Chains are lists of rules. The basic predefined chains are:
You can define your own chains, but they are pretty esoteric unless you're using them as "chain subroutines", called by one of the builtin chains.

Rules contain packet-matching patterns and actions. Typically, a packet traverses a chain until a rule matches; sometimes the corresponding action causes a jump to another chain or a continuation along the same chain, but the most common case is that we're then done with the packet.

Tables are probably better thought as parts of chains, rather than the other way around; they are in a sense rule targets.

Specifically, in iptables the tables are

Targets: ACCEPT, DENY, REJECT, MASQuerade, REDIRECT, RETURN

The FILTER table is where we would do packet filtering. The MANGLE table is where we would do packet-header rewriting. The MANGLE table has targets TOS, TTL, and MARK.

Obvious application: blocking certain categories of traffic
Not-so-obvious: differential routing, and actually tweaking traffic (with MANGLE; can be done before and after routing)

Here is a diagram from http://ornellas.apanela.com/dokuwiki/pub:firewall_and_adv_routing indicating the relationship of the chains to one another and to routing.

Note that the Local Machine is a sink for all packets entering, and a source for other packets. Packets do not flow through it. The second Routing Decision is for packets created on the local machine which are sent outwards (or possibly back to the local machine).

iptables packet flow
      Incoming
Traffic
|
|
V
+----------+
|PREROUTING|
+----------+
| raw | <--------------+
| mangle | |
| nat | |
+----------+ |
| |
| |
Routing |
+- Decision -+ |
| | |
| | |
V V |
Local Remote |
Destination Destination |
| | |
| | |
V V |
+--------+ +---------+ |
| INPUT | | FORWARD | |
+--------+ +---------+ |
| mangle | | mangle | |
| filter | | filter | |
+--------+ +---------+ |
| | |
V | |
Local | |
Machine | |
| |
| |
Local | |
Machine | |
| | |
V | |
Routing | |
Decision | |
| | |
V | |
+--------+ | |
| OUTPUT | | |
+--------+ | |
| raw | | |
| mangle | | |
| nat | | |
| filter | | |
+--------+ | |
| | |
| +-------------+ |
| | POSTROUTING | Local
+----> +-------------+ --> Traffic
| mangle |
| nat |
+-------------+
|
|
V
Outgoing
Traffic

iptables: netfilter.org
iproute2: policyrouting.org

tables in iptables:


From the iptables man page:

filter This is the default table.   It  contains  the  built-in  chains INPUT  (for  packets  coming  into the box itself), FORWARD (for packets being routed through the box), and OUTPUT (for  locally-generated packets).
     
pld: generally, users add things to the forward chain. If the box is acting as a router, that's the only one that makes sense.

nat    This table is consulted when a packet that creates a new connection is encountered.  It consists of three built-ins: PREROUTING (for  altering  packets  as  soon  as they come in), OUTPUT (for altering  locally-generated   packets   before   routing),   and POSTROUTING  (for altering packets as they are about to go out).

pld: The NAT table is very specific: it's there for implementing network address translation. Note that the kernel must keep track of the TCP state of every connection it has seen, and also at least something about UDP state. For UDP, the kernel pretty much has to guess when the connection is ended. Even for TCP, if the connection was between hosts A and B, and host A was turned off, and host B eventually just timed out and deleted the connection (as most servers do, though it isn't really in the TCP spec), then the NAT router won't know this.

Part of NAT is to reuse the same port, if it is available; port translation is only done when another host inside NAT-world is already using that port.
     
     
mangle This table is used for specialized packet  alteration.   It  has two  built-in  chains: PREROUTING (for altering incoming packets before routing) and OUTPUT (for altering locally-generated packets before routing).
     
pld: classic uses include tweaking the Type-Of-Service (TOS) bits. Note that it's actually kind of hard to tell if an ssh connection is interactive or bulk; see the example from Boxman below.
     
A second application is to set the fw_mark value based on fields the iproute2 RPDB (Routing Policy DataBase) cannot otherwise see. (RPDB can see the fw_mark). This is often used as an alternative to "tc filter".
     
An extension of this is the CLASSIFY option:
        iptables -t mangle -A POSTROUTING -o eth2 -p tcp --sport 80 -j CLASSIFY  --set-class 1:10

The CLASSIFY option is used with the tc (advanced queuing) package; it allows us to place packets in a given tc queue by using iptables.
  

Examples

These examples are from http://netfilter.org/documentation/HOWTO//packet-filtering-HOWTO.html.

Here is an example of how to block responses to pings to this host:
 iptables -A INPUT  -p icmp -j DROP
A more specific version is

    iptables -A INPUT -p icmp --icmp-type echo-request -j DROP

To remove:  iptables --delete INPUT 1 (where 1 is the rule number), or just change -A to -D above and leave everything else the same.

The icmp-type options can be obtained with the command iptables -p icmp --help.

Demo on linux1. The idea is that we are zapping all icmp packets as they arrive in the INPUT chain.

We are appending (-A) to the INPUT chain; the source address is localhost (note that we're blocking our outbound responses), the protocol is icmp, and if this is the case we jump (-j) to the DROP target.

The above rule is in the INPUT chain because we are blocking pings to this host. If we want to block pings through this host, we add the rule to the FORWARD chain:

    iptables -A FORWARD -p icmp --icmp-type echo-request -j DROP
    iptables -D FORWARD -p icmp --icmp-type echo-request -j DROP




Here is an example of how to block all traffic to port 80 on this host:

iptables -A INPUT -p tcp --dport 80 -j DROP

The option --sport is also available, as is --tcp-flags. Also, -p (protocol) works for icmp, udp, all

Here is how to allow inbound traffic only to ports 80 and 22:

iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp --dport 31337 -j ACCEPT
iptables -A INPUT -p tcp -j DROP

You can also set the following, so the last line above doesn't have to be there (and, in particular, doesn't have to be last). This makes it easier to add new or temporary inbound exceptions.

iptables --policy INPUT DROP

On my home router, I have the command blockhost, that does the following:

iptables --table filter --append FORWARD --source $HOST --protocol tcp --destination-port 80 --jump DROP
iptables --table filter --append FORWARD --destination $HOST --protocol tcp --source-port 80 --jump DROP

Note that the router lies between the host and the outside world; that is, I must use the filter table. Also, I block inbound traffic from port 80, and also outbound traffic to port 80. (I also have a command to block all traffic, when necessary.)



Here is a set of commands to block all inbound tcp connections on interface ppp0.

First, we create a new chain named block, so that we can put rules on one place to apply to both INPUT and FORWARD chains.
## Create chain which blocks new connections, except if coming from inside.
# iptables -N block

# create the block chain
# iptables -A block -m state --state ESTABLISHED,RELATED -j ACCEPT
# iptables -A block -m state --state NEW -i ! ppp0 -j ACCEPT
# iptables -A block -j DROP

## Jump to that chain from INPUT and FORWARD chains.
# iptables -A INPUT -j block
# iptables -A FORWARD -j block
The -j option means to jump to the block chain, but then return if nothing matches. However, as the last rule always matches, this doesn't actually happen.

The interface is specified with -i; the second block entry states that the interface is anything not ppp0.

The -m means to load the specific "matching" module; -m state --state NEW means that we are loading the tcp state matcher, and that we want to match packets starting a NEW connection.



What if we want to block traffic of a particular user? We have the iptables owner module. It applies only to locally generated packets. If we're throttling on the same machine that the user is using, we can use this module directly.

If the blocking (or throttling) needs to be done on a router different from the user machine, then we need a two-step approach. First, we can use this module to mangle the packets in some way (eg set some otherwise-unused header bits, or forward the packet down a tunnel). Then, at the router, we restore the packets and send them into the appropriate queue.

Iptables can also base decisions on the TCP connection state using the state module and the --state state option, where state is a comma separated list of the connection states to match.  Possible states are INVALID meaning that the packet could not be identified for some reason which includes running out of memory and ICMP errors  which  don't correspond to any known connection, ESTABLISHED meaning that the packet is associated with a connection which has seen packets in both directions, NEW meaning that the packet has started a new connection, or otherwise associated with a connection which has not seen packets in both directions, and RELATED meaning that the packet is starting a new connection, but is associated with an existing connection, such as an FTP data transfer, or an ICMP error. [from man iptables]

building firewalls w iptables/iproute2:
ip command: can build simple packet filter firewall:
    only allow packets to server ports 25 and 80:
   
potential inconsistencies:
    traffic to port 21 gets "ICMP message 1",
    traffic to port 53 gets "ICMP message 2"
    traffic to port 70 gets blackholed
But in practice this is not such an issue.

Here are ulam3's iptables entries for enabling NAT. Ethernet interface eth0 is the "internal" interface; eth1 connects to the outside world. In the NAT setting, the internal tables are in charge of keeping track of connection mapping; each outbound connection from the inside (any host/port) is assigned a unique port on the iptables host.

iptables --table nat --append POSTROUTING --out-interface eth1 -j MASQUERADE

# the next entry is for the "PRIVATE NETWORK" interface
iptables --append FORWARD --in-interface eth0 -j ACCEPT

echo 1 > /proc/sys/net/ipv4/ip_forward

mangling

IpTables is partly a firewall tool, but NAT is really something quite different from a simple firewall. IpTables has one other important non-firewall feature: packet "mangling". Packets can be marked or re-written in several ways. Here is one simple example:

iptables -t mangle -A PREROUTING -i eth2 -j MARK --set-mark 1

Internally, packets have an associated "firewall mark" or fwmark; the comand above sets this mark for packets arriving via interface eth2. The fwmark is not physically part of the packet, and is never transmitted, but it is associated with the packet for its lifetime within the system.

Demo on linux1 and linux2

Step 1: start tcp_writer on the host system, tcp_reader on linux2

Step 2: block the traffic with
    iptables --table filter --append FORWARD --destination linux2 --protocol tcp --source-port 5431 --jump DROP

And then unblock with

iptables --table filter --delete FORWARD --destination linux2 --protocol tcp --source-port 5431 --jump DROP
or
iptables --table filter --delete FORWARD 1

Note that putting this rule in the INPUT or OUTPUT chains has no effect. We could use the POSTROUTING chain, but it doesn't support the FILTER table.


 

iproute2

ipRoute2 has some functional overlap with iptables, but it is fundamentally for general routing, not firewalls (though note that firewalls are a special-purpose form of routing).

iproute2 features support for:
A typical iproute2 triad involves:

Address CIDRed IP address (eg 147.126.2.0/23)
Next Hop
defines how to get to the address
Rule defines when the above applies


Classically, routing involves only the first two elements: given a destination, we route to a given NextHop. (Sometimes the Type_of_Service bits are included in the routing destination too). iproute2 allows us to introduce rules to take into account:
The central idea of iproute2 is to have multiple classic ⟨dest,next_hop⟩ tables, with the desired table selected by the Rule portion.
  
The ip command
old approach was to have interfaces have one "primary" ip addr and then several "coloned" subinterfaces:
        eth0    192.1.2.3
        eth0:1    10.3.4.5
        eth0:2    200.9.10.11
All address assignments are on equal footing now.

iproute2 has multiple tables, but is slightly different from "iptables". The tables of iproute2 are actual routing tables.

Basics of the ip command:
ip link show|list    list interfaces
ip address show|list    show ip addresses
ip route show|list
ip rule show|list     not on ulam2, but does work on ulam3
0:    from all lookup local
32766:    from all lookup main
32767:    from all lookup default

The above is a list of rules determining which table is used. The tables themselves are numbered from 0 to 255; local =255, main=254, and default=253.

Policy routing introduces additional routing tables, one per rule. Local, main, and default are the three standard tables.
The default rules here are to use each table for all lookups.

Classical routing uses local and main. The local table is for high-priority routes for localhost/loopback, and also broadcast and multicast addresses. (examine ulam3 table with ip route list table local) Has b'cast and loopback routes

main:  classical table
ulam3:
10.213.119.0/24 dev eth0  proto kernel  scope link  src 10.213.119.254
10.38.2.0/24 dev eth1  proto kernel  scope link  src 10.38.2.42
default via 10.38.2.1 dev eth1

default: normally empty. Despite the name, this has nothing to do with the traditional concept of a default route.

rules: linearly ordered. Rules can access
The match rule for each of the tables above is to "match everything". Thus, the local table is consulted first. For nonlocal traffic that table will not yield a route, and so the main table is consulted. Usually the main table has a "default" route that will always match.

Typically a match identifies a (conventional) routing table, which then uses the destaddr to make the final selection. This is the "unicast" action; the use of the table is because destaddr is the packet field used most intimately. Also, within a table, CIDR longest-match is used.

Other actions can include

Note the local rule always fires, but in MOST cases the return value is "continue_along_the_rules_chain"
       
Here is an example combining iptables and iproute2 that will  allow special routing for all packets arriving on interface eth2 (from http://ornellas.apanela.com/dokuwiki/pub:firewall_and_adv_routing):

iptables: this command "marks" all packets arriving on eth2
iptables -t mangle -A PREROUTING -i eth2 -j MARK --set-mark 1
Now here is the modified iproute2 ruleset, where a special table 33 has been created (together with rule 330). See below, Example 1, for more details on how this table should be created.
# ip rule list
0: from all lookup local
330: from all fwmark 0x1 lookup 33
32766: from all lookup main
32767: from all lookup default
Note that the rules for iptables don't (or didn't, I'm not sure if this has changed) allow us to check the arriving interface directly, hence the use of iptables, the fwmark, and packet mangling.



From http://linux-ip.net/html/routing-tables.html, emphasis added:

The multiple routing table system provides a flexible infrastructure on top of which to implement policy routing. By allowing multiple traditional routing tables (keyed primarily to destination address) to be combined with the routing policy database (RPDB) (keyed primarily to source address), the kernel supports a well-known and well-understood interface while simultaneously expanding and extending its routing capabilities. Each routing table still operates in the traditional and expected fashion. Linux simply allows you to choose from a number of routing tables, and to traverse routing tables in a user-definable sequence until a matching route is found.


Here is an example combining iptables and iproute2 that will  allow special routing for all packets arriving on interface eth2 (from http://ornellas.apanela.com/dokuwiki/pub:firewall_and_adv_routing):

iptables: this command "marks" all packets arriving on eth2
iptables -t mangle -A PREROUTING -i eth2 -j MARK --set-mark 1
Now we issue this command to create the rule:

    ip rule add from all fwmark 0x1 lookup 33

Here is the modified iproute2 ruleset, where a special table 33 has been created (together with rule 32765). See below, Example 1, for more details on how this table should be created.

# ip rule list
0: from all lookup local
32765: from all fwmark 0x1 lookup 33
32766: from all lookup main
32767: from all lookup default
Note that the rules for iptables don't allow us to check the arriving interface directly, hence the use of iptables, the fwmark, and packet mangling.

Note also that we haven't created table 33 yet (see below).


 visit /etc/iproute2:
rt_tables, etc

1. Add an entry to rt_tables:
    100   foo
2. ip rule list:
    doesn't show it
3. ip rule add fwmark 1 table foo
Now: ip rule list shows
0:    from all lookup local
32765:    from all fwmark 0x1 lookup foo
32766:    from all lookup main
32767:    from all lookup default
The rule number is assigned automatically.
Cleanup: ip rule del fwmark 1
If there's a way to delete by number, I don't know it.



Example 1: simple source routing (http://lartc.org/howto/lartc.rpdb.html), Hubert

Suppose one of my house mates only visits hotmail and wants to pay less. This is fine with me, but you'll end up using the low-end cable modem

fast link: local router is 212.64.94.251
slow link: local end is 212.64.78.148;  local_end <--link--> 195.96.98.253
(The 212.64.0.0/16 are the local addresses)
user JOHN has ip addr 10.0.0.10 on the local subnet 10.0.0.0/8


Step 1: Create in /etc/iproute2/rt_tables a line
    200 JOHN_TABLE
This makes JOHN_TABLE a synonym for 200. /etc/iproute2/rt_tables contains a number of <num, tablename> pairs.

Step 2: have John's host use JOHN_TABLE:

# ip rule add from 10.0.0.10 lookup JOHN_TABLE


output of "ip rule list"
0:    from all lookup local
32765:    from 10.0.0.10 lookup JOHN_TABLE
32766:    from all lookup main
32767:    from all lookup default

Step 3: create JOHN_TABLE
main: default outbound route is the fast link
JOHN_TABLE: default = slow link
    ip route add default via 195.96.98.253 dev ppp2 table JOHN_TABLE

This is a standard "unicast" route.
Other options:
    unreachable
    blackhole
    prohibit
    local (route back to this host; cf "local" policyroute table
    broadcast
    throw: terminate table search; go on to next policy-routing table
    nat
    via    (used above)

for table_id in  main local default
do
    ip route show table $table_id
done

Note that what we REALLY want is to limit John's bandwidth, even if we have a single outbound link that John shares with everyone. We'll see how to do this with tc/tbf below.
    


Example 2
from: http://lartc.org/howto/lartc.netfilter.html (Hubert, ch 11)

iptables: allows MARKING packet headers (this is the fwmark).
Marking packets destined for port 25:

# iptables -A PREROUTING -i eth0 -t mangle -p tcp --dport 25  -j MARK --set-mark 1

Let's say that we have multiple connections, one that is fast (and expensive) and one that is slower. We would most certainly like outgoing mail to go via the cheap route.

We've already marked the packets with a '1', we now instruct the routing policy database to act on this:

# echo 201 mail.out >> /etc/iproute2/rt_tables
# ip rule add fwmark 1 table mail.out
# ip rule ls
0:    from all lookup local
32764:    from all fwmark        1 lookup mail.out
32766:    from all lookup main
32767:    from all lookup default

Now we generate a route to the slow but cheap link in the mail.out table:

# ip route add default via 195.96.98.253 dev ppp0 table mail.out




Example 3: special subnet:

Subnet S of site A has its own link to the outside world.
This is easy when this link attachment point (call it RS) is  on S: other machines on S just need to be told that RS is their router.

However, what if the topology is like this:

    S----R1---rest of A ----GR----link1---INTERNET
                              \__link2_INTERNET
                             
How does GR (for Gateway Router) route via link1 for most traffic, but link2 for traffic originating in S?
Use matching on source subnet to route via second link!

Again, we're probably going to have to set the fwmark.




LARTC section 4.2 on two outbound links.
PRwL chapter 5.2 is probably better here.



Examples from Policy Routing with Linux

Example 5.2.2: Basic Router Filters

                  coreRouter
                       |
      Actg_Router------+-----EngrRouter----192.168.2.0/24
      172.17.0.0/16    |
                       |
                 corp backbone 10.0.0.0/8
     
Now we configure so most 10/8 traffic can't enter the side networks.
Accounting: admin denial except for 10.2.3.32/27 and 10.3.2.0/27.
Engineering test network accessible from 10.10.0.0/14

From accounting network - 172.17.0.0/16
Rules for inbound traffic
   10.2.3.32/27   -   full route
    10.3.2.0/27    -   full route
   10/8               -   prohibit    (block everyone else; note longest-match)
   172.16/16      -   prohibit    (explained in more detail in 5.2.1)

From Engineering test network - 192.168.2.0/24
   10.10/14       -   full route       (special subnet granted access)
   10/8               -   blackhole     (zero access to corporate backbone
   172.17/16      -   blackhole     (zero access to accounting)
   172.16/16      -   blackhole

Possible configuration for EngrRouter:

ip addr add 10.254.254.253/32 dev eth0 brd 10.255.255.255
ip route add 10.10/14 scope link proto kernel dev eth0 src 10.254.254.253

ip route add blackhole 10/8
ip route add blackhole 172.17/16
ip route add blackhole 172.16/16

 

Possible configuration for Actg_Router

ip route add 10.2.3.32/27 scope link proto kernel dev eth0 src 10.254.254.252
ip route add 10.3.2.0/27 scope link proto kernel dev eth0 src 10.254.254.252
ip route add prohibit 10/8
ip route add prohibit 172.16/16
ip route add prohibit 192.168.2/24


See also http://linux-ip.net/html/routing-tables.html.


Examples from Jason Boxman, A Traffic Control Journey: Real World Scenarios

8.1.1: setting TOS flags for OpenSSH connections. Normal ssh use is interactive, and the TOS settings are thus for Minimize-Delay. However, ssh tunnels are not meant for interactive use, and we probably want to reset the TOS flags to Maximize-Throughput. If we have both tunnels and interactive connections, then without this option our router will probably pretty much bring ssh interactive traffic to a halt while the bulk traffic proceeds.

This is a good example of the --limit and --limit-burst options. Note the -m limit module-loading option before. See also hashlimit in man iptables.

Other examples in this section involve tc, and we'll look at them soon.



"ip route" v "ip rule"
To tweak a table: "ip route ..."
To administer the routing-policy database (RPDB): "ip rule ..."


The linux iproute and tc packages allow us to manage the traffic rather than the devices. Why do we want to do this??

To tweak a table: "ip route ..."
To administer the routing-policy database (RPDB): "ip rule ..."

RPDB rules look at <srcaddr, dstaddr, in_interface, tos, fw_mark>
These rules can't look at anything else! BUT: the fw_mark field (a "virtual" field not really in the packet) can be set with other things outside of RPDB (like iptables).

    Marking packets destined for port 25:
    table: mangle; chain: PREROUTING

    # iptables -A PREROUTING -i eth0 -t mangle -p tcp --dport 25 -j MARK --set-mark 1

    # echo 201 mail.out >> /etc/iproute2/rt_tables
    # ip rule add fwmark 1 table mail.out     ;; routes on mark set above!

From ip man page:

   1. Priority: 0, Selector: match anything,  Action:  lookup  routing table  local (ID 255).  The local table is a special routing table containing high priority control routes for local and broadcast addresses.

      Rule 0 is special. It cannot be deleted or overridden.


   2. Priority:  32766, Selector: match anything, Action: lookup routing table main (ID 254).  The main table is the  normal  routing table containing all non-policy routes. This rule may be deleted and/or overridden with other ones by the administrator.


   3. Priority: 32767, Selector: match anything, Action: lookup  routing  table default (ID 253).  The default table is empty.  It is reserved for some post-processing if no previous  default  rules selected the packet.  This rule may also be deleted.

Warning: table main is updated by routing-table protocol, RIP/EIGRP/etc
Other tables are not: if a routing change occurs, other tables (and traffic that uses them) may be out of luck.






Traffic shaping and traffic control

Generally there's not much point in doing shaping from the bottleneck link into a faster link. The bottleneck link has done all the shaping already!

fair queuing

Restricts bandwidth when there is competition,  but allows full use when network is idle. A user's bandwidth share is capped only when the link is busy!

token bucket

Restricts bandwidth to a fixed rate, period (but also allows for burstiness as per the bucket, which can always be small)
        
tc command:

shaping: output rate limiting (delay or drop)
scheduling: prioritizing. Classic application: sending voip ahead of bulk traffic
policing: input rate regulation
dropping: what gets done to nonconforming traffic

Two scenarios to restrict user/host Joe:
    1. Reduce absolute bandwidth (in/out?) available to Joe. If the link is otherwise idle, Joe is still cut
    2. Guarantee non-Joe users a min share; ie cap Joe's bandwidth only when the link is busy.

qdisc: queuing discipline
You can think of this as the TYPE of queue. Examples: fifo, fifo+taildrop, fifo+randomdrop, fair_queuing, RED, tbf

queuing disciplines are applied to INTERFACES, using the tc command.

Queuing disciplines can be "classless" or "classful" (hierarchical)

Queuing Disciplines (qdisc): does scheduling. Some also support shaping/policing
how packets are enqueued, dequeued

Basic "classless" qdiscs

pfifo_fast

(see man pfifo_fast): three-band FIFO queue

Consider the following iptables command, to set the TOS bits on outbound ssh traffic to "Minimize-Delay":

# iptables -A OUTPUT -t mangle -p tcp --dport 22 -j TOS --set-tos Minimize-Delay
 
This works with pfifo_fast, which provides three bands. Band selection by default is done using TOS bits of packet header (which you probably have to mangle to set). See Hubert, §9.2.1.1, for a table of the TOS-to-band map.

Dequeuing algorithm (typically invoked whenever the hardware is ready for a packet (or whenever the qdisc reports to the hardware that it has a packet):

    if there are any packets in band 1, dequeue the first one and send it
    else if there are any packets in band 2, dequeue the first one and send it
    else if there are any packets in band 3, dequeue the first one and send it
    else report no packets available

Note that in a very direct sense pfifo_fast does support three "classes" of traffic. However, it is not considered to be classful, since

Example: queue flooding on upload

In practice, it is very important to set interactive traffic to have a higher priority than normal traffic (eg web browsing, file downloads). However, you don't have much control of the downlink traffic, and if the uplink queue is on the ISP's hardware (eg their cablemodem), then you won't have much control of the upload side.

    me------<fast>------[broadband gizmo]-----------<slow>-----...internet
   
In the scenario above, suppose the broadband gizmo, BG, has a queue capacity of 30K (30 packets?). A bulk UPLOAD (though not download) will fill BG's queue, no matter what you do at your end with pfifo_fast. This means that every interactive packet will now wait behind a 30KB queue to get up into the internet. As the upload packets are sent, they will be ACKed and then your machine will replenish BG's queue.

One approach is to reduce BG's queue. But this may not be possible.

Here's another approach:

me---<fast>---[tq_router]---<slow2>---[broadband gizmo]---<slow>--internet

Make sure slow2 ~ slow. Then upload will fill tq_router's queue, but fast traffic can still bypass.

Logically, can have "me" == "tq_router"
_______________________

Token bucket filter (tbf)

See man tbf or man tc-tbf

restrict flow to a set average rate, while allowing bursts. The tbf qdisc slows down excessive bursts to meet filter limits. This is shape-only, no scheduling.

tbf (or hbf) is probably the preferred way of implementing bandwidth caps.
   
tokens are put into a bucket at a set rate. If a packet arrives:
    tokens available: send immediately and decrement the bucket
    no token: drop (or wait, below)
   
Over the long term, your transmission rate will equal the token rate.
Over the short term, you can transmit bursts of up to token size.

Parameters:
    bucketsize (burst): how large the bucket can be
    rate: rate at which tokens are put in
   
    limit: number of bytes that can wait for tokens. All packets wait, in essence; if you set this to zero then the throughput is zero. While theoretically it makes sense to set this to zero, in practice that appears to trigger serious clock-granularity problems.

    latency: express limit in time units
   
    mpu: minimum packet unit; size of an "empty" packet

Granularity issue: buckets are typically updated every 10 ms.
During 10 ms, on a 100Mbps link, 1MB can accumulate, or ~100 large packets! Generally, tbf introduces no burstiness itself up to 1mbit (a 10 kbit packet takes 10 ms on a 1mbit link). Beyond that, a steady stream of packets may "bunch up" due to the every-10-ms bucket refil.

Use of tbf for bulk traffic into a modem:

# tc qdisc add      \
    dev ppp0        \
    root            \
    tbf             \
    rate 220kbit    \        # if actual bandwidth is 250kbit
    latency 50ms    \
    burst 1540            # one packet

You want packets queued at your end, NOT within your modem!
Otherwise you will have no way to use pfifo_fast to have interactive traffic leapfrog the queue.

What you REALLY want is for TBF to apply only to the bulk bands of PFIFO_FAST.
Can't do this with TBF; that would be classFUL traffic control (though we can do that with PRIO, the classful analogue of pfifo_fast)

This can't be used to limit a subset of the traffic!

TBF is not WORK-CONSERVING: the link can be idle and someone can have a packet ready to send, and yet it still waits.



tbf demo: linux1, linux2, and tc

Start tcp_writer in ~/networks/java/tcp_reader (or start it on linux2). This accepts connections to port 5431, and then sends them data as fast as it can. Data flows from the writer to the reader. Linux1 has eth0 facing the host system and eth1 facing linux2; tbf must be applied at the interface the data stream exits via.

The write size is 1024 bytes, but TCP re-packages the data once it starts queuing up, so the actual write size is typically 1448 bytes plus 14+20+32 = 66 bytes header, for 1514 bytes in all. The header adds an additional 4.6%.

tbf1:
tc qdisc add dev eth1 root tbf rate $BW burst $BURST limit $LIMIT
tc qdisc add dev eth0 root tbf rate 1mbps burst 100kb limit 200kb

For mininet, try

tc qdisc add dev r1-eth1 root tbf rate 4000kbit burst 100kb limit 200kb

Note: "mbps" is megaByte/sec. For a bit rate of 1 megabit/sec, use "1mbit"

also try:

    tc qdisc change dev eth1 root tbf rate newrate burst newburst limit newlimit

This might cause a spike in kbit/sec numbers:

479
479
159
4294
564
564
...

clear with tc qdisc del dev eth1 root

tc_stats
tc -s qdisc show dev eth0
tc -s class show dev eth0
tc -d qdisc show dev eth0
tc -d class show dev eth0

demo: enable rate 1000 kbit/sec (1 kbit/ms), burst 20kb, limit 100kb. Then try limit = 5kb v 6kb.
At the given rate, 1kb takes 8ms. The bucket is replenished every 10ms (hardware clock limitation), so a burst should not be consumed much during a 10ms interval.

At 10mbit, 10ms is 12kb, and we won't get that rate unless burst is set to about that.


Fair Queuing

   
Note that the FQ clock can be reset to zero whenever all queues are empty, and can in fact just stay at zero until something arrives.
   
    Linux "sfq":
    Flows are individual tcp connections
        NOT hosts, or subnets, etc! Can't get this?!
    Each flow is hashed by srcaddr,destaddr,port.
    Each bucket is considered a separate input "pseudoqueue"
    Collisions result in unfairness, so the hash function is altered at regular intervals to minimize this.
   
What we really want is a way to define flow_id's, so they can be created dynamically, and connections can be assigned to flow_id by:
   
    sfq is schedule-only, no shaping
   
What you probably want is to apply this to DOWNLOADs.
tc doesn't do that if your linux box is tied directly to your broadband gizmo. Applied to downloads at a router,

    joe,mary,alice---<---1--------[router R]------<---2----internet

it would mean that each of joe,mary,alice's connections would get 1/3 the bw, BUT that would be 1/3 of the internal link, link 1.

Regulating shares of link 2 would have to be done upstream.

If we know that link 2 has a bandwidth of 3 Mbps, we can use CBQ (below) to restruct each of joe,mary,alice to 1 Mbps, by controlling the outbound queue at R into link 1.

Further sharing considerations:

If we divide by 1 flow = 1 tcp connection, joe can double throughput by adding a second tcp connection.

If we divide by 1 flow = 1 host, we do a little better at achieving per-user fairness, assuming 1 user = 1 machine. Linux sfq does not support this.

Linux sfq creates multiple virtual queues. It's important to realize that there is only one physical queue; if one sender dominates that queue by keeping it full, to the point that the other connections get less bandwidth than their FQ share, then the later division into LOGICAL queues won't help the underdogs much.

sfq IS work-conserving


RED


Generally intended for internet routers
We drop packets at random (at a very low rate) when queue capacity reaches a preset limit (eg 50% of max), to signal tcp senders to slow down.




classful qdiscs

CBQ, HTB, PRIO

Disneyland example: what is the purpose of having one queue feed into another queue?

However, under tc we can have these classful qdiscs form a tree, possibly of considerable depth.

LARTC 9.3: (http://lartc.org/howto/lartc.qdisc.advice.html)




Basic terminology for classful qdiscs
classes form a tree
each leaf node has a class
At each interior node, there is a CLASSIFIER algorithm (a filter) and a set of child class nodes.

Linux: at router input(ingress), we can apply POLICING to drop packets; at egress, we apply SHAPING to put packets in the right queue. Terminology derives from the fact that there is no ingress queue in linux (or most systems).

Classful Queuing Disciplines

CBQ, an acronym for 'class-based queuing', is the best known.
It is not, however, the only classful queuing discipline. And it is rather baroque.

PRIO: divides into classes :1, :2, :3 (user-configurable this time)
dequeuing: take packet from :1 if avail; if not then go on to :2, etc

by default, packets go into the band they would go into in PFIFO_FAST, using TOS bits. But you can use "shapers" to adjust this.

Hierarchy of PRIO queues is equivalent to the "flattened" PRIO queue.
However, a hierarchy of PRIO queues with SFQ/TBF offspring is not "flattenable".



TBF: classic rate-based shaper; packets wait for their token
(policing version: you get dropped if you arrive before your token)


For a classless qdisc, we're done once we create it
(its parent might be nonroot, though).

For a classful qdisc, we add CLASSES to it.
Each class will then in turn have a qdisc added to it.

parent of a class is either a qdisc or a class of same type

class major numbers must match parent.

qdisc major numbers are new

each class needs to have something below it, although every class gets a fifo qdisc by default.

We then attach a sub-qdisc to each subclass.



LARTC example:

Hubert example in 9.5.3.2 has a prio qdisc at the root.
The subclasses 1:1, 1:2, and 1:3 are automatic for prio queuing, as is the filtering.

           1:   root qdisc
         / | \
        /  |  \
       /   |   \
     1:1  1:2  1:3    classes, added automatically
      |    |    |
     10:  20:  30:    qdiscs    qdiscs
     sfq  tbf  sfq
band  0    1    2

Bulk traffic will go to 30:, interactive traffic to 20: or 10:.

Command lines for adding prio queue

    # tc qdisc add dev eth0 root handle 1: prio 

This automatically creates classes 1:1, 1:2, 1:3. We could say
    tc qdisc add dev eth0 root handle 2: prio bands 5
to get bands 2:1, 2:2, 2:3, 2:4, and 2:5. Then zap with
    tc qdisc del dev eth0 root
 
But suppose we stick with the three bands, and add:
# tc qdisc add dev eth0 parent 1:1 handle 10: sfq    // prob should be tbf too
# tc qdisc add dev eth0 parent 1:2 handle 20: tbf rate 20kbit buffer 1600 limit 3000
# tc qdisc add dev eth0 parent 1:3 handle 30: sfq                               

We now get a somewhat more complex example.



Hierarchical token bucket (htb)

This is a classful version of tbf. Note that because we may now have sibling classes, we have an important sharing feature: each sibling class is allocated bandwidth the minimum of what it requests and what its assigned share is; that is, each class is guaranteed a minimum share (like fair queuing). However, the fairness of the division in the absence of traffic from some nodes is limited by the granularity of quantum round-robin.

pld home example

Commands to create an htb qdisc with three child classes:

   1. Hosts 10.0.0.1  and 10.0.0.2 go to class :10
   2. subnet 10.0.0.0/29 (10.0.0.0 - 10.0.0.7 except the above two!)  goes to class :29
   3. All other traffic goes to class :100

The qdisc is placed on the interior interface of the router, to regulate inbound traffic.

We suppose that this is to control flow over a link that has a sustained bandwidth limit of BW, a bucket size of BURST, and a peak bandwidth of PBW. (PBW is not used here.)
BW=56          #kbps
BURST=350      #mb

tc qdisc add dev eth0 root handle 1: htb default 100
tc class add dev eth0 parent 1: classid 1:1 htb rate ${BW}kbps burst ${BURST}mb

# class 10 is limited by parent only
tc class add dev eth0 parent 1:1 classid 1:10 htb rate ${BW}kbps burst ${BURST}mb

# class 29 has same rate, but half the burst
HBURST=$(expr $BURST / 2)
tc class add dev eth0 parent 1:1 classid 1:29 htb rate ${BW}kbps burst ${HBURST}mb

# class 100 has 3/4 the refill rate, too
BW100=$(expr 3 \* $BW / 4)
tc class add dev eth0 parent 1:1 classid 1:100 htb rate ${BW100}kbps burst ${HBURST}mb

tc filter add dev eth0 parent 1:0 protocol ip u32 \
    match ip dst 10.0.0.1/32 flowid 1:10
tc filter add dev eth0 parent 1:0 protocol ip u32 \
    match ip dst 10.0.0.2/32 flowid 1:10
tc filter add dev eth0 parent 1:0 protocol ip u32 \
    match ip dst 10.0.0.0/29 classid 1:29
;; no rule for flowid 1:100; taken care of by default rule


Actually, I can't use this, because tc does not handle bucket sizes so large. So what I do instead is have a packet-sniffer-based usage counter on my (linux) router, which issues tc class change commands to reduce any class that is over-quota. I then use more moderate values for tc itself: I create the classes (I'm up to five now), and each one gets a rate of ~1-10Mbit and a bucket of ~10mbyte. However, the rate is reduced drastically when a class reaches its quota.

$RATE = 1-10 mbit
$BURST = 10 mbyte
# band 1
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.0/30 flowid 1:1
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.2/32 flowid 1:1

# band 2
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.4/30 flowid 1:2

# band 3 : gram
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.10/32 flowid 1:3
tc qdisc  add dev $DEV parent 1:3 tbf rate $RATE burst $BURST limit $LIMIT

# band 4 : upstairs
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.11/32 flowid 1:4
tc qdisc  add dev $DEV parent 1:4 tbf rate $RATE burst $BURST limit $LIMIT

# band 5:
tc filter add dev $DEV parent 1:0 protocol ip u32 match ip dst 10.0.0.0/24 flowid 1:5
tc qdisc  add dev $DEV parent 1:4 tbf rate 32kbit burst 10kb limit 10kb



Note that while iproute2 has some problems fitting into the Internet model of one router table, tc applies just to the egress interfaces and so is completely separate from routing.




These first two examples here, using htb, use the u32 classifier




Another HTB example

Token bucket is a simple rate-limiting strategy, but the basic version limits everyone in one pool.

Why does hierarchical matter? Strict rate-limiting would not! (why?)
However, HTB is not "flattenable" because you share the parent bucket with your siblings. Consider

      root
     /    \
 A:b=20   b=30
         /    \
     B:b=20   C:b=20

If B and C are both active, their respective buckets drops to about 15, as each get half the parent's bucket.

htb lets you apply flow-specific rate limits (eg to specific users/machines)

Rate-limiting can be used to limit inbound bandwidth to set values


above example using htb: LARTC 9.5.5.1

Functionally almost identical to the CBQ sample configuration below:

# tc qdisc add dev eth0 root handle 1: htb default 30
(30 is a reference to the 1:30 class, to be added)

# tc class add dev eth0 parent 1: classid 1:1 htb rate 6mbit burst 15k

Now here are the classes: note "ceil" in the second two. The ceil parameter allows "borrowing": use of idle bandwidth. Parent of classes here must be the class above. Only one class can have root as its parent!
Sort of weird.
         1:
         |
        1:1
      /  |  \
   1:10 1:20 1:30
   web  smtp other
  
# tc class add dev eth0 parent 1:1 classid 1:10 htb rate 5mbit burst 15k
# tc class add dev eth0 parent 1:1 classid 1:20 htb rate 3mbit ceil 6mbit burst 15k
# tc class add dev eth0 parent 1:1 classid 1:30 htb rate 1kbit ceil 6mbit burst 15k

The author then recommends SFQ for beneath these classes (to replace whatever default leaf qdisc is there)

# tc qdisc add dev eth0 parent 1:10 handle 10: sfq perturb 10
# tc qdisc add dev eth0 parent 1:20 handle 20: sfq perturb 10
# tc qdisc add dev eth0 parent 1:30 handle 30: sfq perturb 10

Add the filters which direct traffic to the right classes: here, we divide by web/email/other

# tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
    match ip dport 80 0xffff flowid 1:10
# tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
    match ip sport 25 0xffff flowid 1:20


htb demo

We can set up the traffic flowing from linux2:tcp_writer to valhal:tcp_reader, again

tcp_reader linux2 5431 2001
tcp_reader linux2 5431 2002
tcp_reader linux2 5431 2003

Then we can regulate the three flows using htb:

tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:1 htb rate 5000kbps ceil 5000kbps burst 100kb

tc class add dev eth0 parent 1:1 classid 1:11 htb rate 1000kbps ceil 5000kbps burst 100kb
tc class add dev eth0 parent 1:1 classid 1:12 htb rate 1000kbps ceil 5000kbps burst 100kb
tc class add dev eth0 parent 1:1 classid 1:13 htb rate 1000kbps ceil 5000kbps burst 100kb

tc filter add dev eth0 parent 1:0 protocol ip u32 match ip dport $PORT1 0xffff flowid 1:11
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip dport $PORT2 0xffff flowid 1:12
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip dport $PORT3 0xffff flowid 1:13

Things to try:

1. Change the root rate
2. Make one flow's rate=ceil, and see how the other flows' traffic divides

When do we need htb, and when is plain tbf enough?



2.5: classifying/filtering:

fwmark:
# iptables -A PREROUTING -i eth0 -t mangle -p tcp --dport 25 -j MARK --set-mark 1

mangle table & CLASSIFY
  iptables -t mangle -A POSTROUTING -o eth2 -p tcp --sport 80 -j CLASSIFY --set-class 1:10

u32:
allows matching on bits of packet headers. u32 is completely stateless (that is, it doesn't remember past connection state; it is strictly a per-packet matcher). The underlying matches are all numeric, but there are preset symbolic names for some fields, to help. See u32 examples above
(repeated here:)

    # tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 80 0xffff flowid 1:3
    # tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 25 0xffff flowid 1:4
   


3. Example for restricting bandwidth used by a single host (or subnet): From www.roback.cc/howtos/bandwidth.php

uses cbq, which does some approximate calculations to limit the flow of subclasses.

Also uses mangling/fwmark to do classification


DNLD = download bandwidth
UPLD = upload bandiwidth
DWEIGHT/UWEIGHT: weighting factors, more or less 1/10 of DNLD/UPLD

tc qdisc add dev eth0 root handle 11: cbq bandwidth 100Mbit avpkt 1000 mpu 64
tc class add dev eth0 parent 11:0 classid 11:1 cbq rate $DNLD weight $DWEIGHT \
    allot 1514 prio 1 avpkt 1000 bounded
tc filter add dev eth0 parent 11:0 protocol ip handle 4 fw flowid 11:1

tc qdisc add dev eth1 root handle 10: cbq bandwidth 10Mbit avpkt 1000 mpu 64
tc class add dev eth1 parent 10:0 classid 10:1 cbq rate $UPLD weight $UWEIGHT \
    allot 1514 prio 1 avpkt 1000 bounded
tc filter add dev eth1 parent 10:0 protocol ip handle 3 fw flowid 10:1

Now MARK the incoming packets, from the designated subnet:

# Mark packets to route
# Upload marking
$IPTABLES -t mangle -A FORWARD -s 192.168.0.128/29 -j MARK --set-mark 3
$IPTABLES -t mangle -A FORWARD -s 192.168.0.6 -j MARK --set-mark 3

# Download marking

$IPTABLES -t mangle -A FORWARD -s ! 192.168.0.0/24 -d 192.168.0.128/29 -j MARK --set-mark 4
$IPTABLES -t mangle -A FORWARD -s ! 192.168.0.0/24 -d 192.168.0.6 -j MARK --set-mark 4