Network Management

Week 9, Mar 28
LT-412, Mon 4:15-6:45 pm

RMON

= RMON 1, containing Ethernet information only. Later, RMON2 added monitoring at the IP and TCP (port) layers.

Basic idea: remote agents do some limited monitoring of subnets. They become "mini-managers", freeing the manager from probing every host on the subnet. Managers can create rows to specify what monitoring is to be done.

Note that implementing RMON is a significant undertaking for an agent: it will have to put its interfaces into promiscuous mode and analyze every packet. Of course, tcpdump & wireshark do this too.

See RMON-MIB

History group (discussed earlier)

This group has two tables: control & data

The control entry specifies the interface (historyControlDataSource), the bucket count, and the interval. The data table then creates one bucket for each time interval, containing a summary of the usage during that interval. Actual history stats: packet counts, byte counts, error counts of various types, etc.

we create a control entry, and that directs how the data table will be built. Note how the data table is indexed, and how rows get deleted (recycled?) as new buckets are created.

When the manager creates a row, it supplies:

historyControlIndex: an arbitrary index value
historyControlDataSource: the OID for the ifIndex object from the mib-2 interfaces table. Note that the full OID is provided here, not just the ifIndex value. Note also that the ifIndex value is the last level of this OID.
historyControlBucketsRequested
historyControlInterval: in seconds, 1..3600.

Note that we might create a 30-second history table and a 30-minute history table, with possibly different bucket counts for each.

The etherHistorySampleIndex is initialized to 1 and is incremented each interval. Old bucket entries are deleted as the buckets are recycled: from historyControlBucketsGranted (starting line 645):

                  When the number of buckets reaches the value of
                  this object and a new bucket is to be added to the
                  media-specific table, the oldest bucket associated
                  with this historyControlEntry shall be deleted by
                  the agent so that the new bucket can be added.

Look at the indexing here. The etherHistoryTable is indexed by etherHistoryIndex and etherHistorySampleIndex. The latter is a bucket serial number; the former is the same as historyControlIndex, the index to the historyControlTable.

The Hosts group has three tables hostControlTable, hostTable, and hostTimeTable.
The hostControlTable specifies the interface. Note hostControlTableSize (line 1462), which determines the number of hosts to be recorded.

The hostControlIndex:

DESCRIPTION: An index that uniquely identifies an entry in the hostControl table. Each such entry defines a function that discovers hosts on a particular interface and places statistics about them in the hostTable and the hostTimeTable on behalf of this hostControlEntry."

The hostTable is indexed by hostIndex and hostAddress; the latter is a particular node's physical address, and contains per-host statistics such as in and out packets, bytes, and errors. hostIndex is the same as hostControlIndex.

Read the description of the hostCreationOrder column.

The hostTimeTable is the same except indexed serially by discovery time. The actual index is { hostTimeIndex, hostTimeCreationOrder }. The hostTimeIndex is again the hostControlIndex. See "The hostTimeTable has two important uses.", line 1355:

The hostTimeTable has two important uses. The first is the fast download of this potentially large table. Because the index of this table runs from 1 to the size of the table, inclusive, its values are predictable. This allows very efficient packing of variables into SNMP PDU's and allows a table transfer to have multiple packets outstanding. These benefits increase transfer rates tremendously.

The second use of the hostTimeTable is the efficient discovery by the management station of new entries added to the table. After the management station has downloaded the entire table, it knows that new entries will be added immediately after the end of the current table. It can thus detect new entries there and retrieve them easily.

I am not aware of any evidence that having contiguous index values allows for more efficient PDU packing! But it is an intriguing idea.

TopN group

hostTopNControlTable
Indexed by hostTopNControlIndex
The N is supposed to be the hostTopNRequestedSize/hostTopNGrantedSize.
The hostTopNRateBase specifies what variable we are to use for the ranking. You get to pick from a list of seven.

Collection runs for the management-supplied hostTopNTimeRemaining.

The actual data table is the hostTopNTable, with columns

hostTopNReport	INTEGER (1..65535)	matches control-table entry
hostTopNIndex	INTEGER (1..65535)
hostTopNAddress	OCTET STRING	host physical address
hostTopNRate	INTEGER	numeric value of change in the selected variable

Index columns are underlined. In this table, the hostTopNIndex specifies the 1..N rank of a particular host (the hostTopNAddress). This may change during the collection period. Alternatively, no data at all may be available until the collection period has ended.

Note that this is all data extracted by the agent from the hosts table. (Note also that you can't create TopN statistics by collecting information on N hosts; you don't know when you see the N+1-th host whether it will replace one in the existing table or not.) Basically, the main thing this group adds is agent-side sorting of the hosts data.

Matrix Group

MatrixControlTable
SD and DS tables: why both? Because one is indexed by source and the other by destination.

The SDTable has fields:
              matrixSDSourceAddress       OCTET STRING,
              matrixSDDestAddress         OCTET STRING,
              matrixSDIndex               INTEGER (1..65535),
              matrixSDPkts                Counter,
              matrixSDOctets              Counter,
              matrixSDErrors              Counter

Again, index fields are underlined. The matrixSDIndex corresponds to the control-table entry.

Switching v Routing

Suppose you have a large network. Should you forward traffic using Ethernet switches (layer 2 devices) or IP routers (layer 3 devices)? Once upon a time this was a controversial subject.

Basically, routers join different IP subnets. Those subnets are switching zones (broadcast domains); every host in a subnet must be able to look up the physical address of every other host (using broadcast ARP) and then send directly to that host's physical address.

On the side of routing, ultimately switching does not scale well. Switches must forward broadcast packets on all ports, meaning that as your network size increases, the amount of stray broadcast traffic increases faster. Switches also do not support a network topology with loops, which are important in providing redundancy. (Switches do allow loops, but use the spanning tree algorithm to disable certain links to prune the topology into a tree. If some nondisabled link fails, a disabled link may be returned to service. However, there is no attempt at optimal paths.) Switches must also keep every individual host heard from in their forwarding tables; routers keep only IP networks in their tables, leading to more compact tables (though switching tables of 100,000 hosts are not unheard of, and work reasonably well). Finally, switches must use broadcast to deliver to previously unencountered destinations; routers are "preconfigured" to reach all destinations.

Switch where you can, route where you must ("dumb things are cheaper than smart things")

The above was a marketing slogan of a switch vendor that is no longer with us. The downside of routers, however, was (and is) that they are slow. Packet forwarding rates were traditionally at least an order of magnitude below switch rates (50 Kpps versus up to 1 Mpps for switches during the last century, though linux routers broke the 1 Mpps barrier a few years ago). Throughout the 1990s, aggressive switch manufacturers convinced sites that switching was all they needed, and there does appear to be such a thing as "unnecessary" routing.

Routers, however, are slow for a reason: they do more. At a minimum, a router:

Verifies the IP header checksum
Decrements the TTL value and recomputes the header checksum
Extracts the destination IP address
Looks this up in the forwarding table, and finds not just a match but the longest match
Sends the packet to the appropriate next hop

However, routers are also often called upon to do additional packet inspection. Some might be as simple as involving the quality-of-service (QoS) bits in the routing decision, but more typical is major firewall implementation. This might involve many additional comparisons involving the IP and TCP header fields, or even fragment reassembly. Another task a router (but seldom if ever a switch) may be called upon to perform is queue management, to enforce bandwidth rules on the different parties involved. Some routers track the full TCP state of all connections. (This is essential for routers doing Network Address Translation, or NAT, though arguably NAT is a separate conceptual module from routing itself.)

Note that routers are not slower because they "switch at layer 3 instead of layer 2". Traditionally, some routers were slower because they were poorly designed at the hardware level (eg using an underpowered general-purpose CPU), but the main issue is that routers have more to do.

One classic layout might be to use a switch within each department of size < 255, and to join departments with routers. This may be unnecessarily restrictive; there is often no special reason to divide up a network along departmental lines. This would often be the case only if security rules (to be implemented as firewall rules) were to be put into effect between every pair of departments. But while you might wish to restrict intra-company access the accounting and HR departments, the case for isolating different R&D groups or even R&D and marketing is less clear.

So another layout is to have a few "secure" departments/workgroups, and also some rather large multi-department (perhaps building-sized, or at least floor-sized) "switching zones", corresponding to subnets of size /22 (10 host bits, or 1024 hosts) to /18 (14 host bits, or ~16,000 hosts). Routing would be used when you actually have some firewall restrictions to enforce between zones (or else the zones grow so large that broadcast traffic has become a quantifiable issue). At some sites, switching zones can comfortably grow to span multiple buildings (ie a campus).

Architecturally, the situation still looks like a router (or multiple routers) with each port connected to a separate "switching zone". The switching zones, however, have "ballooned". The central argument is that routing for the sake of having separate subnets is pointless; routing should be done for a reason. (A non-security reason might be if a department needed to manage their own DHCP server.)

VLANs (Virtual LANs)

Sometimes the separate departments or switching zones tended to overlap geographically. That is, floor 1 might have hosts A1, B2, A3, A4, B5, and floor 2 might have A21, A22, B23, B24, A25, where the hosts beginning with A need to be in a separate subnet from those beginning B. The classic way to achieve this is to have two switches, an A switch and a B switch, on each floor; the A-switches would be interconnected through riser cables and also the two B-switches; the two would come together at a router somewhere.

Typically, department membership might be assigned more or less dynamically, while an employee's cubicle might be much more static. So a given desk might move from "Marketing" to "Sales" to "Research1" over the course of a month.

Virtual LANs were developed to address this. We just have one switch on each floor, configured with an A VLAN ("red") and a B VLAN ("green"). Each host port (that is, port directly connected to a specific host) is assigned to a specific VLAN; for example, on the first floor, ports 1, 3, and 4 might be connected to the A VLAN while 2 and 5 would be connected to the B VLAN. We can think of the ports as having colors (red or green) representing the VLAN to which they belong. Similarly, on the second floor, ports 1, 2, and 5 would be connected to A while 3 and 4 would be connected to B. Assignment would be made through software. A broadcast packet sent from any A-VLAN port would be forwarded only to other A-VLAN ports; the B-VLAN ports would simply not see the packet. Typically, if a host on an A-VLAN port tried to send a packet to the MAC address of a B-VLAN host, the packet would not be delivered (athough this is a slightly higher level of isolation than just not forwarding broadcasts, and for some VLAN hardware this isolation is not implemented). The VLAN switches would maintain separate forwarding tables for the two VLANs, and separate spanning trees (for more complex switch arrangements).

The connection between the floor-1 and floor-2 switches, however, has to be handled differently; so-called trunk ports cannot simply be colored "red" or "green". Instead, trunk ports are identified as such in the configuration setup, and are allowed to carry traffic from both VLANs. But the Ethernet packets involved are tagged with a special protocol (802.1Q) that involves inserting 4 bytes into each packet, marking the packet with its home VLAN. This allows sharing trunks with different VLANs, while preserving VLAN isolation.

While VLANs are used routinely for security isolation, be aware that their original justification was not for security but for traffic isolation.

Also note that while it might be possible to define security groups based on membership in VLANs, it is generally considered safer to do this at the IP level. Detecting security errors in VLAN configurations is error-prone; realizing that someone is on the wrong subnet is much more obvious.

Typically, VLAN switches do not allow any administrative control over the bandwidth allocation of shared trunks; if you want to prioritize traffic from certain departments over these shared trunks, you need to put a router at each end.

Inter-VLAN connections

As for the non-VLAN architecture above, at some point the "switching zones" (VLANs) must exchange traffic. Generally they do this at IP routers. But note that we connect a router not between two or more VLAN switches, but between two or more different-VLAN ports on the same switch. The router would also have a connection to the internet. Switching within a VLAN is automatic (and thus we have no way of enforcing security policies or packet filtering), but traffic through a router is under administrative control.

Some VLAN switches support simple routing modules that don't do much more than transfer packets from one VLAN to another based on IP address. Such a router might be called a "level 3" or "layer 3" switch; such devices tend to be much faster than full-featured routers but also less capable.

OpenNMS

documentation at http://www.opennms.org/wiki/Category:Documentation and http://www.opennms.org/wiki/Official_Documentation.

See also:

HP OpenView
IBM Tivoli
Computer Associates Unicenter

Groundwork Open Source
Nagios
others in Munro & Schmidt

A question to keep in mind: what good does all this do you?

openNMS basics:
    discovery (eg of hosts/interfaces)
    capability detection (eg discovery of services)
    monitoring modules


OpenNMS
Tarus Balog - chief developer
Why open source? To get people to contribute monitors and pollers!

OpenNMS focuses on NODES and also SERVICES.
traditional SNMP view: just NODES (routers, switches) and whether they are working.

The three main functional areas of OpenNMS are:

service polling, which monitors services on the network and reports on their "service level";
data collection from the remote systems via SNMP in order to measure the performance of the network;
A system for event management, statistics, and notifications.

Note that that last category is tricky: the NMS must know what events are "important", and what SNMP data from each device means, and when to sound the alarm. False positives and false negatives are both bad.

Data collection scenario:

To mitigate the problem OpenNMS was modified to collect 200,000 data points from approximately 24,000 interfaces every five minutes, or 2.4 million data points an hour from a single instance of OpenNMS. The limitation turned out to be the speed at which the disk controller could write the data, not OpenNMS itself.

Example: view ulam2, ulam3: interface graphs, resource graphs, and response-time graphs.

discovery => capabilities detection => poller monitor

Most xml config files are in $OPENNMSHOME/etc

discovery

    ping (individual pings to listed ranges of IP addrs (with exceptions))
    manual configuration
<discovery-configuration threads="1" packets-per-sec
        initial-sleep-time="300000" restart-sleep-time="86400000"
        retries="3" timeout="800">

        <include-range retries="2" timeout="3000">
                <begin>192.168.0.1</begin>
                <end>192.168.0.254</end>
        </include-range>

        <include-url>file:/opt/OpenNMS/etc/include</include-url>

</discovery-configuration>

You can do this manually, in the file above, or else using the web interface,
    admin=> Configure Discovery => Include Ranges => Add New

This is host discovery. On each discovered host, we also need to discover services.
Demo of the result in ulam3:/opt/OpenNMS

capabilities detection

see http://www.opennms.org/wiki/Discovery_Configuration_How-To.

core list
    probe for the following:
    # Citrix    central LAN software mgmt
    # DHCP
    # DNS
    # Domino IIOP    Lotus workgroup solution
    # FTP
    # HTTP
    # HTTPS
    # ICMP
    # IMAP        email inbound
    # LDAP
    # Microsoft Exchange
    # Notes HTTP
    # POP3        email inbound
    # SMB
    # SMTP        email server
    # SNMP
    # TCP        port is specifiable; generic service detection

Much newer (and longer) list at http://www.opennms.org/index.php/Discovery_Configuration_How-To;
JDBC, radius, ssh added

A directory of poller documentation is at http://www.opennms.org/wiki/Service_pollers.

Review the network services involved.

SMB: checked, but file-sharing features are NOT continually monitored

SNMP: special case, used for further data collection not polling
    query: system, interfaces, IP->ipAddrTable (other IP addrs for this host)

    discovery information: discovery-configuration.xml

capabilities detection: capsd-configuration.xml
Part of the discovery phase: figuring out what each note is capable of.
Which notes are SMTP servers? Web servers?

Sample plugin config for ICMP from poller-configuration.xml:

<protocol-plugin protocol="ICMP" class-name="org.opennms.netmgt.capsd.IcmpPlugin" scan="on" user-defined="false">
     <protocol-configuration scan="on" user-defined="false">
          <range begin="192.168.10.0" end="192.168.10.254"/>
          <property key="timeout" value="4000"/>
          <property key="retry" value="3"/>
     </protocol-configuration>

     <protocol-configuration scan="off" user-defined="false">
          <range begin="192.168.20.0" end="192.168.20.254"/>
     </protocol-configuration>

     <protocol-configuration scan="enable" user-defined="false">
          <specific>192.168.30.1</specific>>
     </protocol-configuration>

     <property key="timeout" value="2000"/>
     <property key="retry" value="2"/>
</protocol-plugin>

scan: can be turned on/off dynamically
user-defined: did user add this dynamically from console?

POLLING & "COLLECTING" (the latter applies only to SNMP)

Group net devices/services into PACKAGES, with package-specific polling instructions (eg protocol, what to look for, frequency)

Special provision for "30-second outages". These are real outages, but "can be annoying yet hard to correct"

Sample poller service entry from poller-configuration.xml:

           <service name="DNS" interval="300000" user-defined="false" status="on">
                   <parameter key="retry" value="3"/>
                   <parameter key="timeout" value="5000"/>
                   <parameter key="port" value="53"/>
                   <parameter key="lookup" value="localhost"/>
           </service>

Question: how do services GET polled? What constitutes a "response"?
OpenNMS contains "poller monitors": drop-in packages that do these checks on a per-service basis.

OpenNMS uses "adaptive polling": once an outage is discovered through NMS, the service is polled more frequently. Typical generic polling interval: 300 sec; typical polling interval for down service: 30 sec.

Data collection

this is configured for each SNMP "PACKAGE". The data is put into RRDTool, a round-robin DB.

Manual HTTP queries

Begin with "telnet xenon 80", and then type the GET line, the HOST line, and a blank line.

GET /index.html HTTP/1.1
HOST: xenon.cs.luc.edu
(blank line)
            --> returns Apache startup page
    HTTP/1.1 200 OK
    Date: Tue, 18 Mar 2008 19:47:59 GMT
    Server: Apache
    Last-Modified: Wed, 01 Mar 2006 01:04:31 GMT
    ETag: "2f3e2a-5a3-86bbc5c0"
    Accept-Ranges: bytes
    Content-Length: 1443
    Content-Type: text/html; charset=ISO-8859-1

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
            "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
            <title>Test Page for Apache Installation</title>
        </head>

        <body>
            <p>If you can see this, it means that the installation of the
            <a href="http://www.apache.org/foundation/preFAQ.html">Apache web server</a>
            software on this system was successful. You may now add content to this
            directory and replace this page.</p>
        ....


GET /foobar.html HTTP/1.1
HOST: xenon.cs.luc.edu
(blank)
            --> returns 404


OPTIONS * HTTP/1.1
HOST: xenon.cs.luc.edu
(blank)
-->    HTTP/1.1 200 OK
    Date: Tue, 18 Mar 2008 19:49:39 GMT
    Server: Apache/2.2.2 (Unix)
    Allow: GET,HEAD,POST,OPTIONS,TRACE
    Content-Length: 0
    Content-Type: text/plain

Note codes (200, 404, etc), matching "response", and the body, which we can match with a regular expression.
Question: at what point do we decide the server is working ok?
See http://www.opennms.org/wiki/HTTPMonitor.

Next go through http://www.opennms.org/index.php/Testing_Filtering_Proxies_With_HTTPMonitor and see how "negative" polling can be implemented.
(We simply expect a "response" of 400-599)

Of course, if you simply blackhole requests for facebook.com, this won't work.

SNMP queries

This time the port is 25.

Send
    EHLO myhostname

Some specific poller-monitors

Find these in the OpenNMS source directory, mostly under opennms-services/src/main/java/org/opennms/netmgt/poller/monitors.

Now look at HttpMonitor.java:

Look at
    int response = ParameterMap.getKeyedInteger(parameters, "response", -1);                // line 141 in v1.3.2.
    String responseText = ParameterMap.getKeyedString(parameters, "response text", null);

Look at how poller-monitor goes through the response

Look at: if (line.startsWith("HTTP/")) {          // line 215 in v1.3.2
    parse out the response numeric code

responsetext:
        int responseIndex = line.indexOf(responseText);       // line 257
        if (responseIndex != -1)
            bResponseTextFound = true;

Conclusion: source code I'm looking at does NOT do regexp matching!
(It's older than the online docos.)

See the v1.6.5 HttpMonitor.java buildCommand().

DnsMonitor.java

lookup: line 141
build packet: line 158
line 186: request.verifyResponse(incoming.getData(), incoming.getLength())
no actual verification that DNS value is correct, but we don't NEED that!
We do presumably verify that dns response is for the requested machine.
From the source for DNSAddressRequest.java:
* This method only goes so far as to decode the flags in the response
* byte array to verify that a DNS server sent the response.
Verifies request ID (sequence # of request)

SmtpMonitor.java

When we connect, other end should send:
    220 ulam2.cs.luc.edu ESMTP Postfix

214:    read banner
218-240    multiline banner handler
247:    check for the 220
251:    respond HELO myname
    response should be
        250 ulam2.cs.luc.edu
    Note that EHLO myname would produce somehting like:
        250-ulam2.cs.luc.edu
        250-PIPELINING
        250-SIZE 10240000
        250-VRFY
        250-ETRN
        250-STARTTLS
        250-AUTH PLAIN
        250 8BITMIME
289:    check for 250
290    send QUIT

SshMonitor.java

144:    String strBannerMatch = (String) parameters.get("banner");
185:    read a line
195:     check for match with banner line
199-200    send our response line
205:    see if we get any further response, but don't parse it

SmbMonitor.java:

doesn't do anything!!

We are a LONG way from testing this by verifying that a file copied to and then back from an SMB fileshare is unchanged.

This version doesn't have a rich set of expect/send methods. Everything is done "by hand".