Network Management
Week 9, Mar 28
LT-412, Mon 4:15-6:45 pm
RMON
= RMON 1, containing Ethernet information only. Later, RMON2 added monitoring at the IP and TCP (port) layers.
Basic idea: remote agents do some limited monitoring of subnets.
They become "mini-managers", freeing the manager from probing every
host on the subnet. Managers can create rows to specify what monitoring
is to be done.
Note that implementing RMON is a significant
undertaking for an agent: it will have to put its interfaces into
promiscuous mode and analyze every packet. Of course, tcpdump &
wireshark do this too.
See RMON-MIB
History group (discussed earlier)
This group has two tables: control & data
The control entry specifies the interface (historyControlDataSource),
the bucket count, and the interval. The data table then creates one
bucket for each time interval, containing a summary of the usage during
that interval. Actual history stats: packet counts, byte counts, error
counts of various types, etc.
we create a control entry, and that directs how the data table will be
built. Note how the data table is indexed, and how rows get deleted
(recycled?) as new buckets are created.
When the manager creates a row, it supplies:
- historyControlIndex: an arbitrary index value
- historyControlDataSource: the OID for the ifIndex object from the
mib-2 interfaces table. Note that the full OID is provided here, not
just the ifIndex value. Note also that the ifIndex value is the last
level of this OID.
- historyControlBucketsRequested
- historyControlInterval: in seconds, 1..3600.
Note that we might create a 30-second history table and a 30-minute
history table, with possibly different bucket counts for each.
The etherHistorySampleIndex is initialized to 1 and is incremented each
interval. Old bucket entries are deleted as the buckets are recycled:
from historyControlBucketsGranted (starting line 645):
When the number of buckets reaches the value of
this object and a new bucket is to be added to the
media-specific table, the oldest bucket associated
with this historyControlEntry shall be deleted by
the agent so that the new bucket can be added.
Look at the indexing here. The etherHistoryTable is indexed by etherHistoryIndex and etherHistorySampleIndex.
The latter is a bucket serial number; the former is the same as
historyControlIndex, the index to the historyControlTable.
The Hosts group has three tables hostControlTable, hostTable, and hostTimeTable.
The
hostControlTable specifies the interface. Note hostControlTableSize
(line 1462), which determines the number of hosts to be recorded.
The hostControlIndex:
DESCRIPTION: An index that uniquely identifies an entry in the hostControl table. Each such entry defines a function
that discovers hosts on a particular interface and places statistics
about them in the hostTable and the hostTimeTable on behalf of this
hostControlEntry."
The hostTable is indexed by hostIndex and hostAddress; the latter is a particular node's physical address, and contains
per-host statistics such as in and out packets, bytes, and errors. hostIndex is the same as hostControlIndex.
Read the description of the hostCreationOrder column.
The hostTimeTable is the same except indexed serially by discovery
time. The actual index is { hostTimeIndex, hostTimeCreationOrder }. The
hostTimeIndex is again the hostControlIndex. See "The hostTimeTable has
two important uses.", line 1355:
The hostTimeTable has two important
uses. The first is the fast download of this potentially large
table. Because the index of this table runs from 1 to the size of
the table, inclusive, its values are predictable. This allows
very efficient packing of variables into SNMP PDU's and allows a table
transfer to have multiple packets outstanding. These benefits increase
transfer rates tremendously.
The second use of the hostTimeTable is the efficient discovery by the
management station of new entries added to the table. After the
management station has downloaded the entire table, it knows that new
entries will be added immediately after the end of the current
table. It can thus detect new entries there and retrieve them
easily.
I am not aware of any evidence that having contiguous index values
allows for more efficient PDU packing! But it is an intriguing idea.
TopN group
hostTopNControlTable
Indexed by hostTopNControlIndex
The N is supposed to be the hostTopNRequestedSize/hostTopNGrantedSize.
The hostTopNRateBase specifies what variable we are to use for the ranking. You get to pick from a list of seven.
Collection runs for the management-supplied hostTopNTimeRemaining.
The actual data table is the hostTopNTable, with columns
hostTopNReport |
INTEGER (1..65535) |
matches control-table entry |
hostTopNIndex |
INTEGER (1..65535) |
|
hostTopNAddress |
OCTET STRING |
host physical address |
hostTopNRate |
INTEGER |
numeric value of change in the selected variable |
Index columns are underlined. In this table, the hostTopNIndex
specifies the 1..N rank of a particular host (the hostTopNAddress).
This may change during the collection period. Alternatively, no data at
all may be available until the collection period has ended.
Note that this is all data extracted by the agent from the hosts
table. (Note also that you can't create TopN statistics by collecting
information on N hosts; you don't know when you see the N+1-th host
whether it will replace one in the existing table or not.) Basically,
the main thing this group adds is agent-side sorting of the hosts data.
Matrix Group
MatrixControlTable
SD and DS tables: why both? Because one is indexed by source and the other by destination.
The SDTable has fields:
matrixSDSourceAddress OCTET STRING,
matrixSDDestAddress OCTET STRING,
matrixSDIndex INTEGER (1..65535),
matrixSDPkts
Counter,
matrixSDOctets
Counter,
matrixSDErrors
Counter
Again, index fields are underlined. The matrixSDIndex corresponds to the control-table entry.
Switching v Routing
Suppose you have a large network. Should you forward traffic using Ethernet switches (layer 2 devices) or IP routers (layer 3 devices)? Once upon a time this was a controversial subject.
Basically, routers join different IP subnets. Those subnets are
switching zones (broadcast domains); every host in a subnet must be
able to look up the physical address of every other host (using
broadcast ARP) and then send directly to that host's physical address.
On the side of routing, ultimately switching does not scale well.
Switches must forward broadcast packets on all ports, meaning that as
your network size increases, the amount of stray broadcast traffic
increases faster. Switches also do not support a network topology with
loops, which are important in providing redundancy. (Switches do allow
loops, but use the spanning tree algorithm
to disable certain links to prune the topology into a tree. If some
nondisabled link fails, a disabled link may be returned to service.
However, there is no attempt at optimal paths.) Switches must also keep
every individual host heard from in their forwarding tables; routers keep only IP networks
in their tables, leading to more compact tables (though switching
tables of 100,000 hosts are not unheard of, and work reasonably well).
Finally, switches must use broadcast to deliver to previously
unencountered destinations; routers are "preconfigured" to reach all
destinations.
Switch where you can, route where you must ("dumb things are cheaper than smart things")
The above was a marketing slogan of a switch vendor that is no longer
with us. The downside of routers, however, was (and is) that they are
slow. Packet forwarding rates were traditionally at least an order of
magnitude below switch rates (50 Kpps versus up to 1 Mpps for switches
during the last century, though linux routers broke the 1 Mpps barrier
a few years ago). Throughout the 1990s, aggressive switch manufacturers
convinced sites that switching was all they needed, and there does
appear to be such a thing as "unnecessary" routing.
Routers, however, are slow for a reason: they do more. At a minimum, a router:
- Verifies the IP header checksum
- Decrements the TTL value and recomputes the header checksum
- Extracts the destination IP address
- Looks this up in the forwarding table, and finds not just a match but the longest match
- Sends the packet to the appropriate next hop
However, routers are also often called upon to do additional packet
inspection. Some might be as simple as involving the quality-of-service
(QoS) bits in the routing decision, but more typical is major firewall
implementation. This might involve many additional comparisons
involving the IP and TCP header fields, or even fragment reassembly.
Another task a router (but seldom if ever a switch) may be called upon
to perform is queue management, to enforce bandwidth rules on the
different parties involved. Some routers track the full TCP state of
all connections. (This is essential for routers doing Network Address
Translation, or NAT, though arguably NAT is a separate conceptual
module from routing itself.)
Note that routers are not
slower because they "switch at layer 3 instead of layer 2".
Traditionally, some routers were slower because they were poorly
designed at the hardware level (eg using an underpowered
general-purpose CPU), but the main issue is that routers have more to
do.
One classic layout might be to use a switch within each department of
size < 255, and to join departments with routers. This may be
unnecessarily restrictive; there is often no special reason to divide
up a network along departmental lines. This would often be the case
only if security rules (to be implemented as firewall rules) were to be
put into effect between every pair of departments. But while you might
wish to restrict intra-company access the accounting and HR
departments, the case for isolating different R&D groups or even
R&D and marketing is less clear.
So another layout is to have a few "secure" departments/workgroups, and
also some rather large multi-department (perhaps building-sized, or at
least floor-sized) "switching zones", corresponding to subnets of size
/22 (10 host bits, or 1024 hosts) to /18 (14 host bits, or ~16,000
hosts). Routing would be used when you actually have some firewall
restrictions to enforce between zones (or else the zones grow so large
that broadcast traffic has become a quantifiable issue). At some sites,
switching zones can comfortably grow to span multiple buildings (ie a
campus).
Architecturally, the situation still looks like a router (or multiple
routers) with each port connected to a separate "switching zone". The
switching zones, however, have "ballooned". The central argument is
that routing for the sake of having separate subnets is pointless;
routing should be done for a reason. (A non-security reason might be if a department needed to manage their own DHCP server.)
VLANs (Virtual LANs)
Sometimes the separate departments or switching zones tended to overlap
geographically. That is, floor 1 might have hosts A1, B2, A3, A4, B5,
and floor 2 might have A21, A22, B23, B24, A25, where the hosts
beginning with A need to be in a separate subnet from those beginning
B. The classic way to achieve this is to have two switches, an A switch
and a B switch, on each floor; the A-switches would be interconnected
through riser cables and also the two B-switches; the two would come
together at a router somewhere.
Typically, department membership might be assigned more or less
dynamically, while an employee's cubicle might be much more static. So
a given desk might move from "Marketing" to "Sales" to "Research1" over
the course of a month.
Virtual LANs were developed to address this. We just have one
switch on each floor, configured with an A VLAN ("red") and a B VLAN
("green"). Each host port (that is, port directly connected to a
specific host) is assigned to a specific VLAN; for example, on the
first floor, ports 1, 3, and 4 might be connected to the A VLAN while 2
and 5 would be connected to the B VLAN. We can think of the ports as
having colors (red or green)
representing the VLAN to which they belong. Similarly, on the second
floor, ports 1, 2, and 5 would be connected to A while 3 and 4 would be
connected to B. Assignment would be made through software. A broadcast
packet sent from any A-VLAN port would be forwarded only to other
A-VLAN ports; the B-VLAN ports would simply not see the packet.
Typically, if a host on an A-VLAN port tried to send a packet to the
MAC address of a B-VLAN host, the packet would not be delivered
(athough this is a slightly higher level of isolation than just not
forwarding broadcasts, and for some VLAN hardware this isolation is not
implemented). The VLAN switches would maintain separate forwarding
tables for the two VLANs, and separate spanning trees (for more complex
switch arrangements).
The connection between the floor-1 and floor-2 switches, however, has to be handled differently; so-called trunk
ports cannot simply be colored "red" or "green". Instead, trunk ports
are identified as such in the configuration setup, and are allowed to
carry traffic from both VLANs. But the Ethernet packets involved are tagged
with a special protocol (802.1Q) that involves inserting 4 bytes into
each packet, marking the packet with its home VLAN. This allows sharing
trunks with different VLANs, while preserving VLAN isolation.
While VLANs are used routinely for security isolation, be aware that their original justification was not for security but for traffic isolation.
Also note that while it might be possible to define security groups
based on membership in VLANs, it is generally considered safer to do
this at the IP level. Detecting security errors in VLAN configurations
is error-prone; realizing that someone is on the wrong subnet is much
more obvious.
Typically, VLAN switches do not allow any administrative control over
the bandwidth allocation of shared trunks; if you want to prioritize
traffic from certain departments over these shared trunks, you need to
put a router at each end.
Inter-VLAN connections
As for the non-VLAN architecture above, at some point the "switching
zones" (VLANs) must exchange traffic. Generally they do this at IP
routers. But note that we connect a router not between two or more VLAN
switches, but between two or more different-VLAN ports
on the same switch. The router would also have a connection to the
internet. Switching within a VLAN is automatic (and thus we have no way
of enforcing security policies or packet filtering), but traffic
through a router is under administrative control.
Some VLAN switches support simple routing modules that don't do much
more than transfer packets from one VLAN to another based on IP
address. Such a router might be called a "level 3" or "layer 3" switch;
such devices tend to be much faster than full-featured routers but also
less capable.
OpenNMS
documentation at http://www.opennms.org/wiki/Category:Documentation and http://www.opennms.org/wiki/Official_Documentation.
See also:
- HP OpenView
- IBM Tivoli
- Computer Associates Unicenter
Or
- Groundwork Open Source
- Nagios
- others in Munro & Schmidt
A question to keep in mind: what good does all this do you?
openNMS basics:
discovery (eg of hosts/interfaces)
capability detection (eg discovery of services)
monitoring modules
OpenNMS
Tarus Balog - chief developer
Why open source? To get people to contribute monitors and pollers!
OpenNMS focuses on NODES and also SERVICES.
traditional SNMP view: just NODES (routers, switches) and whether they are working.
The three main functional areas of OpenNMS are:
- service polling, which monitors services on the network and reports on their "service level";
- data collection from the remote systems via SNMP in order to measure the performance of the network;
- A system for event management, statistics, and notifications.
Note that that last category is tricky: the NMS must know what events
are "important", and what SNMP data from each device means, and when to
sound the alarm. False positives and false negatives are both bad.
Data collection scenario:
To mitigate the problem OpenNMS was
modified to collect 200,000 data points from approximately 24,000
interfaces every five minutes, or 2.4 million data points an hour from
a single instance of OpenNMS. The limitation turned out to be the speed
at which the disk controller could write the data, not OpenNMS itself.
Example: view ulam2, ulam3: interface graphs, resource graphs, and response-time graphs.
discovery => capabilities detection => poller monitor
Most xml config files are in $OPENNMSHOME/etc
discovery
ping (individual pings to listed ranges of IP addrs (with exceptions))
manual configuration
<discovery-configuration threads="1" packets-per-sec
initial-sleep-time="300000" restart-sleep-time="86400000"
retries="3" timeout="800">
<include-range retries="2" timeout="3000">
<begin>192.168.0.1</begin>
<end>192.168.0.254</end>
</include-range>
<include-url>file:/opt/OpenNMS/etc/include</include-url>
</discovery-configuration>
You can do this manually, in the file above, or else using the web interface,
admin=> Configure Discovery => Include Ranges => Add New
This is host discovery. On each discovered host, we also need to discover services.
Demo of the result in ulam3:/opt/OpenNMS
capabilities detection
see http://www.opennms.org/wiki/Discovery_Configuration_How-To.
core list
probe for the following:
# Citrix central LAN software mgmt
# DHCP
# DNS
# Domino IIOP Lotus workgroup solution
# FTP
# HTTP
# HTTPS
# ICMP
# IMAP email inbound
# LDAP
# Microsoft Exchange
# Notes HTTP
# POP3 email inbound
# SMB
# SMTP email server
# SNMP
# TCP port is specifiable; generic service detection
Much newer (and longer) list at http://www.opennms.org/index.php/Discovery_Configuration_How-To;
JDBC, radius, ssh added
A directory of poller documentation is at http://www.opennms.org/wiki/Service_pollers.
Review the network services involved.
SMB: checked, but file-sharing features are NOT continually monitored
SNMP: special case, used for further data collection not polling
query: system, interfaces, IP->ipAddrTable (other IP addrs for this host)
discovery information: discovery-configuration.xml
capabilities detection: capsd-configuration.xml
Part of the discovery phase: figuring out what each note is capable of.
Which notes are SMTP servers? Web servers?
Sample plugin config for ICMP from poller-configuration.xml:
<protocol-plugin protocol="ICMP" class-name="org.opennms.netmgt.capsd.IcmpPlugin" scan="on" user-defined="false">
<protocol-configuration scan="on" user-defined="false">
<range begin="192.168.10.0" end="192.168.10.254"/>
<property key="timeout" value="4000"/>
<property key="retry" value="3"/>
</protocol-configuration>
<protocol-configuration scan="off" user-defined="false">
<range begin="192.168.20.0" end="192.168.20.254"/>
</protocol-configuration>
<protocol-configuration scan="enable" user-defined="false">
<specific>192.168.30.1</specific>>
</protocol-configuration>
<property key="timeout" value="2000"/>
<property key="retry" value="2"/>
</protocol-plugin>
scan: can be turned on/off dynamically
user-defined: did user add this dynamically from console?
POLLING & "COLLECTING" (the latter applies only to SNMP)
Group net devices/services into PACKAGES, with package-specific polling instructions (eg protocol, what to look for, frequency)
Special provision for "30-second outages". These are real outages, but "can be annoying yet hard to correct"
Sample poller service entry from poller-configuration.xml:
<service name="DNS" interval="300000" user-defined="false"
status="on">
<parameter key="retry" value="3"/>
<parameter key="timeout" value="5000"/>
<parameter key="port" value="53"/>
<parameter key="lookup" value="localhost"/>
</service>
Question: how do services GET polled? What constitutes a "response"?
OpenNMS contains "poller monitors": drop-in packages that do these checks on a per-service basis.
OpenNMS uses "adaptive polling": once an outage is discovered through
NMS, the service is polled more frequently. Typical generic polling
interval: 300 sec; typical polling interval for down service: 30 sec.
Data collection
this is configured for each SNMP "PACKAGE". The data is put into RRDTool, a round-robin DB.
Manual HTTP queries
Begin with "telnet xenon 80", and then type the GET line, the HOST line, and a blank line.
GET /index.html HTTP/1.1
HOST: xenon.cs.luc.edu
(blank line)
--> returns Apache startup page
HTTP/1.1 200 OK
Date: Tue, 18 Mar 2008 19:47:59 GMT
Server: Apache
Last-Modified: Wed, 01 Mar 2006 01:04:31 GMT
ETag: "2f3e2a-5a3-86bbc5c0"
Accept-Ranges: bytes
Content-Length: 1443
Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test Page for Apache Installation</title>
</head>
<body>
<p>If
you can see this, it means that the installation of the
<a href="http://www.apache.org/foundation/preFAQ.html">Apache web
server</a>
software on this system was successful. You may now add content to this
directory and replace this page.</p>
....
GET /foobar.html HTTP/1.1
HOST: xenon.cs.luc.edu
(blank)
--> returns 404
OPTIONS * HTTP/1.1
HOST: xenon.cs.luc.edu
(blank)
--> HTTP/1.1 200 OK
Date: Tue, 18 Mar 2008 19:49:39 GMT
Server: Apache/2.2.2 (Unix)
Allow: GET,HEAD,POST,OPTIONS,TRACE
Content-Length: 0
Content-Type: text/plain
Note codes (200, 404, etc), matching "response", and the body, which we can match with a regular expression.
Question: at what point do we decide the server is working ok?
See http://www.opennms.org/wiki/HTTPMonitor.
Next go through http://www.opennms.org/index.php/Testing_Filtering_Proxies_With_HTTPMonitor and see how "negative" polling can be implemented.
(We simply expect a "response" of 400-599)
Of course, if you simply blackhole requests for facebook.com, this won't work.
SNMP queries
This time the port is 25.
Send
EHLO myhostname
Some specific poller-monitors
Find these in the OpenNMS source directory, mostly under opennms-services/src/main/java/org/opennms/netmgt/poller/monitors.
Now look at HttpMonitor.java:
Look at
int response =
ParameterMap.getKeyedInteger(parameters, "response", -1);
//
line 141 in v1.3.2.
String responseText = ParameterMap.getKeyedString(parameters, "response text", null);
Look at how poller-monitor goes through the response
Look at: if (line.startsWith("HTTP/")) { // line 215 in v1.3.2
parse out the response numeric code
responsetext:
int responseIndex =
line.indexOf(responseText); // line 257
if (responseIndex != -1)
bResponseTextFound = true;
Conclusion: source code I'm looking at does NOT do regexp matching!
(It's older than the online docos.)
See the v1.6.5 HttpMonitor.java buildCommand().
DnsMonitor.java
lookup: line 141
build packet: line 158
line 186: request.verifyResponse(incoming.getData(), incoming.getLength())
no actual verification that DNS value is correct, but we don't NEED that!
We do presumably verify that dns response is for the requested machine.
From the source for DNSAddressRequest.java:
* This method only goes so far as to decode the flags in the response
* byte array to verify that a DNS server sent the response.
Verifies request ID (sequence # of request)
SmtpMonitor.java
When we connect, other end should send:
220 ulam2.cs.luc.edu ESMTP Postfix
214: read banner
218-240 multiline banner handler
247: check for the 220
251: respond HELO myname
response should be
250 ulam2.cs.luc.edu
Note that EHLO myname would produce somehting like:
250-ulam2.cs.luc.edu
250-PIPELINING
250-SIZE 10240000
250-VRFY
250-ETRN
250-STARTTLS
250-AUTH PLAIN
250 8BITMIME
289: check for 250
290 send QUIT
SshMonitor.java
144: String strBannerMatch = (String) parameters.get("banner");
185: read a line
195: check for match with banner line
199-200 send our response line
205: see if we get any further response, but don't parse it
SmbMonitor.java:
doesn't do anything!!
We are a LONG way from testing this by verifying that a file copied to and then back from an SMB fileshare is unchanged.
This version doesn't have a rich set of expect/send methods. Everything is done "by hand".