Project 5: Using Python to read pcap files

Due: Friday April 23

For this project we will return to analyzing the pcap file of project 3, project3.pcap, except this time we'll do it with Python. We'll also use my packet.py library for reading packet headers.

You will also have to install the python-libpcap library. See pypi.org/project/python-libpcap. To install it on your lubuntu virtual machine, execute the following as root:

However, you can also install it natively on your machine (see the website for instructions) and do this assignment without your virtual machine.

This library supports real-time capture from interfaces, but we're just going to read from files.

What you are to do is to answer the following:

  1. For all those separate TCP connections, verify that they are all initiated by the client.
  2. Who initiates the close() for all those connections? The client or the server?
  3. For each connection, how many bytes of data are uploaded? How many bytes are downloaded? Don't count headers.
  4. We saw earlier that there were 8721 downloaded packets in the size range 62-79. How many of these had any data at all? (A few do, and this turns out to be related to HTTP-layer "chunk" reassembly; see for example tcp.port == 39456, and HTTP packets of size 76.)
  5. How many separate DNS queries were made? (Count one query as one UDP local port used to send one UDP packet to some host at port 53). You'll need a new dictionary, perhaps UDPPORTDICT, to hold the local ports used by each UDP packet. (You can get the size of the dictionary at the end with len().)

As an example of how to get started, here is connection1.py. It goes through the pcap file. For each TCP packet, it extracts the socketpair (localaddr, localport, remoteaddr, remoteport), and uses this (along with the packet's direction) as a key to the dictionary CONNECTIONDICT. Corresponding to each key is a list of all packets of that connection, in that direction. The final output is to print the number of packets in the upstream direction for each TCP connection.

Here is the basic scan through the pcap file:

def process_packets(fname):
    sum = 0
    count=0
    for length, time, pktbuf in rpcap(fname):		# here is where we examine each packet
        process_one_pkt(length, time, pktbuf, ETHHDRLEN)

Here is the process_one_pkt() function. Note how headers are extracted, and how packets are entered into CONNECTIONDICT.

def process_one_pkt(length, time, pktbuf : bytes, startpos):
    global CONNECTIONDICT
    ethh= ethheader.read(pktbuf, 0)
    if ethh.ethtype != 0x0800: return None        # ignore non-ipv4 packets
    iph = ip4header.read(pktbuf, ETHHDRLEN)
    if not iph: return                    # returns None if it doesn't look like an IPv4 packet
    if iph.proto == UDP_PROTO:
        udph = udpheader.read(pktbuf, ETHHDRLEN + iph.iphdrlen)
        # if udph.dstport == 53: print('DNS packet')
        return
    if iph.proto != TCP_PROTO: return            # ignore
    tcph = tcpheader.read(pktbuf, ETHHDRLEN + iph.iphdrlen)    # here we *do* allow for the possibility of header options
    if not tcph: return                    # Again, tcpheader.read() returns None if it doesn't look like a TCP packet
    datalen = iph.length - iph.iphdrlen -tcph.tcphdrlen    # can't use len(pktbuf) because of tcpdump-applied trailers
    #print (socket.inet_ntoa(srcaddrb), dstport, dlen)
    if iph.srcaddrb == LOCALADDRB:            # source address is local endpoint
        localport   = tcph.srcport
        remoteport  = tcph.dstport
        remoteaddrb = iph.dstaddrb
        upstream    = True
    else:
        localport   = tcph.dstport
        remoteaddrb = iph.srcaddrb
        remoteport  = tcph.srcport
        upstream    = False
    key = (LOCALADDRB, localport, remoteaddrb, remoteport, upstream)
    if key in CONNECTIONDICT:
        CONNECTIONDICT[key].append(pktbuf)
    else:
        CONNECTIONDICT[key] = [pktbuf]       

Note that, because some "trailers" were added by the original packet-capturing software, tcpdump, the length of the packets should be taken from iph.length, rather than from len(pktbuf) or the "length" variable. (The trailer is identified in the .pcap file as "VSS Monitoring Ethernet trailer"; there is also sometimes padding bytes.)

(This program reports 1793 connections. Wireshark reports 1799 (Statistics => Connections => TCP tab). Why the discrepancy?)

Turn in your python program or programs, and a brief summary of your answers to 1-5 above. Tell me which program was used to answer which question!