Comp 141 Class 10 notes

August 1

Today is our last regular class. Next week will be a review session, and the online final will be open Thursday after class through Saturday morning.

Shotts:

• Chapter 14: Packages
• Chapter 15: Storage
• Chapter 16: Networking
• Chapter 17: Searching for Files
• Chapter 19: Regular Expressions
• Chapter 25: HTML
• Chapter 35: bash arrays
Homework 7

Regular Expression Efficiency

Let's call the finite-state recognizers finite automata, which is the usual term used. So far the finite-state recognizers have all been deterministic: we never have a state with two outgoing edges, going two different directions, that are labeled with the same input. A deterministic finite automaton is abbreviated DFA.

How about the regular expression b (ab)* a? There's a difference here. Now we do have a state with two different edges labeled 'a'. Such an automaton is known as nondeterministic, that is, as an NFA. We can still use an NFA to match inputs, but now what do we do if we're at a vertex and there are multiple edges that match the current input?

There are two primary approaches. The first is to try one of the edges first, and see if that works. If it does not, we backtrack to the vertex in question and at that point try the next edge. This approach does work, but with a poorly chosen regular expression it may be extremely slow. Consider the regular expression (a?)n an. This means up to n optional a's, followed by n a's. Let us match against an, meaning all the optional a's must not be used. The usual strategy when matching "a?" is to try the "a" branch first, and only if that fails do we try the empty branch. But that now means that we will have 2n - 1 false branches before we finally succeed.

Example: (a?)3 a3, or (a?)(a?)(a?)aaa. Suppose the input is aaa. Here are the steps:

• Try to match the first (a?) with the first input a.
• Try to match the second (a?) with the second input a
• Try to match the third (a?) with the third input a.
• Fail, because there is no more input, but the pattern still has aaa.
• Backtrack to the third (a?), and match it to the empty string. Match the remaining input a with the first a of aaa. Fail.
• Backtrack to the second (a?). Match it to the empty string, and try to match the third (a?) to the second input a.
• Get to the pattern aaa, with only the third input a remaining. Fail.
• Backtrack to the first (a?). Try to match it to the empty string.
• Match the second (a?) to the first input a, and the third (a?) to the second input a. Fail.
• Try matching the third (a?) to the empty string. Fail again.
• Match the second (a?) to the empty string. Try to match the third (a?) to the first input a. Fail.
• Match the third (a?) to the empty string. Now the rest of the pattern is aaa and the rest of the input string is aaa. Success!

A much faster approach is to use the NFA with state sets, rather than individual states. That is, when we are in state S1 and the next input can lead either to state S2 or state S3, we record the new state as {S2,S3}. If, for the next input, S2 can go to S4 and S3 can go to either S5 or S6, the next state set is {S4,S5,S6}. This approach might look exponential, but the number of states is fixed.

Example: (a?)3 a3.

/---empty---\ /---empty---\ /---empty---\
(0) --- a --- (1)
--- a --- (2) --- a --- (3) --- a --- (4) --- a --- (5) --- a --- (end)

The steps:

• On the first input a, the state set is {1,2,3,4} (why?)
• On the second input a, the state set is {2,3,4,5}
• On the third input a, the state set is {3,4,5,end}. As this contains the end state, we are done.

See also https://swtch.com/~rsc/regexp/regexp1.html, "Regular expression search algorithms", the paragraph beginning "A more efficient ...."

By the way, a much better regular expression for between n and 2n a's in a row is an (a?)n. We parse n a's at the beginning, and the optional a's are all following.

The implementation of an NFA/DFA recognizer does literally use the graph approach: for each current state, and each next-input symbol, we look up what next states are possible with that input symbol. The code to drive the NFA/DFA does not need to be changed for different NFA/DFAs. This is a big win from a software-engineering perspective.

One more example of NFA state-set recognizer: aaa|aab|aac|aad

+----a->(3)--a->(7)
|
| /--a->(4)--b->(8)
(1)--a->(2)
| \--a->(5)--c->(9)
|
+----a->(6)--d->(10)

We can "factor out" the initial "aa" to get aa(a|b|c|d).

How about aaa|aab|aac|add? We can now only factor one 'a' out: a(aa|ab|ac|dd). But fom (aa|ab|ac) we can factor another a: a((a(b|d))|dd). There are limits to this technique, but sometimes it is useful.

Root access

So far, everything you've done on your VM has been done as user "comp141". We can become the superuser, or root, if we know the root password. But we don't. However, we can also use the sudo command:

sudo bash

Warning: the usual advice these days is to use sudo to run individual commands, not to create a root shell. With the latter, one mistake and it's all over.

Being root lets you look at the log files in /var/log, for example. It also lets you add new users, reconfigure the network and install packages.

If you have a root password, the su command might be a better bet.

it is common to need to

Package Management

[Shotts chapter 14] Yes, you can compile packages. But it's a chore, and when there are compilation problems it's a real chore.

There are many different distributions of Linux: Ubuntu, Debian, Red Hat, Mint, CentOS, Fedora, Gentoo .... One of the biggest differences between them is the style of package management, and to some extent the back-end maintenance of packages.

The high-level package tool on Debian-like distributions, which includes Ubuntu, is apt (or apt-get, an older version).

It's always a good idea to start with apt-get update, which updates the known list of package repositories. Some installations:

• apt install emacs
• apt-install build-essential
• apt install apache2

dpkg -l lists all installed packages. Note, however, that some packages will likely have been auto-installed in the process of installing some other package.

The find command

[Shotts Chapter 17] The unix find command is pretty handy, though it doesn't let you search "inside" complicated filetypes like .docx or .pdf.

The syntax is find directory search-options

There may be no search options. For example, find ~ lists every file within your home directory. find ~ |wc -l counts them. find ~ -type d |wc -l counts your directories and subdirectories.

What if you want to find a file in or below the current directory by name, say "foo.text"? find . -iname '*foo*' works, where -iname is case-insensitive search (there is also -name for case-sensitive search). Note that you need the quote marks, if there is a match for *foo* in the current directory. We don't want shell filename expansion to get in the way here; we want the asterisks to be "expanded" by find.

Other useful options are -empty, -mtime, -newer, -perm mode, -type [d|f|...]. -maxdepth levels,

On page 229, Shotts describes an interesting technique for dealing with files with spaces in them.

Then there is the -exec option. This runs a command on every file found. For example,

find . -type f -exec file {} ';'

The {} represents the name of the file in question, and the ';' marks the end of the file command and the return to find. it almost always has to be quoted so bash doesn't interpret it as an end-of-command indication. Next, let's search for the string 'key' in any files in files10/project

find . -exec grep -i key {} \;

find . -type f -and -exec grep -i -H key {} \;

The -H makes grep always print the file name, and the -type f check prevents checking directories.

There are optimizations to do the inner command just once on the whole lot of files found.

Storage

[Shotts chapter 15] If I plug in my usb drive, where does it appear in the filesystem? On Windows, it would get a new drive letter, like G:. On Ubuntu, it's usually in /media/username.

This is achieved through the mount command (automatically in this case). A disk device has a filesystem on it, which is a tree of directories, and files. There is one "root" directoriy. The mount command attaches such a device to a specific directory on the existing filesystem.

The file /etc/fstab lists what gets mounted where, by default.

A typical physical disk usually has multiple partitions. Each partition has a filesystem. You can view these with fdisk, but you can also destroy your disk so be careful. Fdisk takes a parameter representing the device for the entire disk, eg /dev/nvme0n1.

Mount shows a lot of "virtual" mounts. But we can focus on the disk mounts with mount | grep nvme0n1.

As for filesystems, they can be a variety of types. NTFS is the most common windows filesystem. Linux filesystems include ext2, ext3, ext4, btrfs, xfs, and more. Part of the disk is set aside as an array of inodes, which contain the file's permissions. Directories are tables of pairs <filename, inode>. Linux shows you the inode numbers with ls -i.

Networking

[Shotts chapter 16] Systems have network devices. Most acquire their IP address on startup through the Dynamic Host Configuration Protocol, or dhcp.

Once we're connected, we have these:

• ping
• traceroute
• netstat
• wget / curl

ssh

Shotts p 210. This is the secure shell, a remote-login (and remote-command-execution) utility. Per-user files (like keys) are kept in ~/.ssh.

This is based on public-key authentication. If you, on host A, want to log in to host B, you connect and ask for some authentication data from B. Unless this is your first connection, you already have B's public key. B can send a message signed with B's private key, and A can validate it.

Initial public/private key pairs are created by ssh-keygen.

Previously received public keys are kept in the file known_hosts.

Typically, new hosts are assumed to be trusted. This is called Trust On First Use, or TOFU.

Once B has validated its identity to A (to prevent A from falling prey to sending its passwords to the wrong host), the user on A might log in the usual way, by providing a username and password. The problem here is that random password guessing is still possible. (This is why the Loyola CS dept no longer allows password-based ssh logins, except through the VPN.)

A better approach is to authenticate by pre-arrangement. A will place A's public key in B's file authorized_keys. This means that A is allowed to log in to B without providing a password. A must prove it has the matching private key by encrypting some message selected by B using its private key. A sends it back to B; if B can decrypt the message with A's public key, all is good.

Getting ones public key (typically in

RSA was the original algorithm. Elliptic-curve algorithms are much more popular, and shorter.

If any permissions on the .ssh directory or any of its files are not what they should be (for example, if the private key is readable by anyone other than the ownin user), then the connection fails.

bash arrays

[Shotts chapter 35] Arrays in bash are arrays of strings. This is quite practical. In a bash script, \$* is the unquoted string of all arguments, and "\$*" is the quoted string of all arguments. Neither is quite right. Here is an updated version of echoeach.sh (in files7):

#!/bin/bash

echo '\$*'
for i in \$*
do
echo \$i
done

echo '"\$*"'
for i in "\$*"
do
echo \$i
done

echo '"\$@"'
for i in "\$@"
do
echo \$i
done

If we invoke ./echoeach.sh foo 'bar baz' quux,

• the first prints four things: \$* is foo bar baz quux
• the second prints one thing: "\$*" is "foo bar baz quux"
• the third gets it right. \$@ is an array, and the spaces in any array element have no effect on how it is parsed.

I used to use \$* all the time, but it fails miserably for arguments with spaces. Here is my current script word, which starts OpenOffice on a file:

PROG=/usr/bin/soffice
\$PROG "\$@" 2>/dev/null &

(the 2>/dev/null just makes the stderr messages go away. They are usually useless.)

Running a web server

[Shotts chapter 25] The apache webserver can be installed with "apt install apache2". We can then open a browser on the same machine, and go to "http://localhost". We get the default page.

How about adding another html page (eg foo.html)? It goes in /var/www/html.

#!/bin/bash
echo "<html>