Class 9 Notes

Comp 141 Class 9 notes

July 25

Shotts:

Chapter 19: Regular Expressions

The Unix Pipe Card Game. This is for kids? But there are some cute pipe examples. See the Tasks section lower down the page.

How would you print the most common line (Task 4)? If you sort and then use uniq, it's a start. But uniq -c assigns a frequency count to each line, which solves the problem (almost; you still have to re-sort by numeric frequency).

Regular Expressions

Regular expressions are a way to describe a set of strings. We've already seen them. First, the file-name matching is a form of regular expression: * matches any number of characters, ? matches one character, and [m-p] matches the letters m, n, o, or p. But note that this file-name matching uses a simplified form of regular expressions.

In 1997, Jamie Zawinski (then a dev at Netscape) posted the following on Usenet:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

A big part of the trouble with regular expressions is that they are so hard to read. There is no place for extraneous whitespace; blanks are, of course, significant characters. One does get somewhat used to the syntax, but it's certainly confusing to beginners.

There is also no way to embed comments. Nor is there a way to define regular "subexpressions" that you can use within a larger expression, which would be helpful as a way to improve modularity.

The grep command is also all about matching regular expressions, although so far we've just used it to match strings. Grep has a bunch of useful options:

-i    ignore case
-v    list lines not matching the pattern (invert)
-n    include line numbers
-q    don't print anything, just return 0 if found or 1 if not found

Here are the "metacharacters", that have special meaning in matches:

.     matches any one character
^    anchor for the start of the string or line; negation if the first character within [ ]
$    anchor for the end of the string or line
[ ] for creating character ranges, just like with filename matching
( ) for regular grouping
{ } match a regular subexpression a specific number of times
-    for character ranges
?    match a regular subexpression 0 or 1 times
*    match a regular subexpression 0 or more times
+   match a regular subexpression 1 or more times
|    between two alternative regular expressions, such as in January|Jan in the case example
\    for quoting one of these to use it as a literal character rather than a metacharacter

Basic regular expressions (BRE) use only ^ $ . [ ] *. We will here be using extended regular expessions (ERE). These are not recognized by plain grep; you must either use egrep or grep -E.

The file-regular-expression symbol "?" corresponds to the BRE ".": both match one character. As we will see below, the file-regular-expression symbol "*" corresponds to the BRE ".*": the "." means any single character, and the BRE/ERE "*" means "the previous regular expression repeated 0 or more times". File-regular-expression and BRE for character ranges is pretty similar.

There are also "Perl-compatible regular expressions", the library is known as pcre. This is an entirely different regular-expression notation.

As an example of alternation, in its simplest form, suppose we want to search the output of ps -ef for either "chrome" or "firefox". Then we can use

ps -ef | egrep 'chrome|firefox'

We need the quotes because '|' has special meaning to bash (it is the pipe symbol). Also, note that I used egrep instead of plain grep; egrep enables a much larger set of regular expressions. Plain grep doesn't work here at all (although grep -E is the same as egrep, and officially is preferred).

As an example of character ranges, if we wanted to search for "loyola" or "Loyola" with grep, the pattern "[Ll]oyola" would work.

The ^ character marks the start of the string, in most cases. But as the first character of a range, it means match the characters not in the list. So "[^aeiou]a[^a-s]" matches "cat", but not "eat" or "cab".

There are also some built-in ranges:

[:alnum:]	The alphanumeric characters. In ASCII, equivalent to: [A-Za-z0-9]
[:word:]	The same as [:alnum:], with the addition of the underscore (_) character.
[:alpha:]	The alphabetic characters. In ASCII, equivalent to: [A-Za-z]
[:blank:]	Includes the space and tab characters.
[:cntrl:]	The ASCII control codes. Includes the ASCII characters 0 through 31 and 127.
[:digit:]	The numerals 0 through 9.
[:graph:]	The visible characters. In ASCII, it includes characters 33 through 126.
[:lower:]	The lowercase letters.
[:punct:]	The punctuation characters. In ASCII, equivalent to: [-!"#$%&'()*+,./:;<=>?@[\\\]_`{\|}~]
[:print:]	The printable characters. All the characters in [:graph:] plus the space character.
[:space:]	The whitespace characters including space, tab, carriage return, newline, vertical tab, and form feed. In ASCII, equivalent to: [ \t\r\n\v\f]
[:upper:]	The uppercase characters.
[:xdigit:]	Characters used to express hexadecimal numbers. In ASCII, equivalent to: [0-9A-Fa-f]

Demo: grep [[:alpha:]] grepdemo1.text, or grep [a-z] grepdemo1.text.

Warning: IBM mainframes use EBCDIC encoding, not ASCII, and in EBCDIC the alphabetic letters are not contiguous. So [a-z] fails. So if you find yourself in an EBCDIC environment, stop. (You can also use the above, or [a-ij-rs-z]).

A more serious character-set issue is ascii vs utf-8. grep actually does fairly well with that. Demo: grep ": ." grepdemo2.text.

As with file-matching regular expressions, the square brackets above still need another set of square brackets to make them ranges.

Grouping is often useful. If we wanted to search for "received" or "receiving", we can use the basic alternation form "received|receiving", or the equivalent but (maybe) simpler form with grouping, "receiv(ed|ing)"

Quantifiers

In filename matching, '*' stands for "zero or more characters". In ERE, * stands for "repeat the previous sub-expression zero or more times". So ".*" matches any character zero or more times. "a(b|c)*" matches a, ab, ac, abb, abbcbbcccb, etc.

There is also ?, meaning "match zero or one times (but not more)", and +, which means "match one or more times". So the following matches signed integers:

(+|-)?[:digit:]+

That is, the + or - at the start is optional, and there has to be at least one digit.

Schotts has this regular expression for matching phone numbers, either in the (nnn) nnn-nnnn form or the nnn nnn-nnnn form

^$?[0-9][0-9][0-9]$? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$

I only use nnn-nnn-nnn. You could fix this by changing that first space to ( |-), but that would also accept (nnn)-nnn-nnnn. Realistically you would need grouping (note that I am using the {n} quantifier from below):

^($[0-9]{3}$ )|([0-9]{3}( |-))[0-9]{3}-[0-9]{4}$

Demo: trying these with grepdemo3.text.

There is also { }, which lets you match a specific number of times:

{3} match the previous subexpression 3 times

{3,} match the preceding subexpression 3 or more times

{3,6} match the preceding subexpression between 3 and 6 times

{,3} match the preceding subexpression between 0 and 3 times

There is also multiline mode. In that mode, "^" and "$" match the beginning and end of each line, versus the beginning and end of the entire string. We won't cover this.

Regular Expressions in Unix

Regular expressions are used in grep, of course. Where else?

In bash, the [[ ]] test operator has a regular-expression test. You can write things like this:
if [[ $thechar =~ [0-9] ]]

(Note that the regular expression is written without quotation marks.) This works for strings like '1', 'a', etc. But what if we try multi-char strings like '1a'? Surprisingly, that too is a digit! And 'a1'. A match occurs if the regular expression matches some substring of the lefthand string. From the bash manual:

The pattern will match if it matches any part of the string. If you want to force the pattern to match the entire string, anchor the pattern using the ‘^’ and ‘$’ regular expression operators.

If you want to check if the string matches one and only one digit, use the '^' and '$' anchors:
if [[ $thestr =~ ^[0-9]$ ]]

If you want to check if the string contains only digits, but maybe a whole bunch of them, use
if [[ $thestr =~ ^[0-9]+$ ]]

If you left out the ^ and $, what would it match?

To check for not a match, put "!" in front of the "[[".

The case statement uses "shell patterns", that is, the same kind of simple regular expressions used for filename matching. It does not support full regular expressions, however.

Finite-State Machines

Regular expression: * means repeat 0 or more times, ? means either 0 or 1 times

b a* c
b? a* c?
1 (01*0)* 1 (supposedly odd binary numbers divisible by 3)
[a-z][a-z,0-9]* (for programming-language identifiers)
[0-9]*(.[0-9]*)?(e[0-9]*)? (for floating-point numbers, eg 12.345e67

What strings match these?

What does a finite-state recognizer (finite-state machine) for these look like? A finite-state recognizer is a directed graph. Arcs can be labeled with a single letter. For example, the recognizer for b a* c looks like this:

                a
              /---\
           b  \   /   c
      (1)----> (2) -----> (end)

A recognizer for b? would look like

          /--------\
         /          \
      (1)           (end)
         \     b    /
          \--------/

How about b (aa)* a? There's a difference here: we can't tell which path to take looking only at the next letter.

More examples of regular expressions:

These use slightly extended regexes (The google example does not support + or *)

\d matches any digit 0-9, same as [0-9]
\W matches anything other than a letter, digit or underscore, same as [^a-zA-Z0-9_]
\s matches a space

Warning: there are quite a few different standards for regular expressions. Always read the documentation.

The two ascii-diagrammed finite-state recognizers above were both deterministic: we never have a state with two outgoing edges, going two different directions, that are labeled with the same input. A deterministic finite recognizer is abbreviated DFA (the A is for "automaton", the usual word).

How about b (ab)* a? There's a difference here.

                      (2)
                     /   \
                    a     b
                     \   /
        (0) --- b --> (1) -- a --> (end)

Now we do have a state -- state (1) -- with two different outbound edges labeled 'a'. Such an recognizer is known as nondeterministic, that is, as an NFA. We can still use an NFA to match inputs, but now what do we do if we're at a vertex and there are multiple edges that match the current input?

There are two primary approaches. The first is to try one of the edges first, and see if that works. If it does not, we backtrack to the vertex in question and at that point try the next edge. This approach does work, but with a poorly chosen regular expression it may be extremely slow. Consider the regular expression (a?)ⁿ aⁿ. This means up to n optional a's, followed by n a's. Let us match against aⁿ, meaning all the optional a's must not be used. The usual strategy when matching "a?" is to try the "a" branch first, and only if that fails do we try the empty branch. But that now means that we will have 2ⁿ - 1 false branches before we finally succeed.

Example: (a?)³ a³.

A much faster approach is to use the NFA with state sets, rather than individual states. That is, when we are in state S1 and the next input can lead either to state S2 or state S3, we record the new state as {S2,S3}. If, for the next input, S2 can go to S4 and S3 can go to either S5 or S6, the next state set is {S4,S5,S6}. This approach might look exponential, but the number of states is fixed.

Example: (a?)³ a³.

You might wonder which algorithm your favorite programming language uses: the exponential-time backtracking algorithm, or the linear-time Thompson algorithm. Rather oddly, most programming languages have chosen the exponential-time algorithm. Mostly the reason is implementor ignorance! (Though it is true that when you want to extend regular expressions with variables that receive the result of a sub-match, and then match those variables again, the backtracking algorithm is the only one that works. But now you don't really have regular expressions any more.)

There is in Python an optional module that uses the Thompson algorithm.

See also https://swtch.com/~rsc/regexp/regexp1.html, "Regular expression search algorithms", the paragraph beginning "A more efficient ...."

By the way, a much better regular expression for between n and 2n a's in a row is aⁿ (a?)ⁿ. We parse n a's at the beginning, and the optional a's are all following.

The implementation of an NFA/DFA recognizer (ether Thompson or backtracking) does literally use the graph approach: for each current state, and each next-input symbol, we look up what next states are possible with that input symbol. The code to drive the NFA/DFA does not need to be changed for different NFA/DFAs. This is a big win from a software-engineering perspective.

One more example of NFA state-set recognizer: aaa|aab|aac|aad

         +----a->(3)--a->(7)
         |
         | /--a->(4)--b->(8)
(1)--a->(2)
     | \--a->(5)--c->(9)
     |
         +----a->(6)--d->(10)