Regular Expressions

We've already seen regular expressions in the context of file-name matching. In this case, * matches any number of characters, ? matches one character, and [m-p] matches the letters m, n, o, or p. But note that this file-name matching uses a simplified form of regular expressions.

In 1997, Jamie Zawinski (then a dev at Netscape) posted the following on Usenet:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

A big part of the trouble with regular expressions is that they are so hard to read. There is no place for extraneous whitespace; blanks are, of course, significant characters. There is no way to embed comments. One does get somewhat used to the syntax,

The grep command (and egrep) is also all about matching regular expressions, although so far we've just used it to match strings. Grep has a bunch of useful options:

-i    ignore case
-v    list lines not matching the pattern (invert)
-n    include line numbers
-q    don't print anything, just return 0 if found or 1 if not found

Here are the "metacharacters", that have special meaning in grep regular expressions:

.     matches any one character
^    anchor for the start of the string or line; negation if the first character within [ ]
$    anchor for the end of the string or line
[ ] for creating character ranges, just like with filename matching
( ) for regular grouping
{ } match a regular subexpression a specific number of times
-    for character ranges
?    match a regular subexpression 0 or 1 times
*    match a regular subexpression 0 or more times
+   match a regular subexpression 1 or more times
|    between two alternative regular expressions, such as in January|Jan in the case example
\    for quoting one of these to use it as a literal character rather than a metacharacter

Basic regular expressions (BRE) use only ^ $ . [ ] *. We will here be using extended regular expessions (ERE). These are not recognized by plain grep; you must either use egrep or grep -E.

The file-regular-expression symbol "?" corresponds to the BRE ".": both match one character. As we will see below, the file-regular-expression symbol "*" corresponds to the BRE ".*": the "." means any single character, and the "*" means "the previous regular expression repeated 0 or more times". File-regular-expression and BRE for character ranges is pretty similar.

There are also "Perl-compatible regular expressions", the library is known as pcre. This is an entirely different regular-expression notation.

As an example of alternation, in its simplest form, suppose we want to search the output of ps -ef for either "chrome" or "firefox". Then we can use

We need the quotes because '|' has special meaning to bash (it is the pipe symbol). Also, note that I used egrep instead of plain grep; egrep enables a much larger set of regular expressions. Plain grep doesn't work here at all (although grep -E is the same as egrep, and officially is preferred).

As an example of character ranges, if we wanted to search for "loyola" or "Loyola" with grep, the pattern "[Ll]oyola" would work.

The ^ character marks the start of the string, in most cases. But as the first character of a range, it means match the characters not in the list. So "[^aeiou]a[^a-s]" matches "cat", but not "eat" or "cab".

[:alnum:]    The alphanumeric characters. In ASCII, equivalent to: [A-Za-z0-9]
[:word:]     The same as [:alnum:], with the addition of the underscore (_) character.
[:alpha:]     The alphabetic characters. In ASCII, equivalent to: [A-Za-z]
[:blank:]     Includes the space and tab characters.
[:cntrl:]      The ASCII control codes. Includes the ASCII characters 0 through 31 and 127.
[:digit:]      The numerals 0 through 9.
[:graph:]    The visible characters. In ASCII, it includes characters 33 through 126.
[:lower:]     The lowercase letters.
[:punct:]     The punctuation characters. In ASCII, equivalent to: [-!"#$%&'()*+,./:;<=>?@[\\\]_`{|}~]
[:print:]      The printable characters. All the characters in [:graph:] plus the space character.
[:space:]     The whitespace characters including space, tab, carriage return, newline, vertical tab, and form feed. In ASCII, equivalent to: [ \t\r\n\v\f]
[:upper:]     The uppercase characters.
[:xdigit:]    Characters used to express hexadecimal numbers. In ASCII, equivalent to: [0-9A-Fa-f]

Warning: IBM mainframes use EBCDIC encoding, not ASCII, and in EBCDIC the alphabetic letters are not contiguous. So [a-z] fails. So if you find yourself in an EBCDIC environment, stop. (You can also use the above, or [a-ij-rs-z]).

A more serious character-set issue is ascii vs utf-8. grep actually does fairly well with that. Try grep ": ." grepdemo2.text.

As with file-matching regular expressions, the square brackets above still need another set of square brackets to make them ranges.

Grouping is often useful. If we wanted to search for "received" or "receiving", we can use the basic alternation form "received|receiving", or the equivalent but (maybe) simpler form with grouping, "receiv(ed|ing)"

Quantifiers

In filename matching, '*' stands for "zero or more characters". In ERE, * stands for "repeat the previous sub-expression zero or more times". So ".*" matches any character zero or more times. "a(b|c)*" matches a, ab, ac, abb, abbcbbcccb, etc.

There is also ?, meaning "match zero or one times (but not more)", and +, which means "match one or more times". So the following matches signed integers:

That is, the + or - at the start is optional, and there has to be at least one digit.

Schotts has this regular expression for matching phone numbers, either in the (nnn) nnn-nnnn form or the nnn nnn-nnnn form

I only use nnn-nnn-nnn. You could fix this by changing that first space to ( |-), but that would also accept (nnn)-nnn-nnnn. Realistically you would need grouping (note that I am using the {n} quantifier from below):

There is also multiline mode. In that mode, "^" and "$" match the beginning and end of each line, versus the beginning and end of the entire string. We won't cover this.

Finite-State Machines

These use slightly extended regexes (The google example does not support + or *)

\d matches any digit 0-9, same as [0-9]
\W matches anything other than a letter, digit or underscore, same as [^a-zA-Z0-9_]
\s matches a space
^ matches the start of the line; $ matches the end of the line
{3,6} means that whatever single-character thing preceding this can match between 3 and 6 times

Warning: there are quite a few different standards for regular expressions. Always read the documentation.

Let's call the finite-state recognizers finite automata. So far the finite-state recognizers have all been deterministic: we never have a state with two outgoing edges, going two different directions, that are labeled with the same input. A deterministic finite automaton is abbreviated DFA.

How about b (ab)* a? There's a difference here. Now we do have a state with two different edges labeled 'a'. Such an automaton is known as nondeterministic, that is, as an NFA. We can still use an NFA to match inputs, but now what do we do if we're at a vertex and there are multiple edges that match the current input?

There are two primary approaches. The first is to try one of the edges first, and see if that works. If it does not, we backtrack to the vertex in question and at that point try the next edge. This approach does work, but with a poorly chosen regular expression it may be extremely slow. Consider the regular expression (a?)ⁿ aⁿ. This means up to n optional a's, followed by n a's. Let us match against aⁿ, meaning all the optional a's must not be used. The usual strategy when matching "a?" is to try the "a" branch first, and only if that fails do we try the empty branch. But that now means that we will have 2ⁿ - 1 false branches before we finally succeed.

A much faster approach is to use the NFA with state sets, rather than individual states. That is, when we are in state S1 and the next input can lead either to state S2 or state S3, we record the new state as {S2,S3}. If, for the next input, S2 can go to S4 and S3 can go to either S5 or S6, the next state set is {S4,S5,S6}. This approach might look exponential, but the number of states is fixed.

See also https://swtch.com/~rsc/regexp/regexp1.html, "Regular expression search algorithms", the paragraph beginning "A more efficient ...."

By the way, a much better regular expression for between n and 2n a's in a row is aⁿ (a?)ⁿ. We parse n a's at the beginning, and the optional a's are all following.

The implementation of an NFA/DFA recognizer does literally use the graph approach: for each current state, and each next-input symbol, we look up what next states are possible with that input symbol. The code to drive the NFA/DFA does not need to be changed for different NFA/DFAs. This is a big win from a software-engineering perspective.

Regular Expression Efficiency

Let's call the finite-state recognizers finite automata, which is the usual term used. So far the finite-state recognizers have all been deterministic: we never have a state with two outgoing edges, going two different directions, that are labeled with the same input. A deterministic finite automaton is abbreviated DFA.

How about the regular expression b (ab)* a? There's a difference here. Now we do have a state with two different edges labeled 'a'. Such an automaton is known as nondeterministic, that is, as an NFA. We can still use an NFA to match inputs, but now what do we do if we're at a vertex and there are multiple edges that match the current input?

Example: (a?)³ a³, or (a?)(a?)(a?)aaa. Suppose the input is aaa. Here are the steps:

/---empty---\ /---empty---\ /---empty---\
(0) --- a --- (1) --- a --- (2) --- a --- (3) --- a --- (4) --- a --- (5) --- a --- (end)

By the way, a much better regular expression for between n and 2n a's in a row is aⁿ (a?)ⁿ. We parse n a's at the beginning, and the optional a's are all following.

         +----a->(3)--a->(7)
         |
         | /--a->(4)--b->(8)
(1)--a->(2)
     | \--a->(5)--c->(9)
     |
         +----a->(6)--d->(10)

How about aaa|aab|aac|add? We can now only factor one 'a' out: a(aa|ab|ac|dd). But from (aa|ab|ac) we can factor another a: a((a(b|d))|dd). There are limits to this technique, but sometimes it is useful.