Class 8 Notes

Newlines in file names (that are listed as such by ls) make it impossible to write reliable shell scripts that use the output of ls. Even something simple like

would fail: if a directory contains files "a", "b\nc" and "d" (where \n is a newline) then a newline-printing ls would produce

So ls today will, as far as I can tell, never output a literal newline in a filename.

Try echo '\n'. It does not print a newline! If you want newlines, use echo $'\n'. What about $"\n"? No.

Compiling C and Java

#include <stdio.h>
#include "hello.h"
void main() {
printf("hello, world!\n");
}

The existence of .h files make dependencies complicated in C. Make solves this problem.

    pretty much all platforms support make
    it's straightforward to add complicated testing or other alternative builds to Makefiles
    The command line gives greater execution flexibility than GUIs

download diction-1.11.tar.gz from ftp.gnu.org/gnu/diction/
tar xzf diction-1.11.tar.gz
cd
./configure
make
now touch one of the source files and make again

gpg --import gnu-keyring.gpg ;; from ftp.gnu.org/gnu
gpg --verify diction-1.11.tar.gz.sig

The trust rabbit hole: WARNING: This key is not certified with a trusted signature!

git clone https://github.com/ImageMagick/ImageMagick.git
cd ImageMagick
./configure
make

Make and java: yes you can do it. but java files pretty much all compile independently. Still, make is helpful in that you can use it to recompile only the files you changed.

Regular Expressions

Regular expressions are a way to describe a set of strings. We've already seen them. First, the file-name matching is a form of regular expression: * matches any number of characters, ? matches one character, and [m-p] matches the letters m, n, o, or p. But note that this file-name matching uses a simplified form of regular expressions.

In 1997, Jamie Zawinski (then a dev at Netscape) posted the following on Usenet:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Abig part of the trouble with regular expressions is that they are so hard to read. There is no place for extraneous whitespace; blanks are, of course, significant characters. There is no way to embed comments. One does get somewhat used to the syntax,

The grep command is also all about matching regular expressions, although so far we've just used it to match strings. Grep has a bunch of useful options:

-i    ignore case
-v    list lines not matching the pattern (invert)
-n    include line numbers
-q    don't print anything, just return 0 if found or 1 if not found

.     matches any one character
^    anchor for the start of the string or line; negation if the first character within [ ]
$    anchor for the end of the string or line
[ ] for creating character ranges, just like with filename matching
( ) for regular grouping
{ } match a regular subexpression a specific number of times
-    for character ranges
?    match a regular subexpression 0 or 1 times
*    match a regular subexpression 0 or more times
+   match a regular subexpression 1 or more times
|    between two alternative regular expressions, such as in January|Jan in the case example
\    for quoting one of these to use it as a literal character rather than a metacharacter

Basic regular expressions (BRE) use only ^ $ . [ ] *. We will here be using extended regular expessions (ERE). These are not recognized by plain grep; you must either use egrep or grep -E.

The file-regular-expression symbol "?" corresponds to the BRE ".": both match one character. As we will see below, the file-regular-expression symbol "*" corresponds to the BRE ".*": the "." means any single character, and the "*" means "the previous regular expression repeated 0 or more times". File-regular-expression and BRE for character ranges is pretty similar.

There are also "Perl-compatible regular expressions", the library is known as pcre. This is an entirely different regular-expression notation.

As an example of alternation, in its simplest form, suppose we want to search the output of ps -ef for either "chrome" or "firefox". Then we can use

We need the quotes because '|' has special meaning to bash (it is the pipe symbol). Also, note that I used egrep instead of plain grep; egrep enables a much larger set of regular expressions. Plain grep doesn't work here at all (although grep -E is the same as egrep, and officially is preferred).

As an example of character ranges, if we wanted to search for "loyola" or "Loyola" with grep, the pattern "[Ll]oyola" would work.

The ^ character marks the start of the string, in most cases. But as the first character of a range, it means match the characters not in the list. So "[^aeiou]a[^a-s]" matches "cat", but not "eat" or "cab".

[:alnum:]    The alphanumeric characters. In ASCII, equivalent to: [A-Za-z0-9]
[:word:]     The same as [:alnum:], with the addition of the underscore (_) character.
[:alpha:]     The alphabetic characters. In ASCII, equivalent to: [A-Za-z]
[:blank:]     Includes the space and tab characters.
[:cntrl:]      The ASCII control codes. Includes the ASCII characters 0 through 31 and 127.
[:digit:]      The numerals 0 through 9.
[:graph:]    The visible characters. In ASCII, it includes characters 33 through 126.
[:lower:]     The lowercase letters.
[:punct:]     The punctuation characters. In ASCII, equivalent to: [-!"#$%&'()*+,./:;<=>?@[\\\]_`{|}~]
[:print:]      The printable characters. All the characters in [:graph:] plus the space character.
[:space:]     The whitespace characters including space, tab, carriage return, newline, vertical tab, and form feed. In ASCII, equivalent to: [ \t\r\n\v\f]
[:upper:]     The uppercase characters.
[:xdigit:]    Characters used to express hexadecimal numbers. In ASCII, equivalent to: [0-9A-Fa-f]

Warning: IBM mainframes use EBCDIC encoding, not ASCII, and in EBCDIC the alphabetic letters are not contiguous. So [a-z] fails. So if you find yourself in an EBCDIC environment, stop. (You can also use the above, or [a-ij-rs-z]).

A more serious character-set issue is ascii vs utf-8. grep actually does fairly well with that. Try grep ": ." grepdemo2.text.

As with file-matching regular expressions, the square brackets above still need another set of square brackets to make them ranges.

Grouping is often useful. If we wanted to search for "received" or "receiving", we can use the basic alternation form "received|receiving", or the equivalent but (maybe) simpler form with grouping, "receiv(ed|ing)"

Quantifiers

In filename matching, '*' stands for "zero or more characters". In ERE, * stands for "repeat the previous sub-expression zero or more times". So ".*" matches any character zero or more times. "a(b|c)*" matches a, ab, ac, abb, abbcbbcccb, etc.

There is also ?, meaning "match zero or one times (but not more)", and +, which means "match one or more times". So the following matches signed integers:

That is, the + or - at the start is optional, and there has to be at least one digit.

Schotts has this regular expression for matching phone numbers, either in the (nnn) nnn-nnnn form or the nnn nnn-nnnn form

I only use nnn-nnn-nnn. You could fix this by changing that first space to ( |-), but that would also accept (nnn)-nnn-nnnn. Realistically you would need grouping (note that I am using the {n} quantifier from below):

There is also multiline mode. In that mode, "^" and "$" match the beginning and end of each line, versus the beginning and end of the entire string. We won't cover this.

Comp 141 Class 8 notes

ls and newlines in filenames

Compiling C and Java

Regular Expressions

Quantifiers