Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
July 18
Shotts:
createfile.py
prefix*badfile
ls vs badls
Newlines in file names (that are listed as such by ls) make it impossible to write reliable shell scripts that use the output of ls. Even something simple like
for i in $(ls)
would fail: if a directory contains files "a", "b\nc" and "d" (where \n is a newline) then a newline-printing ls would produce
a
b
c
d
which is the wrong list!
So ls today will, as far as I can tell, never output a literal newline in a filename.
strings
Try echo '\n'. It does not print a newline! If you want newlines, use echo $'\n'. What about $"\n"? No.
This is a pure bashism.
IFS and ifsdemo
hello.c:
#include <stdio.h>
#include "hello.h"
void main() {
printf("hello, world!\n");
}
gcc hello.c
gcc -o hello hello.c
make and Makefile
hello: hello.c hello.h
gcc -o hello hello.c
What make actually does: touch hello.h
The existence of .h files make dependencies complicated in C. Make solves this problem.
The IDE vs make debate
pretty much all platforms
support make
it's straightforward to add complicated
testing or other alternative builds to Makefiles
The command line gives greater execution flexibility
than GUIs
On the other hand, Makefiles can be confusing
Makefile and leading tab
diction
download diction-1.11.tar.gz from
ftp.gnu.org/gnu/diction/
tar xzf diction-1.11.tar.gz
cd
./configure
make
now touch one of the source files and make
again
make install
signatures
gpg --import gnu-keyring.gpg
;; from ftp.gnu.org/gnu
gpg --verify diction-1.11.tar.gz.sig
The trust rabbit hole: WARNING: This key is not certified with a trusted signature!
configure options: Search for "Installation directory options"
ImageMagick
git clone
https://github.com/ImageMagick/ImageMagick.git
cd ImageMagick
./configure
make
Make and java: yes you can do it. but java files pretty much all compile independently. Still, make is helpful in that you can use it to recompile only the files you changed.
Regular expressions are a way to describe a set of strings. We've already seen them. First, the file-name matching is a form of regular expression: * matches any number of characters, ? matches one character, and [m-p] matches the letters m, n, o, or p. But note that this file-name matching uses a simplified form of regular expressions.
In 1997, Jamie Zawinski (then a dev at Netscape) posted the following on Usenet:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Abig part of the trouble with regular expressions is that they are so hard to read. There is no place for extraneous whitespace; blanks are, of course, significant characters. There is no way to embed comments. One does get somewhat used to the syntax,
The grep command is also all about matching regular expressions, although so far we've just used it to match strings. Grep has a bunch of useful options:
-i ignore
case
-v list lines not
matching the pattern (invert)
-n include line numbers
-q don't print anything, just return 0
if found or 1 if not found
Here are the "metacharacters", that have special meaning in matches:
.
matches any one character
^ anchor for the start of the string or line; negation
if the first character within [ ]
$ anchor for the end of the string or line
[ ] for creating character ranges, just like with filename matching
( ) for regular grouping
{ } match a regular subexpression a specific number of times
- for character ranges
? match a regular subexpression 0 or 1
times
* match a regular subexpression 0 or more times
+ match a regular subexpression 1 or more times
| between two alternative regular expressions, such as
in January|Jan in the case example
\ for quoting one of these to use it as a literal
character rather than a metacharacter
Basic regular expressions (BRE) use only ^ $ . [ ] *. We will here be using extended regular expessions (ERE). These are not recognized by plain grep; you must either use egrep or grep -E.
The file-regular-expression symbol "?" corresponds to the BRE ".": both match one character. As we will see below, the file-regular-expression symbol "*" corresponds to the BRE ".*": the "." means any single character, and the "*" means "the previous regular expression repeated 0 or more times". File-regular-expression and BRE for character ranges is pretty similar.
There are also "Perl-compatible regular expressions", the library is known as pcre. This is an entirely different regular-expression notation.
As an example of alternation, in its simplest form, suppose we want to search the output of ps -ef for either "chrome" or "firefox". Then we can use
ps -ef | egrep 'chrome|firefox'
We need the quotes because '|' has special meaning to bash (it is the pipe symbol). Also, note that I used egrep instead of plain grep; egrep enables a much larger set of regular expressions. Plain grep doesn't work here at all (although grep -E is the same as egrep, and officially is preferred).
As an example of character ranges, if we wanted to search for "loyola" or "Loyola" with grep, the pattern "[Ll]oyola" would work.
The ^ character marks the start of the string, in most cases. But as the first character of a range, it means match the characters not in the list. So "[^aeiou]a[^a-s]" matches "cat", but not "eat" or "cab".
There are also some built-in ranges:
[:alnum:] The alphanumeric
characters. In ASCII, equivalent to: [A-Za-z0-9]
[:word:] The same as [:alnum:], with the addition
of the underscore (_) character.
[:alpha:] The alphabetic characters. In ASCII,
equivalent to: [A-Za-z]
[:blank:] Includes the space and tab characters.
[:cntrl:] The ASCII control codes. Includes
the ASCII characters 0 through 31 and 127.
[:digit:] The numerals 0 through 9.
[:graph:] The visible characters. In ASCII, it includes
characters 33 through 126.
[:lower:] The lowercase letters.
[:punct:] The punctuation characters. In ASCII,
equivalent to: [-!"#$%&'()*+,./:;<=>?@[\\\]_`{|}~]
[:print:] The printable characters. All the
characters in [:graph:] plus the space character.
[:space:] The whitespace characters including
space, tab, carriage return, newline, vertical tab, and form feed. In
ASCII, equivalent to: [ \t\r\n\v\f]
[:upper:] The uppercase characters.
[:xdigit:] Characters used to express hexadecimal
numbers. In ASCII, equivalent to: [0-9A-Fa-f]
Warning: IBM mainframes use EBCDIC encoding, not ASCII, and in EBCDIC the alphabetic letters are not contiguous. So [a-z] fails. So if you find yourself in an EBCDIC environment, stop. (You can also use the above, or [a-ij-rs-z]).
A more serious character-set issue is ascii vs utf-8. grep actually does fairly well with that. Try grep ": ." grepdemo2.text.
As with file-matching regular expressions, the square brackets above still need another set of square brackets to make them ranges.
Grouping is often useful. If we wanted to search for "received" or "receiving", we can use the basic alternation form "received|receiving", or the equivalent but (maybe) simpler form with grouping, "receiv(ed|ing)"
In filename matching, '*' stands for "zero or more characters". In ERE, * stands for "repeat the previous sub-expression zero or more times". So ".*" matches any character zero or more times. "a(b|c)*" matches a, ab, ac, abb, abbcbbcccb, etc.
There is also ?, meaning "match zero or one times (but not more)", and +, which means "match one or more times". So the following matches signed integers:
(+|-)?[:digit:]+
That is, the + or - at the start is optional, and there has to be at least one digit.
Schotts has this regular expression for matching phone numbers, either in the (nnn) nnn-nnnn form or the nnn nnn-nnnn form
^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$
I only use nnn-nnn-nnn. You could fix this by changing that first space to ( |-), but that would also accept (nnn)-nnn-nnnn. Realistically you would need grouping (note that I am using the {n} quantifier from below):
^(\([0-9]{3}\) )|([0-9]{3}( |-))[0-9]{3}-[0-9]{4}$
There is also { }, which lets you match a specific number of times:
{3} match the previous subexpression 3 times
{3,} match the preceding subexpression 3 or more times
{3,6} match the preceding subexpression between 3 and 6 times
{,3} match the preceding subexpression between 0 and 3 times
There is also multiline mode. In that mode, "^" and "$" match the beginning and end of each line, versus the beginning and end of the entire string. We won't cover this.