Comp 163 Week 13 notes

Week of April 18

Finite-State Machines

("machine" == "automaton", by the way)

Inputs, by the way, can be:

individual letters, as in regular expressions
the "tokens" of a computer language
something more abstract, representing events (as in TCP)

From last week:

Regular expression: * means repeat 0 or more times, ? means either 0 or 1 times

b a* c
b? a* c?
1 (01*0)* 1 (supposedly odd binary numbers divisible by 3)
[a-z][a-z,0-9]* (for programming-language identifiers)
[0-9]*(.[0-9]*)?(e[0-9]*)? (for floating-point numbers, eg 12.345e67

What strings match these?

What does a finite-state recognizer for these look like?

More examples of regular expressions:

These use slightly extended regexes (The google example does not support + or *)

\d matches any digit 0-9, same as [0-9]
\W matches anything other than a letter, digit or underscore, same as [^a-zA-Z0-9_]
\s matches a space
^ matches the start of the line; $ matches the end of the line
{3,6} means that whatever single-character thing preceding this can match between 3 and 6 times

What does varname\W*=[^=] match?

Warning: there are quite a few different standards for regular expressions. Always read the documentation.

Let's call the finite-state recognizers finite automata. So far the finite-state recognizers have all been deterministic: we never have a state with two outgoing edges, going two different directions, that are labeled with the same input. A deterministic finite automaton is abbreviated DFA.

How about b (ab)* a? There's a difference here. Now we do have a state with two different edges labeled 'a'. Such an automaton is known as nonde^.terministic, that is, as an NFA. We can still use an NFA to match inputs, but now what do we do if we're at a vertex and there are multiple edges that match the current input?

There are two primary approaches. The first is to try one of the edges first, and see if that works. If it does not, we backtrack to the vertex in question and at that point try the next edge. This approach does work, but with a poorly chosen regular expression it may be extremely slow. Consider the regular expression (a?)ⁿ aⁿ. This means up to n optional a's, followed by n a's. Let us match against aⁿ, meaning all the optional a's must not be used. The usual strategy when matching "a?" is to try the "a" branch first, and only if that fails do we try the empty branch. But that now means that we will have 2ⁿ - 1 false branches before we finally succeed.

Example: (a?)³ a³.

A much faster approach is to use the NFA with state sets, rather than individual states. That is, when we are in state S1 and the next input can lead either to state S2 or state S3, we record the new state as {S2,S3}. If, for the next input, S2 can go to S4 and S3 can go to either S5 or S6, the next state set is {S4,S5,S6}. This approach might look exponential, but the number of states is fixed.

Example: (a?)³ a³.

See also https://swtch.com/~rsc/regexp/regexp1.html, "Regular expression search algorithms", the paragraph beginning "A more efficient ...."

By the way, a much better regular expression for between n and 2n a's in a row is aⁿ (a?)ⁿ. We parse n a's at the beginning, and the optional a's are all following.

The implementation of an NFA/DFA recognizer does literally use the graph approach: for each current state, and each next-input symbol, we look up what next states are possible with that input symbol. The code to drive the NFA/DFA does not need to be changed for different NFA/DFAs. This is a big win from a software-engineering perspective.

TCP state diagram: intronetworks.cs.luc.edu/current2/html/tcpA.html#tcp-state-diagram. Note the additional software-engineering issue of this being a distributed system.

Wednesday

Study guide

Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.

-- Jamie Zawinski (regex.info/blog/2006-09-15/247)

Stop Validating Email Addresses with Regex: davidcel.is/posts/stop-validating-email-addresses-with-regex.

How about even more problems? jimbly.github.io/regex-crossword.

TCP kernel implementation: tcp_ipv4.c tcp_v4_do_rcv(), tcp_seq_next(), tcp_seq_stop(), tcp_v4_err(),

tcp_input.c: tcp_rcv_state_process

Also regex option in gedit search box and eclipse search box

One more example of NFA state-set recognizer: aaa|aab|aac|aad

         +----a->(3)--a->(7)
         |
         | /--a->(4)--b->(8)
(1)--a->(2)
     | \--a->(5)--c->(9)
     |
         +----a->(6)--d->(10)

NFA to DFA

It is also possible to convert any NFA to a DFA. The catch is that if there are n states in the NFA, there might be 2ⁿ states in the DFA.

Subset construction: DFA states are all sets of NFA states. Given such a set, and an input, we form the set of all states reachable on that input from any of the NFA states in the set.

Elliptic curve cryptography

Graph of y² = x³ + Ax + B (the (short) Weierstrass form)

What does this have to do with an ellipse?

Elliptic product a⊕b: the graphical constuction over R

Adding a point at infinity

See Boneh & Shoup p 614 (of version 0.5): "The Addition Law" (toc.cryptobook.us, chapter 14 "Elliptic curve cryptography")

Note that if you have two roots r₁ and r₂ of a cubic ax³ + bx² + cx +d, then the product of all the roots is d/a, and so r₃ = d/ar₁r₂.

Finite fields: graui.de/code/elliptic2.

Find the finite-field generator g (or base b)

Taking multiples of g: kg = g⊕g⊕...⊕g, k times. Repeated-squaring algorithm

Size of E(F_p) solution set: roughly p

Basically, for each x, half the time there are no solutions for y and half the time there are two (+y, -y). On average there is one, so total number of solutions is ~p.

Montgomery form: y² = x³ + Ax² + x

Diffie-Hellman-Merkle for basic elliptic curve

For classic Diffie-Hellman-Merkle, Alice chooses an integer a<p, and Bob chooses b<p. Alice and Bob publish g^a and g^b respectively, where g is the chosen generator. If Alice wants to create a key to use for encrypting a message to Bob, she calculates (g^b)^a = g^ab. Similarly, Bob can calculate (g^a)^b = g^ab to decrypt. Nobody else can; you have to know either a or b.

For elliptic curves, Alice again chooses an integer a<p, and Bob chooses b<p. Alice and Bob publish a*g and b*g, respectively. Again, knowing g and knowing a*g does not give you a reasonable method for finding a. The rest of the mechanism works exactly as with the classic case.

Edwards form: x² + y² = 1 + Dx²y². The elliptic product here does not involve cases.

Curve25519

The prime here is p = 2²⁵⁵ - 19, which is easy to find in python. The curve is y² = x³ + 486662x² + x.

Size of E(F_p) = 8q, where q is prime; q = 2**252 + 27742317777372353535851937790883648493

Basic Encryption

Use Diffie-Hellman-Merkle to choose a common secret, and then use a hash of that secret as a conventional encryption key.

Base point: (9, 14781619447589544791020593568409986887264606134616475288964881837755586237401). This has order q, above, in the group.

How did I get this? RFC8032 page 21