Mini-Java Compiler
    As a case study for several data structures, we will look at a simple
    compiler for a subset of the language Java; the language is also essentially
    a subset of C#. It is sufficient to write interesting programs. The code is
    compiled into machine code for a stack machine we will call the smachine;
    it can then be executed via emulation.
    
    The compiler lives in the following files:
    
      - compiler.cs: the parser and most
        code generation
- tokenizer.cs: the tokenizer,
        that divides the input into separate symbols
- nodes.cs: expressions result in the
        creation of expression trees, the nodes of which are defined here
- symboltable.cs: the mechanism
        by which identifiers are looked up
- codestream.cs: for code
        generation
- machine.cs: the definition of the
        virtual stack machine.
There is also the following alterative for symboltable.cs, used below.
      You should use one or the other, but not both.
    
    
     smachine code
    There are three kinds of memory in the stack machine:
    
      - the stack, holding 32-bit values
- global memory, an array of 64K 32-bit values
- program memory
The machine code is stack-based; we can load 32-bit variables and constants
    to the stack, do stack-based (postfix!) arithmetic (pop two operands and
    push the result), and then store the results at a designated address.
    Global-memory locations ("words") are also 32-bit quantities; addresses
    identifying these words are 16 bits.
    
    In processing the statement X =3*Y+1, we would generate the following
    (simplified) code:
    
        LOAD    3
        LOAD    Y
        MUL
        LOAD    1
        ADD
        STOR   X
    
    In the actual code, we would replace Y and X with their 16-bit addresses
    (either global or on the stack). In between statements, the stack does not
    contain any operands, but it still holds stack
      frames containing local variables, and also information about
    pending function calls. Some opcodes (ADD, etc) have no operands; others
    have a 16-bit operand generally representing an address. 
    
    The above code is simplified in the sense that numeric literals such as 1
    and 3 are in fact loaded with the LOADI (LOAD Immediate) opcode, while local
    variables are loaded with LOADF (LOAD relative to stack Frame) and global
    variables are loaded with LOAD. The opcodes PLOAD and PSTOR, which allow
    specification of an address determined at runtime, are available for
    implementation of arrays, but the language does not use them at this time.
    
    Instructions can be divided into the following categories:
    
      - LOAD/STOR and variants (eg Load Immediate, Load relative to
        stack, Load from main memory)
- Stack-based arithmetic: ADD, SUB, MUL, DIV, MOD, NEG, AND, OR, LAND,
        LOR, NOT, XOR. These pop their operands (normally two, though a couple
        are unary) and then push the result.
- Comparisons: CEQ, CGE, CGT (these are a form of stack-based
        arithmetic, where the operator is ==, > or >=. 0 is false, 1 is
        true.)
- Jumps: JNZ, JZ, JGE, JGT. They pop a value and jump depending on that
        value; the value is often set by one of the comparison opcodes.
- Subroutine instructions: JSR, RET, SYSRET
- System: HALT, NOP, ALLOC, DUP
- I/O: SPRINT, IPRINT
Code is defined in the machine.cs
    class, and generated using the emit() call (a member of the codestream
    class).
    
    MiniJava
    
    The language grammar is defined using Extended
      Backus-Naur Form, a relatively standard formal grammar
    specification. The declarations of the grammar are called productions,
    and can all be found in compiler.cs The "extended" means that optional parts
    can be indicated by enclosing in [ ]; zero-or-more repetition is indicated
    by enclosing in {}. For example, if the "language" is a string of 0 or more
    b's, optionally preceded by an a, the grammar might be:
        lang ::= [ 'a' ] { 'b' }
    We can also write this, without [ ] and { }, as
        lang ::= Apart Bpart
        Apart ::= 'a' | empty
        Bpart ::= 'b' Bpart | empty
    Our parser is that part of the
    program that follows the language rules to determine whether the input, as a
    stream of tokens, follows the
    rules. Intermingled in the complier with the parsing statements are
    code-generation and semantic statements.
    
    We will use a parsing technique called recursive
      descent with one-symbol lookahead,
    in which we write a set of parsing procedures, one for each EBNF production.
    A crucial feature is that we will always be able to tell what production to
    follow using only the next token, represented in the compiler.java class by
    the String variable theToken. (Not
    all grammars can be parsed this way.)
                     
    
 An example of the power of recursive-descent parsing is the following
    grammar:
    
    start ::= [ 'a' A ] { 'b' B } 'c'
    A ::= 'a' 'a' | 'b' | 'c' 'd' 
    B ::= 'a' | 'b'
    
    The corresponding pseudocode follows, where theToken contains the current
    token (a letter 'a', 'b', 'c', 'd'), and accept(x) means "if theToken==x,
    then theToken = the next token, otherwise halt". Note that at each point we
    are making a decision as to what to do based only on the next letter; note
    that if production A had been A::='a' 'a' | 'a' 'b' this
      would not be possible (as written).
    
    parseA() {
        if (theToken == 'a') {accept('a'); accept('a');}
        else if (theToken == 'b') {accept  ('b');}
        else if (theToken == 'c') {accept ('c'); accept('d');
    }
        else error()
    }
    
    parseB() {
        if (theToken == 'a') {accept('a');}
        else if (theToken == 'b') {accept('b');}
        else error();
    }
    
    parseStart() {
        if (theToken=='a') {
           accept('a');
           parseA();
        }
        while (theToken == 'b') {
           accept('b');
           parseB();
        }
        accept('c');
    }
    
    Note how [ ] corresponds to the if statement, and { } to
    the while. Also note that the symbol-checking part of
    accept() is unnecessariy if we've just checked the token previously, but in
    the instances in bold it plays an important role. 
    
    In the Compiler.java file, good examples can be found in:
    
      - compileStatement()
- compileCompound()
 
- compileWhile()
- compilePrint()          
            (note alternative parsing idiom)
- compileIdentStmt()       (note
        one-symbol-lookahead workaround)
 
     Variables
    There are two kinds of variables: global and local. Local variables are
    allocated on stack frames, and
    globals go in a separate area of memory. The role of the symbol table is to
    figure out exactly where. The LOAD opcode, which takes as operand an integer
    address in the "memory" area of a Machine object, is for global variables.
    The LOADI (LOAD Immediate) is for loading 16-bit numeric constants, with the
    LOADH (LOAD High) an optional followup for loading the upper half of a
    32-bit constant. See compileFactor() for an example of the use of
    emitLOADINT(). 
    
    For local variables, which must live on the stack (in the stack frame) so
    that local variables of different activations of the same recursive function
    don't collide, we need a different approach. We use the LOADF opcode, where
    the argument is the "offset" of the variable in the stack frame, and the
    actual location is stack[FP+arg], where FP is the Frame Pointer. Each
    procedure has one stack frame (regardless of scope, below), and the FP
    always points to the stack frame of the currently executing procedure. 
    
    Global variables are given locations in the memory area, beginning with
    address 0, and incrementing by 1. Local variables are given offsets in the
    local area, beginning with 0 and incrementing by 1. Whenever you create a
    new local variable, you need to generate cs.emit(ALLOC, n) where n is the total number of local variables. Note
    that when variables go out of scope, you need to do another cs.emit(ALLOC,
    n) with the now-smaller value of n.
    
    The compileDeclaration() method has a parameter, isGlobal, that tells you at
    runtime whether the declaration you are processing is global or local. You
    need that to properly construct the symbol-table entry, and to decide
    whether to emit an ALLOC instruction.
    
    We can also define "constants" using final
      int. Finally, function definitions will also go into the symbol
    table.
    
    Expressions
    I've used a reasonably common, but somewhat compressed, syntax for
    expressions:
    
    expr ::= simple_expr [ relop simple_expr ]
      simple_expr ::= ['+'|'-'] term { addop term }
      term ::= factor { mulop factor }
      factor ::= identifier | number | '(' expr ')'
      
    
    The four levels here give multiplication a higher precedence than addition;
    factors allow the use of parentheses for grouping. Exps that are not also
    simple_exprs represent boolean comparisons, though we do not attempt to do
    any boolean type-checking. The && operator is considered to be a
    mulop, and || is an addop; a consequence of this is that we need full
    parentheses around each boolean comparison when we chain them with || and
    &&: 
        if ((x<10) || (y<20))
    
    Strings can only be used in print
    statements, and the actual string contents are stored in the Machine in a
    separate area from either the stack or the "memory".
    Using the compiler
    I usually compile everything into compiler.exe. Then the following will
    compile and run a program:
    
        mono compiler.exe foo.mjava
    
    Execution begins with the minijava function named main.
    Be sure you have one!
    Scope
    Not only do we have global variables and local variables, but there are
    differing scopes for local variables; a new one begins with every '{'. (This
    is true of C# and java, too, though C# doesn't let you reuse existing
    identifiers.) In particular, you can do the following:
    
        f() {
           final int N = 3;
           { int N; N=5; print(N);}
           print(N);
        }
    
    This will print 5 and then 3. 
    
    (This behavior is dependent on the value of Compiler.INNERSCOPES; if it is true,
    then the semantics are as above. You may set it to false,
    however, or just assume it is false and just not include any inner-scope
    declarations as above.)
    
    Allowing a new scope to start at every '{' means two things: first, whenever
    you encounter '{' (in compileCompound()), you will have to create a new
    Table object, because you don't want a new declaration of N overwriting the
    older declaration. You will thus have a list
    of Table objects, and you will "search" the list in linear order, from most
    recent to oldest.
    
    Under the current semantics for ALLOC N, each time it is called, SP is
    incremented by N (leaving N more units of space between FP and SP). For each
    int variable declared, all you need to do is ALLOC 1.
    
    This is much simpler than an earlier formulation, where ALLOC N set SP = FP
    + N, and successive ALLOCs were not cumulative. In that older
    model, you would need to keep a counter of how many variables are allocated
    along with each per-scope Table object (theTable.size() might do just fine).
    Then, every time a variable is declared, one would have to emit ALLOC N,
    where N was the current number of variables. 
    
    The catch comes at the end of the scope; it's much harder with the current
    interpretation to reclaim space used by the inner declarations. In
    principle, with the old interpretation, you would emit ALLOC N where N is
    now the number of variables as of the previous,
    just-returned-to, scope. Example:
    
    {int n;       // ALLOC 1
      int m;      // ALLOC 2   
       these aren't cumulative! you have two variables!
        { // new scope
           int n   // ALLOC 3
           int x   // ALLOC 4
        } // end of scope
        // ALLOC 2
    
    I did some experiments with C++, where you can actually print addresses of
    variables.  C++ does not reclaim space used by "inner" allocations
    either, so I'm in good company. See mem.cpp.
    
    
    If the ALLOCs are not right, or if the offsets assigned to variables are not
    right, then two variables may overwrite one another. If this happens, the
    numeric results will be nonsense. Try changing the program to use only
    globals; if this fixes the problem, bad local-variable allocation is the
    likely culprit.
    
     class SymbolTable
    As each variable is declared, the role of the symbol table is to store
      its address. Global variables have addresses in main memory; local
      variables have addresses relative to the stack frame pointer FP. 
    The symbol table should do the following:
    
      - Provide a way of entering newly declared variables and function names
        into the symbol table, together with their information. For variables,
        this is their address, and also a flag indicating
        whether the variable is global or local. Global-variable addresses are
        references to main memory; local-variable addresses are offsets from the
        top of the stack. For functions, the essential information is the
        function's entry point, or address of its code. For
        constants, the essential information is the constant's type and value.
        All this is stored in class IdentInfo objects.
- Provide a way of looking up, for each identifier, its current IdentInfo
        information. 
- When the scope of a set of variable declarations
        ends, all declarations that are part of that scope should disappear.
The first is handled by the SymbolTable methods 
    
      - allocVar()
- allocConst()
- allocProc()
The information about constants and functions is provided via parameters,
      but newly declared variables need to be allocated space. All variables
      have size 1 (on either the stack or in main memory, both of which count in
      32-bit words) so this is straightforward. 
    The second is handled by 
    
    The IdentInfo object has a getType() method to indicate whether the
      identifier was a VARIABLE, CONSTANT, or FUNCTION, and appropriate other
      accessors to return the necessary information.
    The third is handled by the following pair:
    
    The first creates a new symbol-table scope, and the second terminates it.
      Note that redeclaring a variable in the same scope should be illegal,
      while redeclaring a variable from a past scope is not. 
    Currently, the compiler proper calls newscope() and endscope() only at
      the start and end of new functions, if INNERSCOPES is set to false.
      The language grammar allows declarations within any compound statement.
      Thus, in the code below the inner declaration of n3 is as if it were
      declared at the top, but the inner declaration of n2 would be seen as a
      duplicate:
    int foo() {
            int n1; int n2; int n4;
            n1 = 1; n2=10; n4=0;
            while (n1<n2) {
                int n2; int n3;
                n2 = 4;
                n4 = n4 + n1+n2;
                n1=n1+1;
            }
            print(n4);
        }
    Neither C# nor Java would allow the redeclaration above of n2, though C++
    does. Different languages have different semantic rules for this
    sort of thing.
    The Symbol-Table Hack
    I've introduced three pathetic hacks to get you started without a "real"
      symbol table: 
    
      - Global variables may be named "Gnn" for two digits nn; these each get
        a unique location in memory (dependent on nn)
- Local variables may be named "Ln"; there can be up to 10 of these.
- There are four pre-recognized function names "f1", "f2", "f3" and
        "f4".
All of these are defined in the SymbolTable class.
    Variables of the form Gnn and Ln do not need to be declared; in fact,
      SymbolTable.allocVar() currently simply returns null. Nonetheless, the
      symbol-table hack will only allocate space for Ln variables if
      there is at least one variable declaration. It does not matter if it is a
      declaration of a "true" variable or an Ln variable.
    Use of the Ln and "true" variables together is very risky.
     Demo of the code
    The following all work using the hack above:
    
    The next two assume a real symbol table:
    
    A first step at a real SymbolTable
    The first step is a linked-list implementation.
    
      - The main symbol table is a singly linked list of
        ⟨identifier,IdentInfo⟩ pairs
 
- There is also a stack of list pointers representing
        scope levels. On newScope(), push the current head of the symbol-table
        list. On endScope(), head = pop().
- allocate() searches the list only up to the end of the current scope
        before deciding there is no duplicate declaration, and creating a new
        entry (at the head)
- lookup() searches the entire list
- size() is likely to include globals too, which we don't want. So we
        need stackFrameSize() = size() - current_number_of_globals
A slightly different first step is the following (symboltable1.cs),
      which assumes INNERSCOPES = false. We create a single
      Dictionary<string,IdentInfo> object named Table. At the start of a
      new scope, we make a backup copy of this, in, say,
      TableBack. We now continue to add new local variables to Table. At the end
      of the scope, we restore Table to TableBack.
    We can tell which scope we're in by looking at the value of
      SymbolTable.scopecount. If it is 0, we are at the global level. If it is
      1, we are at the top level of some function declaration. If we set
      INNERSCOPES = true but want things to behave as if INNERSCOPES is false
      (that is, there are only globals and top-level locals), we create the
      backup Table when scopecount changes to 1 (but not if it increments
      further), and restore Table from the backup when scopecount changes to 0.
    For a two-level symbol table, we might modify
      symboltable1.cs so that:
    
      - Global variables are in a Dictionary named, say, GlobalDict
- Each time we enter a local scope (that is, each time we compile a new
        procedure), we create a new dictionary LocalDict for local variables.
- On lookup, if the current scope is global we look only in GlobalDict.
        If the scope is local, we look first in LocalDict and then, if not
        found, in GlobalDict.
- On creation of new entries (corresponding to new declarations), if the
        scope is local then we install the new entry in LocalDict. There is a
        duplicate-declaration conflict only if the identifier in question is
        already in LocalDict.
Multi-level symbol table
    What if we want to support arbitary nesting? Then we need our symbol
      table to support that. Here is one model:
    
      - The symbol table is a List of Dictionary<String, IdentInfo>
- The global scope is at position 0 in the List
- newScope() adds a new Dictionary to the list; endScope() deletes one.
- allocVar(), etc searches only the last Dictionary in the List.
- lookup() searches each Dictionary in turn, starting with the last one
        and proceeding down to List[0]//
 Declarations
    There are three kinds of declarations:
    variables: attributes are boolean
    isGlobal, and int location (either a global location or a local "offset").
    We might also include type, except everything in minijava is int.
    constants: the only attribute is
    numeric value
    procedures: the main attribute is
    again int location (address). Although we are unlikely to use it, another
    might be the number of parameters.
    
    How would you implement type signatures?
        
    
     CompileExpr2
    The strategy is to have compileExpr return an Integer object if the expr is
    constant (either a final int or a
    number). 
    
    problem: 1+n, where n is a variable.
    One fix is to push the numeric value we want, and then emit a SWAP
    instruction to get them in the right order, if the operator is not
    commutative.
    
    
     CompileExpr3
    Here we build the actual expression tree. The issue above with constants is
    also implemented.
    
    Note that the expression tree is different from the true parse tree. The
    original EBNF grammar is
        E -> T { addop T}
    In "plain" BNF grammar, this is:
        E -> T MT
        MT -> addop MT | empty
    This builds a somewhat weird tree (draw some examples). However, we can do
    better: 
    
    
    Define ExprNode, ConstNode, VarNode, UnopNode, BinopNode. Note that the
    latter four are subclasses of the abstract class ExprNode. We have one
    abstract method in ExprNode:  compile() (the only method at all!)
    
    Then modify compileSExpr, compileTerm so that, if both operands are
    ConstNodes, then we create a new ConstNode to hold the result, instead
      of generating lots of code.
    
    Note how the object hierarchy helps us here, and also how we check node
    types with is.
    
    
     Common subexpressions
    Suppose we want to optimize about common subexpressions:
    
      - z1 = x+y; z2 = x+y;
- z = (x+y)*(x+y)
The first example is an optimization among multiple statements; the second
    is an optimization within a single statement.
    
    How do you detect these?
    We construct a Map of expression subexpressions, making sure that we look
    for == matches instead of .equals matches; this means that we cannot
    use HashMap.
    
    We can, however, still use hashing, if we pay careful attention to the
    hashCode() method, eg
        BinOpNode.hashCode = mixup(left.hashCode(),
    right.hashCode(), (int) operand);
    
    Each subexpression is first looked up in the hashtable; if we find an exact
    match there, we return the hashed entry. Note that we really want to be
    using == on the pointers, all the way, not .equals().
    
    Example: (x+y)*(x+y)
    1. Enter x, y,BinOpNode(x,PLUS, y) in the table.
    2. Now do it again
        second x: find entry for previous x, and return a
    duplicate pointer to the same node.
        second y: find entry for previous y, and return a
    duplicate pointer to the same node.
        Now we want to see if we can find an instance of (x,
    PLUS, y). We search our subexpression table for an exact
      match, using == on all three fields. We
      find one! Therefore, we do not create a new BinOpNode; we just
    return a reference to the existing cell.
    
    
    Technically, we want to do the lookup before
    actually creating the node.
    
    One way to achieve this is to modify the node constructors, but that can be
    massively confusing.
    
    Now we have to generate code. The first time we come to a common
    subexpression, we evaluate it, and create
      a new storage location to store the result. That is, we ALLOC a new
    variable temp on the stack frame,
    before compiling the expression. (Note that if we try this after
    starting to compile the expression, we already have emitted code that is
    pushing things onto the stack, and we will muddle things up.) That is, the
    code for
    
        z = (x+y)*(x+y)
    
    compiles as if it were
    
        temp = x+y;
        z = temp * temp
    
    Rather than reloading temp, though, we would probably do something like
    this, say, for (x+y)*3*(x+y)  (the intervening 3 makes a superficially
    obvious optimization be less obvious).
    
        LOAD x
        LOAD y
        ADD
        DUP   
  
                // now two copies of x+y
    are on the stack
        STORF temp       // pops one
    copy
        LOADI    3
        MUL
        LOADF temp
        MUL