Mini-Java Compiler

As a case study for several data structures, we will look at a simple compiler for a subset of the language Java; the language is also essentially a subset of C#. It is sufficient to write interesting programs. The code is compiled into machine code for a stack machine we will call the smachine; it can then be executed via emulation.

The compiler lives in the following files:

compiler.cs: the parser and most code generation
tokenizer.cs: the tokenizer, that divides the input into separate symbols
nodes.cs: expressions result in the creation of expression trees, the nodes of which are defined here
symboltable.cs: the mechanism by which identifiers are looked up
codestream.cs: for code generation
machine.cs: the definition of the virtual stack machine.

There is also the following alterative for symboltable.cs, used below. You should use one or the other, but not both.

symboltable1.cs

smachine code

There are three kinds of memory in the stack machine:

the stack, holding 32-bit values
global memory, an array of 64K 32-bit values
program memory

The machine code is stack-based; we can load 32-bit variables and constants to the stack, do stack-based (postfix!) arithmetic (pop two operands and push the result), and then store the results at a designated address. Global-memory locations ("words") are also 32-bit quantities; addresses identifying these words are 16 bits.

In processing the statement X =3*Y+1, we would generate the following (simplified) code:

    LOAD    3
    LOAD    Y
    MUL
    LOAD    1
    ADD
    STOR   X

In the actual code, we would replace Y and X with their 16-bit addresses (either global or on the stack). In between statements, the stack does not contain any operands, but it still holds stack frames containing local variables, and also information about pending function calls. Some opcodes (ADD, etc) have no operands; others have a 16-bit operand generally representing an address.

The above code is simplified in the sense that numeric literals such as 1 and 3 are in fact loaded with the LOADI (LOAD Immediate) opcode, while local variables are loaded with LOADF (LOAD relative to stack Frame) and global variables are loaded with LOAD. The opcodes PLOAD and PSTOR, which allow specification of an address determined at runtime, are available for implementation of arrays, but the language does not use them at this time.

Instructions can be divided into the following categories:

LOAD/STOR and variants (eg Load Immediate, Load relative to stack, Load from main memory)
Stack-based arithmetic: ADD, SUB, MUL, DIV, MOD, NEG, AND, OR, LAND, LOR, NOT, XOR. These pop their operands (normally two, though a couple are unary) and then push the result.
Comparisons: CEQ, CGE, CGT (these are a form of stack-based arithmetic, where the operator is ==, > or >=. 0 is false, 1 is true.)
Jumps: JNZ, JZ, JGE, JGT. They pop a value and jump depending on that value; the value is often set by one of the comparison opcodes.
Subroutine instructions: JSR, RET, SYSRET
System: HALT, NOP, ALLOC, DUP
I/O: SPRINT, IPRINT

Code is defined in the machine.cs class, and generated using the emit() call (a member of the codestream class).

MiniJava

The language grammar is defined using Extended Backus-Naur Form, a relatively standard formal grammar specification. The declarations of the grammar are called productions, and can all be found in compiler.cs The "extended" means that optional parts can be indicated by enclosing in [ ]; zero-or-more repetition is indicated by enclosing in {}. For example, if the "language" is a string of 0 or more b's, optionally preceded by an a, the grammar might be:
    lang ::= [ 'a' ] { 'b' }
We can also write this, without [ ] and { }, as
    lang ::= Apart Bpart
    Apart ::= 'a' | empty
    Bpart ::= 'b' Bpart | empty
Our parser is that part of the program that follows the language rules to determine whether the input, as a stream of tokens, follows the rules. Intermingled in the complier with the parsing statements are code-generation and semantic statements.

We will use a parsing technique called recursive descent with one-symbol lookahead, in which we write a set of parsing procedures, one for each EBNF production. A crucial feature is that we will always be able to tell what production to follow using only the next token, represented in the compiler.java class by the String variable theToken. (Not all grammars can be parsed this way.)

An example of the power of recursive-descent parsing is the following grammar:

start ::= [ 'a' A ] { 'b' B } 'c'
A ::= 'a' 'a' | 'b' | 'c' 'd'
B ::= 'a' | 'b'

The corresponding pseudocode follows, where theToken contains the current token (a letter 'a', 'b', 'c', 'd'), and accept(x) means "if theToken==x, then theToken = the next token, otherwise halt". Note that at each point we are making a decision as to what to do based only on the next letter; note that if production A had been A::='a' 'a' | 'a' 'b' this would not be possible (as written).

parseA() {
    if (theToken == 'a') {accept('a'); accept('a');}
    else if (theToken == 'b') {accept ('b');}
    else if (theToken == 'c') {accept ('c'); accept('d'); }
    else error()
}

parseB() {
    if (theToken == 'a') {accept('a');}
    else if (theToken == 'b') {accept('b');}
    else error();
}

parseStart() {
    if (theToken=='a') {
       accept('a');
       parseA();
    }
    while (theToken == 'b') {
       accept('b');
       parseB();
    }
    accept('c');
}

Note how [ ] corresponds to the if statement, and { } to the while. Also note that the symbol-checking part of accept() is unnecessariy if we've just checked the token previously, but in the instances in bold it plays an important role.

In the Compiler.java file, good examples can be found in:

compileStatement()
compileCompound()
compileWhile()
compilePrint() (note alternative parsing idiom)
compileIdentStmt() (note one-symbol-lookahead workaround)

Variables

There are two kinds of variables: global and local. Local variables are allocated on stack frames, and globals go in a separate area of memory. The role of the symbol table is to figure out exactly where. The LOAD opcode, which takes as operand an integer address in the "memory" area of a Machine object, is for global variables. The LOADI (LOAD Immediate) is for loading 16-bit numeric constants, with the LOADH (LOAD High) an optional followup for loading the upper half of a 32-bit constant. See compileFactor() for an example of the use of emitLOADINT().

For local variables, which must live on the stack (in the stack frame) so that local variables of different activations of the same recursive function don't collide, we need a different approach. We use the LOADF opcode, where the argument is the "offset" of the variable in the stack frame, and the actual location is stack[FP+arg], where FP is the Frame Pointer. Each procedure has one stack frame (regardless of scope, below), and the FP always points to the stack frame of the currently executing procedure.

Global variables are given locations in the memory area, beginning with address 0, and incrementing by 1. Local variables are given offsets in the local area, beginning with 0 and incrementing by 1. Whenever you create a new local variable, you need to generate cs.emit(ALLOC, n) where n is the total number of local variables. Note that when variables go out of scope, you need to do another cs.emit(ALLOC, n) with the now-smaller value of n.

The compileDeclaration() method has a parameter, isGlobal, that tells you at runtime whether the declaration you are processing is global or local. You need that to properly construct the symbol-table entry, and to decide whether to emit an ALLOC instruction.

We can also define "constants" using final int. Finally, function definitions will also go into the symbol table.

Expressions

I've used a reasonably common, but somewhat compressed, syntax for expressions:

expr ::= simple_expr [ relop simple_expr ]
simple_expr ::= ['+'|'-'] term { addop term }
term ::= factor { mulop factor }
factor ::= identifier | number | '(' expr ')'

The four levels here give multiplication a higher precedence than addition; factors allow the use of parentheses for grouping. Exps that are not also simple_exprs represent boolean comparisons, though we do not attempt to do any boolean type-checking. The && operator is considered to be a mulop, and || is an addop; a consequence of this is that we need full parentheses around each boolean comparison when we chain them with || and &&:
if ((x<10) || (y<20))

Strings can only be used in print statements, and the actual string contents are stored in the Machine in a separate area from either the stack or the "memory".

Using the compiler

I usually compile everything into compiler.exe. Then the following will compile and run a program:

mono compiler.exe foo.mjava

Execution begins with the minijava function named main. Be sure you have one!

Scope

Not only do we have global variables and local variables, but there are differing scopes for local variables; a new one begins with every '{'. (This is true of C# and java, too, though C# doesn't let you reuse existing identifiers.) In particular, you can do the following:

    f() {
       final int N = 3;
       { int N; N=5; print(N);}
       print(N);
    }

This will print 5 and then 3.

(This behavior is dependent on the value of Compiler.INNERSCOPES; if it is true, then the semantics are as above. You may set it to false, however, or just assume it is false and just not include any inner-scope declarations as above.)

Allowing a new scope to start at every '{' means two things: first, whenever you encounter '{' (in compileCompound()), you will have to create a new Table object, because you don't want a new declaration of N overwriting the older declaration. You will thus have a list of Table objects, and you will "search" the list in linear order, from most recent to oldest.

Under the current semantics for ALLOC N, each time it is called, SP is incremented by N (leaving N more units of space between FP and SP). For each int variable declared, all you need to do is ALLOC 1.

This is much simpler than an earlier formulation, where ALLOC N set SP = FP + N, and successive ALLOCs were not cumulative. In that older model, you would need to keep a counter of how many variables are allocated along with each per-scope Table object (theTable.size() might do just fine). Then, every time a variable is declared, one would have to emit ALLOC N, where N was the current number of variables.

The catch comes at the end of the scope; it's much harder with the current interpretation to reclaim space used by the inner declarations. In principle, with the old interpretation, you would emit ALLOC N where N is now the number of variables as of the previous, just-returned-to, scope. Example:

{int n;       // ALLOC 1
int m;      // ALLOC 2       these aren't cumulative! you have two variables!
    { // new scope
       int n   // ALLOC 3
       int x   // ALLOC 4
    } // end of scope
    // ALLOC 2

I did some experiments with C++, where you can actually print addresses of variables. C++ does not reclaim space used by "inner" allocations either, so I'm in good company. See mem.cpp.

If the ALLOCs are not right, or if the offsets assigned to variables are not right, then two variables may overwrite one another. If this happens, the numeric results will be nonsense. Try changing the program to use only globals; if this fixes the problem, bad local-variable allocation is the likely culprit.

class SymbolTable

As each variable is declared, the role of the symbol table is to store its address. Global variables have addresses in main memory; local variables have addresses relative to the stack frame pointer FP.

The symbol table should do the following:

Provide a way of entering newly declared variables and function names into the symbol table, together with their information. For variables, this is their address, and also a flag indicating whether the variable is global or local. Global-variable addresses are references to main memory; local-variable addresses are offsets from the top of the stack. For functions, the essential information is the function's entry point, or address of its code. For constants, the essential information is the constant's type and value. All this is stored in class IdentInfo objects.
Provide a way of looking up, for each identifier, its current IdentInfo information.
When the scope of a set of variable declarations ends, all declarations that are part of that scope should disappear.

The first is handled by the SymbolTable methods

allocVar()
allocConst()
allocProc()

The information about constants and functions is provided via parameters, but newly declared variables need to be allocated space. All variables have size 1 (on either the stack or in main memory, both of which count in 32-bit words) so this is straightforward.

The second is handled by

lookup()

The IdentInfo object has a getType() method to indicate whether the identifier was a VARIABLE, CONSTANT, or FUNCTION, and appropriate other accessors to return the necessary information.

The third is handled by the following pair:

newscope()
endscope()

The first creates a new symbol-table scope, and the second terminates it. Note that redeclaring a variable in the same scope should be illegal, while redeclaring a variable from a past scope is not.

Currently, the compiler proper calls newscope() and endscope() only at the start and end of new functions, if INNERSCOPES is set to false. The language grammar allows declarations within any compound statement. Thus, in the code below the inner declaration of n3 is as if it were declared at the top, but the inner declaration of n2 would be seen as a duplicate:

int foo() {
    int n1; int n2; int n4;
    n1 = 1; n2=10; n4=0;
    while (n1<n2) {
        int n2; int n3;
        n2 = 4;
        n4 = n4 + n1+n2;
        n1=n1+1;
    }
    print(n4);
}

Neither C# nor Java would allow the redeclaration above of n2, though C++ does. Different languages have different semantic rules for this sort of thing.

The Symbol-Table Hack

I've introduced three pathetic hacks to get you started without a "real" symbol table:

Global variables may be named "Gnn" for two digits nn; these each get a unique location in memory (dependent on nn)
Local variables may be named "Ln"; there can be up to 10 of these.
There are four pre-recognized function names "f1", "f2", "f3" and "f4".

All of these are defined in the SymbolTable class.

Variables of the form Gnn and Ln do not need to be declared; in fact, SymbolTable.allocVar() currently simply returns null. Nonetheless, the symbol-table hack will only allocate space for Ln variables if there is at least one variable declaration. It does not matter if it is a declaration of a "true" variable or an Ln variable.

Use of the Ln and "true" variables together is very risky.

Demo of the code

The following all work using the hack above:

hello.mjava
varG.mjava
varGL.mjava
fact.mjava
loc3.mjava (demonstrates a weird bug)

The next two assume a real symbol table:

A first step at a real SymbolTable

The first step is a linked-list implementation.

The main symbol table is a singly linked list of ⟨identifier,IdentInfo⟩ pairs
There is also a stack of list pointers representing scope levels. On newScope(), push the current head of the symbol-table list. On endScope(), head = pop().
allocate() searches the list only up to the end of the current scope before deciding there is no duplicate declaration, and creating a new entry (at the head)
lookup() searches the entire list
size() is likely to include globals too, which we don't want. So we need stackFrameSize() = size() - current_number_of_globals

A slightly different first step is the following (symboltable1.cs), which assumes INNERSCOPES = false. We create a single Dictionary<string,IdentInfo> object named Table. At the start of a new scope, we make a backup copy of this, in, say, TableBack. We now continue to add new local variables to Table. At the end of the scope, we restore Table to TableBack.

We can tell which scope we're in by looking at the value of SymbolTable.scopecount. If it is 0, we are at the global level. If it is 1, we are at the top level of some function declaration. If we set INNERSCOPES = true but want things to behave as if INNERSCOPES is false (that is, there are only globals and top-level locals), we create the backup Table when scopecount changes to 1 (but not if it increments further), and restore Table from the backup when scopecount changes to 0.

For a two-level symbol table, we might modify symboltable1.cs so that:

Global variables are in a Dictionary named, say, GlobalDict
Each time we enter a local scope (that is, each time we compile a new procedure), we create a new dictionary LocalDict for local variables.
On lookup, if the current scope is global we look only in GlobalDict. If the scope is local, we look first in LocalDict and then, if not found, in GlobalDict.
On creation of new entries (corresponding to new declarations), if the scope is local then we install the new entry in LocalDict. There is a duplicate-declaration conflict only if the identifier in question is already in LocalDict.

Multi-level symbol table

What if we want to support arbitary nesting? Then we need our symbol table to support that. Here is one model:

The symbol table is a List of Dictionary<String, IdentInfo>
The global scope is at position 0 in the List
newScope() adds a new Dictionary to the list; endScope() deletes one.
allocVar(), etc searches only the last Dictionary in the List.
lookup() searches each Dictionary in turn, starting with the last one and proceeding down to List[0]//

Declarations

There are three kinds of declarations:
variables: attributes are boolean isGlobal, and int location (either a global location or a local "offset"). We might also include type, except everything in minijava is int.
constants: the only attribute is numeric value
procedures: the main attribute is again int location (address). Although we are unlikely to use it, another might be the number of parameters.

How would you implement type signatures?

CompileExpr2

The strategy is to have compileExpr return an Integer object if the expr is constant (either a final int or a number).

problem: 1+n, where n is a variable.
One fix is to push the numeric value we want, and then emit a SWAP instruction to get them in the right order, if the operator is not commutative.

CompileExpr3

Here we build the actual expression tree. The issue above with constants is also implemented.

Note that the expression tree is different from the true parse tree. The original EBNF grammar is
    E -> T { addop T}
In "plain" BNF grammar, this is:
    E -> T MT
    MT -> addop MT | empty
This builds a somewhat weird tree (draw some examples). However, we can do better:

Define ExprNode, ConstNode, VarNode, UnopNode, BinopNode. Note that the latter four are subclasses of the abstract class ExprNode. We have one abstract method in ExprNode: compile() (the only method at all!)

Then modify compileSExpr, compileTerm so that, if both operands are ConstNodes, then we create a new ConstNode to hold the result, instead of generating lots of code.

Note how the object hierarchy helps us here, and also how we check node types with is.

Common subexpressions

Suppose we want to optimize about common subexpressions:

z1 = x+y; z2 = x+y;
z = (x+y)*(x+y)

The first example is an optimization among multiple statements; the second is an optimization within a single statement.

How do you detect these?
We construct a Map of expression subexpressions, making sure that we look for == matches instead of .equals matches; this means that we cannot use HashMap.

We can, however, still use hashing, if we pay careful attention to the hashCode() method, eg
    BinOpNode.hashCode = mixup(left.hashCode(), right.hashCode(), (int) operand);

Each subexpression is first looked up in the hashtable; if we find an exact match there, we return the hashed entry. Note that we really want to be using == on the pointers, all the way, not .equals().

Example: (x+y)*(x+y)
1. Enter x, y,BinOpNode(x,PLUS, y) in the table.
2. Now do it again
    second x: find entry for previous x, and return a duplicate pointer to the same node.
    second y: find entry for previous y, and return a duplicate pointer to the same node.
    Now we want to see if we can find an instance of (x, PLUS, y). We search our subexpression table for an exact match, using == on all three fields. We find one! Therefore, we do not create a new BinOpNode; we just return a reference to the existing cell.

Technically, we want to do the lookup before actually creating the node.

One way to achieve this is to modify the node constructors, but that can be massively confusing.

Now we have to generate code. The first time we come to a common subexpression, we evaluate it, and create a new storage location to store the result. That is, we ALLOC a new variable temp on the stack frame, before compiling the expression. (Note that if we try this after starting to compile the expression, we already have emitted code that is pushing things onto the stack, and we will muddle things up.) That is, the code for

    z = (x+y)*(x+y)

compiles as if it were

    temp = x+y;
    z = temp * temp

Rather than reloading temp, though, we would probably do something like this, say, for (x+y)*3*(x+y) (the intervening 3 makes a superficially obvious optimization be less obvious).

    LOAD x
    LOAD y
    ADD
    DUP                   // now two copies of x+y are on the stack
    STORF temp       // pops one copy
    LOADI    3
    MUL
    LOADF temp
    MUL