Mini-Java Compiler
As a case study for several data structures, we will look at a simple
compiler for a subset of the language Java. It is sufficient to write
interesting programs. The code is compiled into machine code for a stack
machine we will call the smachine; it can then be executed
via emulation.
The compiler lives in the following files:
smachine code
There are three kinds of memory in the stack machine:
- the stack, holding 32-bit values
- global memory, an array of 64K 32-bit values
- program memory
The machine code is stack-based; we can load 32-bit variables and constants
to the stack, do stack-based (postfix!) arithmetic (pop two operands and
push the result), and then store the results at a designated address.
Global-memory locations ("words") are also 32-bit quantities; addresses
identifying these words are 16 bits.
In processing the statement X =3*Y+1, we would generate the following
(simplified) code:
LOAD 3
LOAD Y
MUL
LOAD 1
ADD
STOR X
In the actual code, we would replace Y and X with their 16-bit addresses
(either global or on the stack). In between statements, the stack does not
contain any operands, but it still holds stack
frames containing local variables, and also information about
pending function calls. Some opcodes (ADD, etc) have no operands; others
have a 16-bit operand generally representing an address.
The above code is simplified in the sense that numeric literals such as 1
and 3 are in fact loaded with the LOADI (LOAD Immediate) opcode, while local
variables are loaded with LOADF (LOAD relative to stack Frame) and global
variables are loaded with LOAD. The opcodes PLOAD and PSTOR, which allow
specification of an address determined at runtime, are available for
implementation of arrays, but the language does not use them at this time.
Instructions can be divided into the following categories:
- LOAD/STOR and variants (eg Load Immediate, Load relative to
stack, Load from main memory)
- Stack-based arithmetic: ADD, SUB, MUL, DIV, MOD, NEG, AND, OR, LAND,
LOR, NOT, XOR. These pop their operands (normally two, though a couple
are unary) and then push the result.
- Comparisons: CEQ, CGE, CGT (these are a form of stack-based
arithmetic, where the operator is ==, > or >=. 0 is false, 1 is
true.)
- Jumps: JNZ, JZ, JGE, JGT. They pop a value and jump depending on that
value; the value is often set by one of the comparison opcodes.
- Subroutine instructions: JSR, RET, SYSRET
- System: HALT, NOP, ALLOC, DUP
- I/O: SPRINT, IPRINT
Code is defined in the Machine.java
class, and generated using the emit() call (a member of the codestream
class).
MiniJava
The language grammar is defined using Extended
Backus-Naur Form, a relatively standard formal grammar
specification. The declarations of the grammar are called productions,
and can all be found in Compiler.java The "extended" means that optional
parts can be indicated by enclosing in [ ]; zero-or-more repetition is
indicated by enclosing in {}. For example, if the "language" is a string of
0 or more b's, optionally preceded by an a, the grammar might be:
lang ::= [ 'a' ] { 'b' }
We can also write this, without [ ] and { }, as
lang ::= Apart Bpart
Apart ::= 'a' | empty
Bpart ::= 'b' Bpart | empty
Our parser is that part of the
program that follows the language rules to determine whether the input, as a
stream of tokens, follows the
rules. Intermingled in the complier with the parsing statements are
code-generation and semantic statements.
We will use a parsing technique called recursive
descent with one-symbol lookahead,
in which we write a set of parsing procedures, one for each EBNF production.
A crucial feature is that we will always be able to tell what production to
follow using only the next token, represented in the compiler.java class by
the String variable theToken. (Not
all grammars can be parsed this way.)
An example of the power of recursive-descent parsing is the following
grammar:
start ::= [ 'a' A ] { 'b' B } 'c'
A ::= 'a' 'a' | 'b' | 'c' 'd'
B ::= 'a' | 'b'
The corresponding pseudocode follows, where theToken contains the current
token (a letter 'a', 'b', 'c', 'd'), and accept(x) means "if theToken==x,
then theToken = the next token, otherwise halt". Note that at each point we
are making a decision as to what to do based only on the next letter; note
that if production A had been A::='a' 'a' | 'a' 'b' this
would not be possible (as written).
parseA() {
if (theToken == 'a') {accept('a'); accept('a');}
else if (theToken == 'b') {accept ('b');}
else if (theToken == 'c') {accept ('c'); accept('d');
}
else error()
}
parseB() {
if (theToken == 'a') {accept('a');}
else if (theToken == 'b') {accept('b');}
else error();
}
parseStart() {
if (theToken=='a') {
accept('a');
parseA();
}
while (theToken == 'b') {
accept('b');
parseB();
}
accept('c');
}
Note how [ ] corresponds to the if statement, and { } to
the while. Also note that the symbol-checking part of
accept() is unnecessariy if we've just checked the token previously, but in
the instances in bold it plays an important role.
In the Compiler.java file, good examples can be found in:
- compileStatement()
- compileCompound()
- compileWhile()
- compilePrint()
(note alternative parsing idiom)
- compileIdentStmt() (note
one-symbol-lookahead workaround)
Variables
There are two kinds of variables: global and local. Local variables are
allocated on stack frames, and
globals go in a separate area of memory. The role of the symbol table is to
figure out exactly where. The LOAD opcode, which takes as operand an integer
address in the "memory" area of a Machine object, is for global variables.
The LOADI (LOAD Immediate) is for loading 16-bit numeric constants, with the
LOADH (LOAD High) an optional followup for loading the upper half of a
32-bit constant. See compileFactor() for an example of the use of
emitLOADINT().
For local variables, which must live on the stack (in the stack frame) so
that local variables of different activations of the same recursive function
don't collide, we need a different approach. We use the LOADF opcode, where
the argument is the "offset" of the variable in the stack frame, and the
actual location is stack[FP+arg], where FP is the Frame Pointer. Each
procedure has one stack frame (regardless of scope, below), and the FP
always points to the stack frame of the currently executing procedure.
Global variables are given locations in the memory area, beginning with
address 0, and incrementing by 1. Local variables are given offsets in the
local area, beginning with 0 and incrementing by 1. Whenever you create a
new local variable, you need to generate cs.emit(ALLOC, n) where n is the total number of local variables. Note
that when variables go out of scope, you need to do another cs.emit(ALLOC,
n) with the now-smaller value of n.
The compileDeclaration() method has a parameter, isGlobal, that tells you at
runtime whether the declaration you are processing is global or local. You
need that to properly construct the symbol-table entry, and to decide
whether to emit an ALLOC instruction.
We can also define "constants" using final
int. Finally, function definitions will also go into the symbol
table.
Expressions
I've used a reasonably common, but somewhat compressed, syntax for
expressions:
expr ::= simple_expr [ relop simple_expr ]
simple_expr ::= ['+'|'-'] term { addop term }
term ::= factor { mulop factor }
factor ::= identifier | number | '(' expr ')'
The four levels here give multiplication a higher precedence than addition;
factors allow the use of parentheses for grouping. Exps that are not also
simple_exprs represent boolean comparisons, though we do not attempt to do
any boolean type-checking. The && operator is considered to be a
mulop, and || is an addop; a consequence of this is that we need full
parentheses around each boolean comparison when we chain them with || and
&&:
if ((x<10) || (y<20))
Strings can only be used in print
statements, and the actual string contents are stored in the Machine in a
separate area from either the stack or the "memory".
Using the compiler
I usually compile everything into compiler.exe. Then the following will
compile and run a program:
mono compiler.exe foo.mjava
Execution begins with the minijava function named main.
Be sure you have one!
Scope
Not only do we have global variables and local variables, but there are
differing scopes for local variables; a new one begins with every '{'. (This
is true of C# and java, too, though C# doesn't let you reuse existing
identifiers.) In particular, you can do the following:
f() {
final int N = 3;
{ int N; N=5; print(N);}
print(N);
}
This will print 5 and then 3.
(This behavior is dependent on the value of Compiler.INNERSCOPES; if it is true,
then the semantics are as above. You may set it to false,
however, or just assume it is false and just not include any inner-scope
declarations as above.)
Allowing a new scope to start at every '{' means two things: first, whenever
you encounter '{' (in compileCompound()), you will have to create a new
Table object, because you don't want a new declaration of N overwriting the
older declaration. You will thus have a list
of Table objects, and you will "search" the list in linear order, from most
recent to oldest.
Under the current semantics for ALLOC N, each time it is called, SP is
incremented by N (leaving N more units of space between FP and SP). For each
int variable declared, all you need to do is ALLOC 1.
This is much simpler than an earlier formulation, where ALLOC N set SP = FP
+ N, and successive ALLOCs were not cumulative. In that older
model, you would need to keep a counter of how many variables are allocated
along with each per-scope Table object (theTable.size() might do just fine).
Then, every time a variable is declared, one would have to emit ALLOC N,
where N was the current number of variables.
The catch comes at the end of the scope; it's much harder with the current
interpretation to reclaim space used by the inner declarations. In
principle, with the old interpretation, you would emit ALLOC N where N is
now the number of variables as of the previous,
just-returned-to, scope. Example:
{int n; // ALLOC 1
int m; // ALLOC 2
these aren't cumulative! you have two variables!
{ // new scope
int n // ALLOC 3
int x // ALLOC 4
} // end of scope
// ALLOC 2
I did some experiments with C++, where you can actually print addresses of
variables. C++ does not reclaim space used by "inner" allocations
either, so I'm in good company. See mem.cpp.
If the ALLOCs are not right, or if the offsets assigned to variables are not
right, then two variables may overwrite one another. If this happens, the
numeric results will be nonsense. Try changing the program to use only
globals; if this fixes the problem, bad local-variable allocation is the
likely culprit.
class SymbolTable
As each variable is declared, the role of the symbol table is to store
its address. Global variables have addresses in main memory; local
variables have addresses relative to the stack frame pointer FP.
The symbol table should do the following:
- Provide a way of entering newly declared variables and function names
into the symbol table, together with their information. For variables,
this is their address, and also a flag indicating
whether the variable is global or local. Global-variable addresses are
references to main memory; local-variable addresses are offsets from the
top of the stack. For functions, the essential information is the
function's entry point, or address of its code. For
constants, the essential information is the constant's type and value.
All this is stored in class IdentInfo objects.
- Provide a way of looking up, for each identifier, its current IdentInfo
information.
- When the scope of a set of variable declarations
ends, all declarations that are part of that scope should disappear.
The first is handled by the SymbolTable methods
- allocVar()
- allocConst()
- allocProc()
The information about constants and functions is provided via parameters,
but newly declared variables need to be allocated space. All variables
have size 1 (on either the stack or in main memory, both of which count in
32-bit words) so this is straightforward.
The second is handled by
The IdentInfo object has a getType() method to indicate whether the
identifier was a VARIABLE, CONSTANT, or FUNCTION, and appropriate other
accessors to return the necessary information.
The third is handled by the following pair:
The first creates a new symbol-table scope, and the second terminates it.
Note that redeclaring a variable in the same scope should be illegal,
while redeclaring a variable from a past scope is not.
Currently, the compiler proper calls newscope() and endscope() only at
the start and end of new functions, if INNERSCOPES is set to false.
The language grammar allows declarations within any compound statement.
Thus, in the code below the inner declaration of n3 is as if it were
declared at the top, but the inner declaration of n2 would be seen as a
duplicate:
int foo() {
int n1; int n2; int n4;
n1 = 1; n2=10; n4=0;
while (n1<n2) {
int n2; int n3;
n2 = 4;
n4 = n4 + n1+n2;
n1=n1+1;
}
print(n4);
}
Neither C# nor Java would allow the redeclaration above of n2, though C++
does. Different languages have different semantic rules for this
sort of thing.
The Symbol-Table Hack
I've introduced three pathetic hacks to get you started without a "real"
symbol table:
- Global variables may be named "Gnn" for two digits nn; these each get
a unique location in memory (dependent on nn)
- Local variables may be named "Ln"; there can be up to 10 of these.
- There are four pre-recognized function names "f1", "f2", "f3" and
"f4".
All of these are defined in the SymbolTable class.
Variables of the form Gnn and Ln do not need to be declared; in fact,
SymbolTable.allocVar() currently simply returns null. Nonetheless, the
symbol-table hack will only allocate space for Ln variables if
there is at least one variable declaration. It does not matter if it is a
declaration of a "true" variable or an Ln variable.
Use of the Ln and "true" variables together is very risky.
Demo of the code
The following all work using the hack above:
The next two assume a real symbol table:
A first step at a real SymbolTable
The first step is a linked-list implementation.
- The main symbol table is a singly linked list of
⟨identifier,IdentInfo⟩ pairs
- There is also a stack of list pointers representing
scope levels. On newScope(), push the current head of the symbol-table
list. On endScope(), head = pop().
- allocate() searches the list only up to the end of the current scope
before deciding there is no duplicate declaration, and creating a new
entry (at the head)
- lookup() searches the entire list
- size() is likely to include globals too, which we don't want. So we
need stackFrameSize() = size() - current_number_of_globals
A slightly different first step is the following (SymbolTable1.java),
which assumes INNERSCOPES = false. We create a single
Dictionary<string,IdentInfo> object named Table. At the start of a
new scope, we make a backup copy of this, in, say,
TableBack. We now continue to add new local variables to Table. At the end
of the scope, we restore Table to TableBack.
We can tell which scope we're in by looking at the value of
SymbolTable.scopecount. If it is 0, we are at the global level. If it is
1, we are at the top level of some function declaration. If we set
INNERSCOPES = true but want things to behave as if INNERSCOPES is false
(that is, there are only globals and top-level locals), we create the
backup Table when scopecount changes to 1 (but not if it increments
further), and restore Table from the backup when scopecount changes to 0.
For a two-level symbol table, we might modify
Symboltable1.java so that:
- Global variables are in a Dictionary named, say, GlobalDict
- Each time we enter a local scope (that is, each time we compile a new
procedure), we create a new dictionary LocalDict for local variables.
- On lookup, if the current scope is global we look only in GlobalDict.
If the scope is local, we look first in LocalDict and then, if not
found, in GlobalDict.
- On creation of new entries (corresponding to new declarations), if the
scope is local then we install the new entry in LocalDict. There is a
duplicate-declaration conflict only if the identifier in question is
already in LocalDict.
Multi-level symbol table
What if we want to support arbitary nesting? Then we need our symbol
table to support that. Here is one model:
- The symbol table is a List of Dictionary<String, IdentInfo>
- The global scope is at position 0 in the List
- newScope() adds a new Dictionary to the list; endScope() deletes one.
- allocVar(), etc searches only the last Dictionary in the List.
- lookup() searches each Dictionary in turn, starting with the last one
and proceeding down to List[0]//
Declarations
There are three kinds of declarations:
variables: attributes are boolean
isGlobal, and int location (either a global location or a local "offset").
We might also include type, except everything in minijava is int.
constants: the only attribute is
numeric value
procedures: the main attribute is
again int location (address). Although we are unlikely to use it, another
might be the number of parameters.
How would you implement type signatures?
CompileExpr2
The strategy is to have compileExpr return an Integer object if the expr is
constant (either a final int or a
number).
problem: 1+n, where n is a variable.
One fix is to push the numeric value we want, and then emit a SWAP
instruction to get them in the right order, if the operator is not
commutative.
CompileExpr3
Here we build the actual expression tree. The issue above with constants is
also implemented.
Note that the expression tree is different from the true parse tree. The
original EBNF grammar is
E -> T { addop T}
In "plain" BNF grammar, this is:
E -> T MT
MT -> addop MT | empty
This builds a somewhat weird tree (draw some examples). However, we can do
better:
Define ExprNode, ConstNode, VarNode, UnopNode, BinopNode. Note that the
latter four are subclasses of the abstract class ExprNode. We have one
abstract method in ExprNode: compile() (the only method at all!)
Then modify compileSExpr, compileTerm so that, if both operands are
ConstNodes, then we create a new ConstNode to hold the result, instead
of generating lots of code.
Note how the object hierarchy helps us here, and also how we check node
types with is.
Common subexpressions
Suppose we want to optimize about common subexpressions:
- z1 = x+y; z2 = x+y;
- z = (x+y)*(x+y)
The first example is an optimization among multiple statements; the second
is an optimization within a single statement.
How do you detect these?
We construct a Map of expression subexpressions, making sure that we look
for == matches instead of .equals matches; this means that we cannot
use HashMap.
We can, however, still use hashing, if we pay careful attention to the
hashCode() method, eg
BinOpNode.hashCode = mixup(left.hashCode(),
right.hashCode(), (int) operand);
Each subexpression is first looked up in the hashtable; if we find an exact
match there, we return the hashed entry. Note that we really want to be
using == on the pointers, all the way, not .equals().
Example: (x+y)*(x+y)
1. Enter x, y,BinOpNode(x,PLUS, y) in the table.
2. Now do it again
second x: find entry for previous x, and return a
duplicate pointer to the same node.
second y: find entry for previous y, and return a
duplicate pointer to the same node.
Now we want to see if we can find an instance of (x,
PLUS, y). We search our subexpression table for an exact
match, using == on all three fields. We
find one! Therefore, we do not create a new BinOpNode; we just
return a reference to the existing cell.
Technically, we want to do the lookup before
actually creating the node.
One way to achieve this is to modify the node constructors, but that can be
massively confusing.
Now we have to generate code. The first time we come to a common
subexpression, we evaluate it, and create
a new storage location to store the result. That is, we ALLOC a new
variable temp on the stack frame,
before compiling the expression. (Note that if we try this after
starting to compile the expression, we already have emitted code that is
pushing things onto the stack, and we will muddle things up.) That is, the
code for
z = (x+y)*(x+y)
compiles as if it were
temp = x+y;
z = temp * temp
Rather than reloading temp, though, we would probably do something like
this, say, for (x+y)*3*(x+y) (the intervening 3 makes a superficially
obvious optimization be less obvious).
LOAD x
LOAD y
ADD
DUP
// now two copies of x+y
are on the stack
STORF temp // pops one
copy
LOADI 3
MUL
LOADF temp
MUL