Comp 271 Week 13
Mini-Java Compiler
Your final project (there will be no more labs per se, though I'll
assign the project in two phases) is to add some features to a simple
compiler for a language I'll call "mini-java". The programs compile to
code for a virtual stack-based machine. The first part of your
assignment is to get the symbol table working.
The language grammar is defined using Extended Backus-Naur Form, a relatively standard formal grammar specification. The declarations of the grammar are called productions. The
"extended" means that optional parts can be indicated by enclosing in [
]; zero-or-more repetition is indicated by enclosing in {}. For
example, if the "language" is a string of 0 or more b's, optionally
preceded by an a, the grammar might be:
lang ::= [ 'a' ] { 'b' }
We can also write this, without [ ] and { }, as
lang ::= Apart Bpart
Apart ::= 'a' | empty
Bpart ::= 'b' Bpart | empty
Our parser is that part of the program that follows the language rules to determine whether the input, as a stream of tokens, follows the rules. Intermingled in the complier with the parsing statements are code-generation and semantic statements.
We will use a parsing technique called recursive descent with one-symbol lookahead,
in which we write a set of parsing procedures, one for each EBNF
production. A crucial feature is that we will always be able to tell
what production to follow using only the next token, represented in the
compiler.java class by the String variable theToken. (Not all grammars can be parsed this way.)
An example of the power of recursive-descent parsing is the following grammar:
start ::= [ 'a' A ] { 'b' B } 'c'
A ::= 'a' 'a' | 'b' | 'c' 'd'
B ::= 'a' | 'b'
The corresponding pseudocode follows, where theToken contains the
current token (a letter 'a', 'b', 'c', 'd'), and accept(x) means "if
theToken==x, then theToken = the next token, otherwise halt. Note that
at each point we are making a decision as to what to do based only on
the next letter; note that if production A had been A::='a' 'a' | 'a'
'b' this would not be possible (as written).
parseA() {
if (theToken == 'a') {accept('a'); accept('a');}
else if (theToken == 'b') {accept ('b');}
else if (theToken == 'c') {accept ('c'); accept('d'); }
else error()
}
parseB() {
if (theToken == 'a') {accept('a');}
else if (theToken == 'b') {accept('b');}
else error();
}
parseStart() {
if (theToken=='a') {
accept('a');
parseA();
}
while (theToken == 'b') {
accept('b');
parseB();
}
accept('c');
}
Note how [ ] corresponds to the if statement, and { } to the while.
Also note that the symbol-checking part of accept() is unnecessariy if
we've just checked the token previously, but in the instances in bold
it plays an important role.
In the Compiler.java file, good examples can be found in:
- compileStatement()
- compileCompound()
- compileWhile()
- compilePrint() (note alternative parsing idiom)
- compileIdentStmt() (note one-symbol-lookahead workaround)
Machine code
Our machine code is stack-based; we can load variables and constants to
the stack, do stack-based (postfix!) arithmetic, and then store the
results. (We're not supporting arrays). In processing the statement X =
3*Y+1, we would generate the following (simplified) code:
LOAD 3
LOAD Y
MUL
LOAD 1
ADD
STOR X
In between statements, the
stack does not contain any operands, but it still holds stack frames
containing local variables, and also information about pending function
calls. Some opcodes (ADD, etc) have no operands; others have a 16-bit
operand. Addresses can thus be 16 bits, covering a range of 64K "words".
Code is defined in the Machine class, and generated relative to a CodeStream instance using the emit() call.
Variables
There are two kinds of variables: global and local. Local variables are allocated on stack frames,
and globals go in a separate area of memory. The role of the symbol
table is to figure out exactly where. The LOAD opcode, which takes as
operand an integer address in the "memory" area of a Machine object, is
for global variables. The LOADI (LOAD Immediate) is for loading 16-bit
numeric constants, with the LOADH (LOAD High) an optional followup for
loading the upper half of a 32-bit constant. See compileFactor() for an
example of the use of emitLOADINT().
For local variables, which must live on the stack (in the stack frame)
so that local variables of different activations of the same recursive
function don't collide, we need a different approach. We use the LOADF
opcode, where the argument is the "offset" of the variable in the stack
frame, and the actual location is stack[FP+arg], where FP is the Frame
Pointer. Each procedure has one stack frame (regardless of scope,
below), and the FP always points to the stack frame of the currently
executing procedure.
Global variables are given locations in the memory area, beginning with
address 0, and incrementing by 1. Local variables are given offsets in
the local area, beginning with 0 and incrementing by 1. Whenever you
create a new local variable, you need to generate cs.emit(ALLOC, n)
where n is the total number of
local variables. Note that when variables go out of scope, you need to
do another cs.emit(ALLOC, n) with the now-smaller value of n.
The compileDeclaration() method has a parameter, isGlobal, that tells
you at runtime whether the declaration you are processing is global or
local. You need that to properly construct the symbol-table entry, and
to decide whether to emit an ALLOC instruction.
We can also define "constants" using final int. Finally, function definitions will also go into the symbol table.
Expressions
I've used a common but a little squirrely syntax for expressions:
expr ::= simple_expr [ relop simple_expr ]
simple_expr ::= ['+'|'-'] term { addop term }
term ::= factor { mulop factor }
factor ::= identifier | number | '(' expr ')'
The four levels here give multiplication a higher precedence than
addition; factors allow the use of parentheses for grouping. Exps that
are not also simple_exprs represent boolean comparisons, though we do
not attempt to do any boolean type-checking. The && operator is
considered to be a mulop, and || is an addop; a consequence of this is
that we need full parentheses around each boolean comparison when we
chain them with || and &&:
if ((x<10) || (y<20))
Using the compiler
Create a Compiler object from the file in question. Then you can run()
the program, or dump() it to machine code (possibly not too readable,
but important for me when debugging).
Execution begins with the function named main. Be sure you have one!
Scope
Not only do we have global variables and local variables, but there are
differing scopes for local variables; a new one begins with every '{'.
(This is true of java, too.) In particular, you can do the following:
f() {
final int N = 3;
{ int N; N=5; print(N);}
print(N);
}
This will print 5 and then 3.
All this means two things: first, whenever you encounter '{' (in
compileCompound()), you will have to create a new Map object, because
you don't want a new declaration of N overwriting the older
declaration. You will thus have a list of Map objects, and you will "search" the list in linear order, from most recent to oldest.
Second, you will need to keep a counter of how many variables are
allocated along with each per-scope Map object (theMap.size() might do
just fine; I haven't tried). Then, every time a variable is declared,
you'll need to emit that ALLOC N line, where N is the current number of
variables. At the end of a scope, you will need to emit ALLOC N where N
is now the number of variables as of the previous, just-returned-to, scope. Example:
{int n; // ALLOC 1
int m; // ALLOC 2
these aren't cumulative! you have two variables!
{ // new scope
int n // ALLOC 3
int x // ALLOC 4
} // end of scope
// ALLOC 2
At the end of the function body, it does not matter whether you do ALLOC 0 or not; this is taken care of automatically.
Hacks
I've introduced three pathetic hacks to get you started without a
symbol table: GHack(), for global variables, LHack, for local
variables, and FHack(), for functions. Given a variable name ident,
GHack(ident) looks to see if the identifier is of the form "Gnn", for
two digits nn, and if so returns a unique location in the memory area.
LHack(ident) works if ident is of the form "Ln", one digit, and allows
the use of ten local variables. Note the cs.emit(Machine.ALLOC, 10) in
compileFunction to make room for these variables! Finally, FHack(ident)
works if ident is "f1" through "f4", and works for function names.
Once your symbol table is working, you shouldn't need any of these. In
particular, isLHack should be set to false, so memory is not allocated.
While we're at it, this might be a good time to point out the print hack: strings can only be used in print statements, and the actual string contents are stored in the Machine in a separate area from either the stack or the "memory".
Thursday
Demo of the code
hello.mjava
varG.mjava
varGL.mjava
fact.mjava
class SymbolTable
linked-list implementation:
- singly linked list of <identifier,data> pairs
- A stack of list pointers representing scope levels. On newScope(), push the current head. On endScope(), head = pop()
- allocate() searches the list only up to the end of the current
scope before deciding there is no duplicate declaration, and creating a
new entry (at the head)
- lookup() searches the entire list
- size() is likely to include globals too, which we don't want. So we need stackFrameSize() = size() - current_number_of_globals
Having a HashMap for the global scope, and possibly also for the first
local scope, would be a better implementation. Most variables, however,
are defined in one of these two places, so having less-efficient lookup
for any "inner scopes" is plausible.
ArrayList of HashMap implementation
- ArrayList of Hashmap<String, data>
- The global scope is at position 0 in the arraylist
- newScope() adds a new HashMap to the list; endScope() deletes one.
- allocate() searches only the last HashMap.
- lookup searches each HashMap in turn, starting with the last one
- stackFrameSize is the sum of the sizes of each map, except the one at position 0.
More
- Booleans in while/if
- labels in while/if
- ??
Three kinds of declarations:
variables: attributes are
boolean isGlobal, and int location (either a global location or a local
"offset"). We might also include type, except everything in minijava is
int.
constants: the only attribute is numeric value
procedures: the main attribute
is again int location (address). Although we are unlikely to use it,
another might be the number of parameters.