Comp 271 Week 13

Mini-Java Compiler

Your final project (there will be no more labs per se, though I'll assign the project in two phases) is to add some features to a simple compiler for a language I'll call "mini-java". The programs compile to code for a virtual stack-based machine. The first part of your assignment is to get the symbol table working.

The language grammar is defined using Extended Backus-Naur Form, a relatively standard formal grammar specification. The declarations of the grammar are called productions. The "extended" means that optional parts can be indicated by enclosing in [ ]; zero-or-more repetition is indicated by enclosing in {}. For example, if the "language" is a string of 0 or more b's, optionally preceded by an a, the grammar might be:
    lang ::= [ 'a' ] { 'b' }
We can also write this, without [ ] and { }, as
    lang ::= Apart Bpart
    Apart ::= 'a' | empty
    Bpart ::= 'b' Bpart | empty
Our parser is that part of the program that follows the language rules to determine whether the input, as a stream of tokens, follows the rules. Intermingled in the complier with the parsing statements are code-generation and semantic statements.

We will use a parsing technique called recursive descent with one-symbol lookahead, in which we write a set of parsing procedures, one for each EBNF production. A crucial feature is that we will always be able to tell what production to follow using only the next token, represented in the compiler.java class by the String variable theToken. (Not all grammars can be parsed this way.)

An example of the power of recursive-descent parsing is the following grammar:

start ::= [ 'a' A ] { 'b' B } 'c'
A ::= 'a' 'a' | 'b' | 'c' 'd'
B ::= 'a' | 'b'

The corresponding pseudocode follows, where theToken contains the current token (a letter 'a', 'b', 'c', 'd'), and accept(x) means "if theToken==x, then theToken = the next token, otherwise halt. Note that at each point we are making a decision as to what to do based only on the next letter; note that if production A had been A::='a' 'a' | 'a' 'b' this would not be possible (as written).

parseA() {
    if (theToken == 'a') {accept('a'); accept('a');}
    else if (theToken == 'b') {accept  ('b');}
    else if (theToken == 'c') {accept ('c'); accept('d'); }
    else error()
}

parseB() {
    if (theToken == 'a') {accept('a');}
    else if (theToken == 'b') {accept('b');}
    else error();
}

parseStart() {
    if (theToken=='a') {
       accept('a');
       parseA();
    }
    while (theToken == 'b') {
       accept('b');
       parseB();
    }
    accept('c');
}

Note how [ ] corresponds to the if statement, and { } to the while. Also note that the symbol-checking part of accept() is unnecessariy if we've just checked the token previously, but in the instances in bold it plays an important role.

In the Compiler.java file, good examples can be found in:

Machine code

Our machine code is stack-based; we can load variables and constants to the stack, do stack-based (postfix!) arithmetic, and then store the results. (We're not supporting arrays). In processing the statement X = 3*Y+1, we would generate the following (simplified) code:
    LOAD    3
    LOAD    Y
    MUL
    LOAD    1
    ADD
    STOR   X
In between statements, the stack does not contain any operands, but it still holds stack frames containing local variables, and also information about pending function calls. Some opcodes (ADD, etc) have no operands; others have a 16-bit operand. Addresses can thus be 16 bits, covering a range of 64K "words".

Code is defined in the Machine class, and generated relative to a CodeStream instance using the emit() call.

Variables

There are two kinds of variables: global and local. Local variables are allocated on stack frames, and globals go in a separate area of memory. The role of the symbol table is to figure out exactly where. The LOAD opcode, which takes as operand an integer address in the "memory" area of a Machine object, is for global variables. The LOADI (LOAD Immediate) is for loading 16-bit numeric constants, with the LOADH (LOAD High) an optional followup for loading the upper half of a 32-bit constant. See compileFactor() for an example of the use of emitLOADINT().

For local variables, which must live on the stack (in the stack frame) so that local variables of different activations of the same recursive function don't collide, we need a different approach. We use the LOADF opcode, where the argument is the "offset" of the variable in the stack frame, and the actual location is stack[FP+arg], where FP is the Frame Pointer. Each procedure has one stack frame (regardless of scope, below), and the FP always points to the stack frame of the currently executing procedure.

Global variables are given locations in the memory area, beginning with address 0, and incrementing by 1. Local variables are given offsets in the local area, beginning with 0 and incrementing by 1. Whenever you create a new local variable, you need to generate cs.emit(ALLOC, n) where n is the total number of local variables. Note that when variables go out of scope, you need to do another cs.emit(ALLOC, n) with the now-smaller value of n.

The compileDeclaration() method has a parameter, isGlobal, that tells you at runtime whether the declaration you are processing is global or local. You need that to properly construct the symbol-table entry, and to decide whether to emit an ALLOC instruction.

We can also define "constants" using final int. Finally, function definitions will also go into the symbol table.

Expressions

I've used a common but a little squirrely syntax for expressions:
expr ::= simple_expr [ relop simple_expr ]
simple_expr ::= ['+'|'-'] term { addop term }
term ::= factor { mulop factor }
factor ::= identifier | number | '(' expr ')'
The four levels here give multiplication a higher precedence than addition; factors allow the use of parentheses for grouping. Exps that are not also simple_exprs represent boolean comparisons, though we do not attempt to do any boolean type-checking. The && operator is considered to be a mulop, and || is an addop; a consequence of this is that we need full parentheses around each boolean comparison when we chain them with || and &&:
    if ((x<10) || (y<20))

Using the compiler

Create a Compiler object from the file in question. Then you can run() the program, or dump() it to machine code (possibly not too readable, but important for me when debugging).

Execution begins with the function named main. Be sure you have one!

Scope

Not only do we have global variables and local variables, but there are differing scopes for local variables; a new one begins with every '{'. (This is true of java, too.) In particular, you can do the following:
    f() {
       final int N = 3;
       { int N; N=5; print(N);}
       print(N);
    }
This will print 5 and then 3.
All this means two things: first, whenever you encounter '{' (in compileCompound()), you will have to create a new Map object, because you don't want a new declaration of N overwriting the older declaration. You will thus have a list of Map objects, and you will "search" the list in linear order, from most recent to oldest.

Second, you will need to keep a counter of how many variables are allocated along with each per-scope Map object (theMap.size() might do just fine; I haven't tried). Then, every time a variable is declared, you'll need to emit that ALLOC N line, where N is the current number of variables. At the end of a scope, you will need to emit ALLOC N where N is now the number of variables as of the previous, just-returned-to, scope. Example:

{int n;       // ALLOC 1
  int m;      // ALLOC 2       these aren't cumulative! you have two variables!
    { // new scope
       int n   // ALLOC 3
       int x   // ALLOC 4
    } // end of scope
    // ALLOC 2

At the end of the function body, it does not matter whether you do ALLOC 0 or not; this is taken care of automatically.

Hacks

I've introduced three pathetic hacks to get you started without a symbol table: GHack(), for global variables, LHack, for local variables, and FHack(), for functions. Given a variable name ident, GHack(ident) looks to see if the identifier is of the form "Gnn", for two digits nn, and if so returns a unique location in the memory area. LHack(ident) works if ident is of the form "Ln", one digit, and allows the use of ten local variables. Note the cs.emit(Machine.ALLOC, 10) in compileFunction to make room for these variables! Finally, FHack(ident) works if ident is "f1" through "f4", and works for function names.

Once your symbol table is working, you shouldn't need any of these. In particular, isLHack should be set to false, so memory is not allocated.

While we're at it, this might be a good time to point out the print hack: strings can only be used in print statements, and the actual string contents are stored in the Machine in a separate area from either the stack or the "memory".

Thursday

Demo of the code

hello.mjava
varG.mjava
varGL.mjava
fact.mjava

class SymbolTable

linked-list implementation:

Having a HashMap for the global scope, and possibly also for the first local scope, would be a better implementation. Most variables, however, are defined in one of these two places, so having less-efficient lookup for any "inner scopes" is plausible.

ArrayList of HashMap implementation


More
Three kinds of declarations:
variables: attributes are boolean isGlobal, and int location (either a global location or a local "offset"). We might also include type, except everything in minijava is int.
constants: the only attribute is numeric value
procedures: the main attribute is again int location (address). Although we are unlikely to use it, another might be the number of parameters.