Elements of Programming Languages Dordal, rev 8/98

Concepts: expressions, types, variables, evaluation, store, environment, scope, side effects, statements,

What is a programming language? At the most abstract level, it is anything that lets you define a computation; that is, anything that lets you define a mathematical function that takes an input and generates an output from it. However, we would like to be more specific. Here's a sample program:

	main() {
	   int x, sum, i=1;
	   cin >> x;
	   while (i<=x) {
	      sum += i*i;
	      i += 1;
	   }
	   cout << sum;
	}

This program reads in the value x and then calculates 1*1 + 2*2 +...+ x*x. The program has some variables, namely x, sum, and i. These variables are of type int. The body of the program is a sequence of statements; these statements involve the evaluation of various expressions which are then assigned to variables or passed as parameters to procedures (eg to cout.operator<<()).

The abstract definition of a type is a set of values (eg the set of all values of 32-bit words) plus a set of operations on those values (eg integer +, -, *, ==, etc). For example, in C++, an abstract type is commonly viewed as a class. The values of the type are the set of legal combinations of member values (note that not all member value combinations may be legal), and the operations are just the member (and friend) functions of the class. The idea of abstract types, however, applies to concrete types as well. For instance, C arrays are values that support the [], etc operations. Ints are, say, 32-bit values that support not only + - * / but also the bitwise operations & | ^ << >>. In Pascal, on the other hand, the bit operations do not apply to integers; there is a separate "set of ..." type which permits them (and only them). In general, higher-level languages provide more different kinds of types, and more specific types (that is, each type has fewer operations) than lower-level languages. Even assembly language has types, although the types tend to be things like "machine word" and have very large sets of legal operations.

In short, all programming languages provide one or more basic types, and usually some way of defining new types. The basic operations provided by the language determine the exact nature of these basic types. Each variable, and each expression, can have a type assigned to it. A type system is a set of rules for associating each expression with a type; it consists of rules (generally derived from declarations) for the type of each variable, plus rules for the result type of operators, given the types of the operands.

Expressions are anything that has a value. An expression can be a variable, a constant value, or the result of an operation/ function applied to subexpressions. Expressions can be represented by trees, with the leaf nodes being variables and constants, and the internal nodes representing operations being applied to the child nodes. For example, a+3*b+5*d can be represented by the tree

         +
/ \
+ *
/ \ / \
a * 5 d
/ \
3 b

assuming the usual precedence and grouping rules for + and *. It is important to realize that expressions always have a unique tree structure (assuming the programming language grammar is unambiguous!) even if the precedence and grouping rules are implicit rather than explicit. a+3*b+5*d is, in effect, (a+(3*b))+(5*d), or, with operators written in functional style, +(+(a, *(3,b)), *(5,d)), or in Lisp, (+ (+ a (* 3 b)) (* 5 d)).

Even with the tree there is considerable scope for variation in how subexpressions are to be evaluated, particularly when the operators are user-defined functions and the subexpressions are thus function parameters. Here are a few common parameter-passing conventions: call-by-value, call-by-result, call-by-value-result, call-by-reference, call-by-constantref erence, call-by-name. There is also the distinction between full evaluation (evaluate all operands at the beginning) and short-circuit or lazy evaluation (evaluate operands only if needed, particulaly relevant for the Boolean operators and and or). We will return to these later. For the moment, note that the folowing loops would fail

while (i<=max) and (A[i] <> x) do i:= i+1;

while (p!=0 && p->data != x) p = p->next;

if the Boolean evaluation were not short-circuited, and the second clause were always evaluated even when the first is false. Evaluation of the second clause here when the first is false typically causes a run-time error.

Named values may be variables or constants. Variables represent locations in storage; the set of all such locations constitutes the store. One can think of the store as the collection of all variables, although it is more precise to view it as the collection of all variable locations. The environment is a mapping between (all) program names and the things they represent. Conceptually, if a variable represents a location, then given a variable X one would first look in the environment to find the current location LX for X, and then in the store to find the current contents of LX. If a name represents a value that is not a variable -- that is, it represents a constant value -- then that value would be found (conceptually) by direct lookup in the environment. In actual implementations the division into environment and store is rarely explicit, although many environment accesses are in effect done at compile time, and store accesses correspond most clearly to memory fetches.

Given an instance of a name, say X, programming languages usually have scope rules for deciding to which prior declaration of X the instance refers. For example, if X has been declared both globally and locally, the "most-closely-nested" rule to which most languages adhere implies that the local declaration of X is the one that is applicable:

const int X = 3;

foo (float X) {float y; y = X; /* this X refers to the local declaration, as a float */}

bar () {float y; y = X; /* this X refers to the global X, as const int 3 */ }

If the scope of a declaration can be inferred at compile time the language is said to have static scope. Scope rules are, however, sometimes dynamic; for example, an instance of a variable might be bound to the most recent declaration encountered at run-time. With dynamic scope, if the function bar() is called from the top level then its X refers to the global X; however, if bar() were called by foo() then the most recent binding of X would have been the one as float.

Expressions are said to have side effects if they modify the store. The term "side effect" suggests that the primary purpose of the expression is to represent a value; this is sometimes true only in a theoretical sense. For example, assignment statements in C++ are expressions, but certainly the store-modification side effect is the programmer's primary purpose in making the assignment! More traditional examples of expressions with a side effect in C++ occur in the following:

while ((c=cin.get()) != EOF) a[i++] = c;

This contains three subexpressions with side effects. First is the i++ which increments i before returning its value. A second is the assignment (c=cin.get()) within the Boolean control expression. A third is the call to cin.get() itself, which has as side effect a change in the state of the input file cin. Subexpressions with side effects are sometimes Considered Dangerous, although they certainly have their uses. A classic pair of examples are the following:

if (x=0) y = 0; else y = 1/x;

if (p=0) cout < '\n'; else printbuf(p); /* or printf("%s", p), if you prefer */

In both cases the intended Boolean condition was an equality test: x==0 and p==0 respectively. Assignment was used by mistake, but it cannot be caught by the compiler because assignments are legitimate expressions. In both cases the "else" clause will be executed (because the value of the Boolean test expression is 0 (because the value assigned is 0), which represents "false" in C++), but the variable will have just been assigned the value 0 by the if "condition", with presumably disasterous results.

A statement, or command, is then any designated program fragment that (potentially) modifies the store. In Pascal, statements are either assignments of the form variable := expression, or are procedure calls P(expr1, ..., exprN). In C++, any expression becomes a statement when it is terminated by a semicolon. The only expressions which it is sensible to use as commands, though, are assignments and procedure (function) calls as the others lack side effects; executing the command "2+2;" in C++ would have absolutely no effect.

C++ and ANSI C have the void type. An expression of type void (typically a function call) cannot be used as a part of any larger expression, because to do so would violate the type rules. Thus, such an expression can only be used as a statement. In this sense, then, C++ and ANSI C now have true procedures: "function" calls that do not return values. In most other languages, the distinction between procedures and functions is made via syntax, not via the type system.

A traditional program is built up of sequences of statements, connected together via control structures like while and if. The statements modify the store in careful sequence. In the program at the beginning of this document, for example, the store consists of x, i, sum. The body of the loop repeatedly modifies i and sum until a particular condition (i>x) is reached. To understand such a program, the programmer must keep track of all the modifications to the store, and understand how they interact.

The Von Neumann Architecture

The standard machine architecture, on which all the programming languages we will discuss must be implemented, involves sequential execution of machine instructions. Memory is organized as a sequentially numbered array of bytes (RAM); all higher-level data structures must be "mapped" onto this. The store, in particular, is an abstraction of this idea of a series of memory locations. A classic example of a data structure that does not map very conveniently onto this model is the dynamically-sized array; arrays need fixed allocations of memory. In order to implement variable-sized arrays, then, the implementor must be willing to move arrays when they are resized.

Many languages have been designed with a straightforward implementation in mind. In Pascal and C, for example, statements may involve several machine instructions but the idea of sequential execution is preserved almost unchanged. One of our topics will be the idea that, while perhaps all data structures can be mapped onto the RAM memory model, the sequential execution of statements is not the only programming paradigm.

Functional Programming

The essential feature of what is called "functional programming" is summarized as follows [Sethi]:

The value of an expression depends only on the values of its subexpressions, if any

Or, more simply, "side effects considered harmful". This includes, to at least some degree, assignment (although some functional languages allow some forms of assignment). Without assignment, one does not have variables (although one can have constants and parameters). Variables, and hence the store, are a very basic part of programming. It sometimes seems impossible to do without them. However, they are not in fact essential. Without a store, though, one cannot have loops: loops require that each time through the store must be changed in some way, to represent progress through the loop. (We will see in ML an interesting way around this loop obstacle.) Without a store, the sequence of execution also cannot matter: if expressions A and B do not have any side effects (which must be the case without a store), then executing B before A must be the same as executing A;B.

Without a store, one must program entirely with expressions. Subexpressions replace the notion of statements in sequence; the top call will be to evaluate a single expression but evaluating that expression will entail orderly progress through various subexpressions. Recursion would be used in lieu of loops. Here is a recursive version of the loop above:

	int squaresum (int x) {
	   if (x==0) return 0;
	   else return x*x + squaresum(x-1);
        }

Note that recursion entails naming a program fragment that would otherwise have been "anonymous". To finish off the conversion, suppose the function cin.readint() reads an integer value and returns it. Then the program could be written as:

cout << squaresum(cin.readint());

The variable x in squaresum warrants mention. It is still a variable, so the store still is there in theory. However, we are not using it as such: we never assign to it. In practice, x is a constant, with a value bound to it at function entry. Each level of the recursion, however, binds x to a different value (sequentially decreasing until we reach x==0). In any one frame of reference, x has a constant value.

Note that the main program now consists of a single expression. To evaluate the cout, we must first evaluate the squaresum(). To evaluate the squaresum, we must first evaluate the read(). In this sense, then (and this sense only), the notion of sequential execution still exists.