Semantic Analysis

Role of Semantic Analysis

Why do we need semantic analysis?

Context-free grammars give us a formal way of describing the syntax of a language. They, however, do not tell us anything about the meaning of a program.

Semantic analysis analyzes a program to ensure it has a valid meaning. This might include, for example, checking that all variables are properly declared and initialized, all functions are called with the correct number and type of arguments, and so on.

Some language constructs are also not context-free. For example, type inference in a language like Haskell is context-sensitive.

In the COOL language, we will use semantic analysis to check:

  • that all variables are declared before they are used
  • types
  • inheritance relationships
  • that classes are not redefined
  • that methods are not redefined
  • that reserved identifiers are not used as variable names
  • and more…

To give an example, we want to make sure the inheritance graph is a tree, i.e., that there are no cycles.

Scope

In a program, the same variable name may be used in different places. We need to match each use of a variable with its declaration.

The scope of a variable is the region of the program where it is accessible.

Static vs. Dynamic Scopes

In a static scope, the scope of a variable is determined by the program text. In a dynamic scope, the scope of a variable is determined by the program execution.

Most languages we use today use static scoping, but some languages, like Emacs Lisp and LaTeX, do use dynamic scoping.

A note on the scoping rules in COOL

Not all identifiers follow the same scoping rules. For example, class names are always in the global scope, while method names are in the scope of the class they are defined in. This means that class definitions in COOL cannot be nested and are accessible from anywhere in the program. It also means that a class name can be used before it is defined:


Class Foo {
... let y: Bar in ...
}
Class Bar {
...
}

One result of this is that we have to do at least two passes over the AST; one to collect all class definitions, and another to check that all uses of classes are valid. In practice, we will do more passes, as we will see later.

In PA3, feel free to do 10+ passes over the AST if you want to.

In COOL, attribute names are global within the class they are defined in. For example,


Class Foo {
f(): Int { a };
a: Int <- 0;
}

is valid, because the a in the body of f refers to the attribute a in the class Foo. As with classes, attribute names can be used before they are defined.

Implementing the most-closely-nested rule

A common pattern in compilers is to traverse some data structures, such as an AST, and apply different actions depending on the type of node we are visiting. For example, we might want to do something different when we visit a Let node than when we visit a Method node. This is called dispatching and the pattern is called the visitor pattern.

At every node, we can do something before visiting the children, and something after visiting the children. For example, in the Let node, we can push a new scope onto the stack before visiting the children, and pop the scope after visiting the children.

This is a very common pattern in compilers, and we will use it to implement things like type checking.

Implementation: Symbol Tables

A symbol table is a data structure we use to keep track of information about symbols (e.g. variables, functions, methods, classes, etc.) in a program.

It may store information about the symbol such as:

  • its name
  • its type
  • its scope
  • its location in the source code
  • additional metadata associate with the symbol

A simple symbol table

A simple symbol table implements the following operations:

  • enter_scope(): Enters a new scope. This is usually called when we enter a new block, such as a Let node.
  • find(name): Finds the symbol with the given name in the current scope. It returns null if the symbol is not found.
  • add(name): Adds a new symbol with the given name to the current scope.
  • check(name): Checks if the symbol with the given name exists in the current scope.
  • exit_scope(): Exits the current scope. This is usually called when we exit a block, such as a Let node.

This could be implemented as a stack of hash maps, where each hash map is a scope. The enter_scope() operation pushes a new hash map onto the stack, and the exit_scope() operation pops the top hash map off the stack.

Note for the project: The symbol table implementation is given.

Type Checking and Inference

Type checking is the process of verifying that the types are consistent in a program. Type inference is the process of inferring the types of expressions in a program.

Type concepts in COOL

In COOL, there are only two types: class names and SELF_TYPE. Types are instantiated as classes. i.e. all types are classes, and all classes are types. The user declares the type for each identifier, and the compiler infers a type for every expression.

Notation for type rules — Logical rules of inference

So far, we have seen two formal notations for specifying parts of a compiler — regular expressions and context-free grammars. In this section, we will see a third notation, called logical rules of inference, which we use to specify type rules.

Inference rules have the following form: If hypothesis is true, then conclusion is true. For example, the following rule says that if x is an integer, then x + 1 is also an integer:


x: Int => x + 1: Int

If E1E_1 and E2E_2 have certain types, then E3E_3 has a certain type. For example, we can use


e1: Int, e2: Int => e1 + e2: Int

to denote that if e1 and e2 are integers, then e1 + e2 is also an integer.

Modern inference rules are usually written in a more compact form, using a turnstile symbol (|-) to separate the hypotheses from the conclusion. They take the form:

hypothesishypothesisconclusion\begin{array}{c} \vdash \text{hypothesis} \quad \ldots \quad \vdash \text{hypothesis} \\ \hline \vdash \text{conclusion} \end{array}

For example, the previous rule can be written as:

e1:Inte2:Inte1+e2:Int\begin{array}{c} \vdash e_1: \text{Int} \quad \vdash e_2: \text{Int}\\ \hline \vdash e_1 + e_2: \text{Int} \end{array}

Soundness

A type system is sound if whenever e:T\vdash e: T is true, ee evaluates to a value of type TT. We also want sound rules to be precise. For example, the rule i is an integer literali:Objecti \text{ is an integer literal} \vdash i: \text{Object} is sound, but not precise. We want to be able to infer the type of an integer literal to be Int, not Object.

COOL type rules

Here are some of the type rules for COOL:


|- false: Bool
|- true: Bool
|- i is an integer literal |- i: Int
|- s is a string literal |- s: String
|- e: Bool |- !e: Bool
|- e1: Bool |- e2: T |- while e1 loop e2 pool: Object

Type environments

The problem

What is the type of a variable reference?


x is a variable |- x: ?

The local, structural rule does not carry enough information to give x a type.

A solution: Type environments

The idea is to put more information in the rules. A type environment gives types to free variables, and is a function from variable names to types. A variable is free in an expression if it is not defined in the expression.

For example, in the expression x + y, both x and y are free variables. In the expression let x: Int in x + y, x is bound, and y is free.

Let O be a function from ObjectIdentifier to Type. The sentence O |- e:T means that under the assumption that the free variables in e have the types given by O, (it is provable that) the expression e has type T.

This also means we have to modify the rules to include type environments. For example:


O |- e1: Int O |- e2: Int
-----
O |- e1 + e2: Int

Let expressions type rules

Let with no initialization


[Let-No-Init]
O[T0 / x] |- e1: T1
-----
O |- let x: T0 in e1: T1

O[T0 / x] is the type environment O with the type T0 associated with the variable x.

Another way to read O[T0 / x] is: in O, T0 substitutes x.

Let with initialization


[Let-Init]
O |- e0: T0
O[T0 / x] |- e1: T1
-----
O |- let x: T0 <- e0 in e1: T1

The difference between Let-No-Init and Let-Init is that in Let-Init, we have to check the type of the initialization expression e0.

Note that this rule is weak, because it does not allow the type of e0 to be some subtype of T0.

Subtyping

We can define a relation <= on types. The relation T1 <= T2 means that T1 is a subtype of T2. For example, Int <= Int and Int <= Object are both true.

This allows us to write a more precise rule for Let-Init:


[Let-Init]
O |- e0: T0
O[T / x] |- e1: T1
T0 <= T
-----
O |- let x: T <- e0 in e1: T1

Assignment expressions type rules


[Assign]
O(x) = T0
O |- e1: T1
T1 <= T0
-----
O |- x <- e1: T1

Initialized attributes

Let C be a class, x: T be an attribute of C, and function O_C(x) = T.

Attribute initialization is similar to assignment, except for the scope of names:


[Attr-Init]
O_C(x) = T0
O_C |- e1: T1
T1 <= T0
-----
O_C |- x <- e1: T1

Least upper bounds

Consider the expression if e0 then e1 else e2 fi. The type of the expression is the least upper bound of the types of e1 and e2. The least upper bound of T1 and T2 is the smallest type T such that T1 <= T and T2 <= T.

lub(X, Y) = Z if X <= Z and Y <= Z and for all W where X <= W and Y <= W, Z <= W.

In COOL, the least upper bound of two types is their least common ancestor in the inheritance tree.

If-then-else expressions type rules

Now, we can write the type rule for if-then-else:


[If-Then-Else]
O |- e0: Bool
O |- e1: T1
O |- e2: T2
-----
O |- if e0 then e1 else e2 fi: lub(T1, T2)

Case expressions type rules


[Case]
O |- e_0: T_0
O[T_0 / x_0] |- e_1: T_1
...
O[T_0 / x_n] |- e_n: T_n
-----
O |- case e_0 of x_0: T_0 -> e_1 | ... | x_n: T_n -> e_n; esac: lub(T_1, ..., T_n)

Method dispatch

There is a problem with the type rules for method dispatch. Consider the following expression:


e.f(e_1, ..., e_n)

The type of e is T, and T has a method f with type T_0 -> ... -> T_n -> T_r. The types of e_1, …, e_n are T_1, …, T_n. The type of the expression is T_r.

We need more information about the formal parameters and the return type of the method f.

To do this, we add a new concept: method environments. A method environment is a function from method names to method types. A method type is a pair of a list of formal parameter types and a return type.

In COOL, method and object identifiers live in different namespaces. This means that a method name can be the same as an object identifier.

In the type rules, we thus use a separate mapping M for method signatures.

M(C, f) = (T_0, ..., T_n, T_r) means that the class C has a method f with formal parameters of type T_0, …, T_n, and a return type of T_r.

Method dispatch type rule


[Method-Dispatch]
O, M |- e: T
O, M |- e_1: T_1
...
O, M |- e_n: T_n
M(T, f) = (T_0', ..., T_n', T_r)
T_i <= T_i' for all i
-----
O, M |- e.f(e_1, ..., e_n): T_r

Static dispatch

The only difference here is that we use M(C, f) instead of M(T, f) and we call T.f instead of e.f.


[Static-Dispatch]
O, M |- e: T
O, M |- e_1: T_1
...
O, M |- e_n: T_n
M(T, f) = (T_0', ..., T_n', T_r)
T_i <= T_i' for all i
-----
O, M |- T.f(e_1, ..., e_n): T_r

Expressiveness of static type systems

Static type systems can detect common errors, but some correct programs may be rejected by a type checker that is not expressive enough. But more expressive type systems are more complex.

Type checking with SELF_TYPE in COOL

A problem

Consider the following COOL program:


class Count {
i: Int <- 0;
incr(): Count {
{
i <- i + 1;
self;
}
};
};
class Stock inherits Count {
name: String;
};

Now, the following code is going to raise a type error:


class Main {
Stock s <- (new Stock).inc();
... s.name ...
};

because inc returns a Count, not a Stock. The compiler we have so far cannot see that s is actually a Stock and not just a Count.

SELF_TYPE to the rescue

We can solve this problem by introducing a new type SELF_TYPE. SELF_TYPE is a special type that represents the type of the current object. In the example above, SELF_TYPE would be Stock.

The type checker can now prove:


M, C |- (new Count).inc(): Count
M, C |- (new Stock).inc(): Stock

Operations on SELF_TYPE

Recall that we have a few operations on types:

  • A <= B means that A is a subtype of B
  • lub(A, B) is the least upper bound of A and B

We need to extend these operations to SELF_TYPE. We can do this by defining the following rules:

Let A and B be any types but SELF_TYPE. Then to extend <= to SELF_TYPE, we define:

  • SELF_TYPE_C <= SELF_TYPE_C for all classes C
  • SELF_TYPE_C <= A if C <= A
  • A <= SELF_TYPE_C is always false

To extend lub to SELF_TYPE, we define:

  • lub(SELF_TYPE_C, SELF_TYPE_C) = SELF_TYPE_C for all classes C
  • lub(SELF_TYPE_C, A) = lub(C, A) 1

Error recovery

Footnotes

  1. This is the best we can do because SELF_TYPE_C <= C.