Role of Semantic Analysis
Why do we need semantic analysis?
Context-free grammars give us a formal way of describing the syntax of a language. They, however, do not tell us anything about the meaning of a program.
Semantic analysis analyzes a program to ensure it has a valid meaning. This might include, for example, checking that all variables are properly declared and initialized, all functions are called with the correct number and type of arguments, and so on.
Some language constructs are also not context-free. For example, type inference in a language like Haskell is context-sensitive.
In the COOL
language, we will use semantic analysis to check:
- that all variables are declared before they are used
- types
- inheritance relationships
- that classes are not redefined
- that methods are not redefined
- that reserved identifiers are not used as variable names
- and more…
To give an example, we want to make sure the inheritance graph is a tree, i.e., that there are no cycles.
Scope
In a program, the same variable name may be used in different places. We need to match each use of a variable with its declaration.
The scope of a variable is the region of the program where it is accessible.
Static vs. Dynamic Scopes
In a static scope, the scope of a variable is determined by the program text. In a dynamic scope, the scope of a variable is determined by the program execution.
Most languages we use today use static scoping, but some languages, like Emacs Lisp and LaTeX, do use dynamic scoping.
A note on the scoping rules in COOL
Not all identifiers follow the same scoping rules. For example, class names are always in the global scope, while method names are in the scope of the class they are defined in. This means that class definitions in COOL
cannot be nested and are accessible from anywhere in the program. It also means that a class name can be used before it is defined:
Class Foo { ... let y: Bar in ...}Class Bar { ...}
One result of this is that we have to do at least two passes over the AST; one to collect all class definitions, and another to check that all uses of classes are valid. In practice, we will do more passes, as we will see later.
In PA3, feel free to do 10+ passes over the AST if you want to.
In COOL, attribute names are global within the class they are defined in. For example,
Class Foo { f(): Int { a }; a: Int <- 0;}
is valid, because the a
in the body of f
refers to the attribute a
in the class Foo
. As with classes, attribute names can be used before they are defined.
Implementing the most-closely-nested rule
A common pattern in compilers is to traverse some data structures, such as an AST, and apply different actions depending on the type of node we are visiting. For example, we might want to do something different when we visit a Let
node than when we visit a Method
node. This is called dispatching and the pattern is called the visitor pattern.
At every node, we can do something before visiting the children, and something after visiting the children. For example, in the Let
node, we can push a new scope onto the stack before visiting the children, and pop the scope after visiting the children.
This is a very common pattern in compilers, and we will use it to implement things like type checking.
Implementation: Symbol Tables
A symbol table is a data structure we use to keep track of information about symbols (e.g. variables, functions, methods, classes, etc.) in a program.
It may store information about the symbol such as:
- its name
- its type
- its scope
- its location in the source code
- additional metadata associate with the symbol
A simple symbol table
A simple symbol table implements the following operations:
enter_scope()
: Enters a new scope. This is usually called when we enter a new block, such as aLet
node.find(name)
: Finds the symbol with the given name in the current scope. It returnsnull
if the symbol is not found.add(name)
: Adds a new symbol with the given name to the current scope.check(name)
: Checks if the symbol with the given name exists in the current scope.exit_scope()
: Exits the current scope. This is usually called when we exit a block, such as aLet
node.
This could be implemented as a stack of hash maps, where each hash map is a scope. The enter_scope()
operation pushes a new hash map onto the stack, and the exit_scope()
operation pops the top hash map off the stack.
Note for the project: The symbol table implementation is given.
Type Checking and Inference
Type checking is the process of verifying that the types are consistent in a program. Type inference is the process of inferring the types of expressions in a program.
Type concepts in COOL
In COOL, there are only two types: class names and SELF_TYPE
. Types are instantiated as classes. i.e. all types are classes, and all classes are types. The user declares the type for each identifier, and the compiler infers a type for every expression.
Notation for type rules — Logical rules of inference
So far, we have seen two formal notations for specifying parts of a compiler — regular expressions and context-free grammars. In this section, we will see a third notation, called logical rules of inference, which we use to specify type rules.
Inference rules have the following form: If hypothesis is true, then conclusion is true. For example, the following rule says that if x
is an integer, then x + 1
is also an integer:
x: Int => x + 1: Int
If and have certain types, then has a certain type. For example, we can use
e1: Int, e2: Int => e1 + e2: Int
to denote that if e1
and e2
are integers, then e1 + e2
is also an integer.
Modern inference rules are usually written in a more compact form, using a turnstile symbol (|-
) to separate the hypotheses from the conclusion. They take the form:
For example, the previous rule can be written as:
Soundness
A type system is sound if whenever is true, evaluates to a value of type . We also want sound rules to be precise. For example, the rule is sound, but not precise. We want to be able to infer the type of an integer literal to be Int
, not Object
.
COOL type rules
Here are some of the type rules for COOL:
|- false: Bool|- true: Bool|- i is an integer literal |- i: Int|- s is a string literal |- s: String|- e: Bool |- !e: Bool|- e1: Bool |- e2: T |- while e1 loop e2 pool: Object
Type environments
The problem
What is the type of a variable reference?
x is a variable |- x: ?
The local, structural rule does not carry enough information to give x
a type.
A solution: Type environments
The idea is to put more information in the rules. A type environment gives types to free variables, and is a function from variable names to types. A variable is free in an expression if it is not defined in the expression.
For example, in the expression x + y
, both x
and y
are free variables. In the expression let x: Int in x + y
, x
is bound, and y
is free.
Let O
be a function from ObjectIdentifier
to Type
. The sentence O |- e:T
means that under the assumption that the free variables in e
have the types given by O
, (it is provable that) the expression e
has type T
.
This also means we have to modify the rules to include type environments. For example:
O |- e1: Int O |- e2: Int-----O |- e1 + e2: Int
Let expressions type rules
Let with no initialization
[Let-No-Init]O[T0 / x] |- e1: T1-----O |- let x: T0 in e1: T1
O[T0 / x]
is the type environment O
with the type T0
associated with the variable x
.
Another way to read O[T0 / x]
is: in O
, T0
substitutes x
.
Let with initialization
[Let-Init]O |- e0: T0O[T0 / x] |- e1: T1-----O |- let x: T0 <- e0 in e1: T1
The difference between Let-No-Init
and Let-Init
is that in Let-Init
, we have to check the type of the initialization expression e0
.
Note that this rule is weak, because it does not allow the type of e0
to be some subtype of T0
.
Subtyping
We can define a relation <=
on types. The relation T1 <= T2
means that T1
is a subtype of T2
. For example, Int <= Int
and Int <= Object
are both true.
This allows us to write a more precise rule for Let-Init
:
[Let-Init]O |- e0: T0O[T / x] |- e1: T1T0 <= T-----O |- let x: T <- e0 in e1: T1
Assignment expressions type rules
[Assign]O(x) = T0O |- e1: T1T1 <= T0-----O |- x <- e1: T1
Initialized attributes
Let C
be a class, x: T
be an attribute of C
, and function O_C(x) = T
.
Attribute initialization is similar to assignment, except for the scope of names:
[Attr-Init]O_C(x) = T0O_C |- e1: T1T1 <= T0-----O_C |- x <- e1: T1
Least upper bounds
Consider the expression if e0 then e1 else e2 fi
. The type of the expression is the least upper bound of the types of e1
and e2
. The least upper bound of T1
and T2
is the smallest type T
such that T1 <= T
and T2 <= T
.
lub(X, Y) = Z
if X <= Z
and Y <= Z
and for all W
where X <= W
and Y <= W
, Z <= W
.
In COOL, the least upper bound of two types is their least common ancestor in the inheritance tree.
If-then-else expressions type rules
Now, we can write the type rule for if-then-else
:
[If-Then-Else]O |- e0: BoolO |- e1: T1O |- e2: T2-----O |- if e0 then e1 else e2 fi: lub(T1, T2)
Case expressions type rules
[Case]O |- e_0: T_0O[T_0 / x_0] |- e_1: T_1...O[T_0 / x_n] |- e_n: T_n-----O |- case e_0 of x_0: T_0 -> e_1 | ... | x_n: T_n -> e_n; esac: lub(T_1, ..., T_n)
Method dispatch
There is a problem with the type rules for method dispatch. Consider the following expression:
e.f(e_1, ..., e_n)
The type of e
is T
, and T
has a method f
with type T_0 -> ... -> T_n -> T_r
. The types of e_1
, …, e_n
are T_1
, …, T_n
. The type of the expression is T_r
.
We need more information about the formal parameters and the return type of the method f
.
To do this, we add a new concept: method environments. A method environment is a function from method names to method types. A method type is a pair of a list of formal parameter types and a return type.
In COOL, method and object identifiers live in different namespaces. This means that a method name can be the same as an object identifier.
In the type rules, we thus use a separate mapping M
for method signatures.
M(C, f) = (T_0, ..., T_n, T_r)
means that the class C
has a method f
with formal parameters of type T_0
, …, T_n
, and a return type of T_r
.
Method dispatch type rule
[Method-Dispatch]O, M |- e: TO, M |- e_1: T_1...O, M |- e_n: T_nM(T, f) = (T_0', ..., T_n', T_r)T_i <= T_i' for all i-----O, M |- e.f(e_1, ..., e_n): T_r
Static dispatch
The only difference here is that we use M(C, f)
instead of M(T, f)
and we call T.f
instead of e.f
.
[Static-Dispatch]O, M |- e: TO, M |- e_1: T_1...O, M |- e_n: T_nM(T, f) = (T_0', ..., T_n', T_r)T_i <= T_i' for all i-----O, M |- T.f(e_1, ..., e_n): T_r
Expressiveness of static type systems
Static type systems can detect common errors, but some correct programs may be rejected by a type checker that is not expressive enough. But more expressive type systems are more complex.
Type checking with SELF_TYPE in COOL
A problem
Consider the following COOL program:
class Count { i: Int <- 0; incr(): Count { { i <- i + 1; self; } };};class Stock inherits Count { name: String;};
Now, the following code is going to raise a type error:
class Main { Stock s <- (new Stock).inc(); ... s.name ...};
because inc
returns a Count
, not a Stock
. The compiler we have so far cannot see that s
is actually a Stock
and not just a Count
.
SELF_TYPE to the rescue
We can solve this problem by introducing a new type SELF_TYPE
. SELF_TYPE
is a special type that represents the type of the current object. In the example above, SELF_TYPE
would be Stock
.
The type checker can now prove:
M, C |- (new Count).inc(): CountM, C |- (new Stock).inc(): Stock
Operations on SELF_TYPE
Recall that we have a few operations on types:
A <= B
means thatA
is a subtype ofB
lub(A, B)
is the least upper bound ofA
andB
We need to extend these operations to SELF_TYPE
. We can do this by defining the following rules:
Let A
and B
be any types but SELF_TYPE
. Then to extend <=
to SELF_TYPE
, we define:
SELF_TYPE_C <= SELF_TYPE_C
for all classesC
SELF_TYPE_C <= A
ifC <= A
A <= SELF_TYPE_C
is always false
To extend lub
to SELF_TYPE
, we define:
lub(SELF_TYPE_C, SELF_TYPE_C) = SELF_TYPE_C
for all classesC
lub(SELF_TYPE_C, A) = lub(C, A)
1
Error recovery
Footnotes
-
This is the best we can do because
SELF_TYPE_C <= C
. ↩