Parsing

Top-down Parsing

Bottom-up Parsing

Bottom-up parsers work by gradually reducing the input program to the start symbol SS.

LR(0) Parsing

Handles

A handle is a string that can be reduced (though a sequence of steps) to the start symbol.

The intuition behind choosing between shift and reduce is that we only want to reduce if the result can be reduced to the start symbol eventually.

For every rightmost derivation SαXωαβωS \rightarrow^* \alpha X \omega \rightarrow \alpha \beta \omega, XβX\rightarrow \beta in the position after α\alpha is a handle of αβω\alpha \beta \omega.

Recognizing handles

There are no known efficient algorithms for recognizing handles. We can only use heuristics to guess whether a stack is a handle.

For some CFGs, the heuristics always guess correctly. For example, if the CFG is LL(1)1, then the heuristics always guess correctly.

Viable Prefixes

α\alpha is a viable prefix if there is an ω\omega such that αω\alpha \mid \omega is a legal state of a shift-reduce parser.

Items

An item is a production with a .\texttt{.} on the right-hand side, denoting a focus point.

The focus point tells us where we are in the production.

For example, the items for T \to (E) are:

  • T \to .(E)
  • T \to (.E)
  • T \to (E.)
  • T \to (E).
Remarks
  • The only item for XϵX \to \epsilon is X.X \rightarrow \texttt{.}.
  • Items are also known as LR(0) items.
Example

Consider the input (int).

  • (E|) is a state in the shift-reduce parse.
  • (E is a prefix of the RHS of T \to (E)
  • Item T \to (E.) says: so far, we’ve seen (E, and we’re hoping to see ).

Generalization

The stack may have many prefixes of RHS’s of productions:

Prefix1 Prefix2  Prefixn\textnormal{Prefix}_1 \ \textnormal{Prefix}_2 \ \ldots \ \textnormal{Prefix}_n

Since Prefixi\textnormal{Prefix}_i is a prefix of the RHS of XiαiX_i \to \alpha_i, we know that Prefixi\textnormal{Prefix}_i will eventually be reduced to XiX_i. This means that the missing part of Prefixi1\textnormal{Prefix}_{i-1} of the RHS of Xi1αi1X_{i-1} \to \alpha_{i-1} starts with XiX_i (i.e. β\exists \beta such that Xi1Prefixi1XiβX_{i-1} \to \textnormal{Prefix}_{i-1}X_i\beta).

Recursively, we know that Prefixk+1  Prefixn\textnormal{Prefix}_{k+1}\ \ldots\ \textnormal{Prefix}_{n} eventually reduces to the missing part of αk\alpha_k.

Consider the input (int * int).

Recall that our grammar is:


E -> T + E | T
T -> int * T | int | (E)

  • (int * | int) is a state in the shift-reduce parse.
  • From top of the stack:
    • int * is a prefix of the RHS of T \to int * T
    • ϵ\epsilon is a prefix of the RHS of E \to T
    • ( is a prefix of the RHS of T \to (E)

Our stack of items:

  • T \to int * .T says that we’ve seen int *, and we’re hoping to see T.
  • E \to .T says that we’ve seen ϵ\epsilon, and we’re hoping to see T.
  • T \to (.E) says that we’ve seen (, and we’re hoping to see E).

Recognizing Viable Prefixes

To recognize viable prefixes, we must recognize a sequence of partial RHS’s of productions, each of which can eventually reduce to part of the missing suffix of its predecessor.

An NFA for Recognizing Viable Prefixes
  1. Add a new start production SSS' \to S.
  2. The NFA states are the items of GG (including the new start production).
  3. For item Eα.XβE \to \alpha \texttt{.} X \beta, add the transition2
    Eα.XβXEαX.βE \to \alpha \texttt{.} X \beta \longrightarrow^X E \to \alpha X \texttt{.}\beta
  4. For item Eα.XβE \to \alpha \texttt{.} X \beta and production XγX \to \gamma, add the transition3
    Eα.XβϵX.γE \to \alpha \texttt{.} X \beta \longrightarrow^{\epsilon} X \to \texttt{.} \gamma
  5. Every state is an accepting state.
  6. Start state is S.SS' \to \texttt{.} S.
Example

Recall our running example grammar:


E -> T + E | T
T -> int * T | int | (E)

  1. Create S.ES' \to \texttt{.} E.
  2. Add transition S.EESE. S' \to \texttt{.} E \longrightarrow^E S' \to E \texttt{.}
  3. Add transitions 4
    S.EϵE.TS' \to \texttt{.} E \longrightarrow^{\epsilon} E \to \texttt{.} T
    S.EϵE.T+ES' \to \texttt{.} E \longrightarrow^{\epsilon} E \to \texttt{.} T + E
  4. For the state E.TE \to \texttt{.} T, add transitions
    TT.int * T\longrightarrow^T T \to \texttt{.int * } T
    TT.int\longrightarrow^T T \to \texttt{.int}
    TT.(E)\longrightarrow^T T \to \texttt{.}(E)
  5. Repeat the process until we have all the transitions to get the NFA.

Once we have the NFA, we can convert it to a DFA as we have seen before by applying the epsilon closures and the subset construction.

Remarks:
  • We call the states of the DFA “canonical collections of LR(0) items”.
  • The dragon book also shows another way of constructing the LR(0) items.

Valid Items

We say that an item Xβ.γX \to \beta \texttt{.} \gamma is a valid item for a viable prefix αβ\alpha \beta if

SαXωαβγωS' \to^* \alpha X \omega \to \alpha \beta \gamma \omega

by a right-most derivation.

The γ\gamma’s of the valid items tell us what we hope to see next from the input.

An item is often valid for many prefixes. For example, T(.E)T \to (\texttt{.}E) is valid for prefixes with any number of starting parentheses.

LR(0) - When to Shift / Reduce?

Let’s say:

  • We are at a state αt\alpha \mid t in the shift-reduce parse, i.e. the stack contains α\alpha and the next input starts with tt.
  • DFA on input α\alpha leads us to the state ss.

The idea is to:

  • Reduce by XβX \to \beta if ss contains Xβ.X \to \beta \texttt{.}.
  • Shift if ss contains Xβ.tωX \to \beta \texttt{.} t \omega (i.e. ss has a transition on tt).

LR(0) Conflicts

  • Shift-reduce conflict: ss contains Xβ.tωX \to \beta \texttt{.} t \omega and XβX \to \beta.
  • Reduce-reduce conflict: ss contains XβX \to \beta and YγY \to \gamma.

SLR Parsing

SLR stands for “Simple Left-to-Right”. It improves on LR(0) with one change (shown below in bold):

  • Reduce by XβX \to \beta if ss contains Xβ.X \to \beta \texttt{.} and tFollow(X){\bf t \in \textbf{Follow}(X)}
  • Shift if ss contains Xβ.tωX \to \beta \texttt{.} t \omega (i.e. ss has a transition on tt).

This ensures we don’t do the reduction if we know we’d get after the reduction, which is XtXt, cannot be further reduced.

If there are still conflicts under these rules, we say the grammar is not SLR.

Remark
  • Even though the SLR algorithm does a sort of “one-step lookahead”, we still call it an LR(0) algorithm because the DFA states still only contain LR(0) items.
  • Many grammars are not SLR.
  • In practice, we can use precedence declarations and rewrite our grammar into a form that does not have conflicts (i.e. is SLR).

SLR Parsing Algorithm

  1. Let MM be the DFA for viable prefixes of GG.
  2. Let x1xn\mid x_1 \ldots x_n be the initial configuration.
  3. Repeat until configuration is SS \mid:
    • Let αw\alpha \mid w be the current configuration.
    • Run MM on current stack α\alpha.
    • If MM rejects α\alpha, report parsing error.
      • (Because this means α\alpha is not a viable prefix.)
    • Else let tt be the next input symbol.
      • Use SLR heuristics to decide whether to shift or reduce.

Note: We can improve the performance by caching the results of the DFA runs.

Action Table

The SLR parsing algorithm uses an action table to decide whether to shift or reduce.

The action table is a 2D table with rows for states and columns for input symbols.

For each DFA state ss and input symbol tt, we have an entry in the table action[s, t] that is either:

  • shift s' (where ss' is the state ss leads to on input tt)
  • reduce by XαX \to \alpha
  • accept (if ss is the accepting state and tt is the end-of-input marker)
  • error otherwise

LR(1) Parsing

The LR(1) parsing algorithm is similar to the SLR parsing algorithm, but its DFA states contain LR(1) items instead of LR(0) items.

The idea is to build lookahead into the items. An LR(1) item is a pair of an LR(0) item and a lookahead symbol.

For example, [T -> . int * T, $] means: after seeing T -> int * T, reduce by T -> int * T if the next input is $.

The LR(1) algorithm is not covered in detail in this class.

Remarks

Bottom-up parsers are more efficient than top-down parsers, but top-down parsers are easier to implement. This is why the latter are becoming more common, especially for big languages like C++. Tools like Bison and Flex that generate bottom-up parsers are still useful for small languages, like the one we are using in this class.

Footnotes

  1. A CFG is LL(1) if for every nonterminal XX and every terminal aa, there is at most one production XαX \rightarrow \alpha such that aFIRST(α)a \in \operatorname{FIRST}(\alpha).

  2. The transition takes care of the situation where we have seen α\alpha and we’re waiting to see XX.

  3. This takes care of the same situation where we are hoping to see XX, but it’s also fine to get parts of a production that will eventually reduce to XX.

  4. These two transitions take care of the situation from the start state where it’s fine to not directly get a EE, but get things that can reduce to EE.