Learning Compiler -- Intro, Lexical Analysis and Parsing

This is the summary of Stanford Compiler Lessons on Coursera

Intro

The basic steps of compile is

lexical analysis
parsing
semantic analysis
optimization
code generation

Lexical Analysis

the implementation of lexical analysis contains two things:

first, partition the input string into lexemes

Second, identify the token class of each lexeme

Notice that this method mostly use Left-to-right scan, and lookahead sometimes is required.

Token Class(In COOL)
- whitespace: : a non-empty sequence of blanks, newlines, and tabs
- keyword: “if” or “else” or “then” or …
- identifier: strings of letters or digits, starting with a letter
- integer: a non-empty string of digits
Regular Language

The regular expressions over ∑ are the smallest set of expressions including

In regular languages there are rules descirbed in sets like:
- ℇ (Epsilon)
- ‘c’ (Single Charactor)
- A + B (Union)
- AB (Concatation)
- A* (Iteration)
Formal Language

Let ∑ be a set of characters (an alphabet). A language over ∑ is a set of strings of characters drawn from ∑ It is represeted as L(e) = M, where e stands for regular expression, M stands for set of strings. In formal languages I’ve learned that multiple syntaxs of regular language can only point to one semantic meaning, which is the formal language.
Lexical Specification
- Procedure:
  1. Write a rexp for the lexemes of each token class
  2. Construct formal language L(R), matching all lexemes for all tokens
  3. Let input be x1…xn For 1 ≤ i ≤ n check x1…xn ∊ L(R)
  4. If success, then we know that x1…xi ∊ L(Rj) for some j
  5. Remove x1…xi from input and go to (3)
  6. return ERROR token if all strings are not in the lexical specification
- Notice:
  - one regular expression should match as long as a string can be.
  - regular expression’s priority is determined by the specified order.
  - An Error match should always have the lowest priority
Finite Automata
- A finite automaton consists of
  - An input alphabet ∑
  - A set of states S
  - A start state n
  - A set of accepting states F ⊆ S
  - A set of transitions s1 (on input a) -> s2
- Deterministic Finite Automata (DFA)
  - One transition per input per state
  - No ℇ-moves
  - Faster to execute
- Nondeterministic Finite Automata (NFA)
  - Can have multiple transitions for one input in a given state (which may ends in different results)
  - Can have ℇ-moves
  - In general, smaller than DFA
Implementing Finite Automata
- Procedure: Lexical Specification -> Regular Expressions -> NFA -> DFA -> Table-driven Implementation of DFA
- NFA -> DFA
  - Epsilon Closure: Pick one state, then look at all the states it can reach by following epsilon moves.
  - So basically the transition is about finding epsilon closures in NFA of different inputs, then gather each epsilon closure up as one DFA state.
- A DFA can be implemented by a 2D table T (Acturally NFA can also do the trick,but the table would be huge…)
  - One dimension is states
  - Other dimension is input symbol
  - For every transition Si (on input a) -> Sk, define T[i,a] = k

Parsing

Finite Automata can really only express things where you can count modulus on k, where k is the number of states.

So we need parsing, which transfer sequence of tokens from lexer to parse tree of the program.