Lexical Analysis
Recognize tokens and ignore white spaces, comments
Generates token stream
. Error reporting
. Model using regular expressions
. Recognize using Finite State Automata
The first phase of the compiler is lexical analysis. The lexical analyzer breaks a sentence into a sequence of words or tokens and ignores white spaces and comments. It generates a stream of tokens from the input. This is modeled through regular expressions and the structure is recognized through finite state automata. If the token is not valid i.e., does not fall into any of the identifiable groups, then the lexical analyzer reports an error. Lexical analysis thus involves recognizing the tokens in the source program and reporting errors, if any. We will study more about all these processes in the subsequent slides
Sentences consist of string of tokens (a syntactic category) for example, number, identifier, keyword, string
. Sequences of characters in a token is a lexeme for example, 100.01, counter, const, “How are you?”
. Rule of description is a pattern for example, letter(letter/digit)*
. Discard whatever does not contribute to parsing like white spaces ( blanks, tabs, newlines ) and comments
. construct constants: convert numbers to token num and pass number as its attribute, for example, integer 31 becomes <num, 31>
. recognize keyword and identifiers for example counter = counter + increment becomes id = id + id /*check if id is a keyword*/
We often use the terms “token”, “pattern” and “lexeme” while studying lexical analysis. Lets see what each term stands for.
Token: A token is a syntactic category. Sentences consist of a string of tokens. For example number, identifier, keyword, string etc are tokens.
Lexeme: Sequence of characters in a token is a lexeme. For example 100.01, counter, const, “How are you?” etc are lexemes.
Pattern: Rule of description is a pattern. For example letter (letter | digit)* is a pattern to symbolize a set of strings which consist of a letter followed by a letter or digit. In general, there is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token. This pattern is said to match each string in the set. A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. The patterns are specified using regular expressions. For example, in the Pascal statement
Const pi = 3.1416;
The substring pi is a lexeme for the token “identifier”. We discard whatever does not contribute to parsing like white spaces (blanks, tabs, new lines) and comments. When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about the particular lexeme that matched to the subsequent phases of the compiler. For example, the pattern num matches both 1 and 0 but it is essential for the code generator to know what string was actually matched. The lexical analyzer collects information about tokens into their associated attributes. For example integer 31 becomes <num, 31>. So, the constants are constructed by converting numbers to token ‘num’ and passing the number as its attribute. Similarly, we recognize keywords and identifiers. For example count = count + inc becomes id = id + id.