3.0 Module 3: Syntactic Analysis – The Structure of Language
3.1 Introduction to Syntactic Analysis
Having explored how to analyze individual words, we now move to syntactic analysis, also known as parsing. This is the process of analyzing a string of symbols—in our case, words in a sentence—to determine if its order and structure conform to the rules of a formal grammar. Syntactic analysis serves a crucial dual purpose in NLP. First, it acts as a gatekeeper, checking for grammatical well-formedness. For example, a parser would reject an ungrammatical sentence like “The school goes to the boy.” Second, and more importantly, it creates a structural representation, typically a parse tree, that reveals the hierarchical relationships between words. This is the moment in our pipeline when we move from a simple sequence of words to a structured representation of “who did what to whom,” providing the structural backbone necessary for understanding meaning.
3.2 The Role of the Parser
A parser is the software component specifically designed to perform syntactic analysis. It takes text as input and, after checking for correct syntax according to a formal grammar, produces a structural representation of that input. The main roles of a parser include:
- Reporting any syntax errors found in the input text.
- Recovering from common errors to continue processing the remainder of the input.
- Creating a parse tree or another hierarchical data structure (like an abstract syntax tree) that represents the sentence’s grammatical structure.
- Creating a symbol table that stores information about the symbols (e.g., words) encountered.
- Producing intermediate representations that can be used by subsequent phases of the NLP pipeline, such as semantic analysis.
3.3 Core Parsing Strategies
There are two primary strategies for constructing a parse tree from an input sentence and a grammar.
3.3.1 Top-Down Parsing
A top-down parser begins with the grammar’s start symbol (e.g., S for Sentence) and attempts to generate the input string by applying grammar rules in a forward direction. It effectively tries to “predict” the structure of the sentence from the top of the grammar down to the words. This approach often uses recursive procedures. However, a significant disadvantage of simple top-down methods is the potential need for backtracking. If a chosen grammar rule leads to a dead end (a sequence that cannot match the input), the parser must backtrack and try an alternative rule.
3.3.2 Bottom-Up Parsing
In contrast, a bottom-up parser starts with the input symbols (the words of the sentence) and tries to construct the parse tree from the leaves up to the root (the start symbol). It works by successively replacing sequences of symbols that match the right-hand side of a grammar rule with the non-terminal on the left-hand side, effectively “reducing” the input string until only the start symbol remains.
3.4 Derivations and Parse Trees
3.4.1 Derivations
A derivation is the sequence of production rules applied to generate an input string from the grammar’s start symbol. The specific order in which non-terminals are replaced during this process defines the type of derivation.
- Left-most Derivation: In a left-most derivation, the input is scanned from left to right. At each step, the left-most non-terminal in the current sentential form is the one that is replaced by applying a production rule.
- Right-most Derivation: In a right-most derivation, the input is scanned from right to left. At each step, the right-most non-terminal in the current sentential form is the one that is replaced.
3.4.2 Parse Trees
A parse tree is the graphical depiction of a derivation, providing a clear visualization of a sentence’s syntactic structure.
- The root of the tree is the grammar’s start symbol.
- The leaf nodes of the tree are the terminal symbols (the words of the original sentence).
- The interior nodes are the non-terminal symbols (the phrasal categories like Noun Phrase or Verb Phrase).
A key property of a valid parse tree is that an in-order traversal of its leaf nodes will reproduce the original input string.
3.5 Formal Grammars
To describe the syntactic structure of a language algorithmically, we need a formal grammar. In 1956, Noam Chomsky introduced a mathematical model of grammar that has been foundational to both linguistics and computer science.
3.5.1 Defining a Grammar
A formal grammar G can be defined as a 4-tuple: G = (N, T, S, P), where:
- N (or VN): A finite set of non-terminal symbols. These are variables that represent phrasal categories (e.g., NP, VP, S).
- T (or Σ): A finite set of terminal symbols. These are the actual words or tokens in the language (e.g., the, boy, ran).
- S: The designated start symbol, which must be a member of N (S ∈ N).
- P: A finite set of production rules of the form α → β, where α and β are strings of symbols from N and T. These rules define how non-terminals can be rewritten.
3.5.2 Context-Free Grammar (CFG)
A Context-Free Grammar (CFG) is a type of formal grammar that is a superset of Regular Grammar and is widely used for parsing natural language. It is defined by the same four components:
- Non-terminals (V): Syntactic variables that define sets of strings.
- Terminals (Σ): The basic symbols or tokens from which strings are formed.
- Productions (P): Rules that define how non-terminals and terminals can be combined. In a CFG, the left side of every rule must consist of a single non-terminal.
- Start Symbol (S): The non-terminal from which all derivations begin.
3.6 Major Grammatical Frameworks
Within syntactic analysis, there are two opposing frameworks for conceptualizing the structure of a sentence, leading to different kinds of parse trees.
3.6.1 Phrase Structure (Constituency) Grammar
Introduced by Noam Chomsky, this framework is based on the constituency relation. The core idea is that sentences are not flat strings of words but are composed of hierarchical groups of words called phrases or constituents. This model is derived from the classic subject-predicate division of Greek and Latin grammar. The basic clause structure is typically analyzed as being composed of a Noun Phrase (NP) and a Verb Phrase (VP). For the sentence “This tree is illustrating the constituency relation”, a constituency-based parse tree would group words into nested phrasal nodes, such as [NP This tree] and [VP is illustrating [NP the constituency relation]].
3.6.2 Dependency Grammar
Introduced by Lucien Tesniere, this framework is based on the dependency relation and is fundamentally different from constituency grammar because it lacks phrasal nodes. The core ideas are:
- Words are connected to each other by directed links, called dependencies.
- The verb is the structural center of the clause.
- All other syntactic units are either directly or indirectly dependent on the verb.
For the same sentence, “This tree is illustrating the dependency relation”, a dependency-based parse tree would show illustrating as the root of the structure, with words like tree and relation connected to it as dependents (e.g., subject and object, respectively), and other words like This and the dependent on their respective nouns. This results in a flatter structure that directly models the functional relationships between words.
Choosing between these frameworks is a foundational decision in NLP system design; constituency grammars are often better for analyzing the formal structure of a sentence, while dependency grammars excel at capturing the functional relationships needed for tasks like question answering.
This module has established how syntactic analysis provides the structural framework for language. With this structure in place, we can now proceed to the next module, which focuses on how meaning is formally represented and extracted from these grammatical structures.