Natural Language Processing

2.0 Module 2: Linguistic Resources and Word-Level Analysis

2.1 Introduction to Linguistic Resources and Analysis

For any Natural Language Processing system to function effectively, it requires two fundamental components: access to structured linguistic data and a set of methods for analyzing text at its most granular level. This module introduces these core building blocks. We will first explore the concept of corpora—large, structured collections of text that serve as the empirical foundation for modern NLP. We will then delve into word-level analysis, examining the techniques used to process and understand individual words, which constitutes the first critical step in the NLP pipeline.

2.2 The Role of Corpora in NLP

A corpus (plural: corpora) is a large, structured collection of machine-readable texts produced in a natural communicative setting. Corpora are not just random collections of text; they are carefully designed to be representative of a specific language or language variety, enabling researchers to make generalizable claims about the language they study.

2.2.1 Representativeness and Balance

The quality and utility of a corpus are determined by its design, with representativeness and balance being two of the most critical factors.

Representativeness: This concept is central to corpus design. As defined by linguists, representativeness ensures that the findings derived from analyzing the corpus can be reliably generalized to the broader language variety it is intended to represent.
- According to Leech (1991), “A corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety.”
- Biber (1993) adds that representativeness “refers to the extent to which a sample includes the full range of variability in a population.”
Balance: A corpus is considered balanced if it includes a wide and proportional range of text genres (e.g., news articles, fiction, academic papers, spoken conversations). While there is no single scientific measure for perfect balance, designers use their best judgment and the intended use of the corpus to ensure it covers the necessary variety of text categories.

2.2.2 Sampling

Because language is infinite, any corpus must be a finite sample of it. The process of sampling is therefore inescapable in corpus construction and is directly linked to achieving representativeness and balance. Key considerations in sampling include:

Population: The complete collection of all possible texts of the type being studied (e.g., all English-language newspapers published in a specific year).
Sampling Frame: The list of all the individual units from which a sample can be drawn (e.g., a list of all newspaper titles).
Sampling Unit: The specific unit being sampled from the frame (e.g., a single newspaper issue or a specific book).

Conscious and careful sampling decisions are required at every stage, from selecting the kinds of texts to include to determining the length of text samples taken from within each selected document.

2.2.3 Corpus Size

The ideal size of a corpus depends on its intended purpose and the types of queries and methodologies that will be applied to it. Over the past several decades, advances in technology and data storage have enabled a dramatic increase in corpus size, which in turn has powered more sophisticated and accurate NLP models.

Year	Name of the Corpus	Size (in words)
1960s – 70s	Brown and LOB	1 Million
1980s	The Birmingham corpora	20 Million
1990s	The British National Corpus	100 Million
Early 21st century	The Bank of English corpus	650 Million

2.3 Specialized Linguistic Corpora

Beyond general-purpose corpora, many NLP tasks rely on specialized corpora that have been enriched with linguistic annotations.

2.3.1 TreeBank Corpus

A TreeBank is a text corpus that has been annotated with its syntactic or semantic sentence structure, typically represented in a tree format. TreeBanks are usually created by first tagging a corpus with part-of-speech information and then adding deeper structural analysis.

Semantic Treebanks: These corpora use a formal representation to annotate a sentence’s semantic structure. Examples include the Robot Commands Treebank and the Groningen Meaning Bank.
Syntactic Treebanks: These corpora annotate the grammatical structure of sentences. Many such Treebanks exist for various languages, including the Penn Arabic Treebank (Arabic), the Sininca Treebank (Chinese), and the BLLIP WSJ corpus (English).

TreeBanks are invaluable resources with wide-ranging applications in Computational Linguistics (for training parsers and other NLP systems), Corpus Linguistics (for studying syntactic phenomena), and Theoretical Linguistics (for providing empirical evidence for grammatical theories).

2.3.2 PropBank Corpus

The Proposition Bank (PropBank) is a corpus annotated with verbal propositions and their arguments. It is a verb-oriented resource, meaning it focuses on capturing the “who did what to whom” information for each verb in a sentence. Its primary application in NLP is to provide training data for semantic role labeling, a task that aims to identify the semantic roles of arguments in a predicate-argument structure.

2.3.3 VerbNet

VerbNet (VN) is a large, hierarchical, and domain-independent lexical resource for English verbs. It organizes verbs into classes based on shared syntactic and semantic behavior. Each VerbNet class contains two key components:

A set of syntactic frames: These descriptions detail the possible surface realizations of a verb’s arguments, covering constructions like transitive, intransitive, and various diathesis alternations.
A set of semantic descriptions: These descriptions constrain the types of thematic roles (e.g., animate, human) that can be associated with a verb’s arguments.

This combination provides rich, structured information about verb behavior, making it a powerful resource for parsing and semantic analysis.

2.3.4 WordNet

WordNet is a large lexical database for the English language. Unlike a traditional dictionary, WordNet organizes words based on their meanings. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms called Synsets, each representing a distinct concept. These synsets are then linked by conceptual-semantic and lexical relations (such as hyponymy and antonymy). This rich, network-like structure makes WordNet extremely useful for various NLP applications, including word-sense disambiguation (determining which meaning of a word is used in a specific context) and improving information retrieval systems.

2.4 Word-Level Analysis: Regular Expressions and Finite State Automata

Word-level analysis begins with tools for recognizing and manipulating patterns in strings of text. Regular expressions and finite state automata are the formal foundations for these tasks.

2.4.1 Regular Expressions (RE)

A Regular Expression (RE) is a compact language for specifying text search patterns. It is an algebraic notation for characterizing a set of strings. A regular expression is defined mathematically as follows:

ε is an RE that denotes the language containing an empty string.
φ is an RE that denotes an empty language.
If X and Y are REs, then:
- X.Y (concatenation) is an RE.
- X+Y (union, often written as X|Y) is an RE.
- X* (Kleene closure, meaning zero or more occurrences) is an RE.

Regular Expression	Regular Set	Explanation
(0 + 10*)	{0, 1, 10, 100, 1000, …}	This expression matches either the single digit ‘0’ OR the digit ‘1’ followed by zero or more ‘0’s (Kleene closure on ‘0’).
(010)	{1, 01, 10, 010, 0010, …}	The string consists of a single 1 surrounded on either side by zero or more 0s.
(a+b)*	{ε, a, b, aa, ab, bb, ba, …}	Matches any string of any length composed of as and bs, including the empty string (ε).
(a+b)*abb	{abb, aabb, babb, …}	Matches any string composed of as and bs that ends with the specific sequence abb.

2.4.2 Finite State Automata (FSA)

An automaton is an abstract self-propelled computing device that follows a predetermined sequence of operations. A Finite State Automaton (FSA) or Finite Automaton (FA) is an automaton with a finite number of states. It is a mathematical model of computation that can be used to recognize regular languages.

An FSA is formally defined as a 5-tuple (Q, Σ, δ, q0, F) where:

Q: A finite set of states.
Σ: A finite set of input symbols (the alphabet).
δ: The transition function, which maps a state and an input symbol to a resulting state.
q0: The initial state (where processing begins).
F: A set of final (or accepting) states.

2.4.3 Relationship between RE, FSA, and Regular Grammars

Finite State Automata, Regular Expressions, and Regular Grammars are three equivalent formalisms. They are different ways of describing the same class of languages, known as regular languages. Any language that can be described by a regular expression can be recognized by an FSA, and vice versa.

2.4.4 Types of FSA

FSAs can be represented graphically as state diagrams and are categorized into two main types.

Deterministic Finite Automata (DFA) In a DFA, for each state and each input symbol, there is exactly one transition to a next state. The machine’s path is uniquely determined by the input string.

Mathematical Definition: The transition function δ maps a state and symbol to a single state: δ: Q × Σ → Q.
Graphical Representation: States are vertices, transitions are labeled arcs, the initial state has an incoming arc, and final states are marked with a double circle.

Example of a DFA:

Q = {a, b, c}
Σ = {0, 1}
q0 = {a}
F = {c}
Transition Table:

Current State	Next State for Input 0	Next State for Input 1
a	a	b
b	c	a
c	c	c

In the graphical representation of this DFA, state ‘a’ is the initial state. There is an arc from ‘a’ back to itself labeled ‘0’, and an arc from ‘a’ to ‘b’ labeled ‘1’. From state ‘b’, an arc labeled ‘0’ goes to ‘c’, and an arc labeled ‘1’ goes back to ‘a’. From state ‘c’, arcs labeled ‘0’ and ‘1’ both point back to ‘c’. State ‘c’ is represented by a double circle, indicating it is the final state. This DFA accepts any string of 0s and 1s where a ‘1’ is followed by a ‘0’.

Non-deterministic Finite Automata (NDFA) In an NDFA, for a given state and input symbol, the machine can transition to a set of states (including zero, one, or multiple states).

Mathematical Definition: The transition function δ maps a state and symbol to a set of possible next states: δ: Q × Σ → 2<sup>Q</sup>.
Graphical Representation: The components are the same as for a DFA, but a single state can have multiple outgoing arcs for the same input symbol.

Example of an NDFA:

Q = {a, b, c}
Σ = {0, 1}
q0 = {a}
F = {c}
Transition Table:

Current State	Next State for Input 0	Next State for Input 1
a	{a, b}	{b}
b	{c}	{a, c}
c	{b, c}	{c}

The key difference from a DFA is this element of choice; from a single state, given a single input, the NDFA can follow multiple paths simultaneously. A string is accepted if any of these possible paths ends in a final state. For example, in the NDFA above, from state ‘a’ with input ‘0’, the machine can transition to either ‘a’ or ‘b’.

2.5 Morphological Analysis

Morphology is the study of the internal structure of words and the principles of word formation. It is the first step in the NLP pipeline, concerned with breaking words down into their component parts.

2.5.1 Morphemes: The Building Blocks

A morpheme is the smallest meaning-bearing unit in a language. Words are composed of one or more morphemes.

Stems: The core, meaningful unit of a word. For example, in foxes, the stem is fox.
Affixes: Morphemes that are added to a stem to modify its meaning or grammatical function.
- Prefixes: Precede the stem (e.g., un– in unbuckle).
- Suffixes: Follow the stem (e.g., -s in cats).
- Infixes: Are inserted inside the stem (e.g., pluralizing cupful to cupsful).
- Circumfixes: Precede and follow the stem simultaneously (less common in English).

2.5.2 Morphological Parsing

Morphological parsing is the process of analyzing a word to break it down into its constituent morphemes. For instance, parsing foxes would yield the stem fox and the plural affix -es. Building a morphological parser requires three key components:

Lexicon: A dictionary of stems and affixes, containing information about their meaning and grammatical properties (e.g., whether a stem is a Noun or a Verb).
Morphotactics: A model of morpheme ordering that specifies which classes of morphemes can follow others. For example, in English, the plural morpheme follows the noun stem, it does not precede it.
Orthographic Rules: Spelling rules that model the changes that occur when morphemes are combined. For example, the rule that changes y to ie when adding the plural suffix to a word like city to form cities.

This module has detailed the essential linguistic resources and the foundational techniques of word-level analysis. With this groundwork, we can now proceed to the next module, which explores how these individual words are assembled into grammatical structures.

Natural Language Processing

Natural Language Processing

Curriculum

2.0 Module 2: Linguistic Resources and Word-Level Analysis

Modal title