6.0 Module 6: Core NLP Tasks and Methodologies
6.1 Introduction to Core NLP Tasks
The foundational concepts of linguistics, ambiguity resolution, and discourse structure are applied to solve specific, practical problems in Natural Language Processing. This module shifts our focus to these concrete applications. We will explore two fundamental tasks that are essential components of many higher-level NLP systems: Part-of-Speech (PoS) Tagging, the process of labeling words with their grammatical category, and Information Retrieval (IR), the science of finding relevant documents from a large collection.
6.2 Part-of-Speech (PoS) Tagging
Tagging is the general process of automatically assigning a descriptive label (a tag) to a token. Part-of-Speech (PoS) tagging is the specific task of assigning a part-of-speech tag—such as noun, verb, adjective, or adverb—to each word in a sentence. This process is a crucial first step for many NLP applications as it resolves a significant amount of lexical ambiguity and provides basic syntactic information.
6.2.1 Rule-Based PoS Tagging
This is one of the earliest techniques for PoS tagging. It operates using a two-stage architecture:
- Dictionary Lookup: First, a dictionary is used to assign each word a list of all its potential PoS tags.
- Disambiguation Rules: Second, a large set of hand-written rules are applied to resolve ambiguity. These rules use contextual information to select the correct tag. For example, a rule might state: “If the preceding word is an article (like ‘the’), then the current word must be a noun.”
These taggers are knowledge-driven, with all rules and linguistic information built manually by experts.
6.2.2 Stochastic PoS Tagging
A stochastic model is one that involves frequency or probability. Stochastic taggers use a training corpus to learn the probability of a given word having a particular tag. The two simplest approaches are:
- The Word Frequency Approach: For an ambiguous word, the tagger assigns the tag that occurred most frequently with that word in the training corpus. The main weakness of this method is that it can produce grammatically invalid tag sequences.
- Tag Sequence Probabilities (The n-gram approach): This more sophisticated method considers the context. The probability of a tag for a given word is calculated based on the previous n tags. For example, a bigram (2-gram) model would calculate the probability of a word being a verb given that the previous word was tagged as a noun.
6.2.3 Transformation-Based Tagging (Brill Tagging)
This is a hybrid approach that combines rule-based methods with machine learning. Also known as Brill Tagging, it is an instance of Transformation-Based Learning (TBL). The process works as follows:
- An initial, simple solution is applied (e.g., assign every word its most frequent tag).
- The system then iteratively learns a set of simple, ordered transformation rules that correct errors. In each cycle, it finds and applies the single rule that provides the most improvement to the tagging accuracy.
- This continues until no further improvement can be made.
- Advantages: The resulting rules are simple, easy to understand, and can be debugged.
- Disadvantages: The training process can be very slow on large corpora, and the model does not produce tag probabilities.
6.2.4 Hidden Markov Model (HMM) PoS Tagging
Defining HMMs A Hidden Markov Model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. An HMM is a doubly-embedded stochastic model: a hidden stochastic process (the sequence of states) can only be observed through another stochastic process that produces a sequence of observations.
Imagine a coin-tossing experiment where the specific coin being used at each toss is hidden from you. You only see the sequence of heads and tails. The HMM would model the (hidden) states of which coin is being used and the (observable) outputs of H/T. An HMM is characterized by five elements:
- N: The number of hidden states.
- M: The number of distinct observable symbols.
- A: The state transition probability matrix.
- P: The probability distribution of observable symbols in each state.
- I: The initial state distribution.
Applying HMMs to PoS Tagging PoS tagging can be perfectly modeled as an HMM. The PoS tags are the hidden states, and the words in the sentence are the observable outputs. The goal is to find the most probable sequence of hidden states (tags) that could have generated the observed sequence of words. We want to find the tag sequence C = (C<sub>1</sub>, C<sub>2</sub>, …, C<sub>T</sub>) that maximizes P(C|W) for a given word sequence W = (W<sub>1</sub>, W<sub>2</sub>, …, W<sub>T</sub>).
Mathematical Formulation Using Bayes’ rule, we can transform the problem: P(C|W) = [P(W|C) * P(C)] / P(W)
Since P(W) is constant for a given sentence, we only need to maximize: P(W|C) * P(C)
To make this computationally tractable, we make two key simplifying assumptions:
- First Assumption (n-gram model): The probability of a tag depends only on the previous n-1 tags. For a bigram model, this simplifies to: P(C) ≈ Π P(C<sub>i</sub> | C<sub>i-1</sub>)
- Second Assumption (Word Independence): The probability of a word appearing depends only on its own tag and is independent of other words and tags. P(W|C) ≈ Π P(W<sub>i</sub> | C<sub>i</sub>)
Combining these, the goal is to find the tag sequence C that maximizes: Πi=1…T [ P(Ci | Ci-1) * P(Wi | Ci) ]
To make this tangible, consider the sentence “The cat sat.” An HMM would calculate the probability of the entire tag sequence (e.g., DT NN VBD) by multiplying the probabilities of each step in the sequence. This involves both the probability of one tag following another (the transition probability) and the probability of a word being generated by a tag (the emission probability). For our example, the calculation would look something like this: P(DT|START) * P(The|DT) * P(NN|DT) * P(cat|NN) * P(VBD|NN) * P(sat|VBD). The tag sequence with the highest total probability is chosen as the answer.
The required probabilities can be estimated directly from a large tagged corpus:
(2) Transition Probability: P(C<sub>i</sub> | C<sub>i-1</sub>) = Count(C<sub>i-1</sub>, C<sub>i</sub>) / Count(C<sub>i-1</sub>) (e.g., Number of times a Verb follows a Noun / Total number of Nouns)
(3) Emission Probability: P(W<sub>i</sub> | C<sub>i</sub>) = Count(W<sub>i</sub>, C<sub>i</sub>) / Count(C<sub>i</sub>) (e.g., Number of times the word ‘run’ is a Verb / Total number of Verbs)
6.3 Information Retrieval (IR)
Information Retrieval (IR) is the field concerned with the organization, storage, retrieval, and evaluation of information from document repositories. The classic challenge in IR is the ad-hoc retrieval problem: a user enters a query in natural language, and the system must return a ranked list of documents that are relevant to that query.
6.3.1 Core Design Features of IR Systems
Most modern IR systems are built around a few core design principles:
- Inverted Index: This is the primary data structure. An inverted index is a mapping from content, such as words or numbers, to their locations in a set of documents. For every word in the collection, it lists all the documents that contain it, making keyword searches extremely efficient.
- Stop Word Elimination: Stop words are high-frequency words (e.g., ‘a’, ‘the’, ‘in’, ‘of’) that are considered to have little semantic value for searching. These words are placed on a stop list and removed from documents and queries before processing. This reduces the size of the inverted index but carries a small risk of removing a meaningful term (e.g., removing ‘A’ from a search for ‘Vitamin A’).
- Stemming: This is a heuristic process of reducing words to their root or base form by chopping off endings. For example, the words laughing, laughs, and laughed would all be reduced to the stem laugh. This allows a search for “laugh” to match documents containing any of its variants.
6.3.2 The Boolean Model
This is the oldest IR model, based on set theory and Boolean algebra.
- Components: Documents (D) are treated as sets of terms, queries (Q) are Boolean expressions (using AND, OR, NOT), and the framework (F) is Boolean algebra.
- Relevance: A document is considered relevant if and only if it satisfies the Boolean query expression.
- Operators: social AND economic retrieves documents containing both terms (set intersection). social OR economic retrieves documents containing either term (set union).
- Advantages:
- Simple, easy to implement, and predictable.
- Gives the user a high degree of control.
- Disadvantages:
- Retrieval is all-or-nothing; there are no partial matches or ranking.
- Query formulation is complex and can be unforgiving.
- Does not account for term frequency or importance.
6.3.3 The Vector Space Model
This model was developed to overcome the limitations of the Boolean model.
- Core Concept: Documents and queries are represented as vectors in a high-dimensional space, where each dimension corresponds to a unique term in the collection.
- Similarity: The relevance of a document to a query is determined by the similarity between their vectors, typically calculated as the cosine of the angle between them. A smaller angle (cosine closer to 1) indicates higher similarity.
Term Weighting In the vector space model, not all terms are equally important. Term weights are used to reflect this.
- Term Frequency (tf): The number of times a term appears in a document. Higher tf suggests the term is a good descriptor of that document’s content.
- Document Frequency (df): The number of documents in the collection that contain a term.
- Collection Frequency (cf): The total number of times a term appears in the entire collection.
Inverse Document Frequency (idf) The importance of a term is often inversely proportional to its frequency across the collection. A term that is rare across all documents is more informative than a common one. This is captured by the idf measure.
- Formula: A common formulation for idf is idf_t = log(1 + (N / n_t)), where N is the total number of documents and n<sub>t</sub> is the number of documents containing term t.
The final weight for a term in a document is often a combination of its tf and idf (e.g., tf-idf). The tf-idf weight combines these two ideas, giving the highest scores to terms that are frequent in a specific document (high tf) but rare across the entire collection (high idf), as these are the terms that are most likely to be uniquely descriptive of that document’s topic.
6.3.4 User Query Improvement via Relevance Feedback
Relevance feedback is an interactive process for improving query results. The system’s initial output is shown to the user, who then provides feedback on which documents are relevant. This feedback is used to automatically reformulate the query.
- Explicit Feedback: The user explicitly marks documents as relevant (binary) or rates their relevance on a scale (graded).
- Implicit Feedback: The system infers relevance from user behavior, such as which documents are clicked on, how long a user spends viewing a page (dwell time), or scrolling actions.
- Pseudo Feedback (Blind Feedback): This is an automated method. The system assumes the top-k documents returned by the initial query are relevant and uses them to reformulate the query without any user interaction.
- Assume the top n (e.g., 10-50) documents from the initial search are relevant.
- Select the most important terms (e.g., using tf-idf) from this assumed-relevant set.
- Add these terms to the original query and re-run the search.
This module has covered key methodologies for PoS tagging and information retrieval. We will now turn to the underlying linguistic theory that informs these tasks and explore a broader range of real-world NLP applications.