Natural Language Processing

3.0 Stochastic POS Tagging

Stochastic POS tagging represents a significant paradigm shift from manual rule creation to data-driven, probabilistic methods. This methodology leverages statistical information—specifically frequency and probability—derived from a pre-annotated training corpus. Its emergence marked the beginning of the empirical, data-driven revolution in NLP, demonstrating that robust linguistic behavior could be learned automatically from text.

Stochastic tagging can be implemented using two distinct approaches that vary in complexity and effectiveness.

Word Frequency Approach This simple method disambiguates a word by assigning it the POS tag with which it occurred most frequently in the training corpus. For example, if the word “book” appeared 100 times as a noun and 10 times as a verb in the training data, this approach would always tag it as a noun. The key weakness of this method is its failure to consider context, which can result in inadmissible sequences of tags (e.g., an article followed by a verb).
Tag Sequence Probabilities (n-gram approach) A more advanced and context-aware method, this approach determines the best tag for a word by calculating the probability of its occurrence in a sequence with the n preceding tags. This fundamentally models P(tag|previous n-1 tags), thereby capturing local contextual dependencies that the word-frequency approach ignores. A bigram model (n=2), for instance, would calculate the probability of a tag based on the single most recent tag, allowing the model to prefer more grammatically sound tag sequences.

The defining properties of stochastic taggers are rooted in their statistical nature.

Relies on the probability of a tag’s occurrence, both individually and in sequence.
Requires a large, annotated training corpus as its primary source of knowledge.
Chooses the most frequent or probable tag for a word based on patterns observed in the training data.
Suffers from the “unknown word” or “out-of-vocabulary (OOV)” problem, as it has no probability data for words not present in the corpus.

The next methodology we will examine, Transformation-Based Tagging, offers a hybrid model inspired by both the rule-based logic and the data-driven learning of stochastic systems.

Natural Language Processing

Natural Language Processing

Curriculum

3.0 Stochastic POS Tagging

Modal title