3.0 Stochastic POS Tagging
Stochastic POS tagging represents a significant paradigm shift from manual rule creation to data-driven, probabilistic methods. This methodology leverages statistical information—specifically frequency and probability—derived from a pre-annotated training corpus. Its emergence marked the beginning of the empirical, data-driven revolution in NLP, demonstrating that robust linguistic behavior could be learned automatically from text.
Stochastic tagging can be implemented using two distinct approaches that vary in complexity and effectiveness.
- Word Frequency Approach This simple method disambiguates a word by assigning it the POS tag with which it occurred most frequently in the training corpus. For example, if the word “book” appeared 100 times as a noun and 10 times as a verb in the training data, this approach would always tag it as a noun. The key weakness of this method is its failure to consider context, which can result in inadmissible sequences of tags (e.g., an article followed by a verb).
- Tag Sequence Probabilities (n-gram approach) A more advanced and context-aware method, this approach determines the best tag for a word by calculating the probability of its occurrence in a sequence with the n preceding tags. This fundamentally models P(tag|previous n-1 tags), thereby capturing local contextual dependencies that the word-frequency approach ignores. A bigram model (n=2), for instance, would calculate the probability of a tag based on the single most recent tag, allowing the model to prefer more grammatically sound tag sequences.
The defining properties of stochastic taggers are rooted in their statistical nature.
- Relies on the probability of a tag’s occurrence, both individually and in sequence.
- Requires a large, annotated training corpus as its primary source of knowledge.
- Chooses the most frequent or probable tag for a word based on patterns observed in the training data.
- Suffers from the “unknown word” or “out-of-vocabulary (OOV)” problem, as it has no probability data for words not present in the corpus.
The next methodology we will examine, Transformation-Based Tagging, offers a hybrid model inspired by both the rule-based logic and the data-driven learning of stochastic systems.