5.0 Hidden Markov Model (HMM) for POS Tagging
Hidden Markov Model (HMM) based POS tagging is a powerful, purely stochastic method that models language as a sequence generation process. The source describes it as a doubly-embedded stochastic model, highlighting its two probabilistic layers: a hidden process of state transitions and an observable process of symbol emissions. Its strategic importance lies in its ability to model language as a sequence of hidden states (the POS tags) that generate observable outputs (the words). This framework became a workhorse for many NLP tasks, providing a principled way to determine the most probable sequence of tags for an entire sentence.
The core components of an HMM, when applied to POS tagging, are:
- Hidden States: The Part-of-Speech tags (e.g., NOUN, VERB, ADJECTIVE) are treated as the hidden states of the model, as they are not directly observed in raw text.
- Observable Outputs: The words in the sentence are the observable outputs, which are considered to have been “produced” or “emitted” by the hidden tag states.
The central objective of an HMM tagger is to find the sequence of tags (C) that is most likely to have generated a given sequence of words (W). This is framed as a task of maximizing the conditional probability P(C|W).
To make this calculation computationally feasible, the HMM approach relies on two critical simplifying assumptions:
- The Markov Assumption (Tag Sequence Probability): This assumption states that the probability of any given tag depends only on the immediately preceding tag (in a bigram model). This simplifies the complex probability of an entire tag sequence into a product of individual transition probabilities between adjacent tags: PROB(Ci|Ci-1).
- The Independence Assumption (Word Emission Probability): This assumption states that the probability of a word appearing depends only on its own POS tag, independent of any surrounding words or other tags. This simplifies the relationship between words and tags into a set of emission probabilities: PROB(Wi|Ci).
These probabilities are estimated from a large, tagged corpus using maximum likelihood estimation. The transition probability is the count of a specific bigram divided by the count of its prefix, and the emission probability is the count of a word-tag pair divided by the count of the tag.
- The transition probability PROB(Ci|Ci-1) is calculated as: Count(Ci-1, Ci) / Count(Ci-1)
- The emission probability PROB(Wi|Ci) is calculated as: Count(Wi, Ci) / Count(Ci)
With these components defined, we can now proceed to a direct, side-by-side comparison of all four methodologies.