2.0 Rule-Based POS Tagging
Rule-based POS tagging stands as one of the earliest, knowledge-driven methodologies. Its strategic importance lies in its direct application of formal linguistic expertise, which is manually encoded into a set of explicit rules. This approach represented the dominant paradigm of “Good Old-Fashioned AI” (GOFAI), where intelligence was believed to emerge from the symbolic manipulation of expert knowledge rather than from statistical learning.
The architecture of a typical rule-based tagger operates in two distinct stages:
- First Stage: This initial stage uses a dictionary or lexicon to retrieve a list of all potential POS tags for each word in a sentence. At this point, words that can belong to multiple grammatical categories (e.g., “book” as a noun or a verb) are marked as ambiguous.
- Second Stage: Following the initial lookup, the system applies a large set of hand-written disambiguation rules to resolve ambiguities. These rules analyze the surrounding context—such as the preceding or following words—to select a single, correct tag for each word. A common rule, for example, might state that if a word is preceded by an article, it must be a noun.
The core properties of this methodology are defined by its reliance on human expertise and explicit instruction.
- Knowledge-Driven: These taggers are built upon a foundation of manually compiled linguistic knowledge rather than statistical patterns learned from data.
- Manual Rule Creation: The disambiguation rules, which can number around 1000 for a robust system, are crafted and maintained manually by linguists or system developers.
- Explicit Modeling: All linguistic generalizations, such as how to handle unknown words or prefer certain tag sequences, are explicitly coded into the rules rather than learned implicitly from data.
The primary advantage of this approach is its ability to leverage deep, nuanced linguistic insight, leading to high precision when the rule set is comprehensive. However, its main disadvantage is the significant and continuous manual effort required to create, refine, and maintain the large volume of rules. This makes the system brittle and notoriously difficult to scale or adapt to new domains (e.g., from financial news to medical records), as the rules are often domain-specific.
We now turn from this knowledge-driven paradigm to the data-driven principles that underpin Stochastic POS Tagging.