Apache OpenNLP

7.0 Module 7: Practical Implementation – Part-of-Speech (POS) Tagging

7.1 Introduction: Uncovering the Grammatical Structure of Text

Part-of-Speech (POS) tagging is the process of assigning a grammatical category—such as noun, verb, adjective, or adverb—to each word (token) in a sentence. This task moves beyond simple word identification to uncover the underlying grammatical structure of the text. POS tagging is a crucial prerequisite for more advanced syntactic analysis, including full parsing and chunking, as it provides the fundamental labels upon which these higher-level structures are built. Understanding the POS tags of words is essential for disambiguating their meaning and function within a sentence.

7.2 A Glossary of Common POS Tags in OpenNLP

OpenNLP uses a standard set of abbreviations for its POS tags, derived from the Penn Treebank tag set. The table below lists some of the most common tags you will encounter.

Parts of Speech	Meaning of parts of speech
NN	Noun, singular or mass (e.g., “dog”, “music”)
DT	Determiner (e.g., “the”, “a”, “some”)
VB	Verb, base form (e.g., “run”, “eat”)
VBD	Verb, past tense (e.g., “ran”, “ate”)
VBZ	Verb, third person singular present (e.g., “runs”, “eats”)
IN	Preposition or subordinating conjunction (e.g., “in”, “on”, “if”)
NNP	Proper noun, singular (e.g., “John”, “London”)
TO	“to” as a preposition or infinitive marker (e.g., “to the store”, “to run”)
JJ	Adjective (e.g., “big”, “red”)

7.3 The OpenNLP POS Tagging Workflow

Like other core tasks in OpenNLP, POS tagging is a model-driven process. It relies on the en-pos-maxent.bin model, which has been trained to predict the most likely part of speech for a word based on its context within a sentence.

The end-to-end workflow for POS tagging involves several steps:

Load the POSModel from the en-pos-maxent.bin file.
Instantiate the POSTaggerME class using the loaded model.
Tokenize the input sentence into an array of String tokens.
Pass the token array to the tag() method of the POSTaggerME instance to generate an array of POS tags.
(Optional) Use the POSSample class to conveniently display the tokens and their corresponding tags together.

7.4 Practical Implementation: A Step-by-Step Code Walkthrough

The following Java program demonstrates the complete POS tagging workflow.

import java.io.FileInputStream;

import java.io.InputStream;

import opennlp.tools.postag.POSModel;

import opennlp.tools.postag.POSSample;

import opennlp.tools.postag.POSTaggerME;

import opennlp.tools.tokenize.WhitespaceTokenizer;

public class PosTaggerExample {

public static void main(String args[]) throws Exception{

//Loading Parts of speech-maxent model

InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-pos-maxent.bin”);

POSModel model = new POSModel(inputStream);

//Instantiating POSTaggerME class

POSTaggerME tagger = new POSTaggerME(model);

String sentence = “Hi welcome to Tutorialspoint”;

//Tokenizing the sentence using WhitespaceTokenizer class

WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE;

String[] tokens = whitespaceTokenizer.tokenize(sentence);

//Generating tags

String[] tags = tagger.tag(tokens);

//Instantiating the POSSample class

POSSample sample = new POSSample(tokens, tags);

System.out.println(sample.toString());

}

Code Deconstruction:

Model Loading: The en-pos-maxent.bin model is loaded into a POSModel object.
Tagger Instantiation: An instance of POSTaggerME is created from the loaded model.
Tokenization: The input sentence is tokenized using the WhitespaceTokenizer. The tag() method requires a token array as input.
Tag Generation: The tagger.tag(tokens) method is called, which processes the tokens array and returns a tags array of the same length, with each element containing the POS tag for the corresponding token.
Displaying Results: A POSSample object is created by combining the tokens and tags arrays. Its toString() method produces a standard output format where each token is followed by its tag, separated by an underscore.

Output and Analysis:

Hi_NNP welcome_JJ to_TO Tutorialspoint_VB

The output correctly analyzes most of the sentence. However, the tagging of “Tutorialspoint” as a base-form verb (VB) is peculiar. This is an excellent example of the probabilistic nature of statistical models. Given the limited context of this short sentence and the specific data the model was trained on, it has made an unexpected, and likely incorrect, prediction. This serves as a valuable reminder that these tools are not infallible and their output often requires critical evaluation, especially on domain-specific or out-of-vocabulary terms.

7.5 Advanced Topics: Performance Monitoring and Tag Probabilities

OpenNLP provides utilities for measuring the performance of its components and for examining the confidence of its statistical predictions.

Performance Monitoring The PerformanceMonitor class can be used to benchmark the speed of an operation. It is started before the operation, its counter is incremented for each item processed, and then it is stopped to print the final results.

import java.io.FileInputStream;

import java.io.InputStream;

import opennlp.tools.cmdline.PerformanceMonitor;

import opennlp.tools.postag.POSModel;

import opennlp.tools.postag.POSSample;

import opennlp.tools.postag.POSTaggerME;

import opennlp.tools.tokenize.WhitespaceTokenizer;

public class PosTagger_Performance {

public static void main(String args[]) throws Exception{

InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-pos-maxent.bin”);

POSModel model = new POSModel(inputStream);

WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE;

String sentence = “Hi welcome to Tutorialspoint”;

String[] tokens = whitespaceTokenizer.tokenize(sentence);

POSTaggerME tagger = new POSTaggerME(model);

String[] tags = tagger.tag(tokens);

POSSample sample = new POSSample(tokens, tags);

System.out.println(sample.toString());

//Monitoring the performance of POS tagger

PerformanceMonitor perfMon = new PerformanceMonitor(System.err, “sent”);

perfMon.start();

perfMon.incrementCounter();

perfMon.stopAndPrintFinalResult();

}

This code will output performance metrics such as processing speed (items per second) and total runtime.

Tag Probabilities The probs() method of the POSTaggerME class allows you to retrieve the confidence scores for each tag assignment.

import java.io.FileInputStream;

import java.io.InputStream;

import opennlp.tools.postag.POSModel;

import opennlp.tools.postag.POSSample;

import opennlp.tools.postag.POSTaggerME;

import opennlp.tools.tokenize.WhitespaceTokenizer;

public class PosTaggerProbs {

public static void main(String args[]) throws Exception{

//Loading Parts of speech-maxent model

InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-pos-maxent.bin”);

POSModel model = new POSModel(inputStream);

//Creating an object of WhitespaceTokenizer class

WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE;

String sentence = “Hi welcome to Tutorialspoint”;

String[] tokens = whitespaceTokenizer.tokenize(sentence);

//Instantiating POSTaggerME class

POSTaggerME tagger = new POSTaggerME(model);

//Generating tags

String[] tags = tagger.tag(tokens);

//Instantiating the POSSample class

POSSample sample = new POSSample(tokens, tags);

System.out.println(sample.toString());

//Probabilities for each tag of the last tagged sentence.

double [] probs = tagger.probs();

System.out.println(” “);

//Printing the probabilities

for(int i = 0; i<probs.length; i++)

System.out.println(probs[i]);

}

Output Probabilities:

0.6416834779738033

0.42983612874819177

0.8584513635863117

0.4394784478206072

Each value in the output corresponds to the probability of the tag assigned to the token at the same index. For example, the model was 85.8% confident in assigning the TO tag to the word “to”, but only 43.9% confident in its VB tag for “Tutorialspoint”, reinforcing our earlier analysis that this was a low-confidence prediction.

Now that we have successfully identified tokens and assigned their parts of speech, we are equipped to move to a higher level of syntactic analysis: full sentence parsing.

Apache OpenNLP

Apache OpenNLP

Curriculum

7.0 Module 7: Practical Implementation – Part-of-Speech (POS) Tagging

Modal title