9.0 Module 9: Practical Implementation – Chunking
9.1 Introduction: Identifying Syntactic Groups
Chunking, also known as shallow parsing, is an NLP task that aims to identify and label non-recursive phrases in a sentence. Instead of building a full, hierarchical parse tree, chunking segments a sentence into its primary constituent groups, such as Noun Phrases (NP), Verb Phrases (VP), and Prepositional Phrases (PP). The key distinction from full parsing is that chunking does not capture the internal structure of these phrases or the grammatical relationships between them. It is a simpler, faster alternative that is often sufficient for information extraction tasks where identifying key phrases is more important than understanding their complex interdependencies.
9.2 The OpenNLP Chunking Workflow: A Multi-Step Process
A crucial point to understand about chunking in OpenNLP is that it is a composite task. The chunker does not operate on raw text directly; instead, it requires the output of previous NLP steps. Specifically, the input to the chunker is a sentence that has already been tokenized and Part-of-Speech tagged.
The full end-to-end process for chunking a sentence is as follows:
- Tokenize the input sentence into an array of String tokens.
- Generate POS tags for the token array, resulting in a corresponding array of String tags.
- Load the ChunkerModel from the en-chunker.bin file.
- Instantiate the ChunkerME class with the loaded model.
- Call the chunk() method, passing both the token array and the tag array as arguments to generate the final chunk tags.
9.3 Practical Implementation: Generating Chunks from a Sentence
The following Java program demonstrates the complete workflow for chunking a sentence.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
public class ChunkerExample{
public static void main(String args[]) throws IOException {
// Tokenizing the sentence
String sentence = “Hi welcome to Tutorialspoint”;
WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE;
String[] tokens = whitespaceTokenizer.tokenize(sentence);
// Generating the POS tags
File file = new File(“C:/OpenNLP_models/en-pos-maxent.bin”);
POSModel model = new POSModelLoader().load(file);
POSTaggerME tagger = new POSTaggerME(model);
String[] tags = tagger.tag(tokens);
// Loading the chunker model
InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-chunker.bin”);
ChunkerModel chunkerModel = new ChunkerModel(inputStream);
// Instantiate the ChunkerME class
ChunkerME chunkerME = new ChunkerME(chunkerModel);
// Generating the chunks
String result[] = chunkerME.chunk(tokens, tags);
for (String s : result)
System.out.println(s);
}
}
Professor’s Note: You may notice that the POS model in this example is loaded using new POSModelLoader().load(file), which differs from the new POSModel(inputStream) constructor we used in the POS tagging module. The POSModelLoader is a utility class often seen in the command-line tools. While both methods achieve the same goal of loading the model into memory, directly using the InputStream constructor is generally more common in application development.
Code Deconstruction and Output Analysis:
- The code first performs the prerequisite steps: the sentence is tokenized, and then the POS tags for those tokens are generated. The POS tagger produces: Hi_NNP welcome_JJ to_TO Tutorialspoint_VB.
- Next, the en-chunker.bin model is loaded, and the chunkerME.chunk(tokens, tags) method is called.
- The output uses a standard notation known as B-I-O (Beginning, Inside, Outside). B-NP marks a token that is at the beginning of a Noun Phrase. I-NP marks a token that is inside the same Noun Phrase. An O tag (not present in this example) would mark a token that is outside of any chunk.
Output:
B-NP
I-NP
B-VP
I-VP
This output should be read in parallel with the input tokens and their POS tags. “Hi” (NNP) begins a Noun Phrase (B-NP), and “welcome” (JJ) is inside that same phrase (I-NP). Then, “to” (TO) begins a Verb Phrase (B-VP), and “” (VB) is inside it (I-VP). This clearly connects the chunking decision to the results of the POS tagging stage.
9.4 Advanced Analysis: Chunk Spans and Confidence Scores
As with other OpenNLP components, the chunker provides methods for retrieving positional data and probability scores.
Retrieving Chunk Spans The chunkAsSpans() method offers an alternative to the B-I-O tags. It returns an array of Span objects, where each span directly represents a complete chunk, including its type (e.g., NP, VP). This is often a more convenient format to work with programmatically.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.Span;
public class ChunkerSpansEample{
public static void main(String args[]) throws IOException {
String sentence = “Hi welcome to Tutorialspoint”;
WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE;
String[] tokens = whitespaceTokenizer.tokenize(sentence);
File file = new File(“C:/OpenNLP_models/en-pos-maxent.bin”);
POSModel model = new POSModelLoader().load(file);
POSTaggerME tagger = new POSTaggerME(model);
String[] tags = tagger.tag(tokens);
InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-chunker.bin”);
ChunkerModel chunkerModel = new ChunkerModel(inputStream);
ChunkerME chunkerME = new ChunkerME(chunkerModel);
// Generating the tagged chunk spans
Span[] span = chunkerME.chunkAsSpans(tokens, tags);
for (Span s : span)
System.out.println(s.toString());
}
}
Output and Analysis:
[0..2) NP
[2..4) VP
This output clearly identifies two chunks. The first is a Noun Phrase (NP) spanning from token index 0 to 2 (exclusive), covering “Hi” and “welcome”. The second is a Verb Phrase (VP) spanning from token index 2 to 4 (exclusive), covering “to” and “Tutorialspoint”.
Chunking Probabilities The probs() method of the ChunkerME class can be called after a chunk() operation to get the confidence scores for each assigned chunk tag.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
public class ChunkerProbsExample{
public static void main(String args[]) throws IOException {
String sentence = “Hi welcome to Tutorialspoint”;
WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE;
String[] tokens = whitespaceTokenizer.tokenize(sentence);
File file = new File(“C:/OpenNLP_models/en-pos-maxent.bin”);
POSModel model = new POSModelLoader().load(file);
POSTaggerME tagger = new POSTaggerME(model);
String[] tags = tagger.tag(tokens);
InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-chunker.bin”);
ChunkerModel cModel = new ChunkerModel(inputStream);
ChunkerME chunkerME = new ChunkerME(cModel);
// Generating the chunk tags
chunkerME.chunk(tokens, tags);
// Getting the probabilities of the last decoded sequence
double[] probs = chunkerME.probs();
for(int i = 0; i<probs.length; i++)
System.out.println(probs[i]);
}
}
Output:
0.9592746040797778
0.6883933131241501
0.8830563473996004
0.8951150529746051
These values represent the model’s confidence for each of the B-I-O tags assigned to the tokens “Hi”, “welcome”, “to”, and “Tutorialspoint”, respectively.
Having covered the programmatic use of the OpenNLP API, we will now turn our attention to an alternative method of interaction: the powerful Command Line Interface.