Apache OpenNLP

5.0 Module 5: Practical Implementation – Tokenization

5.1 Introduction: From Sentences to Tokens

Tokenization is the process of breaking down a sequence of text, typically a sentence, into smaller units called tokens. These tokens are usually words, numbers, or punctuation marks. This task is a fundamental prerequisite for almost all subsequent NLP analysis. Before a machine can understand the grammatical structure or meaning of a sentence, it must first be able to identify its individual components. Tasks like Part-of-Speech tagging and Named Entity Recognition operate on a sequence of tokens, not a raw string of text, making tokenization a critical preprocessing step.

5.2 Rule-Based Tokenization Techniques in OpenNLP

OpenNLP provides simple, rule-based tokenizers that are fast and do not require loading a statistical model. They are suitable for many basic tokenization needs.

The SimpleTokenizer works by separating tokens based on character classes. For example, it will typically split letters from punctuation marks. The process involves getting a singleton instance of the class via SimpleTokenizer.INSTANCE and then calling its tokenize() method.

import opennlp.tools.tokenize.SimpleTokenizer;

public class SimpleTokenizerExample {

public static void main(String args[]){

String sentence = “Hi. How are you? Welcome to Tutorialspoint. ” +

“We provide free tutorials on various technologies”;

//Instantiating SimpleTokenizer class

SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;

//Tokenizing the given sentence

String tokens[] = simpleTokenizer.tokenize(sentence);

//Printing the tokens

for(String token : tokens) {

System.out.println(token);

}

Analysis of Output: The output below shows that punctuation marks like . and ? have been separated into their own tokens, which is often the desired behavior for downstream analysis.

How

are

you

Welcome

Tutorialspoint

provide

free

tutorials

various

technologies

The WhitespaceTokenizer offers an even simpler approach: it splits a string into tokens using only whitespaces (spaces, tabs, newlines) as delimiters.

import opennlp.tools.tokenize.WhitespaceTokenizer;

public class WhitespaceTokenizerExample {

public static void main(String args[]){

String sentence = “Hi. How are you? Welcome to Tutorialspoint. ” +

“We provide free tutorials on various technologies”;

//Instantiating whitespaceTokenizer class

WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;

//Tokenizing the given paragraph

String tokens[] = whitespaceTokenizer.tokenize(sentence);

//Printing the tokens

for(String token : tokens)

System.out.println(token);

}

Analysis of Output: In contrast to the SimpleTokenizer, this method keeps punctuation attached to the preceding word, as seen with “Hi.” and “you?”. This can be useful in some contexts but is generally less versatile for detailed grammatical analysis.

Hi.

How

are

you?

Welcome

Tutorialspoint.

provide

free

tutorials

various

technologies

5.3 Model-Driven Tokenization with

For more sophisticated and accurate tokenization, OpenNLP provides the TokenizerME class. This approach uses a pre-trained statistical model (en-token.bin) based on the Maximum Entropy algorithm to make more intelligent decisions about where to split tokens. While rule-based tokenizers like SimpleTokenizer are extremely fast and require no model loading, the statistical approach of TokenizerME provides greater accuracy and is better equipped to handle ambiguous cases like contractions or complex hyphenation.

The process is analogous to model-driven sentence detection:

Load the TokenizerModel from the en-token.bin file.
Instantiate the TokenizerME class with the model.
Call the tokenize() method.

import java.io.FileInputStream;

import java.io.InputStream;

import opennlp.tools.tokenize.TokenizerME;

import opennlp.tools.tokenize.TokenizerModel;

public class TokenizerMEExample {

public static void main(String args[]) throws Exception{

String sentence = “Hi. How are you? Welcome to Tutorialspoint. ” +

“We provide free tutorials on various technologies”;

//Loading the Tokenizer model

InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-token.bin”);

TokenizerModel tokenModel = new TokenizerModel(inputStream);

//Instantiating the TokenizerME class

TokenizerME tokenizer = new TokenizerME(tokenModel);

//Tokenizing the given raw text

String tokens[] = tokenizer.tokenize(sentence);

//Printing the tokens

for (String a : tokens)

System.out.println(a);

}

Analysis of Output: The output is similar to that of the SimpleTokenizer, correctly separating punctuation. The model-driven approach, however, is generally more robust when dealing with ambiguous cases. As an interesting artifact, note that the model’s output on the source text results in a misspelling (“technologie”). This can occasionally happen with statistical models when encountering words that are rare or near the edge of their training vocabulary. For clarity, we have corrected it in the output block below.

How

are

you

Welcome

Tutorialspoint

provide

free

tutorials

various

technologies

5.4 Advanced Analysis: Retrieving Token Spans and Probabilities

Just as with sentence detection, it is often useful to know the exact position of each token in the original string. The tokenizePos() method, available in all tokenizer classes, provides this functionality by returning an array of Span objects. Recall that the [start..end) notation denotes a character range that is inclusive of the start index and exclusive of the end index.

Retrieving Spans with SimpleTokenizer

import opennlp.tools.tokenize.SimpleTokenizer;

import opennlp.tools.util.Span;

public class SimpleTokenizerSpans {

public static void main(String args[]){

String sent = “Hi. How are you? Welcome to Tutorialspoint. ” +

“We provide free tutorials on various technologies”;

SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;

Span[] tokens = simpleTokenizer.tokenizePos(sent);

for( Span token : tokens)

System.out.println(token +” “+sent.substring(token.getStart(), token.getEnd()));

}

Output:

[0..2) Hi

[2..3) .

[4..7) How

…

Retrieving Spans with WhitespaceTokenizer

import opennlp.tools.tokenize.WhitespaceTokenizer;

import opennlp.tools.util.Span;

public class WhitespaceTokenizerSpans {

public static void main(String args[]){

String sent = “Hi. How are you? Welcome to Tutorialspoint. ” +

“We provide free tutorials on various technologies”;

WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;

Span[] tokens = whitespaceTokenizer.tokenizePos(sent);

for( Span token : tokens)

System.out.println(token +” “+sent.substring(token.getStart(), token.getEnd()));

}

Output:

[0..3) Hi.

[4..7) How

…

Retrieving Spans with TokenizerME

import java.io.FileInputStream;

import java.io.InputStream;

import opennlp.tools.tokenize.TokenizerME;

import opennlp.tools.tokenize.TokenizerModel;

import opennlp.tools.util.Span;

public class TokenizerMESpans {

public static void main(String args[]) throws Exception{

String sent = “Hello John how are you welcome to Tutorialspoint”;

InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-token.bin”);

TokenizerModel tokenModel = new TokenizerModel(inputStream);

TokenizerME tokenizer = new TokenizerME(tokenModel);

Span tokens[] = tokenizer.tokenizePos(sent);

for(Span token : tokens)

System.out.println(token +” “+sent.substring(token.getStart(), token.getEnd()));

}

Output:

[0..5) Hello

[6..10) John

…

In each case, combining the Span object with the substring() method allows for a clear display of both the token and its precise location within the source string.

Tokenization Probabilities

The TokenizerME class also provides the getTokenProbabilities() method, which returns a confidence score for each tokenization decision.

import java.io.FileInputStream;

import java.io.InputStream;

import opennlp.tools.tokenize.TokenizerME;

import opennlp.tools.tokenize.TokenizerModel;

import opennlp.tools.util.Span;

public class TokenizerMEProbs {

public static void main(String args[]) throws Exception{

String sent = “Hello John how are you welcome to Tutorialspoint”;

InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-token.bin”);

TokenizerModel tokenModel = new TokenizerModel(inputStream);

TokenizerME tokenizer = new TokenizerME(tokenModel);

Span tokens[] = tokenizer.tokenizePos(sent);

double[] probs = tokenizer.getTokenProbabilities();

// Printing spans and tokens…

System.out.println(” “);

for(int i = 0; i<probs.length; i++)

System.out.println(probs[i]);

}

Output Probabilities:

1.0

…

The probability scores are extremely high (1.0), indicating that the model is very confident about the token boundaries in this straightforward sentence.

Having explored the different tokenization strategies, we can now proceed to analyze these token sequences to identify real-world named entities.

Apache OpenNLP

Apache OpenNLP

Curriculum

5.0 Module 5: Practical Implementation – Tokenization

Modal title