5.0 Module 5: Practical Implementation – Tokenization
5.1 Introduction: From Sentences to Tokens
Tokenization is the process of breaking down a sequence of text, typically a sentence, into smaller units called tokens. These tokens are usually words, numbers, or punctuation marks. This task is a fundamental prerequisite for almost all subsequent NLP analysis. Before a machine can understand the grammatical structure or meaning of a sentence, it must first be able to identify its individual components. Tasks like Part-of-Speech tagging and Named Entity Recognition operate on a sequence of tokens, not a raw string of text, making tokenization a critical preprocessing step.
5.2 Rule-Based Tokenization Techniques in OpenNLP
OpenNLP provides simple, rule-based tokenizers that are fast and do not require loading a statistical model. They are suitable for many basic tokenization needs.
The SimpleTokenizer works by separating tokens based on character classes. For example, it will typically split letters from punctuation marks. The process involves getting a singleton instance of the class via SimpleTokenizer.INSTANCE and then calling its tokenize() method.
import opennlp.tools.tokenize.SimpleTokenizer;
public class SimpleTokenizerExample {
public static void main(String args[]){
String sentence = “Hi. How are you? Welcome to Tutorialspoint. ” +
“We provide free tutorials on various technologies”;
//Instantiating SimpleTokenizer class
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
//Tokenizing the given sentence
String tokens[] = simpleTokenizer.tokenize(sentence);
//Printing the tokens
for(String token : tokens) {
System.out.println(token);
}
}
}
Analysis of Output: The output below shows that punctuation marks like . and ? have been separated into their own tokens, which is often the desired behavior for downstream analysis.
Hi
.
How
are
you
?
Welcome
to
Tutorialspoint
.
We
provide
free
tutorials
on
various
technologies
The WhitespaceTokenizer offers an even simpler approach: it splits a string into tokens using only whitespaces (spaces, tabs, newlines) as delimiters.
import opennlp.tools.tokenize.WhitespaceTokenizer;
public class WhitespaceTokenizerExample {
public static void main(String args[]){
String sentence = “Hi. How are you? Welcome to Tutorialspoint. ” +
“We provide free tutorials on various technologies”;
//Instantiating whitespaceTokenizer class
WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;
//Tokenizing the given paragraph
String tokens[] = whitespaceTokenizer.tokenize(sentence);
//Printing the tokens
for(String token : tokens)
System.out.println(token);
}
}
Analysis of Output: In contrast to the SimpleTokenizer, this method keeps punctuation attached to the preceding word, as seen with “Hi.” and “you?”. This can be useful in some contexts but is generally less versatile for detailed grammatical analysis.
Hi.
How
are
you?
Welcome
to
Tutorialspoint.
We
provide
free
tutorials
on
various
technologies
5.3 Model-Driven Tokenization with
For more sophisticated and accurate tokenization, OpenNLP provides the TokenizerME class. This approach uses a pre-trained statistical model (en-token.bin) based on the Maximum Entropy algorithm to make more intelligent decisions about where to split tokens. While rule-based tokenizers like SimpleTokenizer are extremely fast and require no model loading, the statistical approach of TokenizerME provides greater accuracy and is better equipped to handle ambiguous cases like contractions or complex hyphenation.
The process is analogous to model-driven sentence detection:
- Load the TokenizerModel from the en-token.bin file.
- Instantiate the TokenizerME class with the model.
- Call the tokenize() method.
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
public class TokenizerMEExample {
public static void main(String args[]) throws Exception{
String sentence = “Hi. How are you? Welcome to Tutorialspoint. ” +
“We provide free tutorials on various technologies”;
//Loading the Tokenizer model
InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-token.bin”);
TokenizerModel tokenModel = new TokenizerModel(inputStream);
//Instantiating the TokenizerME class
TokenizerME tokenizer = new TokenizerME(tokenModel);
//Tokenizing the given raw text
String tokens[] = tokenizer.tokenize(sentence);
//Printing the tokens
for (String a : tokens)
System.out.println(a);
}
}
Analysis of Output: The output is similar to that of the SimpleTokenizer, correctly separating punctuation. The model-driven approach, however, is generally more robust when dealing with ambiguous cases. As an interesting artifact, note that the model’s output on the source text results in a misspelling (“technologie”). This can occasionally happen with statistical models when encountering words that are rare or near the edge of their training vocabulary. For clarity, we have corrected it in the output block below.
Hi
.
How
are
you
?
Welcome
to
Tutorialspoint
.
We
provide
free
tutorials
on
various
technologies
5.4 Advanced Analysis: Retrieving Token Spans and Probabilities
Just as with sentence detection, it is often useful to know the exact position of each token in the original string. The tokenizePos() method, available in all tokenizer classes, provides this functionality by returning an array of Span objects. Recall that the [start..end) notation denotes a character range that is inclusive of the start index and exclusive of the end index.
Retrieving Spans with SimpleTokenizer
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.util.Span;
public class SimpleTokenizerSpans {
public static void main(String args[]){
String sent = “Hi. How are you? Welcome to Tutorialspoint. ” +
“We provide free tutorials on various technologies”;
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
Span[] tokens = simpleTokenizer.tokenizePos(sent);
for( Span token : tokens)
System.out.println(token +” “+sent.substring(token.getStart(), token.getEnd()));
}
}
Output:
[0..2) Hi
[2..3) .
[4..7) How
…
Retrieving Spans with WhitespaceTokenizer
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.Span;
public class WhitespaceTokenizerSpans {
public static void main(String args[]){
String sent = “Hi. How are you? Welcome to Tutorialspoint. ” +
“We provide free tutorials on various technologies”;
WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;
Span[] tokens = whitespaceTokenizer.tokenizePos(sent);
for( Span token : tokens)
System.out.println(token +” “+sent.substring(token.getStart(), token.getEnd()));
}
}
Output:
[0..3) Hi.
[4..7) How
…
Retrieving Spans with TokenizerME
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
public class TokenizerMESpans {
public static void main(String args[]) throws Exception{
String sent = “Hello John how are you welcome to Tutorialspoint”;
InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-token.bin”);
TokenizerModel tokenModel = new TokenizerModel(inputStream);
TokenizerME tokenizer = new TokenizerME(tokenModel);
Span tokens[] = tokenizer.tokenizePos(sent);
for(Span token : tokens)
System.out.println(token +” “+sent.substring(token.getStart(), token.getEnd()));
}
}
Output:
[0..5) Hello
[6..10) John
…
In each case, combining the Span object with the substring() method allows for a clear display of both the token and its precise location within the source string.
Tokenization Probabilities
The TokenizerME class also provides the getTokenProbabilities() method, which returns a confidence score for each tokenization decision.
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
public class TokenizerMEProbs {
public static void main(String args[]) throws Exception{
String sent = “Hello John how are you welcome to Tutorialspoint”;
InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-token.bin”);
TokenizerModel tokenModel = new TokenizerModel(inputStream);
TokenizerME tokenizer = new TokenizerME(tokenModel);
Span tokens[] = tokenizer.tokenizePos(sent);
double[] probs = tokenizer.getTokenProbabilities();
// Printing spans and tokens…
System.out.println(” “);
for(int i = 0; i<probs.length; i++)
System.out.println(probs[i]);
}
}
Output Probabilities:
1.0
1.0
1.0
…
The probability scores are extremely high (1.0), indicating that the model is very confident about the token boundaries in this straightforward sentence.
Having explored the different tokenization strategies, we can now proceed to analyze these token sequences to identify real-world named entities.