4.0 Module 4: Practical Implementation – Sentence Segmentation
4.1 Introduction: The Challenge of Sentence Boundary Disambiguation
Sentence Boundary Disambiguation (SBD), also known as sentence segmentation, is the task of identifying the beginning and end of sentences in a block of text. While this may seem trivial at first glance, it is a complex problem in NLP. Punctuation marks like periods are ambiguous; a period can signify the end of a sentence, or it can be part of an abbreviation (e.g., “Mr.”), a decimal number, or an ellipsis. Accurately segmenting text into sentences is a critical first step in nearly all NLP pipelines, as most subsequent analysis, such as POS tagging and parsing, operates on a sentence-by-sentence basis.
4.2 A Baseline Comparison: Sentence Detection with Java Regular Expressions
A simple approach to sentence detection is to use regular expressions to split a text string based on common end-of-sentence punctuation. This method serves as a useful baseline to appreciate the sophistication of more advanced techniques.
The following Java code demonstrates this approach. The input sentence string is split using a simple regex pattern “[.?!]” that matches a period, question mark, or exclamation mark.
public class SentenceDetection_RE {
public static void main(String args[]){
String sentence = ” Hi. How are you? Welcome to Tutorialspoint. ” +
“We provide free tutorials on various technologies”;
String simple = “[.?!]”;
String[] splitString = (sentence.split(simple));
for (String string : splitString)
System.out.println(string);
}
}
Output and Analysis:
Hi
How are you
Welcome to Tutorialspoint
We provide free tutorials on various technologies
The output shows that the String.split() method successfully broke the text into sentence-like fragments. However, a key limitation is immediately apparent: the punctuation used for splitting is consumed and removed from the output, which is often undesirable. Furthermore, this simple pattern would fail on more complex cases involving abbreviations.
4.3 The OpenNLP Approach: Model-Driven Sentence Detection
OpenNLP approaches sentence detection not as a simple rule-based splitting task, but as a statistical classification problem. It uses a pre-trained model, en-sent.bin, which has learned the contextual patterns of sentence boundaries from a large corpus of text. This Maximum Entropy statistical model provides far more accurate and robust results than simple regular expressions.
The process for using OpenNLP for sentence detection involves three core steps:
- Load the SentenceModel from the en-sent.bin file.
- Instantiate the SentenceDetectorME class with the loaded model.
- Use the sentDetect() method of the detector instance to process the text.
4.4 Practical Implementation: A Step-by-Step Code Walkthrough
The following code demonstrates the OpenNLP approach to sentence detection.
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
public class SentenceDetectionME {
public static void main(String args[]) throws Exception {
String sentence = “Hi. How are you? Welcome to Tutorialspoint. ” +
“We provide free tutorials on various technologies”;
//Loading sentence detector model
InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-sent.bin”);
SentenceModel model = new SentenceModel(inputStream);
//Instantiating the SentenceDetectorME class
SentenceDetectorME detector = new SentenceDetectorME(model);
//Detecting the sentence
String sentences[] = detector.sentDetect(sentence);
//Printing the sentences
for(String sent : sentences)
System.out.println(sent);
}
}
Code Exegesis:
- Loading the Model: An InputStream is created using a FileInputStream that points to the en-sent.bin model file on disk. This stream is then passed to the constructor of the SentenceModel class, which loads the model into memory.
- Instantiating the Detector: An instance of SentenceDetectorME is created, passing the model object to its constructor. This prepares the detector for use.
- Detecting Sentences: The detector.sentDetect(sentence) method is called. It processes the input sentence string and returns an array of strings, sentences, where each element is a detected sentence.
- Printing the Output: A simple for loop iterates through the sentences array and prints each one to the console.
Output and Analysis:
Hi.
How are you?
Welcome to Tutorialspoint.
We provide free tutorials on various technologies
This output is clearly superior to the regex approach. The sentences are correctly identified, and the end-of-sentence punctuation is preserved, resulting in well-formed output that is ready for further processing.
4.5 Advanced Analysis: Positional Data and Confidence Scoring
Beyond simply extracting sentence text, OpenNLP can provide metadata about the detection process, such as the position of each sentence and the model’s confidence in its decision.
Retrieving Sentence Positions (Spans)
The sentPosDetect() method returns the character offsets of each sentence as an array of Span objects. A Span object contains the start (inclusive) and end (exclusive) indices of the detected segment.
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;
public class SentencePosDetection {
public static void main(String args[]) throws Exception {
String paragraph = “Hi. How are you? Welcome to Tutorialspoint. ” +
“We provide free tutorials on various technologies”;
InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-sent.bin”);
SentenceModel model = new SentenceModel(inputStream);
SentenceDetectorME detector = new SentenceDetectorME(model);
Span spans[] = detector.sentPosDetect(paragraph);
for (Span span : spans)
System.out.println(span);
}
}
Output:
[0..16)
[17..43)
[44..93)
The [start..end) notation indicates a character range, where start is the inclusive starting index of the sentence and end is the exclusive ending index.
By combining sentPosDetect() with Java’s substring() method, we can display both the sentence and its position.
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;
public class SentencesAndPosDetection {
public static void main(String args[]) throws Exception {
String sen = “Hi. How are you? Welcome to Tutorialspoint.” +
” We provide free tutorials on various technologies”;
InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-sent.bin”);
SentenceModel model = new SentenceModel(inputStream);
SentenceDetectorME detector = new SentenceDetectorME(model);
Span[] spans = detector.sentPosDetect(sen);
for (Span span : spans)
System.out.println(sen.substring(span.getStart(), span.getEnd())+” “+ span);
}
}
Output:
Hi. How are you? [0..16)
Welcome to Tutorialspoint. [17..43)
We provide free tutorials on various technologies [44..93)
Confidence Scoring
The getSentenceProbabilities() method provides insight into the model’s confidence. It must be called after sentDetect() and returns an array of double values, where each value is the probability score for the corresponding sentence.
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
public class SentenceDetectionMEProbs {
public static void main(String args[]) throws Exception {
String sentence = “Hi. How are you? Welcome to Tutorialspoint. ” +
“We provide free tutorials on various technologies”;
InputStream inputStream = new FileInputStream(“C:/OpenNLP_models/en-sent.bin”);
SentenceModel model = new SentenceModel(inputStream);
SentenceDetectorME detector = new SentenceDetectorME(model);
String sentences[] = detector.sentDetect(sentence);
for(String sent : sentences)
System.out.println(sent);
double[] probs = detector.getSentenceProbabilities();
System.out.println(” “);
for(int i = 0; i<probs.length; i++)
System.out.println(probs[i]);
}
}
Output Probabilities:
0.9240246995179983
0.9957680129995953
1.0
These scores indicate the model’s high confidence in its segmentation decisions for this particular text.
Having successfully segmented our text into sentences, the next logical step in our NLP pipeline is to break these sentences down into their constituent words, a process known as tokenization.