Glossary
Apache OpenNLP An open-source Java library used to process natural language text.
ChunkerME class A class in the opennlp.tools.chunker package used to divide a sentence into smaller chunks, such as noun phrases and verb phrases. It uses a maximum-entropy-based model.
ChunkerModel class A class representing the predefined model (en-chunker.bin) used to divide a sentence into smaller chunks.
Chunking The process of breaking or dividing a sentence into parts of words such as word groups and verb groups.
Command Line Interface (CLI) An interface provided by OpenNLP to carry out different NLP operations, as well as train and evaluate models, through the command line.
find() method A method of the NameFinderME class used to detect named entities in an array of tokens. It returns an array of Span objects.
Named Entity Recognition (NER) The process of finding names, people, places, and other entities from a given text.
NameFinderME class A class in the opennlp.tools.namefind package that uses a maximum entropy model to find named entities in raw text.
NLP (Natural Language Processing) A set of tools used to derive meaningful and useful information from natural language sources such as web pages and text documents.
ParserTool class A class in the opennlp.tools.cmdline.parser package used to parse content. Its parseLine() method is used to parse raw text.
Parsing The process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. In OpenNLP, it involves breaking down a sentence into its constituent parts and showing their syntactic relation.
Parts of Speech (POS) Tagging The process of dividing text into various grammatical elements for further analysis. OpenNLP provides the following tags:
| Parts of Speech | Meaning of parts of speech |
| NN | Noun, singular or mass |
| DT | Determiner |
| VB | Verb, base form |
| VBD | Verb, past tense |
| VBZ | Verb, third person singular present |
| IN | Preposition or subordinating conjunction |
| NNP | Proper noun, singular |
| TO | to |
| JJ | Adjective |
POSModel class A class representing the predefined model (en-pos-maxent.bin) used to tag the parts of speech of a given sentence.
POSTaggerME class A class in the opennlp.tools.postag package used to predict the parts of speech of given raw text using a Maximum Entropy model.
sentDetect() method A method of the SentenceDetectorME class used to detect sentences in a string of raw text, returning them as a string array.
sentPosDetect() method A method of the SentenceDetectorME class used to detect the positions (spans) of sentences in a given text.
Sentence Boundary Disambiguation (SBD) The process of deciding the beginning and end of sentences in natural language text, also known as sentence breaking.
SentenceDetectorME class A class in the opennlp.tools.sentdetect package that uses a maximum entropy model to evaluate end-of-sentence characters and split raw text into sentences.
SentenceModel class A class representing the predefined model (en-sent.bin) used to detect sentences in raw text.
SimpleTokenizer class A tokenizer class that tokenizes given raw text using character classes.
Span class A class in the opennlp.tools.util package used to store the start and end integer of sets, representing the position of a token, sentence, or other textual element.
tag() method A method of the POSTaggerME class used to assign POS tags to an array of tokens.
TokenNameFinderModel class A class representing a predefined model used to find named entities, such as en-ner-person.bin.
TokenizerME class A tokenizer class that converts raw text into separate tokens using a Maximum Entropy model and requires a model file (en-token.bin).
TokenizerModel class A class representing the predefined model (en-token.bin) used to tokenize a given sentence.
Tokenization The process of chopping a given sentence into smaller parts, known as tokens.
tokenize() method A method present in tokenizer classes (SimpleTokenizer, WhitespaceTokenizer, TokenizerME) used to tokenize raw text into an array of string tokens.
WhitespaceTokenizer class A tokenizer class that uses whitespaces to tokenize given text.