Glossary of Key Terms
Glossary of Key Terms
| Term | Definition |
| Ad-hoc retrieval | A problem in IR where a user enters a query in natural language, and the system returns documents related to the required information. |
| Agreement (Concord) | When a word changes form depending on the other words to which it relates; making the value of a grammatical category agree between different words. |
| Ambiguity | The ability of being understood in more than one way. |
| Anaphoric Ambiguity | Ambiguity that arises due to the use of anaphora entities in discourse, where a reference like “it” could point to multiple antecedents. |
| Antonymy | The relation between two lexical items having symmetry between their semantic components relative to an axis (e.g., rich/poor). |
| Aspect | A grammatical category related to the verb that defines the view taken of an event, such as perfective (complete) or imperfective (ongoing). |
| Automaton | An abstract self-propelled computing device that follows a predetermined sequence of operations automatically. |
| Bottom-up Parsing | A parsing strategy where the parser starts with the input symbols and tries to construct the parse tree up to the start symbol. |
| Chunking | The process of identifying parts of speech (POS) and short phrases by labeling tokens to get the structure of a sentence. |
| Coherence | The property of a text where utterances stick together to form a meaningful whole, characterized by relations between utterances and entities. |
| Context Free Grammar (CFG) | A notation for describing languages, defined by a set of non-terminals, terminals, productions, and a start symbol. It is a superset of Regular grammar. |
| Corpus | A large and structured set of machine-readable texts produced in a natural communicative setting. |
| Coreference Resolution | The task of finding referring expressions in a text that refer to the same entity. |
| Dependency Grammar | A grammar formalism where words are connected by directed links (dependencies), with the verb as the center of the clause structure. It lacks phrasal nodes. |
| Derivation | A set of production rules used to generate an input string during parsing. |
| Deterministic Finite automation (DFA) | A type of finite automaton where for every input symbol, the state to which the machine will move can be determined. |
| Discourse Processing | The subfield of NLP concerned with building theories and models of how utterances stick together to form coherent discourse. |
| Finite State Automata (FSA) | An automaton having a finite number of states. |
| Grammar | In linguistics, the rules or principles with the help of which language works. |
| Hidden Markov Model (HMM) | A doubly-embedded stochastic model where an underlying hidden stochastic process can only be observed through another set of stochastic processes that produces a sequence of observations. |
| Homonymy | Words having the same spelling or form but different and unrelated meanings (e.g., bat/bat). |
| Hyponymy | The relationship between a generic term (hypernym) and instances of that term (hyponyms), such as “color” (hypernym) and “blue” (hyponym). |
| Information Retrieval (IR) | A software program that deals with the organization, storage, retrieval, and evaluation of information from document repositories, particularly textual information. |
| Inverted Index | A data structure that lists, for every word, all documents that contain it and the frequency of its occurrences in the document. |
| Lemmatization | The process of extracting the base form (lemma) of a word by removing inflectional endings using vocabulary and morphological analysis. |
| Lexical Ambiguity | The ambiguity of a single word that can be treated as different parts of speech or have different meanings (e.g., “silver” as a noun, adjective, or verb). |
| Lexical Semantics | The study of the meaning of individual lexical items (words, sub-words, affixes, phrases). |
| Machine Translation (MT) | The process of translating text from one source language into another language. |
| Meaning Representation | A formal structure built from building blocks like entities, concepts, relations, and predicates to describe a situation and enable reasoning. |
| Morphemes | The smallest meaning-bearing units of a word, such as stems and affixes. |
| Morphological Parsing | The problem of recognizing that a word breaks down into smaller meaningful units called morphemes and producing a linguistic structure for it. |
| Morphology | The study of the structure, classification, and formation of words in a language. |
| Natural Language Processing (NLP) | The sub-field of Computer Science and AI concerned with enabling computers to understand and process human language. |
| Non-deterministic Finite Automation (NDFA) | A type of finite automaton where for an input symbol, the machine can move to any combination of states. |
| Part-of-Speech (POS) Tagging | The process of assigning one of the parts of speech (e.g., noun, verb, adjective) to a given word in a sentence. |
| Parse Tree | A graphical depiction of a derivation, where the root is the start symbol, interior nodes are non-terminals, and leaf nodes are terminals. |
| Parsing | The process of analyzing strings of symbols in a natural language to check for conformity to the rules of a formal grammar and to build a structural representation. |
| Phrase Structure Grammar | Also known as Constituency Grammar, it is based on the constituency relation, where a sentence structure is viewed in terms of noun phrases (NP) and verb phrases (VP). |
| Polysemy | A word or phrase with different but related senses (e.g., “bank” as a financial institution or the building housing it). |
| Pragmatic Ambiguity | Ambiguity where the context of a phrase gives it multiple interpretations because the statement is not specific. |
| Pragmatic Analysis | The NLP phase that fits the actual objects/events in a given context with object references obtained during semantic analysis to resolve ambiguity. |
| Pragmatics | The study of the functions of language and its use in context. |
| PropBank | A corpus annotated with verbal propositions and their arguments, used for semantic role labeling. |
| Reference Resolution | The task of determining what entities are referred to by which linguistic expressions in a text. |
| Regular Expression (RE) | A language for specifying text search strings using a specialized syntax held in a pattern. |
| Relevance Feedback | A process in IR that uses the initial output of a query to gather user information about relevance to perform a new, improved query. |
| Semantic Ambiguity | Ambiguity that occurs when the meaning of words themselves can be misinterpreted, leading to different interpretations of a sentence. |
| Semantic Analysis | The NLP phase that draws the exact dictionary meaning from text and checks it for meaningfulness. |
| Semantics | The study of how meaning is conveyed in language. |
| Stemming | A heuristic process of extracting the base form of words by chopping off their ends. |
| Stochastic POS Tagging | A tagging technique that uses frequency or probability to disambiguate words, often based on word frequency or tag sequence probabilities (n-grams). |
| Synonymy | The relation between two lexical items having different forms but expressing the same or a close meaning (e.g., author/writer). |
| Synsets | In WordNet, sets of cognitive synonyms into which nouns, verbs, adjectives, and adverbs are grouped. |
| Syntactic Ambiguity | Ambiguity that occurs when a sentence can be parsed in different ways. |
| Syntax | The study of the order and arrangement of words into larger units like clauses and phrases. |
| Syntax Analysis | The NLP phase that checks if a sentence is well-formed and breaks it into a structure showing syntactic relationships between words. Also known as parsing. |
| Tense | A grammatical category related to the verb that indicates the time of an action (present, past, future). |
| Tokenization | The process of breaking given text into smaller units called tokens, which can be words, numbers, or punctuation marks. |
| Top-down Parsing | A parsing strategy where the parser starts constructing the parse tree from the start symbol and tries to transform it to match the input. |
| Transformation-based Tagging | Also called Brill tagging, a rule-based algorithm for automatic POS tagging where rules are automatically induced from data to transform one state to another. |
| TreeBank | A linguistically parsed text corpus that annotates the syntactic or semantic structure of sentences. |
| Vector Space Model | An IR model where documents and queries are represented as vectors in a high-dimensional space, and similarity is often calculated using the cosine of the angle between them. |
| VerbNet (VN) | A large, hierarchical, domain-independent lexical resource for English verbs that incorporates both semantic and syntactic information. |
| Word Sense Disambiguation (WSD) | The ability to determine which meaning of a word is activated by its use in a particular context. |
| WordNet | A lexical database for English where nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (Synsets). |