Glossary of Key Terms
Glossary of Key Terms
| Term | Definition |
| Adjusted R-squared | A measure of fit in multiple regression that accounts for the number of independent variables in the model, penalizing for the loss of degrees of freedom as more variables are added. |
| Autocorrelation | Also known as serial correlation, it occurs when the error term in one period is related to the error term in another period. This violates a CLRM assumption. |
| Autoregressive Model | A type of dynamic model where the dependent variable is assumed to be influenced by the contemporaneous value of an independent variable and the lagged (previous) value of the dependent variable itself. |
| BLUE (Best Linear Unbiased Estimator) | A property of OLS estimators proven by the Gauss-Markov theorem. It signifies that the estimators have the smallest variance (Best), are Linear combinations of the data, and are Unbiased (equal their true values on average). |
| Censored Dependent Variable | A limited dependent variable where information is lost because some actual values are suppressed and reported at a minimum or maximum threshold, though the observation remains in the sample. |
| Central Limit Theorem (CLT) | A statistical concept stating that if random samples of size n are drawn from a population, then when n is large, the distribution of the sample mean is approximately normal, regardless of the population’s original distribution. |
| Chow Test | A statistical test used to check for the structural stability of a model, determining if the parameters of the model change between different groups or time periods. |
| Classical Linear Regression Model (CLRM) Assumptions | A set of assumptions required for OLS estimators to be BLUE. Key assumptions include linearity in parameters, random sampling, no perfect multicollinearity, zero conditional mean of the error, homoskedasticity, and no autocorrelation. |
| Coefficient of Determination (R-squared) | A measure of goodness of fit that represents the proportion of variation in the dependent variable that is explained by the variation in the independent variables. |
| Confidence Interval | A range of values calculated from sample data that is likely to contain the true population parameter a certain percentage of the time (e.g., 95%). |
| Consistent Estimator | An estimator that approaches the true parameter value as the sample size gets larger and larger. |
| Correlation Coefficient | A measure used to determine the strength and direction of the linear relationship between two variables, with a value between -1 and +1. |
| Covariance | A measure that uses the difference between the value of each random variable and its mean to determine how they vary with one another (positive, negative, or no relationship). |
| Cross-Sectional Data | Data consisting of measurements for individual observations (persons, firms, countries, etc.) at a given point in time. |
| Cumulative Density Function (CDF) | A function of a random variable that shows the sum or accrual of probabilities up to a certain value. |
| Data Mining | The practice of searching data for relationships between variables without a supporting theoretical structure, which is generally viewed unfavorably in econometrics. |
| Dependent Variable (Y) | The outcome of interest in an econometric model, which is hypothesized to be affected by the independent variables. |
| Difference-in-Difference (D-in-D) | An estimation technique used with pooled cross-sections in a natural experiment to identify a policy or treatment effect by “differencing out” preexisting group differences and time-period effects. |
| Dummy Variable | A dichotomous variable that takes on a value of 1 if a particular qualitative characteristic is present and 0 otherwise, used to quantify non-numeric information. |
| Durbin-Watson (DW) Test | A statistical test used to detect the presence of first-order autocorrelation, AR(1), in the residuals of a regression model. |
| Econometrics | A field based on a “theoretical-quantitative and empirical-quantitative approach to economic problems,” which uses statistical methods to quantify economic relationships, test theories, and make predictions. |
| Efficient Estimator | An estimator that achieves the smallest variance among all estimators of its kind. |
| Expected Value (Mean) | A measure of central tendency for a random variable; the average value of a random variable calculated as a weighted average of all possible values. |
| Feasible Generalized Least Squares (FGLS) | An estimation procedure used to correct for heteroskedasticity or autocorrelation when the exact form of the problem is unknown and must be estimated from the data. |
| Fixed Effects (FE) Estimator | A panel data estimation method that eliminates unobserved, time-invariant individual characteristics (fixed effects) by time-demeaning the data. |
| Functional Form | The mathematical form of the relationship between the dependent and independent variables in a model (e.g., linear, quadratic, log-log). |
| Gauss-Markov Theorem | A theorem stating that OLS estimators are the Best Linear Unbiased Estimators (BLUE) given that the CLRM assumptions hold. |
| Heckman Selection Model | An econometric model used to correct for selection bias that arises from a truncated dependent variable, particularly when self-selection determines inclusion in the sample. |
| Heteroskedasticity | A violation of a CLRM assumption where the variance of the error term changes in response to a change in the value(s) of the independent variable(s). |
| Homoskedasticity | A CLRM assumption stating that the error term has a constant variance regardless of the value(s) taken by the independent variable(s). |
| Independent Variable (X) | A factor or explanatory variable in an econometric model that is believed to cause changes in the dependent variable. |
| Interaction Term | A variable created as the product of two or more independent variables, used to allow the effect of one variable on the dependent variable to depend on the value of another variable. |
| Linear Probability Model (LPM) | A model estimated using OLS where the dependent variable is a dichotomous dummy variable. The predicted values are interpreted as conditional probabilities. |
| Logit Model | A nonlinear model for a qualitative dependent variable based on the logistic cumulative density function, estimated using Maximum Likelihood. |
| Maximum Likelihood (ML) Estimation | An estimation technique, used for models like probit and logit, that chooses parameter values that maximize the probability (likelihood) of observing the sample data. |
| Multicollinearity | A situation where a linear relationship exists between two or more independent variables. High multicollinearity can lead to large standard errors and unreliable coefficient estimates. |
| Normal Distribution | A continuous, symmetrical, bell-shaped probability distribution that is central to many statistical inference procedures in econometrics. |
| Normality Assumption | The CLRM assumption that the error term follows a normal distribution. It is not required for OLS to be BLUE but is necessary for hypothesis testing in small samples. |
| Omitted Variable Bias | Bias in OLS coefficient estimates that occurs when a relevant variable that affects the dependent variable and is correlated with an included independent variable is excluded from the model. |
| Ordinary Least Squares (OLS) | The most common econometric method for estimating a sample regression function. It works by choosing coefficient values that minimize the sum of the squared residuals. |
| p-value | The lowest level of significance at which a null hypothesis could be rejected given a calculated test statistic. A small p-value provides evidence against the null hypothesis. |
| Panel (Longitudinal) Data | Data that contains a time series for each cross-sectional unit in the sample; the same individual units are observed over a period of time. |
| Population Regression Function (PRF) | The theoretical function that defines the true relationship between the dependent and independent variables for the entire population. It is what the sample regression function attempts to estimate. |
| Probit Model | A nonlinear model for a qualitative dependent variable based on the standard normal cumulative density function, estimated using Maximum Likelihood. |
| Random Effects (RE) Model | A panel data model that assumes unobserved individual effects are random and uncorrelated with the independent variables, allowing them to be treated as part of the composite error term. |
| Robust Standard Errors | Also known as White-corrected standard errors, they are an adjustment to OLS standard errors to make them valid for inference in the presence of heteroskedasticity of an unknown form. |
| Sample Regression Function (SRF) | The estimated version of the Population Regression Function, calculated using sample data. |
| Spurious Correlation | A situation where two variables coincidentally have a statistical relationship but one doesn’t cause the other, often because both are trending over time due to a third, unobserved factor. |
| STATA | A popular command-driven statistical software program used extensively for performing econometric analysis. |
| Time-Series Data | Data consisting of measurements on one or more variables over time in a given space (e.g., a country’s GDP from 1950-2020). |
| Tobit Model | An econometric model used for censored dependent variables, which uses Maximum Likelihood estimation to account for the threshold values. |
| Truncated Dependent Variable | A limited dependent variable where observations are entirely missing from the sample if the value of the variable is above or below a certain threshold. |
| Unbiased Estimator | An estimator whose expected value (the mean of its sampling distribution) is equal to the true population parameter value. |
| Variance Inflation Factor (VIF) | A measure used to detect the severity of multicollinearity by calculating how much the variance of an estimated regression coefficient is increased due to collinearity. |
| Weighted Least Squares (WLS) | A method that transforms an original model with heteroskedasticity into one that is homoskedastic by weighting each observation based on information about the error variance. |