Key econometric ideas needed for QMM

(A review)

Andrew Pua

2025-10-27

Where we are coming from

  1. We want you to be able to use the QMM.
  2. We want you to be able to criticize its components – and there are many moving parts.
  3. We want you to be able to know which questions can be answered by QMM.
  4. We want you to know where the points of improvement are so that it would encourage research.

Day 1

  1. Opening remarks

  2. What are regressions?

  3. Core ideas behind regressions leads to a lot of advanced topics

    • Base technology for QMM
    • Further exploration for you
  4. How would things change in the time series case?

  5. Revisiting the old QMM

Pearson and Lee heights dataset

Pearson and Lee heights dataset

Pearson and Lee heights dataset

Pearson and Lee heights dataset

Pearson and Lee heights dataset

Try a stem-and-leaf plot of the strip.


  The decimal point is at the |

  64 | 5
  66 | 479
  68 | 001334467712237
  70 | 00023479901244555888
  72 | 011236688902668
  74 | 06
  76 | 4

Graphing averages for small strips

Graphing averages for small strips

Graphing averages for small strips

District income and test scores

District income and test scores

The beginning

  • When can things go wrong?
  • We have seen different types of regressions – linear, polynomial, nonparametric
  • There are still more. Which should you choose?
  • What if we want to make predictions outside the range available from the data?
  • Do the slopes and intercepts of the lines we have seen so far have any meaning aside from being summaries of the data?

Prediction problems

  • Think of \(Y\) being a characteristic of interest. and you have other information in the form of another random variable \(X_{1}\).

  • Suppose we have two random variables \(X_{1}\) and \(Y\), which follow a joint distribution \(f_{X_{1},Y}\left(x_{1},y\right)\).

  • Suppose you draw a unit at random from the sub-population of units with \(X_{1}=x_{1}\). Your task is to predict this unit’s \(Y\).

  • How do we accomplish this task optimally?

Prediction under squared error loss

  • Consider a prediction rule of the form \(\beta_{0}+\beta_{1}X_{1}\).
  • The task is now to find the unique solution to the following optimization problem: \[\min_{\beta_{0},\beta_{1}}\mathbb{E}\left[\left(Y-\beta_{0}-\beta_{1}X_{1}\right)^{2}\right].\]

Prediction under squared error loss

  • Provided that \(\mathsf{Var}\left(X_{1}\right)>0\), the optimal solution \(\left(\beta_{0}^{*},\beta_{1}^{*}\right)\) solve the following first-order conditions: \[\begin{eqnarray} \mathbb{E}\left(Y\right)-\beta_{0}^{*}-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right) &=& 0,\\ \mathbb{E}\left(X_{1}Y\right)-\beta_{0}^{*}\mathbb{E}\left(X_{1}\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}^{2}\right) &=& 0. \end{eqnarray}.\]
  • As a result, we have \[\beta_{0}^{*} = \mathbb{E}\left(Y\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right),\qquad\beta_{1}^{*}=\dfrac{\mathsf{Cov}\left(X_{1},Y\right)}{\mathsf{Var}\left(X_{1}\right)}.\]

The linear regression model

  • We were able to obtain a predictive relationship between \(X_{1}\) and \(Y\).
  • We can now write this relationship as \[Y=\beta_{0}^{*}+\beta_{1}^{*}X_{1}+u,\] where \(u\) satisfies \(\mathbb{E}\left(u\right) =0\) and \(\mathsf{Cov}\left(X_{1},u\right)=0\) by design or by construction.
  • This predictive relationship is called the linear regression model.

Adding more flexibility

  • Linear prediction rules of the form \(\beta_0+\beta_1 X_1\) can be restrictive to some extent.
    • Can we do better?
    • We can add polynomial terms \(X_1^2, X_1^3, \ldots, X_1^p, \ldots\).
  • The optimization problem becomes \[\min_{\beta_{0},\beta_{1}, \ldots, \beta_p}\mathbb{E}\left[\left(Y-\beta_{0}-\beta_{1}X_{1}-\beta_2X_1^2-\cdots \beta_p X_1^p\right)^{2}\right].\]

Adding more flexibility

  • The best linear predictor as \(p\to\infty\) converges in mean square to the an object called the conditional expectation \(\mathbb{E}\left(Y|X_1\right)\), i.e. \[\lim_{p\to\infty} \mathbb{E}\left(\mathbb{E}\left(Y|X_1\right)-\left(\beta_0^*+\beta_1^*X_1+\ldots+\beta_p^*X_1^p\right)\right)^2=0.\]

  • It can be shown that \(\mathbb{E}\left(Y|X_1\right)\) solves the following optimization problem: \[\min_{g}\mathbb{E}\left[\left(Y-g\left(X_1\right)\right)^{2}\right].\]

Conditional expectations vs linear regressions

  • The best predictor of \(Y\) given \(X_1\) is \(\mathbb{E}\left(Y|X_1\right)\).

  • The best linear predictor \(Y\) given \(X_1\) is \(\beta_0^*+\beta_1^*X_1\).

  • There is no guarantee that these two predictors are the same.

  • In practice, we either:

    • Assume they are the same.
    • Accept they are not the same.

Conditional expectations vs linear regressions

  • Let \(Y=10X_1^2+W\) where \(W\) and \(X\) are independent.
  • It can be shown that \(\mathbb{E}\left(Y|X_1\right)=10X^{2}_1\).
  • But the best linear predictor varies depending on where you can “find” values of \(X_1\).

\(X_1\sim U\left(0,1\right)\)

\(X_1\sim U\left(-1,1\right)\)

\(X_1\sim f_{X_1}\)

What should we do?

  • Very simple to take the best linear predictor to the data. Apply OLS.

  • Extrapolation could be tricky. Suppose we cannot observe \(X_1=x_1\).

  • We could not compute \(\mathbb{E}\left(Y|X_1=x_1\right)\) even if we had access to the entire population. This is a situation where \(\mathbb{E}\left(Y|X_1=x_1\right)\) cannot be identified.

  • But we could still compute \(\beta_{0}^{*}+\beta_{1}^{*}x_1\).

What should we do?

  • Getting \(\mathbb{E}\left(Y|X_1\right)\) right is a priority for prediction.

  • Estimation and inference could also be tricky.

    • Curse of dimensionality: What happens when you have other random variables in addition to \(X_1\)?
    • Tuning parameters: The difference between the discrete and continuous case matters.

In typical practice

  • What people would usually do then is to somehow aspire to correct specification.

  • Correct specification means that the best linear predictor is the same as the best predictor.

Are these relevant for QMM?

  • Aren’t we supposed to be working with time series?

  • You have to think about the conditioning set.

    • Must think about correct specification.
    • But now involve dynamics, meaning past and present information.
  • You have to determine if OLS will still work.

    • Presence of dynamics can complicate the use of OLS.
    • Scarcity of data can affect OLS performance.

The conditioning set

  1. Correct dynamic specification is \[\begin{eqnarray*}&&\mathbb{E}\left(Y_{t}|Y_{t-1},Y_{t-2},\ldots,X_{t-1},X_{t-2},\ldots\right) \\ &=&\beta_{0}+\beta_{1}Y_{t-1}+\beta_{2}Y_{t-2}+\cdots+\beta_{p}Y_{t-p}\\ &&+\delta_{1}X_{t-1}+\delta_{2}X_{t-2}+\cdots+\delta_{q}X_{t-q}.\end{eqnarray*}\]

  2. Leads to autoregressive distributed lag model ARDL(p,q): \[\begin{eqnarray*}Y_{t} &=& \beta_{0}+\beta_{1}Y_{t-1}+\beta_{2}Y_{t-2}+\cdots+\beta_{p}Y_{t-p}\\ &&+\delta_{1}X_{t-1}+\delta_{2}X_{t-2}+\cdots+\delta_{q}X_{t-q}+u_{t}\end{eqnarray*}\]

Implications of correct dynamic specification

  1. The error term satisfies \(\mathbb{E}\left(u_{t}|Y_{t-1},Y_{t-2},\ldots,X_{t-1},X_{t-2},\ldots\right)=0\).

  2. An implication is that \(u_t\) is uncorrelated with its own past.

  3. Therefore, this implication can be used to provide evidence against correct dynamic specification.

  4. All other \(Y_{t-k}\) and \(X_{t-k}\) with \(k> p\) have no predictive value for \(Y_t\).

Implications of correct dynamic specification

  1. Practically implementing ARDL(p,q) is different from usual OLS because

    • Must figure out \(p\) and \(q\) in advance.
    • \(u_t\) must have no serial correlation.
    • How many \(X\)’s should we incude?
    • Real time series in developing countries are extremely short.

But …

  1. There is another version of an ARDL(p,q): \[\begin{eqnarray*}Y_{t} &=& \beta_{0}+\beta_{1}Y_{t-1}+\beta_{2}Y_{t-2}+\cdots+\beta_{p}Y_{t-p}\\ &&{\color{red}{+\delta_0 X_{t}}}+\delta_{1}X_{t-1}+\delta_{2}X_{t-2}+\cdots+\delta_{q}X_{t-q}+u_{t}\end{eqnarray*}\]

  2. Where does this come from?

  3. When you move to vector autoregressive models, this kind of model class appears if you focus on the conditional version of one of the variables in your vector.

Revisiting the bivariate normal case

  • You can show that when \[\left(\begin{array}{c} X_{1}\\ Y \end{array}\right)\sim N\left(\left(\begin{array}{c} \mu_{1}\\ \mu_{Y} \end{array}\right),\left(\begin{array}{cc} \sigma_{1}^{2} & \rho\sigma_{1}\sigma_{Y}\\ \rho\sigma_{1}\sigma_{Y} & \sigma_{Y}^{2} \end{array}\right)\right),\] we must have \[\mathbb{E}\left(Y|X_1\right)=\mu_Y+\rho\frac{\sigma_Y}{\sigma_1}\left(X_1-\mu_1\right)=\underbrace{\mu_Y-\rho\frac{\sigma_Y}{\sigma_1}\mu_1}_{\beta_0}+\underbrace{\rho\frac{\sigma_Y}{\sigma_1}}_{\beta_1}X_1.\]

Revisiting the bivariate normal case

  • In addition, the conditional variance is actually a constant under bivariate normality: \[\mathsf{Var}\left(Y|X_1\right)=\left(1-\rho^2\right)\sigma_Y^2.\]

  • Finally, the conditional distribution is also normally distributed, so that: \[Y|X_1=x_1 \sim N\left(\beta_0+\beta_1x_1, \left(1-\rho^2\right)\sigma_Y^2\right).\]

Revisiting the bivariate normal case

  • We can already write \[Y=\beta_0+\beta_1 X_1+u\] with \[\mathbb{E}\left(u|X_1\right)=0\] because we have correct specification of the conditional mean.

Revisiting the bivariate normal case

  • In the next slides, you will see some pictures of the bivariate normal for the case where \(\mu_1=\mu_Y=0\) and \(\sigma^2_1=\sigma^2_Y=1\). Therefore, the only thing that changes in the pictures is \(\rho\).

    • \(\rho\) takes values \(\{-0.9,-0.6,0,0.6,0.9,0.99\}\).
    • \(X_1\) is on the horizontal axis and \(Y\) is on the vertical axis.
    • The dashed lines represent \(\mathbb{E}\left(Y|X_1\right)\) and the dotted lines represent \(\mathbb{E}\left(X_1|Y\right)\).

Contour plots for the bivariate normal

Will OLS work?

  • Suppose you are interested in estimating the parameters of a first-order autoregression or AR(1) process \(Y_{t}=\beta_0+\beta_1 Y_{t-1}+u_{t}\).

  • To give you a sense of what the data on \(\left\{ Y_{t}\right\}_{t=1}^{T}\) would look like, I generate artificial data.

Will OLS work?

  • Here are some pictures where \(\beta_0=0\) and \(\beta_1\) can be 0, 0.5, 0.95, and 1.

  • I assume that \(u_{t}\sim N\left(0,1\right)\) and \(Y_{0}\sim N\left(0,1\right)\).

  • You will see two plots side-by-side. One is a time-series plot where \(Y_{t}\) is plotted against \(t\) and the other is a scatterplot where \(Y_{t}\) is plotted against \(Y_{t-1}\).

  • To enhance comparability, I fix the use the set of randomly drawn \(u_{t}\)’s and \(Y_{0}\)’s.

\(Y_t=u_t\)

\(Y_t=0.5Y_{t-1}+u_t\)

\(Y_t=0.95Y_{t-1}+u_t\)

\(Y_t=Y_{t-1}+u_t\)

Behavior of OLS

            beta1=0 beta1=0.5 beta1=0.95 beta1=1
mean.ols    -0.0268     0.433     0.8314  0.8740
mean.reg.se  0.1626     0.146     0.0868  0.0746
sd.ols       0.1595     0.150     0.1115  0.0992
p.vals       0.0650     0.070     0.1880  0.2920
             beta1=0 beta1=0.5 beta1=0.95 beta1=1
mean.ols    -0.00368    0.4882     0.9247  0.9677
mean.reg.se  0.07957    0.0694     0.0297  0.0188
sd.ols       0.07570    0.0663     0.0334  0.0274
p.vals       0.04300    0.0350     0.0870  0.3010
             beta1=0 beta1=0.5 beta1=0.95 beta1=1
mean.ols    -0.00228    0.4964     0.9445 0.99160
mean.reg.se  0.03959    0.0344     0.0129 0.00483
sd.ols       0.03922    0.0342     0.0133 0.00679
p.vals       0.04900    0.0460     0.0490 0.29300

A curiosity

A curiosity

  • The plot you just saw is for the case where \(T=640\) and \(\beta_1=1\).
  • The blue curve is the standard normal.
  • In this case, you reject the null \(H_{0}:\;\beta_1=1\) more often than you should.
  • Therefore you need new critical values to decide whether or not there is evidence in support or against the null.

Sampling distribution of \(t\)-statistic under the null of a unit root

  • Dickey and Fuller (1979) have shown that when testing the null of a unit root, the asymptotic distribution of the test statistic under the null is nonstandard.

  • The asymptotic distribution of the test statistic under the null alo changes depending on the presence or absence of deterministic variables in the autoregression and the nature of the null being tested.

OLS results, first case


Call:
lm(formula = dyn(y1 ~ x1))

Residuals:
   Min     1Q Median     3Q    Max 
-3.588 -0.650  0.004  0.699  3.188 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.0958     0.0313    3.06   0.0023 **
x1            0.0282     0.0320    0.88   0.3797   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.988 on 998 degrees of freedom
Multiple R-squared:  0.000773,  Adjusted R-squared:  -0.000228 
F-statistic: 0.772 on 1 and 998 DF,  p-value: 0.38

OLS results, second case


Call:
lm(formula = dyn(y2 ~ x2))

Residuals:
   Min     1Q Median     3Q    Max 
-33.04 -11.37   0.24  10.73  26.51 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  30.4113     0.6112    49.8   <2e-16 ***
x2            0.8107     0.0147    55.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.1 on 998 degrees of freedom
Multiple R-squared:  0.752, Adjusted R-squared:  0.752 
F-statistic: 3.03e+03 on 1 and 998 DF,  p-value: <2e-16

What is cointegration?

What is cointegration?

Compare with non-cointegration

Compare with non-cointegration