The decimal point is at the |
64 | 5
66 | 479
68 | 001334467712237
70 | 00023479901244555888
72 | 011236688902668
74 | 06
76 | 4
(A review)
2025-10-27
Opening remarks
What are regressions?
Core ideas behind regressions leads to a lot of advanced topics
How would things change in the time series case?
Revisiting the old QMM
Try a stem-and-leaf plot of the strip.
The decimal point is at the |
64 | 5
66 | 479
68 | 001334467712237
70 | 00023479901244555888
72 | 011236688902668
74 | 06
76 | 4
Think of \(Y\) being a characteristic of interest. and you have other information in the form of another random variable \(X_{1}\).
Suppose we have two random variables \(X_{1}\) and \(Y\), which follow a joint distribution \(f_{X_{1},Y}\left(x_{1},y\right)\).
Suppose you draw a unit at random from the sub-population of units with \(X_{1}=x_{1}\). Your task is to predict this unit’s \(Y\).
How do we accomplish this task optimally?
The best linear predictor as \(p\to\infty\) converges in mean square to the an object called the conditional expectation \(\mathbb{E}\left(Y|X_1\right)\), i.e. \[\lim_{p\to\infty} \mathbb{E}\left(\mathbb{E}\left(Y|X_1\right)-\left(\beta_0^*+\beta_1^*X_1+\ldots+\beta_p^*X_1^p\right)\right)^2=0.\]
It can be shown that \(\mathbb{E}\left(Y|X_1\right)\) solves the following optimization problem: \[\min_{g}\mathbb{E}\left[\left(Y-g\left(X_1\right)\right)^{2}\right].\]
The best predictor of \(Y\) given \(X_1\) is \(\mathbb{E}\left(Y|X_1\right)\).
The best linear predictor \(Y\) given \(X_1\) is \(\beta_0^*+\beta_1^*X_1\).
There is no guarantee that these two predictors are the same.
In practice, we either:
Very simple to take the best linear predictor to the data. Apply OLS.
Extrapolation could be tricky. Suppose we cannot observe \(X_1=x_1\).
We could not compute \(\mathbb{E}\left(Y|X_1=x_1\right)\) even if we had access to the entire population. This is a situation where \(\mathbb{E}\left(Y|X_1=x_1\right)\) cannot be identified.
But we could still compute \(\beta_{0}^{*}+\beta_{1}^{*}x_1\).
Getting \(\mathbb{E}\left(Y|X_1\right)\) right is a priority for prediction.
Estimation and inference could also be tricky.
What people would usually do then is to somehow aspire to correct specification.
Correct specification means that the best linear predictor is the same as the best predictor.
Aren’t we supposed to be working with time series?
You have to think about the conditioning set.
You have to determine if OLS will still work.
Correct dynamic specification is \[\begin{eqnarray*}&&\mathbb{E}\left(Y_{t}|Y_{t-1},Y_{t-2},\ldots,X_{t-1},X_{t-2},\ldots\right) \\ &=&\beta_{0}+\beta_{1}Y_{t-1}+\beta_{2}Y_{t-2}+\cdots+\beta_{p}Y_{t-p}\\ &&+\delta_{1}X_{t-1}+\delta_{2}X_{t-2}+\cdots+\delta_{q}X_{t-q}.\end{eqnarray*}\]
Leads to autoregressive distributed lag model ARDL(p,q): \[\begin{eqnarray*}Y_{t} &=& \beta_{0}+\beta_{1}Y_{t-1}+\beta_{2}Y_{t-2}+\cdots+\beta_{p}Y_{t-p}\\ &&+\delta_{1}X_{t-1}+\delta_{2}X_{t-2}+\cdots+\delta_{q}X_{t-q}+u_{t}\end{eqnarray*}\]
The error term satisfies \(\mathbb{E}\left(u_{t}|Y_{t-1},Y_{t-2},\ldots,X_{t-1},X_{t-2},\ldots\right)=0\).
An implication is that \(u_t\) is uncorrelated with its own past.
Therefore, this implication can be used to provide evidence against correct dynamic specification.
All other \(Y_{t-k}\) and \(X_{t-k}\) with \(k> p\) have no predictive value for \(Y_t\).
Practically implementing ARDL(p,q) is different from usual OLS because
There is another version of an ARDL(p,q): \[\begin{eqnarray*}Y_{t} &=& \beta_{0}+\beta_{1}Y_{t-1}+\beta_{2}Y_{t-2}+\cdots+\beta_{p}Y_{t-p}\\ &&{\color{red}{+\delta_0 X_{t}}}+\delta_{1}X_{t-1}+\delta_{2}X_{t-2}+\cdots+\delta_{q}X_{t-q}+u_{t}\end{eqnarray*}\]
Where does this come from?
When you move to vector autoregressive models, this kind of model class appears if you focus on the conditional version of one of the variables in your vector.
In addition, the conditional variance is actually a constant under bivariate normality: \[\mathsf{Var}\left(Y|X_1\right)=\left(1-\rho^2\right)\sigma_Y^2.\]
Finally, the conditional distribution is also normally distributed, so that: \[Y|X_1=x_1 \sim N\left(\beta_0+\beta_1x_1, \left(1-\rho^2\right)\sigma_Y^2\right).\]
In the next slides, you will see some pictures of the bivariate normal for the case where \(\mu_1=\mu_Y=0\) and \(\sigma^2_1=\sigma^2_Y=1\). Therefore, the only thing that changes in the pictures is \(\rho\).
Suppose you are interested in estimating the parameters of a first-order autoregression or AR(1) process \(Y_{t}=\beta_0+\beta_1 Y_{t-1}+u_{t}\).
To give you a sense of what the data on \(\left\{ Y_{t}\right\}_{t=1}^{T}\) would look like, I generate artificial data.
Here are some pictures where \(\beta_0=0\) and \(\beta_1\) can be 0, 0.5, 0.95, and 1.
I assume that \(u_{t}\sim N\left(0,1\right)\) and \(Y_{0}\sim N\left(0,1\right)\).
You will see two plots side-by-side. One is a time-series plot where \(Y_{t}\) is plotted against \(t\) and the other is a scatterplot where \(Y_{t}\) is plotted against \(Y_{t-1}\).
To enhance comparability, I fix the use the set of randomly drawn \(u_{t}\)’s and \(Y_{0}\)’s.
beta1=0 beta1=0.5 beta1=0.95 beta1=1
mean.ols -0.0268 0.433 0.8314 0.8740
mean.reg.se 0.1626 0.146 0.0868 0.0746
sd.ols 0.1595 0.150 0.1115 0.0992
p.vals 0.0650 0.070 0.1880 0.2920
beta1=0 beta1=0.5 beta1=0.95 beta1=1
mean.ols -0.00368 0.4882 0.9247 0.9677
mean.reg.se 0.07957 0.0694 0.0297 0.0188
sd.ols 0.07570 0.0663 0.0334 0.0274
p.vals 0.04300 0.0350 0.0870 0.3010
beta1=0 beta1=0.5 beta1=0.95 beta1=1
mean.ols -0.00228 0.4964 0.9445 0.99160
mean.reg.se 0.03959 0.0344 0.0129 0.00483
sd.ols 0.03922 0.0342 0.0133 0.00679
p.vals 0.04900 0.0460 0.0490 0.29300
Dickey and Fuller (1979) have shown that when testing the null of a unit root, the asymptotic distribution of the test statistic under the null is nonstandard.
The asymptotic distribution of the test statistic under the null alo changes depending on the presence or absence of deterministic variables in the autoregression and the nature of the null being tested.
Consider the following two situations:
Call:
lm(formula = dyn(y1 ~ x1))
Residuals:
Min 1Q Median 3Q Max
-3.588 -0.650 0.004 0.699 3.188
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0958 0.0313 3.06 0.0023 **
x1 0.0282 0.0320 0.88 0.3797
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.988 on 998 degrees of freedom
Multiple R-squared: 0.000773, Adjusted R-squared: -0.000228
F-statistic: 0.772 on 1 and 998 DF, p-value: 0.38
Call:
lm(formula = dyn(y2 ~ x2))
Residuals:
Min 1Q Median 3Q Max
-33.04 -11.37 0.24 10.73 26.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.4113 0.6112 49.8 <2e-16 ***
x2 0.8107 0.0147 55.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.1 on 998 degrees of freedom
Multiple R-squared: 0.752, Adjusted R-squared: 0.752
F-statistic: 3.03e+03 on 1 and 998 DF, p-value: <2e-16
There are two broad ways of solving the spurious regression problem:
The first option is the most appropriate course of action given.
Call:
lm(formula = dyn(diff(y2) ~ diff(x2)))
Residuals:
Min 1Q Median 3Q Max
-3.590 -0.650 0.002 0.698 3.187
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0973 0.0313 3.11 0.002 **
diff(x2) 0.0291 0.0320 0.91 0.364
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.987 on 997 degrees of freedom
Multiple R-squared: 0.000826, Adjusted R-squared: -0.000176
F-statistic: 0.825 on 1 and 997 DF, p-value: 0.364