Seven Examples to Better Illustrate the Method of Moments to Undergraduates

Andrew Adrian Pua

2024-11-07

Really, Seven? Maybe four?

Follow along

Or visit https://bit.ly/pes-7.

Takeaway messages

It is getting harder to pile on prerequisites to get to the point where a student is ready for advanced econometric methods.
Emphasize a “shortest path” approach and exploit parallels across different toy models as much as possible.
We don’t want students and practitioners to be too reliant on a black box.

The method of moments

You have been using the method of moments without you knowing it:
- Sample averages, sample proportions
- Least squares coefficients à la output from a Stata command like reg y x1 x2
- Average treatment effects: difference of two means when applied to RCTs
- Instrumental variables

The method of moments, continued

So why should you know more about the method?
Because it is a powerful black box.
- Many moving components you may not be aware of, even in “simple” settings
- Generalized version of the method of moments
- Affects the way we approach scientific pursuits
- Affects the way we do consulting in industrial and policy settings

Problem?

Look at Larsen and Marx (a book referenced in CHED’s PSG for economics programmes):

Problem?

Look at their first illustration:

Distribution known!
Becomes a mechanical exercise.
Method of moments shines in semiparametric settings, where maximum likelihood can be difficult to implement.

“Prerequisites”, at least for this talk

You have to be exposed to the core idea behind the law of large numbers in the IID case.
You have to know about moments like expected values including covariances.
You have to know the distinction between sample and population.

Ingredients for the method

From an algorithmic view:

Moment functions: Functions of the data and parameters of interest
Moment or orthogonality conditions: Restrictions on these moment functions, typically zero restrictions
Number of moment conditions is equal to number of parameters
Suggestion: Replace population moments with sample moments and solve for the unknown parameters.

Example

Making predictions

One of the four words in the sentence “I SEE THE MOUSE” will be selected at random by a person. That person will tell you how many \(E\)’s there are in the selected word.
Call \(Y\) the number of letters in the selected word and \(X_1\) the number of \(E\)’s in the selected word.
Your task is to predict the number of letters in the selected word (ex ante).

Making predictions, continued

You would be “punished” according to the square of the difference between the actual \(Y\) and your prediction.
Because you do not know for sure which word will be chosen, you have to allow for contingencies.
Therefore, losses depend on which word was chosen.
What would be your prediction rule in order to make your expected loss as small as possible?

Making predictions, continued

One way to answer this question is to propose a prediction rule and hope for the best.
For example, we can say that the rule should have the form \(\beta_0+\beta_1 X_1\).
The task is now to find the unique solution to the following optimization problem: \[\min_{\beta_{0},\beta_{1}}\mathbb{E}\left[\left(Y-\beta_{0}-\beta_{1}X_{1}\right)^{2}\right].\]

Making predictions, continued

An optimal solution \(\left(\beta_{0}^{*},\beta_{1}^{*}\right)\) solves the following first-order conditions: \[\begin{eqnarray*} \mathbb{E}\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1}\right) &=& 0, \\ \mathbb{E}\left(X_{1}\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1}\right)\right) &=& 0. \end{eqnarray*}\]
As a result, we have \[\begin{equation}\beta_{0}^{*} = \mathbb{E}\left(Y\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right),\qquad\beta_{1}^{*}=\dfrac{\mathsf{Cov}\left(X_{1},Y\right)}{\mathsf{Var}\left(X_{1}\right)}.\label{blp-coef}\end{equation}\]

Public service announcement

But this is none other than the population form of simple OLS.
In effect, you have replaced the \(\mathbb{E}\left(\cdot\right)\) with \(\dfrac{1}{n}\displaystyle\sum_{i=1}^n \left(\cdot\right)\).
Repeat after me, linear regression is estimating a linear prediction rule!

Connections: method of moments

You have unwittingly used the method of moments!!
To use the method, you need moment conditions which restrict the behavior of some functions of both the data and the unknown parameters.
Where do you find these? Go back to first-order conditions.

Connections: models

It is definitely possible to think of a generative model to make sense of linear regression.
Define the error from best linear prediction to be \(u=Y-\beta_0^*-\beta_1^*X_1\).
The generative model now has a signal plus noise form \(Y=\beta_0^*+\beta_1^* X_1 +u\).
The moment conditions can be rewritten as: \[\mathbb{E}\left(u\cdot 1\right)=0,\ \ \mathbb{E}\left(u\cdot X_1\right)=0.\]

Example

Instrumental variables

What if you are in a situation where \(Y=\beta_0+\beta_1X_1+u\) is not a linear regression (usually called structural)?
What this means:
- \(\beta_0\) and \(\beta_1\) are not necessarily the coefficients of the best linear predictor of \(Y\) using \(X_1\).
- \(\mathbb{E}\left(u \cdot 1\right) \neq 0\) or \(\mathbb{E}\left(u \cdot X_1\right) \neq 0\) or both
But you have to show that \(\beta_1\) means something for your use case: typically, people think \(\beta_1\) is the causal effect of \(X_1\) on \(Y\).

Instrumental variables, continued

One approach is to find an external source of variation \(Z\) correlated with \(X_1\) but not correlated with \(u\).
The argument is as follows: \[\mathsf{Cov}\left(Y, Z\right)=\beta_1\mathsf{Cov}\left(X_1, Z\right)+\mathsf{Cov}\left(u, Z\right)\] \[\Rightarrow\beta_1=\dfrac{\mathsf{Cov}\left(Y, Z\right)}{\mathsf{Cov}\left(X_1, Z\right)}.\]

Instrumental variables, continued

The previous identification argument requires
- \(\mathsf{Cov}\left(X_1, Z\right)\neq 0\)
- \(\mathsf{Cov}\left(u, Z\right)=0\)
Violations of any of the two aforementioned conditions would already signal a failure of instrumental variables to identify \(\beta_1\).

Instrumental variables, continued

Since there are actually two unknown parameters \(\beta_0\) and \(\beta_1\), we can write down two moment conditions to recover both using instrumental variables: \[ \mathbb{E}\left(u \cdot 1 \right)=0,\ \ \mathbb{E}\left(u \cdot Z\right) =0\] \[\Rightarrow \mathbb{E}\left(u \cdot 1 \right)=0,\ \ \mathsf{Cov}\left(u, Z\right) =0\] \[\Rightarrow \begin{eqnarray*} \mathbb{E}\left(\left(Y-\beta_0-\beta_1 X_1\right) \cdot 1 \right)=0 \\ \mathbb{E}\left(\left(Y-\beta_0-\beta_1 X_1\right) \cdot Z\right) =0\end{eqnarray*}\]

Example

A panel data setting

Setup:
- Units indexed by \(i\) (but I will suppress this notation)
- \(t=1,2,3\) index the time periods
- \(Y_t\), \(Y_{t-1}\), \(W\): current \(Y\), previous period’s \(Y\), extra information

A panel data setting, continued

Task: Predict \(Y_t\) using \(Y_{t-1}\) and \(W\).
This is just going to extend the best (linear) prediction example. Nothing special here!
So, we are just in a linear regression setting: \[Y_t=\beta_0^*+\beta_1^* Y_{t-1}+\beta_2^* W+u_t\]

A panel data setting, continued

More complicated task: Predict \(Y_t\) using \(Y_{t-1}\) and \(W\). But \(W\) is now unobserved.
This task is much harder to achieve, but we can settle for predicting changes in \(Y\) instead.
Calculate a differenced equation: \[Y_3-Y_2 = \beta_1^* \left(Y_2-Y_1\right)+\left(u_3-u_2\right)\]
New wrinkle: Not a linear regression anymore!

A panel data setting, continued

\(\beta_1^*\) has a predictive interpretation, but the differenced equation effectively becomes a structural equation.
Here, we can use IV! Consider a potential instrumental variable where \(Z=Y_1\).

A panel data setting, continued

Mimic an earlier argument: \[\begin{eqnarray*} \mathrm{Cov}\left(\Delta Y_{3},Y_1\right) & = & \mathrm{Cov}\left(\beta_{1}^*\Delta Y_{2}+\Delta u_{3},Y_1\right)\\ \mathrm{Cov}\left(\Delta Y_{3},Y_1\right) & = & \beta_{1}^*\mathrm{Cov}\left(\Delta Y_{2},Y_1\right)+\mathrm{Cov}\left(\Delta u_{3},Y_1\right)\\ \dfrac{\mathrm{Cov}\left(\Delta Y_{3},Y_1\right)}{\mathrm{Cov}\left(\Delta Y_{2},Y_1\right)} & = & \beta_{1}^*. \end{eqnarray*}\]

A panel data setting, continued

We can also write down the moment condition used to identify \(\beta_1^*\): \[ \mathbb{E}\left(\left(\Delta Y_3-\beta_1^* \Delta Y_2\right) \cdot Y_1\right) =0\]

An additional time period

What happens if we observe \(Y_4\)? Then you will have extra moment conditions (of course, there is a price!) because you can take other differences.
Again, you observe \(\left(Y_1,Y_2,Y_3,Y_4\right)\).
You have something like: \[ \begin{eqnarray*} \mathbb{E}\left(\left(\Delta Y_3-\beta_1^* \Delta Y_2\right) \cdot Y_1\right) =0\\ \mathbb{E}\left(\left(\Delta Y_4-\beta_1^* \Delta Y_3\right) \cdot Y_1\right) =0 \\ \mathbb{E}\left(\left(\Delta Y_4-\beta_1^* \Delta Y_3\right) \cdot Y_2\right) =0\end{eqnarray*}\]

An additional time period, continued

In the end, you will have more moment conditions than the dimension of the parameter \(\beta_1^*\).
How do you combine them? This is where the generalized method of moments (GMM) comes in.

Example

Two-sample problems

Overidentification: More moment conditions than parameters
Suppose you have two independently collected datasets \(\left(Y_{i1}\right)_{i=1}^n\), \(\left(Y_{i2}\right)_{i=1}^n\) or replicates on some random variable \(Y\).
Suppose you have \(\mathbb{E}\left(Y_{i1}\right)=\mathbb{E}\left(Y_{i2}\right)=\mu\), but we have \(\mathsf{Var}\left(Y_{i1}\right)=\sigma^2_1\), \(\mathsf{Var}\left(Y_{i2}\right)=\sigma^2_2\), and \(\sigma^2_1\neq \sigma^2_2\).

Two-sample problems, continued

Parameter is \(\mu\). Two moment conditions could identify \(\mu\).
The sample moment conditions are

\[\begin{eqnarray*} \frac{1}{n}\sum_{i=1}^n \left(Y_{i1} - \widehat{\mu}\right) = \overline{Y}_1 -\widehat{\mu}=0\\ \frac{1}{n}\sum_{i=1}^n \left(Y_{i2} - \widehat{\mu}\right) = \overline{Y}_2 -\widehat{\mu}=0\\ \end{eqnarray*}\]

You see a problem here?

Two-sample problems, continued

Why not choose \(\widehat{\mu}\) in order to minimize \[\begin{equation*} L = w_1 \left(\overline{Y}_1 -\widehat{\mu}\right)^2+ w_2 \left(\overline{Y}_2 -\widehat{\mu}\right)^2 +2w_3 \left(\overline{Y}_1 -\widehat{\mu}\right) \left(\overline{Y}_2 -\widehat{\mu}\right) \label{gmm} \end{equation*}\]
In matrix form:

\[\begin{equation*} L=\begin{pmatrix}\overline{Y}_1 -\widehat{\mu}\\ \overline{Y}_2 -\widehat{\mu}\end{pmatrix}^\prime \overbrace{\begin{pmatrix}w_1 & w_3 \\ w_3 & w_2\end{pmatrix}}^W \begin{pmatrix}\overline{Y}_1 -\widehat{\mu}\\ \overline{Y}_2 -\widehat{\mu}\end{pmatrix}. \end{equation*}\]

Two-sample problems, continued

If we knew the values of \(w_1\), \(w_2\), and \(w_3\) in advance, then we are done:

\[\begin{equation} \widehat{\mu}= \underbrace{\frac{w_1+w_3}{w_1+w_2+2w_3}}_{w}\overline{Y}_1+\underbrace{\frac{w_2+w_3}{w_1+w_2+2w_3}}_{1-w}\overline{Y}_2 \label{weighted-mean} \end{equation}\]

If you do not know \(w_1, w_2, w_3\), then we have to find them. Useful to think of this as a portfolio optimization problem!

Two-sample problems, continued

It can be shown that

\[\begin{equation*} \mathsf{Var}\left(\widehat{\mu}\right)=w^2 \mathsf{Var}\left(\overline{Y}_1\right)+ (1-w)^2\mathsf{Var}\left( \overline{Y}_2\right)=w^2\frac{\sigma^2_1}{n}+(1-w)^2\frac{\sigma^2_2}{n} \end{equation*}\]

The optimal value of \(w\) which minimizes \(\mathsf{Var}\left(\widehat{\mu}\right)\) is \[\begin{equation} w=\frac{w_1+w_3}{w_1+w_2+2w_3}=\frac{\sigma^2_2}{\sigma^2_1+\sigma^2_2}=\frac{1}{\sigma^2_1/\sigma^2_2+1} \label{weight} \end{equation}\]

Two-sample problems, continued

Connection to distribution theory for GMM: Choose weights on moment conditions to minimize asymptotic variance
An intuitive form for the estimator
No matrix algebra, unless you really want it
No distribution theory, unless you really want it

Other examples

Poisson and exponential
Should I control for a variable or not?
Dealing with measurement error

All of these can be made accessible. Of course, implementation is another thing!

Challenges

Have to teach general-purpose estimation procedures: Every time you change a moment condition, you have to repeat the procedure.
Do you teach distribution theory? Depends.
How do you get standard errors and conduct standard inference? Use a pre-existing GMM framework available in packages
- gmm in Stata
- momentfit in R

Final advertisements

Slides	My webpage

Questions, proposals, collaboration?

Email me at andrew.pua@dlsu.edu.ph or approach me, I’ll give you my card.