Seven Examples to Better Illustrate the Method of Moments to Undergraduates

Andrew Adrian Pua

2024-11-07

Really, Seven? Maybe four?

Follow along

Or visit https://bit.ly/pes-7.

Takeaway messages

  • It is getting harder to pile on prerequisites to get to the point where a student is ready for advanced econometric methods.
  • Emphasize a “shortest path” approach and exploit parallels across different toy models as much as possible.
  • We don’t want students and practitioners to be too reliant on a black box.

The method of moments

  • You have been using the method of moments without you knowing it:

    • Sample averages, sample proportions
    • Least squares coefficients à la output from a Stata command like reg y x1 x2
    • Average treatment effects: difference of two means when applied to RCTs
    • Instrumental variables

The method of moments, continued

  • So why should you know more about the method?

  • Because it is a powerful black box.

    • Many moving components you may not be aware of, even in “simple” settings
    • Generalized version of the method of moments
    • Affects the way we approach scientific pursuits
    • Affects the way we do consulting in industrial and policy settings

Problem?

Look at Larsen and Marx (a book referenced in CHED’s PSG for economics programmes):

Problem?

Look at their first illustration:

  • Distribution known!
  • Becomes a mechanical exercise.
  • Method of moments shines in semiparametric settings, where maximum likelihood can be difficult to implement.

“Prerequisites”, at least for this talk

  • You have to be exposed to the core idea behind the law of large numbers in the IID case.
  • You have to know about moments like expected values including covariances.
  • You have to know the distinction between sample and population.

Ingredients for the method

From an algorithmic view:

  • Moment functions: Functions of the data and parameters of interest
  • Moment or orthogonality conditions: Restrictions on these moment functions, typically zero restrictions
  • Number of moment conditions is equal to number of parameters
  • Suggestion: Replace population moments with sample moments and solve for the unknown parameters.

Example

Making predictions

  • One of the four words in the sentence “I SEE THE MOUSE” will be selected at random by a person. That person will tell you how many \(E\)’s there are in the selected word.

  • Call \(Y\) the number of letters in the selected word and \(X_1\) the number of \(E\)’s in the selected word.

  • Your task is to predict the number of letters in the selected word (ex ante).

Making predictions, continued

  • You would be “punished” according to the square of the difference between the actual \(Y\) and your prediction.

  • Because you do not know for sure which word will be chosen, you have to allow for contingencies.

  • Therefore, losses depend on which word was chosen.

  • What would be your prediction rule in order to make your expected loss as small as possible?

Making predictions, continued

  • One way to answer this question is to propose a prediction rule and hope for the best.

  • For example, we can say that the rule should have the form \(\beta_0+\beta_1 X_1\).

  • The task is now to find the unique solution to the following optimization problem: \[\min_{\beta_{0},\beta_{1}}\mathbb{E}\left[\left(Y-\beta_{0}-\beta_{1}X_{1}\right)^{2}\right].\]

Making predictions, continued

  • An optimal solution \(\left(\beta_{0}^{*},\beta_{1}^{*}\right)\) solves the following first-order conditions: \[\begin{eqnarray*} \mathbb{E}\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1}\right) &=& 0, \\ \mathbb{E}\left(X_{1}\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1}\right)\right) &=& 0. \end{eqnarray*}\]

  • As a result, we have \[\begin{equation}\beta_{0}^{*} = \mathbb{E}\left(Y\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right),\qquad\beta_{1}^{*}=\dfrac{\mathsf{Cov}\left(X_{1},Y\right)}{\mathsf{Var}\left(X_{1}\right)}.\label{blp-coef}\end{equation}\]

Public service announcement

  • But this is none other than the population form of simple OLS.
  • In effect, you have replaced the \(\mathbb{E}\left(\cdot\right)\) with \(\dfrac{1}{n}\displaystyle\sum_{i=1}^n \left(\cdot\right)\).
  • Repeat after me, linear regression is estimating a linear prediction rule!

Connections: method of moments

  • You have unwittingly used the method of moments!!
  • To use the method, you need moment conditions which restrict the behavior of some functions of both the data and the unknown parameters.
  • Where do you find these? Go back to first-order conditions.

Connections: models

  • It is definitely possible to think of a generative model to make sense of linear regression.
  • Define the error from best linear prediction to be \(u=Y-\beta_0^*-\beta_1^*X_1\).
  • The generative model now has a signal plus noise form \(Y=\beta_0^*+\beta_1^* X_1 +u\).
  • The moment conditions can be rewritten as: \[\mathbb{E}\left(u\cdot 1\right)=0,\ \ \mathbb{E}\left(u\cdot X_1\right)=0.\]

Example

Instrumental variables

  • What if you are in a situation where \(Y=\beta_0+\beta_1X_1+u\) is not a linear regression (usually called structural)?

  • What this means:

    • \(\beta_0\) and \(\beta_1\) are not necessarily the coefficients of the best linear predictor of \(Y\) using \(X_1\).
    • \(\mathbb{E}\left(u \cdot 1\right) \neq 0\) or \(\mathbb{E}\left(u \cdot X_1\right) \neq 0\) or both
  • But you have to show that \(\beta_1\) means something for your use case: typically, people think \(\beta_1\) is the causal effect of \(X_1\) on \(Y\).

Instrumental variables, continued

  • One approach is to find an external source of variation \(Z\) correlated with \(X_1\) but not correlated with \(u\).
  • The argument is as follows: \[\mathsf{Cov}\left(Y, Z\right)=\beta_1\mathsf{Cov}\left(X_1, Z\right)+\mathsf{Cov}\left(u, Z\right)\] \[\Rightarrow\beta_1=\dfrac{\mathsf{Cov}\left(Y, Z\right)}{\mathsf{Cov}\left(X_1, Z\right)}.\]

Instrumental variables, continued

  • The previous identification argument requires

    • \(\mathsf{Cov}\left(X_1, Z\right)\neq 0\)
    • \(\mathsf{Cov}\left(u, Z\right)=0\)
  • Violations of any of the two aforementioned conditions would already signal a failure of instrumental variables to identify \(\beta_1\).

Instrumental variables, continued

  • Since there are actually two unknown parameters \(\beta_0\) and \(\beta_1\), we can write down two moment conditions to recover both using instrumental variables: \[ \mathbb{E}\left(u \cdot 1 \right)=0,\ \ \mathbb{E}\left(u \cdot Z\right) =0\] \[\Rightarrow \mathbb{E}\left(u \cdot 1 \right)=0,\ \ \mathsf{Cov}\left(u, Z\right) =0\] \[\Rightarrow \begin{eqnarray*} \mathbb{E}\left(\left(Y-\beta_0-\beta_1 X_1\right) \cdot 1 \right)=0 \\ \mathbb{E}\left(\left(Y-\beta_0-\beta_1 X_1\right) \cdot Z\right) =0\end{eqnarray*}\]

Example

A panel data setting

  • Setup:

    • Units indexed by \(i\) (but I will suppress this notation)
    • \(t=1,2,3\) index the time periods
    • \(Y_t\), \(Y_{t-1}\), \(W\): current \(Y\), previous period’s \(Y\), extra information

A panel data setting, continued

  • Task: Predict \(Y_t\) using \(Y_{t-1}\) and \(W\).
  • This is just going to extend the best (linear) prediction example. Nothing special here!
  • So, we are just in a linear regression setting: \[Y_t=\beta_0^*+\beta_1^* Y_{t-1}+\beta_2^* W+u_t\]

A panel data setting, continued

  • More complicated task: Predict \(Y_t\) using \(Y_{t-1}\) and \(W\). But \(W\) is now unobserved.
  • This task is much harder to achieve, but we can settle for predicting changes in \(Y\) instead.
  • Calculate a differenced equation: \[Y_3-Y_2 = \beta_1^* \left(Y_2-Y_1\right)+\left(u_3-u_2\right)\]
  • New wrinkle: Not a linear regression anymore!

A panel data setting, continued

  • \(\beta_1^*\) has a predictive interpretation, but the differenced equation effectively becomes a structural equation.
  • Here, we can use IV! Consider a potential instrumental variable where \(Z=Y_1\).

A panel data setting, continued

  • Mimic an earlier argument: \[\begin{eqnarray*} \mathrm{Cov}\left(\Delta Y_{3},Y_1\right) & = & \mathrm{Cov}\left(\beta_{1}^*\Delta Y_{2}+\Delta u_{3},Y_1\right)\\ \mathrm{Cov}\left(\Delta Y_{3},Y_1\right) & = & \beta_{1}^*\mathrm{Cov}\left(\Delta Y_{2},Y_1\right)+\mathrm{Cov}\left(\Delta u_{3},Y_1\right)\\ \dfrac{\mathrm{Cov}\left(\Delta Y_{3},Y_1\right)}{\mathrm{Cov}\left(\Delta Y_{2},Y_1\right)} & = & \beta_{1}^*. \end{eqnarray*}\]

A panel data setting, continued

  • We can also write down the moment condition used to identify \(\beta_1^*\): \[ \mathbb{E}\left(\left(\Delta Y_3-\beta_1^* \Delta Y_2\right) \cdot Y_1\right) =0\]

An additional time period

  • What happens if we observe \(Y_4\)? Then you will have extra moment conditions (of course, there is a price!) because you can take other differences.

  • Again, you observe \(\left(Y_1,Y_2,Y_3,Y_4\right)\).

  • You have something like: \[ \begin{eqnarray*} \mathbb{E}\left(\left(\Delta Y_3-\beta_1^* \Delta Y_2\right) \cdot Y_1\right) =0\\ \mathbb{E}\left(\left(\Delta Y_4-\beta_1^* \Delta Y_3\right) \cdot Y_1\right) =0 \\ \mathbb{E}\left(\left(\Delta Y_4-\beta_1^* \Delta Y_3\right) \cdot Y_2\right) =0\end{eqnarray*}\]

An additional time period, continued

  • In the end, you will have more moment conditions than the dimension of the parameter \(\beta_1^*\).

  • How do you combine them? This is where the generalized method of moments (GMM) comes in.

Example

Two-sample problems

  • Overidentification: More moment conditions than parameters

  • Suppose you have two independently collected datasets \(\left(Y_{i1}\right)_{i=1}^n\), \(\left(Y_{i2}\right)_{i=1}^n\) or replicates on some random variable \(Y\).

  • Suppose you have \(\mathbb{E}\left(Y_{i1}\right)=\mathbb{E}\left(Y_{i2}\right)=\mu\), but we have \(\mathsf{Var}\left(Y_{i1}\right)=\sigma^2_1\), \(\mathsf{Var}\left(Y_{i2}\right)=\sigma^2_2\), and \(\sigma^2_1\neq \sigma^2_2\).

Two-sample problems, continued

  • Parameter is \(\mu\). Two moment conditions could identify \(\mu\).

  • The sample moment conditions are

\[\begin{eqnarray*} \frac{1}{n}\sum_{i=1}^n \left(Y_{i1} - \widehat{\mu}\right) = \overline{Y}_1 -\widehat{\mu}=0\\ \frac{1}{n}\sum_{i=1}^n \left(Y_{i2} - \widehat{\mu}\right) = \overline{Y}_2 -\widehat{\mu}=0\\ \end{eqnarray*}\]

  • You see a problem here?

Two-sample problems, continued

  • Why not choose \(\widehat{\mu}\) in order to minimize \[\begin{equation*} L = w_1 \left(\overline{Y}_1 -\widehat{\mu}\right)^2+ w_2 \left(\overline{Y}_2 -\widehat{\mu}\right)^2 +2w_3 \left(\overline{Y}_1 -\widehat{\mu}\right) \left(\overline{Y}_2 -\widehat{\mu}\right) \label{gmm} \end{equation*}\]

  • In matrix form:

\[\begin{equation*} L=\begin{pmatrix}\overline{Y}_1 -\widehat{\mu}\\ \overline{Y}_2 -\widehat{\mu}\end{pmatrix}^\prime \overbrace{\begin{pmatrix}w_1 & w_3 \\ w_3 & w_2\end{pmatrix}}^W \begin{pmatrix}\overline{Y}_1 -\widehat{\mu}\\ \overline{Y}_2 -\widehat{\mu}\end{pmatrix}. \end{equation*}\]

Two-sample problems, continued

  • If we knew the values of \(w_1\), \(w_2\), and \(w_3\) in advance, then we are done:

\[\begin{equation} \widehat{\mu}= \underbrace{\frac{w_1+w_3}{w_1+w_2+2w_3}}_{w}\overline{Y}_1+\underbrace{\frac{w_2+w_3}{w_1+w_2+2w_3}}_{1-w}\overline{Y}_2 \label{weighted-mean} \end{equation}\]

  • If you do not know \(w_1, w_2, w_3\), then we have to find them. Useful to think of this as a portfolio optimization problem!

Two-sample problems, continued

  • It can be shown that

\[\begin{equation*} \mathsf{Var}\left(\widehat{\mu}\right)=w^2 \mathsf{Var}\left(\overline{Y}_1\right)+ (1-w)^2\mathsf{Var}\left( \overline{Y}_2\right)=w^2\frac{\sigma^2_1}{n}+(1-w)^2\frac{\sigma^2_2}{n} \end{equation*}\]

  • The optimal value of \(w\) which minimizes \(\mathsf{Var}\left(\widehat{\mu}\right)\) is \[\begin{equation} w=\frac{w_1+w_3}{w_1+w_2+2w_3}=\frac{\sigma^2_2}{\sigma^2_1+\sigma^2_2}=\frac{1}{\sigma^2_1/\sigma^2_2+1} \label{weight} \end{equation}\]

Two-sample problems, continued

  • Connection to distribution theory for GMM: Choose weights on moment conditions to minimize asymptotic variance
  • An intuitive form for the estimator
  • No matrix algebra, unless you really want it
  • No distribution theory, unless you really want it

Other examples

  1. Poisson and exponential
  2. Should I control for a variable or not?
  3. Dealing with measurement error

All of these can be made accessible. Of course, implementation is another thing!

Challenges

  1. Have to teach general-purpose estimation procedures: Every time you change a moment condition, you have to repeat the procedure.

  2. Do you teach distribution theory? Depends.

  3. How do you get standard errors and conduct standard inference? Use a pre-existing GMM framework available in packages

    • gmm in Stata
    • momentfit in R

Final advertisements

Slides My webpage

Questions, proposals, collaboration?

Email me at andrew.pua@dlsu.edu.ph or approach me, I’ll give you my card.