2024-11-07
Or visit https://bit.ly/pes-7.
You have been using the method of moments without you knowing it:
reg y x1 x2
So why should you know more about the method?
Because it is a powerful black box.
Look at Larsen and Marx (a book referenced in CHED’s PSG for economics programmes):
Look at their first illustration:
From an algorithmic view:
One of the four words in the sentence “I SEE THE MOUSE” will be selected at random by a person. That person will tell you how many \(E\)’s there are in the selected word.
Call \(Y\) the number of letters in the selected word and \(X_1\) the number of \(E\)’s in the selected word.
Your task is to predict the number of letters in the selected word (ex ante).
You would be “punished” according to the square of the difference between the actual \(Y\) and your prediction.
Because you do not know for sure which word will be chosen, you have to allow for contingencies.
Therefore, losses depend on which word was chosen.
What would be your prediction rule in order to make your expected loss as small as possible?
One way to answer this question is to propose a prediction rule and hope for the best.
For example, we can say that the rule should have the form \(\beta_0+\beta_1 X_1\).
The task is now to find the unique solution to the following optimization problem: \[\min_{\beta_{0},\beta_{1}}\mathbb{E}\left[\left(Y-\beta_{0}-\beta_{1}X_{1}\right)^{2}\right].\]
An optimal solution \(\left(\beta_{0}^{*},\beta_{1}^{*}\right)\) solves the following first-order conditions: \[\begin{eqnarray*} \mathbb{E}\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1}\right) &=& 0, \\ \mathbb{E}\left(X_{1}\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1}\right)\right) &=& 0. \end{eqnarray*}\]
As a result, we have \[\begin{equation}\beta_{0}^{*} = \mathbb{E}\left(Y\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right),\qquad\beta_{1}^{*}=\dfrac{\mathsf{Cov}\left(X_{1},Y\right)}{\mathsf{Var}\left(X_{1}\right)}.\label{blp-coef}\end{equation}\]
What if you are in a situation where \(Y=\beta_0+\beta_1X_1+u\) is not a linear regression (usually called structural)?
What this means:
But you have to show that \(\beta_1\) means something for your use case: typically, people think \(\beta_1\) is the causal effect of \(X_1\) on \(Y\).
The previous identification argument requires
Violations of any of the two aforementioned conditions would already signal a failure of instrumental variables to identify \(\beta_1\).
Setup:
What happens if we observe \(Y_4\)? Then you will have extra moment conditions (of course, there is a price!) because you can take other differences.
Again, you observe \(\left(Y_1,Y_2,Y_3,Y_4\right)\).
You have something like: \[ \begin{eqnarray*} \mathbb{E}\left(\left(\Delta Y_3-\beta_1^* \Delta Y_2\right) \cdot Y_1\right) =0\\ \mathbb{E}\left(\left(\Delta Y_4-\beta_1^* \Delta Y_3\right) \cdot Y_1\right) =0 \\ \mathbb{E}\left(\left(\Delta Y_4-\beta_1^* \Delta Y_3\right) \cdot Y_2\right) =0\end{eqnarray*}\]
In the end, you will have more moment conditions than the dimension of the parameter \(\beta_1^*\).
How do you combine them? This is where the generalized method of moments (GMM) comes in.
Overidentification: More moment conditions than parameters
Suppose you have two independently collected datasets \(\left(Y_{i1}\right)_{i=1}^n\), \(\left(Y_{i2}\right)_{i=1}^n\) or replicates on some random variable \(Y\).
Suppose you have \(\mathbb{E}\left(Y_{i1}\right)=\mathbb{E}\left(Y_{i2}\right)=\mu\), but we have \(\mathsf{Var}\left(Y_{i1}\right)=\sigma^2_1\), \(\mathsf{Var}\left(Y_{i2}\right)=\sigma^2_2\), and \(\sigma^2_1\neq \sigma^2_2\).
Parameter is \(\mu\). Two moment conditions could identify \(\mu\).
The sample moment conditions are
\[\begin{eqnarray*} \frac{1}{n}\sum_{i=1}^n \left(Y_{i1} - \widehat{\mu}\right) = \overline{Y}_1 -\widehat{\mu}=0\\ \frac{1}{n}\sum_{i=1}^n \left(Y_{i2} - \widehat{\mu}\right) = \overline{Y}_2 -\widehat{\mu}=0\\ \end{eqnarray*}\]
Why not choose \(\widehat{\mu}\) in order to minimize \[\begin{equation*} L = w_1 \left(\overline{Y}_1 -\widehat{\mu}\right)^2+ w_2 \left(\overline{Y}_2 -\widehat{\mu}\right)^2 +2w_3 \left(\overline{Y}_1 -\widehat{\mu}\right) \left(\overline{Y}_2 -\widehat{\mu}\right) \label{gmm} \end{equation*}\]
In matrix form:
\[\begin{equation*} L=\begin{pmatrix}\overline{Y}_1 -\widehat{\mu}\\ \overline{Y}_2 -\widehat{\mu}\end{pmatrix}^\prime \overbrace{\begin{pmatrix}w_1 & w_3 \\ w_3 & w_2\end{pmatrix}}^W \begin{pmatrix}\overline{Y}_1 -\widehat{\mu}\\ \overline{Y}_2 -\widehat{\mu}\end{pmatrix}. \end{equation*}\]
\[\begin{equation} \widehat{\mu}= \underbrace{\frac{w_1+w_3}{w_1+w_2+2w_3}}_{w}\overline{Y}_1+\underbrace{\frac{w_2+w_3}{w_1+w_2+2w_3}}_{1-w}\overline{Y}_2 \label{weighted-mean} \end{equation}\]
\[\begin{equation*} \mathsf{Var}\left(\widehat{\mu}\right)=w^2 \mathsf{Var}\left(\overline{Y}_1\right)+ (1-w)^2\mathsf{Var}\left( \overline{Y}_2\right)=w^2\frac{\sigma^2_1}{n}+(1-w)^2\frac{\sigma^2_2}{n} \end{equation*}\]
All of these can be made accessible. Of course, implementation is another thing!
Have to teach general-purpose estimation procedures: Every time you change a moment condition, you have to repeat the procedure.
Do you teach distribution theory? Depends.
How do you get standard errors and conduct standard inference? Use a pre-existing GMM framework available in packages
gmm
in Statamomentfit
in RSlides | My webpage |
Questions, proposals, collaboration?
Email me at andrew.pua@dlsu.edu.ph or approach me, I’ll give you my card.