. 7
( 17)


10.1 The Recipe


ut+1 (b) ≡ mt+1 (b)xt+1 ’ pt
gT (b) ≡ ET [ut (b)]

E [ut (b) ut’j (b)0 ]

GMM estimate

ˆ2 = argminb gT (b)0 S ’1 gT (b).

Standard errors
1 ‚gT (b)
var(ˆ2 ) = (d0 S ’1 d)’1 ; d ≡
T ‚b


Test of the model (“overidentifying restrictions”)
£ ¤
T JT = T min gT (b)0 S ’1 gT (b) ∼ χ2 (#moments ’ #parameters).

It™s easiest to start our discussion of GMM in the context of an explicit discount factor
model, such as the consumption-based model. I treat the special structure of linear factor
models later. I start with the basic classic recipe as given by Hansen and Singleton (1982).
Discount factor models involve some unknown parameters as well as data, so I write
mt+1 (b) when it™s important to remind ourselves of this dependence. For example, if mt+1 =
β(ct+1 /ct )’γ , then b ≡ [β γ]0 . I write ˆ to denote an estimate when it is important to
distinguish estimated from other values.
Any asset pricing model implies

E(pt ) = E [mt+1 (b)xt+1 ] .

It™s easiest to write this equation in the form E(·) = 0

E [mt+1 (b)xt+1 ’ pt ] = 0.

x and p are typically vectors; we typically check whether a model for m can price a number
of assets simultaneously. Equations (10.137) are often called the moment conditions.
It™s convenient to de¬ne the errors ut (b) as the object whose mean should be zero,

ut+1 (b) = mt+1 (b)xt+1 ’ pt

Given values for the parameters b, we could construct a time series on ut and look at its mean.
De¬ne gT (b) as the sample mean of the ut errors, when the parameter vector is b in a
sample of size T :
gT (b) ≡ ut (b) = ET [ut (b)] = ET [mt+1 (b)xt+1 ’ pt ] .
T t=1

The second equality introduces the handy notation ET for sample means,
ET (·) = (·).
T t=1

(It might make more sense to denote these estimates E and g . However, Hansen™s T subscript
notation is so widespread that doing so would cause more confusion than it solves.)


The ¬rst stage estimate of b minimizes a quadratic form of the sample mean of the errors,

ˆ1 = argmin ˆ gT (ˆ 0 W gT (ˆ
b b) b)

for some arbitrary matrix W (often, W = I). This estimate is consistent and asymptotically
normal. You can and often should stop here, as I explain below.
Using ˆ1 , form an estimate S of

E [ut (b) ut’j (b)0 ] . (138)

(Below I discuss various interpretations of and ways to construct this estimate.) Form a
second stage estimate ˆ2 using the matrix S in the quadratic form,

ˆ2 = argmin gT (b)0 S ’1 gT (b).
b b

ˆ2 is a consistent, asymptotically normal, and asymptotically ef¬cient estimate of the param-
eter vector b. “Ef¬cient” means that it has the smallest variance-covariance matrix among all
estimators that set different linear combinations of gT (b) to zero.
The variance-covariance matrix of ˆ2 is
var(ˆ2 ) = (d0 S ’1 d)’1
‚gT (b)
or, more explicitly,
µ ¶¯

[(mt+1 (b)xt+1 ’ pt )] ¯
d = ET ¯ˆ
‚b b=b

(More precisely, d should be written as the object to which ‚gT /‚b converges, and then
‚gT /‚b is an estimate of that object used to form a consistent estimate of the asymptotic
variance-covariance matrix.)
This variance-covariance matrix can be used to test whether a parameter or group of
parameters are equal to zero, via

qi ∼ N (0, 1)
ˆ ii


h i’1
ˆj var(ˆ jj ˆj ∼ χ2 (#included b0 s)
b b) b

where bj =subvector, var(b)jj =submatrix.
Finally, the test of overidentifying restrictions is a test of the overall ¬t of the model. It
states that T times the minimized value of the second-stage objective is distributed χ2 with
degrees of freedom equal to the number of moments less the number of estimated parameters.
£ ¤
T JT = T min gT (b)0 S ’1 gT (b) ∼ χ2 (#moments ’ #parameters).

10.2 Interpreting the GMM procedure

gT (b) is a pricing error. It is proportional to ±.
GMM picks parameters to minimize a weighted sum of squared pricing errors.
The second-stage picks the linear combination of pricing errors that are best measured, by
having smallest sampling variation. First and second stage are like OLS and GLS regressions.
The standard error formula is a simple application of the delta method.
The JT test evaluates the model by looking at the sum of squared pricing errors.

Pricing errors

The moment conditions are

gT (b) = ET [mt+1 (b)xt+1 ] ’ ET [pt ] .

Thus, each moment is the difference between actual (ET (p)) and predicted (ET (mx)) price,
or pricing error. What could be more natural than to pick parameters so that the model™s
predicted prices are as close as possible to the actual prices, and then to evaluate the model
by how large these pricing errors are?
In the language of expected returns, the moments gT (b) are proportional to the difference
between actual and predicted returns; Jensen™s alphas, or the vertical distance between the
points and the line in Figure 5. To see this fact, recall that 0 = E(mRe ) can be translated to
a predicted expected return,
cov(m, Re )
E(Re ) = ’ .


Therefore, we can write the pricing error as
µ µ ¶¶
cov(m, Re )
e e
g(b) = E(mR ) = E(m) E(R ) ’ ’
(actual mean return - predicted mean return.)
g(b) =

If we express the model in expected return-beta language,

E(Rei ) = ±i + β 0 »

then the GMM objective is proportional to the Jensen™s alpha measure of mis-pricing,
g(b) = ±i .

First-stage estimates

If we could, we™d pick b to make every element of gT (b) = 0 ” to have the model price
assets perfectly in sample. However, there are usually more moment conditions (returns times
instruments) than there are parameters. There should be, because theories with as many free
parameters as facts (moments) are vacuous. Thus, we choose b to make gT (b) as small as
possible, by minimizing a quadratic form,

min gT (b)0 W gT (b). (139)

W is a weighting matrix that tells us how much attention to pay to each moment, or how
to trade off doing well in pricing one asset or linear combination of assets vs. doing well in
pricing another. In the common case W = I, GMM treats all assets symmetrically, and the
objective is to minimize the sum of squared pricing errors.
The sample pricing error gT (b) may be a nonlinear function of b. Thus, you may have
to use a numerical search to ¬nd the value of b that minimizes the objective in (10.139).
However, since the objective is locally quadratic, the search is usually straightforward.

Second-stage estimates: Why S ’1 ?

What weighting matrix should you use? The weighting matrix directs GMM to emphasize
some moments or linear combinations of moments at the expense of others. You might start
with W = I, i.e., try to price all assets equally well. A W that is not the identity matrix can
be used to offset differences in units between the moments. You also might also start with
different elements on the diagonal of W if you think some assets are more interesting, more
informative, or better measured than others.
The second-stage estimate picks a weighting matrix based on statistical considerations.


Some asset returns may have much more variance than other assets. For those assets, the
sample mean gT = ET (mt Rt ’ 1) will be a much less accurate measurement of the popula-
tion mean E(mR ’ 1), since the sample mean will vary more from sample to sample. Hence,
it seems like a good idea to pay less attention to pricing errors from assets with high variance
of mt Rt ’ 1. One could implement this idea by using a W matrix composed of inverse vari-
ances of ET (mt Rt ’ 1) on the diagonal. More generally, since asset returns are correlated,
one might think of using the covariance matrix of ET (mt Rt ’1). This weighting matrix pays
most attention to linear combinations of moments about which the data set at hand has the
most information. This idea is exactly the same as heteroskedasticity and cross-correlation
corrections that lead you from OLS to GLS in linear regressions.
The covariance matrix of gT = ET (ut+1 ) is the variance of a sample mean. Exploiting
the assumption that E(ut ) = 0, and that ut is stationary so E(u1 u2 ) = E(ut ut+1 ) depends
only on the time interval between the two us, we have
à !
var(gT ) = var ut+1
T t=1
1£ ¡ ¢ ¤
T E(ut u0 ) + (T ’ 1) E(ut u0 ) + E(ut u0 )) + ...
= t t’1 t+1

As T ’ ∞, (T ’ j)/T ’ 1, so

1X 1
E(ut u0 ) = S.
var(gT ) ’ t’j
T j=’∞ T

The last equality denotes S, known for other reasons as the spectral density matrix at fre-
quency zero of ut . (Precisely, S so de¬ned is the variance-covariance matrix of the gT for
¬xed b. The actual variance-covariance matrix of gT must take into account the fact that we
chose b to set a linear combination of the gT to zero in each sample. I give that formula
below. The point here is heuristic.)
This fact suggests that a good weighting matrix might be the inverse of S. In fact, Hansen
(1982) shows formally that the choice

E(ut u0 )
W =S , S≡ t’j

is the statistically optimal weighing matrix, meaning that it produces estimates with lowest
asymptotic variance.

You may be more used to the formula σ(u)/ T for the standard deviation of a sample
mean. This formula is a special case that holds when the u0 s are uncorrelated over time. If


Et (ut u0 ) = 0, j 6= 0, then the previous equation reduces to
à !
1X 1 var(u)
E(uu0 ) =
var ut+1 = .
T t=1 T T

This is probably the ¬rst statistical formula you ever saw “ the variance of the sample mean.
In GMM, it is the last statistical formula you™ll ever see as well. GMM amounts to just gen-
eralizing the simple ideas behind the distribution of the sample mean to parameter estimation
and general statistical contexts.
The ¬rst and second stage estimates should remind you of standard linear regression mod-
els. You start with an OLS regression. If the errors are not i.i.d., the OLS estimates are con-
sistent, but not ef¬cient. If you want ef¬cient estimates, you can use the OLS estimates to
obtain a series of residuals, estimate a variance-covariance matrix of residuals, and then do
GLS. GLS is also consistent and more ef¬cient, meaning that the sampling variation in the
estimated parameters is lower.

Standard errors

The formula for the standard error of the estimate,
var(ˆ2 ) = (d0 S ’1 d)’1 (140)
can be understood most simply as an instance of the “delta method” that the asymptotic
variance of f (x) is f 0 (x)2 var(x). Suppose there is only one parameter and one moment.
S/T is the variance matrix of the moment gT . d’1 is [‚gT /‚b]’1 = ‚b/‚gT . Then the delta
method formula gives
1 ‚b ‚b
var(ˆ2 ) =
b var(gT ) .
T ‚gT ‚gT
The actual formula (10.140) just generalizes this idea to vectors.

10.2.1 JT Test

Once you™ve estimated the parameters that make a model “¬t best,” the natural question is,
how well does it ¬t? It™s natural to look at the pricing errors and see if they are “big.” The
JT test asks whether they are “big” by statistical standards “ if the model is true, how often
should we see a (weighted) sum of squared pricing errors this big? If not often, the model is
“rejected.” The test is
h i
T JT = T gT (ˆ 0 S ’1 gT (ˆ ∼ χ2 (#moments ’ #parameters).
b) b)

Since S is the variance-covariance matrix of gT , this statistic is the minimized pricing errors


divided by their variance-covariance matrix. Sample means converge to a normal distribution,
so sample means squared divided by variance converges to the square of a normal, or χ2 .
The reduction in degrees of freedom corrects for the fact that S is really the covariance
matrix of gT for ¬xed b. We set a linear combination of the gT to zero in each sample, so the
actual covariance matrix of gT is singular, with rank #moments - #parameters. More details

10.3 Applying GMM

Forecast errors and instruments.
Stationarity and choice of units.

Notation; instruments and returns

Most of the effort involved with GMM is simply mapping a given problem into the very
general notation. The equation

E [mt+1 (b)xt+1 ’ pt ] = 0

can capture a lot. We often test asset pricing models using returns, in which case the moment
conditions are

E [mt+1 (b)Rt+1 ’ 1] = 0.

It is common to add instruments as well. Mechanically, you can multiply both sides of

1 = Et [mt+1 (b)Rt+1 ]

by any variable zt observed at time t before taking unconditional expectations, resulting in

E(zt ) = E [mt+1 (b)Rt+1 zt ] .

Expressing the result in E(·) = 0 form,

0 = E {[mt+1 (b)Rt+1 ’ 1] zt } .

We can do this for a whole vector of returns and instruments, multiplying each return by each
instrument. For example, if we start with two returns R = [Ra Rb ]0 and one instrument z,


equation (10.141) looks like
±®  ®
 ® 
 mt+1 (b) Ra 1 0
 
   1   0
m (b) Rb
E  t+1 ’   .
» ° zt » = °
° mt+1 (b) Ra zt 0»
 
 
mt+1 (b) Rb zt zt 0

Using the Kronecker product — meaning “multiply every element by every other element”
we can denote the same relation compactly by

E {[mt+1 (b) Rt+1 ’ 1] — zt } = 0,

or, emphasizing the managed-portfolio interpretation and p = E(mx) notation,

E [mt+1 (b)(Rt+1 — zt ) ’ (1 — zt )] = 0.

Forecast errors and instruments

The asset pricing model says that, although expected returns can vary across time and as-
sets, expected discounted returns should always be the same, 1. The error ut+1 = mt+1 Rt+1 ’
1 is the ex-post discounted return. ut+1 = mt+1 Rt+1 ’ 1 represents a forecast error. Like
any forecast error, ut+1 should be conditionally and unconditionally mean zero.
In an econometric context, z is an instrument because it is uncorrelated with the error
ut+1 . E(zt ut+1 ) is the numerator of a regression coef¬cient of ut+1 on zt ; thus adding
instruments basically checks that the ex-post discounted return is unforecastable by linear
If an asset™s return is higher than predicted when zt is unusually high, but not on average,
scaling by zt will pick up this feature of the data. Then, the moment condition checks that
the discount rate is unusually low at such times, or that the conditional covariance of the
discount rate and asset return moves suf¬ciently to justify the high conditionally expected
return. As I explained in Section 8.1, the addition of instruments is equivalent to adding the
returns of managed portfolios to the analysis, and is in principle able to capture all of the
model™s predictions.

Stationarity and distributions

The GMM distribution theory does require some statistical assumption. Hansen (1982)
and Ogaki (1993) cover them in depth. The most important assumption is that m, p, and
x must be stationary random variables. (“Stationary” of often misused to mean constant, or
i.i.d.. The statistical de¬nition of stationarity is that the joint distribution of xt , xt’j depends
only on j and not on t.) Sample averages must converge to population means as the sample
size grows, and stationarity implies this result.
Assuring stationarity usually amounts to a choice of sensible units. For example, though


we could express the pricing of a stock as

pt = Et [mt+1 (dt+1 + pt+1 )]

it would not be wise to do so. For stocks, p and d rise over time and so are typically not
stationary; their unconditional means are not de¬ned. It is better to divide by pt and express
the model as
· ¸
dt+1 + pt+1
1 = Et mt+1 = Et (mt+1 Rt+1 )
The stock return is plausibly stationary.
Dividing by dividends is an alternative and I think underutilized way to achieve stationar-
ity (at least for portfolios, since many individual stocks do not pay regular dividends):
· µ ¶ ¸
pt pt+1 dt+1
= Et mt+1 1 + .
dt dt+1 dt
³ ´
pt+1 dt+1
Now we map 1 + dt+1 dt into xt+1 and pt into pt . This formulation allows us to focus
on prices rather than one-period returns.
Bonds are a claim to a dollar, so bond prices and yields do not grow over time. Hence, it
might be all right to examine

pb = E(mt+1 1)

with no transformations.
Stationarity is not always a clear-cut question in practice. As variables become “less
stationary,” as they experience longer swings in a sample, the asymptotic distribution can
becomes a less reliable guide to a ¬nite-sample distribution. For example, the level of nominal
interest rates is surely a stationary variable in a fundamental sense: it was 6% in ancient
Babylon, about 6% in 14th century Italy, and about 6% again today. Yet it takes very long
swings away from this unconditional mean, moving slowly up or down for even 20 years at
a time. Therefore, in an estimate and test that uses the level of interest rates, the asymptotic
distribution theory might be a bad approximation to the correct ¬nite sample distribution
theory. This is true even if the number of data points is large. 10,000 data points measured
every minute are a “smaller” data set than 100 data points measured every year. In such
a case, it is particularly important to develop a ¬nite-sample distribution by simulation or
bootstrap, which is easy to do given today™s computing power.
It is also important to choose test assets in a way that is stationary. For example, individual
stocks change character over time, increasing or decreasing size, exposure to risk factors,
leverage, and even nature of the business. For this reason, it is common to sort stocks into
portfolios based on characteristics such as betas, size, book/market ratios, industry and so
forth. The statistical characteristics of the portfolio returns may be much more constant than


the characteristics of individual securities, which ¬‚oat in and out of the various portfolios.
(One can alternatively include the characteristics as instruments.)
Many econometric techniques require assumptions about distributions. As you can see,
the variance formulas used in GMM do not include the usual assumptions that variables
are i.i.d., normally distributed, homoskedastic, etc. You can put such assumptions in if you
want to “ we™ll see how below, and adding such assumptions simpli¬es the formulas and can
improve the small-sample performance when the assumptions are justi¬ed “ but you don™t
have to add these assumptions.

Chapter 11. GMM: general formulas
and applications
Lots of calculations beyond formal parameter estimation and overall model testing are useful
in the process of evaluating a model and comparing it to other models. But you want to
understand sampling variation in such calculations, and mapping the questions into the GMM
framework allows you to do this easily. In addition, alternative estimation and evaluation
procedures may be more intuitive or robust to model misspeci¬cation than the two (or multi)
stage procedure described above.
In this chapter I lay out the general GMM framework, and I discuss four applications and
variations on the basic GMM method. 1) I show how to derive standard errors of nonlin-
ear functions of sample moments, such as correlation coef¬cients. 2) I apply GMM to OLS
regressions, easily deriving standard error formulas that correct for autocorrelation and con-
ditional heteroskedasticity. 3) I show how to use prespeci¬ed weighting matrices W in asset
pricing tests in order to overcome the tendency of ef¬cient GMM to focus on spuriously low-
variance portfolios 4) As a good parable for prespeci¬ed linear combination of moments a, I
show how to mimic “calibration” and “evaluation” phases of real business cycle models. 5)
I show how to use the distribution theory for the gT beyond just forming the JT test in order
to evaluate the importance of individual pricing errors. The next chapter continues, and col-
lects GMM variations useful for evaluating linear factor models and related mean-variance
frontier questions.
Many of these calculations amount to creative choices of the aT matrix that selects which
linear combination of moments are set to zero, and reading off the resulting formulas for
variance covariance matrix of the estimated coef¬cients, equation (11.146) and variance co-
variance matrix of the moments gT , equation (11.147).

11.1 General GMM formulas

The general GMM estimate

aT gT (ˆ = 0

Distribution of ˆ :

T cov(ˆ = (ad)’1 aSa0 (ad)’10

Distribution of gT (ˆ :
h i¡ ¢¡ ¢0
T cov gT (ˆ = I ’ d(ad)’1 a S I ’ d(ad)’1 a


The “optimal” estimate uses a = d0 S ’1 . In this case,

T cov(ˆ = (d0 S ’1 d)’1

h i
ˆ = S ’ d(d0 S ’1 d)’1 d0
T cov gT (b)


T JT = T gT (ˆ 0 S ’1 gT (ˆ ’ χ2 (#moments ’ #parameters).
b) b)

An analogue to the likelihood ratio test,

T JT (restricted) ’ T JT (unrestricted) ∼ χ2
Number of restrictions

GMM procedures can be used to implement a host of estimation and testing exercises.
Just about anything you might want to estimate can be written as a special case of GMM. To
do so, you just have to remember (or look up) a few very general formulas, and then map
them into your case.
Express a model as

E[f (xt , b)] = 0

Everything is a vector: f can represent a vector of L sample moments, xt can be M data
series, b can be N parameters. f(xt , b) is a slightly more explicit statement of the errors
ut (b) in the last chapter
De¬nition of the GMM estimate.
We estimate parameters ˆ to set some linear combination of sample means of f to zero,
ˆ : set aT gT (ˆ = 0 (143)
b b)

gT (b) ≡ f(xt , b)
T t=1

and aT is a matrix that de¬nes which linear combination of gT (b) will be set to zero. This
de¬nes the GMM estimate.
If there are as many moments as parameters, you will set each moment to zero; when
there are fewer parameters than moments, (11.143) just captures the natural idea that you
will set some moments, or some linear combination of moments to zero in order to estimate
the parameters. The minimization of the last chapter is a special case. If you estimate b by


min gT (b)0 W gT (b), the ¬rst order conditions are

W gT (b) = 0,
which is of the form (11.143) with aT = ‚gT /‚bW . The general GMM procedure allows

you to pick arbitrary linear combinations of the moments to set to zero in parameter estima-
Standard errors of the estimate.
Hansen (1982), Theorem 3.1 tells us that the asymptotic distribution of the GMM estimate
√ £ ¤
T (ˆ ’ b) ’ N 0, (ad)’1 aSa0 (ad)’10 (144)
· ¸
‚f ‚gT (b)
d≡E (xt , b) =
‚b0 ‚b0
(i.e., d is de¬ned as the population moment in the ¬rst equality, which we estimate in sample
by the second equality), where

a ≡ plim aT ,
and where

E [f(xt , b), f (xt’j b)0 ] . (145)

Don™t forget the T in (11.144)! In practical terms, this means to use
var(ˆ = (ad)’1 aSa0 (ad)’10 (146)
as the covariance matrix for standard errors and tests. As in the last chapter, you can under-
stand this formula as an application of the delta method.
Distribution of the moments.
Hansen™s Lemma 4.1 gives the sampling distribution of the moments gT (b) :

h¡ ¢0 i
√ ¢¡
T gT (ˆ ’ N 0, I ’ d(ad)’1 a S I ’ d(ad)’1 a . (147)

As we have seen, S would be the asymptotic variance-covariance matrix of sample means, if
we did not estimate any parameters, which sets some linear combinations of the gT to zero.
The I ’ d(ad)’1 a terms account for the fact that in each sample some linear combinations
of gT are set to zero. Thus, this variance-covariance matrix is singular.


χ2 tests.
A sum of squared standard normals is distributed χ2 . Therefore, it is natural to use the
distribution theory for gT to see if the gT are jointly “too big.” Equation (11.147) suggests
that we form the statistic
h¡ ¢ i’1
ˆ 0 I ’ d(ad)’1 a S I ’ d(ad)’1 a 0 gT (ˆ (148)
T gT (b) b)

and that it should have a χ2 distribution. It does, but with a hitch: The variance-covariance
matrix is singular, soP have to pseudo-invert it. For example, you can perform an eigen-
value decomposition = QΛQ0 and then invert only the non-zero eigenvalues. Also, the χ2
distribution has degrees of freedom given by the number non-zero linear combinations of gT ,
the number of moments less number of estimated parameters. You can similarly use (11.147)
to construct tests of individual moments (“are the small stocks mispriced?”) or groups of
Ef¬cient estimates
The theory so far allows us to estimate parameters by setting any linear combination of
moments to zero. Hansen shows that one particular choice is statistically optimal,

a = d0 S ’1 . (149)

This choice is the ¬rst order condition to min{b} gT (b)0 S ’1 gT (b) that we studied in the last
Chapter. With this weighting matrix, the standard error formula (11.146) reduces to
√ £ ¤
T (ˆ ’ b) ’ N 0, (d0 S ’1 d)’1 . (150)

This is Hansen™s Theorem 3.2. The sense in which (11.149) is “ef¬cient” is that the sampling
variation of the parameters for arbitrary a matrix, (11.146), equals the sampling variation of
the “ef¬cient” estimate in (11.150) plus a positive semide¬nite matrix.
With the optimal weights (11.149), the variance of the moments (11.147) simpli¬es to
1¡ ¢
S ’ d(d0 S ’1 d)’1 d0 . (151)
cov(gT ) =
We can use this matrix in a test of the form (11.148). However, Hansen™s Lemma 4.2 tells us
that there is an equivalent and simpler way to construct this test,

T gT (ˆ 0 S ’1 gT (ˆ ’ χ2 (#moments ’ #parameters). (152)
b) b)

This result is nice since we get to use the already-calculated and non-singular S ’1 .
To derive (11.152) from (11.147), factor S = CC 0 and then ¬nd the asymptotic covari-
ance matrix of C ’1 gT (ˆ using (11.147). The result is
h√ i
var T C ’1 gT (ˆ = I ’ C ’1 d(d0 S ’1 d)’1 d0 C ’10 .


This is an idempotent matrix of rank #moments-#parameters, so (11.152) follows.
Alternatively, note that S ’1 is a pseudo-inverse of the second stage cov(gT ). (A pseudo-
inverse times cov(gT ) should result in an idempotent matrix of the same rank as cov(gT ).)
¡ ¢
S ’1 cov(gT ) = S ’1 S ’ d(d0 S ’1 d)’1 d0 = I ’ S ’1 d(d0 S ’1 d)’1 d0

Then, check that the result is idempotent.
¡ ¢¡ ¢
I ’ S ’1 d(d0 S ’1 d)’1 d0 I ’ S ’1 d(d0 S ’1 d)’1 d0 = I ’ S ’1 d(d0 S ’1 d)’1 d0 .

This derivation not only veri¬es that JT has the same distribution as gT cov(gT )’1 gT , but

that they are numerically the same in every sample.
I emphasize that (11.150) and (11.152) only apply to the “optimal” choice of weights,
(11.149). If you use another set of weights, as in a ¬rst-stage estimate, you must use the
general formulas (11.146) and (11.147).
Model comparisons
You often want to compare one model to another. If one model can be expressed as a
special or “restricted” case of the other or “unrestricted” model we can perform a statistical
comparison that looks very much like a likelihood ratio test. If we use the same S matrix
“ usually that of the unrestricted model “ the restricted JT must rise. But if the restricted
model is really true, it shouldn™t rise “much.” How much?

T JT (restricted) ’ T JT (unrestricted) ∼ χ2 (#of restrictions)

This is a “χ2 difference” test, due to Newey and West (1987a), who call it the “D-test.”

11.2 Testing moments

How to test one or a group of pricing errors. 1) Use the formula for var(gT ) 2) A χ2
difference test.

You may want to see how well a model does on particular moments or particular pricing
errors. For example, the celebrated “small ¬rm effect” states that an unconditional CAPM
(m = a+ bRW , no scaled factors) does badly in pricing the returns on a portfolio that always
holds the smallest 1/10th or 1/20th of ¬rms in the NYSE. You might want to see whether a
new model prices the small returns well. The standard error of pricing errors also allows
you to add error bars to a plot of predicted vs. actual mean returns such as Figure 5 or other
diagnostics based on pricing errors.
We have already seen that individual elements of gT measure the pricing errors or ex-
pected return errors. Thus, the sampling variation of gT given by (11.147) provides exactly


the standard error we are looking for. You can use the sampling distribution of gT , to evalu-
ate the signi¬cance of individual pricing errors, to construct a t-test (for a single gT , such as
small ¬rms) or χ2 test (for groups of gT , such as small ¬rms — instruments). As usual this is
the Wald test.
Alternatively, you can use the χ2 difference approach. Start with a general model that in-
cludes all the moments, and form an estimate of the spectral density matrix S. Now set to
zero the moments you want to test, and denote gsT (b) the vector of moments, including the
zeros (s for “smaller”). Choose bs to minimize gsT (bs )0 S ’1 gsT (bs ) using the same weight-
ing matrix S. The criterion will be lower than the original criterion gT (b)0 S ’1 gT (b), since
there are the same number of parameters and fewer moments. But, if the moments we want to
test truly are zero, the criterion shouldn™t be that much lower. The χ2 difference test applies,
T gT (ˆ 0 S ’1 gT (ˆ ’ T gsT (ˆs )S ’1 gsT (ˆs ) ∼ χ2 (#eliminated moments).
b) b) b b

Of course, don™t fall into the obvious trap of picking the largest of 10 pricing errors and
noting it™s more than two standard deviations from zero. The distribution of the largest of 10
pricing errors is much wider than the distribution of a single one. To use this distribution,
you have to pick which pricing error you™re going to test before you look at the data.

11.3 Standard errors of anything by delta method

One quick application illustrates the usefulness of the GMM formulas. Often, we want to
estimate a quantity that is a nonlinear function of sample means,
b = φ [E(xt )] = φ(µ).
In this case, the formula (11.144) reduces to
· ¸0 X ·¸

1 dφ dφ
cov(xt , x0 ) (153)
var(bT ) = .
T dµ j=’∞ dµ

The formula is very intuitive. The variance of the sample mean is the covariance term inside.
The derivatives just linearize the function φ near the true b.
For example, a correlation coef¬cient can be written as a function of sample means as
E(xt yt ) ’ E(xt )E(yt )
corr(xt , yt ) = p p
E(x2 ) ’ E(xt )2 E(yt ) ’ E(yt )2

Thus, take
£ ¤0
E(xt ) E(x2 ) E(yt ) E(yt ) E(xt yt )
µ= .

A problem at the end of the chapter asks you to take derivatives and derive the standard error
of the correlation coef¬cient. One can derive standard errors for impulse-response functions,


variance decompositions, and many other statistics in this way.

11.4 Using GMM for regressions

By mapping OLS regressions in to the GMM framework, we derive formulas for OLS
standard errors that correct for autocorrelation and conditional heteroskedasticity of the er-
rors. The general formula is
® 

var(β) = E(xt x0 )’1 ° E(ut xt x0 ut’j )» E(xt x0 )’1 .
t t’j t
T j=’∞

and it simpli¬es in special cases.

Mapping any statistical procedure into GMM makes it easy to develop an asymptotic
distribution that corrects for statistical problems such as non-normality, serial correlation and
conditional heteroskedasticity. To illustrate, as well as to develop the very useful formulas, I
map OLS regressions into GMM.
Correcting OLS standard errors for econometric problems is not the same thing as GLS.
When errors do not obey the OLS assumptions, OLS is consistent, and often more robust
than GLS, but its standard errors need to be corrected.
OLS picks parameters β to minimize the variance of the residual:
£ ¤
min ET (yt ’ β 0 xt )2 .

We ¬nd β from the ¬rst order condition, which states that the residual is orthogonal to the
right hand variable:
h i
ˆ ˆ
gT (β) = ET xt (yt ’ x0 β) = 0 (154)

This condition is exactly identi¬ed“the number of moments equals the number of parameters.
Thus, we set the sample moments exactly to zero and there is no weighting matrix (a = I).
We can solve for the estimate analytically,
β = [ET (xt x0 )] ET (xt yt ).

This is the familiar OLS formula. The rest of the ingredients to equation (11.144) are

d = E(xt x0 )


f (xt , β) = xt (yt ’ x0 β) = xt et

where et is the regression residual. Equation (11.144) gives a formula for OLS standard
® 

var(β) = E(xt x0 )’1 ° E(ut xt x0 ut’j )» E(xt x0 )’1 .
ˆ (155)
t t’j t
T j=’∞

This formula reduces to some interesting special cases.

Serially uncorrelated, homoskedastic errors

These are the usual OLS assumptions, and it™s good the usual formulas emerge. Formally,
the OLS assumptions are

E(et | xt , xt’1 ...et’1 , et’2 ...) = 0

E(e2 | xt , xt’1 ...et , et’1 ...) = constant = σ2 . (157)
t e

To use these assumptions, I use the fact that

E(ab) = E(E(a|b)b).

The ¬rst assumption means that only the j = 0 term enters the sum

E(et xt x0 et’j ) = E(e2 xt x0 ).
t’j t t

The second assumption means that

E(e2 xt x0 ) = E(e2 )E(xt x0 ) = σ2 E(xt x0 ).
t t t t e t

Hence equation (11.155) reduces to our old friend,

12 ’1
ˆ σe E(xt x0 )’1 = σ 2 (X 0 X) .
var(β) = t e
£ ¤0
The last notation is typical of econometrics texts, in which X = x1 rep-
x2 ... xT
resents the data matrix.

Heteroskedastic errors

If we delete the conditional homoskedasticity assumption (11.157), we can™t pull the u out


of the expectation, so the standard errors are
ˆ E(xt x0 )’1 E(u2 xt x0 )E(xt x0 )’1 .
var(β) = t t t t
These are known as “Heteroskedasticity consistent standard errors” or “White standard er-
rors” after White (1980).

Hansen-Hodrick errors

Hansen and Hodrick (1982) run forecasting regressions of (say) six month returns, using
monthly data. We can write this situation in regression notation as

yt+k = β 0 xt + µt+k t = 1, 2, ...T.

Fama and French (1988) also use regressions of overlapping long horizon returns on variables
such as dividend/price ratio and term premium. Such regressions are an important part of the
evidence for predictability in asset returns.
Under the null that one-period returns are unforecastable, we will still see correlation in
the µt due to overlapping data. Unforecastable returns imply

E(µt µt’j ) = 0 for |j| ≥ k

but not for |j| < k. Therefore, we can only rule out terms in S lower than k. Since we might
as well correct for potential heteroskedasticity while we™re at it, the standard errors are
® 
var(bT ) = E(xt x0 )’1 ° E(ut xt x0 ut’j )» E(xt x0 )’1 .
t t’j t

11.5 Prespeci¬ed weighting matrices and moment conditions

Prespeci¬ed rather than “optimal” weighting matrices can emphasize economically inter-
esting results, they can avoid the trap of blowing up standard errors rather than improving
pricing errors, they can lead to estimates that are more robust to small model misspeci¬-
cations. This is analogous to the fact that OLS is often preferable to GLS in a regression
context. The GMM formulas for a ¬xed weighting matrix W are
var(ˆ = (d0 W d)’1 d0 W SW d(d0 W d)’1

(I ’ d(d0 W d)’1 d0 W )S(I ’ W d(d0 W d)’1 d0 ).
var(gT ) =


In the basic approach outlined in Chapter 10, our ¬nal estimates were based on the “ef¬-
cient” S ’1 weighting matrix. This objective maximizes the asymptotic statistical information
in the sample about a model, given the choice of moments gT . However, you may want to
use a prespeci¬ed weighting matrix W 6= S ’1 instead, or at least as a diagnostic accompa-
nying more formal statistical tests. A prespeci¬ed weighting matrix lets you, rather than the
S matrix, specify which moments or linear combination of moments GMM will value in the
minimization min{b} gT (b)0 W gT (b). A higher value of Wii forces GMM to pay more atten-
tion to getting the ith moment right in the parameter estimation. For example, you might feel
that some assets suffer from measurement error, are small and illiquid and hence should be
deemphasized, or you may want to keep GMM from looking at portfolios with strong long
and short position. I give some additional motivations below.
You can also go one step further and impose which linear combinations aT of moment
conditions will be set to zero in estimation rather than use the choice resulting from a min-
imization, aT = d0 S ’1 or aT = d0 W . The ¬xed W estimate still trades off the accuracy
of individual moments according to the sensitivity of each moment with respect to the pa-
£ 1 2 ¤0
rameter. For example, if gT = gT gT , W = I, but ‚gT /‚b = [1 10], so that the second
moment is 10 times more sensitive to the parameter value than the ¬rst moment, then GMM
with ¬xed weighting matrix sets
1 2
1 — gT + 10 — gT = 0.

The second moment condition will be 10 times closer to zero than the ¬rst. If you really want
GMM to pay equal attention to the two moments, then you can ¬x the aT matrix directly, for
example aT = [1 1] or aT = [1 ’ 1].
Using a prespeci¬ed weighting matrix or using a prespeci¬ed set of moments is not the
same thing as ignoring correlation of the errors ut in the distribution theory. The S matrix
will still show up in all the standard errors and test statistics.

11.5.1 How to use prespeci¬ed weighting matrices

Once you have decided to use a prespeci¬ed weighting matrix W or a prespeci¬ed set of
moments aT gT (b) = 0, the general distribution theory outlined in section 11.1 quickly gives
standard errors of the estimates and moments, and therefore a χ2 statistic that can be used
to test whether all the moments are jointly zero. Section 11.1 gives the formulas for the
case that aT is prespeci¬ed. If we use weighting matrix W , the ¬rst order conditions to
min{b} gT (b)W gT (b) are

‚gT (b)0
W gT (b) = d0 W gT (b) = 0,
so we map into the general case with aT = d0 W. Plugging this value into (11.146), the


variance-covariance matrix of the estimated coef¬cients is
var(ˆ = (d0 W d)’1 d0 W SW d(d0 W d)’1 . (158)
(You can check that this formula reduces to 1/T (d0 S ’1 d)’1 with W = S ’1 .)
Plugging a = d0 W into equation (11.147), we ¬nd the variance-covariance matrix of the
moments gT
(I ’ d(d0 W d)’1 d0 W )S(I ’ W d(d0 W d)’1 d0 ) (159)
var(gT ) =
As in the general formula, the terms to the left and right of S account for the fact that some
linear combinations of moments are set to zero in each sample.
Equation (11.159) can be the basis of χ2 tests for the overidentifying restrictions. If we
interpret ()’1 to be a generalized inverse, then

gT var(gT )’1 gT ∼ χ2 (#moments ’ #parameters).

As in the general case, you have to pseudo-invert the singular var(gT ), for example by in-
verting only the non-zero eigenvalues.
The major danger in using prespeci¬ed weighting matrices or moments aT is that the
choice of moments, units, and (of course) the prespeci¬ed aT or W must be made carefully.
For example, if you multiply the second moment by 10 times its original value, the S matrix
will undo this transformation and weight them in their original proportions. The identity
weighting matrix will not undo such transformations, so the units should be picked right

11.5.2 Motivations for prespeci¬ed weighting matrices

Robustness, as with OLS vs. GLS.
When errors are autocorrelated or heteroskedastic, every econometrics textbook shows
you how to “improve” on OLS by making appropriate GLS corrections. If you correctly
model the error covariance matrix and if the regression is perfectly speci¬ed, the GLS pro-
cedure can improve ef¬ciency, i.e. give estimates with lower asymptotic standard errors.
However, GLS is less robust. If you model the error covariance matrix incorrectly, the GLS
estimates can be much worse than OLS. Also, the GLS transformations can zero in on slightly
misspeci¬ed areas of the model, producing garbage. GLS is “best,” but OLS is “pretty darn
good.” One often has enough data that wringing every last ounce of statistical precision (low
standard errors) from the data is less important than producing estimates that do not depend
on questionable statistical assumptions, and that transparently focus on the interesting fea-
tures of the data. In these cases, it is often a good idea to use OLS estimates. The OLS
standard error formulas are wrong, though, so you must correct the standard errors of the


OLS estimates for these features of the error covariance matrices, using the formulas we
developed in section 11.4.
GMM works the same way. First-stage or otherwise ¬xed weighting matrix estimates may
give up something in asymptotic ef¬ciency, but they are still consistent, and they can be more
robust to statistical and economic problems. You still want to use the S matrix in computing
standard errors, though, as you want to correct OLS standard errors, and the GMM formulas
show you how to do this.
Even if in the end you want to produce “ef¬cient” estimates and tests, it is a good idea to
calculate standard errors and model ¬t tests for the ¬rst-stage estimates. Ideally, the parameter
estimates should not change by much, and the second stage standard errors should be tighter.
If the “ef¬cient” parameter estimates do change a great deal, it is a good idea to diagnose
why this is so. It must come down to the “ef¬cient” parameter estimates strongly weighting
moments or linear combinations of moments that were not important in the ¬rst stage, and
that the former linear combination of moments disagrees strongly with the latter about which
parameters ¬t well. Then, you can decide whether the difference in results is truly due to
ef¬ciency gain, or whether it signals a model misspeci¬cation.
Chapter 16 argues more at length for judicious use of “inef¬cient” methods such as OLS
to guard against inevitable model misspeci¬cations.
Near-singular S.
The spectral density matrix is often nearly singular, since asset returns are highly corre-
lated with each other, and since we often include many assets relative to the number of data
points. As a result, second stage GMM (and, as we will see below, maximum likelihood
or any other ef¬cient technique) tries to minimize differences and differences of differences
of asset returns in order to extract statistically orthogonal components with lowest variance.
One may feel that this feature leads GMM to place a lot of weight on poorly estimated, eco-
nomically uninteresting, or otherwise non-robust aspects of the data. In particular, portfolios
of the form 100R1 ’ 99R2 assume that investors can in fact purchase such heavily leveraged
portfolios. Short-sale costs often rule out such portfolios or signi¬cantly alter their returns,
so one may not want to emphasize pricing them correctly in the estimation and evaluation.
For example, suppose that S is given by
· ¸

S= .

· ¸
1 1 ’ρ
S = .
’ρ 1
1 ’ ρ2

We can factor S ’1 into a “square root” by the Choleski decomposition. This produces a


triangular matrix C such that C 0 C = S ’1 . You can check that the matrix
" #
1’ρ2 1’ρ2
0 1

works. Then, the GMM criterion

min gT S ’1 gT

is equivalent to

min(gT C 0 )(CgT ).

CgT gives the linear combination of moments that ef¬cient GMM is trying to minimize.
Looking at (11.160), as ρ ’ 1, the (2,2) element stays at 1, but the (1,1) and (1,2) elements
get very large and of opposite signs. For example, if ρ = 0.95, then
· ¸
3.20 ’3.04
C= .
0 1

In this example, GMM pays a little attention to the second moment, but places three times
as much weight on the difference between the ¬rst and second moments. Larger matrices
produce even more extreme weights. At a minimum, it is a good idea to look at S ’1 and its
Choleski decomposition to see what moments GMM is prizing.
The same point has a classic interpretation, and is a well-known danger with classic
regression-based tests. Ef¬cient GMM wants to focus on well-measured moments. In as-
set pricing applications, the errors are typically close to uncorrelated over time, so GMM is
looking for portfolios with small values of var(mt+1 Re ). Roughly speaking, those will
be asset with small return variance. Thus, GMM will pay most attention to correctly pricing
the sample minimum-variance portfolio, and GMM™s evaluation of the model by JT test will
focus on its ability to price this portfolio.
Now, consider what happens in a sample, as illustrated in Figure 24. The sample mean-
variance frontier is typically a good deal wider than the true, or ex-ante mean-variance fron-
tier. In particular, the sample minimum-variance portfolio may have little to do with the
true minimum-variance portfolio. Like any portfolio on the sample frontier, its composition
largely re¬‚ects luck “ that™s why we have asset pricing models in the ¬rst place rather than
just price assets with portfolios on the sample frontier. The sample minimum variance return
is also likely to be composed of strong long-short positions.
In sum, you may want to force GMM not to pay quite so much attention to correctly
pricing the sample minimum variance portfolio, and you may want to give less importance to
a statistical measure of model evaluation that almost entirely prizes GMM™s ability to price
that portfolio.
Economically interesting moments.


Sample minimum-variance portfolio
E(R) Sample, ex-post frontier

True, ex-ante frontier


Figure 24. True or ex ante and sample or ex-post mean-variance frontier. The sample often
shows a spurious minimum-variance portfolio.


The optimal weighting matrix makes GMM pay close attention to linear combinations of
moments with small sampling error in both estimation and evaluation. One may want to force
the estimation and evaluation to pay attention to economically interesting moments instead.
The initial portfolios are usually formed on an economically interesting characteristic such as
size, beta, book/market or industry. One typically wants in the end to see how well the model
prices these initial portfolios, not how well the model prices potentially strange portfolios
of those portfolios. If a model fails, one may want to characterize that failure as “the model
doesn™t price small stocks” not “the model doesn™t price a portfolio of 900— small ¬rm returns
’600— large ¬rm returns ’299— medium ¬rm returns.”
Level playing ¬eld.
The S matrix changes as the model and as its parameters change. (See the de¬nition,
(10.138) or (11.145).) As the S matrix changes, which assets the GMM estimate tries hard
to price well changes as well. For example, the S matrix from one model may value strongly
pricing the T bill well, while that of another model may value pricing a stock excess return
well. Comparing the results of such estimations is like comparing apples and oranges. By
¬xing the weighting matrix, you can force GMM to pay attention to the various assets in the
same proportion while you vary the model.
The fact that S matrices change with the model leads to another subtle trap. One model
my may “improve” a JT = gT S ’1 gT statistic because it blows up the estimates of S, rather

than making any progress on lowering the pricing errors gT . No one would formally use a
comparison of JT tests across models to compare them, of course. But it has proved nearly
irresistible for authors to claim success for a new model over previous ones by noting im-
proved JT statistics, despite different weighting matrices, different moments, and sometimes
much larger pricing errors. For example, if you take a model mt and create a new model by
simply adding noise, unrelated to asset returns (in sample), m0 = mt + µt , then the moment
condition gT = ET (mt Rt ) = ET ((mt + µt ) Rt ) is unchanged. However, the spectral den-
0e e
h i
2 e e0
sity matrix S = E (mt + µt ) Rt Rt can rise dramatically. This can reduce the JT leading
to a false sense of “improvement.”
Conversely, if the sample contains a nearly riskfree portfolio of the test assets, or a port-
folio with apparently small variance of mt+1 Re , then the JT test essentially evaluates the
model by how will it can price this one portfolio. This can lead to a false rejection “ even a
very small gT will produce a large gT S ’1 gT if there is an eigenvalue of S that is (spuriously)

too small.
If you use a common weighting matrix W for all models, and evaluate the models by
gT W gT , then you can avoid this trap. Beware that the individual χ2 statistics are based on

gT var(gT )’1 gT , and var(gT ) contains S, even with a prespeci¬ed weighting matrix W .

You should look at the pricing errors, or at some statistic such as the sum of absolute or
squared pricing errors to see if they are bigger or smaller, leaving the distribution aside. The
question “are the pricing errors small?” is as interesting as the question “if we drew arti¬cial
data over and over again from a null statistical model, how often would we estimate a ratio


of pricing errors to their estimated variance gT S ’1 gT this big or larger?”

11.5.3 Some prespeci¬ed weighting matrices

Two examples of economically interesting weighting matrices are the second-moment matrix
of returns, advocated by Hansen and Jagannathan (1997) and the simple identity matrix,
which is used implicitly in much empirical asset pricing.
Second moment matrix.
Hansen and Jagannathan (1997) advocate the use of the second moment matrix of payoffs
W = E(xx0 )’1 in place of S. They motivate this weighting matrix as an interesting distance
measure between a model for m, say y, and the space of true m™s. Precisely, the minimum
distance (second moment) between a candidate discount factor y and the space of true dis-
count factors is the same as the minimum value of the GMM criterion with W = E(xx0 )’1
as weighting matrix.


m proj(y| X)


Nearest m

Figure 25. Distance between y and nearest m = distance between proj(y|X) and x— .

To see why this is true, refer to Figure 25. The distance between y and the nearest valid
m is the same as the distance between proj(y | X) and x— . As usual, consider the case that
X is generated from a vector of payoffs x with price p. From the OLS formula,

proj(y | X) = E(yx0 )E(xx0 )’1 x.


x— is the portfolio of x that prices x by construction,

x— = p0 E(xx0 )’1 x.

Then, the distance between y and the nearest valid m is:

ky ’ nearest mk = kproj(y|X) ’ x— k
° °
= °E(yx0 )E(xx0 )’1 x ’ p0 E(xx0 )’1 x°
° °
= °(E(yx0 ) ’ p0 ) E(xx0 )’1 x°


. 7
( 17)