ńņš. 7 |

Deļ¬nitions

ut+1 (b) ā” mt+1 (b)xt+1 ā’ pt

gT (b) ā” ET [ut (b)]

ā

X

E [ut (b) utā’j (b)0 ]

Sā”

j=ā’ā

GMM estimate

Ė2 = argminb gT (b)0 S ā’1 gT (b).

Ė

b

Standard errors

1 ā‚gT (b)

var(Ė2 ) = (d0 S ā’1 d)ā’1 ; d ā”

b

T ā‚b

177

CHAPTER 10 GMM IN EXPLICIT DISCOUNT FACTOR MODELS

Test of the model (āoveridentifying restrictionsā)

Ā£ Ā¤

T JT = T min gT (b)0 S ā’1 gT (b) ā¼ Ļ2 (#moments ā’ #parameters).

Itā™s easiest to start our discussion of GMM in the context of an explicit discount factor

model, such as the consumption-based model. I treat the special structure of linear factor

models later. I start with the basic classic recipe as given by Hansen and Singleton (1982).

Discount factor models involve some unknown parameters as well as data, so I write

mt+1 (b) when itā™s important to remind ourselves of this dependence. For example, if mt+1 =

Ī²(ct+1 /ct )ā’Ī³ , then b ā” [Ī² Ī³]0 . I write Ė to denote an estimate when it is important to

b

distinguish estimated from other values.

Any asset pricing model implies

(136)

E(pt ) = E [mt+1 (b)xt+1 ] .

Itā™s easiest to write this equation in the form E(Ā·) = 0

(137)

E [mt+1 (b)xt+1 ā’ pt ] = 0.

x and p are typically vectors; we typically check whether a model for m can price a number

of assets simultaneously. Equations (10.137) are often called the moment conditions.

Itā™s convenient to deļ¬ne the errors ut (b) as the object whose mean should be zero,

ut+1 (b) = mt+1 (b)xt+1 ā’ pt

Given values for the parameters b, we could construct a time series on ut and look at its mean.

Deļ¬ne gT (b) as the sample mean of the ut errors, when the parameter vector is b in a

sample of size T :

T

1X

gT (b) ā” ut (b) = ET [ut (b)] = ET [mt+1 (b)xt+1 ā’ pt ] .

T t=1

The second equality introduces the handy notation ET for sample means,

T

1X

ET (Ā·) = (Ā·).

T t=1

Ė

(It might make more sense to denote these estimates E and g . However, Hansenā™s T subscript

Ė

notation is so widespread that doing so would cause more confusion than it solves.)

178

SECTION 10.1 THE RECIPE

The ļ¬rst stage estimate of b minimizes a quadratic form of the sample mean of the errors,

Ė1 = argmin Ė gT (Ė 0 W gT (Ė

b b) b)

{b}

for some arbitrary matrix W (often, W = I). This estimate is consistent and asymptotically

normal. You can and often should stop here, as I explain below.

Using Ė1 , form an estimate S of

Ė

b

ā

X

E [ut (b) utā’j (b)0 ] . (138)

Sā”

j=ā’ā

(Below I discuss various interpretations of and ways to construct this estimate.) Form a

second stage estimate Ė2 using the matrix S in the quadratic form,

Ė

b

Ė2 = argmin gT (b)0 S ā’1 gT (b).

Ė

b b

Ė2 is a consistent, asymptotically normal, and asymptotically efļ¬cient estimate of the param-

b

eter vector b. āEfļ¬cientā means that it has the smallest variance-covariance matrix among all

estimators that set different linear combinations of gT (b) to zero.

The variance-covariance matrix of Ė2 is

b

1

var(Ė2 ) = (d0 S ā’1 d)ā’1

b

T

where

ā‚gT (b)

dā”

ā‚b

or, more explicitly,

Āµ Ā¶ĀÆ

ĀÆ

ā‚

[(mt+1 (b)xt+1 ā’ pt )] ĀÆ

d = ET ĀÆĖ

ā‚b b=b

(More precisely, d should be written as the object to which ā‚gT /ā‚b converges, and then

ā‚gT /ā‚b is an estimate of that object used to form a consistent estimate of the asymptotic

variance-covariance matrix.)

This variance-covariance matrix can be used to test whether a parameter or group of

parameters are equal to zero, via

Ėb

qi ā¼ N (0, 1)

Ė ii

var(b)

179

CHAPTER 10 GMM IN EXPLICIT DISCOUNT FACTOR MODELS

and

h iā’1

Ėj var(Ė jj Ėj ā¼ Ļ2 (#included b0 s)

b b) b

where bj =subvector, var(b)jj =submatrix.

Finally, the test of overidentifying restrictions is a test of the overall ļ¬t of the model. It

states that T times the minimized value of the second-stage objective is distributed Ļ2 with

degrees of freedom equal to the number of moments less the number of estimated parameters.

Ā£ Ā¤

T JT = T min gT (b)0 S ā’1 gT (b) ā¼ Ļ2 (#moments ā’ #parameters).

{b}

10.2 Interpreting the GMM procedure

gT (b) is a pricing error. It is proportional to Ī±.

GMM picks parameters to minimize a weighted sum of squared pricing errors.

The second-stage picks the linear combination of pricing errors that are best measured, by

having smallest sampling variation. First and second stage are like OLS and GLS regressions.

The standard error formula is a simple application of the delta method.

The JT test evaluates the model by looking at the sum of squared pricing errors.

Pricing errors

The moment conditions are

gT (b) = ET [mt+1 (b)xt+1 ] ā’ ET [pt ] .

Thus, each moment is the difference between actual (ET (p)) and predicted (ET (mx)) price,

or pricing error. What could be more natural than to pick parameters so that the modelā™s

predicted prices are as close as possible to the actual prices, and then to evaluate the model

by how large these pricing errors are?

In the language of expected returns, the moments gT (b) are proportional to the difference

between actual and predicted returns; Jensenā™s alphas, or the vertical distance between the

points and the line in Figure 5. To see this fact, recall that 0 = E(mRe ) can be translated to

a predicted expected return,

cov(m, Re )

E(Re ) = ā’ .

E(m)

180

SECTION 10.2 INTERPRETING THE GMM PROCEDURE

Therefore, we can write the pricing error as

Āµ Āµ Ā¶Ā¶

cov(m, Re )

e e

g(b) = E(mR ) = E(m) E(R ) ā’ ā’

E(m)

1

(actual mean return - predicted mean return.)

g(b) =

Rf

If we express the model in expected return-beta language,

E(Rei ) = Ī±i + Ī² 0 Ī»

i

then the GMM objective is proportional to the Jensenā™s alpha measure of mis-pricing,

1

g(b) = Ī±i .

Rf

First-stage estimates

If we could, weā™d pick b to make every element of gT (b) = 0 ā” to have the model price

assets perfectly in sample. However, there are usually more moment conditions (returns times

instruments) than there are parameters. There should be, because theories with as many free

parameters as facts (moments) are vacuous. Thus, we choose b to make gT (b) as small as

possible, by minimizing a quadratic form,

min gT (b)0 W gT (b). (139)

{b}

W is a weighting matrix that tells us how much attention to pay to each moment, or how

to trade off doing well in pricing one asset or linear combination of assets vs. doing well in

pricing another. In the common case W = I, GMM treats all assets symmetrically, and the

objective is to minimize the sum of squared pricing errors.

The sample pricing error gT (b) may be a nonlinear function of b. Thus, you may have

to use a numerical search to ļ¬nd the value of b that minimizes the objective in (10.139).

However, since the objective is locally quadratic, the search is usually straightforward.

Second-stage estimates: Why S ā’1 ?

What weighting matrix should you use? The weighting matrix directs GMM to emphasize

some moments or linear combinations of moments at the expense of others. You might start

with W = I, i.e., try to price all assets equally well. A W that is not the identity matrix can

be used to offset differences in units between the moments. You also might also start with

different elements on the diagonal of W if you think some assets are more interesting, more

informative, or better measured than others.

The second-stage estimate picks a weighting matrix based on statistical considerations.

181

CHAPTER 10 GMM IN EXPLICIT DISCOUNT FACTOR MODELS

Some asset returns may have much more variance than other assets. For those assets, the

sample mean gT = ET (mt Rt ā’ 1) will be a much less accurate measurement of the popula-

tion mean E(mR ā’ 1), since the sample mean will vary more from sample to sample. Hence,

it seems like a good idea to pay less attention to pricing errors from assets with high variance

of mt Rt ā’ 1. One could implement this idea by using a W matrix composed of inverse vari-

ances of ET (mt Rt ā’ 1) on the diagonal. More generally, since asset returns are correlated,

one might think of using the covariance matrix of ET (mt Rt ā’1). This weighting matrix pays

most attention to linear combinations of moments about which the data set at hand has the

most information. This idea is exactly the same as heteroskedasticity and cross-correlation

corrections that lead you from OLS to GLS in linear regressions.

The covariance matrix of gT = ET (ut+1 ) is the variance of a sample mean. Exploiting

the assumption that E(ut ) = 0, and that ut is stationary so E(u1 u2 ) = E(ut ut+1 ) depends

only on the time interval between the two us, we have

Ć !

T

1X

var(gT ) = var ut+1

T t=1

1Ā£ Ā” Ā¢ Ā¤

T E(ut u0 ) + (T ā’ 1) E(ut u0 ) + E(ut u0 )) + ...

= t tā’1 t+1

T2

As T ā’ ā, (T ā’ j)/T ā’ 1, so

ā

1X 1

E(ut u0 ) = S.

var(gT ) ā’ tā’j

T j=ā’ā T

The last equality denotes S, known for other reasons as the spectral density matrix at fre-

quency zero of ut . (Precisely, S so deļ¬ned is the variance-covariance matrix of the gT for

ļ¬xed b. The actual variance-covariance matrix of gT must take into account the fact that we

chose b to set a linear combination of the gT to zero in each sample. I give that formula

below. The point here is heuristic.)

This fact suggests that a good weighting matrix might be the inverse of S. In fact, Hansen

(1982) shows formally that the choice

ā

X

ā’1

E(ut u0 )

W =S , Sā” tā’j

j=ā’ā

is the statistically optimal weighing matrix, meaning that it produces estimates with lowest

asymptotic variance.

ā

You may be more used to the formula Ļ(u)/ T for the standard deviation of a sample

mean. This formula is a special case that holds when the u0 s are uncorrelated over time. If

t

182

SECTION 10.2 INTERPRETING THE GMM PROCEDURE

Et (ut u0 ) = 0, j 6= 0, then the previous equation reduces to

tā’j

Ć !

T

1X 1 var(u)

E(uu0 ) =

var ut+1 = .

T t=1 T T

This is probably the ļ¬rst statistical formula you ever saw ā“ the variance of the sample mean.

In GMM, it is the last statistical formula youā™ll ever see as well. GMM amounts to just gen-

eralizing the simple ideas behind the distribution of the sample mean to parameter estimation

and general statistical contexts.

The ļ¬rst and second stage estimates should remind you of standard linear regression mod-

els. You start with an OLS regression. If the errors are not i.i.d., the OLS estimates are con-

sistent, but not efļ¬cient. If you want efļ¬cient estimates, you can use the OLS estimates to

obtain a series of residuals, estimate a variance-covariance matrix of residuals, and then do

GLS. GLS is also consistent and more efļ¬cient, meaning that the sampling variation in the

estimated parameters is lower.

Standard errors

The formula for the standard error of the estimate,

1

var(Ė2 ) = (d0 S ā’1 d)ā’1 (140)

b

T

can be understood most simply as an instance of the ādelta methodā that the asymptotic

variance of f (x) is f 0 (x)2 var(x). Suppose there is only one parameter and one moment.

S/T is the variance matrix of the moment gT . dā’1 is [ā‚gT /ā‚b]ā’1 = ā‚b/ā‚gT . Then the delta

method formula gives

1 ā‚b ā‚b

var(Ė2 ) =

b var(gT ) .

T ā‚gT ā‚gT

The actual formula (10.140) just generalizes this idea to vectors.

10.2.1 JT Test

Once youā™ve estimated the parameters that make a model āļ¬t best,ā the natural question is,

how well does it ļ¬t? Itā™s natural to look at the pricing errors and see if they are ābig.ā The

JT test asks whether they are ābigā by statistical standards ā“ if the model is true, how often

should we see a (weighted) sum of squared pricing errors this big? If not often, the model is

ārejected.ā The test is

h i

T JT = T gT (Ė 0 S ā’1 gT (Ė ā¼ Ļ2 (#moments ā’ #parameters).

b) b)

Since S is the variance-covariance matrix of gT , this statistic is the minimized pricing errors

183

CHAPTER 10 GMM IN EXPLICIT DISCOUNT FACTOR MODELS

divided by their variance-covariance matrix. Sample means converge to a normal distribution,

so sample means squared divided by variance converges to the square of a normal, or Ļ2 .

The reduction in degrees of freedom corrects for the fact that S is really the covariance

matrix of gT for ļ¬xed b. We set a linear combination of the gT to zero in each sample, so the

actual covariance matrix of gT is singular, with rank #moments - #parameters. More details

below.

10.3 Applying GMM

Notation.

Forecast errors and instruments.

Stationarity and choice of units.

Notation; instruments and returns

Most of the effort involved with GMM is simply mapping a given problem into the very

general notation. The equation

E [mt+1 (b)xt+1 ā’ pt ] = 0

can capture a lot. We often test asset pricing models using returns, in which case the moment

conditions are

E [mt+1 (b)Rt+1 ā’ 1] = 0.

It is common to add instruments as well. Mechanically, you can multiply both sides of

1 = Et [mt+1 (b)Rt+1 ]

by any variable zt observed at time t before taking unconditional expectations, resulting in

E(zt ) = E [mt+1 (b)Rt+1 zt ] .

Expressing the result in E(Ā·) = 0 form,

(141)

0 = E {[mt+1 (b)Rt+1 ā’ 1] zt } .

We can do this for a whole vector of returns and instruments, multiplying each return by each

instrument. For example, if we start with two returns R = [Ra Rb ]0 and one instrument z,

184

SECTION 10.3 APPLYING GMM

equation (10.141) looks like

ļ£±ļ£® ļ£¹ļ£¼ ļ£®

ļ£¹ ļ£® ļ£¹

ļ£“ mt+1 (b) Ra 1ļ£“ 0

ļ£“ ļ£“

t+1

ļ£²ļ£Æ ļ£ŗ ļ£Æ 1 ļ£ŗļ£½ ļ£Æ 0ļ£ŗ

m (b) Rb

E ļ£Æ t+1 ļ£ŗā’ļ£Æ ļ£ŗ ļ£Æ ļ£ŗ.

t+1

ļ£» ļ£° zt ļ£»ļ£“ = ļ£°

ļ£“ļ£° mt+1 (b) Ra zt 0ļ£»

ļ£“ ļ£“

t+1

ļ£³ ļ£¾

mt+1 (b) Rb zt zt 0

t+1

Using the Kronecker product ā— meaning āmultiply every element by every other elementā

we can denote the same relation compactly by

(142)

E {[mt+1 (b) Rt+1 ā’ 1] ā— zt } = 0,

or, emphasizing the managed-portfolio interpretation and p = E(mx) notation,

E [mt+1 (b)(Rt+1 ā— zt ) ā’ (1 ā— zt )] = 0.

Forecast errors and instruments

The asset pricing model says that, although expected returns can vary across time and as-

sets, expected discounted returns should always be the same, 1. The error ut+1 = mt+1 Rt+1 ā’

1 is the ex-post discounted return. ut+1 = mt+1 Rt+1 ā’ 1 represents a forecast error. Like

any forecast error, ut+1 should be conditionally and unconditionally mean zero.

In an econometric context, z is an instrument because it is uncorrelated with the error

ut+1 . E(zt ut+1 ) is the numerator of a regression coefļ¬cient of ut+1 on zt ; thus adding

instruments basically checks that the ex-post discounted return is unforecastable by linear

regressions.

If an assetā™s return is higher than predicted when zt is unusually high, but not on average,

scaling by zt will pick up this feature of the data. Then, the moment condition checks that

the discount rate is unusually low at such times, or that the conditional covariance of the

discount rate and asset return moves sufļ¬ciently to justify the high conditionally expected

return. As I explained in Section 8.1, the addition of instruments is equivalent to adding the

returns of managed portfolios to the analysis, and is in principle able to capture all of the

modelā™s predictions.

Stationarity and distributions

The GMM distribution theory does require some statistical assumption. Hansen (1982)

and Ogaki (1993) cover them in depth. The most important assumption is that m, p, and

x must be stationary random variables. (āStationaryā of often misused to mean constant, or

i.i.d.. The statistical deļ¬nition of stationarity is that the joint distribution of xt , xtā’j depends

only on j and not on t.) Sample averages must converge to population means as the sample

size grows, and stationarity implies this result.

Assuring stationarity usually amounts to a choice of sensible units. For example, though

185

CHAPTER 10 GMM IN EXPLICIT DISCOUNT FACTOR MODELS

we could express the pricing of a stock as

pt = Et [mt+1 (dt+1 + pt+1 )]

it would not be wise to do so. For stocks, p and d rise over time and so are typically not

stationary; their unconditional means are not deļ¬ned. It is better to divide by pt and express

the model as

Ā· Āø

dt+1 + pt+1

1 = Et mt+1 = Et (mt+1 Rt+1 )

pt

The stock return is plausibly stationary.

Dividing by dividends is an alternative and I think underutilized way to achieve stationar-

ity (at least for portfolios, since many individual stocks do not pay regular dividends):

Ā· Āµ Ā¶ Āø

pt pt+1 dt+1

= Et mt+1 1 + .

dt dt+1 dt

Ā³ Ā“

pt+1 dt+1

Now we map 1 + dt+1 dt into xt+1 and pt into pt . This formulation allows us to focus

dt

on prices rather than one-period returns.

Bonds are a claim to a dollar, so bond prices and yields do not grow over time. Hence, it

might be all right to examine

pb = E(mt+1 1)

t

with no transformations.

Stationarity is not always a clear-cut question in practice. As variables become āless

stationary,ā as they experience longer swings in a sample, the asymptotic distribution can

becomes a less reliable guide to a ļ¬nite-sample distribution. For example, the level of nominal

interest rates is surely a stationary variable in a fundamental sense: it was 6% in ancient

Babylon, about 6% in 14th century Italy, and about 6% again today. Yet it takes very long

swings away from this unconditional mean, moving slowly up or down for even 20 years at

a time. Therefore, in an estimate and test that uses the level of interest rates, the asymptotic

distribution theory might be a bad approximation to the correct ļ¬nite sample distribution

theory. This is true even if the number of data points is large. 10,000 data points measured

every minute are a āsmallerā data set than 100 data points measured every year. In such

a case, it is particularly important to develop a ļ¬nite-sample distribution by simulation or

bootstrap, which is easy to do given todayā™s computing power.

It is also important to choose test assets in a way that is stationary. For example, individual

stocks change character over time, increasing or decreasing size, exposure to risk factors,

leverage, and even nature of the business. For this reason, it is common to sort stocks into

portfolios based on characteristics such as betas, size, book/market ratios, industry and so

forth. The statistical characteristics of the portfolio returns may be much more constant than

186

SECTION 10.3 APPLYING GMM

the characteristics of individual securities, which ļ¬‚oat in and out of the various portfolios.

(One can alternatively include the characteristics as instruments.)

Many econometric techniques require assumptions about distributions. As you can see,

the variance formulas used in GMM do not include the usual assumptions that variables

are i.i.d., normally distributed, homoskedastic, etc. You can put such assumptions in if you

want to ā“ weā™ll see how below, and adding such assumptions simpliļ¬es the formulas and can

improve the small-sample performance when the assumptions are justiļ¬ed ā“ but you donā™t

have to add these assumptions.

187

Chapter 11. GMM: general formulas

and applications

Lots of calculations beyond formal parameter estimation and overall model testing are useful

in the process of evaluating a model and comparing it to other models. But you want to

understand sampling variation in such calculations, and mapping the questions into the GMM

framework allows you to do this easily. In addition, alternative estimation and evaluation

procedures may be more intuitive or robust to model misspeciļ¬cation than the two (or multi)

stage procedure described above.

In this chapter I lay out the general GMM framework, and I discuss four applications and

variations on the basic GMM method. 1) I show how to derive standard errors of nonlin-

ear functions of sample moments, such as correlation coefļ¬cients. 2) I apply GMM to OLS

regressions, easily deriving standard error formulas that correct for autocorrelation and con-

ditional heteroskedasticity. 3) I show how to use prespeciļ¬ed weighting matrices W in asset

pricing tests in order to overcome the tendency of efļ¬cient GMM to focus on spuriously low-

variance portfolios 4) As a good parable for prespeciļ¬ed linear combination of moments a, I

show how to mimic ācalibrationā and āevaluationā phases of real business cycle models. 5)

I show how to use the distribution theory for the gT beyond just forming the JT test in order

to evaluate the importance of individual pricing errors. The next chapter continues, and col-

lects GMM variations useful for evaluating linear factor models and related mean-variance

frontier questions.

Many of these calculations amount to creative choices of the aT matrix that selects which

linear combination of moments are set to zero, and reading off the resulting formulas for

variance covariance matrix of the estimated coefļ¬cients, equation (11.146) and variance co-

variance matrix of the moments gT , equation (11.147).

11.1 General GMM formulas

The general GMM estimate

aT gT (Ė = 0

b)

Distribution of Ė :

b

T cov(Ė = (ad)ā’1 aSa0 (ad)ā’10

b)

Distribution of gT (Ė :

b)

h iĀ” Ā¢Ā” Ā¢0

T cov gT (Ė = I ā’ d(ad)ā’1 a S I ā’ d(ad)ā’1 a

b)

188

SECTION 11.1 GENERAL GMM FORMULAS

The āoptimalā estimate uses a = d0 S ā’1 . In this case,

T cov(Ė = (d0 S ā’1 d)ā’1

b)

h i

Ė = S ā’ d(d0 S ā’1 d)ā’1 d0

T cov gT (b)

and

T JT = T gT (Ė 0 S ā’1 gT (Ė ā’ Ļ2 (#moments ā’ #parameters).

b) b)

An analogue to the likelihood ratio test,

T JT (restricted) ā’ T JT (unrestricted) ā¼ Ļ2

Number of restrictions

GMM procedures can be used to implement a host of estimation and testing exercises.

Just about anything you might want to estimate can be written as a special case of GMM. To

do so, you just have to remember (or look up) a few very general formulas, and then map

them into your case.

Express a model as

E[f (xt , b)] = 0

Everything is a vector: f can represent a vector of L sample moments, xt can be M data

series, b can be N parameters. f(xt , b) is a slightly more explicit statement of the errors

ut (b) in the last chapter

Deļ¬nition of the GMM estimate.

We estimate parameters Ė to set some linear combination of sample means of f to zero,

b

Ė : set aT gT (Ė = 0 (143)

b b)

where

T

1X

gT (b) ā” f(xt , b)

T t=1

and aT is a matrix that deļ¬nes which linear combination of gT (b) will be set to zero. This

deļ¬nes the GMM estimate.

If there are as many moments as parameters, you will set each moment to zero; when

there are fewer parameters than moments, (11.143) just captures the natural idea that you

will set some moments, or some linear combination of moments to zero in order to estimate

the parameters. The minimization of the last chapter is a special case. If you estimate b by

189

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

min gT (b)0 W gT (b), the ļ¬rst order conditions are

0

0

ā‚gT

W gT (b) = 0,

ā‚b

which is of the form (11.143) with aT = ā‚gT /ā‚bW . The general GMM procedure allows

0

you to pick arbitrary linear combinations of the moments to set to zero in parameter estima-

tion.

Standard errors of the estimate.

Hansen (1982), Theorem 3.1 tells us that the asymptotic distribution of the GMM estimate

is

ā Ā£ Ā¤

T (Ė ā’ b) ā’ N 0, (ad)ā’1 aSa0 (ad)ā’10 (144)

b

where

Ā· Āø

ā‚f ā‚gT (b)

dā”E (xt , b) =

ā‚b0 ā‚b0

(i.e., d is deļ¬ned as the population moment in the ļ¬rst equality, which we estimate in sample

by the second equality), where

a ā” plim aT ,

and where

ā

X

E [f(xt , b), f (xtā’j b)0 ] . (145)

Sā”

j=ā’ā

ā

Donā™t forget the T in (11.144)! In practical terms, this means to use

1

var(Ė = (ad)ā’1 aSa0 (ad)ā’10 (146)

b)

T

as the covariance matrix for standard errors and tests. As in the last chapter, you can under-

stand this formula as an application of the delta method.

Distribution of the moments.

Hansenā™s Lemma 4.1 gives the sampling distribution of the moments gT (b) :

hĀ” Ā¢0 i

ā Ā¢Ā”

T gT (Ė ā’ N 0, I ā’ d(ad)ā’1 a S I ā’ d(ad)ā’1 a . (147)

b)

As we have seen, S would be the asymptotic variance-covariance matrix of sample means, if

we did not estimate any parameters, which sets some linear combinations of the gT to zero.

The I ā’ d(ad)ā’1 a terms account for the fact that in each sample some linear combinations

of gT are set to zero. Thus, this variance-covariance matrix is singular.

190

SECTION 11.1 GENERAL GMM FORMULAS

Ļ2 tests.

A sum of squared standard normals is distributed Ļ2 . Therefore, it is natural to use the

distribution theory for gT to see if the gT are jointly ātoo big.ā Equation (11.147) suggests

that we form the statistic

hĀ” Ā¢ iā’1

Ā¢Ā”

Ė 0 I ā’ d(ad)ā’1 a S I ā’ d(ad)ā’1 a 0 gT (Ė (148)

T gT (b) b)

and that it should have a Ļ2 distribution. It does, but with a hitch: The variance-covariance

matrix is singular, soP have to pseudo-invert it. For example, you can perform an eigen-

you

value decomposition = QĪQ0 and then invert only the non-zero eigenvalues. Also, the Ļ2

distribution has degrees of freedom given by the number non-zero linear combinations of gT ,

the number of moments less number of estimated parameters. You can similarly use (11.147)

to construct tests of individual moments (āare the small stocks mispriced?ā) or groups of

moments.

Efļ¬cient estimates

The theory so far allows us to estimate parameters by setting any linear combination of

moments to zero. Hansen shows that one particular choice is statistically optimal,

a = d0 S ā’1 . (149)

This choice is the ļ¬rst order condition to min{b} gT (b)0 S ā’1 gT (b) that we studied in the last

Chapter. With this weighting matrix, the standard error formula (11.146) reduces to

ā Ā£ Ā¤

T (Ė ā’ b) ā’ N 0, (d0 S ā’1 d)ā’1 . (150)

b

This is Hansenā™s Theorem 3.2. The sense in which (11.149) is āefļ¬cientā is that the sampling

variation of the parameters for arbitrary a matrix, (11.146), equals the sampling variation of

the āefļ¬cientā estimate in (11.150) plus a positive semideļ¬nite matrix.

With the optimal weights (11.149), the variance of the moments (11.147) simpliļ¬es to

1Ā” Ā¢

S ā’ d(d0 S ā’1 d)ā’1 d0 . (151)

cov(gT ) =

T

We can use this matrix in a test of the form (11.148). However, Hansenā™s Lemma 4.2 tells us

that there is an equivalent and simpler way to construct this test,

T gT (Ė 0 S ā’1 gT (Ė ā’ Ļ2 (#moments ā’ #parameters). (152)

b) b)

This result is nice since we get to use the already-calculated and non-singular S ā’1 .

To derive (11.152) from (11.147), factor S = CC 0 and then ļ¬nd the asymptotic covari-

ance matrix of C ā’1 gT (Ė using (11.147). The result is

b)

hā i

var T C ā’1 gT (Ė = I ā’ C ā’1 d(d0 S ā’1 d)ā’1 d0 C ā’10 .

b)

191

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

This is an idempotent matrix of rank #moments-#parameters, so (11.152) follows.

Alternatively, note that S ā’1 is a pseudo-inverse of the second stage cov(gT ). (A pseudo-

inverse times cov(gT ) should result in an idempotent matrix of the same rank as cov(gT ).)

Ā” Ā¢

S ā’1 cov(gT ) = S ā’1 S ā’ d(d0 S ā’1 d)ā’1 d0 = I ā’ S ā’1 d(d0 S ā’1 d)ā’1 d0

Then, check that the result is idempotent.

Ā” Ā¢Ā” Ā¢

I ā’ S ā’1 d(d0 S ā’1 d)ā’1 d0 I ā’ S ā’1 d(d0 S ā’1 d)ā’1 d0 = I ā’ S ā’1 d(d0 S ā’1 d)ā’1 d0 .

This derivation not only veriļ¬es that JT has the same distribution as gT cov(gT )ā’1 gT , but

0

that they are numerically the same in every sample.

I emphasize that (11.150) and (11.152) only apply to the āoptimalā choice of weights,

(11.149). If you use another set of weights, as in a ļ¬rst-stage estimate, you must use the

general formulas (11.146) and (11.147).

Model comparisons

You often want to compare one model to another. If one model can be expressed as a

special or ārestrictedā case of the other or āunrestrictedā model we can perform a statistical

comparison that looks very much like a likelihood ratio test. If we use the same S matrix

ā“ usually that of the unrestricted model ā“ the restricted JT must rise. But if the restricted

model is really true, it shouldnā™t rise āmuch.ā How much?

T JT (restricted) ā’ T JT (unrestricted) ā¼ Ļ2 (#of restrictions)

This is a āĻ2 differenceā test, due to Newey and West (1987a), who call it the āD-test.ā

11.2 Testing moments

How to test one or a group of pricing errors. 1) Use the formula for var(gT ) 2) A Ļ2

difference test.

You may want to see how well a model does on particular moments or particular pricing

errors. For example, the celebrated āsmall ļ¬rm effectā states that an unconditional CAPM

(m = a+ bRW , no scaled factors) does badly in pricing the returns on a portfolio that always

holds the smallest 1/10th or 1/20th of ļ¬rms in the NYSE. You might want to see whether a

new model prices the small returns well. The standard error of pricing errors also allows

you to add error bars to a plot of predicted vs. actual mean returns such as Figure 5 or other

diagnostics based on pricing errors.

We have already seen that individual elements of gT measure the pricing errors or ex-

pected return errors. Thus, the sampling variation of gT given by (11.147) provides exactly

192

SECTION 11.3 STANDARD ERRORS OF ANYTHING BY DELTA METHOD

the standard error we are looking for. You can use the sampling distribution of gT , to evalu-

ate the signiļ¬cance of individual pricing errors, to construct a t-test (for a single gT , such as

small ļ¬rms) or Ļ2 test (for groups of gT , such as small ļ¬rms ā— instruments). As usual this is

the Wald test.

Alternatively, you can use the Ļ2 difference approach. Start with a general model that in-

cludes all the moments, and form an estimate of the spectral density matrix S. Now set to

zero the moments you want to test, and denote gsT (b) the vector of moments, including the

zeros (s for āsmallerā). Choose bs to minimize gsT (bs )0 S ā’1 gsT (bs ) using the same weight-

ing matrix S. The criterion will be lower than the original criterion gT (b)0 S ā’1 gT (b), since

there are the same number of parameters and fewer moments. But, if the moments we want to

test truly are zero, the criterion shouldnā™t be that much lower. The Ļ2 difference test applies,

T gT (Ė 0 S ā’1 gT (Ė ā’ T gsT (Ės )S ā’1 gsT (Ės ) ā¼ Ļ2 (#eliminated moments).

b) b) b b

Of course, donā™t fall into the obvious trap of picking the largest of 10 pricing errors and

noting itā™s more than two standard deviations from zero. The distribution of the largest of 10

pricing errors is much wider than the distribution of a single one. To use this distribution,

you have to pick which pricing error youā™re going to test before you look at the data.

11.3 Standard errors of anything by delta method

One quick application illustrates the usefulness of the GMM formulas. Often, we want to

estimate a quantity that is a nonlinear function of sample means,

b = Ļ [E(xt )] = Ļ(Āµ).

In this case, the formula (11.144) reduces to

Ā· Āø0 X Ā·Āø

ā

1 dĻ dĻ

cov(xt , x0 ) (153)

var(bT ) = .

tā’j

T dĀµ j=ā’ā dĀµ

The formula is very intuitive. The variance of the sample mean is the covariance term inside.

The derivatives just linearize the function Ļ near the true b.

For example, a correlation coefļ¬cient can be written as a function of sample means as

E(xt yt ) ā’ E(xt )E(yt )

corr(xt , yt ) = p p

E(x2 ) ā’ E(xt )2 E(yt ) ā’ E(yt )2

2

t

Thus, take

Ā£ Ā¤0

E(xt ) E(x2 ) E(yt ) E(yt ) E(xt yt )

2

Āµ= .

t

A problem at the end of the chapter asks you to take derivatives and derive the standard error

of the correlation coefļ¬cient. One can derive standard errors for impulse-response functions,

193

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

variance decompositions, and many other statistics in this way.

11.4 Using GMM for regressions

By mapping OLS regressions in to the GMM framework, we derive formulas for OLS

standard errors that correct for autocorrelation and conditional heteroskedasticity of the er-

rors. The general formula is

ļ£® ļ£¹

ā

X

1

var(Ī²) = E(xt x0 )ā’1 ļ£° E(ut xt x0 utā’j )ļ£» E(xt x0 )ā’1 .

Ė

t tā’j t

T j=ā’ā

and it simpliļ¬es in special cases.

Mapping any statistical procedure into GMM makes it easy to develop an asymptotic

distribution that corrects for statistical problems such as non-normality, serial correlation and

conditional heteroskedasticity. To illustrate, as well as to develop the very useful formulas, I

map OLS regressions into GMM.

Correcting OLS standard errors for econometric problems is not the same thing as GLS.

When errors do not obey the OLS assumptions, OLS is consistent, and often more robust

than GLS, but its standard errors need to be corrected.

OLS picks parameters Ī² to minimize the variance of the residual:

Ā£ Ā¤

min ET (yt ā’ Ī² 0 xt )2 .

{Ī²}

Ė

We ļ¬nd Ī² from the ļ¬rst order condition, which states that the residual is orthogonal to the

right hand variable:

h i

Ė Ė

gT (Ī²) = ET xt (yt ā’ x0 Ī²) = 0 (154)

t

This condition is exactly identiļ¬edā“the number of moments equals the number of parameters.

Thus, we set the sample moments exactly to zero and there is no weighting matrix (a = I).

We can solve for the estimate analytically,

ā’1

Ė

Ī² = [ET (xt x0 )] ET (xt yt ).

t

This is the familiar OLS formula. The rest of the ingredients to equation (11.144) are

d = E(xt x0 )

t

194

SECTION 11.4 USING GMM FOR REGRESSIONS

f (xt , Ī²) = xt (yt ā’ x0 Ī²) = xt et

t

where et is the regression residual. Equation (11.144) gives a formula for OLS standard

errors,

ļ£® ļ£¹

ā

X

1

var(Ī²) = E(xt x0 )ā’1 ļ£° E(ut xt x0 utā’j )ļ£» E(xt x0 )ā’1 .

Ė (155)

t tā’j t

T j=ā’ā

This formula reduces to some interesting special cases.

Serially uncorrelated, homoskedastic errors

These are the usual OLS assumptions, and itā™s good the usual formulas emerge. Formally,

the OLS assumptions are

(156)

E(et | xt , xtā’1 ...etā’1 , etā’2 ...) = 0

E(e2 | xt , xtā’1 ...et , etā’1 ...) = constant = Ļ2 . (157)

t e

To use these assumptions, I use the fact that

E(ab) = E(E(a|b)b).

The ļ¬rst assumption means that only the j = 0 term enters the sum

ā

X

E(et xt x0 etā’j ) = E(e2 xt x0 ).

tā’j t t

j=ā’ā

The second assumption means that

E(e2 xt x0 ) = E(e2 )E(xt x0 ) = Ļ2 E(xt x0 ).

t t t t e t

Hence equation (11.155) reduces to our old friend,

12 ā’1

Ė Ļe E(xt x0 )ā’1 = Ļ 2 (X 0 X) .

var(Ī²) = t e

T

Ā£ Ā¤0

The last notation is typical of econometrics texts, in which X = x1 rep-

x2 ... xT

resents the data matrix.

Heteroskedastic errors

If we delete the conditional homoskedasticity assumption (11.157), we canā™t pull the u out

195

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

of the expectation, so the standard errors are

1

Ė E(xt x0 )ā’1 E(u2 xt x0 )E(xt x0 )ā’1 .

var(Ī²) = t t t t

T

These are known as āHeteroskedasticity consistent standard errorsā or āWhite standard er-

rorsā after White (1980).

Hansen-Hodrick errors

Hansen and Hodrick (1982) run forecasting regressions of (say) six month returns, using

monthly data. We can write this situation in regression notation as

yt+k = Ī² 0 xt + Īµt+k t = 1, 2, ...T.

Fama and French (1988) also use regressions of overlapping long horizon returns on variables

such as dividend/price ratio and term premium. Such regressions are an important part of the

evidence for predictability in asset returns.

Under the null that one-period returns are unforecastable, we will still see correlation in

the Īµt due to overlapping data. Unforecastable returns imply

E(Īµt Īµtā’j ) = 0 for |j| ā„ k

but not for |j| < k. Therefore, we can only rule out terms in S lower than k. Since we might

as well correct for potential heteroskedasticity while weā™re at it, the standard errors are

ļ£® ļ£¹

k

X

1

var(bT ) = E(xt x0 )ā’1 ļ£° E(ut xt x0 utā’j )ļ£» E(xt x0 )ā’1 .

t tā’j t

T

j=ā’k

11.5 Prespeciļ¬ed weighting matrices and moment conditions

Prespeciļ¬ed rather than āoptimalā weighting matrices can emphasize economically inter-

esting results, they can avoid the trap of blowing up standard errors rather than improving

pricing errors, they can lead to estimates that are more robust to small model misspeciļ¬-

cations. This is analogous to the fact that OLS is often preferable to GLS in a regression

context. The GMM formulas for a ļ¬xed weighting matrix W are

1

var(Ė = (d0 W d)ā’1 d0 W SW d(d0 W d)ā’1

b)

T

1

(I ā’ d(d0 W d)ā’1 d0 W )S(I ā’ W d(d0 W d)ā’1 d0 ).

var(gT ) =

T

196

SECTION 11.5 PRESPECIFIED WEIGHTING MATRICES AND MOMENT CONDITIONS

In the basic approach outlined in Chapter 10, our ļ¬nal estimates were based on the āefļ¬-

cientā S ā’1 weighting matrix. This objective maximizes the asymptotic statistical information

in the sample about a model, given the choice of moments gT . However, you may want to

use a prespeciļ¬ed weighting matrix W 6= S ā’1 instead, or at least as a diagnostic accompa-

nying more formal statistical tests. A prespeciļ¬ed weighting matrix lets you, rather than the

S matrix, specify which moments or linear combination of moments GMM will value in the

minimization min{b} gT (b)0 W gT (b). A higher value of Wii forces GMM to pay more atten-

tion to getting the ith moment right in the parameter estimation. For example, you might feel

that some assets suffer from measurement error, are small and illiquid and hence should be

deemphasized, or you may want to keep GMM from looking at portfolios with strong long

and short position. I give some additional motivations below.

You can also go one step further and impose which linear combinations aT of moment

conditions will be set to zero in estimation rather than use the choice resulting from a min-

imization, aT = d0 S ā’1 or aT = d0 W . The ļ¬xed W estimate still trades off the accuracy

of individual moments according to the sensitivity of each moment with respect to the pa-

Ā£ 1 2 Ā¤0

rameter. For example, if gT = gT gT , W = I, but ā‚gT /ā‚b = [1 10], so that the second

moment is 10 times more sensitive to the parameter value than the ļ¬rst moment, then GMM

with ļ¬xed weighting matrix sets

1 2

1 Ć— gT + 10 Ć— gT = 0.

The second moment condition will be 10 times closer to zero than the ļ¬rst. If you really want

GMM to pay equal attention to the two moments, then you can ļ¬x the aT matrix directly, for

example aT = [1 1] or aT = [1 ā’ 1].

Using a prespeciļ¬ed weighting matrix or using a prespeciļ¬ed set of moments is not the

same thing as ignoring correlation of the errors ut in the distribution theory. The S matrix

will still show up in all the standard errors and test statistics.

11.5.1 How to use prespeciļ¬ed weighting matrices

Once you have decided to use a prespeciļ¬ed weighting matrix W or a prespeciļ¬ed set of

moments aT gT (b) = 0, the general distribution theory outlined in section 11.1 quickly gives

standard errors of the estimates and moments, and therefore a Ļ2 statistic that can be used

to test whether all the moments are jointly zero. Section 11.1 gives the formulas for the

case that aT is prespeciļ¬ed. If we use weighting matrix W , the ļ¬rst order conditions to

min{b} gT (b)W gT (b) are

0

ā‚gT (b)0

W gT (b) = d0 W gT (b) = 0,

ā‚b

so we map into the general case with aT = d0 W. Plugging this value into (11.146), the

197

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

variance-covariance matrix of the estimated coefļ¬cients is

1

var(Ė = (d0 W d)ā’1 d0 W SW d(d0 W d)ā’1 . (158)

b)

T

(You can check that this formula reduces to 1/T (d0 S ā’1 d)ā’1 with W = S ā’1 .)

Plugging a = d0 W into equation (11.147), we ļ¬nd the variance-covariance matrix of the

moments gT

1

(I ā’ d(d0 W d)ā’1 d0 W )S(I ā’ W d(d0 W d)ā’1 d0 ) (159)

var(gT ) =

T

As in the general formula, the terms to the left and right of S account for the fact that some

linear combinations of moments are set to zero in each sample.

Equation (11.159) can be the basis of Ļ2 tests for the overidentifying restrictions. If we

interpret ()ā’1 to be a generalized inverse, then

gT var(gT )ā’1 gT ā¼ Ļ2 (#moments ā’ #parameters).

0

As in the general case, you have to pseudo-invert the singular var(gT ), for example by in-

verting only the non-zero eigenvalues.

The major danger in using prespeciļ¬ed weighting matrices or moments aT is that the

choice of moments, units, and (of course) the prespeciļ¬ed aT or W must be made carefully.

For example, if you multiply the second moment by 10 times its original value, the S matrix

will undo this transformation and weight them in their original proportions. The identity

weighting matrix will not undo such transformations, so the units should be picked right

initially.

11.5.2 Motivations for prespeciļ¬ed weighting matrices

Robustness, as with OLS vs. GLS.

When errors are autocorrelated or heteroskedastic, every econometrics textbook shows

you how to āimproveā on OLS by making appropriate GLS corrections. If you correctly

model the error covariance matrix and if the regression is perfectly speciļ¬ed, the GLS pro-

cedure can improve efļ¬ciency, i.e. give estimates with lower asymptotic standard errors.

However, GLS is less robust. If you model the error covariance matrix incorrectly, the GLS

estimates can be much worse than OLS. Also, the GLS transformations can zero in on slightly

misspeciļ¬ed areas of the model, producing garbage. GLS is ābest,ā but OLS is āpretty darn

good.ā One often has enough data that wringing every last ounce of statistical precision (low

standard errors) from the data is less important than producing estimates that do not depend

on questionable statistical assumptions, and that transparently focus on the interesting fea-

tures of the data. In these cases, it is often a good idea to use OLS estimates. The OLS

standard error formulas are wrong, though, so you must correct the standard errors of the

198

SECTION 11.5 PRESPECIFIED WEIGHTING MATRICES AND MOMENT CONDITIONS

OLS estimates for these features of the error covariance matrices, using the formulas we

developed in section 11.4.

GMM works the same way. First-stage or otherwise ļ¬xed weighting matrix estimates may

give up something in asymptotic efļ¬ciency, but they are still consistent, and they can be more

robust to statistical and economic problems. You still want to use the S matrix in computing

standard errors, though, as you want to correct OLS standard errors, and the GMM formulas

show you how to do this.

Even if in the end you want to produce āefļ¬cientā estimates and tests, it is a good idea to

calculate standard errors and model ļ¬t tests for the ļ¬rst-stage estimates. Ideally, the parameter

estimates should not change by much, and the second stage standard errors should be tighter.

If the āefļ¬cientā parameter estimates do change a great deal, it is a good idea to diagnose

why this is so. It must come down to the āefļ¬cientā parameter estimates strongly weighting

moments or linear combinations of moments that were not important in the ļ¬rst stage, and

that the former linear combination of moments disagrees strongly with the latter about which

parameters ļ¬t well. Then, you can decide whether the difference in results is truly due to

efļ¬ciency gain, or whether it signals a model misspeciļ¬cation.

Chapter 16 argues more at length for judicious use of āinefļ¬cientā methods such as OLS

to guard against inevitable model misspeciļ¬cations.

Near-singular S.

The spectral density matrix is often nearly singular, since asset returns are highly corre-

lated with each other, and since we often include many assets relative to the number of data

points. As a result, second stage GMM (and, as we will see below, maximum likelihood

or any other efļ¬cient technique) tries to minimize differences and differences of differences

of asset returns in order to extract statistically orthogonal components with lowest variance.

One may feel that this feature leads GMM to place a lot of weight on poorly estimated, eco-

nomically uninteresting, or otherwise non-robust aspects of the data. In particular, portfolios

of the form 100R1 ā’ 99R2 assume that investors can in fact purchase such heavily leveraged

portfolios. Short-sale costs often rule out such portfolios or signiļ¬cantly alter their returns,

so one may not want to emphasize pricing them correctly in the estimation and evaluation.

For example, suppose that S is given by

Ā· Āø

1Ļ

S= .

Ļ1

so

Ā· Āø

1 1 ā’Ļ

ā’1

S = .

ā’Ļ 1

1 ā’ Ļ2

We can factor S ā’1 into a āsquare rootā by the Choleski decomposition. This produces a

199

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

triangular matrix C such that C 0 C = S ā’1 . You can check that the matrix

" #

āā’Ļ

ā1

(160)

1ā’Ļ2 1ā’Ļ2

C=

0 1

works. Then, the GMM criterion

min gT S ā’1 gT

0

is equivalent to

min(gT C 0 )(CgT ).

0

CgT gives the linear combination of moments that efļ¬cient GMM is trying to minimize.

Looking at (11.160), as Ļ ā’ 1, the (2,2) element stays at 1, but the (1,1) and (1,2) elements

get very large and of opposite signs. For example, if Ļ = 0.95, then

Ā· Āø

3.20 ā’3.04

C= .

0 1

In this example, GMM pays a little attention to the second moment, but places three times

as much weight on the difference between the ļ¬rst and second moments. Larger matrices

produce even more extreme weights. At a minimum, it is a good idea to look at S ā’1 and its

Choleski decomposition to see what moments GMM is prizing.

The same point has a classic interpretation, and is a well-known danger with classic

regression-based tests. Efļ¬cient GMM wants to focus on well-measured moments. In as-

set pricing applications, the errors are typically close to uncorrelated over time, so GMM is

looking for portfolios with small values of var(mt+1 Re ). Roughly speaking, those will

t+1

be asset with small return variance. Thus, GMM will pay most attention to correctly pricing

the sample minimum-variance portfolio, and GMMā™s evaluation of the model by JT test will

focus on its ability to price this portfolio.

Now, consider what happens in a sample, as illustrated in Figure 24. The sample mean-

variance frontier is typically a good deal wider than the true, or ex-ante mean-variance fron-

tier. In particular, the sample minimum-variance portfolio may have little to do with the

true minimum-variance portfolio. Like any portfolio on the sample frontier, its composition

largely reļ¬‚ects luck ā“ thatā™s why we have asset pricing models in the ļ¬rst place rather than

just price assets with portfolios on the sample frontier. The sample minimum variance return

is also likely to be composed of strong long-short positions.

In sum, you may want to force GMM not to pay quite so much attention to correctly

pricing the sample minimum variance portfolio, and you may want to give less importance to

a statistical measure of model evaluation that almost entirely prizes GMMā™s ability to price

that portfolio.

Economically interesting moments.

200

SECTION 11.5 PRESPECIFIED WEIGHTING MATRICES AND MOMENT CONDITIONS

Sample minimum-variance portfolio

E(R) Sample, ex-post frontier

True, ex-ante frontier

Ļ(R)

Figure 24. True or ex ante and sample or ex-post mean-variance frontier. The sample often

shows a spurious minimum-variance portfolio.

201

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

The optimal weighting matrix makes GMM pay close attention to linear combinations of

moments with small sampling error in both estimation and evaluation. One may want to force

the estimation and evaluation to pay attention to economically interesting moments instead.

The initial portfolios are usually formed on an economically interesting characteristic such as

size, beta, book/market or industry. One typically wants in the end to see how well the model

prices these initial portfolios, not how well the model prices potentially strange portfolios

of those portfolios. If a model fails, one may want to characterize that failure as āthe model

doesnā™t price small stocksā not āthe model doesnā™t price a portfolio of 900Ć— small ļ¬rm returns

ā’600Ć— large ļ¬rm returns ā’299Ć— medium ļ¬rm returns.ā

Level playing ļ¬eld.

The S matrix changes as the model and as its parameters change. (See the deļ¬nition,

(10.138) or (11.145).) As the S matrix changes, which assets the GMM estimate tries hard

to price well changes as well. For example, the S matrix from one model may value strongly

pricing the T bill well, while that of another model may value pricing a stock excess return

well. Comparing the results of such estimations is like comparing apples and oranges. By

ļ¬xing the weighting matrix, you can force GMM to pay attention to the various assets in the

same proportion while you vary the model.

The fact that S matrices change with the model leads to another subtle trap. One model

my may āimproveā a JT = gT S ā’1 gT statistic because it blows up the estimates of S, rather

0

than making any progress on lowering the pricing errors gT . No one would formally use a

comparison of JT tests across models to compare them, of course. But it has proved nearly

irresistible for authors to claim success for a new model over previous ones by noting im-

proved JT statistics, despite different weighting matrices, different moments, and sometimes

much larger pricing errors. For example, if you take a model mt and create a new model by

simply adding noise, unrelated to asset returns (in sample), m0 = mt + Īµt , then the moment

t

condition gT = ET (mt Rt ) = ET ((mt + Īµt ) Rt ) is unchanged. However, the spectral den-

0e e

h i

2 e e0

sity matrix S = E (mt + Īµt ) Rt Rt can rise dramatically. This can reduce the JT leading

to a false sense of āimprovement.ā

Conversely, if the sample contains a nearly riskfree portfolio of the test assets, or a port-

folio with apparently small variance of mt+1 Re , then the JT test essentially evaluates the

t+1

model by how will it can price this one portfolio. This can lead to a false rejection ā“ even a

very small gT will produce a large gT S ā’1 gT if there is an eigenvalue of S that is (spuriously)

0

too small.

If you use a common weighting matrix W for all models, and evaluate the models by

gT W gT , then you can avoid this trap. Beware that the individual Ļ2 statistics are based on

0

gT var(gT )ā’1 gT , and var(gT ) contains S, even with a prespeciļ¬ed weighting matrix W .

0

You should look at the pricing errors, or at some statistic such as the sum of absolute or

squared pricing errors to see if they are bigger or smaller, leaving the distribution aside. The

question āare the pricing errors small?ā is as interesting as the question āif we drew artiļ¬cial

data over and over again from a null statistical model, how often would we estimate a ratio

202

SECTION 11.5 PRESPECIFIED WEIGHTING MATRICES AND MOMENT CONDITIONS

of pricing errors to their estimated variance gT S ā’1 gT this big or larger?ā

0

11.5.3 Some prespeciļ¬ed weighting matrices

Two examples of economically interesting weighting matrices are the second-moment matrix

of returns, advocated by Hansen and Jagannathan (1997) and the simple identity matrix,

which is used implicitly in much empirical asset pricing.

Second moment matrix.

Hansen and Jagannathan (1997) advocate the use of the second moment matrix of payoffs

W = E(xx0 )ā’1 in place of S. They motivate this weighting matrix as an interesting distance

measure between a model for m, say y, and the space of true mā™s. Precisely, the minimum

distance (second moment) between a candidate discount factor y and the space of true dis-

count factors is the same as the minimum value of the GMM criterion with W = E(xx0 )ā’1

as weighting matrix.

X

m proj(y| X)

x*

y

Nearest m

Figure 25. Distance between y and nearest m = distance between proj(y|X) and xā— .

To see why this is true, refer to Figure 25. The distance between y and the nearest valid

m is the same as the distance between proj(y | X) and xā— . As usual, consider the case that

X is generated from a vector of payoffs x with price p. From the OLS formula,

proj(y | X) = E(yx0 )E(xx0 )ā’1 x.

203

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

xā— is the portfolio of x that prices x by construction,

xā— = p0 E(xx0 )ā’1 x.

Then, the distance between y and the nearest valid m is:

ky ā’ nearest mk = kproj(y|X) ā’ xā— k

Ā° Ā°

= Ā°E(yx0 )E(xx0 )ā’1 x ā’ p0 E(xx0 )ā’1 xĀ°

Ā° Ā°

= Ā°(E(yx0 ) ā’ p0 ) E(xx0 )ā’1 xĀ°

ńņš. 7 |