Again, we already derived this χ2 test in (12.168), and its ¬nite sample F counterpart, the

GRS F test (12.169).

The other test of the restrictions is the likelihood ratio test (14.197). Quite generally,

likelihood ratio tests are asymptotically equivalent to Wald tests, and so gives the same result.

Showing it in this case is not worth the algebra.

14.4 When factors are not excess returns, ML prescribes a

cross-sectional regression

If the factors are not returns, we don™t have a choice between time-series and cross-sectional

regression, since the intercepts are not zero. As you might suspect, ML prescribes a cross-

sectional regression in this case.

The factor model, expressed in expected return beta form, is

E(Rei ) = ±i + β 0 »; i = 1, 2, ..N (206)

i

The betas are de¬ned from time-series regressions

Rei = ai + β 0 ft + µi (207)

t i t

The intercepts ai in the time-series regressions need not be zero, since the model does not

apply to the factors. They are not unrestricted, however. Taking expectations of the time-

series regression (14.207) and comparing it to (14.206) (as we did to derive the restriction

± = 0 for the time-series regression), the restriction ± = 0 implies

ai = β 0 (» ’ E(ft )) (208)

i

Plugging into (14.207), the time series regressions must be of the restricted form

Rt = β 0 » + β 0 [ft ’ E(ft )] + µi . (209)

ei

i i t

In this form, you can see that β 0 » determines the mean return. Since there are fewer factors

i

than returns, this is a restriction on the regression (14.209).

Stack assets i = 1, 2, ...N to a vector; and introduce the auxiliary statistical model that

the errors and factors are i.i.d. normal and uncorrelated with each other. Then, the restricted

model is

e

Rt = B» + B [ft ’ E(ft )] + µt

ft = E(f) + ut

· ¸ µ ¶

µt Σ0

∼ N 0,

ut 0V

where B denotes a N — K matrix of regression coef¬cients of the N assets on the K factors.

255

CHAPTER 14 MAXIMUM LIKELIHOOD

The likelihood function is

T T

1 X 0 ’1 1 X 0 ’1

L = (const.) ’ µ Σ µt ’ u V ut

2 t=1 t 2 t=1 t

e

µt = Rt ’ B [» + ft ’ E(f)] ; ut = ft ’ E(f).

Maximizing the likelihood function,

T T

X X

‚L 0 ’1

(Re ’B [» V ’1 (ft ’ E(f))

: 0= BΣ + ft ’ E(f )]) +

t

‚E(f ) t=1 t=1

T

X

‚L

: 0 = B0 Σ’1 (Re ’ B [» + ft ’ E(f )])

t

‚» t=1

The solution to this pair of equations is

[ (14.210)

E(f ) = ET (ft )

¡ ¢’1 0 ’1

ˆ (14.211)

» = B 0 Σ’1 B B Σ ET (Re ) .

t

The maximum likelihood estimate of the factor risk premium is a GLS cross-sectional regres-

sion of average returns on betas.

The maximum likelihood estimates of the regression coef¬cients B are again not the same

as the standard OLS formulas. Again, ML imposes the null to improve ef¬ciency.

T

X

‚L

Σ’1 (Rt ’B [» + ft ’ E(f)]) [» + ft ’ E(f )]0 = 0 (14.212)

e

:

‚B t=1

ÃT !’1

T

X X

0

[ft + » ’ E(f)] [ft + » ’ E(f )]0

ˆ Re [ft + » ’ E(f)]

B = t

t=1 t=1

This is true, even though the B are de¬ned in the theory as population regression coef¬cients.

(The matrix notation hides a lot here! If you want to rederive these formulas, it™s helpful to

P

start with scalar parameters, e.g. Bij , and to think of it as ‚L/‚θ = T (‚L/‚µt )0 ‚µt /‚θ.

t=1

) Therefore, to really implement ML, you have to solve (14.211) and (14.212) simultaneously

ˆˆ ˆ

for », B, along with Σ whose ML estimate is the usual second moment matrix of the resid-

ˆ

uals. This can usually be done iteratively: Start with OLS B, run an OLS cross-sectional

ˆ ˆ

regression for », form Σ, and iterate.

14.5 Problems

256

SECTION 14.5 PROBLEMS

1. Why do we use restricted ML when the factor is a return, but unrestricted ML when

the factor is not a return? To see why, try to formulate a ML estimator based on an

unrestricted regression when factors are not returns, equation (12.166). Add pricing

errors ±i to the regression as we did for the unrestricted regression in the case that factors

are returns, and then ¬nd ML estimators for B, », ±, E(f ). (Treat V and Σ as known to

make the problem easier.)

2. Instead of writing a regression, build up the ML for the CAPM a little more formally.

Write the statistical model as just the assumption that individual returns and the market

return are jointly normal,

· ¸ µ· ¸ ¶

Re E(Re ) cov(Rem , Re0 )

Σ

∼N ,

Rem E(Rem ) cov(Rem , Re ) σ2

m

The model™s restriction is

E(Re ) = γcov(Rem , Re ).

Estimate γ and show that this is the same time-series estimator as we derived by

presupposing a regression.

257

Chapter 15. Time series, cross-section,

and GMM/DF tests of linear factor models

The GMM/DF, time-series and cross-sectional regression procedures and distribution theory

are similar, but not identical. Cross-sectional regressions on betas are not the same thing as

cross sectional regressions on second moments. Cross-sectional regressions weighted by the

residual covariance matrix are not the same thing as cross-sectional regressions weighted by

the spectral density matrix.

GLS cross-sectional regressions and second stage GMM have a theoretical ef¬ciency ad-

vantage over OLS cross sectional regressions and ¬rst stage GMM, but how important is this

advantage, and is it outweighed by worse ¬nite-sample performance?

The time-series regression, as ML estimate, has a potential gain in ef¬ciency when re-

turns are factors and the residuals are i.i.d. normal. Why does ML prescribe a time-series

regression when the return is a factor and a cross-sectional regression when the return is not

a factor? The time-series regression seems to ignore pricing errors and estimate the model by

entirely different moments. How does adding one test asset make such a seemingly dramatic

difference to the procedure?

Finally, and perhaps most importantly, the GMM/discount factor approach is still a “new”

procedure. Many authors still do not trust it. It is important to verify that it produces similar

results and well-behaved test statistics in the setups of the classic regression tests.

To address these questions, I ¬rst apply the various methods to a classic empirical ques-

tion. How do time-series regression, cross-sectional regression and GMM/stochastic discount

factor compare when applied to a test of the CAPM on CRSP size portfolios? I ¬nd that three

methods produce almost exactly the same results for this classic exercise. They produce al-

most exactly the same estimates, standard errors, t-statistics and χ2 statistics that the pricing

errors are jointly zero.

Then I conduct a Monte Carlo and Bootstrap evaluation. Again, I ¬nd little difference

between the methods. The estimates, standard errors, and size and power of tests are almost

identical across methods.

The Bootstrap does reveal that the traditional i.i.d. assumption generates χ2 statistics with

about 1/2 the correct size “ they reject half as often as they should under the null. Simple

GMM corrections to the distribution theory repair this size defect. Also, you can ruin any

estimate and test with a bad spectral density matrix estimate. I try an estimate with 24 lags and

no Newey-West weights. It is singular in the data sample and many Monte Carlo replications.

Interestingly, this singularity has minor effects on standard errors, but causes disasters when

you use the spectral density matrix to weight a second-stage GMM.

I also ¬nd that second stage “ef¬cient” GMM is only very slightly more ef¬cient than

¬rst stage GMM, but is somewhat less robust; it is more sensitive to the poor spectral density

matrix and its asymptotic standard errors can be slightly misleading. As OLS is often better

258

SECTION 15.1 THREE APPROACHES TO THE CAPM IN SIZE PORTFOLIOS

than GLS, despite the theoretical ef¬ciency advantage of GLS, ¬rst-stage GMM may be better

than second stage GMM in many applications.

This section should give comfort that the apparently “new” GMM/discount factor formu-

lation is almost exactly the same as traditional methods in the traditional setup. There is a

widespread impression that GMM has dif¬culty in small samples. The literature on the small

sample properties of GMM (for example, Ferson and Foerster, 1994, Fuhrer, Moore, and

Schuh, 1995) naturally tries hard setups, with highly nonlinear models, highly persistent and

heteroskedastic errors, conditioning information, potentially weak instruments and so forth.

Nobody would write a paper trying GMM in a simple situation such as this one, correctly

foreseeing that the answer would not be very interesting. Unfortunately, many readers take

from this literature a mistaken impression that GMM always has dif¬culty in ¬nite samples,

even in very standard setups. This is not the case.

The point of the GMM/discount factor method, of course, is not a new way to handle

the simple i.i.d. normal CAPM problems, which are already handled ef¬ciently by regres-

sion techniques. The point of the GMM/discount factor method is its ability to transparently

handle situations that are very hard with expected return - beta models and ML techniques,

including the incorporation of conditioning information and nonlinear models. With the re-

assurance of this section, we can proceed to those more exiting applications.

Cochrane (2000) presents a more in-depth analysis, including estimation and Monte Carlo

evaluation of individual pricing error estimates and tests. Jagannathan and Wang (2000) com-

pare the GMM/discount factor approach to classic regression tests analytically. They show

that the parameter estimates, standard errors and χ2 statistics are asymptotically identical to

those of an expected return- beta cross-sectional regression when the factor is not a return.

15.1 Three approaches to the CAPM in size portfolios

The time-series approach sends the expected return - beta line through the market return,

ignoring other assets. The OLS cross -sectional regression minimizes the sum of squared

pricing errors, so allows some market pricing error to ¬t other assets better. The GLS cross-

sectional regression weights pricing errors by the residual covariance matrix, so reduces to

the time-series regression when the factor is a return and is included in the test assets.

The GMM/discount factor estimates, standard errors and χ2 statistics are very close to

time-series and cross-sectional regression estimates in this classic setup.

Time series and cross section

Figures 28 and 29 illustrate the difference between time-series and cross-sectional regres-

sions, in an evaluation of the CAPM on monthly size portfolios.

259

CHAPTER 15 TIME SERIES, CROSS-SECTION, AND GMM/DF TESTS OF LINEAR FACTOR MODELS

Figure 28 presents the time-series regression. The time-series regression estimates the

factor risk premium from the average of the factor, ignoring any information in the other

ˆ

assets, » = ET (Rem ). Thus, a time-series regression draws the expected return-beta line

across assets by making it ¬t precisely on two points, the market return and the riskfree rate“

The market and riskfree rate have zero estimated pricing error in every sample. (The far right

portfolios are the smallest ¬rm portfolios, and their positive pricing errors are the small ¬rm

anomaly “ this data set is the ¬rst serious failure of the CAPM. I come back to the substantive

issue in Chapter 20.)

The time-series regression is the ML estimator in this case, since the factor is a return.

Why does ML ignore all the information in the test asset average returns, and estimate the

factor premium from the average factor return only? The answer lies in the structure that

we told ML to assume when looking at the data. When we write Re = a + βft + µt and µ

independent of f, we tell ML that a sample of returns already includes the same sample of

the factor, plus extra noise. Thus, the sample of test asset returns cannot possibly tell ML

anything more than the sample of the factor alone about the mean of the factor. Second, we

tell ML that the factor risk premium equals the mean of the factor, so it may not consider the

possibility that the two are different in trying to match the data.

Figure 28. Average excess returns vs. betas on CRSP size portfolios, 1926-1998. The line

gives the predicted average return from the time-series regression, E(Re ) = βE(Rem ).

The OLS cross-sectional regression in Figure 29 draws the expected return-beta line by

260

SECTION 15.1 THREE APPROACHES TO THE CAPM IN SIZE PORTFOLIOS

Figure 29. Average excess returns vs betas of CRSP size portfolios 1926-1998, and the ¬t

of cross-sectional regressions.

minimizing the squared pricing error across all assets. Therefore, it allows some pricing error

for the market return, if by doing so the pricing errors on other assets can be reduced. Thus,

the OLS cross-sectional regression gives some pricing error to the market return in order to

lower the pricing errors of the other portfolios.

When the factor is not also a return, ML prescribes a cross-sectional regression. ML still

[

ignores anything but the factor data in estimating the mean of the factor“E(f) = ET (ft ).

However, ML is now allowed to us a different parameter for the factor risk premium that ¬ts

average returns to betas, which it does by cross-sectional regression. However, ML is a GLS

cross sectional regression, not an OLS cross-sectional regression. The GLS cross-sectional

regression in Figure 29 is almost exactly identical to the time-series regression result “ it

passes right through the origin and the market return ignoring all the other pricing errors.

The GLS cross-sectional regression

¡ ¢’1 0 ’1

ˆ

» = β 0 Σ’1 β β Σ ET (Re ).

weights the various portfolios by the inverse of the residual covariance matrix Σ. If we include

the market return as a test asset, it obviously has no residual variance“Rem = 0+1—Rem +0“

t t

so the GLS estimate pays exclusive attention to it in ¬tting the market line. The same thing

261

CHAPTER 15 TIME SERIES, CROSS-SECTION, AND GMM/DF TESTS OF LINEAR FACTOR MODELS

happens if the test assets span the factors “ if a linear combination of the test assets is equal

to the factor and hence has no residual variance. The size portfolios nearly span the market

return, so the GLS cross-sectional regression is visually indistinguishable from the time-

series regression in this case.

This observation wraps up one mystery, why ML seemed so different when the factor is

and is not a return. As the residual variance of a portfolio goes to zero, the GLS regression

pays more and more attention to that portfolio, until you have achieved the same result as a

time-series regression.

If we allow a free constant in the OLS cross-sectional regression, thus allowing a pricing

error for the risk free rate, you can see from Figure 29 that the OLS cross-sectional regression

line will ¬t the size portfolios even better, though allowing a pricing error in the risk free rate

as well as the market return. However, a free intercept in an OLS regression on excess returns

puts no weight at all on the intercept pricing error. It is a better idea to include the riskfree

rate as a test asset, either directly by doing the whole thing in levels of returns rather than

excess returns or by adding E(Re ) = 0, β = 0 to the cross-sectional regression. The GLS

cross-sectional regression will notice that the T-bill rate has no residual variance and so will

send the line right through the origin, as it does for the market return.

GMM/discount factor ¬rst and second stage

Figure 30 illustrates the GMM/discount factor estimate with the same data. The horizontal

axis is the second moment of returns and factors rather than beta, but you would not know it

from the placement of the dots. The ¬rst stage estimate is an OLS cross-sectional regression

of average returns on second moments. It minimizes the sum of squared pricing errors, and so

produces pricing errors almost exactly equal to those of the OLS cross-sectional regression

of returns on betas. The second stage estimate minimizes pricing errors weighted by the

spectral density matrix. The spectral density matrix is not the same as the residual covariance

matrix, so the second stage GMM does not go through the market portfolio as does the GLS

cross-sectional regression. In fact, the slope of the line is slightly higher for the second stage

estimate.

(The spectral density matrix of the discount factor formulation does not reduce to the

residual covariance matrix even if we assume the regression model, the asset pricing model

is true, and factors and residuals are i.i.d. normal. In particular, when the market is a test

asset, the GLS cross-sectional regression focuses all attention on the market portfolio but the

second stage GMM/DF does not do so. The parameter b is related to » by b = »/E(Rem2 ).

The other assets still are useful in determining the parameter b, even though Given the market

return and the regression model Rt = β i Rt + µi , seeing the other assets does not help to

ei em

t

determine the mean of the market return, )

Overall, the ¬gures do not suggest any strong reason to prefer ¬rst and second stage

GMM/discount factor, time-series, OLS or GLS cross sectional regression in this standard

model and data set. The results are affected by the choice of method. In particular, the size

of the small ¬rm anomaly is substantially affected by how one draws the market line. But

262

SECTION 15.1 THREE APPROACHES TO THE CAPM IN SIZE PORTFOLIOS

Figure 30. Average excess return vs. predicted value of 10 CRSP size portfolios,

1926-1998, based on GMM/SDF estimate. The model predicts E(Re ) = bE(Re Rem ). The

second stage estimate of b uses a spectral density estimate with zero lags.

the graphs and analysis do not strongly suggest that any method is better than any other for

purposes other than ¬shing for the answer one wants.

Parameter estimates, standard errors, and tests

Table c1 presents the parameter estimates and standard errors from time-series, cross-

section, and GMM/discount factor approach in the CAPM size portfolio test illustrated by

Figures 28 and 29. The main parameter to be estimated is the slope of the lines in the above

¬gures, the market price of risk » in the expected return-beta model and the relation between

mean returns and second moments b in the stochastic discount factor model. The big point of

Table c1 is that the GMM/discount factor estimate and standard errors behave very similarly

to the traditional estimates and standard errors.

The rows compare results with various methods of calculating the spectral density matrix.

i.i.d. imposes no serial correlation and regression errors independent of right hand variables,

and is identical to the Maximum Likelihood based formulas. The 0 lag estimate allows con-

ditional heteroskedasticity, but no correlation of residuals. The 3 lag, Newey West estimate

is a sensible correction for short order autocorrelation. I include the 24 lag spectral density

matrix to show how things can go wrong if you use a ridiculous spectral density matrix.

263

CHAPTER 15 TIME SERIES, CROSS-SECTION, AND GMM/DF TESTS OF LINEAR FACTOR MODELS

Beta model » GMM/DF b

Time- Cross section 1st 2nd stage

Series OLS GLS stage Est. Std. Err.

Estimate 0.66 0.71 0.66 2.35

i.i.d. 0.18 (3.67) 0.20 (3.55) 0.18 (3.67)

0 lags 0.18 (3.67) 0.19 (3.74) 0.18 (3.67) 0.63 (3.73) 0.61 (4.03)

2.46

3 lags, NW 0.20 (3.30) 0.21 (3. 38) 0.20 (3.30) 0.69 (3.41) 0.64 (3.73)

2.39

24 lags 0.16 (4.13) 0.16 (4.44) 0.16 (4.13) 1.00 (2.35) 0.69 (3.12)

2.15

Table c1. Parameter estimates and standard errors. Estimates are shown in italic,

standard errors in regular type, and t-statistics in parentheses. The time-series esti-

mate is the mean market return in percent per month. The cross-sectional estimate

is the slope coef¬cient » in E(Re ) = β». The GMM estimate is the parameter

b in E(Re ) = E(Re f )b. CRSP monthly data 1926-1998. “Lags” gives the num-

ber of lags in the spectral density matrix. “NW” uses Newey-West weighting in the

spectral density matrix.

The OLS cross-sectional estimate 0.71 is a little higher than the mean market return 0.66,

in order to better ¬t all of the assets, as seen in Figure 29. The GLS cross-sectional estimate

is almost exactly the same as the mean market return, and the GLS standard errors are almost

exactly the same as the time-series standard errors. The Shanken correction for generated

regressors is very important to standard errors of the cross-sectional regressions. Without

the Σf term in the standard deviation of » (12.184)“ i.e. treating the β as ¬xed right hand

variables “ the standard errors come out to 0.07 for OLS and √ for GLS “ far less than the

0.00

correct 0.20 and 0.18 shown in the table, and far less than σ/ T .

The b estimates are not directly comparable to the risk premium estimates, but it is easy

to translate their units. Applying the discount factor model with normalization a = 1 to the

market return itself,

E(Rem )

b= .

E(Rem2 )

With E(Rem ) = 0.66% and σ(Rem ) = 5.47%, we have 100 — b = 100 (0.66) /(0.662 +

5.472 ) = 2.17. The entries in Table c1 are close to this magnitude. Most are slightly larger, as

is the OLS cross-sectional regression, in order to better ¬t the other portfolios. The t-statistics

are quite close across methods, which is another way to correct the units.

The second-stage GMM/DF estimates (as well as standard errors) depend on which spec-

tral density weighting matrix is used as a weighting matrix. The results are quite similar for

all the sensible spectral density estimates. The 24 lag spectral density matrix starts to produce

unusual estimates. This spectral density estimate will cause lots of problems below.

Table c2 presents the χ2 and F statistics that test whether the pricing errors are jointly

signi¬cant. The OLS and GLS cross-sectional regression, and the ¬rst and second stage

GMM/discount factor tests give exactly the same χ2 statistic, though the individual pricing

264

SECTION 15.2 MONTE CARLO AND BOOTSTRAP

errors and covariance matrix are not the same so I do not present them separately. The big

point of Table c2 is that the GMM/discount factor method gives almost exactly the same

result as the cross-sectional regression.

Time series Cross section GMM/DF

%p %p %p

χ2 χ2 χ2

(10) (9) (9)

i.i.d. 8.5 58 8.5 49

GRS F 0.8 59

0 lags 10.5 40 10.6 31 10.5 31

3 lags NW 11.0 36 11.1 27 11.1 27

24 lags -432 -100 7.6 57 7.7 57

Table c2. χ2 tests that all pricing errors are jointly equal to zero.

For the time-series regression, the GRS F test gives almost exactly the same rejection

probability as does the asymptotic χ2 test. Apparently, the advantages of a statistic that is

valid in ¬nite samples is not that important in this data set. The χ2 tests for the time-series

case without the i.i.d. assumption are a bit more conservative, with 30-40% p value rather

than almost 60%. However, this difference is not large. The one exception is the χ2 test using

24 lags and no weights in the spectral density matrix. That matrix turns out not to be positive

de¬nite in this sample, with disastrous results for the χ2 statistic.

(Somewhat surprisingly, the CAPM is not rejected. This is because the small ¬rm effect

vanishes in the latter part of the sample. I discuss this fact further in Chapter 20. See in

particular Figure 28.)

Looking across the rows, the χ2 statistic is almost exactly the same for each method.

The cross-sectional regression and GMM/DF estimate have one lower degree of freedom

(the market premium is estimated from the cross-section rather than from the market return),

and so show slightly greater rejection probabilities. For a given spectral density estimation

technique, the cross-sectional regression and the GMM/DF approach give almost exactly the

same χ2 values and rejection probabilities. The 24 lag spectral density matrix is a disaster as

usual. In this case, it is a greater disaster for the time-series test than for the cross-section or

GMM/discount factor test. It turns out not to be positive de¬nite, so the sample pricing errors

produce a nonsensical negative value of ±0 cov(ˆ )’1 ±

ˆ ± ˆ

15.2 Monte Carlo and Bootstrap

The parameter distribution for the time-series regression estimate is quite similar to that

from the GMM/discount factor estimate.

The size and power of χ2 test statistics is nearly identical for time-series regression test

265

CHAPTER 15 TIME SERIES, CROSS-SECTION, AND GMM/DF TESTS OF LINEAR FACTOR MODELS

and the GMM/discount factor test.

A bad spectral density matrix can ruin either time-series or GMM/discount factor esti-

mates and tests.

There is enough serial correlation and heteroskedasticity in the data that conventional

i.i.d. formulas produce test statistics with about 1/2 the correct size. If you want to do

classic regression tests, you should correct the distribution theory rather than use the ML

i.i.d. distributions.

Econometrics is not just about sensible point estimates, it is about sampling variability

of those estimates, and whether standard error formulas correctly capture that sampling vari-

ability. How well do the various standard error and test statistic formulas capture the true

sampling distribution of the estimates? To answer this question I conduct two Monte Carlos

and two bootstraps. I conduct one each under the null that the CAPM is correct, to study size,

and one each under the alternative that the CAPM is false, to study power.

The Monte Carlo experiments follow the standard ML assumption that returns and fac-

tors are i.i.d. normally distributed, and the factors and residuals are independent as well as

uncorrelated. I generate arti¬cial samples of the market return from an i.i.d. normal, using

the sample mean and variance of the value-weighted return. I then generate arti¬cial size

decile returns under the null by Rei = 0 + β i Rt + µit , using the sample residual covari-

em

t

ance matrix Σ to draw i.i.d. normal residuals µit and the sample regression coef¬cients β i .

To generate data under the alternative, I add the sample ±i . draw 5000 arti¬cial samples. I

try a long sample of 876 months, matching the CRSP sample analyzed above. I also draw a

short sample of 240 months or 20 years, which is about as short as one should dare try to test

a factor model.

The bootstraps check whether non-normalities, autocorrelation, heteroskedasticity, and

non-independence of factors and residuals matters to the sampling distribution in this data

set. I do a block-bootstrap, resampling the data in groups of three months with replacement,

to preserve the short-order autocorrelation and persistent heteroskedasticity in the data. To

impose the CAPM, I draw the market return and residuals in the time-series regression, and

then compute arti¬cial data on decile portfolio returns by Rt = 0 + β i Rt + µit . To study

ei em

the alternative, I simply redraw all the data in groups of three. Of course, the actual data may

display conditioning information not displayed by this bootstrap, such as predictability and

conditional heteroskedasticity based on additional variables such as the dividend/price ratio,

lagged squared returns, or implied volatilities.

The ¬rst-stage GMM/discount factor and OLS cross-sectional regression are nearly iden-

tical in every arti¬cial sample, as the GLS cross-sectional regression is nearly identical to the

time-series regression in every sample. Therefore, the important question is to compare the

time series regression “ which is ML with i.i.d. normal returns and factors “ to the ¬rst and

second stage GMM/DF procedure. For this reason and to save space, I do not include the

cross-sectional regressions in the Monte Carlo and bootstrap.

266

SECTION 15.2 MONTE CARLO AND BOOTSTRAP

χ2 tests

Table 6c presents the χ2 tests of the hypothesis that all pricing errors are zero under

the null that the CAPM is true, and Table 7c presents the χ2 tests under the null that the

CAPM is false. Each table presents the percentage of the 5000 arti¬cial data sets in which

the χ2 tests rejected the null at the indicated level. The central point of these tables is that

the GMM/discount factor test performs almost exactly the same way as the time-series test.

Compare the GMM/DF entry to its corresponding Time series entry; they are all nearly identi-

cal. Neither the small ef¬ciency advantage of time-series vs. cross section, nor the difference

between betas and second moments seems to make any difference to the sampling distribu-

tion.

Monte Carlo Block-Bootstrap

Time series GMM/DF Time series GMM/DF

Sample size: 240 876 240 876 240 876 240 876

level (%): 5 5 1 5 5 1 5 5 1 5 5 1

i.i.d. 7.5 6.0 1.1 6.0 2.8 0.6

0 lags 7.7 6.1 1.1 7.5 6.3 1.0 7.7 4.3 1.0 6.6 3.7 0.9

3 lags, NW 10.7 6.5 1.4 9.7 6.6 1.3 10.5 5.4 1.3 9.5 5.3 1.3

24 lags 25 39 32 25 41 31 23 38 31 24 41 32

Table 6c. Size. Probability of rejection for χ2 statistics under the null that all pricing

errors are zero

Monte Carlo Block-Bootstrap

Time-Series GMM/DF Time-Series GMM/DF

Sample size: 240 876 240 876 240 876 240 876

level (%): 5 1 5 1 5 1 5 1

i.i.d. 17 48 26 11 40 18

0 lags 17 48 26 17 50 27 15 54 28 14 55 29

3 lags, NW 22 49 27 21 51 29 18 57 31 17 59 33

24 lags 29 60 53 29 66 57 27 63 56 29 68 60

Table 7c. Power. Probability of rejection for χ2 statistics under the null that the

CAPM is false, and the true means of the decile portfolio returns are equal to their

sample means.

Start with the Monte Carlo evaluation of the time-series test in Table 6c. The i.i.d. and 0

lag distributions produce nearly exact rejection probabilities in the long sample and slightly

too many (7.5%) rejections in the short sample. Moving down, GMM distributions here

correct for things that aren™t there. This has a small but noticeable effect on the sensible

3 lag test, which rejects slightly too often under this null. Naturally, this is worse for the

short sample, but looking across the rows, the time-series and discount factor tests are nearly

identical in every case. The variation across technique is almost zero, given the spectral

267

CHAPTER 15 TIME SERIES, CROSS-SECTION, AND GMM/DF TESTS OF LINEAR FACTOR MODELS

density estimate. The 24 lag unweighted spectral density is the usual disaster, rejecting far

too often. It is singular in many samples. In the long sample, the 1% tail of this distribution

occurs at a χ2 value of 440 rather than the 23.2 of the χ2 distribution!

(10)

The long sample block-bootstrap in the right half of the tables shows even in this simple

setup how i.i.d. normal assumptions can be misleading. The traditional i.i.d. χ2 test has

almost half the correct size “ it rejects a 10% test 6% of the time, a 5% test 2.8% of the

time and a 1% test 0.6% of the time. Removing the assumption that returns and factors

are independent, going from i.i.d. to 0 lags, brings about half of the size distortion back,

while adding one of the sensible autocorrelation corrections does the rest. In each row, the

time-series and GMM/DF methods produce almost exactly the same results again. The 24

lag spectral density matrices are a disaster as usual.

Table 7c shows the rejection probabilities under the alternative. The most striking feature

of the table is that the GMM/discount factor test gives almost exactly the same rejection

probability as the time-series test, for each choice of spectral density estimation technique.

When there is a difference, the GMM/discount factor test rejects slightly more often. The

24 lag tests reject most often, but this is not surprising given that they reject almost as often

under the null.

Parameter estimates and standard errors

Table c5 presents the sampling variation of the » and b estimates. The rows and columns

market σ(»), σ(ˆ and in italic font, give the variation of the estimated » or b across the

ˆ b),

5000 arti¬cial samples. The remaining rows and columns give the average across samples of

the standard errors. The presence of pricing errors has little effect on the estimated b or »

and their standard errors, so I only present results under the null that the CAPM is true. The

parameters are not directly comparable “ the b parameter includes the variance as well as the

mean of the factor, and ET (Rem ) is the natural GMM estimate of the mean market return as

it is the Time-series estimate of the factor risk premium. Still, it is interesting to know and to

compare how well the two methods do at estimating their central parameter.

268

SECTION 15.2 MONTE CARLO AND BOOTSTRAP

Monte Carlo Block-Bootstrap

Time GMM/DF Time GMM/DF

series 1st 2nd stage series 1st 2nd stage

stage σ(ˆ stage σ(ˆ

b) E (s.e.) b) E (s.e.)

T=876:

σ(»), σ(ˆ

ˆ 0.19 0.64 0.20 0.69

b)

i.i.d. 0.18 0.18

0 lags 0.18 0.65 0.60 0.18 0.63 0.60

0.61 0.67

3 lags NW 0.18 0.65 0.59 0.19 0.67 0.62

0.62 0.67

24 lags 0.18 0.62 0.27 0.19 0.66 0.24

130 1724

T=240:

ˆ σ(ˆ 0.35 1.25 0.37 1.40

σ(»), b)

i.i.d. 0.35 0.35

0 lags 0.35 1.23 1.14 0.35 1.24 1.15

1.24 1.45

3 lags NW 0.35 1.22 1.11 0.36 1.31 1.14

1.26 1.48

24 lags 0.29 1.04 0.69 0.31 1.15 0.75

191 893

Table 5. Monte Carlo and block-bootstrap evaluation of the sampling variability of

parameter estimates b and ». The Monte Carlo redraws 5000 arti¬cial data sets of

length T=876 from a random normal assuming that the CAPM is true. The block-

bootstrap redraws the data in groups of 3 with replacement. The row and columns

marked σ(») and σ(ˆ and using italic font give the variation across samples of the

ˆ b)

estimated » and b. The remaining entries of “Time series” “1st stage” and “E (s.e.)”

columns in roman font give the average value of the computed standard error of the

parameter estimate, where the average is taken over the 5000 samples.

The central message of this table is that the GMM/DF estimates behave almost exactly as

the time-series estimate, and the asymptotic standard error formulas almost exactly capture

the sampling variation of the estimates. The second stage GMM/DF estimate is a little bit

more ef¬cient at the cost of slightly misleading standard errors.

Start with the long sample and the ¬rst column. All of the standard error formulas give

essentially identical and correct results for the time-series estimate. Estimating the sample

mean is not rocket science. The ¬rst stage GMM/DF estimator in the second column behaves

the same way, except the usually troublesome 24 lag unweighted estimate.

The second stage GMM/DF estimate in the third and fourth columns uses the inverse

spectral density matrix to weight, and so the estimator depends on the choice of spectral

density estimate. The sensible spectral density estimates (not 24 lags) produce second-stage

estimates that vary less than the ¬rst-stage estimates, 0.61 ’ 0.62 rather than 0.64. Sec-

ond stage GMM is more ef¬cient, meaning that it produces estimates with smaller sampling

variation. However, the table shows that the ef¬ciency gain is quite small, so not much is

lost if one prefers ¬rst stage OLS estimates. The sensible spectral density estimates produce

second-stage standard errors that again almost exactly capture the sampling variation of the

269

CHAPTER 15 TIME SERIES, CROSS-SECTION, AND GMM/DF TESTS OF LINEAR FACTOR MODELS

estimated parameters.

The 24 lag unweighted estimate produces hugely variable estimates and arti¬cially small

standard errors. Using bad or even singular spectral density estimates seems to have a sec-

ondary effect on standard error calculations, but using its inverse as a weighting matrix can

have a dramatic effect on estimation.

With the block-bootstrap in the right hand side of Table 5c, the time-series estimate is

slightly more volatile as a result of the slight autocorrelation in the market return. The i.i.d.

and zero lag formulas do not capture this effect, but the GMM standard errors that allow

autocorrelation do pick it up. However, this is a very minor effect as there is very little auto-

correlation in the market return. The effect is more pronounced in the ¬rst stage GMM/DF

estimate, since the smaller ¬rm portfolios depart more from the normal i.i.d. assumption.

The true variation is 0.69, but standard errors that ignore autocorrelation only produce 0.63.

The standard errors that correct for autocorrelation are nearly exact. In the second-stage

GMM/DF, the sensible spectral density estimates again produce slightly more ef¬cient es-

timates than the ¬rst stage, with variation of 0.67 rather than 0.69. This comes at a cost,

though, that the asymptotic standard errors are a bit less reliable.

In the shorter sample, we see that standard errors for the mean market return in the Time

series column are all quite accurate, except the usual 24 lag case. In the GMM/DF case, we

see that the actual sampling variability of the b estimate is no longer smaller for the second

stage. The second stage estimate is not more ef¬cient in this “small” sample. Furthermore,

while the ¬rst stage standard errors are still decently accurate, the second stage standard

errors substantially understate the true sampling variability of the parameter estimate. They

represent a hoped-for ef¬ciency that is not present in the small sample. Even in this simple

setup, ¬rst-stage GMM is clearly a better choice for estimating the central parameter, and

hence for examining individual pricing errors and their pattern across assets.

270

Chapter 16. Which method?

Of course, the point of GMM/discount factor methods is not a gain in ef¬ciency or simplic-

ity in a traditional setup “ linear factor model, i.i.d. normally distributed returns and factors,

etc. It™s hard to beat the ef¬ciency or simplicity of regression methods in those setups. The

point of the GMM/discount factor approach is that it allows a simple technique for evalu-

ating nonlinear or otherwise complex models, for including conditioning information while

not requiring the econometrician to see everything that the agent sees, and for allowing the

researcher to circumvent inevitable model misspeci¬cations or simpli¬cations and data prob-

lems by keeping the econometrics focused on interesting issues.

The alternative is usually some form of maximum likelihood. This is much harder in most

circumstances, since you have to write down a complete statistical model for the conditional

distribution of your data. Just evaluating, let alone maximizing, the likelihood function is

often challenging. Whole series of papers are written just on the econometric issues of par-

ticular cases, for example how to maximize the likelihood functions of speci¬c classes of

univariate continuous time models for the short interest rate.

Of course, there is no necessary pairing of GMM with the discount factor expression of a

model, and ML with the expected return-beta formulation. Many studies pair discount factor

expressions of the model with ML, and many others evaluate expected return-beta model by

GMM, as we have done in adjusting regression standard errors for non-i.i.d. residuals.

Advanced empirical asset pricing faces an enduring tension between these two philoso-

phies. The choice essentially involves tradeoffs between statistical ef¬ciency, the effects of

misspeci¬cation of both the economic and statistical models, and the clarity and economic

interpretability of the results. There are situations in which it™s better to trade some small ef-

¬ciency gains for the robustness of simpler procedures or more easily interpretable moments;

OLS can be better than GLS. The central reason is speci¬cation errors; the fact that our sta-

tistical and economic models are at best quantitative parables. There are other situations in

which one may really need to squeeze every last drop out of the data, intuitive moments are

statistically very inef¬cient, and more intensive maximum-likelihood approaches are more

appropriate. Unfortunately, the environments are complex, and differ from case to case. We

don™t have universal theorems from statistical theory or generally applicable Monte Carlo ev-

idence. Speci¬cation errors by their nature resist quantitative modeling “ if you knew how

to model them, they wouldn™t be there. We can only think about the lessons of past experi-

ences. In my experience, in the limited range of applications I have worked with, a GMM

approach based on simple easily interpretable moments has proved far more fruitful than for-

mal maximum likelihood. In addition, I have found ¬rst stage GMM “ OLS cross sectional

regressions “ to be more trustworthy than second-stage GMM, in any case where there was a

substantial difference between the two approaches.

The rest of this chapter collects some thoughts on the choice between formal ML and

less formal GMM, focusing on economically interesting rather than statistically informative

moments.

271

CHAPTER 16 WHICH METHOD?

“ML” vs. “GMM”

The debate is often stated as a choice between “maximum likelihood” and “GMM.” This

is a bad way to put the issue. ML is a special case of GMM: it suggests a particular choice

of moments that are statistically optimal in a well-de¬ned sense. Given the set of moments,

the distribution theories are identical. Also, there is no such thing as “the” GMM estimate.

GMM is a ¬‚exible tool; you can use any aT matrix and gT moments that you want to use.

For example, we saw how to use GMM to derive the asymptotic distribution of the standard

time-series regression estimator with autocorrelated returns. The moments in this case were

not the pricing errors. It™s all GMM; the issue is the choice of moments. Both ML and GMM

are tools that a thoughtful researcher can use in learning what the data says about a given

asset pricing model, rather than as stone tablets giving precise directions that lead to truth

if followed literally. If followed literally and thoughtlessly, both ML and GMM can lead to

horrendous results.

The choice is between moments selected by an auxiliary statistical model, even if com-

pletely economically uninterpretable, and moments selected for their economic or data sum-

mary interpretation, even if not statistically ef¬cient.

ML is often ignored

As we have seen, ML plus the assumption of normal i.i.d. disturbances leads to easily

interpretable time-series or cross-sectional regressions, empirical procedures that are close to

the economic content of the model. However, asset returns are not normally distributed or

i.i.d.. They have fatter tails than a normal, they are heteroskedastic (times of high and times

of low volatility), they are autocorrelated, and predictable from a variety of variables. If

one were to take seriously the ML philosophy and its quest for ef¬ciency, one should model

these features of returns. The result would be a different likelihood function, and its scores

would prescribe different moment conditions than the familiar and intuitive time-series or

cross-sectional regressions.

Interestingly, few empirical workers do this. (The exceptions tend to be papers whose

primary point is illustration of econometric technique rather than empirical ¬ndings.) ML

seems to be ¬ne when it suggests easily interpretable regressions; when it suggests something

else, people use the regressions anyway.

For example, ML prescribes that one estimate βs without a constant. βs are almost univer-

sally estimated with a constant. Researchers often run cross-sectional regressions rather than

time-series regressions, even when the factors are returns. ML speci¬es a GLS cross-sectional

regression, but many empirical workers use OLS cross-sectional regressions instead, distrust-

ing the GLS weighting matrix. Time-series regressions are almost universally run with a con-

stant, though ML prescribes a regression with no constant. The true ML formulas for GLS

regressions require one to iterate between non-OLS formulas for betas, covariance matrix

estimate and the cross-sectional regression estimate. Empirical applications usually use the

unconstrained estimates of all these quantities. And of course, any of the regression tests

continue to be run at all, with ML justi¬cations, despite the fact that returns are not i.i.d. The

272

regressions came ¬rst, and the maximum likelihood formalization came later. If we had to

assume that returns had a gamma distribution to justify the regressions, it™s a sure bet that we

would make that “assumption” behind ML instead of the normal i.i.d. assumption!

The reason must be that researchers feel that omitting some of the information in the null

hypothesis, the estimation and test is more robust, though some ef¬ciency is lost if the null

economic and statistical models are exactly correct. Researchers must not really believe that

their null hypotheses, statistical and economic, are exactly correct. They want to produce es-

timates and tests that are robust to reasonable model mis-speci¬cations. They also want to

produce estimates and tests that are easily interpretable, that capture intuitively clear styl-

ized facts in the data, and that relate directly to the economic concepts of the model. Such

estimates are persuasive in large part because the reader can see that they are robust. (And

following this train of thought, one might want to pursue estimation strategies that are even

more robust than OLS, since OLS places a lot of weight on outliers. For example, Chen and

Ready 1997 claim that Fama and French™s 1993 size and value effects depend crucially on a

few outliers.)

ML does not necessarily produce robust or easily interpretable estimates. It wasn™t de-

signed to do so. The point and advertisement of ML is that it provides ef¬cient estimates;

it uses every scrap of information in the statistical and economic model in the quest for ef-

¬ciency. It does the “right” ef¬cient thing if model is true. It does not necessarily do the

“reasonable” thing for “approximate” models.

OLS vs. GLS cross-sectional regressions

One place in which this argument crystallizes is in the choice between OLS and GLS

cross-sectional regressions, or equivalently between ¬rst and second stage GMM.

The last chapter can lead to a mistaken impression that the doesn™t matter that much. This

is true to some extent in that simple environment, but not in more complex environments. For

example, Fama and French (1997) report important correlations between betas and pricing

errors in a time-series test of a three-factor model on industry portfolios. This correlation

cannot happen with an OLS cross-sectional estimate, as the cross-sectional estimate sets

the cross-sectional correlation between right hand variables (betas) and error terms (pricing

errors) to zero by construction. First stage estimates seem to work better in factor pricing

models based on macroeconomic data. For example, Figure 5 presents the ¬rst stage estimate

of the consumption-based model. The second-stage estimate produced much larger individual

pricing errors, because by so doing it could lower pricing errors of portfolios with strong

long-short positions required by the spectral density matrix. The same thing happened in the

investment based factor pricing model Cochrane (1996), and the scaled consumption-based

model of Lettau and Ludvigson (2000). Authors as far back as Fama and MacBeth (1973)

have preferred OLS cross-sectional regressions, distrusting the GLS weights.

GLS and second-stage GMM gain their asymptotic ef¬ciency when the covariance and

spectral density matrices have converged to their population values. GLS and second stage

GMM use these matrices to ¬nd well-measured portfolios; portfolios with small residual

273

CHAPTER 16 WHICH METHOD?

variance for GLS, and small variance of discounted return for GMM. The danger is that

these quantities are poorly estimated in a ¬nite sample, that sample minimum-variance port-

folios bear little relation to population minimum-variance portfolios. This by itself should not

create too much of a problem for a perfect model, one that prices all portfolios. But an im-

perfect model that does a very good job of pricing a basic set of portfolios may do a poor job

of pricing strange linear combinations of those portfolios, especially combinations that in-

volve strong long and short positions, positions that really are outside the payoff space given

transactions, margin, and short sales constraints. Thus, the danger is the interaction between

spurious sample minimum-variance portfolios and the speci¬cation errors of the model.

Interestingly, Kandel and Stambaugh (1995) and Roll and Ross (1995) argue for GLS

cross-sectional regressions also as a result of model misspeci¬cation. They start by observing

that show that so long as there is any misspeci¬cation at all “ so long as the pricing errors

are not exactly zero; so long as the market proxy is not exactly on the mean-variance frontier

“ then there are portfolios that produce arbitrarily good and arbitrarily bad ¬ts in plots of

expected returns vs. betas. Since even a perfect model leaves pricing errors in sample, this is

always true in samples.

It™s easy to see the basic argument. Take a portfolio long the positive alpha securities

and short the negative alpha securities; it will have a really big alpha! More precisely, if the

original securities follow

E(Re ) = ± + »β,

then consider portfolios of the original securities formed from a non-singular matrix A. They

follow

E(ARe ) = A± + »Aβ.

You can make all these portfolios have the same β with Aβ =constant, and then they will

have a spread in alphas. You will see a plot in which all the portfolios have the same beta

but the average returns are spread up and down. Conversely, you can pick A to make the

expected return-beta plot look as good as you want.

GLS has an important feature in this situation: the GLS cross-sectional regression is

independent of such repackaging of portfolios. If you transform a set of returns Re to ARe ,

then the OLS cross-sectional regression is transformed from

¡ ¢’1 0

ˆ

» = β0β β E (Re )

to

¡ ¢’1 0 0

ˆ

» = β 0 A0 Aβ β A AE (Re ) .

This does depend on the repackaging A. However, the residual covariance matrix of ARe is

274

AΣA0 , so the GLS regression

¡ ¢’1 0 ’1

ˆ

» = β 0 Σ’1 β β Σ E (Re )

is not affected so long as A is full rank and therefore does not throw away information

¡ ¢’1 0 0 0 ¡ ¢’1 0 ’1

ˆ

» = β 0 A0 (AΣA0 )’1 Aβ β A (A ΣA)’1 AE (Re ) = β 0 Σ’1 β β Σ E (Re ) .

(The spectral density matrix and second stage estimate shares this property in GMM es-

timates. These are not the only weighting matrix choices that are invariant to portfolios. For

example, Hansen and Jagannathan™s 1997 suggestion of the return second moment matrix has

the same property.)

This is a fact, but it does not show that OLS chooses a particularly good or bad set of

portfolios. Perhaps you don™t think that GLS™ choice of portfolios is particularly informative.

In this case, you use OLS precisely to focus attention on a particular set of economically

interesting portfolios.

The choice depends subtly on what you want your test to accomplish. If you want to prove

the model wrong, then GLS helps you to focus on the most informative portfolios for proving

the model wrong. That is exactly what an ef¬cient test is supposed to do. But many models

are wrong, but still pretty darn good. It is a shame to throw out the information that the model

does a good job of pricing an interesting set of portfolios. The sensible compromise would

seem to be to report the OLS estimate on “interesting” portfolios, and also to report the GLS

test statistic that shows the model to be rejected. That is, in fact, the typical collection of

facts.

Additional examples of trading off ef¬ciency for robustness

Here are some additional examples of situations in which it has turned out to be wise to

trade off some apparent ef¬ciency for robustness to model misspeci¬cations.

Low frequency time-series models. In estimating time-series models such as the AR(1),

P

maximum likelihood minimizes one-step ahead forecast error variance, µ2 . But any time-

t

series model is only an approximation, and the researcher™s objective may not be one-step

ahead forecasting. For example, in making sense of the yield on long term bonds, one is in-

terested in the long-run behavior of the short rate of interest. In estimating the magnitude

of long-horizon univariate mean reversion in stock returns, we want to know only the sum

of autocorrelations or moving average coef¬cients. (We will study this application in sec-

tion 20.335.) The approximate model that generates the smallest one-step ahead forecast

error variance may be quite different from the model that best matches long-run autocorrela-

tions. ML can pick the wrong model and make very bad predictions for long-run responses.

(Cochrane 1986 contains a more detailed analysis of this point in the context of long-horizon

GDP forecasting.)

Lucas™ money demand estimate. Lucas (1988) is a gem of an example. Lucas was in-

275

CHAPTER 16 WHICH METHOD?

terested in estimating the income elasticity of money demand. Money and income trend

upwards over time and over business cycles, but also have some high-frequency movement

that looks like noise. If you run a regression in log-levels,

mt = a + byt + µt

you get a sensible coef¬cient of about b = 1, but you ¬nd that the error term is strongly

serially correlated. Following standard advice, most researchers run GLS, which amounts

pretty much to ¬rst-differencing the data,

mt ’ mt’1 = b(yt ’ yt’1 ) + · t .

This error term passes its Durbin-Watson statistic, but the b estimate is much lower, which

doesn™t make much economic sense, and, worse, is unstable, depending a lot on time period

and data de¬nitions. Lucas realized that the regression in differences threw out all of the

information in the data, which was in the trend, and focused on the high-frequency noise.

Therefore, the regression in levels, with standard errors corrected for correlation of the error

term, is the right one to look at. Of course, GLS and ML didn™t know there was any “noise” in

the data, which is why they threw out the baby and kept the bathwater. Again, ML ruthlessly

exploits the null for ef¬ciency, and has no way of knowing what is “reasonable” or “intuitive.”

Stochastic singularities and calibration. Models of the term structure of interest rates (we

will study these models in section 19) and real business cycle models in macroeconomics

give even more stark examples. These models are stochastically singular. They generate

predictions for many time series from a few shocks, so the models predict that there are

combinations of the time series that leave no error term. Even though the models have rich

and interesting implications, ML will seize on this economically uninteresting singularity,

refuse to estimate parameters, and reject any model of this form.

The simplest example of the situation is the linear-quadratic permanent income model

paired with an AR(1) speci¬cation for income. The model is

yt = ρyt’1 + µt

∞

1 Xj 1

ct ’ ct’1 = (Et ’ Et’1 ) β yt+j = µt

1 ’ β j=0 (1 ’ βρ) (1 ’ β)

This model generates all sorts of important and economically interesting predictions for the

joint process of consumption and income (and asset prices). Consumption should be roughly

a random walk, and should respond only to permanent income changes; investment should be

more volatile than income and income more volatile than consumption. Since there is only

one shock and two series, however, the model taken literally predicts a deterministic relation

between consumption and income; it predicts

rβ

ct ’ ct’1 = (yt ’ ρyt’1 ) .

1 ’ βρ

276

ML will notice that this is the statistically most informative prediction of the model. There

is no error term! In any real data set there is no con¬guration of the parameters r, β, ρ that

make this restriction hold, data point for data point. The probability of observing a data set

{ct , yt } is exactly zero, and the log likelihood function is ’∞ for any set of parameters. ML

says to throw the model out.

The popular af¬ne models of the term structure of interest rates act the same way. They

specify that all yields at any moment in time are deterministic functions of a few state vari-

ables. Such models can capture much of the important qualitative behavior of the term struc-

ture, including rising, falling and humped shapes, and the information in the term structure

for future movements in yields and the volatility of yields. They are very useful for derivative

pricing. But it is never the case in actual yield data that yields of all maturities are exact func-

tions of K yields. Actual data on N yields always require N shocks. Again, a ML approach

reports a ’∞ log likelihood function for any set of parameters.

Addressing model mis-speci¬cation

The ML philosophy offers an answer to model mis-speci¬cation: specify the right model,

and then do ML. If regression errors are correlated, model and estimate the covariance matrix

and do GLS. If you are worried about proxy errors in the pricing factor, short sales costs or

other transactions costs so that model predictions for extreme long-short positions should

not be relied on, time-aggregation or mismeasurement of consumption data, non-normal or

non-i.i.d. returns, time-varying betas and factor risk premia, additional pricing factors and so

on“don™t chat about them, write them down, and then do ML.

Following this lead, researchers have added “measurement errors” to real business cycle

models (Sargent 1989 is a classic example) and af¬ne yield models in order to break the

stochastic singularity (I discuss this case a bit more in section 19.6). The trouble is, of course,

that the assumed structure of the measurement errors now drives what moments ML pays

attention to. And seriously modeling and estimating the measurement errors takes us further

away from the economically interesting parts of the model. (Measurement error augmented

models will often wind up specifying sensible moments, but by assuming ad-hoc processes

for measurement error, such as i.i.d. errors. Why not just specify the sensible moments in the

¬rst place?)

More generally, authors tend not to follow this advice, in part because it is ultimately

infeasible. Economics necessarily studies quantitative parables rather than completely spec-

i¬ed models. It would be nice if we could write down completely speci¬ed models, if we

could quantitatively describe all the possible economic and statistical model and speci¬ca-

tion errors, but we can™t.

The GMM framework, used judiciously, allows us to evaluate misspeci¬ed models. It al-

lows us to direct that the statistical effort focus on the “interesting” predictions while ignoring

the fact that the world does not match the “uninteresting” simpli¬cations. For example, ML

only gives us a choice of OLS, whose standard errors are wrong, or GLS, which we may not

trust in small samples or which may focus on uninteresting parts of the data. GMM allows

277

CHAPTER 16 WHICH METHOD?

us to keep an OLS estimate, but correct the standard errors for non-i.i.d. distributions. More

generally, GMM allows one to specify an economically interesting set of moments, or a set

of moments that one feels will be robust to misspeci¬cations of the economic or statistical

model, without having to spell out exactly what is the source of model mis-speci¬cation that

makes those moments “optimal” or even “interesting” and “robust.” It allows one to accept

the lower “ef¬ciency” of the estimates under some sets of statistical assumptions, in return

for such robustness.

At the same time, the GMM framework allows us to ¬‚exibly incorporate statistical model

misspeci¬cations in the distribution theory. For example, knowing that returns are not i.i.d.

normal, one may want to use the time series regression technique to estimate betas anyway.

This estimate is not inconsistent, but the standard errors that ML formulas pump out under

this assumption are inconsistent. GMM gives a ¬‚exible way to derive at least and asymptotic

set of corrections for statistical model misspeci¬cations of the time-series regression coef¬-

cient. Similarly, a pooled time-series cross-sectional OLS regression is not inconsistent, but

standard errors that ignore cross-correlation of error terms are far too small.

The “calibration” of real business cycle models is often really nothing more than a GMM

parameter estimate, using economically sensible moments such as average output growth,

consumption/output ratios etc. to avoid the stochastic singularity that would doom a ML ap-

proach. (Kydland and Prescott™s 1982 idea that empirical microeconomics would provide

accurate parameter estimates for macroeconomic and ¬nancial models has pretty much van-

ished.) Calibration exercises usually do not compute standard errors, nor do they report any

distribution theory associated with the “evaluation” stage when one compares the model™s

predicted second moments with those in the data. Following Burnside, Eichenbaum and Re-

belo (1993) however, it™s easy enough to calculate such a distribution theory “ to evaluate

whether the difference between predicted “second moments” and actual moments is large

compared to sampling variation, including the variation induced by parameter estimation in

the same sample “ by listing the ¬rst and second moments together in the gT vector.

“Used judiciously” is an important quali¬cation. Many GMM estimations and tests suf-

fer from lack of thought in the choice of moments, test assets and instruments. For example,

early GMM papers tended to pick assets and especially instruments pretty much at random.

Industry portfolios have almost no variation in average returns to explain. Authors often

included many lags of returns and consumption growth as instruments to test a consumption-

based model. However, the 7th lag of returns really doesn™t predict much about future returns

given lags 1-12, and the ¬rst-order serial correlation in seasonally adjusted, ex-post revised

consumption growth may be economically uninteresting. More recent work tends to em-

phasize a few well-chosen assets and instruments that capture important and economically

interesting features of the data..

Auxiliary model

ML requires an auxiliary statistical model. For example, in the classic ML formalization

of regression tests, we had to stopped to assume that returns and factors are jointly i.i.d. nor-

mal. As the auxiliary statistical model becomes more and more complex and hence realistic,

278

more and more effort is devoted to estimating the auxiliary statistical model. ML has no way

of knowing that some parameters “ a, b; β, », risk aversion γ “ are more “important” than

others “ Σ, and parameters describing time-varying conditional moments of returns.

A very convenient feature of GMM is that it does not require such an auxiliary statistical

model. For example, in studying GMM we went straight from p = E(mx) to moment

conditions, estimates, and distribution theory. This is an important saving of the researcher™s

and the reader™s time, effort and attention.

Finite sample distributions

Many authors say they prefer regression tests and the GRS statistic in particular because

it has a ¬nite sample distribution theory, and they distrust the ¬nite-sample performance of

the GMM asymptotic distribution theory.

This argument does not have much force. The ¬nite sample distribution only holds if

returns really are normal and i.i.d., and if the factor is perfectly measured. Since these as-

sumptions do not hold, it is not obvious that a ¬nite-sample distribution that ignores non-i.i.d.

returns will be a better approximation than an asymptotic distribution that corrects for them.

All approaches give essentially the same answers in the classic setup of i.i.d. returns.

The issue is how the various techniques perform in more complex setups, especially with

conditioning information, and here there are no analytic ¬nite-sample distributions.

In addition, once you have picked the estimation method “ how you will generate a num-

ber from the data; or which moments you will use “ ¬nding its ¬nite sample distribution,

given an auxiliary statistical model, is simple. Just run a Monte Carlo or bootstrap. Thus,

picking an estimation method because it delivers analytic formulas for a ¬nite sample distri-

bution (under false assumptions) should be a thing of the past. Analytic formulas for ¬nite

sample distributions are useful for comparing estimation methods and arguing about statisti-

cal properties of estimators, but they are not necessary for the empiricists™ main task.

Finite sample quality of asymptotic distributions, and “nonparametric” estimates

Several investigations (Ferson and Foerster 1994, Hansen, Heaton, and Yaron 1996) have

found cases in which the GMM asymptotic distribution theory is a poor approximation to

a ¬nite-sample distribution theory. This is especially true when one asks “non-parametric”

corrections for autocorrelation or heteroskedasticity to provide large corrections and when

the number of moments is large compared to the sample size, or if the moments one uses for

GMM turn out to be very inef¬cient (Fuhrer, Moore, and Schuh 1995) which can happen if

you put in a lot of instruments with low forecast power

The ML distribution is the same as GMM, conditional on the choice of moments, but typ-

ical implementations of ML also use the parametric time-series model to simplify estimates

of the terms in the distribution theory as well as to derive the likelihood function.

If this is the case “ if the “nonparametric” estimates of the GMM distribution theory

perform poorly in a ¬nite sample, while the “parametric” ML distribution works well “ there

is no reason not to use a parametric time series model to estimate the terms in the GMM

279

CHAPTER 16 WHICH METHOD?

P

distribution as well. For example, rather than calculate ∞ j=’∞ E(ut ut’j ) from a large

sum of autocorrelations, you can model ut = ρut’1 + µt , estimate ρ, and then calculate

P

σ2 (u) ∞ 1+ρ

j=’∞ ρ = σ (u) 1’ρ . Section 11.7 discussed this idea in more detail.

j 2

The case for ML

In the classic setup, the ef¬ciency gain of ML over GMM on the pricing errors is tiny.

However, several studies have found cases in which the statistically motivated choice of mo-

ments suggested by ML has important ef¬ciency advantages.

For example, Jacquier, Polson and Rossi (1994) study the estimation of a time-series

model with stochastic volatility. This is a model of the form

(16.213)

dSt /St = µdt + Vt dZ1t

dVt = µV (Vt )dt + σ(Vt )dZ2t ,

and S is observed but V is not. The obvious and easily interpretable moments include the

autocorrelation of squared returns, or the autocorrelation of the absolute value of returns.

However, Jacquier, Polson and Rossi ¬nd that the resulting estimates are far less ef¬cient

than those resulting from the ML scores.

Of course, this study presumes that the model (16.213) really is exactly true. Whether

the uninterpretable scores or the interpretable moments really perform better to give an ap-

proximate model of the form (16.213), given some other data-generating mechanism is open

to discussion.

Even in the canonical OLS vs. GLS case, a wildly heteroskedastic error covariance ma-

trix can mean that OLS spends all its effort ¬tting unimportant data points. A “judicious”

application of GMM (OLS) in this case would require at least some transformation of units

so that OLS is not wildly inef¬cient.

Statistical philosophy

The history of empirical work that has been persuasive “ that has changed people™s under-

standing of the facts in the data and which economic models understand those facts “ looks a

lot different than the statistical theory preached in econometrics textbooks.

The CAPM was taught and believed in and used for years despite formal statistical rejec-

tions. It only fell by the wayside when other, coherent views of the world were offered in the

multifactor models. And the multifactor models are also rejected! It seems that “it takes a

model to beat a model,” not a rejection.

Even when evaluating a speci¬c model, most of the interesting calculations come from

examining speci¬c alternatives rather than overall pricing error tests. The original CAPM

tests focused on whether the intercept in a cross-sectional regression was higher or lower than

the risk free rate, and whether individual variance entered into cross-sectional regressions.

The CAPM fell when it was found that characteristics such as size and book/market do enter

cross-sectional regressions, not when generic pricing error tests rejected.

280

In¬‚uential empirical work tells a story. The most ef¬cient procedure does not seem to

convince people if they cannot transparently see what stylized facts in the data drive the

result. A test of a model that focuses on its ability to account for the cross section of average

returns of interesting portfolios will in the end be much more persuasive than one that (say)

focuses on the model™s ability to explain the ¬fth moment of the second portfolio, even if ML

¬nds the latter moment much more statistically informative.

Most recently, Fama and French (1988b) and (1993) are good examples of empirical

work that changed many people™s minds, in this case that long-horizon returns really are

predictable, and that we need a multifactor model rather than the CAPM to understand the

cross-section of average returns. These papers are not stunning statistically: long horizon

predictability is on the edge of statistical signi¬cance, and the multifactor model is rejected

by the GRS test. But these papers made clear what stylized and robust facts in the data

drive the results, and why those facts are economically sensible. For example, the 1993

paper focused on tables of average returns and betas. Those tables showed strong variation

in average returns that was not matched by variation in market betas, yet was matched by

variation in betas on new factors. There is no place in statistical theory for such a table, but

it is much more persuasive than a table of χ2 values for pricing error tests. On the other

hand, I can think of no case in which the application of a clever statistical models to wring

the last ounce of ef¬ciency out of a dataset, changing t statistics from 1.5 to 2.5, substantially

changed the way people think about an issue.

Statistical testing is one of many questions we ask in evaluating theories, and usually not

the most important one. This is not a philosophical or normative statement; it is a positive or

empirical description of the process by which the profession has moved from theory to theory.

Think of the kind of questions people ask when presented with a theory and accompanying

empirical work. They usually start by thinking hard about the theory itself. What is the central

part of the economic model or explanation? Is it internally consistent? Do the assumptions

make sense? Then, when we get to the empirical work, how were the numbers produced?

Are the data de¬nitions sensible? Are the concepts in the data decent proxies for the concepts

in the model? (There™s not much room in statistical theory for that question!) Are the model

predictions robust to the inevitable simpli¬cations? Does the result hinge on power utility vs.

another functional form? What happens if you add a little measurement error, or if agents

have an information advantage, etc.? What are the identi¬cation assumptions, and do they

make any sense “ why is y on the left and x on the right rather than the other way around?

Finally, someone in the back of the room might raise his hand and ask, “if the data were

generated by a draw of i.i.d. normal random variables over and over again, how often would

you come up with a number this big or bigger?” That™s an interesting and important check on

the overall believability of the results. But it is not necessarily the ¬rst check, and certainly

not the last and decisive check. Many models are kept that have economically interesting but

statistically rejectable results, and many more models are quickly forgotten that have strong

statistics but just do not tell as clean a story.

The classical theory of hypothesis testing, its Bayesian alternative, or the underlying

hypothesis-testing view of the philosophy of science are miserable descriptions of the way

281

CHAPTER 16 WHICH METHOD?

science in general and economics in particular proceed from theory to theory. And this is

probably a good thing too. Given the non-experimental nature of our data, the inevitable ¬sh-

ing biases of many researchers examining the same data, and the unavoidable fact that our

theories are really quantitative parables more than literal descriptions of the way the data are

generated, the way the profession settles on new theories makes a good deal of sense. Clas-

sical statistics requires that nobody ever looked at the data before specifying the model. Yet

more regressions have been run than there are data points in the CRSP database. Bayesian

econometrics can in principle incorporate the information of previous researchers, yet it never

applied in this way “ each study starts anew with a “uninformative” prior. Statistical theory

draws a sharp distinction between the model “ which we know is right; utility is exactly

power; and the parameters which we estimate. But this distinction isn™t true; we are just as

uncertain about functional forms as we are about parameters. A distribution theory at bottom

tries to ask an unknowable question: If we turned the clock back to 1947 and reran the post-

war period 1000 times, in how many of those alternative histories would (say) the average

S&P500 return be greater than 9%? It™s pretty amazing in fact that a statistician can purport

to give any answer at all to such a question, having observed only one history.

These paragraphs do not contain original ideas, and they mirror changes in the philoso-

phy of science more broadly. 50 years ago, the reigning philosophy of science focused on

the idea that scientists provide rejectable hypotheses. This idea runs through philosophical

writings exempli¬ed by Popper (1959), classical statistical decision theory, and mirrored in

economics by Friedman (1953) However, this methodology contains an important inconsis-

tency. Though researchers are supposed to let the data decide, writers on methodology do

not look at how actual theories evolved. It was, as in Friedman™s title, a “Methodology of

positive economics,” not a “positive methodology of economics.” Why should methodology

be normative, a result of philosophical speculation, and not an empirical discipline like ev-

erything else. In a very famous book, Kuhn (1970) looked at the actual history of scienti¬c

revolutions, and found that the actual process had very little to do with the formal methodol-

ogy. McCloskey (1983, 1998) has gone even further, examining the “rhetoric” of economics;

the kinds of arguments that persuaded people to change their minds about economic theories.

Needless to say, the largest t-statistic did not win!

Kuhn™s and especially McCloskey™s ideas are not popular in the ¬nance and economics

professions. Precisely, they are not popular in how people talk about their work, though

they describe well how people actually do their work. Most people in the ¬elds cling to the

normative, rejectable-hypothesis view of methodology. But we need not suppose that they

would be popular. The ideas of economics and ¬nance are not popular among the agents in

the models. How many stock market investors even know what a random walk or the CAPM

is, let alone believing those models have even a grain of truth? Why should the agents in the

models of how scienti¬c ideas evolve have an intuitive understanding of the models? “As if”

rationality can apply to us as well!

Philosophical debates aside, a researcher who wants his ideas to be convincing, as well

as right, would do well to study how ideas have in the past convinced people, rather than just

study a statistical decision theorist™s ideas about how ideas should convince people. Kuhn,

282

and, in economics, McCloskey have done that, and their histories are worth reading. In the

end, statistical properties may be a poor way to choose statistical methods.

Summary

The bottom line is simple: It™s ok to do a ¬rst stage or simple GMM estimate rather than

an explicit maximum likelihood estimate and test. Many people (and, unfortunately, many

journal referees) seem to think that nothing less than a full maximum likelihood estimate and

test is acceptable. This section is long in order to counter that impression; to argue that at

least in many cases of practical importance, a simple ¬rst stage GMM approach, focusing on

economically interpretable moments, can be adequately ef¬cient, robust to model misspeci-

¬cations, and ultimately more persuasive.

283

PART III

Bonds and options

284

The term structure of interest rates and derivative pricing use closely related techniques. As

you might have expected, I present both issues in a discount factor context. All models come

down to a speci¬cation of the discount factor. The discount factor speci¬cations in term

structure and option pricing models are quite simple.

So far, we have focused on returns, which reduces the pricing problem to a one-period or

instantaneous problem. Pricing bonds and options forces us to start thinking about chaining

together the one-period or instantaneous representations to get a prediction for prices of long-

lived securities. Taking this step is very important, and I forecast that we will see much

more multiperiod analysis in stocks as well, studying price and stream of payoffs rather than

returns. This step rather than the discount factor accounts for the mathematical complexity

of some term structure and option pricing models.

There are two standard ways to go from instantaneous or return representations to prices.

First, we can chain the discount factors together, ¬nding from a one period discount factor

mt,t+1 a long-term discount factor mt,t+j = mt,t+1 mt+1,t+2 ...mt+j’1 mt+j that can price

a j period payoff. In continuous time, we will ¬nd the discount factor increments dΛ satisfy

the instantaneous pricing equation 0 = Et [d (ΛP )], and then solve its stochastic differential

equation to ¬nd its level Λt+j in order to price a j’ period payoff as Pt = Et [Λt+j /Λt xt+j ].

Second, we can chain the prices together. Conceptually, this is the same as chaining re-

turns Rt,t+j = Rt,t+1 Rt+1,t+2 ..Rt+j’1,t+j instead of chaining together the discount fac-

tors. From 0 = Et [d (ΛP )], we ¬nd a differential equation for the prices, and solve that

back. We™ll use both methods to solve interest rate and option pricing models.

285

Chapter 17. Option pricing

Options are a very interesting and useful set of instruments, as you will see in the background

section. In thinking about their value, we will adopt an extremely relative pricing approach.

Our objective will be to ¬nd out a value for the option, taking as given the values of other

securities, and in particular the price of the stock on which the option is written and an interest

rate.

17.1 Background

17.1.1 De¬nitions and payoffs

A call option gives you the right to buy a stock for a speci¬ed strike price on a speci¬ed

expiration date.

The call option payoff is CT = max(ST ’ X, 0).

Portfolios of options are called strategies. A straddle “ a put and a call at the same strike

price “ is a bet on volatility

Options allow you to buy and sell pieces of the return distribution.

Before studying option prices, we need to start by understanding option payoffs.

A call option gives you the right, but not the obligation, to buy a stock (or other “under-

lying” asset) for a speci¬ed strike price (X) on (or before) the expiration date (T). European

options can only be exercised on the expiration date. American options can be exercised any-

time before as well as on the expiration date. I will only treat European options. A put option

gives the right to sell a stock at a speci¬ed strike price on (or before) the expiration date. I™ll

use the standard notation,

call price today

C = Ct =

call payoff = value at expiration (T).

CT =

stock price today

S = St =

stock price at expiration

ST =

strike price

X=

Our objective is to ¬nd the price C. The general framework is (of course) C = E(mx)

where x denotes the option™s payoff. The option™s payoff is the same things as its value at

expiration. If the stock has risen above the strike price, then the option is worth the difference

between stock and strike. If the stock has fallen below the strike price, it expires worthless.

286

SECTION 17.1 BACKGROUND

Thus, the option payoff is

½

ST ’ X if ST ≥ X

Call payoff =

if ST ¤ X

0

CT = max(ST ’ X, 0).

A put works the opposite way: It gains value as the stock falls below the strike price, since

the right to sell it at a high price is more and more valuable.

Put payoff = PT = max(X ’ ST , 0).

It™s easiest to keep track of options by a graph of their value as a function of stock price.

Figure 31 graphs the payoffs from buying calls and puts, and the corresponding short posi-

tions, which are called writing call and put options. One of the easiest mistakes to make is to

confuse the payoff, with the pro¬t, which is the value at expiration less the cost of buying the

option. I drew in pro¬t lines, payoff - cost, to emphasize this difference.

Call Put

Payoff

ST ST

Profit

Write Call Write Put

ST

ST

Figure 31. Payoff diagrams for simple option strategies.

Right away, you can see some of the interesting features of options. A call option allows

you a huge positive beta. Typical at-the-money options (strike price = current stock price)

give a beta of about 10; meaning that the option is equivalent to borrowing $10 to invest $11

in the stock. However, your losses are limited to the cost of the option, which is paid upfront.

Options are obviously very useful for trading. Imagine how dif¬cult it would be to buy stock

on such huge margin, and how dif¬cult it would be to make sure people paid if the bet went

287