. 10
( 17)


Again, we already derived this χ2 test in (12.168), and its ¬nite sample F counterpart, the
GRS F test (12.169).
The other test of the restrictions is the likelihood ratio test (14.197). Quite generally,
likelihood ratio tests are asymptotically equivalent to Wald tests, and so gives the same result.
Showing it in this case is not worth the algebra.

14.4 When factors are not excess returns, ML prescribes a
cross-sectional regression

If the factors are not returns, we don™t have a choice between time-series and cross-sectional
regression, since the intercepts are not zero. As you might suspect, ML prescribes a cross-
sectional regression in this case.
The factor model, expressed in expected return beta form, is

E(Rei ) = ±i + β 0 »; i = 1, 2, ..N (206)

The betas are de¬ned from time-series regressions

Rei = ai + β 0 ft + µi (207)
t i t

The intercepts ai in the time-series regressions need not be zero, since the model does not
apply to the factors. They are not unrestricted, however. Taking expectations of the time-
series regression (14.207) and comparing it to (14.206) (as we did to derive the restriction
± = 0 for the time-series regression), the restriction ± = 0 implies

ai = β 0 (» ’ E(ft )) (208)

Plugging into (14.207), the time series regressions must be of the restricted form

Rt = β 0 » + β 0 [ft ’ E(ft )] + µi . (209)
i i t

In this form, you can see that β 0 » determines the mean return. Since there are fewer factors
than returns, this is a restriction on the regression (14.209).
Stack assets i = 1, 2, ...N to a vector; and introduce the auxiliary statistical model that
the errors and factors are i.i.d. normal and uncorrelated with each other. Then, the restricted
model is
Rt = B» + B [ft ’ E(ft )] + µt
ft = E(f) + ut
· ¸ µ ¶
µt Σ0
∼ N 0,
ut 0V

where B denotes a N — K matrix of regression coef¬cients of the N assets on the K factors.


The likelihood function is
1 X 0 ’1 1 X 0 ’1
L = (const.) ’ µ Σ µt ’ u V ut
2 t=1 t 2 t=1 t
µt = Rt ’ B [» + ft ’ E(f)] ; ut = ft ’ E(f).

Maximizing the likelihood function,
‚L 0 ’1
(Re ’B [» V ’1 (ft ’ E(f))
: 0= BΣ + ft ’ E(f )]) +
‚E(f ) t=1 t=1
: 0 = B0 Σ’1 (Re ’ B [» + ft ’ E(f )])
‚» t=1

The solution to this pair of equations is

[ (14.210)
E(f ) = ET (ft )
¡ ¢’1 0 ’1
ˆ (14.211)
» = B 0 Σ’1 B B Σ ET (Re ) .

The maximum likelihood estimate of the factor risk premium is a GLS cross-sectional regres-
sion of average returns on betas.
The maximum likelihood estimates of the regression coef¬cients B are again not the same
as the standard OLS formulas. Again, ML imposes the null to improve ef¬ciency.
Σ’1 (Rt ’B [» + ft ’ E(f)]) [» + ft ’ E(f )]0 = 0 (14.212)
‚B t=1
ÃT !’1
[ft + » ’ E(f)] [ft + » ’ E(f )]0
ˆ Re [ft + » ’ E(f)]
B = t
t=1 t=1

This is true, even though the B are de¬ned in the theory as population regression coef¬cients.
(The matrix notation hides a lot here! If you want to rederive these formulas, it™s helpful to
start with scalar parameters, e.g. Bij , and to think of it as ‚L/‚θ = T (‚L/‚µt )0 ‚µt /‚θ.
) Therefore, to really implement ML, you have to solve (14.211) and (14.212) simultaneously
ˆˆ ˆ
for », B, along with Σ whose ML estimate is the usual second moment matrix of the resid-
uals. This can usually be done iteratively: Start with OLS B, run an OLS cross-sectional
ˆ ˆ
regression for », form Σ, and iterate.

14.5 Problems


1. Why do we use restricted ML when the factor is a return, but unrestricted ML when
the factor is not a return? To see why, try to formulate a ML estimator based on an
unrestricted regression when factors are not returns, equation (12.166). Add pricing
errors ±i to the regression as we did for the unrestricted regression in the case that factors
are returns, and then ¬nd ML estimators for B, », ±, E(f ). (Treat V and Σ as known to
make the problem easier.)
2. Instead of writing a regression, build up the ML for the CAPM a little more formally.
Write the statistical model as just the assumption that individual returns and the market
return are jointly normal,
· ¸ µ· ¸ ¶
Re E(Re ) cov(Rem , Re0 )
∼N ,
Rem E(Rem ) cov(Rem , Re ) σ2

The model™s restriction is
E(Re ) = γcov(Rem , Re ).
Estimate γ and show that this is the same time-series estimator as we derived by
presupposing a regression.

Chapter 15. Time series, cross-section,
and GMM/DF tests of linear factor models
The GMM/DF, time-series and cross-sectional regression procedures and distribution theory
are similar, but not identical. Cross-sectional regressions on betas are not the same thing as
cross sectional regressions on second moments. Cross-sectional regressions weighted by the
residual covariance matrix are not the same thing as cross-sectional regressions weighted by
the spectral density matrix.
GLS cross-sectional regressions and second stage GMM have a theoretical ef¬ciency ad-
vantage over OLS cross sectional regressions and ¬rst stage GMM, but how important is this
advantage, and is it outweighed by worse ¬nite-sample performance?
The time-series regression, as ML estimate, has a potential gain in ef¬ciency when re-
turns are factors and the residuals are i.i.d. normal. Why does ML prescribe a time-series
regression when the return is a factor and a cross-sectional regression when the return is not
a factor? The time-series regression seems to ignore pricing errors and estimate the model by
entirely different moments. How does adding one test asset make such a seemingly dramatic
difference to the procedure?
Finally, and perhaps most importantly, the GMM/discount factor approach is still a “new”
procedure. Many authors still do not trust it. It is important to verify that it produces similar
results and well-behaved test statistics in the setups of the classic regression tests.
To address these questions, I ¬rst apply the various methods to a classic empirical ques-
tion. How do time-series regression, cross-sectional regression and GMM/stochastic discount
factor compare when applied to a test of the CAPM on CRSP size portfolios? I ¬nd that three
methods produce almost exactly the same results for this classic exercise. They produce al-
most exactly the same estimates, standard errors, t-statistics and χ2 statistics that the pricing
errors are jointly zero.
Then I conduct a Monte Carlo and Bootstrap evaluation. Again, I ¬nd little difference
between the methods. The estimates, standard errors, and size and power of tests are almost
identical across methods.
The Bootstrap does reveal that the traditional i.i.d. assumption generates χ2 statistics with
about 1/2 the correct size “ they reject half as often as they should under the null. Simple
GMM corrections to the distribution theory repair this size defect. Also, you can ruin any
estimate and test with a bad spectral density matrix estimate. I try an estimate with 24 lags and
no Newey-West weights. It is singular in the data sample and many Monte Carlo replications.
Interestingly, this singularity has minor effects on standard errors, but causes disasters when
you use the spectral density matrix to weight a second-stage GMM.
I also ¬nd that second stage “ef¬cient” GMM is only very slightly more ef¬cient than
¬rst stage GMM, but is somewhat less robust; it is more sensitive to the poor spectral density
matrix and its asymptotic standard errors can be slightly misleading. As OLS is often better


than GLS, despite the theoretical ef¬ciency advantage of GLS, ¬rst-stage GMM may be better
than second stage GMM in many applications.
This section should give comfort that the apparently “new” GMM/discount factor formu-
lation is almost exactly the same as traditional methods in the traditional setup. There is a
widespread impression that GMM has dif¬culty in small samples. The literature on the small
sample properties of GMM (for example, Ferson and Foerster, 1994, Fuhrer, Moore, and
Schuh, 1995) naturally tries hard setups, with highly nonlinear models, highly persistent and
heteroskedastic errors, conditioning information, potentially weak instruments and so forth.
Nobody would write a paper trying GMM in a simple situation such as this one, correctly
foreseeing that the answer would not be very interesting. Unfortunately, many readers take
from this literature a mistaken impression that GMM always has dif¬culty in ¬nite samples,
even in very standard setups. This is not the case.
The point of the GMM/discount factor method, of course, is not a new way to handle
the simple i.i.d. normal CAPM problems, which are already handled ef¬ciently by regres-
sion techniques. The point of the GMM/discount factor method is its ability to transparently
handle situations that are very hard with expected return - beta models and ML techniques,
including the incorporation of conditioning information and nonlinear models. With the re-
assurance of this section, we can proceed to those more exiting applications.
Cochrane (2000) presents a more in-depth analysis, including estimation and Monte Carlo
evaluation of individual pricing error estimates and tests. Jagannathan and Wang (2000) com-
pare the GMM/discount factor approach to classic regression tests analytically. They show
that the parameter estimates, standard errors and χ2 statistics are asymptotically identical to
those of an expected return- beta cross-sectional regression when the factor is not a return.

15.1 Three approaches to the CAPM in size portfolios

The time-series approach sends the expected return - beta line through the market return,
ignoring other assets. The OLS cross -sectional regression minimizes the sum of squared
pricing errors, so allows some market pricing error to ¬t other assets better. The GLS cross-
sectional regression weights pricing errors by the residual covariance matrix, so reduces to
the time-series regression when the factor is a return and is included in the test assets.
The GMM/discount factor estimates, standard errors and χ2 statistics are very close to
time-series and cross-sectional regression estimates in this classic setup.

Time series and cross section
Figures 28 and 29 illustrate the difference between time-series and cross-sectional regres-
sions, in an evaluation of the CAPM on monthly size portfolios.


Figure 28 presents the time-series regression. The time-series regression estimates the
factor risk premium from the average of the factor, ignoring any information in the other
assets, » = ET (Rem ). Thus, a time-series regression draws the expected return-beta line
across assets by making it ¬t precisely on two points, the market return and the riskfree rate“
The market and riskfree rate have zero estimated pricing error in every sample. (The far right
portfolios are the smallest ¬rm portfolios, and their positive pricing errors are the small ¬rm
anomaly “ this data set is the ¬rst serious failure of the CAPM. I come back to the substantive
issue in Chapter 20.)
The time-series regression is the ML estimator in this case, since the factor is a return.
Why does ML ignore all the information in the test asset average returns, and estimate the
factor premium from the average factor return only? The answer lies in the structure that
we told ML to assume when looking at the data. When we write Re = a + βft + µt and µ
independent of f, we tell ML that a sample of returns already includes the same sample of
the factor, plus extra noise. Thus, the sample of test asset returns cannot possibly tell ML
anything more than the sample of the factor alone about the mean of the factor. Second, we
tell ML that the factor risk premium equals the mean of the factor, so it may not consider the
possibility that the two are different in trying to match the data.

Figure 28. Average excess returns vs. betas on CRSP size portfolios, 1926-1998. The line
gives the predicted average return from the time-series regression, E(Re ) = βE(Rem ).

The OLS cross-sectional regression in Figure 29 draws the expected return-beta line by


Figure 29. Average excess returns vs betas of CRSP size portfolios 1926-1998, and the ¬t
of cross-sectional regressions.

minimizing the squared pricing error across all assets. Therefore, it allows some pricing error
for the market return, if by doing so the pricing errors on other assets can be reduced. Thus,
the OLS cross-sectional regression gives some pricing error to the market return in order to
lower the pricing errors of the other portfolios.
When the factor is not also a return, ML prescribes a cross-sectional regression. ML still
ignores anything but the factor data in estimating the mean of the factor“E(f) = ET (ft ).
However, ML is now allowed to us a different parameter for the factor risk premium that ¬ts
average returns to betas, which it does by cross-sectional regression. However, ML is a GLS
cross sectional regression, not an OLS cross-sectional regression. The GLS cross-sectional
regression in Figure 29 is almost exactly identical to the time-series regression result “ it
passes right through the origin and the market return ignoring all the other pricing errors.
The GLS cross-sectional regression
¡ ¢’1 0 ’1
» = β 0 Σ’1 β β Σ ET (Re ).

weights the various portfolios by the inverse of the residual covariance matrix Σ. If we include
the market return as a test asset, it obviously has no residual variance“Rem = 0+1—Rem +0“
t t
so the GLS estimate pays exclusive attention to it in ¬tting the market line. The same thing


happens if the test assets span the factors “ if a linear combination of the test assets is equal
to the factor and hence has no residual variance. The size portfolios nearly span the market
return, so the GLS cross-sectional regression is visually indistinguishable from the time-
series regression in this case.
This observation wraps up one mystery, why ML seemed so different when the factor is
and is not a return. As the residual variance of a portfolio goes to zero, the GLS regression
pays more and more attention to that portfolio, until you have achieved the same result as a
time-series regression.
If we allow a free constant in the OLS cross-sectional regression, thus allowing a pricing
error for the risk free rate, you can see from Figure 29 that the OLS cross-sectional regression
line will ¬t the size portfolios even better, though allowing a pricing error in the risk free rate
as well as the market return. However, a free intercept in an OLS regression on excess returns
puts no weight at all on the intercept pricing error. It is a better idea to include the riskfree
rate as a test asset, either directly by doing the whole thing in levels of returns rather than
excess returns or by adding E(Re ) = 0, β = 0 to the cross-sectional regression. The GLS
cross-sectional regression will notice that the T-bill rate has no residual variance and so will
send the line right through the origin, as it does for the market return.
GMM/discount factor ¬rst and second stage
Figure 30 illustrates the GMM/discount factor estimate with the same data. The horizontal
axis is the second moment of returns and factors rather than beta, but you would not know it
from the placement of the dots. The ¬rst stage estimate is an OLS cross-sectional regression
of average returns on second moments. It minimizes the sum of squared pricing errors, and so
produces pricing errors almost exactly equal to those of the OLS cross-sectional regression
of returns on betas. The second stage estimate minimizes pricing errors weighted by the
spectral density matrix. The spectral density matrix is not the same as the residual covariance
matrix, so the second stage GMM does not go through the market portfolio as does the GLS
cross-sectional regression. In fact, the slope of the line is slightly higher for the second stage
(The spectral density matrix of the discount factor formulation does not reduce to the
residual covariance matrix even if we assume the regression model, the asset pricing model
is true, and factors and residuals are i.i.d. normal. In particular, when the market is a test
asset, the GLS cross-sectional regression focuses all attention on the market portfolio but the
second stage GMM/DF does not do so. The parameter b is related to » by b = »/E(Rem2 ).
The other assets still are useful in determining the parameter b, even though Given the market
return and the regression model Rt = β i Rt + µi , seeing the other assets does not help to
ei em
determine the mean of the market return, )
Overall, the ¬gures do not suggest any strong reason to prefer ¬rst and second stage
GMM/discount factor, time-series, OLS or GLS cross sectional regression in this standard
model and data set. The results are affected by the choice of method. In particular, the size
of the small ¬rm anomaly is substantially affected by how one draws the market line. But


Figure 30. Average excess return vs. predicted value of 10 CRSP size portfolios,
1926-1998, based on GMM/SDF estimate. The model predicts E(Re ) = bE(Re Rem ). The
second stage estimate of b uses a spectral density estimate with zero lags.

the graphs and analysis do not strongly suggest that any method is better than any other for
purposes other than ¬shing for the answer one wants.
Parameter estimates, standard errors, and tests
Table c1 presents the parameter estimates and standard errors from time-series, cross-
section, and GMM/discount factor approach in the CAPM size portfolio test illustrated by
Figures 28 and 29. The main parameter to be estimated is the slope of the lines in the above
¬gures, the market price of risk » in the expected return-beta model and the relation between
mean returns and second moments b in the stochastic discount factor model. The big point of
Table c1 is that the GMM/discount factor estimate and standard errors behave very similarly
to the traditional estimates and standard errors.
The rows compare results with various methods of calculating the spectral density matrix.
i.i.d. imposes no serial correlation and regression errors independent of right hand variables,
and is identical to the Maximum Likelihood based formulas. The 0 lag estimate allows con-
ditional heteroskedasticity, but no correlation of residuals. The 3 lag, Newey West estimate
is a sensible correction for short order autocorrelation. I include the 24 lag spectral density
matrix to show how things can go wrong if you use a ridiculous spectral density matrix.


Beta model » GMM/DF b
Time- Cross section 1st 2nd stage
Series OLS GLS stage Est. Std. Err.
Estimate 0.66 0.71 0.66 2.35
i.i.d. 0.18 (3.67) 0.20 (3.55) 0.18 (3.67)
0 lags 0.18 (3.67) 0.19 (3.74) 0.18 (3.67) 0.63 (3.73) 0.61 (4.03)
3 lags, NW 0.20 (3.30) 0.21 (3. 38) 0.20 (3.30) 0.69 (3.41) 0.64 (3.73)
24 lags 0.16 (4.13) 0.16 (4.44) 0.16 (4.13) 1.00 (2.35) 0.69 (3.12)

Table c1. Parameter estimates and standard errors. Estimates are shown in italic,
standard errors in regular type, and t-statistics in parentheses. The time-series esti-
mate is the mean market return in percent per month. The cross-sectional estimate
is the slope coef¬cient » in E(Re ) = β». The GMM estimate is the parameter
b in E(Re ) = E(Re f )b. CRSP monthly data 1926-1998. “Lags” gives the num-
ber of lags in the spectral density matrix. “NW” uses Newey-West weighting in the
spectral density matrix.

The OLS cross-sectional estimate 0.71 is a little higher than the mean market return 0.66,
in order to better ¬t all of the assets, as seen in Figure 29. The GLS cross-sectional estimate
is almost exactly the same as the mean market return, and the GLS standard errors are almost
exactly the same as the time-series standard errors. The Shanken correction for generated
regressors is very important to standard errors of the cross-sectional regressions. Without
the Σf term in the standard deviation of » (12.184)“ i.e. treating the β as ¬xed right hand
variables “ the standard errors come out to 0.07 for OLS and √ for GLS “ far less than the
correct 0.20 and 0.18 shown in the table, and far less than σ/ T .
The b estimates are not directly comparable to the risk premium estimates, but it is easy
to translate their units. Applying the discount factor model with normalization a = 1 to the
market return itself,
E(Rem )
b= .
E(Rem2 )
With E(Rem ) = 0.66% and σ(Rem ) = 5.47%, we have 100 — b = 100 (0.66) /(0.662 +
5.472 ) = 2.17. The entries in Table c1 are close to this magnitude. Most are slightly larger, as
is the OLS cross-sectional regression, in order to better ¬t the other portfolios. The t-statistics
are quite close across methods, which is another way to correct the units.
The second-stage GMM/DF estimates (as well as standard errors) depend on which spec-
tral density weighting matrix is used as a weighting matrix. The results are quite similar for
all the sensible spectral density estimates. The 24 lag spectral density matrix starts to produce
unusual estimates. This spectral density estimate will cause lots of problems below.
Table c2 presents the χ2 and F statistics that test whether the pricing errors are jointly
signi¬cant. The OLS and GLS cross-sectional regression, and the ¬rst and second stage
GMM/discount factor tests give exactly the same χ2 statistic, though the individual pricing


errors and covariance matrix are not the same so I do not present them separately. The big
point of Table c2 is that the GMM/discount factor method gives almost exactly the same
result as the cross-sectional regression.

Time series Cross section GMM/DF
%p %p %p
χ2 χ2 χ2
(10) (9) (9)
i.i.d. 8.5 58 8.5 49
GRS F 0.8 59
0 lags 10.5 40 10.6 31 10.5 31
3 lags NW 11.0 36 11.1 27 11.1 27
24 lags -432 -100 7.6 57 7.7 57

Table c2. χ2 tests that all pricing errors are jointly equal to zero.

For the time-series regression, the GRS F test gives almost exactly the same rejection
probability as does the asymptotic χ2 test. Apparently, the advantages of a statistic that is
valid in ¬nite samples is not that important in this data set. The χ2 tests for the time-series
case without the i.i.d. assumption are a bit more conservative, with 30-40% p value rather
than almost 60%. However, this difference is not large. The one exception is the χ2 test using
24 lags and no weights in the spectral density matrix. That matrix turns out not to be positive
de¬nite in this sample, with disastrous results for the χ2 statistic.
(Somewhat surprisingly, the CAPM is not rejected. This is because the small ¬rm effect
vanishes in the latter part of the sample. I discuss this fact further in Chapter 20. See in
particular Figure 28.)
Looking across the rows, the χ2 statistic is almost exactly the same for each method.
The cross-sectional regression and GMM/DF estimate have one lower degree of freedom
(the market premium is estimated from the cross-section rather than from the market return),
and so show slightly greater rejection probabilities. For a given spectral density estimation
technique, the cross-sectional regression and the GMM/DF approach give almost exactly the
same χ2 values and rejection probabilities. The 24 lag spectral density matrix is a disaster as
usual. In this case, it is a greater disaster for the time-series test than for the cross-section or
GMM/discount factor test. It turns out not to be positive de¬nite, so the sample pricing errors
produce a nonsensical negative value of ±0 cov(ˆ )’1 ±
ˆ ± ˆ

15.2 Monte Carlo and Bootstrap

The parameter distribution for the time-series regression estimate is quite similar to that
from the GMM/discount factor estimate.
The size and power of χ2 test statistics is nearly identical for time-series regression test


and the GMM/discount factor test.
A bad spectral density matrix can ruin either time-series or GMM/discount factor esti-
mates and tests.
There is enough serial correlation and heteroskedasticity in the data that conventional
i.i.d. formulas produce test statistics with about 1/2 the correct size. If you want to do
classic regression tests, you should correct the distribution theory rather than use the ML
i.i.d. distributions.

Econometrics is not just about sensible point estimates, it is about sampling variability
of those estimates, and whether standard error formulas correctly capture that sampling vari-
ability. How well do the various standard error and test statistic formulas capture the true
sampling distribution of the estimates? To answer this question I conduct two Monte Carlos
and two bootstraps. I conduct one each under the null that the CAPM is correct, to study size,
and one each under the alternative that the CAPM is false, to study power.
The Monte Carlo experiments follow the standard ML assumption that returns and fac-
tors are i.i.d. normally distributed, and the factors and residuals are independent as well as
uncorrelated. I generate arti¬cial samples of the market return from an i.i.d. normal, using
the sample mean and variance of the value-weighted return. I then generate arti¬cial size
decile returns under the null by Rei = 0 + β i Rt + µit , using the sample residual covari-
ance matrix Σ to draw i.i.d. normal residuals µit and the sample regression coef¬cients β i .
To generate data under the alternative, I add the sample ±i . draw 5000 arti¬cial samples. I
try a long sample of 876 months, matching the CRSP sample analyzed above. I also draw a
short sample of 240 months or 20 years, which is about as short as one should dare try to test
a factor model.
The bootstraps check whether non-normalities, autocorrelation, heteroskedasticity, and
non-independence of factors and residuals matters to the sampling distribution in this data
set. I do a block-bootstrap, resampling the data in groups of three months with replacement,
to preserve the short-order autocorrelation and persistent heteroskedasticity in the data. To
impose the CAPM, I draw the market return and residuals in the time-series regression, and
then compute arti¬cial data on decile portfolio returns by Rt = 0 + β i Rt + µit . To study
ei em

the alternative, I simply redraw all the data in groups of three. Of course, the actual data may
display conditioning information not displayed by this bootstrap, such as predictability and
conditional heteroskedasticity based on additional variables such as the dividend/price ratio,
lagged squared returns, or implied volatilities.
The ¬rst-stage GMM/discount factor and OLS cross-sectional regression are nearly iden-
tical in every arti¬cial sample, as the GLS cross-sectional regression is nearly identical to the
time-series regression in every sample. Therefore, the important question is to compare the
time series regression “ which is ML with i.i.d. normal returns and factors “ to the ¬rst and
second stage GMM/DF procedure. For this reason and to save space, I do not include the
cross-sectional regressions in the Monte Carlo and bootstrap.


χ2 tests
Table 6c presents the χ2 tests of the hypothesis that all pricing errors are zero under
the null that the CAPM is true, and Table 7c presents the χ2 tests under the null that the
CAPM is false. Each table presents the percentage of the 5000 arti¬cial data sets in which
the χ2 tests rejected the null at the indicated level. The central point of these tables is that
the GMM/discount factor test performs almost exactly the same way as the time-series test.
Compare the GMM/DF entry to its corresponding Time series entry; they are all nearly identi-
cal. Neither the small ef¬ciency advantage of time-series vs. cross section, nor the difference
between betas and second moments seems to make any difference to the sampling distribu-
Monte Carlo Block-Bootstrap
Time series GMM/DF Time series GMM/DF
Sample size: 240 876 240 876 240 876 240 876
level (%): 5 5 1 5 5 1 5 5 1 5 5 1
i.i.d. 7.5 6.0 1.1 6.0 2.8 0.6
0 lags 7.7 6.1 1.1 7.5 6.3 1.0 7.7 4.3 1.0 6.6 3.7 0.9
3 lags, NW 10.7 6.5 1.4 9.7 6.6 1.3 10.5 5.4 1.3 9.5 5.3 1.3
24 lags 25 39 32 25 41 31 23 38 31 24 41 32

Table 6c. Size. Probability of rejection for χ2 statistics under the null that all pricing
errors are zero

Monte Carlo Block-Bootstrap
Time-Series GMM/DF Time-Series GMM/DF
Sample size: 240 876 240 876 240 876 240 876
level (%): 5 1 5 1 5 1 5 1
i.i.d. 17 48 26 11 40 18
0 lags 17 48 26 17 50 27 15 54 28 14 55 29
3 lags, NW 22 49 27 21 51 29 18 57 31 17 59 33
24 lags 29 60 53 29 66 57 27 63 56 29 68 60

Table 7c. Power. Probability of rejection for χ2 statistics under the null that the
CAPM is false, and the true means of the decile portfolio returns are equal to their
sample means.

Start with the Monte Carlo evaluation of the time-series test in Table 6c. The i.i.d. and 0
lag distributions produce nearly exact rejection probabilities in the long sample and slightly
too many (7.5%) rejections in the short sample. Moving down, GMM distributions here
correct for things that aren™t there. This has a small but noticeable effect on the sensible
3 lag test, which rejects slightly too often under this null. Naturally, this is worse for the
short sample, but looking across the rows, the time-series and discount factor tests are nearly
identical in every case. The variation across technique is almost zero, given the spectral


density estimate. The 24 lag unweighted spectral density is the usual disaster, rejecting far
too often. It is singular in many samples. In the long sample, the 1% tail of this distribution
occurs at a χ2 value of 440 rather than the 23.2 of the χ2 distribution!
The long sample block-bootstrap in the right half of the tables shows even in this simple
setup how i.i.d. normal assumptions can be misleading. The traditional i.i.d. χ2 test has
almost half the correct size “ it rejects a 10% test 6% of the time, a 5% test 2.8% of the
time and a 1% test 0.6% of the time. Removing the assumption that returns and factors
are independent, going from i.i.d. to 0 lags, brings about half of the size distortion back,
while adding one of the sensible autocorrelation corrections does the rest. In each row, the
time-series and GMM/DF methods produce almost exactly the same results again. The 24
lag spectral density matrices are a disaster as usual.
Table 7c shows the rejection probabilities under the alternative. The most striking feature
of the table is that the GMM/discount factor test gives almost exactly the same rejection
probability as the time-series test, for each choice of spectral density estimation technique.
When there is a difference, the GMM/discount factor test rejects slightly more often. The
24 lag tests reject most often, but this is not surprising given that they reject almost as often
under the null.
Parameter estimates and standard errors
Table c5 presents the sampling variation of the » and b estimates. The rows and columns
market σ(»), σ(ˆ and in italic font, give the variation of the estimated » or b across the
ˆ b),
5000 arti¬cial samples. The remaining rows and columns give the average across samples of
the standard errors. The presence of pricing errors has little effect on the estimated b or »
and their standard errors, so I only present results under the null that the CAPM is true. The
parameters are not directly comparable “ the b parameter includes the variance as well as the
mean of the factor, and ET (Rem ) is the natural GMM estimate of the mean market return as
it is the Time-series estimate of the factor risk premium. Still, it is interesting to know and to
compare how well the two methods do at estimating their central parameter.


Monte Carlo Block-Bootstrap
series 1st 2nd stage series 1st 2nd stage
stage σ(ˆ stage σ(ˆ
b) E (s.e.) b) E (s.e.)
σ(»), σ(ˆ
ˆ 0.19 0.64 0.20 0.69
i.i.d. 0.18 0.18
0 lags 0.18 0.65 0.60 0.18 0.63 0.60
0.61 0.67
3 lags NW 0.18 0.65 0.59 0.19 0.67 0.62
0.62 0.67
24 lags 0.18 0.62 0.27 0.19 0.66 0.24
130 1724
ˆ σ(ˆ 0.35 1.25 0.37 1.40
σ(»), b)
i.i.d. 0.35 0.35
0 lags 0.35 1.23 1.14 0.35 1.24 1.15
1.24 1.45
3 lags NW 0.35 1.22 1.11 0.36 1.31 1.14
1.26 1.48
24 lags 0.29 1.04 0.69 0.31 1.15 0.75
191 893

Table 5. Monte Carlo and block-bootstrap evaluation of the sampling variability of
parameter estimates b and ». The Monte Carlo redraws 5000 arti¬cial data sets of
length T=876 from a random normal assuming that the CAPM is true. The block-
bootstrap redraws the data in groups of 3 with replacement. The row and columns
marked σ(») and σ(ˆ and using italic font give the variation across samples of the
ˆ b)
estimated » and b. The remaining entries of “Time series” “1st stage” and “E (s.e.)”
columns in roman font give the average value of the computed standard error of the
parameter estimate, where the average is taken over the 5000 samples.

The central message of this table is that the GMM/DF estimates behave almost exactly as
the time-series estimate, and the asymptotic standard error formulas almost exactly capture
the sampling variation of the estimates. The second stage GMM/DF estimate is a little bit
more ef¬cient at the cost of slightly misleading standard errors.
Start with the long sample and the ¬rst column. All of the standard error formulas give
essentially identical and correct results for the time-series estimate. Estimating the sample
mean is not rocket science. The ¬rst stage GMM/DF estimator in the second column behaves
the same way, except the usually troublesome 24 lag unweighted estimate.
The second stage GMM/DF estimate in the third and fourth columns uses the inverse
spectral density matrix to weight, and so the estimator depends on the choice of spectral
density estimate. The sensible spectral density estimates (not 24 lags) produce second-stage
estimates that vary less than the ¬rst-stage estimates, 0.61 ’ 0.62 rather than 0.64. Sec-
ond stage GMM is more ef¬cient, meaning that it produces estimates with smaller sampling
variation. However, the table shows that the ef¬ciency gain is quite small, so not much is
lost if one prefers ¬rst stage OLS estimates. The sensible spectral density estimates produce
second-stage standard errors that again almost exactly capture the sampling variation of the


estimated parameters.
The 24 lag unweighted estimate produces hugely variable estimates and arti¬cially small
standard errors. Using bad or even singular spectral density estimates seems to have a sec-
ondary effect on standard error calculations, but using its inverse as a weighting matrix can
have a dramatic effect on estimation.
With the block-bootstrap in the right hand side of Table 5c, the time-series estimate is
slightly more volatile as a result of the slight autocorrelation in the market return. The i.i.d.
and zero lag formulas do not capture this effect, but the GMM standard errors that allow
autocorrelation do pick it up. However, this is a very minor effect as there is very little auto-
correlation in the market return. The effect is more pronounced in the ¬rst stage GMM/DF
estimate, since the smaller ¬rm portfolios depart more from the normal i.i.d. assumption.
The true variation is 0.69, but standard errors that ignore autocorrelation only produce 0.63.
The standard errors that correct for autocorrelation are nearly exact. In the second-stage
GMM/DF, the sensible spectral density estimates again produce slightly more ef¬cient es-
timates than the ¬rst stage, with variation of 0.67 rather than 0.69. This comes at a cost,
though, that the asymptotic standard errors are a bit less reliable.
In the shorter sample, we see that standard errors for the mean market return in the Time
series column are all quite accurate, except the usual 24 lag case. In the GMM/DF case, we
see that the actual sampling variability of the b estimate is no longer smaller for the second
stage. The second stage estimate is not more ef¬cient in this “small” sample. Furthermore,
while the ¬rst stage standard errors are still decently accurate, the second stage standard
errors substantially understate the true sampling variability of the parameter estimate. They
represent a hoped-for ef¬ciency that is not present in the small sample. Even in this simple
setup, ¬rst-stage GMM is clearly a better choice for estimating the central parameter, and
hence for examining individual pricing errors and their pattern across assets.

Chapter 16. Which method?
Of course, the point of GMM/discount factor methods is not a gain in ef¬ciency or simplic-
ity in a traditional setup “ linear factor model, i.i.d. normally distributed returns and factors,
etc. It™s hard to beat the ef¬ciency or simplicity of regression methods in those setups. The
point of the GMM/discount factor approach is that it allows a simple technique for evalu-
ating nonlinear or otherwise complex models, for including conditioning information while
not requiring the econometrician to see everything that the agent sees, and for allowing the
researcher to circumvent inevitable model misspeci¬cations or simpli¬cations and data prob-
lems by keeping the econometrics focused on interesting issues.
The alternative is usually some form of maximum likelihood. This is much harder in most
circumstances, since you have to write down a complete statistical model for the conditional
distribution of your data. Just evaluating, let alone maximizing, the likelihood function is
often challenging. Whole series of papers are written just on the econometric issues of par-
ticular cases, for example how to maximize the likelihood functions of speci¬c classes of
univariate continuous time models for the short interest rate.
Of course, there is no necessary pairing of GMM with the discount factor expression of a
model, and ML with the expected return-beta formulation. Many studies pair discount factor
expressions of the model with ML, and many others evaluate expected return-beta model by
GMM, as we have done in adjusting regression standard errors for non-i.i.d. residuals.
Advanced empirical asset pricing faces an enduring tension between these two philoso-
phies. The choice essentially involves tradeoffs between statistical ef¬ciency, the effects of
misspeci¬cation of both the economic and statistical models, and the clarity and economic
interpretability of the results. There are situations in which it™s better to trade some small ef-
¬ciency gains for the robustness of simpler procedures or more easily interpretable moments;
OLS can be better than GLS. The central reason is speci¬cation errors; the fact that our sta-
tistical and economic models are at best quantitative parables. There are other situations in
which one may really need to squeeze every last drop out of the data, intuitive moments are
statistically very inef¬cient, and more intensive maximum-likelihood approaches are more
appropriate. Unfortunately, the environments are complex, and differ from case to case. We
don™t have universal theorems from statistical theory or generally applicable Monte Carlo ev-
idence. Speci¬cation errors by their nature resist quantitative modeling “ if you knew how
to model them, they wouldn™t be there. We can only think about the lessons of past experi-
ences. In my experience, in the limited range of applications I have worked with, a GMM
approach based on simple easily interpretable moments has proved far more fruitful than for-
mal maximum likelihood. In addition, I have found ¬rst stage GMM “ OLS cross sectional
regressions “ to be more trustworthy than second-stage GMM, in any case where there was a
substantial difference between the two approaches.
The rest of this chapter collects some thoughts on the choice between formal ML and
less formal GMM, focusing on economically interesting rather than statistically informative


“ML” vs. “GMM”

The debate is often stated as a choice between “maximum likelihood” and “GMM.” This
is a bad way to put the issue. ML is a special case of GMM: it suggests a particular choice
of moments that are statistically optimal in a well-de¬ned sense. Given the set of moments,
the distribution theories are identical. Also, there is no such thing as “the” GMM estimate.
GMM is a ¬‚exible tool; you can use any aT matrix and gT moments that you want to use.
For example, we saw how to use GMM to derive the asymptotic distribution of the standard
time-series regression estimator with autocorrelated returns. The moments in this case were
not the pricing errors. It™s all GMM; the issue is the choice of moments. Both ML and GMM
are tools that a thoughtful researcher can use in learning what the data says about a given
asset pricing model, rather than as stone tablets giving precise directions that lead to truth
if followed literally. If followed literally and thoughtlessly, both ML and GMM can lead to
horrendous results.
The choice is between moments selected by an auxiliary statistical model, even if com-
pletely economically uninterpretable, and moments selected for their economic or data sum-
mary interpretation, even if not statistically ef¬cient.
ML is often ignored
As we have seen, ML plus the assumption of normal i.i.d. disturbances leads to easily
interpretable time-series or cross-sectional regressions, empirical procedures that are close to
the economic content of the model. However, asset returns are not normally distributed or
i.i.d.. They have fatter tails than a normal, they are heteroskedastic (times of high and times
of low volatility), they are autocorrelated, and predictable from a variety of variables. If
one were to take seriously the ML philosophy and its quest for ef¬ciency, one should model
these features of returns. The result would be a different likelihood function, and its scores
would prescribe different moment conditions than the familiar and intuitive time-series or
cross-sectional regressions.
Interestingly, few empirical workers do this. (The exceptions tend to be papers whose
primary point is illustration of econometric technique rather than empirical ¬ndings.) ML
seems to be ¬ne when it suggests easily interpretable regressions; when it suggests something
else, people use the regressions anyway.
For example, ML prescribes that one estimate βs without a constant. βs are almost univer-
sally estimated with a constant. Researchers often run cross-sectional regressions rather than
time-series regressions, even when the factors are returns. ML speci¬es a GLS cross-sectional
regression, but many empirical workers use OLS cross-sectional regressions instead, distrust-
ing the GLS weighting matrix. Time-series regressions are almost universally run with a con-
stant, though ML prescribes a regression with no constant. The true ML formulas for GLS
regressions require one to iterate between non-OLS formulas for betas, covariance matrix
estimate and the cross-sectional regression estimate. Empirical applications usually use the
unconstrained estimates of all these quantities. And of course, any of the regression tests
continue to be run at all, with ML justi¬cations, despite the fact that returns are not i.i.d. The

regressions came ¬rst, and the maximum likelihood formalization came later. If we had to
assume that returns had a gamma distribution to justify the regressions, it™s a sure bet that we
would make that “assumption” behind ML instead of the normal i.i.d. assumption!
The reason must be that researchers feel that omitting some of the information in the null
hypothesis, the estimation and test is more robust, though some ef¬ciency is lost if the null
economic and statistical models are exactly correct. Researchers must not really believe that
their null hypotheses, statistical and economic, are exactly correct. They want to produce es-
timates and tests that are robust to reasonable model mis-speci¬cations. They also want to
produce estimates and tests that are easily interpretable, that capture intuitively clear styl-
ized facts in the data, and that relate directly to the economic concepts of the model. Such
estimates are persuasive in large part because the reader can see that they are robust. (And
following this train of thought, one might want to pursue estimation strategies that are even
more robust than OLS, since OLS places a lot of weight on outliers. For example, Chen and
Ready 1997 claim that Fama and French™s 1993 size and value effects depend crucially on a
few outliers.)
ML does not necessarily produce robust or easily interpretable estimates. It wasn™t de-
signed to do so. The point and advertisement of ML is that it provides ef¬cient estimates;
it uses every scrap of information in the statistical and economic model in the quest for ef-
¬ciency. It does the “right” ef¬cient thing if model is true. It does not necessarily do the
“reasonable” thing for “approximate” models.
OLS vs. GLS cross-sectional regressions
One place in which this argument crystallizes is in the choice between OLS and GLS
cross-sectional regressions, or equivalently between ¬rst and second stage GMM.
The last chapter can lead to a mistaken impression that the doesn™t matter that much. This
is true to some extent in that simple environment, but not in more complex environments. For
example, Fama and French (1997) report important correlations between betas and pricing
errors in a time-series test of a three-factor model on industry portfolios. This correlation
cannot happen with an OLS cross-sectional estimate, as the cross-sectional estimate sets
the cross-sectional correlation between right hand variables (betas) and error terms (pricing
errors) to zero by construction. First stage estimates seem to work better in factor pricing
models based on macroeconomic data. For example, Figure 5 presents the ¬rst stage estimate
of the consumption-based model. The second-stage estimate produced much larger individual
pricing errors, because by so doing it could lower pricing errors of portfolios with strong
long-short positions required by the spectral density matrix. The same thing happened in the
investment based factor pricing model Cochrane (1996), and the scaled consumption-based
model of Lettau and Ludvigson (2000). Authors as far back as Fama and MacBeth (1973)
have preferred OLS cross-sectional regressions, distrusting the GLS weights.
GLS and second-stage GMM gain their asymptotic ef¬ciency when the covariance and
spectral density matrices have converged to their population values. GLS and second stage
GMM use these matrices to ¬nd well-measured portfolios; portfolios with small residual


variance for GLS, and small variance of discounted return for GMM. The danger is that
these quantities are poorly estimated in a ¬nite sample, that sample minimum-variance port-
folios bear little relation to population minimum-variance portfolios. This by itself should not
create too much of a problem for a perfect model, one that prices all portfolios. But an im-
perfect model that does a very good job of pricing a basic set of portfolios may do a poor job
of pricing strange linear combinations of those portfolios, especially combinations that in-
volve strong long and short positions, positions that really are outside the payoff space given
transactions, margin, and short sales constraints. Thus, the danger is the interaction between
spurious sample minimum-variance portfolios and the speci¬cation errors of the model.
Interestingly, Kandel and Stambaugh (1995) and Roll and Ross (1995) argue for GLS
cross-sectional regressions also as a result of model misspeci¬cation. They start by observing
that show that so long as there is any misspeci¬cation at all “ so long as the pricing errors
are not exactly zero; so long as the market proxy is not exactly on the mean-variance frontier
“ then there are portfolios that produce arbitrarily good and arbitrarily bad ¬ts in plots of
expected returns vs. betas. Since even a perfect model leaves pricing errors in sample, this is
always true in samples.
It™s easy to see the basic argument. Take a portfolio long the positive alpha securities
and short the negative alpha securities; it will have a really big alpha! More precisely, if the
original securities follow

E(Re ) = ± + »β,

then consider portfolios of the original securities formed from a non-singular matrix A. They

E(ARe ) = A± + »Aβ.

You can make all these portfolios have the same β with Aβ =constant, and then they will
have a spread in alphas. You will see a plot in which all the portfolios have the same beta
but the average returns are spread up and down. Conversely, you can pick A to make the
expected return-beta plot look as good as you want.
GLS has an important feature in this situation: the GLS cross-sectional regression is
independent of such repackaging of portfolios. If you transform a set of returns Re to ARe ,
then the OLS cross-sectional regression is transformed from
¡ ¢’1 0
» = β0β β E (Re )

¡ ¢’1 0 0
» = β 0 A0 Aβ β A AE (Re ) .

This does depend on the repackaging A. However, the residual covariance matrix of ARe is

AΣA0 , so the GLS regression
¡ ¢’1 0 ’1
» = β 0 Σ’1 β β Σ E (Re )

is not affected so long as A is full rank and therefore does not throw away information
¡ ¢’1 0 0 0 ¡ ¢’1 0 ’1
» = β 0 A0 (AΣA0 )’1 Aβ β A (A ΣA)’1 AE (Re ) = β 0 Σ’1 β β Σ E (Re ) .

(The spectral density matrix and second stage estimate shares this property in GMM es-
timates. These are not the only weighting matrix choices that are invariant to portfolios. For
example, Hansen and Jagannathan™s 1997 suggestion of the return second moment matrix has
the same property.)
This is a fact, but it does not show that OLS chooses a particularly good or bad set of
portfolios. Perhaps you don™t think that GLS™ choice of portfolios is particularly informative.
In this case, you use OLS precisely to focus attention on a particular set of economically
interesting portfolios.
The choice depends subtly on what you want your test to accomplish. If you want to prove
the model wrong, then GLS helps you to focus on the most informative portfolios for proving
the model wrong. That is exactly what an ef¬cient test is supposed to do. But many models
are wrong, but still pretty darn good. It is a shame to throw out the information that the model
does a good job of pricing an interesting set of portfolios. The sensible compromise would
seem to be to report the OLS estimate on “interesting” portfolios, and also to report the GLS
test statistic that shows the model to be rejected. That is, in fact, the typical collection of
Additional examples of trading off ef¬ciency for robustness
Here are some additional examples of situations in which it has turned out to be wise to
trade off some apparent ef¬ciency for robustness to model misspeci¬cations.
Low frequency time-series models. In estimating time-series models such as the AR(1),
maximum likelihood minimizes one-step ahead forecast error variance, µ2 . But any time-
series model is only an approximation, and the researcher™s objective may not be one-step
ahead forecasting. For example, in making sense of the yield on long term bonds, one is in-
terested in the long-run behavior of the short rate of interest. In estimating the magnitude
of long-horizon univariate mean reversion in stock returns, we want to know only the sum
of autocorrelations or moving average coef¬cients. (We will study this application in sec-
tion 20.335.) The approximate model that generates the smallest one-step ahead forecast
error variance may be quite different from the model that best matches long-run autocorrela-
tions. ML can pick the wrong model and make very bad predictions for long-run responses.
(Cochrane 1986 contains a more detailed analysis of this point in the context of long-horizon
GDP forecasting.)
Lucas™ money demand estimate. Lucas (1988) is a gem of an example. Lucas was in-


terested in estimating the income elasticity of money demand. Money and income trend
upwards over time and over business cycles, but also have some high-frequency movement
that looks like noise. If you run a regression in log-levels,

mt = a + byt + µt

you get a sensible coef¬cient of about b = 1, but you ¬nd that the error term is strongly
serially correlated. Following standard advice, most researchers run GLS, which amounts
pretty much to ¬rst-differencing the data,

mt ’ mt’1 = b(yt ’ yt’1 ) + · t .

This error term passes its Durbin-Watson statistic, but the b estimate is much lower, which
doesn™t make much economic sense, and, worse, is unstable, depending a lot on time period
and data de¬nitions. Lucas realized that the regression in differences threw out all of the
information in the data, which was in the trend, and focused on the high-frequency noise.
Therefore, the regression in levels, with standard errors corrected for correlation of the error
term, is the right one to look at. Of course, GLS and ML didn™t know there was any “noise” in
the data, which is why they threw out the baby and kept the bathwater. Again, ML ruthlessly
exploits the null for ef¬ciency, and has no way of knowing what is “reasonable” or “intuitive.”
Stochastic singularities and calibration. Models of the term structure of interest rates (we
will study these models in section 19) and real business cycle models in macroeconomics
give even more stark examples. These models are stochastically singular. They generate
predictions for many time series from a few shocks, so the models predict that there are
combinations of the time series that leave no error term. Even though the models have rich
and interesting implications, ML will seize on this economically uninteresting singularity,
refuse to estimate parameters, and reject any model of this form.
The simplest example of the situation is the linear-quadratic permanent income model
paired with an AR(1) speci¬cation for income. The model is

yt = ρyt’1 + µt

1 Xj 1
ct ’ ct’1 = (Et ’ Et’1 ) β yt+j = µt
1 ’ β j=0 (1 ’ βρ) (1 ’ β)

This model generates all sorts of important and economically interesting predictions for the
joint process of consumption and income (and asset prices). Consumption should be roughly
a random walk, and should respond only to permanent income changes; investment should be
more volatile than income and income more volatile than consumption. Since there is only
one shock and two series, however, the model taken literally predicts a deterministic relation
between consumption and income; it predicts

ct ’ ct’1 = (yt ’ ρyt’1 ) .
1 ’ βρ

ML will notice that this is the statistically most informative prediction of the model. There
is no error term! In any real data set there is no con¬guration of the parameters r, β, ρ that
make this restriction hold, data point for data point. The probability of observing a data set
{ct , yt } is exactly zero, and the log likelihood function is ’∞ for any set of parameters. ML
says to throw the model out.
The popular af¬ne models of the term structure of interest rates act the same way. They
specify that all yields at any moment in time are deterministic functions of a few state vari-
ables. Such models can capture much of the important qualitative behavior of the term struc-
ture, including rising, falling and humped shapes, and the information in the term structure
for future movements in yields and the volatility of yields. They are very useful for derivative
pricing. But it is never the case in actual yield data that yields of all maturities are exact func-
tions of K yields. Actual data on N yields always require N shocks. Again, a ML approach
reports a ’∞ log likelihood function for any set of parameters.
Addressing model mis-speci¬cation
The ML philosophy offers an answer to model mis-speci¬cation: specify the right model,
and then do ML. If regression errors are correlated, model and estimate the covariance matrix
and do GLS. If you are worried about proxy errors in the pricing factor, short sales costs or
other transactions costs so that model predictions for extreme long-short positions should
not be relied on, time-aggregation or mismeasurement of consumption data, non-normal or
non-i.i.d. returns, time-varying betas and factor risk premia, additional pricing factors and so
on“don™t chat about them, write them down, and then do ML.
Following this lead, researchers have added “measurement errors” to real business cycle
models (Sargent 1989 is a classic example) and af¬ne yield models in order to break the
stochastic singularity (I discuss this case a bit more in section 19.6). The trouble is, of course,
that the assumed structure of the measurement errors now drives what moments ML pays
attention to. And seriously modeling and estimating the measurement errors takes us further
away from the economically interesting parts of the model. (Measurement error augmented
models will often wind up specifying sensible moments, but by assuming ad-hoc processes
for measurement error, such as i.i.d. errors. Why not just specify the sensible moments in the
¬rst place?)
More generally, authors tend not to follow this advice, in part because it is ultimately
infeasible. Economics necessarily studies quantitative parables rather than completely spec-
i¬ed models. It would be nice if we could write down completely speci¬ed models, if we
could quantitatively describe all the possible economic and statistical model and speci¬ca-
tion errors, but we can™t.
The GMM framework, used judiciously, allows us to evaluate misspeci¬ed models. It al-
lows us to direct that the statistical effort focus on the “interesting” predictions while ignoring
the fact that the world does not match the “uninteresting” simpli¬cations. For example, ML
only gives us a choice of OLS, whose standard errors are wrong, or GLS, which we may not
trust in small samples or which may focus on uninteresting parts of the data. GMM allows


us to keep an OLS estimate, but correct the standard errors for non-i.i.d. distributions. More
generally, GMM allows one to specify an economically interesting set of moments, or a set
of moments that one feels will be robust to misspeci¬cations of the economic or statistical
model, without having to spell out exactly what is the source of model mis-speci¬cation that
makes those moments “optimal” or even “interesting” and “robust.” It allows one to accept
the lower “ef¬ciency” of the estimates under some sets of statistical assumptions, in return
for such robustness.
At the same time, the GMM framework allows us to ¬‚exibly incorporate statistical model
misspeci¬cations in the distribution theory. For example, knowing that returns are not i.i.d.
normal, one may want to use the time series regression technique to estimate betas anyway.
This estimate is not inconsistent, but the standard errors that ML formulas pump out under
this assumption are inconsistent. GMM gives a ¬‚exible way to derive at least and asymptotic
set of corrections for statistical model misspeci¬cations of the time-series regression coef¬-
cient. Similarly, a pooled time-series cross-sectional OLS regression is not inconsistent, but
standard errors that ignore cross-correlation of error terms are far too small.
The “calibration” of real business cycle models is often really nothing more than a GMM
parameter estimate, using economically sensible moments such as average output growth,
consumption/output ratios etc. to avoid the stochastic singularity that would doom a ML ap-
proach. (Kydland and Prescott™s 1982 idea that empirical microeconomics would provide
accurate parameter estimates for macroeconomic and ¬nancial models has pretty much van-
ished.) Calibration exercises usually do not compute standard errors, nor do they report any
distribution theory associated with the “evaluation” stage when one compares the model™s
predicted second moments with those in the data. Following Burnside, Eichenbaum and Re-
belo (1993) however, it™s easy enough to calculate such a distribution theory “ to evaluate
whether the difference between predicted “second moments” and actual moments is large
compared to sampling variation, including the variation induced by parameter estimation in
the same sample “ by listing the ¬rst and second moments together in the gT vector.
“Used judiciously” is an important quali¬cation. Many GMM estimations and tests suf-
fer from lack of thought in the choice of moments, test assets and instruments. For example,
early GMM papers tended to pick assets and especially instruments pretty much at random.
Industry portfolios have almost no variation in average returns to explain. Authors often
included many lags of returns and consumption growth as instruments to test a consumption-
based model. However, the 7th lag of returns really doesn™t predict much about future returns
given lags 1-12, and the ¬rst-order serial correlation in seasonally adjusted, ex-post revised
consumption growth may be economically uninteresting. More recent work tends to em-
phasize a few well-chosen assets and instruments that capture important and economically
interesting features of the data..
Auxiliary model
ML requires an auxiliary statistical model. For example, in the classic ML formalization
of regression tests, we had to stopped to assume that returns and factors are jointly i.i.d. nor-
mal. As the auxiliary statistical model becomes more and more complex and hence realistic,

more and more effort is devoted to estimating the auxiliary statistical model. ML has no way
of knowing that some parameters “ a, b; β, », risk aversion γ “ are more “important” than
others “ Σ, and parameters describing time-varying conditional moments of returns.
A very convenient feature of GMM is that it does not require such an auxiliary statistical
model. For example, in studying GMM we went straight from p = E(mx) to moment
conditions, estimates, and distribution theory. This is an important saving of the researcher™s
and the reader™s time, effort and attention.
Finite sample distributions
Many authors say they prefer regression tests and the GRS statistic in particular because
it has a ¬nite sample distribution theory, and they distrust the ¬nite-sample performance of
the GMM asymptotic distribution theory.
This argument does not have much force. The ¬nite sample distribution only holds if
returns really are normal and i.i.d., and if the factor is perfectly measured. Since these as-
sumptions do not hold, it is not obvious that a ¬nite-sample distribution that ignores non-i.i.d.
returns will be a better approximation than an asymptotic distribution that corrects for them.
All approaches give essentially the same answers in the classic setup of i.i.d. returns.
The issue is how the various techniques perform in more complex setups, especially with
conditioning information, and here there are no analytic ¬nite-sample distributions.
In addition, once you have picked the estimation method “ how you will generate a num-
ber from the data; or which moments you will use “ ¬nding its ¬nite sample distribution,
given an auxiliary statistical model, is simple. Just run a Monte Carlo or bootstrap. Thus,
picking an estimation method because it delivers analytic formulas for a ¬nite sample distri-
bution (under false assumptions) should be a thing of the past. Analytic formulas for ¬nite
sample distributions are useful for comparing estimation methods and arguing about statisti-
cal properties of estimators, but they are not necessary for the empiricists™ main task.
Finite sample quality of asymptotic distributions, and “nonparametric” estimates
Several investigations (Ferson and Foerster 1994, Hansen, Heaton, and Yaron 1996) have
found cases in which the GMM asymptotic distribution theory is a poor approximation to
a ¬nite-sample distribution theory. This is especially true when one asks “non-parametric”
corrections for autocorrelation or heteroskedasticity to provide large corrections and when
the number of moments is large compared to the sample size, or if the moments one uses for
GMM turn out to be very inef¬cient (Fuhrer, Moore, and Schuh 1995) which can happen if
you put in a lot of instruments with low forecast power
The ML distribution is the same as GMM, conditional on the choice of moments, but typ-
ical implementations of ML also use the parametric time-series model to simplify estimates
of the terms in the distribution theory as well as to derive the likelihood function.
If this is the case “ if the “nonparametric” estimates of the GMM distribution theory
perform poorly in a ¬nite sample, while the “parametric” ML distribution works well “ there
is no reason not to use a parametric time series model to estimate the terms in the GMM


distribution as well. For example, rather than calculate ∞ j=’∞ E(ut ut’j ) from a large
sum of autocorrelations, you can model ut = ρut’1 + µt , estimate ρ, and then calculate
σ2 (u) ∞ 1+ρ
j=’∞ ρ = σ (u) 1’ρ . Section 11.7 discussed this idea in more detail.
j 2

The case for ML
In the classic setup, the ef¬ciency gain of ML over GMM on the pricing errors is tiny.
However, several studies have found cases in which the statistically motivated choice of mo-
ments suggested by ML has important ef¬ciency advantages.
For example, Jacquier, Polson and Rossi (1994) study the estimation of a time-series
model with stochastic volatility. This is a model of the form

dSt /St = µdt + Vt dZ1t
dVt = µV (Vt )dt + σ(Vt )dZ2t ,

and S is observed but V is not. The obvious and easily interpretable moments include the
autocorrelation of squared returns, or the autocorrelation of the absolute value of returns.
However, Jacquier, Polson and Rossi ¬nd that the resulting estimates are far less ef¬cient
than those resulting from the ML scores.
Of course, this study presumes that the model (16.213) really is exactly true. Whether
the uninterpretable scores or the interpretable moments really perform better to give an ap-
proximate model of the form (16.213), given some other data-generating mechanism is open
to discussion.
Even in the canonical OLS vs. GLS case, a wildly heteroskedastic error covariance ma-
trix can mean that OLS spends all its effort ¬tting unimportant data points. A “judicious”
application of GMM (OLS) in this case would require at least some transformation of units
so that OLS is not wildly inef¬cient.
Statistical philosophy
The history of empirical work that has been persuasive “ that has changed people™s under-
standing of the facts in the data and which economic models understand those facts “ looks a
lot different than the statistical theory preached in econometrics textbooks.
The CAPM was taught and believed in and used for years despite formal statistical rejec-
tions. It only fell by the wayside when other, coherent views of the world were offered in the
multifactor models. And the multifactor models are also rejected! It seems that “it takes a
model to beat a model,” not a rejection.
Even when evaluating a speci¬c model, most of the interesting calculations come from
examining speci¬c alternatives rather than overall pricing error tests. The original CAPM
tests focused on whether the intercept in a cross-sectional regression was higher or lower than
the risk free rate, and whether individual variance entered into cross-sectional regressions.
The CAPM fell when it was found that characteristics such as size and book/market do enter
cross-sectional regressions, not when generic pricing error tests rejected.

In¬‚uential empirical work tells a story. The most ef¬cient procedure does not seem to
convince people if they cannot transparently see what stylized facts in the data drive the
result. A test of a model that focuses on its ability to account for the cross section of average
returns of interesting portfolios will in the end be much more persuasive than one that (say)
focuses on the model™s ability to explain the ¬fth moment of the second portfolio, even if ML
¬nds the latter moment much more statistically informative.
Most recently, Fama and French (1988b) and (1993) are good examples of empirical
work that changed many people™s minds, in this case that long-horizon returns really are
predictable, and that we need a multifactor model rather than the CAPM to understand the
cross-section of average returns. These papers are not stunning statistically: long horizon
predictability is on the edge of statistical signi¬cance, and the multifactor model is rejected
by the GRS test. But these papers made clear what stylized and robust facts in the data
drive the results, and why those facts are economically sensible. For example, the 1993
paper focused on tables of average returns and betas. Those tables showed strong variation
in average returns that was not matched by variation in market betas, yet was matched by
variation in betas on new factors. There is no place in statistical theory for such a table, but
it is much more persuasive than a table of χ2 values for pricing error tests. On the other
hand, I can think of no case in which the application of a clever statistical models to wring
the last ounce of ef¬ciency out of a dataset, changing t statistics from 1.5 to 2.5, substantially
changed the way people think about an issue.
Statistical testing is one of many questions we ask in evaluating theories, and usually not
the most important one. This is not a philosophical or normative statement; it is a positive or
empirical description of the process by which the profession has moved from theory to theory.
Think of the kind of questions people ask when presented with a theory and accompanying
empirical work. They usually start by thinking hard about the theory itself. What is the central
part of the economic model or explanation? Is it internally consistent? Do the assumptions
make sense? Then, when we get to the empirical work, how were the numbers produced?
Are the data de¬nitions sensible? Are the concepts in the data decent proxies for the concepts
in the model? (There™s not much room in statistical theory for that question!) Are the model
predictions robust to the inevitable simpli¬cations? Does the result hinge on power utility vs.
another functional form? What happens if you add a little measurement error, or if agents
have an information advantage, etc.? What are the identi¬cation assumptions, and do they
make any sense “ why is y on the left and x on the right rather than the other way around?
Finally, someone in the back of the room might raise his hand and ask, “if the data were
generated by a draw of i.i.d. normal random variables over and over again, how often would
you come up with a number this big or bigger?” That™s an interesting and important check on
the overall believability of the results. But it is not necessarily the ¬rst check, and certainly
not the last and decisive check. Many models are kept that have economically interesting but
statistically rejectable results, and many more models are quickly forgotten that have strong
statistics but just do not tell as clean a story.
The classical theory of hypothesis testing, its Bayesian alternative, or the underlying
hypothesis-testing view of the philosophy of science are miserable descriptions of the way


science in general and economics in particular proceed from theory to theory. And this is
probably a good thing too. Given the non-experimental nature of our data, the inevitable ¬sh-
ing biases of many researchers examining the same data, and the unavoidable fact that our
theories are really quantitative parables more than literal descriptions of the way the data are
generated, the way the profession settles on new theories makes a good deal of sense. Clas-
sical statistics requires that nobody ever looked at the data before specifying the model. Yet
more regressions have been run than there are data points in the CRSP database. Bayesian
econometrics can in principle incorporate the information of previous researchers, yet it never
applied in this way “ each study starts anew with a “uninformative” prior. Statistical theory
draws a sharp distinction between the model “ which we know is right; utility is exactly
power; and the parameters which we estimate. But this distinction isn™t true; we are just as
uncertain about functional forms as we are about parameters. A distribution theory at bottom
tries to ask an unknowable question: If we turned the clock back to 1947 and reran the post-
war period 1000 times, in how many of those alternative histories would (say) the average
S&P500 return be greater than 9%? It™s pretty amazing in fact that a statistician can purport
to give any answer at all to such a question, having observed only one history.
These paragraphs do not contain original ideas, and they mirror changes in the philoso-
phy of science more broadly. 50 years ago, the reigning philosophy of science focused on
the idea that scientists provide rejectable hypotheses. This idea runs through philosophical
writings exempli¬ed by Popper (1959), classical statistical decision theory, and mirrored in
economics by Friedman (1953) However, this methodology contains an important inconsis-
tency. Though researchers are supposed to let the data decide, writers on methodology do
not look at how actual theories evolved. It was, as in Friedman™s title, a “Methodology of
positive economics,” not a “positive methodology of economics.” Why should methodology
be normative, a result of philosophical speculation, and not an empirical discipline like ev-
erything else. In a very famous book, Kuhn (1970) looked at the actual history of scienti¬c
revolutions, and found that the actual process had very little to do with the formal methodol-
ogy. McCloskey (1983, 1998) has gone even further, examining the “rhetoric” of economics;
the kinds of arguments that persuaded people to change their minds about economic theories.
Needless to say, the largest t-statistic did not win!
Kuhn™s and especially McCloskey™s ideas are not popular in the ¬nance and economics
professions. Precisely, they are not popular in how people talk about their work, though
they describe well how people actually do their work. Most people in the ¬elds cling to the
normative, rejectable-hypothesis view of methodology. But we need not suppose that they
would be popular. The ideas of economics and ¬nance are not popular among the agents in
the models. How many stock market investors even know what a random walk or the CAPM
is, let alone believing those models have even a grain of truth? Why should the agents in the
models of how scienti¬c ideas evolve have an intuitive understanding of the models? “As if”
rationality can apply to us as well!
Philosophical debates aside, a researcher who wants his ideas to be convincing, as well
as right, would do well to study how ideas have in the past convinced people, rather than just
study a statistical decision theorist™s ideas about how ideas should convince people. Kuhn,

and, in economics, McCloskey have done that, and their histories are worth reading. In the
end, statistical properties may be a poor way to choose statistical methods.
The bottom line is simple: It™s ok to do a ¬rst stage or simple GMM estimate rather than
an explicit maximum likelihood estimate and test. Many people (and, unfortunately, many
journal referees) seem to think that nothing less than a full maximum likelihood estimate and
test is acceptable. This section is long in order to counter that impression; to argue that at
least in many cases of practical importance, a simple ¬rst stage GMM approach, focusing on
economically interpretable moments, can be adequately ef¬cient, robust to model misspeci-
¬cations, and ultimately more persuasive.

Bonds and options

The term structure of interest rates and derivative pricing use closely related techniques. As
you might have expected, I present both issues in a discount factor context. All models come
down to a speci¬cation of the discount factor. The discount factor speci¬cations in term
structure and option pricing models are quite simple.
So far, we have focused on returns, which reduces the pricing problem to a one-period or
instantaneous problem. Pricing bonds and options forces us to start thinking about chaining
together the one-period or instantaneous representations to get a prediction for prices of long-
lived securities. Taking this step is very important, and I forecast that we will see much
more multiperiod analysis in stocks as well, studying price and stream of payoffs rather than
returns. This step rather than the discount factor accounts for the mathematical complexity
of some term structure and option pricing models.
There are two standard ways to go from instantaneous or return representations to prices.
First, we can chain the discount factors together, ¬nding from a one period discount factor
mt,t+1 a long-term discount factor mt,t+j = mt,t+1 mt+1,t+2 ...mt+j’1 mt+j that can price
a j period payoff. In continuous time, we will ¬nd the discount factor increments dΛ satisfy
the instantaneous pricing equation 0 = Et [d (ΛP )], and then solve its stochastic differential
equation to ¬nd its level Λt+j in order to price a j’ period payoff as Pt = Et [Λt+j /Λt xt+j ].
Second, we can chain the prices together. Conceptually, this is the same as chaining re-
turns Rt,t+j = Rt,t+1 Rt+1,t+2 ..Rt+j’1,t+j instead of chaining together the discount fac-
tors. From 0 = Et [d (ΛP )], we ¬nd a differential equation for the prices, and solve that
back. We™ll use both methods to solve interest rate and option pricing models.

Chapter 17. Option pricing
Options are a very interesting and useful set of instruments, as you will see in the background
section. In thinking about their value, we will adopt an extremely relative pricing approach.
Our objective will be to ¬nd out a value for the option, taking as given the values of other
securities, and in particular the price of the stock on which the option is written and an interest

17.1 Background

17.1.1 De¬nitions and payoffs

A call option gives you the right to buy a stock for a speci¬ed strike price on a speci¬ed
expiration date.
The call option payoff is CT = max(ST ’ X, 0).
Portfolios of options are called strategies. A straddle “ a put and a call at the same strike
price “ is a bet on volatility
Options allow you to buy and sell pieces of the return distribution.

Before studying option prices, we need to start by understanding option payoffs.
A call option gives you the right, but not the obligation, to buy a stock (or other “under-
lying” asset) for a speci¬ed strike price (X) on (or before) the expiration date (T). European
options can only be exercised on the expiration date. American options can be exercised any-
time before as well as on the expiration date. I will only treat European options. A put option
gives the right to sell a stock at a speci¬ed strike price on (or before) the expiration date. I™ll
use the standard notation,

call price today
C = Ct =
call payoff = value at expiration (T).
CT =
stock price today
S = St =
stock price at expiration
ST =
strike price

Our objective is to ¬nd the price C. The general framework is (of course) C = E(mx)
where x denotes the option™s payoff. The option™s payoff is the same things as its value at
expiration. If the stock has risen above the strike price, then the option is worth the difference
between stock and strike. If the stock has fallen below the strike price, it expires worthless.


Thus, the option payoff is
ST ’ X if ST ≥ X
Call payoff =
if ST ¤ X
CT = max(ST ’ X, 0).

A put works the opposite way: It gains value as the stock falls below the strike price, since
the right to sell it at a high price is more and more valuable.

Put payoff = PT = max(X ’ ST , 0).

It™s easiest to keep track of options by a graph of their value as a function of stock price.
Figure 31 graphs the payoffs from buying calls and puts, and the corresponding short posi-
tions, which are called writing call and put options. One of the easiest mistakes to make is to
confuse the payoff, with the pro¬t, which is the value at expiration less the cost of buying the
option. I drew in pro¬t lines, payoff - cost, to emphasize this difference.

Call Put


Write Call Write Put


Figure 31. Payoff diagrams for simple option strategies.

Right away, you can see some of the interesting features of options. A call option allows
you a huge positive beta. Typical at-the-money options (strike price = current stock price)
give a beta of about 10; meaning that the option is equivalent to borrowing $10 to invest $11
in the stock. However, your losses are limited to the cost of the option, which is paid upfront.
Options are obviously very useful for trading. Imagine how dif¬cult it would be to buy stock
on such huge margin, and how dif¬cult it would be to make sure people paid if the bet went



. 10
( 17)