. 9
( 17)


ˆ= 1 ˆ t ; ±i = 1
» »ˆ ±it .
T t=1 T t=1


Most importantly, they suggest that we use the standard deviations of the cross-sectional
regression estimates to generate the sampling errors for these estimates,

X³ ´2
ˆ=1 ˆ t ’ » ; σ2 (ˆ i ) = 1 (ˆ it ’ ±i )2 .
σ (») » ± ± ˆ
2 2
T t=1 T t=1

It™s 1/T 2 because we™re ¬nding standard errors of sample means, σ2 /T
This is an intuitively appealing procedure once you stop to think about it. Sampling error
is, after all, about how a statistic would vary from one sample to the next if we repeated the
observations. We can™t do that with only one sample, but why not cut the sample in half, and
deduce how a statistic would vary from one full sample to the next from how it varies from
the ¬rst half of the sample to the next half? Proceeding, why not cut the sample in fourths,
eights and so on? The Fama-MacBeth procedure carries this idea to is logical conclusion,
using the variation in the statistic »t over time to deduce its sampling variation.
We are used to deducing the sampling variance of the sample mean of a series xt by
looking at the variation of xt through time in the sample, using σ 2 (¯) = σ2 (x)/T =
P 2
t (xt ’ x) . The Fama-MacBeth technique just applies this idea to the slope and pric-
ing error estimates. The formula assumes that the time series is not autocorrelated, but one
could easily extend the idea to estimates »t that are correlated over time by using a long run
variance matrix, i.e. estimate .

ˆ =1 ˆˆ
σ (») covT (»t , »t’j )
T j=’∞

One should of course use some sort of weighting matrix or a parametric description of the
autocorrelations of », as explained in section 11.7. Asset return data are usually not highly
correlated, but accounting for such correlation could have a big effect on the application
of the Fama-MacBeth technique to corporate ¬nance data or other regressions in which the
cross-sectional estimates are highly correlated over time.
It is natural to use this sampling theory to test whether all the pricing errors are jointly
zero as we have before. Denote by ± the vector of pricing errors across assets. We could
estimate the covariance matrix of the sample pricing errors by
ˆ ±t
T t=1
(ˆ t ’ ±) (ˆ t ’ ±)0
cov(ˆ ) =
± ± ˆ± ˆ
T t=1

(or a general version that accounts for correlation over time) and then use the test

±0 cov(ˆ )’1 ± ∼ χ2 .
ˆ ± ˆ N’1


12.3.1 Fama MacBeth in depth

The GRS procedure and the formulas given above for a single cross-sectional regression are
familiar from any course in regression. We will see them justi¬ed by maximum likelihood
below. The Fama MacBeth procedure seems unlike anything you™ve seen in any econometrics
course, and it is obviously a useful and simple technique that can be widely used in panel
data in economics and corporate ¬nance as well as asset pricing. Is it truly different? Is there
something different about asset pricing data that requires a fundamentally new technique not
taught in standard regression courses? Or is it similar to standard techniques? To answer these
questions it is worth looking in a little more detail at what it accomplishes and why.
It™s easier to do this in a more standard setup, with left hand variable y and right hand
variable x. Consider a regression

yit = β 0 xit + µit i = 1, 2, ...N; t = 1, 2, ...T.

The data in this regression has a cross-sectional element as well as a time-series element.
In corporate ¬nance, for example, one might be interested in the relationship between in-
vestment and ¬nancial variables, and the data set has many ¬rms (N ) as well as time series
observations for each ¬rm (T ). In and expected return-beta asset pricing model, the xit stand-
ing for the β i and β stands for ».
An obvious thing to do in this context is simply to stack the i and t observations together
and estimate β by OLS. I will call this the pooled time-series cross-section estimate. How-
ever, the error terms are not likely to be uncorrelated with each other. In particular, the error
terms are likely to be cross-sectionally correlated at a given time. If one stock™s return is un-
usually high this month, another stock™s return is also likely to be high; if one ¬rm invests an
unusually great amount this year, another ¬rm is also likely to do so. When errors are corre-
lated, OLS is still consistent, but the OLS distribution theory is wrong, and typically suggests
standard errors that are much too small. In the extreme case that the N errors are perfectly
correlated at each time period, there really only one observation for each time period, so one
really has T rather than NT observations. Therefore, a real pooled time-series cross-section
estimate must include corrected standard errors. People often ignore this fact and report OLS
standard errors.
Another thing we could do is ¬rst take time series averages and then run a pure cross-
sectional regression of

ET (yit ) = β 0 ET (xit ) + ui i = 1, 2, ...N

This procedure would lose any information due to variation of the xit over time, but at least
it might be easier to ¬gure out a variance-covariance matrix for ui and correct the standard
errors for residual correlation. (You could also average cross-sectionally and than run a single
time-series regression. We™ll get to that option later.)
In either case, the standard error corrections are just applications of the standard formula


for OLS regressions with correlated error terms.
Finally, we could run the Fama-MacBeth procedure: run a cross-sectional regression at
ˆ ˆ
each point in time; average the cross-sectional β t estimates to get an estimate β, and use the
ˆ ˆ
time-series standard deviation of β t to estimate the standard error of β.
It turns out that the Fama MacBeth procedure is just another way of calculating the stan-
dard errors, corrected for cross-sectional correlation:
Proposition: If the xit variables do not vary over time, and if the errors are cross-sectionally
correlated but not correlated over time, then the Fama-MacBeth estimate, the pure cross-
sectional OLS estimate and the pooled time-series cross-sectional OLS estimates are identi-
cal. Also, the Fama-MacBeth standard errors are identical to the cross-sectional regression or
stacked OLS standard errors, corrected for residual correlation. None of these relations hold
if the x vary through time.
Since they are identical procedures, whether one calculates estimates and standard errors
in one way or the other is a matter of taste.
I emphasize one procedure that is incorrect: pooled time series and cross section OLS with
no correction of the standard errors. The errors are so highly cross-sectionally correlated in
most ¬nance applications that the standard errors so computed are often off by a factor of 10.
The assumption that the errors are not correlated over time is probably not so bad for
asset pricing applications, since returns are close to independent. However, when pooled
time-series cross-section regressions are used in corporate ¬nance applications, errors are
likely to be as severely correlated over time as across ¬rms, if not more so. The “other
factors” (µ) that cause, say, company i to invest more at time t than predicted by a set of right
hand variables is surely correlated with the other factors that cause company j to invest more.
But such factors are especially likely to cause company i to invest more tomorrow as well. In
this case, any standard errors must also correct for serial correlation in the errors; the GMM
based formulas in section 11.4 can do this easily.
The Fama-MacBeth standard errors also do not correct for the fact that β are generated
regressors. If one is going to use them, it is a good idea to at least calculate the Shanken
correction factors outlined above, and check that the corrections are not large.
Proof: We just have to write out the three approaches and compare them. Having assumed
that the x variables do not vary over time, the regression is

yit = x0 β + µit .

We can stack up the cross-sections i = 1...N and write the regression as

yt = xβ + µt .

x is now a matrix with the x0 as rows. The error assumptions mean E(µt µ0 ) = Σ.


Pooled OLS: To run pooled OLS, we stack the time series and cross sections by writing
®  ®  ® 
y1 x µ1
 y2  x  µ2 
     
Y = . ; X = . ; ² = . 
.» .» .»
°. °. °.
yT x µT
and then
Y = Xβ + ²
® 
 
E(²²0 ) = „¦ = ° »
The estimate and its standard error are then
β OLS = (X 0 X) X 0 Y
’1 ’1
cov(β OLS ) = (X 0 X) X 0 „¦X (X 0 X)
Writing this out from the de¬nitions of the stacked matrices, with X 0 X =T x0 x,
ˆ = (x0 x) x0 ET (yt )
1 0 ’1 0 ’1
ˆ (x x) (x Σx) (x0 x) .
cov(β OLS ) =
We can estimate this sampling variance with
¡ ¢
Σ = ET ˆtˆ0 ; ˆt ≡ yt ’ xβ OLS
µ µt µ

Pure cross-section: The pure cross-sectional estimator runs one cross-sectional regression
of the time-series averages. So, take those averages,
ET (yt ) = xβ + ET (µt )
where x = ET (x ) since x is constant. Having assumed i.i.d. errors over time, the error
covariance matrix is
E (ET (µt ) ET (µ0 )) = Σ.
The cross sectional estimate and corrected standard errors are then
ˆ = (x0 x) x0 ET (yt )
β XS
1 0 ’1 0 ’1 0 ’1
σ 2 (β XS ) = (x x) x Σx (x x)

Thus, the cross-sectional and pooled OLS estimates and standard errors are exactly the same,
in each sample.
Fama-MacBeth: The Fama“MacBeth estimator is formed by ¬rst running the cross-
sectional regression at each moment in time,
β t = (x0 x) x0 yt .

Then the estimate is the average of the cross-sectional regression estimates,
ˆ F M = ET β t = (x0 x)’1 x0 ET (yt ) .

Thus, the Fama-MacBeth estimator is also the same as the OLS estimator, in each sample.
The Fama-MacBeth standard error is based on the time-series standard deviation of the β t .
Using covT to denote sample covariance,
³ ´ ³´
ˆ F M = 1 covT β t = 1 (x0 x)’1 x0 covT (yt ) x (x0 x)’1 .
cov β

yt = xβ F M + ˆt

we have
covT (yt ) = ET (ˆtˆ0 ) = Σ
µ µt

and ¬nally
³ ´
ˆ F M = 1 (x0 x)’1 x0 Σx (x0 x)’1 .
cov β
Thus, the FM estimator of the standard error is also numerically equivalent to the OLS cor-
rected standard error.
Varying x If the xit vary through time, none of the three procedures are equal anymore,
since the cross-sectional regressions ignore time-series variation in the xit . As an extreme
example, suppose a scalar xit varies over time but not cross-sectionally,

yit = ± + xt β + µit ; i = 1, 2, ...N ; t = 1, 2, ...T.

The grand OLS regression is
˜ 1 i yit
t xt N
xt yit
ˆ it
=P 2= P2
it xt
˜ t xt

where x = x ’ ET (x) denotes the demeaned variables. The estimate is driven by the covari-
ance over time of xt with the cross-sectional average of the yit , which is sensible because all
of the information in the sample lies in time variation. It is identical to a regression over time


of cross-sectional averages. However, you can™t even run a cross-sectional estimate, since
the right hand variable is constant across i. As a practical example, you might be interested
in a CAPM speci¬cation in which the betas vary over time (β t ) but not across test assets.
This sample still contains information about the CAPM: the time-variation in betas should
be matched by time variation in expected returns. But any method based on cross-sectional
regressions will completely miss it.
In historical context, the Fama MacBeth procedure was also important because it allowed
changing betas, which a single cross-sectional regression or a time-series regression test can-
not easily handle.

12.4 Problems

1. When we express the CAPM in excess return form, can the test assets be differences
between risky assets, Ri ’Rj ? Can the market excess return also use a risky asset, or must
¡ ¢
it be relative to a risk free rate? (Hint: start with E(Ri ) ’ Rf = β i,m E(Rm ) ’ Rf
and see if you can get to the other forms. Betas must be regression coef¬cients.)
2. Can you run the GRS test on a model that uses industrial production growth as a factor,
E(Ri ) ’ Rf = β i,∆ip »ip ?
3. Fama and French (1997b) report that pricing errors are correlated with betas in a test of a
factor pricing model on industry portfolios. How is this possible?
4. We saw that a GLS cross-sectional regression of the CAPM passes through the market
and riskfree rate by construction. Show that if the market return is an equally weighted
portfolio of the test assets, then an OLS cross-sectional regression with an estimated
intercept passes through the market return by construction. Does it also pass through the
riskfree rate or origin?

Chapter 13. GMM for linear factor
models in discount factor form

13.1 GMM on the pricing errors gives a cross-sectional regression

The ¬rst stage estimate is an OLS cross-sectional regression, and the second stage is a
GLS regression,

ˆ1 = (d0 d)’1 d0 ET (p)
First stage : b
ˆ2 = (d0 S ’1 d)d0 S ’1 E(p).
Second stage : b

Standard errors are the corresponding regression formulas, and the variance of the pricing
errors are the standard regression formula for variance of a residual.

Treating the constant a — 1 as a constant factor, the model is

m = b0 f
E(p) = E(mx).

or simply

E(p) = E(xf 0 )b.

Keep in mind that p and x are N — 1 vectors of asset prices and payoffs respectively; f
is a K — 1 vector of factors, and b is a K — 1 vector of parameters. I suppress the time
indices mt+1 , ft+1 , xt+1, pt . The payoffs are typically returns or excess returns, including
returns scaled by instruments. The prices are typically one (returns) zero (excess returns) or
To implement GMM, we need to choose a set of moments. The obvious set of moments
to use are the pricing errors,

gT (b) = ET (xf 0 b ’ p).

This choice is natural but not necessary. You don™t have to use p = E(mx) with GMM, and
you don™t have to use GMM with p = E(mx). You can (we will) use GMM on expected
return-beta models, and you can use maximum likelihood on p = E(mx). It is a choice, and
the results will depend on this choice of moments as well as the speci¬cation of the model.


The GMM estimate is formed from

min gT (b)0 W gT (b)

with ¬rst order condition

d0 W gT (b) = d0 W ET (xf 0 b ’ p) = 0

‚gT (b)
d0 = = ET (f x0 ).
This is the second moment matrix of payoffs and factors. The ¬rst stage has W = I, the
second stage has W = S ’1 . Since this is a linear model, we can solve analytically for the
GMM estimate, and it is

ˆ1 = (d0 d)’1 d0 ET (p)
First stage : b
ˆ2 = (d0 S ’1 d)d0 S ’1 ET (p).
Second stage : b

The ¬rst stage estimate is an OLS cross-sectional regression of average prices on the
second moment of payoff with factors, and the second stage estimate is a GLS cross-sectional
regression. What could be more sensible? The model (13.189) says that average prices should
be a linear function of the second moment of payoff with factors, so the estimate runs a linear
regression. These are cross-sectional regressions since they operate across assets on sample
averages. The “data points” in the regression are sample average prices (y) and second
moments of payoffs with factors (x) across test assets. We are picking the parameter b to
make the model ¬t explain the cross-section of asset prices as well as possible.
We ¬nd the distribution theory from the standard GMM standard error formulas (11.144)
and (11.150). In the ¬rst stage, a = d0 .

cov(ˆ1 ) = (d0 d)’1 d0 Sd(d0 d)’1
First stage : (13.190)
cov(ˆ2 ) = (d0 S ’1 d)’1 .
Second stage : b
Unsurprisingly, these are exactly the formulas for OLS and GLS regression errors with error
covariance S. The pricing errors are correlated across assets, since the payoffs are correlated.
Therefore the OLS cross-sectional regression standard errors need to be corrected for correla-
tion, as they are in (13.190) and one can pursue an ef¬cient estimate as in GLS. The analogy
is GLS is close, since S is the covariance matrix of E(p) ’ E(xf 0 )b; S is the covariance
matrix of the “errors” in the cross-sectional regression.


The covariance matrix of the pricing errors is, from (11.147), (11.151) and (11.152)
h i¡ ¢¡ ¢
First stage : T cov gT (ˆ = I ’ d(d0 d)’1 d0 S I ’ d(d0 d)’1 d0 (13.191)
h i
ˆ = S ’ d(d0 S ’1 d)’1 d0 .
Second stage : T cov gT (b)

These are obvious analogues to the standard regression formulas for the covariance matrix of
regression residuals.
The model test

gT (b)0 cov(gT )’1 gT (b) ∼ χ2 (#moments ’ #parameters)

which specializes for the second-stage estimate as

T gT (ˆ 0 S ’1 gT (ˆ ∼ χ2 (#moments ’ #parameters).
b) b)

There is not much point in writing these out, other than to point out that the test is a quadratic
form in the vector of pricing errors. It turns out that the χ2 test has the same value for ¬rst
and second stage for this model, even though the parameter estimates, pricing errors and
covariance matrix are not the same.

13.2 The case of excess returns

When mt+1 = a ’ b0 ft+1 and the test assets are excess returns, the GMM estimate is
a GLS cross-sectional regression of average returns on the second moments of returns with
ˆ1 = (d0 d)’1 d0 ET (Re )
First stage : b
ˆ2 = (d0 S ’1 d)d0 S ’1 ET (Re ).
Second stage : b

where d is the covariance matrix between returns and factors. The other formulas are the

The analysis of the last section requires that at least one asset has a nonzero price. If all
assets are excess returns then ˆ1 = (d0 d)’1 d0 ET (p) = 0. Linear factor models are most often
applied to excess returns, so this case is important. The trouble is that in this case the mean
discount factor is not identi¬ed. If E(mRe ) = 0 then E((2 — m)Re ) = 0. Analogously in
expected return-beta models, if all test assets are excess returns, then we have no information
on the level of the zero-beta rate.
Writing out the model as m = a’b0 f , we cannot separately identify a and b so we have to
choose some normalization. The choice is entirely one of convenience; lack of identi¬cation


means precisely that the pricing errors do not depend on the choice of normalization.
The easiest choice is a = 1. Then

gT (b) = ET (mRe ) = ET (Re ) ’ E(Re f 0 )b.

We have
‚gT (b)
d0 = = E(fRe0 ),
the second moment matrix of returns and factors. The ¬rst order condition to min gT W gT


d0 W [d b + ET (Re )] = 0.

Then, the GMM estimates of b are
ˆ1 = (d0 d)’1 d0 ET (Re )
First stage : b
ˆ2 = (d0 S ’1 d)d0 S ’1 ET (Re ).
Second stage : b

The GMM estimate is a cross-sectional regression of mean excess returns on the second
moments of returns with factors. From here on in, the distribution theory is unchanged from
the last section.
Mean returns on covariances
We can obtain a cross-sectional regression of mean excess returns on covariances, which
are just a heartbeat away from betas, by choosing the normalization a = 1 + b0 E(f) rather
than a = 1. Then, the model is m = 1 ’ b0 (f ’ E(f )) with mean E(m) = 1. The pricing
errors are
gT (b) = ET (mRe ) = ET (Re ) ’ ET (Re f 0 )b

where I denote f ≡ f ’ E(f). We have

‚gT (b) ˜
d0 = = ET (fRe0 ),
which now denotes the covariance matrix of returns and factors. The ¬rst order condition to
min gT W gT is now

d0 W [d b + ET (Re )] = 0.

Then, the GMM estimates of b are
ˆ1 = (d0 d)’1 d0 ET (Re )
First stage : b
ˆ2 = (d0 S ’1 d)d0 S ’1 ET (Re ).
Second stage : b


The GMM estimate is a cross-sectional regression of expected excess returns on the covari-
ance between returns and factors. Naturally, the model says that expected excess returns
should be proportional to the covariance between returns and factors, and the estimate es-
timates that relation by a linear regression. The standard errors and variance of the pricing
errors are the same as in (13.190) and (13.191), with d now representing the covariance ma-
trix. The formulas are almost exactly identical to those of the cross-sectional regressions in
section 12.2. The p = E(mx) formulation of the model for excess returns is equivalent to
E(Re ) = ’Cov(Re , f 0 )b; thus covariances enter in place of betas β.
There is one ¬‚y in the ointment; the mean of the factor E(f ) is estimated, and the dis-
tribution theory should recognize sampling variation induced by this fact, as we did for the
fact that betas are generated regressors in the cross-sectional regressions of section 2.3. The
distribution theory is straightforward, and a problem at the end of the chapter guides you
through it. However, I think it is better to avoid the complication and just use the second mo-
ment approach, or some other non-sample dependent normalization for a. The pricing errors
are identical “ the whole point is that the normalization of a does not matter to the pricing
errors. Therefore, the χ2 statistics are also identical. As you change the normalization for
a, you change the estimate of b. Therefore, the only effect is to add a term in the sampling
variance of the estimated parameter b.

13.3 Horse Races

How to test whether one set of factors drives out another. Test b2 = 0 in m = b0 f1 + b0 f2
1 2
ˆ2 , or the χ2 difference test.
using the standard error of b

It™s often interesting to test whether one set of factors drives out another. For example,
Chen Roll and Ross (1986) test whether their ¬ve macroeconomic factors price assets so well
that one can ignore even the market return. Given the large number of factors that have been
proposed, a statistical procedure for testing which factors survive in the presence of the others
is desirable.
In this framework, such a test is very easy. Start by estimating a general model

m = b0 f1 + b0 f2 . (192)
1 2

We want to know, given factors f1 , do we need the f2 to price assets “ i.e. is b2 = 0? There
are two ways to do this.
First and most obviously, we have an asymptotic covariance matrix for [b1 b2 ], so we can
form a t test (if b2 is scalar) or χ2 test for b2 = 0 by forming the statistic
ˆ0 var(ˆ2 )’1ˆ2 ∼ χ2
b2 b b #b2


where #b2 is the number of elements in the b2 vector. This is a Wald test..
Second, we can estimate a restricted system m = b0 f1 . Since there are fewer free param-
eters and the same number of moments than in (13.192), we expect the criterion JT to rise.
If we use the same weighting matrix, (usually the one estimated from the unrestricted model
(13.192)) then the JT cannot in fact decline. But if b2 really is zero, it shouldn™t rise “much.”
How much? The χ2 difference test answers that question;
T JT (restricted) ’ T JT (unrestricted) ∼ χ2 (#of restrictions)
This is very much like a likelihood ratio test.

13.4 Testing for characteristics

How to check whether an asset pricing model drives out a characteristic such as size,
book/market or volatility. Run cross sectional regressions of pricing errors on characteristics;
use the formulas for covariance matrix of the pricing errors to create standard errors.

It™s often interesting to characterize a model by checking whether the model drives out
a characteristic. For example, portfolios organized by size or market capitalization show a
wide dispersion in average returns (at least up to 1979). Small stocks gave higher average
returns than large stocks. The size of the portfolio is a characteristic. A good asset pricing
model should account for average returns by betas. It™s ok if a characteristic is associated
with average returns, but in the end betas should drive out the characteristic; the alphas or
pricing errors should not be associated with the characteristic. The original tests of the CAPM
similarly checked whether the variance of the individual portfolio had anything to do with
average returns once betas were included.
Denote the characteristic of portfolio i by yi . An obvious idea is to include both betas and
the characteristic in a multiple, cross-sectional regression,
E(Rei ) = (±0 ) + β 0 » + γyi + µi ; i = 1, 2, ...N

Alternatively, subtract β» from both sides and consider a cross-sectional regression of alphas
on the characteristic,
±i = (±0 ) + γyi + µi ; i = 1, 2, ...N.
(The difference is whether you allow the presence of the size characteristic to affect the »
estimate or not.)
We can always run such a regression, but we don™t want to use the OLS formulas for the
sampling error of the estimates, since the errors µi are correlated across assets. Under the
null that γ = 0, µ = ±, so we can simply use the covariance matrix of the alphas to generate


standard errors of the γ. Let X denote the vector of characteristics, then the estimate is

γ = (X 0 X)’1 X 0 ±
ˆ ˆ

with standard error

σ(ˆ) = (X 0 X)’1 X 0 cov(ˆ )X(X 0 X)’1
γ ±

At this point, simply use the formula for cov(ˆ ) or cov(gT ) as appropriate for the model that
you tested.
Sometimes, the characteristic is also estimated rather than being a ¬xed number such
as the size rank of a size portfolio, and you™d like to include the sampling uncertainty of
its estimation in the standard errors of γ . Let yt denote the time series whose mean E(yt )
i i
determines the characteristic. Now, write the moment condition for the ith asset as
gT = ET (mt+1 (b)xt+1 ’ pt ’ γyt ).

The estimate of γ tells you how the characteristic E(yi ) is associated with model pricing
errors E(mt+1 (b)xt+1 ’ pt ). The GMM estimate of γ is

E(y)0 W (E(mx) ’ p ’ γy)

0 0
γ = (ET (y)W ET (y))
ˆ ET (y)W gT

a OLS or GLS regression of the pricing errors on the estimated characteristics. The stan-
dard GMM formulas for the standard deviation of γ or the χ2 difference test for γ = 0 tell
you whether the γ estimate is statistically signi¬cant, including the fact that E(y) must be

13.5 Testing for priced factors: lambdas or b™s?

bj asks whether factor j helps to price assets given the other factors. bj gives the multiple
regression coef¬cient of m on fj given the other factors.
»j asks whether factor j is priced, or whether its factor-mimicking portfolio carries a
positive risk premium. »j gives the single regression coef¬cient of m on fj .
Therefore, when factors are correlated, one should test bj = 0 to see whether to include
factor j given the other factors rather than test »j = 0.
Expected return-beta models de¬ned with single regression betas give rise to » with mul-
tiple regression interpretation that one can use to test factor pricing.


In the context of expected return-beta models, it has been more traditional to evaluate the
relative strengths of models by testing the factor risk premia » of additional factors, rather
than test whether their b is zero. (The b™s are not the same as the β™s. b are the regression
coef¬cient of m on f, β are the regression coef¬cients of Ri on f.)
To keep the equations simple, I™ll use mean-zero factors, excess returns, and normalize to
E(m) = 1, since the mean of m is not identi¬ed with excess returns.
The parameters b and » are related by
» = E(f f 0 )b.
See section 6.3. Brie¬‚y,
0 = E(mRe ) = E [Re (1 ’ f 0 b)]
E(Re ) = cov(Re , f 0 )b = cov(Re , f 0 )E(f f 0 )’1 E(f f 0 )b = β 0 ».
Thus, when the factors are orthogonal, E(f f 0 ) is diagonal, and each »j = 0 if and only if
the corresponding bj = 0. The distinction between b and » only matters when the factors are
correlated. Factors are often correlated however.
»j captures whether factor fj is priced. We can write » = E [f (f 0 b)] = ’E(mf) to see
that » is (the negative of) the price that the discount factor m assigns to f . b captures whether
factor fj is marginally useful in pricing assets, given the presence of other factors. If bj = 0,
we can price assets just as well without factor fj as with it.
»j is proportional to the single regression coef¬cient of m on f. »j = cov(m, fj ).
»j = 0 asks the corresponding single regression coef¬cient question”“is factor j correlated
with the true discount factor?”
bj is the multiple regression coef¬cient of m on fj given all the other factors. This just
follows from m = b0 f. (Regressions don™t have to have error terms!) A multiple regression
coef¬cient β j in y = xβ + µ is the way to answer “does xj help to explain variation in y
given the presence of the other x™s?” When you want to ask the question, “should I include
factor j given the other factors?” you want to ask the multiple regression question.
For example, suppose the CAPM is true, which is the single factor model
m = a ’ bRem
where Rem is the market excess return. Consider any other excess return Rex , positively
correlated with Rem (x for extra). If we try a factor model with the spurious factor Rex , the
answer is
m = a ’ bRem + 0 — Rex .
bx is obviously zero, indicating that adding this factor does not help to price assets.
However, since the correlation of Rex with Rem is positive, the beta of Rex on Rem is
positive, Rex earns a positive expected excess return, and »x = E(Rex ) > 0. In the expected


return - beta model

E(Rei ) = β im »m + β ix »x

»m = E(Rem ) is unchanged by the addition of the spurious factor. However, since the fac-
tors Rem , Rex are correlated, the multiple regression betas of Rei on the factors change when
we add the extra factor Rex . If β ix is positive, β im will decline from its single-regression
value, so the new model explains the same expected return E(Rei ). The expected return -
beta model will indicate a risk premium for β x exposure, and many assets will have β x expo-
sure (Rx for example!) even though factor Rx is spurious. In particular, Rex will of course
have multiple regression coef¬cients β x,m = 0 and β x,x = 1, and its expected return will be
entirely explained by the new factor x.
So, as usual, the answer depends on the question. If you want to know whether factor i
is priced, look at » (or E(mf i )). If you want to know whether factor i helps to price other
assets, look at bi . This is not an issue about sampling error or testing. All moments above are
population values.
Of course, testing b = 0 is particularly easy in the GMM, p = E(mx) setup. But you can
always test the same ideas in any expression of the model. In an expected return-beta model,
estimate b by E(ff 0 )’1 » and test the elements of that vector rather than » itself.
You can write an asset pricing model as ERe = β 0 » and use the » to test whether each
factor can be dropped in the presence of the others, if you use single regression betas rather
than multiple regression betas. In this case each » is proportional to the corresponding b.
Problem 2 at the end of this chapter helps you to work out this case.

13.5.1 Mean-variance frontier and performance evaluation

A GMM, p = E(mx) approach to testing whether a return expands the mean-variance
frontier. Just test whether m = a + bR prices all returns. If there is no risk free rate, use two
values of a.

We often summarize asset return data by mean-variance frontiers. For example, a large
literature has examined the desirability of international diversi¬cation in a mean-variance
context. Stock returns from many countries are not perfectly correlated, so it looks like one
can reduce portfolio variance a great deal for the same mean return by holding an internation-
ally diversi¬ed portfolio. But is this real or just sampling error? Even if the value-weighted
portfolio were ex-ante mean-variance ef¬cient, an ex-post mean-variance frontier constructed
from historical returns on the roughly NYSE stocks would leave the value-weighted portfo-
lio well inside the ex-post frontier. So is “I should have bought Japanese stocks in 1960” (and
sold them in 1990!) a signal that broad-based international diversi¬cation a good idea now,
or is it simply 20/20 hindsight regret like “I should have bought Microsoft in 1982?” Sim-


Frontiers intersect



Figure 27. Mean variance frontiers might intersect rather than coincide.

ilarly, when evaluating fund managers, we want to know whether the manager is truly able
to form a portfolio that beats mean-variance ef¬cient passive portfolios, or whether better
performance in sample is just due to luck.
Since a factor model is true if and only if a linear combination of the factors (or factor-
mimicking portfolios if the factors are not returns) is mean-variance ef¬cient, one can inter-
pret a test of any factor pricing model as a test whether a given return is on the mean-variance
frontier. Section 12.1 showed how the Gibbons Ross and Shanken pricing error statistic can
be interpreted as a test whether a given portfolio is on the mean-variance frontier, when re-
turns and factors are i.i.d., and the GMM distribution theory of that test statistic allows us to
extend the test to non-i.i.d. errors. A GMM, p = E(mx), m = a ’ bRp test analogously
tests whether Rp is on the mean-variance frontier of the test assets.
We may want to go one step further, and not just test whether a combination of a set of
assets Rd (say, domestic assets) is on the mean-variance frontier, but whether the Rd assets
span the mean-variance frontier of Rd and Ri (say, foreign or international) assets. The
trouble is, that if there is no riskfree rate, the frontier generated by Rd might just intersect the
frontier generated by Rd and Ri together, rather than span or coincide with the latter frontier,
as shown in Figure 27. Testing that m = a ’ b0 Rd prices both Rd and Ri only checks for


DeSantis (1992) and Chen and Knez (1992,1993) show how to test for spanning as op-
posed to intersection. For intersection, m = a ’ b0 Rd will price both Rd and Rf only for
one value of a, or equivalently E(m) or choice of the intercept, as shown. If the frontiers co-
incide or span, then m = a + b0 Rd prices both Rd and Rf for any value of a. Thus,we can
test for coincident frontiers by testing whether m = a + b0 Rd prices both Rd and Rf for
two prespeci¬ed values of a simultaneously.
To see how this work, start by noting that there must be at least two assets in Rd . If not,
there is no mean-variance frontier of Rd assets; it is simply a point. If there are two assets in
Rd ,Rd1 and Rd2 , then the mean-variance frontier of domestic assets connects them; they are
each on the frontier. If they are both on the frontier, then there must be discount factors

m1 = a1 ’ ˜1 Rd1


m2 = a2 ’ ˜2 Rd2

and, of course, any linear combination,
¤h i
m = »a1 + (1 ’ »)a2 ’ »˜1 Rd1 + (1 ’ »)˜2 Rd2 .
b b

Equivalently, for any value of a, there is a discount factor of the form
¡ ¢
m = a ’ b1 Rd1 + b2 Rd2 .

Thus, you can test for spanning with a JT test on the moments
£ ¤
E (a1 ’ b10 Rd )Rd = 0
£ ¤
E (a1 ’ b10 Rd )Ri = 0
£ ¤
E (a2 ’ b20 Rd )Rd = 0
£ ¤
E (a2 ’ b20 Rd )Ri = 0

for any two ¬xed values of a1 , a2 .

13.6 Problems

1. Work out the GMM distribution theory for the model m = 1 ’ b0 (f ’ E(f )) and
test assets are excess returns. The distribution should recognize the fact that E(f ) is
estimated in sample. To do this, set up
· ¸
ET (Re ’ Re (f 0 ’ Ef 0 ) b)
gT =
ET (f ’ Ef )

" ³ ´ #
˜ e0
ET fR 0
aT = .
0 IK
The estimated parameters are b, E(f). You should end up with a formula for the standard
error of b that resembles the Shanken correction (12.184), and an unchanged JT test.
2. Show that if you use single regression betas, then the corresponding » can be used to test
for the marginal importance of factors. However, the » are no longer the expected return
of factor mimicking portfolios.

Chapter 14. Maximum likelihood
Maximum likelihood is, like GMM, a general organizing principle that is a useful place to
start when thinking about how to choose parameters and evaluate a model. It comes with
an asymptotic distribution theory, which, like GMM, is a good place to start when you are
unsure about how to treat various problems such as the fact that betas must be estimated in a
cross-sectional regression.
As we will see, maximum likelihood is a special case of GMM. Given a statistical descrip-
tion of the data, it prescribes which moments are statistically most informative. Given those
moments, ML and GMM are the same. Thus, ML can be used to defend why one picks a cer-
tain set of moments, or for advice on which moments to pick if one is unsure. In this sense,
maximum likelihood (paired with carefully chosen statistical models) justi¬es the regression
tests above, as it justi¬es standard regressions. On the other hand, ML does not easily allow
you to use other non-“ef¬cient” moments, if you suspect that ML™s choices are not robust to
misspeci¬cations of the economic or statistical model. For example, ML will tell you how
to do GLS, but it will not tell you how to adjust OLS standard errors for non-standard error
Hamilton (1994) p.142-148 and the appendix in Campbell Lo MacKinlay (1997) give
nice summaries of maximum likelihood theory. Campbell Lo and MacKinlay™s Chapter 5
and 6 treat many more variations of regression based tests and maximum likelihood.

14.1 Maximum likelihood

The maximum likelihood principle says to pick the parameters that make the observed
data most likely. Maximum likelihood estimates are asymptotically ef¬cient. The informa-
tion matrix gives the asymptotic standard errors of ML estimates.

The maximum likelihood principle says to pick that set of parameters that makes the
observed data most likely. This is not “the set of parameters that are most likely given the
data” “ in classical (as opposed to Bayesian) statistics, parameters are numbers, not random
To implement this idea, you ¬rst have to ¬gure out what the probability of seeing a data
set {xt } is, given the free parameters θ of a model. This probability distribution is called the
likelihood function f({xt } ; θ). Then, the maximum likelihood principle says to pick

ˆ = arg max f ({xt } ; θ).

For reasons that will soon be obvious, it™s much easier to work with the log of this probability



L({xt } ; θ) = ln f ({xt } ; θ),

Maximizing the log likelihood is the same thing as maximizing the likelihood.
Finding the likelihood function isn™t always easy. In a time-series context, the best way
to do it is often to ¬rst ¬nd the log conditional likelihood function f (xt |xt’1 , xt’2 , ...x0 ; θ),
the chance of seeing xt+1 given xt , xt’1 , ... and given values for the parameters, . Since joint
probability is the product of conditional probabilities, the log likelihood function is just the
sum of the conditional log likelihood functions,

L({xt } ; θ) = ln f(xt |xt’1 , xt’2 ...x0 ; θ).

More concretely, we usually assume normal errors, so the likelihood function is

1 X 0 ’1
L=’ ln (2π |Σ|) ’ µ Σ µt
2 t=1 t

where µt denotes a vector of shocks; µt = xt ’ E(xt |xt’1 , xt’2 ...x0 ; θ).
This expression gives a simple recipe for constructing a likelihood function. You usually
start with a model that generates xt from errors, e.g. xt = ρxt’1 + µt . Invert that model to
express the errors µt in terms of the data {xt } and plug in to (14.194).
There is a small issue about how to start off a model such as (14.193). Ideally, the ¬rst
observation should be the unconditional density, i.e.

L({xt } ; θ) = ln f (x1 ; θ) + ln f (x2 |x1 ; θ) + ln f (x3 |x2 , x1 ; θ)...

However, it is usually hard to evaluate the unconditional density or the ¬rst terms with only
a few lagged xs. Therefore, if as usual the conditional density can be expressed in terms
of a ¬nite number k of lags of xt , one often maximizes the conditional likelihood function
(conditional on the ¬rst k observations), treating the ¬rst k observations as ¬xed rather than
random variables.

L({xt } ; θ) = ln f (xk+1 |xk , xk’1 ...x1 ; θ) + ln f(xk+2 |xk , xk’1... x2 ; θ) + ...

Alternatively, one can treat k pre-sample values {x0 , x’1 , ...x’k+1 } as additional parameters
over which to maximize the likelihood function.
Maximum likelihood estimators come with a useful asymptotic (i.e. approximate) distri-


bution theory. First, the distribution of the estimates is
÷ ¸’1 !
‚ 2L
ˆ (195)
θ∼N θ, ’

If the likelihood L has a sharp peak at ˆ then we know a lot about the parameters, while if
the peak is ¬‚at, other parameters are just as plausible. The maximum likelihood estimator
is asymptotically ef¬cient meaning that no other estimator can produce a smaller covariance
The second derivative in (14.195) is known as the information matrix,

1 X ‚ 2 ln f (xt+1 |xt , xt’1 , ...x0 ; θ)
1 ‚ 2L
I=’ =’ .
T ‚θ‚θ0 ‚θ‚θ0
T t=1

(More precisely, the information matrix is de¬ned as the expected value of the second partial,
which is estimated with the sample value.) The information matrix can also be estimated as
a product of ¬rst derivatives. The expression

Tµ ¶µ ¶0
1 X ‚ ln f (xt+1 |xt , xt’1 , ...x0 ; θ) ‚ ln f(xt+1 |xt , xt’1 , ...x0 ; θ)
I =’ .
T t=1 ‚θ ‚θ

converges to the same value as (14.196). (Hamilton 1994 p.429 gives a proof.)
If we estimate a model restricting the parameters, the maximum value of the likelihood
function will necessarily be lower. However, if the restriction is true, it shouldn™t be that
much lower. This intuition is captured in the likelihood ratio test

2(Lunrestricted ’ Lrestricted )∼χ2
number of restrictions

The form and idea of this test is much like the χ2 difference test for GMM objectives that we
met in section 11.1.

14.2 ML is GMM on the scores

ML is a special case of GMM. ML uses the information in the auxiliary statistical model to
derive statistically most informative moment conditions. To see this fact, start with the ¬rst
order conditions for maximizing a likelihood function

‚L({xt } ; θ) X ‚ ln f (xt |xt’1 xt’2 ...; θ)
= = 0.
‚θ ‚θ


This is a GMM estimate. It is the sample counterpart to a population moment condition
µ ¶
‚ ln f (xt |xt’1 xt’2 ...; θ)
g(θ) = E = 0.

The term ‚ ln f (xt |xt’1 xt’2 ...; θ)/‚θ is known as the “score.” It is a random variable,
formed as a combination of current and past data (xt , xt’1 ...). Thus, maximum likelihood is
a special case of GMM, a special choice of which moments to examine.
For example, suppose that x follows an AR(1) with known variance,

xt = ρxt’1 + µt ,

and suppose the error terms are i.i.d. normal random variables. Then,

(xt ’ ρxt’1 )2
ln f (xt |xt’1 , xt’2 ...; ρ) = const. ’ 2 = const ’
2σ 2

and the score is

‚ ln f (xt |xt’1 xt’2 ...; ρ) (xt ’ ρxt’1 ) xt’1
= .

The ¬rst order condition for maximizing likelihood is

(xt ’ ρxt’1 ) xt’1 = 0.
T t=1

This expression is a moment condition, and you™ll recognize it as the OLS estimator of ρ,
which we have already regarded as a case of GMM.
The example shows another property of scores: The scores should be unforecastable. In
the example,
· ¸ hµ x i
(xt ’ ρxt’1 ) xt’1 t t’1
Et’1 = Et’1 = 0.
2 σ2

Intuitively, if we used a combination of the x variables E(h(xt , xt’1 , ...)) = 0 that was
predictable, we could form another moment “ an instrument “ that described the predictability
of the h variable and use that moment to get more information about the parameters. To prove
this property more generally, start with the fact that f (xt |xt’1 , xt’2 , ...; θ) is a conditional


density and therefore must integrate to one,
1= f (xt |xt’1 , xt’2 , ...; θ)dxt
‚f (xt |xt’1 , xt’2 , ...; θ)
0= dxt
‚ ln f (xt |xt’1 , xt’2 , ...; θ)
0= f (xt |xt’1 , xt’2 , ...; θ)dxt
· ¸
‚ ln f (xt |xt’1 , xt’2 , ...; θ)
0 = Et’1 .

Furthermore, as you might expect, the GMM distribution theory formulas give the same
result as the ML distribution, i.e., the information matrix is the asymptotic variance-covariance
matrix. To show this fact, apply the GMM distribution theory (11.144) to (14.198). The
derivative matrix is
1 X ‚ 2 ln f (xt |xt’1 xt’2 ...; θ)
‚gT (θ)
d= = =I
T t=1 ‚θ‚θ

This is the second derivative expression of the information matrix. The S matrix is
· ¸
‚ ln f (xt |xt’1 xt’2 ...; θ) ‚ ln f (xt |xt’1 xt’2 ...; θ) 0
E =I
‚θ ‚θ
The lead and lag terms in S are all zero since we showed above that scores should be un-
forecastable. This is the outer product de¬nition of the information matrix. There is no a
matrix, since the moments themselves are set to zero. The GMM asymptotic distribution of
ˆ is therefore
√ £ ¤ £ ¤
T (ˆ ’ θ) ’ N 0, d’1 Sd’10 = N 0, I ’1 .
We recover the inverse information matrix, as speci¬ed by the ML asymptotic distribution

14.3 When factors are returns, ML prescribes a time-series

I add to the economic model E (Re ) = βE(f) a statistical assumption that the regression
errors are independent over time and independent of the factors. ML then prescribes a time-
series regression with no constant. To prescribe a time series regression with a constant, we
drop the model prediction ± = 0. I show how the information matrix gives the same result
as the OLS standard errors.


Given a linear factor model whose factors are also returns, as with the CAPM, ML pre-
scribes a time-series regression test. To keep notation simple, I again treat a single factor f.
The economic model is

E (Re ) = βE(f )

Re is an N — 1 vector of test assets, and β is an N — 1 vector of regression coef¬cients of
these assets on the factor (the market return Rem in the case of the CAPM).
To apply maximum likelihood, we need to add an explicit statistical model that fully
describes the joint distribution of the data. I assume that the market return and regression
errors are i.i.d. normal, i.e.

Re = ± + βft + µt
ft = E(f ) + ut
· ¸ µ· ¸ · ¸¶
µt 0 Σ0
∼N ,
0 σ2
ut 0 u

(We can get by with non-normal factors, but it is easier not to present the general case.)
Equation (14.202) has no content other than normality. The zero correlation between ut and
µt identi¬es β as a regression coef¬cient. You can just write Re , Rem as a general bivariate
normal, and you will get the same results.
The economic model (14.201) implies restrictions on this statistical model. Taking ex-
pectations of (14.202), the CAPM implies that the intercepts ± should all be zero. Again, this
is also the only restriction that the CAPM places on the statistical model (14.202).
The most principled way to apply maximum likelihood is to impose the null hypothesis
throughout. Thus, we write the likelihood function imposing ± = 0. To construct the likeli-
hood function, we reduce the statistical model to independent error terms, and then add their
log probability densities to get the likelihood function.
1X e 1 X (ft ’ E(f ))2
0 ’1 e
L = (const.) ’ (R ’ βft ) Σ (Rt ’ βft ) ’
2 t=1 t σ2
2 t=1 u

The estimates follow from the ¬rst order conditions,
ÃT !’1
‚L ˆ
(Re ft2 Re ft
=Σ ’ βft ) ft = 0 ’ β =
t t
‚β t=1 t=1 t=1
1X X
‚L [ =»= 1
= (ft ’ E(f)) = 0 ’ E(f) ft
σ 2 t=1
‚E(f ) T t=1

(‚L/‚Σ and ‚L/‚σ2 also produce ML estimates of the covariance matrices, which turn out


to be the standard averages of squared residuals.)
The ML estimate of β is the OLS regression without a constant. The null hypothesis says
to leave out the constant, and the ML estimator uses that fact to avoid estimating a constant.
Since the factor risk premium is equal to the market return, it™s not too surprising that the »
estimate is the same as that of the average market return.
We know that the ML distribution theory must give the same result as the GMM distribu-
tion theory which we already derived in section 12.1, but it™s worth seeing it explicitly. The
asymptotic standard errors follow from either estimate of the information matrix, for example
‚ 2L ’1
ft2 = 0
0 = ’Σ
‚β‚β t=1

11 1 1
ˆ (203)
cov(β) = Σ= Σ.
T E(f 2 ) T E(f )2 + σ2 (f )

This is the standard OLS formula.
We also want pricing error measurements, standard errors and tests. We can apply maxi-
mum likelihood to estimate an unconstrained model, containing intercepts, and then use Wald
tests (estimate/standard error) to test the restriction that the intercepts are zero. We can also
use the unconstrained model to run the likelihood ratio test. The unconstrained likelihood
function is
1X e
(Rt ’ ± ’ βft )0 Σ’1 (Rt ’ ± ’ βft ) + ...
L = (const.) ’
2 t=1

(I ignore the term in the factor, since it will again just tell us to use the sample mean to
estimate the factor risk premium.)
The estimates are now
‚L ˆ
(Re ’ ± ’ βft ) = 0 ’ ± = ET (Rt ) ’ βET (ft )
=Σ ˆ
‚± t=1
X e
‚L ˆ = covT (Rt , ft )
=Σ ’ ± ’ βft ) ft = 0 ’ β
σ2 (ft )
‚β T

Unsurprisingly, the maximum likelihood estimates of ± and β are the OLS estimates, with a
The inverse of the information matrix gives the asymptotic distribution of these estimates.
Since they are just OLS estimates, we™re going to get the OLS standard errors, but it™s worth


seeing it come out of ML.
® ’1
· ¸’1
 
‚ 2L Σ’1 Σ’1 E(f )
’ · 
¸ =
° ¤» Σ’1 E(f ) Σ’1 E(f 2 )

· ¸
E(f 2 ) E(f )
= —Σ
E(f ) 1
σ2 (f)
The covariance matrices of ± and β are thus
" ¶2 #
1 E(f )
cov(ˆ ) =
± 1+ Σ
T σ(f )
ˆ (14.204)
cov(β) = Σ.
T σ 2 (f )
These are just the usual OLS standard errors, which we derived in section 12.1 as a special
case of GMM standard errors for the OLS time-series regressions when errors are uncorre-
lated over time and independent of the factors, or by specializing σ2 (X 0 X)’1 .
You cannot just invert ‚ 2 L/‚±‚±0 to ¬nd the covariance of ±. That attempt would give
just Σ as the covariance matrix of ±, which would be wrong. You have to invert the entire
information matrix to get the standard error of any parameter. Otherwise, you are ignoring
the effect that estimating β has on the distribution of ±. In fact, what I presented is really
wrong, since we also must estimate Σ. However, it turns out that Σ is independent of ± andˆ
ˆ “ the information matrix is block-diagonal “ so the top left two elements of the true inverse
information matrix are the same as I have written here.
The variance of β in (14.204) is larger than it is in (14.203) was when we impose the null
of no constant. ML uses all the information it can to produce ef¬cient estimates “ estimates
with the smallest possible covariance matrix. The ratio of the two formulas is equal to the
familiar term 1 + E(f )2 /σ2 (f ). In annual data for the CAPM, σ(Rem ) = 16%, E(Rem ) =
8%, means that unrestricted estimate (14.204) has a variance 25% larger than the restricted
estimate (14.203), so the gain in ef¬ciency can be important. In monthly data, however the
gain is smaller since variance and mean both scale with the horizon.
We can also view this fact as a warning: ML will ruthlessly exploit the null hypothesis and
do things like running regressions without a constant in order to get any small improvement
in ef¬ciency.
We can use these covariance matrices to construct a Wald (estimate/standard error) test
the restriction of the model that the alphas are all zero,
à ¶2 !’1
±0 Σ’1 ±∼χ2 . (205)
T 1+ ˆ ˆN



. 9
( 17)