. 8
( 17)


= [E(yx) ’ p]0 E(xx0 )’1 [E(yx) ’ p]
= gT E(xx0 )’1 gT

You might want to choose parameters of the model to minimize this “economic” measure
of model ¬t, or this economically motivated linear combination of pricing errors, rather than
the statistical measure of ¬t S ’1 . You might also use the minimized value of this criterion to
compare two models. In that way, you are sure the better model is better because it improves
on the pricing errors rather than just blowing up the weighting matrix.
Identity matrix.
Using the identity matrix weights the initial choice of assets or portfolios equally in esti-
mation and evaluation. This choice has a particular advantage with large systems in which S
is nearly singular, as it avoids most of the problems associated with inverting a near-singular
S matrix. Many empirical asset pricing studies use OLS cross-sectional regressions, which
are the same thing as a ¬rst stage GMM estimate with an identity weighting matrix.
Comparing the second moment and identity matrices.
The second moment matrix gives an objective that is invariant to the initial choice of
assets or portfolios. If we form a portfolio Ax of the initial payoffs x, with nonsingular A
(i.e. a transformation that doesn™t throw away information) then

[E(yAx) ’ Ap]0 E(Axx0 A0 )’1 [E(yAx) ’ Ap] = [E(yx) ’ p]0 E(xx0 )’1 [E(yx) ’ p].

The optimal weighting matrix S shares this property. It is not true of the identity or other
¬xed matrices. In those cases, the results will depend on the initial choice of portfolios.
Kandel and Stambaugh (1995) have suggested that the results of several important asset
pricing model tests are highly sensitive to the choice of portfolio; i.e. that authors inadver-
tently selected a set of portfolios on which the CAPM does unusually badly in a particular
sample. Insisting that weighting matrices have this kind of invariance to portfolio selection
might be a good device to ward against this problem.
On the other hand, if you want to focus on the model™s predictions for economically
interesting portfolios, then it wouldn™t make much sense for the weighting matrix to undo
the speci¬cation of economically interesting portfolios! For example, many studies want


to focus on the ability of a model to describe expected returns that seem to depend on a
characteristic such as size, book/market, industry, momentum, etc. Also, the second moment
matrix is often even more nearly singular than the spectral density matrix, since E(xx0 ) =
cov(x)+E(x)E(x)0 . Therefore, it often emphasizes portfolios with even more extreme short
and long positions, and is no help on overcoming the near singularity of the S matrix.

11.6 Estimating on one group of moments, testing on another.

You may want to force the system to use one set of moments for estimation and another for
testing. The real business cycle literature in macroeconomics does this extensively, typically
using “¬rst moments” for estimation (“calibration”) and “second moments” (i.e. ¬rst mo-
ments of squares) for evaluation. A statistically minded macroeconomist might like to know
whether the departures of model from data “second moments” are large compared to sam-
pling variation, and would like to include sampling uncertainty about the parameter estimates
in this evaluation.
You might want to choose parameters using one set of asset returns (stocks; domestic as-
sets; size portfolios, ¬rst 9 size deciles, well-measured assets) and then see how the model
does “out of sample” on another set of assets (bonds; foreign assets; book/market portfo-
lios, small ¬rm portfolio, questionably measured assets, mutual funds). However, you want
the distribution theory for evaluation on the second set of moments to incorporate sampling
uncertainty about the parameters in their estimation on the ¬rst set of moments.
You can do all this very simply by using an appropriate weighting matrix or a prespec-
i¬ed moment matrix aT . For example, if the ¬rst N moments will be used to estimate N
parameters, and the remaining M moments will be used to test the model “out of sample,”
use aT = [IN 0N—M ] . If there are more moments N than parameters in the “estimation”
block, you can construct a weighting matrix W which is an identity matrix in the N — N es-
timation block and zero elsewhere. Then aT = ‚gT /‚bW will simply contain the ¬rst N

columns of ‚gT /‚b followed by zeros. The test moments will not be used in estimation. You

could even use the inverse of the upper N — N block of S (not the upper block of the inverse
of S!) to make the estimation a bit more ef¬cient.

11.7 Estimating the spectral density matrix

Hints on estimating the spectral density or long run covariance matrix. 1) Use a sensi-
ble ¬rst stage estimate 2) Remove means 3) Downweight higher order correlations 4) Con-
sider parametric structures for autocorrelation and heteroskedasticity 5) Use the null to limit
the number of correlations or to impose other structure on S? 6) Size problems; consider
a factor or other parametric cross-sectional structure for S. 7) Iteration and simultaneous
b, S estimation.


The optimal weighting matrix S depends on population moments, and depends on the
parameters b. Work back through the de¬nitions,

E(ut u0 );
S = t’j
ut ≡ (mt (b)xt ’ pt’1 )

How do we estimate this matrix? The big picture is simple: following the usual philoso-
phy, estimate population moments by their sample counterparts. Thus, use the ¬rst stage b
estimates and the data to construct sample versions of the de¬nition of S. This produces a
consistent estimate of the true spectral density matrix, which is all the asymptotic distribution
theory requires.
The details are important however, and this section gives some hints. Also, you may want
a different, and less restrictive, estimate of S for use in standard errors than you do when you
are estimating S for use in a weighting matrix.
1) Use a sensible ¬rst stage W, or transform the data.
In the asymptotic theory, you can use consistent ¬rst stage b estimates formed by any
nontrivial weighting matrix. In practice, of course, you should use a sensible weighting
matrix so that the ¬rst stage estimates are not ridiculously inef¬cient. W = I is often a good
Sometimes, some moments will have different units than other moments. For example,
the dividend/price ratio is a number like 0.04. Therefore, the moment formed by Rt+1 — d/pt
will be about 0.04 as big as large as the moment formed by Rt+1 — 1. If you use W = I,
GMM will pay much less attention to the Rt+1 — d/pt moment. It is wise, then, to either
use an initial weighting matrix that overweights the Rt+1 — d/pt moment, or to transform
the data so the two moments are about the same mean and variance. For example, you could
use Rt+1 — (1 + d/pt ). It is also useful to start with moments that are not horrendously
correlated with each other, or to remove such correlation with a clever W . For example, you
might consider Ra and Rb ’ Ra rather than Ra and Rb . You can accomplish this directly, or
by starting with

· ¸· ¸ · ¸
1 ’1 10 2 ’1
W= = .
01 ’1 1 ’1 1

2) Remove means.
Under the null, E(ut ) = 0, so it does not matter to the asymptotic distribution theory


whether you estimate the covariance matrix by removing means, using
1X 1X
[(ut ’ u)(ut ’ u) ] ; u ≡
¯ ¯ ¯ ut
T t=1 T t=1

or whether you estimate the second moment matrix by not removing means. However,
Hansen and Singleton (1982) advocate removing the means in sample, and this is gener-
ally a good idea.
It is already a major obstacle to second-stage estimation that estimated S matrices (and
even simple variance-covariance matrices) are often nearly singular, providing an unreliable
weighting matrix when inverted. Since second moment matrices E(uu0 ) = cov(u, u0 ) +
E(u)E(u0 ) add a singular matrix E(u)E(u0 ) they are often even worse.
3) Downweight higher order correlations.
You obviously cannot use a direct sample counterpart to the spectral density matrix. In
a sample of size 100, there is no way to estimate E(ut u0 t+101 ).Your estimate of E(ut ut+99 )

is based on one data point, u1 u0 . Hence, it will be a pretty unreliable estimate. For this
reason, the estimator using all possible autocorrelations in a given sample is inconsistent.
(Consistency means that as the sample grows, the probability distribution of the estimator
converges to the true value. Inconsistent estimates typically have very large sample variation.)
Furthermore, even S estimates that use few autocorrelations are not always positive def-
inite in sample. This is embarrassing when one tries to invert the estimated spectral density
matrix, which you have to do if you use it as a weighting matrix. Therefore, it is a good idea
to construct consistent estimates that are automatically positive de¬nite in every sample. One
such estimate is the Bartlett estimate, used in this application by Newey and West (1987b). It
X µ k ’ |j| ¶ 1 X
k T
ˆ (ut u0 ). (161)
S= t’j
k T t=1

As you can see, only autocorrelations up to kth (k < T ) order are included, and higher order
autocorrelations are downweighted. (It™s important to use 1/T not 1/(T ’ k); this is a further
downweighting.) The Newey-West estimator is basically the variance of kth sums, which is
why it is positive de¬nite in sample:
« 
V ar  ut’j  = kE(ut u0 ) + (k ’ 1)[E(ut u0 ) + E(ut’1 u0 )] + · · ·
t t’1 t
X k ’ |j|
+[E(ut u0 ) + E(ut’k u0 )] E(ut u0 ).
t’k t t’k

Andrews (1991) gives some additional weighting schemes for spectral density estimates.


This calculation also gives some intuition for the S matrix. We™re looking for the variance
1 PT
across samples of the sample mean var( T t=1 ut ). We only have one sample mean to look
at, so we estimate the variance of the sample mean by looking at the variance in a single
³P ´
sample of shorter sums, var k j=1 uj . The S matrix is sometimes called the long-run

covariance matrix for this reason. In fact, one could estimate S directly as a variance of kth
sums and obtain almost the same estimator, that would also be positive de¬nite in any sample,

k T
vt = ut’j ; v =
¯ vt
T ’k
j=1 t=k+1
(vt ’ v ) (vt ’ v)0 .
S= ¯ ¯
kT ’k

This estimator has been used when measurement of S is directly interesting (Cochrane 1998,
Lo and MacKinlay 1988). A variety of other weighting schemes have been advocated.
What value of k, or how wide a window if of another shape, should you use? Here
again, you have to use some judgment. Too short values of k, together with a ut that is
signi¬cantly autocorrelated, and you don™t correct for correlation that might be there in the
errors. Too long a value of k, together with a series that does not have much autocorrelation,
and the performance of the estimate and test deteriorates. If k = T /2 for example, you are
really using only two data points to estimate the variance of the mean. The optimum value
then depends on how much persistence or low-frequency movement there is in a particular
application, vs. accuracy of the estimate.
There is an extensive statistical literature about optimal window width, or size of k. Alas,
this literature mostly characterizes the rate at which k should increase with sample size. You
must promise to increase k as sample size increases, but not as quickly as the sample size
increases “ limT ’∞ k = ∞, limT ’∞ k/T = 0 “ in order to obtain consistent estimates. In
practice, promises about what you™d do with more data are pretty meaningless, and usually
broken once more data arrives.
4) Consider parametric structures for autocorrelation and heteroskedasticity.
“Nonparametric” corrections such as (11.161) often don™t perform very well in typical
samples. The problem is that “nonparametric” techniques are really very highly parametric;
you have to estimate many correlations in the data. Therefore, the nonparametric estimate
varies a good deal from sample to sample, while the asymptotic distribution theory ignores
sampling variation in covariance matrix estimates. The asymptotic distribution can therefore
be a poor approximation to the ¬nite-sample distribution of statistics like the JT . The S ’1
weighting matrix will also be unreliable.
One answer is to use a Monte-Carlo or bootstrap to estimate the ¬nite-sample distribution
parameters and test statistics rather than to rely on asymptotic theory.
Alternatively, you can impose a parametric structure on the S matrix. Just because the


formulas are expressed in terms of a sum of covariances does not mean you have to estimate
them that way; GMM is not inherently tied to “nonparametric” covariance matrix estimates.
For example, if you model a scalar u as an AR(1) with parameter ρ, then you can estimate
two numbers ρ and σ 2 rather than a whole list of autocorrelations, and calculate

∞ ∞
X X 1+ρ
σ2 ρ|j| = σ 2
S= E(ut ut’j ) = u u
j=’∞ j=’∞

If this structure is not a bad approximation, imposing it can result in more reliable estimates
and test statistics since one has to estimate many fewer coef¬cients. You could transform the
data in such a way that there is less correlation to correct for in the ¬rst place.
(This is a very useful formula, by the way. You are probably used to calculating the
standard error of the mean as
σ(¯) = √ .
This formula assumes that the x are uncorrelated over time. If an AR(1) is not a bad model
for their correlation, you can quickly adjust for correlation by using
σ(x) 1 + ρ
σ(¯) = √
This sort of parametric correction is very familiar from OLS regression analysis. The
textbooks commonly advocate the AR(1) model for serial correlation as well as parametric
models for heteroskedasticity corrections. There is no reason not to follow a similar approach
for GMM statistics.
5) Use the null to limit correlations?
In the typical asset pricing setup, the null hypothesis speci¬es that Et (ut+1 ) = Et (mt+1 Rt+1 ’
1) = 0, as well as E(ut+1 ) = 0. This implies that all the autocorrelation terms of S drop
out; E(ut u0 ) = 0 for j 6= 0. The lagged u could be an instrument z; the discounted return
should be unforecastable, using past discounted returns as well as any other variable. In this
situation, one could exploit the null to only include one term, and estimate
ˆ= 1 ut u0 .
S t
T t=1

Similarly, if one runs a regression forecasting returns from some variable zt ,

Rt+1 = a + bzt + µt+1 ,

the null hypothesis that returns are not forecastable by any variable at time t means that the


errors should not be autocorrelated. One can then simplify the standard errors in the OLS
regression formulas given in section 11.4, eliminating all the leads and lags.
In other situations, the null hypothesis can suggest a functional form for E(ut u0 ) or that
some but not all are zero. For example, as we saw in section 11.4, regressions of long horizon
returns on overlapping data lead to a correlated error term, even under the null hypothesis
of no return forecastability. We can impose this null, ruling out terms past the overlap, as
suggested by Hansen and Hodrick,
® 
E(xt x0 )’1 ° E(et xt x0 et’j )» E(xt x0 )’1 . (162)
var(bT ) = t t’j t

However, the null might not be correct, and the errors might be correlated. If so, you might
make a mistake by leaving them out. If the null is correct, the extra terms will converge to zero
and you will only have lost a few (¬nite-sample) degrees of freedom needlessly estimating
them. If the null is not correct, you have an inconsistent estimate. With this in mind, you
might want to include at least a few extra autocorrelations, even when the null says they don™t
Furthermore, there is no guarantee that the unweighted sum in (11.162) is positive de¬nite
in sample. If the sum in the middle is not positive de¬nite, you could add a weighting to
the sum, possibly increasing the number of lags so that the lags near k are not unusually
underweighted. Again, estimating extra lags that should be zero under the null only loses a
little bit of power.
Monte Carlo evidence (Hodrick 1992) suggests that imposing the null hypothesis to sim-
plify the spectral density matrix helps to get the ¬nite-sample size of test statistics right “ the
probability of rejection given the null is true. One should not be surprised that if the null is
true, imposing as much of it as possible makes estimates and tests work better. On the other
hand, adding extra correlations can help with the power of test statistics “ the probability of
rejection given that an alternative is true “ since they converge to the correct spectral density
This trade-off requires some thought. For measurement rather than pure testing, using
a spectral density matrix that can accommodate alternatives may be the right choice. For
example, in the return forecasting regressions, one is really focused on measuring return
forecastability rather than just formally testing the hypothesis that it is zero. On the other
hand, the small-sample performance of the nonparametric estimators with many lags is not
very good.
If you are testing an asset pricing model that predicts u should not be autocorrelated, and
there is a lot of correlation “ if this issue makes a big difference “ then this is an indication that
something is wrong with the model; that including u as one of your instruments z would result
in a rejection or at least substantially change the results. If the u are close to uncorrelated,
then it really doesn™t matter if you add a few extra terms or not.


6) Size problems; consider a factor or other parametric cross-sectional structure.
If you try to estimate a covariance matrix that is larger than the number of data points (say
2000 NYSE stocks and 800 monthly observations), the estimate of S, like any other covari-
ance matrix, is singular by construction. This fact leads to obvious problems when you try to
invert S! More generally, when the number of moments is more than around 1/10 the number
of data points, S estimates tend to become unstable and near-singular. Used as a weighting
matrix, such an S matrix tells you to pay lots of attention to strange and probably spurious
linear combinations of the moments, as I emphasized in section 11.5. For this reason, most
second-stage GMM estimations are limited to a few assets and a few instruments.
A good, but as yet untried alternative might be to impose a factor structure or other well-
behaved structure on the covariance matrix. The near-universal practice of grouping assets
into portfolios before analysis already implies an assumption that the true S of the underlying
assets has a factor structure. Grouping in portfolios means that the individual assets have no
information not contained in the portfolio, so that a weighting matrix S ’1 would treat all
assets in the portfolio identically. It might be better to estimate an S imposing a factor
structure on all the primitive assets.
Another response to the dif¬culty of estimating S is to stop at ¬rst stage estimates, and
only use S for standard errors. One might also use a highly structured estimate of S as
weighting matrix, while using a less constrained estimate for the standard errors.
This problem is of course not unique to GMM. Any estimation technique requires us to
calculate a covariance matrix. Many traditional estimates simply assume that ut errors are
cross-sectionally independent. This false assumption leads to understatements of the standard
errors far worse than the small sample performance of any GMM estimate.
Our econometric techniques all are designed for large time series and small cross-sections.
Our data has a large cross section and short time series. A large unsolved problem in ¬nance
is the development of appropriate large-N small-T tools for evaluating asset pricing models.
7) Alternatives to the two-stage procedure: iteration and one-step.
Hansen and Singleton (1982) describe the above two-step procedure, and it has become
popular for that reason. Two alternative procedures may perform better in practice, i.e. may
result in asymptotically equivalent estimates with better small-sample properties. They can
also be simpler to implement, and require less manual adjustment or care in specifying the
setup (moments, weighting matrices) which is often just as important.
a) Iterate. The second stage estimate ˆ2 will not imply the same spectral density as the
¬rst stage. It might seem appropriate that the estimate of b and of the spectral density should
be consistent, i.e. to ¬nd a ¬xed point of ˆ = min{b} [gT (b)0 S(ˆ ’1 gT (b)]. One way to search
b b)
for such a ¬xed point is to iterate: ¬nd b2 from
ˆ2 = min gT (b)0 S ’1 (b1 )gT (b) (163)

where b1 is a ¬rst stage estimate, held ¬xed in the minimization over b2 . Then use ˆ2 to ¬nd


S(ˆ2 ), ¬nd
ˆ3 = min[gT (b)0 S(ˆ2 )’1 gT (b)],
b b

and so on. There is no ¬xed point theorem that such iterations will converge, but they often
do, especially with a little massaging. (I once used S [(bj + bj’1 )/2] in the beginning part of
an iteration to keep it from oscillating between two values of b). Ferson and Foerster (1994)
¬nd that iteration gives better small sample performance than two-stage GMM in Monte
Carlo experiments. This procedure is also likely to produce estimates that do not depend on
the initial weighting matrix.
b) Pick b and S simultaneously. It is not true that S must be held ¬xed as one searches
for b. Instead, one can use a new S(b) for each value of b. Explicitly, one can estimate b by

min [gT (b)0 S ’1 (b)gT (b)] (164)

The estimates produced by this simultaneous search will not be numerically the same in
a ¬nite sample as the two-step or iterated estimates. The ¬rst order conditions to (11.163) are
µ ¶0
‚gT (b)
S ’1 (b1 )gT (b) = 0 (165)
while the ¬rst order conditions in (11.164) add a term involving the derivatives of S(b) with
respect to b. However, the latter terms vanish asymptotically, so the asymptotic distribution
theory is not affected. Hansen, Heaton and Yaron (1996) conduct some Monte Carlo experi-
ments and ¬nd that this estimate may have small-sample advantages in certain problems. A
problem is that the one-step minimization may ¬nd regions of the parameter space that blow
up the spectral density matrix S(b) rather than lower the pricing errors gT .
Often, one choice will be much more convenient than another. For linear models, one
can ¬nd the minimizing value of b from the ¬rst order conditions (11.165) analytically. This
fact eliminates the need to search so even an iterated estimate is much faster. For nonlinear
models, each step involves a numerical search over gT (b) SgT (b). Rather than perform this
search many times, it may be much quicker to minimize once over gT (b) S(b)gT (b). On
the other hand, the latter is not a locally quadratic form, so the search may run into greater
numerical dif¬culties.

11.8 Problems

1. Use the delta method version of the GMM formulas to derive the sampling variance of
an autocorrelation coef¬cient.
2. Write a formula for the standard error of OLS regression coef¬cients that corrects for
autocorrelation but not heteroskedasticity


3. Write a formula for the standard error of OLS regression coef¬cients if E(et et’j ) =
ρj σ2 .
4. If the GMM errors come from an asset pricing model, ut = mt Rt ’ 1, can you ignore
lags in the spectral density matrix? What if you know that returns are predictable? What
if the error is formed from an instrument/managed portfolio ut zt’1 ?

Chapter 12. Regression-based tests of
linear factor models
This and the next three chapters study the question, how should we estimate and evaluate
linear factor models; models of the form p = E(mx), m = b0 f or equivalently E(Re ) =
β»? These models are by far the most common in empirical asset pricing, and there is a large
literature on econometric techniques to estimate and evaluate them. Each technique focuses
on the same questions: how to estimate parameters, how to calculate standard errors of the
estimated parameters, how to calculate standard errors of the pricing errors, and how to test
the model, usually with a test statistic of the form ±0 V ’1 ±.
ˆ ˆ
I start with simple and longstanding time-series and cross-sectional regression tests. Then,
I pursue GMM approach to the model expressed in p = E(mx), m = b0 f form. The follow-
ing chapter summarizes the principle of maximum likelihood estimation and derives maxi-
mum likelihood estimates and tests. Finally, a chapter compares the different approaches.
As always, the theme is the underlying unity. All of the techniques come down to one of
two basic ideas: time-series regression or cross-sectional regression. The GMM, p = E(mx)
approach turns out to be almost identical to cross-sectional regressions. Maximum likelihood
(with appropriate statistical assumptions) justi¬es the time-series and cross-sectional regres-
sion approaches. The formulas for parameter estimates, standard errors, and test statistics are
all strikingly similar.

12.1 Time-series regressions

When the factor is also a return, we can evaluate the model

E(Rei ) = β i E(f)

by running OLS time series regressions

Rei = ±i + β i ft + µi ; t = 1, 2, ...T
t t

for each asset. The OLS distribution formulas (with corrected standard errors) provide stan-
dard errors of ± and β.
With errors that are i.i.d. over time, homoskedastic and independent of the factors, the
asymptotic joint distribution of the intercepts gives the model test statistic,
" ¶2 #’1
ET (f)
ˆˆ ˆ
±0 Σ’1 ± ∼χ2
T 1+ N
σ(f )


The Gibbons-Ross-Shanken test is a multivariate, ¬nite sample counterpart to this statistic,
when the errors are also normally distributed,
T ’N ’K ³ ´’1
0 ˆ ’1
ˆˆ ˆ
±0 Σ’1 ± ∼FN,T ’N’K .
1 + ET (f) „¦ ET (f )
I show how to construct the same test statistics with heteroskedastic and autocorrelated errors
via GMM.

I start with the simplest case. We have a factor pricing model with a single factor. The
factor is an excess return (for example, the CAPM, with Rem = Rm ’ Rf ), and the test
assets are all excess returns. We express the model in expected return - beta form. The betas
are de¬ned by regression coef¬cients

Rt = ±i + β i ft + µi

and the model states that expected returns are linear in the betas:

E(Rei ) = β i E(f ). (167)

Since the factor is also an excess return, the model applies to the factor as well, so E(f ) =
1 — ».
Comparing the model (12.167) and the expectation of the time series regression (12.166)
we see that the model has one and only one implication for the data: all the regression
intercepts ±i should be zero. The regression intercepts are equal to the pricing errors.
Given this fact, Black Jensen and Scholes (1972) suggested a natural strategy for estima-
tion and evaluation: Run time-series regressions (12.166) for each test asset. The estimate of
the factor risk premium is just the sample mean of the factor,
» = ET (f ).

Then, use standard OLS formulas for a distribution theory of the parameters. In particular
you can use t-tests to check whether the pricing errors ± are in fact zero. These distributions
are usually presented for the case that the regression errors in (12.166) are uncorrelated and
homoskedastic, but the formulas in section 11.4 show easily how to calculate standard errors
for arbitrary error covariance structures.
We also want to know whether all the pricing errors are jointly equal to zero. This re-
quires us to go beyond standard formulas for the regression (12.166) taken alone, as we want
to know the joint distribution of ± estimates from separate regressions running side by side
but with errors correlated across assets (E(µi µj ) 6= 0). (We can think of 12.166 as a panel
regression, and then it™s a test whether the ¬rm dummies are jointly zero.) The classic form
of these tests assume no autocorrelation or heteroskedasticity, but allow the errors to be cor-
related across assets. Dividing the ± regression coef¬cients by their variance-covariance


matrix leads to a χ2 test,
" ¶2 #’1
ET (f)
ˆˆ ˆ
±0 Σ’1 ± ∼χ2 (168)
T 1+ N
σ(f )

where ET (f ) denotes sample mean, σ2 (f ) denotes sample variance, ± is a vector of the
ˆ ˆ
estimated intercepts,
£ ¤0
± = ±1 ±2 ... ±N
ˆ ˆ ˆ

Σ is the residual covariance matrix, i.e. the sample estimate of E(µt µ0 ) = Σ, where
£ ¤0
µt = µ1 µ2 · · · µN .
t t t

As usual when testing hypotheses about regression coef¬cients, this test is valid asymp-
totically. The asymptotic distribution theory assumes that σ2 (f) (i.e. X 0 X) and Σ have
converged to their probability limits; therefore it is asymptotically valid even though the fac-
tor is stochastic and Σ is estimated, but it ignores those sources of variation in a ¬nite sample.
It does not require that the errors are normal, relying on the central limit theorem so that ± is
normal. I derive (12.168) below.
Also as usual in a regression context, we can derive a ¬nite-sample F distribution for the
hypothesis that a set of parameters are jointly zero, for ¬xed values of the right hand variable
ft ,.
" ¶2 #’1
T ’N ’1 ET (f)
ˆˆ ˆ
±0 Σ’1 ± ∼FN,T ’N’1 (169)
N σ (f )

This is the Gibbons Ross and Shanken (1989) or “GRS” test statistic. The F distribution
recognizes sampling variation in Σ, which is not included in (12.168). This distribution
requires that the errors µ are normal as well as uncorrelated and homoskedastic. With normal
errors, the ± are normal and Σ is an independent Wishart (the multivariate version of a χ2 ),
so the ratio is F . This distribution is exact in a ¬nite sample.
Tests (12.168) and (12.169) have a very intuitive form. The basic part of the test is a
ˆˆ ˆ
quadratic form in the pricing errors, ±0 Σ’1 ±. If there were no βf in the model, then the ± ˆ
would simply be the sample mean of the regression errors µt . Assuming i.i.d. µt , the variance
of their sample mean is just 1/T Σ. Thus, if we knew Σ then T ±0 Σ’1 ± would be a sum
ˆ ˆ
of squared sample means divided by their variance-covariance matrix, which would have an
asymptotic χ2 distribution, or a ¬nite sample χ2 distribution if the µt are normal. But we
have to estimate Σ, which is why the ¬nite-sample distribution is F rather than χ2 . We also
estimate the β, and the second term in (12.168) and (12.169) accounts for that fact.
Recall that a single beta representation exists if and only if the reference return is on
the mean-variance frontier. Thus, the test can also be interpreted as a test whether f is ex-


ante mean-variance ef¬cient “ whether it is on the mean-variance frontier using population
moments “ after accounting for sampling error. Even if f is on the true or ex-ante mean-
variance frontier, other returns will outperform it in sample due to luck, so the return f will
usually be inside the ex-post mean-variance frontier “ i.e. the frontier drawn using sample
moments. Still, it should not be too far inside the sample frontier. Gibbons Ross and Shanken
show that the test statistic can be expressed in terms of how far inside the ex-post frontier the
return f is,
³ ´2 ³ ´2
µq ET (f )
T ’ N ’ 1 σq ’ σ(f ) ˆ
´2 .
N ET (f )
1 + σ(f )

³ ´2
is the Sharpe ratio of the ex-post tangency portfolio (maximum ex-post Sharpe ratio)
formed from the test assets plus the factor f .
If there are many factors that are excess returns, the same ideas work, with some cost of
algebraic complexity. The regression equation is
Rei = ±i + β 0 ft + µi .
i t

The asset pricing model
E(Rei ) = β 0 E(f)

again predicts that the intercepts should be zero. We can estimate ± and β with OLS time-
ˆˆ ˆ
series regressions. Assuming normal i.i.d. errors, the quadratic form ±0 Σ’1 ± has the distri-
T ’N ’K ³ ´’1
0 ˆ ’1
ˆˆ ˆ
±0 Σ’1 ± ∼FN,T ’N’K (171)
1 + ET (f ) „¦ ET (f)
Number of assets
N =
Number of factors
K =
[ft ’ ET (f)] [ft ’ ET (f)]0
T t=1

The main difference is that the Sharpe ratio of the single factor is replaced by the natural

generalization ET (f) „¦’1 ET (f).

12.1.1 Derivation of the χ2 statistic and distributions with general errors.

I derive (12.168) as an instance of GMM. This approach allows us to generate straightfor-
wardly the required corrections for autocorrelated and heteroskedastic disturbances. (MacKin-


lay and Richardson (1991) advocate GMM approaches to regression tests in this way.) It also
serves to remind us that GMM and p = E(mx) are not necessarily paired; one can do a
GMM estimate of an expected return - beta model too. The mechanics are only slightly dif-
ferent than what we did to generate distributions for OLS regression coef¬cients in section
11.4, since we keep track of N OLS regressions simultaneously.
Write the equations for all N assets together in vector form,

Re = ± + βft + µt .

We use the usual OLS moments to estimate the coef¬cients,
· ¸ µ· ¸¶
ET (Rt ’ ± ’ βft ) µt
gT (b) = = ET =0
ET [(Rt ’ ± ’ βft ) ft ] ft µt

These moments exactly identify the parameters ±, β, so the a matrix in agT (ˆ = 0 is the
identity matrix. Solving, the GMM estimates are of course the OLS estimates,
± = ET (Rt ) ’ βET (ft )
ET [(Rt ’ ET (Re )) ft ]
e e
covT (Rt , ft )
ˆ t
β= = .
ET [(ft ’ ET (ft )) ft ] varT (ft )

The d matrix in the general GMM formula is
· ¸ · ¸
‚gT (b) IN IN E(ft ) 1 E(ft )
d≡ =’ =’ — IN
IN E(ft ) IN E(ft2 ) E(ft ) E(ft2 )

where IN is an N — N identity matrix. The S matrix is
X · E(µt µ0 ) ¸

E(µt µ0 ft’j )
t’j t’j
S= .
E(ft µt µ0 ) E(ft µt µ0 ft’j )
t’j t’j

Using the GMM variance formula (11.146) with a = I we have
µ· ¸¶
ˆ 1
= d’1 Sd’10 . (172)
var ˆ
β T

At this point, we™re done. The upper left hand corner of var(± β) gives us var(ˆ ) and the
0 2
test we™re looking for is ± var(ˆ ) ± ∼ χN .
ˆ ± ˆ
The standard formulas make this expression prettier by assuming that the errors are uncor-
related over time and not heteroskedastic to simplify the S matrix, as we derived the standard
OLS formulas in section 11.4. If we assume that f and µ are independent as well as orthog-
onal, E(fµµ0 ) = E(f )E(µµ0 ) and E(f 2 µµ0 ) = E(f 2 )E(µµ0 ). If we assume that the errors
are independent over time as well, we lose all the lead and lag terms. Then, the S matrix


simpli¬es to
· ¸ · ¸
E(µt µ0 ) E(µt µ0 )E(ft ) 1 E(ft )
t t
S= = —Σ
E(ft )E(µt µt ) E(µt µt )E(ft2 )
0 0
E(ft ) E(ft2 )

Now we can plug into (12.172). Using (A—B)’1 = A’1 —B ’1 and (A—B)(C —D) =
AC — BD, we obtain
÷ !
µ· ¸¶ ¸’1
ˆ 1 1 E(ft )
var = —Σ .
ˆ E(ft ) E(ft2 )
β T

Evaluating the inverse,
µ· ¸¶ · ¸
E(ft2 ) ’E(ft )
1 1
var = —Σ
ˆ ’E(ft ) 1
β T var(f)

We™re interested in the top left corner. Using E(f 2 ) = E(f )2 + var(f),
µ ¶
E(f )2
var (ˆ ) =
± 1+ Σ.
T var(f )

This is the traditional formula (12.168), but there is now no real reason to assume that the
errors are i.i.d. or independent of the factors. By simply calculating 12.172, we can easily
construct standard errors and test statistics that do not require these assumptions.

12.2 Cross-sectional regressions

We can ¬t

E(Rei ) = β 0 » + ±i

by running a cross-sectional regression of average returns on the betas. This technique can
be used whether the factor is a return or not.
I discuss OLS and GLS cross-sectional regressions, I ¬nd formulas for the standard errors
of », and a χ2 test whether the ± are jointly zero. I derive the distributions as an instance of
GMM, and I show how to implement the same approach for autocorrelated and heteroskedas-
tic errors. I show that the GLS cross-sectional regression is the same as the time-series re-
gression when the factor is also an excess return, and is included in the set of test assets.


E ( Rei )


Assets i

Slope »

Figure 26. Cross-sectional regression

Start again with the K factor model, written as

E(Rei ) = β 0 »; i = 1, 2, ...N

The central economic question is why average returns vary across assets; expected returns of
an asset should be high if that asset has high betas or risk exposure to factors that carry high
risk premia.
Figure 26 graphs the case of a single factor such as the CAPM. Each dot represents one
asset i. The model says that average returns should be proportional to betas, so plot the
sample average returns against the betas. Even if the model is true, this plot will not work out
perfectly in each sample, so there will be some spread as shown.

Given these facts, a natural idea is to run a cross-sectional regression to ¬t a line through
the scatterplot of Figure 26. First ¬nd estimates of the betas from a time series regression,

Rei = ai + β 0 ft + µi , t = 1, 2, ...T for each i. (174)
t i t

Then estimate the factor risk premia » from a regression across assets of average returns on
the betas,

ET (Rei ) = β 0 » + ±i , i = 1, 2....N. (175)


As in the ¬gure, β are the right hand variables, » are the regression coef¬cients, and the
cross-sectional regression residuals ±i are the pricing errors. This is also known as a two-
pass regression estimate, because one estimates ¬rst time-series and then cross-sectional re-
You can run the cross-sectional regression with or without a constant. The theory says
that the constant or zero-beta excess return should be zero. You can impose this restriction or
estimate a constant and see if it turns out to be small. The usual tradeoff between ef¬ciency
(impose the null as much as possible to get ef¬cient estimates) and robustness applies.

12.2.1 OLS cross-sectional regression

It will simplify notation to consider a single factor; the case of multiple factors looks the
same with vectors in place of scalars. I denote vectors from 1 to N with¤ missing sub or
£ ¤0 £ 0
superscripts, i.e. µt = µ1 µ2 · · · µN , β = β 1 β 2 · · · β N , and similarly
t t t
for Re and ±. For simplicity take the case of no intercept in the cross-sectional regression.
With this notation OLS cross-sectional estimates are
¡ ¢’1 0
» = β0β (12.176)
β ET (Re )
± = ET (Re ) ’ »β.

Next, we need a distribution theory for the estimated parameters. The most natural place
to start is with the standard OLS distribution formulas. I start with the traditional assumption
that the true errors are i.i.d. over time, and independent of the factors. This will give us some
easily interpretable formulas, and we will see most of these terms remain when we do the
distribution theory right later on.
In an OLS regression Y = Xβ + u and E(uu0 ) = „¦, the standard error of the β estimate
is (X 0 X)’1 X 0 „¦X(X 0 X)’1 . The residual covariance matrix is (I ’ X(X 0 X)’1 X 0 )„¦(I ’
X(X 0 X)’1 X 0 )0
Denote Σ = E (µt µ0 ). Since the ±i are just time series averages of the true µi shocks
t t
(the average of the sample residuals is always zero), the errors in the cross-sectional regres-
sion have covariance matrix E (±±0 ) = T Σ. Thus the conventional OLS formulas for the

covariance matrix of OLS estimates and residual with correlated errors give
³´ 1 ¡ 0 ¢’1 0 ¡ ¢’1

β Σβ β 0 β (12.177)
σ» = ββ
1³ ¡ 0 ¢’1 0 ´ ³ ¡ 0 ¢’1 0 ´
cov(ˆ ) =
± I’β β β β Σ I ’β β β β

We could test whether all pricing errors are zero with the statistic
±0 cov(ˆ )’1 ± ∼χ2 . (179)
ˆ ± ˆ N’1


The distribution is χ2
N’1 not χN because the covariance matrix is singular. The singularity

and the extra terms in (12.178) result from the fact that the » coef¬cient was estimated along
the way, and means that we have to use a generalized inverse. (If there are K factors, we
obviously end up with χ2 N’K .)
A test of the residuals is unusual in OLS regressions. We do not usually test whether the
residuals are “too large,” since we have no information other than the residuals themselves
about how large they should be. In this case, however, the ¬rst stage time-series regression
gives us some independent information about the size of cov(±±0 ), information that we could
not get from looking at the cross-sectional residual ± itself.

12.2.2 GLS cross-sectional regression

Since the residuals in the cross-sectional regression (12.175) are correlated with each other,
standard textbook advice is to run a GLS cross-sectional regression rather than OLS, using
E(±±0 ) = T Σ as the error covariance matrix:

¡ 0 ’1 ¢’1 0 ’1
ˆ (12.180)
β Σ ET (Re )
»= βΣ β
± = ET (Re ) ’ »β.

The standard regression formulas give the variance of these estimates as
³´ 1 ¡ 0 ’1 ¢’1

σ» = βΣ β
1³ ¡ 0 ’1 ¢’1 0 ´
cov(ˆ ) =
± Σ’β β Σ β β

The comments of section 11.5 warning that OLS is sometimes much more robust than
GLS apply in this case. The GLS regression should improve ef¬ciency, i.e. give more precise
estimates. However, Σ may be hard to estimate and to invert, especially if the cross-section
N is large. One may well choose the robustness of OLS over the asymptotic statistical ad-
vantages of GLS.
A GLS regression can be understood as a transformation of the space of returns, to focus
attention on the statistically most informative portfolios. Finding (say, by Choleski decompo-
sition) a matrix C such that CC 0 = Σ’1 , the GLS regression is the same as an OLS regression
of CET (Re ) on Cβ, i.e. of testing the model on the portfolios CRe . The statistically most
informative portfolios are those with the lowest residual variance Σ. But this asymptotic sta-
tistical theory assumes that the covariance matrix has converged to its true value. In most
samples, the ex-post or sample mean-variance frontier still seems to indicate lots of luck, and
this is especially true if the cross section is large, anything more than 1/10 of the time series.
The portfolios CRe are likely to contain many extreme long-short positions.
Again, we could test the hypothesis that all the ± are equal to zero with (12.179). Though
the appearance of the statistic is the same, the covariance matrix is smaller, re¬‚ecting the


greater power of the GLS test. As with the JT test, (11.152) we can develop an equivalent
test that does not require a generalized inverse;

T ±0 Σ’1 ± ∼χ2 . (183)
ˆ ˆ N’1

To derive (12.183), I proceed exactly as in the derivation of the JT test (11.152). De¬ne, say
by Choleski decomposition, a matrix C such that CC 0 = Σ’1 . Now, ¬nd the covariance

matrix of T C 0 ±:
³ ¢’1 0 ´
√ ¡0 ¡ ¢’1 0
0 ’1
β C = I ’ δ δ0 δ
0 0
cov( T C±) = C (CC ) ’ β β CC β δ


δ = C 0 β.
√ √
In sum, ± is asymptotically normal so T C 0 ± is asymptotically normal, cov( T C 0 ±) is an
ˆ ˆ ˆ
0 0 ’1
idempotent matrix with rank N ’ 1; therefore T ± CC ± = T ± Σ ± is χN’1 .
0 2
ˆ ˆ ˆ ˆ

12.2.3 Correction for the fact that β are estimated, and GMM formulas that
don™t need i.i.d. errors.

In applying standard OLS formulas to a cross-sectional regression, we assume that the right
hand variables β are ¬xed. The β in the cross-sectional regression are not ¬xed, of course,
but are estimated in the time series regression. This turns out to matter, even as T ’ ∞.
In this section, I derive the correct asymptotic standard errors. With the simplifying as-
sumption that the errors µ are i.i.d. over time and independent of the factors, the result is
1 h 0 ’1 0 ¡ ¢’1 ³ ´ i
ˆ (β β) β Σβ β 0 β 1 + »0 Σ’1 » + Σf (12.184)
σ 2 (»OLS ) = f
1 h¡ 0 ’1 ¢’1 ³ ´ i
2ˆ 0 ’1
σ (»GLS ) = βΣ β 1 + » Σf » + Σf
where Σf is the variance-covariance matrix of the factors. This correction is due to Shanken
(1992). Comparing these standard errors to (12.177) and (12.181), we see that there is a
³ ´
multiplicative correction 1 + »0 Σ’1 » and an additive correction Σf .

The asymptotic variance-covariance matrix of the pricing errors is
1³ ¡ 0 ¢’1 0 ´ ¡ ³ ´
0 ’1 0 ¢ 0 ’1
1 + » Σf (12.185)
cov(ˆ OLS ) =
± IN ’ β β β β Σ IN ’ β(β β) β »
1³ ¡ 0 ’1 ¢’1 0 ´ ³ ´
0 ’1
cov(ˆ GLS ) =
± Σ’β β Σ β β 1 + » Σf »
Comparing these results to (12.178) and (12.182) we see the same multiplicative correction


We can form the asymptotic χ2 test of the pricing errors by dividing pricing errors by their
variance-covariance matrix, ±cov(ˆ )’1 ±. Following (12.183), we can simplify this result for
ˆ ± ˆ
the GLS pricing errors resulting in
³ ´
T 1 + » Σf » ±0
0 ’1
ˆ GLS Σ’1 ±GLS ∼ χ2 (187)
ˆ N’K .

Are the corrections important relative to the simple OLS formulas given above? In the
CAPM » = E(Rem ) so »2 /σ2 (Rem ) ≈ (0.08/0.16)2 = 0.25 in annual data. In annual data,
then, the multiplicative term is too large to ignore. However, the mean and variance both
scale with horizon, so the Sharpe ration scales with the square root of horizon. Therefore,
for a monthly interval »2 /σ 2 (Rem ) ≈ 0.25/12 ≈ 0.02 which is quite small and ignoring the
multiplicative term makes little difference.
The additive term in the standard error of » can be very important. Consider a one factor
model, suppose all the β are 1.0, all the residuals are uncorrelated so Σ is diagonal, suppose
all assets have the same residual covariance σ2 (µ), and ignore the multiplicative term. Now
we can write either covariance matrix in (12.184) as
· ¸
1 12
2ˆ 2
σ (») = σ (µ) + σ (f)

Even with N = 1, most factor models have fairly high R2 , so σ2 (µ) < σ2 (f ). Typical
CAPM values of R2 = 1 ’ σ2 (µ)/σ2 (f ) for large portfolios are 0.6-0.7; and multifactor
models such as the Fama French 3 factor model have R2 often over 0.9. Typical numbers of
assets N = 10 to 50 make the ¬rst term vanish compared to the second term.
More generally, suppose the factor were in fact a return. Then the factor risk premium is
» = E(f), and we™d use Σf /T as the standard error of ». This is the “correction” term in
(12.184), so we expect it to be, in fact, the most important term.
Note that Σf /T is the standard error of the mean of f . Thus, in the case that the return
is a factor, so E(f ) = », this is the only term you would use.
This example suggests that Σf is not just an important correction, it is likely to be the
dominant consideration in the sampling error of the ».
Comparing (12.187) to the GRS tests for a time-series regression, (12.168), (12.169),
(12.171) we see the same statistic. The only difference is that by estimating » from the
cross-section rather than imposing » = E(f ), the cross-sectional regression loses degrees of
freedom equal to the number of factors.
Though these formulas are standard classics, I emphasize that we don™t have to make the
severe assumptions on the error terms that are used to derive them. As with the time-series
case, I derive a general formula for the distribution of » and ±, and only at the last moment
make classic error term assumptions to make the spectral density matrix pretty.
Derivation and formulas that don™t require i.i.d. errors.


The easy and elegant way to account for the effects of “generated regressors” such as
the β in the cross-sectional regression is to map the whole thing into GMM. Then, we treat
the moments that generate the regressors β at the same time as the moments that generate
the cross-sectional regression coef¬cient », and the covariance matrix S between the two
sets of moments captures the effects of generating the regressors on the standard error of the
cross-sectional regression coef¬cients. Comparing this straightforward derivation with the
dif¬culty of Shanken™s (1992) paper that originally derived the corrections for », and noting
that Shanken did not go on to ¬nd the formulas (12.185) that allow a test of the pricing errors
is a nice argument for the simplicity and power of the GMM framework.
To keep the algebra manageable, I treat the case of a single factor. The moments are
® ®
E(Rt ’ a ’ βft ) 0
° E [(Rt ’ a ’ βft )ft ] » = ° 0 » (188)
gT (b) =
E (Re ’ β») 0

The top two moment conditions exactly identify a and β as the time-series OLS estimates.
(Note a not ±. The time-series intercept is not necessarily equal to the pricing error in a
cross-sectional regression.) The bottom moment condition is the asset pricing model. It is in
general overidenti¬ed in a sample, since there is only one extra parameter (») and N extra
moment conditions. If we use a weighting vector β 0 on this condition, we obtain the OLS
cross-sectional estimate of ». If we use a weighting vector β 0 Σ’1 , we obtain the GLS cross-
sectional estimate of ». To accommodate both cases, use a weighting vector γ 0 , and then
substitute γ 0 = β 0 , γ 0 = β 0 Σ’1 , etc. at the end.
The standard errors for » come straight from the general GMM standard error formula
(11.146). The ± are not parameters, but are the last N moments. Their covariance matrix is
thus given by the GMM formula (11.147) for the sample variation of the gT .
All we have to do is map the problem into the GMM notation. The parameter vector is
£ ¤
b0 = a0 »

The a matrix chooses which moment conditions are set to zero in estimation,
· ¸
I2N 0
a= .

The d matrix is the sensitivity of the moment conditions to the parameters,
® 
’IN ’IN E(f ) 0
= ° ’IN E(f ) ’IN E(f 2 ) 0 »
d= 0
0 ’»IN ’β


The S matrix is the long-run covariance matrix of the moments.

«® ® 0 
Re ’ a ’ βft Rt’j ’ a ’ βft’j

X t
E ° (Re ’ a ’ βft )ft » ° (Rt’j ’ a ’ βft’j )ft’j » 
S = t
e e
Rt ’ β» Rt’j ’ β»
«® ® 0 
µt µt’j

E ° »° »
µt ft µt’j ft’j
β(ft ’ Ef ) + µt β(ft’j ’ Ef) + µt’j

In the second expression, I have used the regression model and the restriction under the null
that E (Rt ) = β». In calculations, of course, you could simply estimate the ¬rst expression.

We are done. We have the ingredients to calculate the GMM standard error formula
(11.146) and formula for the covariance of moments (11.147).
We can recover the classic formulas (12.184), (12.185), (12.186) by adding the assump-
tion that the errors are i.i.d. and independent of the factors, and that the factors are uncorre-
lated over time as well. The assumption that the errors and factors are uncorrelated over time
means we can ignore the lead and lag terms. Thus, the top left corner is E(µt µ0 ) = Σ. The
assumption that the errors are independent from the factors ft simpli¬es the terms in which
µt and ft are multiplied: E(µt (µ0 ft )) = E(f )Σ for example. The result is

® 
Σ E(f )Σ Σ
S = ° E(f )Σ E(f 2 )Σ »
E(f )Σ
E(f )Σ ββ 0 σ 2 (f ) + Σ

Multiplying a, d, S together as speci¬ed by the GMM formula for the covariance matrix
of parameters (11.146) we obtain the covariance matrix of all the parameters, and its (3,3)
element gives the variance of ». Multiplying the terms together as speci¬ed by (11.147), we
obtain the sampling distribution of the ±, (12.185). The formulas (12.184) reported above
are derived the same way with a vector of factors ft rather than a scalar; the second moment
condition in (12.188) then reads E [(Rt ’ a ’ βf t ) — ft ]. The matrix multiplication is not

particularly enlightening.
Once again, there is really no need to make the assumption that the errors are i.i.d. and
especially that they are conditionally homoskedastic “ that the factor f and errors µ are in-
dependent. It is quite easy to estimate an S matrix that does not impose these conditions
and calculate standard errors. They will not have the pretty analytic form given above, but
they will more closely report the true sampling uncertainty of the estimate. Furthermore, if
one is really interested in ef¬ciency, the GLS cross-sectional estimate should use the spectral
density matrix as weighting matrix rather than Σ’1 .


12.2.4 Time series vs. cross-section

How are the time-series and cross-sectional approaches different?
Most importantly, you can run the cross-sectional regression when the factor is not a
return. The time-series test requires factors that are also returns, so that you can estimate
factor risk premia by » = ET (f ). The asset pricing model does predict a restriction on the
intercepts in the time-series regression. Why not just test these? If you impose the restriction
E(Rei ) = β 0 », you can write the time-series regression (12.174) as

Rei = β 0 » + β 0 (ft ’ E(f)) + µi , t = 1, 2, ...T for each i.
t i i t

Comparing this with (12.174), you see that the intercept restriction is

ai = β 0 (» ’ E(f )) .

This restriction makes sense. The model says that mean returns should be proportional to
betas, and the intercept in the time-series regression controls the mean return. You can also
see how » = E(f ) results in a zero intercept. Finally, however, you see that without an
estimate of », you can™t check this intercept restriction. If the factor is not a return, you will
be forced to do something like a cross-sectional regression.
When the factor is a return, so that we can compare the two methods, they are not neces-
sarily the same. The time-series regression estimates the factor risk premium as the sample
mean of the factor. Hence, the factor receives a zero pricing error. Also, the predicted zero-
beta excess return is also zero. Thus, the time-series regression describes the cross-section of
expected returns by drawing a line as in Figure 26 that runs through the origin and through
the factor, ignoring all of the other points. The OLS cross-sectional regression picks the slope
and intercept, if you include one, to best ¬t all the points; to minimize the sum of squares of
all the pricing errors.
If the factor is a return, the GLS cross-sectional regression, including the factor as a test
asset, is identical to the time-series regression. The time-series regression for the factor is, of

ft = 0 + 1ft + 0

so it has a zero intercept, beta equal to one, and zero residual in every sample. The residual
variance covariance matrix of the returns, including the factor, is
µ· e ¸¶· ¸
R ’ a ’ βf Σ0
E [·] =
f ’ 0 ’ 1f 00

Since the factor has zero residual variance, a GLS regression puts all its weight on that asset.
Therefore, » = ET (f ) just as for the time-series regression. The pricing errors are the same,
as is their distribution and the χ2 test. (You gain a degree of freedom by adding the factor to
the cross sectional regression, so the test is a χ2 .)


Why does the “ef¬cient” technique ignore the pricing errors of all of the other assets in
estimating the factor risk premium, and focus only on the mean return? The answer is simple,
though subtle. In the regression model

Re = a + βft + µt ,

the average return of each asset in a sample is equal to beta times the average return of the
factor in the sample, plus the average residual in the sample. An average return carries no
additional information about the mean of the factor. A signal plus noise carries no additional
information beyond that in the same signal. Thus, an “ef¬cient” cross-sectional regression
wisely ignores all the information in the other asset returns and uses only the information in
the factor return to estimate the factor risk premium.

12.3 Fama-MacBeth Procedure

I introduce the Fama-MacBeth procedure for running cross sectional regression and calcu-
lating standard errors that correct for cross-sectional correlation in a panel. I show that, when
the right hand variables do not vary over time, Fama-MacBeth is numerically equivalent to
pooled time-series, cross-section OLS with standard errors corrected for cross-sectional cor-
relation, and also to a single cross-sectional regression on time-series averages with standard
errors corrected for cross-sectional correlation. Fama-MacBeth standard errors do not include
corrections for the fact that the betas are also estimated.

Fama and MacBeth (1973) suggest an alternative procedure for running cross-sectional
regressions, and for producing standard errors and test statistics. This is a historically impor-
tant procedure, it is computationally simple to implement, and is still widely used, so it is
important to understand it and relate it to other procedures.
First, you ¬nd beta estimates with a time-series regression. Fama and MacBeth use rolling
5 year regressions, but one can also use the technique with full-sample betas, and I will
consider that simpler case. Second, instead of estimating a single cross-sectional regression
with the sample averages, we now run a cross-sectional regression at each time period, i.e.

Rt = β 0 »t + ±it i = 1, 2, ...N for each t.

I write the case of a single factor for simplicity, but it™s easy to extend the model to multiple
factors. Then, Fama and MacBeth suggest that we estimate » and ±i as the average of the
cross sectional regression estimates,


. 8
( 17)