= gT E(xx0 )’1 gT

0

You might want to choose parameters of the model to minimize this “economic” measure

of model ¬t, or this economically motivated linear combination of pricing errors, rather than

the statistical measure of ¬t S ’1 . You might also use the minimized value of this criterion to

compare two models. In that way, you are sure the better model is better because it improves

on the pricing errors rather than just blowing up the weighting matrix.

Identity matrix.

Using the identity matrix weights the initial choice of assets or portfolios equally in esti-

mation and evaluation. This choice has a particular advantage with large systems in which S

is nearly singular, as it avoids most of the problems associated with inverting a near-singular

S matrix. Many empirical asset pricing studies use OLS cross-sectional regressions, which

are the same thing as a ¬rst stage GMM estimate with an identity weighting matrix.

Comparing the second moment and identity matrices.

The second moment matrix gives an objective that is invariant to the initial choice of

assets or portfolios. If we form a portfolio Ax of the initial payoffs x, with nonsingular A

(i.e. a transformation that doesn™t throw away information) then

[E(yAx) ’ Ap]0 E(Axx0 A0 )’1 [E(yAx) ’ Ap] = [E(yx) ’ p]0 E(xx0 )’1 [E(yx) ’ p].

The optimal weighting matrix S shares this property. It is not true of the identity or other

¬xed matrices. In those cases, the results will depend on the initial choice of portfolios.

Kandel and Stambaugh (1995) have suggested that the results of several important asset

pricing model tests are highly sensitive to the choice of portfolio; i.e. that authors inadver-

tently selected a set of portfolios on which the CAPM does unusually badly in a particular

sample. Insisting that weighting matrices have this kind of invariance to portfolio selection

might be a good device to ward against this problem.

On the other hand, if you want to focus on the model™s predictions for economically

interesting portfolios, then it wouldn™t make much sense for the weighting matrix to undo

the speci¬cation of economically interesting portfolios! For example, many studies want

204

SECTION 11.6 ESTIMATING ON ONE GROUP OF MOMENTS, TESTING ON ANOTHER.

to focus on the ability of a model to describe expected returns that seem to depend on a

characteristic such as size, book/market, industry, momentum, etc. Also, the second moment

matrix is often even more nearly singular than the spectral density matrix, since E(xx0 ) =

cov(x)+E(x)E(x)0 . Therefore, it often emphasizes portfolios with even more extreme short

and long positions, and is no help on overcoming the near singularity of the S matrix.

11.6 Estimating on one group of moments, testing on another.

You may want to force the system to use one set of moments for estimation and another for

testing. The real business cycle literature in macroeconomics does this extensively, typically

using “¬rst moments” for estimation (“calibration”) and “second moments” (i.e. ¬rst mo-

ments of squares) for evaluation. A statistically minded macroeconomist might like to know

whether the departures of model from data “second moments” are large compared to sam-

pling variation, and would like to include sampling uncertainty about the parameter estimates

in this evaluation.

You might want to choose parameters using one set of asset returns (stocks; domestic as-

sets; size portfolios, ¬rst 9 size deciles, well-measured assets) and then see how the model

does “out of sample” on another set of assets (bonds; foreign assets; book/market portfo-

lios, small ¬rm portfolio, questionably measured assets, mutual funds). However, you want

the distribution theory for evaluation on the second set of moments to incorporate sampling

uncertainty about the parameters in their estimation on the ¬rst set of moments.

You can do all this very simply by using an appropriate weighting matrix or a prespec-

i¬ed moment matrix aT . For example, if the ¬rst N moments will be used to estimate N

parameters, and the remaining M moments will be used to test the model “out of sample,”

use aT = [IN 0N—M ] . If there are more moments N than parameters in the “estimation”

block, you can construct a weighting matrix W which is an identity matrix in the N — N es-

timation block and zero elsewhere. Then aT = ‚gT /‚bW will simply contain the ¬rst N

0

columns of ‚gT /‚b followed by zeros. The test moments will not be used in estimation. You

0

could even use the inverse of the upper N — N block of S (not the upper block of the inverse

of S!) to make the estimation a bit more ef¬cient.

11.7 Estimating the spectral density matrix

Hints on estimating the spectral density or long run covariance matrix. 1) Use a sensi-

ble ¬rst stage estimate 2) Remove means 3) Downweight higher order correlations 4) Con-

sider parametric structures for autocorrelation and heteroskedasticity 5) Use the null to limit

the number of correlations or to impose other structure on S? 6) Size problems; consider

a factor or other parametric cross-sectional structure for S. 7) Iteration and simultaneous

b, S estimation.

205

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

The optimal weighting matrix S depends on population moments, and depends on the

parameters b. Work back through the de¬nitions,

∞

X

E(ut u0 );

S = t’j

j=’∞

ut ≡ (mt (b)xt ’ pt’1 )

How do we estimate this matrix? The big picture is simple: following the usual philoso-

phy, estimate population moments by their sample counterparts. Thus, use the ¬rst stage b

estimates and the data to construct sample versions of the de¬nition of S. This produces a

consistent estimate of the true spectral density matrix, which is all the asymptotic distribution

theory requires.

The details are important however, and this section gives some hints. Also, you may want

a different, and less restrictive, estimate of S for use in standard errors than you do when you

are estimating S for use in a weighting matrix.

1) Use a sensible ¬rst stage W, or transform the data.

In the asymptotic theory, you can use consistent ¬rst stage b estimates formed by any

nontrivial weighting matrix. In practice, of course, you should use a sensible weighting

matrix so that the ¬rst stage estimates are not ridiculously inef¬cient. W = I is often a good

choice.

Sometimes, some moments will have different units than other moments. For example,

the dividend/price ratio is a number like 0.04. Therefore, the moment formed by Rt+1 — d/pt

will be about 0.04 as big as large as the moment formed by Rt+1 — 1. If you use W = I,

GMM will pay much less attention to the Rt+1 — d/pt moment. It is wise, then, to either

use an initial weighting matrix that overweights the Rt+1 — d/pt moment, or to transform

the data so the two moments are about the same mean and variance. For example, you could

use Rt+1 — (1 + d/pt ). It is also useful to start with moments that are not horrendously

correlated with each other, or to remove such correlation with a clever W . For example, you

might consider Ra and Rb ’ Ra rather than Ra and Rb . You can accomplish this directly, or

by starting with

· ¸· ¸ · ¸

1 ’1 10 2 ’1

W= = .

01 ’1 1 ’1 1

2) Remove means.

Under the null, E(ut ) = 0, so it does not matter to the asymptotic distribution theory

206

SECTION 11.7 ESTIMATING THE SPECTRAL DENSITY MATRIX

whether you estimate the covariance matrix by removing means, using

T T

1X 1X

0

[(ut ’ u)(ut ’ u) ] ; u ≡

¯ ¯ ¯ ut

T t=1 T t=1

or whether you estimate the second moment matrix by not removing means. However,

Hansen and Singleton (1982) advocate removing the means in sample, and this is gener-

ally a good idea.

It is already a major obstacle to second-stage estimation that estimated S matrices (and

even simple variance-covariance matrices) are often nearly singular, providing an unreliable

weighting matrix when inverted. Since second moment matrices E(uu0 ) = cov(u, u0 ) +

E(u)E(u0 ) add a singular matrix E(u)E(u0 ) they are often even worse.

3) Downweight higher order correlations.

You obviously cannot use a direct sample counterpart to the spectral density matrix. In

a sample of size 100, there is no way to estimate E(ut u0 t+101 ).Your estimate of E(ut ut+99 )

0

is based on one data point, u1 u0 . Hence, it will be a pretty unreliable estimate. For this

100

reason, the estimator using all possible autocorrelations in a given sample is inconsistent.

(Consistency means that as the sample grows, the probability distribution of the estimator

converges to the true value. Inconsistent estimates typically have very large sample variation.)

Furthermore, even S estimates that use few autocorrelations are not always positive def-

inite in sample. This is embarrassing when one tries to invert the estimated spectral density

matrix, which you have to do if you use it as a weighting matrix. Therefore, it is a good idea

to construct consistent estimates that are automatically positive de¬nite in every sample. One

such estimate is the Bartlett estimate, used in this application by Newey and West (1987b). It

is

X µ k ’ |j| ¶ 1 X

k T

ˆ (ut u0 ). (161)

S= t’j

k T t=1

j=’k

As you can see, only autocorrelations up to kth (k < T ) order are included, and higher order

autocorrelations are downweighted. (It™s important to use 1/T not 1/(T ’ k); this is a further

downweighting.) The Newey-West estimator is basically the variance of kth sums, which is

why it is positive de¬nite in sample:

«

k

X

V ar ut’j = kE(ut u0 ) + (k ’ 1)[E(ut u0 ) + E(ut’1 u0 )] + · · ·

t t’1 t

j=1

k

X k ’ |j|

+[E(ut u0 ) + E(ut’k u0 )] E(ut u0 ).

=k

t’k t t’k

k

j=’k

Andrews (1991) gives some additional weighting schemes for spectral density estimates.

207

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

This calculation also gives some intuition for the S matrix. We™re looking for the variance

1 PT

across samples of the sample mean var( T t=1 ut ). We only have one sample mean to look

at, so we estimate the variance of the sample mean by looking at the variance in a single

³P ´

k

sample of shorter sums, var k j=1 uj . The S matrix is sometimes called the long-run

1

covariance matrix for this reason. In fact, one could estimate S directly as a variance of kth

sums and obtain almost the same estimator, that would also be positive de¬nite in any sample,

k T

X X

1

vt = ut’j ; v =

¯ vt

T ’k

j=1 t=k+1

T

X

11

(vt ’ v ) (vt ’ v)0 .

ˆ

S= ¯ ¯

kT ’k

t=k+1

This estimator has been used when measurement of S is directly interesting (Cochrane 1998,

Lo and MacKinlay 1988). A variety of other weighting schemes have been advocated.

What value of k, or how wide a window if of another shape, should you use? Here

again, you have to use some judgment. Too short values of k, together with a ut that is

signi¬cantly autocorrelated, and you don™t correct for correlation that might be there in the

errors. Too long a value of k, together with a series that does not have much autocorrelation,

and the performance of the estimate and test deteriorates. If k = T /2 for example, you are

really using only two data points to estimate the variance of the mean. The optimum value

then depends on how much persistence or low-frequency movement there is in a particular

application, vs. accuracy of the estimate.

There is an extensive statistical literature about optimal window width, or size of k. Alas,

this literature mostly characterizes the rate at which k should increase with sample size. You

must promise to increase k as sample size increases, but not as quickly as the sample size

increases “ limT ’∞ k = ∞, limT ’∞ k/T = 0 “ in order to obtain consistent estimates. In

practice, promises about what you™d do with more data are pretty meaningless, and usually

broken once more data arrives.

4) Consider parametric structures for autocorrelation and heteroskedasticity.

“Nonparametric” corrections such as (11.161) often don™t perform very well in typical

samples. The problem is that “nonparametric” techniques are really very highly parametric;

you have to estimate many correlations in the data. Therefore, the nonparametric estimate

varies a good deal from sample to sample, while the asymptotic distribution theory ignores

sampling variation in covariance matrix estimates. The asymptotic distribution can therefore

be a poor approximation to the ¬nite-sample distribution of statistics like the JT . The S ’1

weighting matrix will also be unreliable.

One answer is to use a Monte-Carlo or bootstrap to estimate the ¬nite-sample distribution

parameters and test statistics rather than to rely on asymptotic theory.

Alternatively, you can impose a parametric structure on the S matrix. Just because the

208

SECTION 11.7 ESTIMATING THE SPECTRAL DENSITY MATRIX

formulas are expressed in terms of a sum of covariances does not mean you have to estimate

them that way; GMM is not inherently tied to “nonparametric” covariance matrix estimates.

For example, if you model a scalar u as an AR(1) with parameter ρ, then you can estimate

two numbers ρ and σ 2 rather than a whole list of autocorrelations, and calculate

u

∞ ∞

X X 1+ρ

σ2 ρ|j| = σ 2

S= E(ut ut’j ) = u u

1’ρ

j=’∞ j=’∞

If this structure is not a bad approximation, imposing it can result in more reliable estimates

and test statistics since one has to estimate many fewer coef¬cients. You could transform the

data in such a way that there is less correlation to correct for in the ¬rst place.

(This is a very useful formula, by the way. You are probably used to calculating the

standard error of the mean as

σ(x)

σ(¯) = √ .

x

T

This formula assumes that the x are uncorrelated over time. If an AR(1) is not a bad model

for their correlation, you can quickly adjust for correlation by using

r

σ(x) 1 + ρ

σ(¯) = √

x

1’ρ

T

instead.)

This sort of parametric correction is very familiar from OLS regression analysis. The

textbooks commonly advocate the AR(1) model for serial correlation as well as parametric

models for heteroskedasticity corrections. There is no reason not to follow a similar approach

for GMM statistics.

5) Use the null to limit correlations?

In the typical asset pricing setup, the null hypothesis speci¬es that Et (ut+1 ) = Et (mt+1 Rt+1 ’

1) = 0, as well as E(ut+1 ) = 0. This implies that all the autocorrelation terms of S drop

out; E(ut u0 ) = 0 for j 6= 0. The lagged u could be an instrument z; the discounted return

t’j

should be unforecastable, using past discounted returns as well as any other variable. In this

situation, one could exploit the null to only include one term, and estimate

T

X

ˆ= 1 ut u0 .

S t

T t=1

Similarly, if one runs a regression forecasting returns from some variable zt ,

Rt+1 = a + bzt + µt+1 ,

the null hypothesis that returns are not forecastable by any variable at time t means that the

209

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

errors should not be autocorrelated. One can then simplify the standard errors in the OLS

regression formulas given in section 11.4, eliminating all the leads and lags.

In other situations, the null hypothesis can suggest a functional form for E(ut u0 ) or that

t’j

some but not all are zero. For example, as we saw in section 11.4, regressions of long horizon

returns on overlapping data lead to a correlated error term, even under the null hypothesis

of no return forecastability. We can impose this null, ruling out terms past the overlap, as

suggested by Hansen and Hodrick,

®

k

X

1

E(xt x0 )’1 ° E(et xt x0 et’j )» E(xt x0 )’1 . (162)

var(bT ) = t t’j t

T

j=’k

However, the null might not be correct, and the errors might be correlated. If so, you might

make a mistake by leaving them out. If the null is correct, the extra terms will converge to zero

and you will only have lost a few (¬nite-sample) degrees of freedom needlessly estimating

them. If the null is not correct, you have an inconsistent estimate. With this in mind, you

might want to include at least a few extra autocorrelations, even when the null says they don™t

belong.

Furthermore, there is no guarantee that the unweighted sum in (11.162) is positive de¬nite

in sample. If the sum in the middle is not positive de¬nite, you could add a weighting to

the sum, possibly increasing the number of lags so that the lags near k are not unusually

underweighted. Again, estimating extra lags that should be zero under the null only loses a

little bit of power.

Monte Carlo evidence (Hodrick 1992) suggests that imposing the null hypothesis to sim-

plify the spectral density matrix helps to get the ¬nite-sample size of test statistics right “ the

probability of rejection given the null is true. One should not be surprised that if the null is

true, imposing as much of it as possible makes estimates and tests work better. On the other

hand, adding extra correlations can help with the power of test statistics “ the probability of

rejection given that an alternative is true “ since they converge to the correct spectral density

matrix.

This trade-off requires some thought. For measurement rather than pure testing, using

a spectral density matrix that can accommodate alternatives may be the right choice. For

example, in the return forecasting regressions, one is really focused on measuring return

forecastability rather than just formally testing the hypothesis that it is zero. On the other

hand, the small-sample performance of the nonparametric estimators with many lags is not

very good.

If you are testing an asset pricing model that predicts u should not be autocorrelated, and

there is a lot of correlation “ if this issue makes a big difference “ then this is an indication that

something is wrong with the model; that including u as one of your instruments z would result

in a rejection or at least substantially change the results. If the u are close to uncorrelated,

then it really doesn™t matter if you add a few extra terms or not.

210

SECTION 11.7 ESTIMATING THE SPECTRAL DENSITY MATRIX

6) Size problems; consider a factor or other parametric cross-sectional structure.

If you try to estimate a covariance matrix that is larger than the number of data points (say

2000 NYSE stocks and 800 monthly observations), the estimate of S, like any other covari-

ance matrix, is singular by construction. This fact leads to obvious problems when you try to

invert S! More generally, when the number of moments is more than around 1/10 the number

of data points, S estimates tend to become unstable and near-singular. Used as a weighting

matrix, such an S matrix tells you to pay lots of attention to strange and probably spurious

linear combinations of the moments, as I emphasized in section 11.5. For this reason, most

second-stage GMM estimations are limited to a few assets and a few instruments.

A good, but as yet untried alternative might be to impose a factor structure or other well-

behaved structure on the covariance matrix. The near-universal practice of grouping assets

into portfolios before analysis already implies an assumption that the true S of the underlying

assets has a factor structure. Grouping in portfolios means that the individual assets have no

information not contained in the portfolio, so that a weighting matrix S ’1 would treat all

assets in the portfolio identically. It might be better to estimate an S imposing a factor

structure on all the primitive assets.

Another response to the dif¬culty of estimating S is to stop at ¬rst stage estimates, and

only use S for standard errors. One might also use a highly structured estimate of S as

weighting matrix, while using a less constrained estimate for the standard errors.

This problem is of course not unique to GMM. Any estimation technique requires us to

calculate a covariance matrix. Many traditional estimates simply assume that ut errors are

cross-sectionally independent. This false assumption leads to understatements of the standard

errors far worse than the small sample performance of any GMM estimate.

Our econometric techniques all are designed for large time series and small cross-sections.

Our data has a large cross section and short time series. A large unsolved problem in ¬nance

is the development of appropriate large-N small-T tools for evaluating asset pricing models.

7) Alternatives to the two-stage procedure: iteration and one-step.

Hansen and Singleton (1982) describe the above two-step procedure, and it has become

popular for that reason. Two alternative procedures may perform better in practice, i.e. may

result in asymptotically equivalent estimates with better small-sample properties. They can

also be simpler to implement, and require less manual adjustment or care in specifying the

setup (moments, weighting matrices) which is often just as important.

a) Iterate. The second stage estimate ˆ2 will not imply the same spectral density as the

b

¬rst stage. It might seem appropriate that the estimate of b and of the spectral density should

be consistent, i.e. to ¬nd a ¬xed point of ˆ = min{b} [gT (b)0 S(ˆ ’1 gT (b)]. One way to search

b b)

for such a ¬xed point is to iterate: ¬nd b2 from

ˆ2 = min gT (b)0 S ’1 (b1 )gT (b) (163)

b

{b}

where b1 is a ¬rst stage estimate, held ¬xed in the minimization over b2 . Then use ˆ2 to ¬nd

b

211

CHAPTER 11 GMM: GENERAL FORMULAS AND APPLICATIONS

S(ˆ2 ), ¬nd

b

ˆ3 = min[gT (b)0 S(ˆ2 )’1 gT (b)],

b b

{b}

and so on. There is no ¬xed point theorem that such iterations will converge, but they often

do, especially with a little massaging. (I once used S [(bj + bj’1 )/2] in the beginning part of

an iteration to keep it from oscillating between two values of b). Ferson and Foerster (1994)

¬nd that iteration gives better small sample performance than two-stage GMM in Monte

Carlo experiments. This procedure is also likely to produce estimates that do not depend on

the initial weighting matrix.

b) Pick b and S simultaneously. It is not true that S must be held ¬xed as one searches

for b. Instead, one can use a new S(b) for each value of b. Explicitly, one can estimate b by

min [gT (b)0 S ’1 (b)gT (b)] (164)

{b}

The estimates produced by this simultaneous search will not be numerically the same in

a ¬nite sample as the two-step or iterated estimates. The ¬rst order conditions to (11.163) are

µ ¶0

‚gT (b)

S ’1 (b1 )gT (b) = 0 (165)

‚b

while the ¬rst order conditions in (11.164) add a term involving the derivatives of S(b) with

respect to b. However, the latter terms vanish asymptotically, so the asymptotic distribution

theory is not affected. Hansen, Heaton and Yaron (1996) conduct some Monte Carlo experi-

ments and ¬nd that this estimate may have small-sample advantages in certain problems. A

problem is that the one-step minimization may ¬nd regions of the parameter space that blow

up the spectral density matrix S(b) rather than lower the pricing errors gT .

Often, one choice will be much more convenient than another. For linear models, one

can ¬nd the minimizing value of b from the ¬rst order conditions (11.165) analytically. This

fact eliminates the need to search so even an iterated estimate is much faster. For nonlinear

0

models, each step involves a numerical search over gT (b) SgT (b). Rather than perform this

0

search many times, it may be much quicker to minimize once over gT (b) S(b)gT (b). On

the other hand, the latter is not a locally quadratic form, so the search may run into greater

numerical dif¬culties.

11.8 Problems

1. Use the delta method version of the GMM formulas to derive the sampling variance of

an autocorrelation coef¬cient.

2. Write a formula for the standard error of OLS regression coef¬cients that corrects for

autocorrelation but not heteroskedasticity

212

SECTION 11.8 PROBLEMS

3. Write a formula for the standard error of OLS regression coef¬cients if E(et et’j ) =

ρj σ2 .

4. If the GMM errors come from an asset pricing model, ut = mt Rt ’ 1, can you ignore

lags in the spectral density matrix? What if you know that returns are predictable? What

if the error is formed from an instrument/managed portfolio ut zt’1 ?

213

Chapter 12. Regression-based tests of

linear factor models

This and the next three chapters study the question, how should we estimate and evaluate

linear factor models; models of the form p = E(mx), m = b0 f or equivalently E(Re ) =

β»? These models are by far the most common in empirical asset pricing, and there is a large

literature on econometric techniques to estimate and evaluate them. Each technique focuses

on the same questions: how to estimate parameters, how to calculate standard errors of the

estimated parameters, how to calculate standard errors of the pricing errors, and how to test

the model, usually with a test statistic of the form ±0 V ’1 ±.

ˆ ˆ

I start with simple and longstanding time-series and cross-sectional regression tests. Then,

I pursue GMM approach to the model expressed in p = E(mx), m = b0 f form. The follow-

ing chapter summarizes the principle of maximum likelihood estimation and derives maxi-

mum likelihood estimates and tests. Finally, a chapter compares the different approaches.

As always, the theme is the underlying unity. All of the techniques come down to one of

two basic ideas: time-series regression or cross-sectional regression. The GMM, p = E(mx)

approach turns out to be almost identical to cross-sectional regressions. Maximum likelihood

(with appropriate statistical assumptions) justi¬es the time-series and cross-sectional regres-

sion approaches. The formulas for parameter estimates, standard errors, and test statistics are

all strikingly similar.

12.1 Time-series regressions

When the factor is also a return, we can evaluate the model

E(Rei ) = β i E(f)

by running OLS time series regressions

Rei = ±i + β i ft + µi ; t = 1, 2, ...T

t t

for each asset. The OLS distribution formulas (with corrected standard errors) provide stan-

dard errors of ± and β.

With errors that are i.i.d. over time, homoskedastic and independent of the factors, the

asymptotic joint distribution of the intercepts gives the model test statistic,

" ¶2 #’1

µ

ET (f)

ˆˆ ˆ

±0 Σ’1 ± ∼χ2

T 1+ N

σ(f )

ˆ

214

SECTION 12.1 TIME-SERIES REGRESSIONS

The Gibbons-Ross-Shanken test is a multivariate, ¬nite sample counterpart to this statistic,

when the errors are also normally distributed,

T ’N ’K ³ ´’1

0 ˆ ’1

ˆˆ ˆ

±0 Σ’1 ± ∼FN,T ’N’K .

1 + ET (f) „¦ ET (f )

N

I show how to construct the same test statistics with heteroskedastic and autocorrelated errors

via GMM.

I start with the simplest case. We have a factor pricing model with a single factor. The

factor is an excess return (for example, the CAPM, with Rem = Rm ’ Rf ), and the test

assets are all excess returns. We express the model in expected return - beta form. The betas

are de¬ned by regression coef¬cients

Rt = ±i + β i ft + µi

ei

(166)

t

and the model states that expected returns are linear in the betas:

E(Rei ) = β i E(f ). (167)

Since the factor is also an excess return, the model applies to the factor as well, so E(f ) =

1 — ».

Comparing the model (12.167) and the expectation of the time series regression (12.166)

we see that the model has one and only one implication for the data: all the regression

intercepts ±i should be zero. The regression intercepts are equal to the pricing errors.

Given this fact, Black Jensen and Scholes (1972) suggested a natural strategy for estima-

tion and evaluation: Run time-series regressions (12.166) for each test asset. The estimate of

the factor risk premium is just the sample mean of the factor,

ˆ

» = ET (f ).

Then, use standard OLS formulas for a distribution theory of the parameters. In particular

you can use t-tests to check whether the pricing errors ± are in fact zero. These distributions

are usually presented for the case that the regression errors in (12.166) are uncorrelated and

homoskedastic, but the formulas in section 11.4 show easily how to calculate standard errors

for arbitrary error covariance structures.

We also want to know whether all the pricing errors are jointly equal to zero. This re-

quires us to go beyond standard formulas for the regression (12.166) taken alone, as we want

to know the joint distribution of ± estimates from separate regressions running side by side

but with errors correlated across assets (E(µi µj ) 6= 0). (We can think of 12.166 as a panel

tt

regression, and then it™s a test whether the ¬rm dummies are jointly zero.) The classic form

of these tests assume no autocorrelation or heteroskedasticity, but allow the errors to be cor-

related across assets. Dividing the ± regression coef¬cients by their variance-covariance

ˆ

215

CHAPTER 12 REGRESSION-BASED TESTS OF LINEAR FACTOR MODELS

matrix leads to a χ2 test,

" ¶2 #’1

µ

ET (f)

ˆˆ ˆ

±0 Σ’1 ± ∼χ2 (168)

T 1+ N

σ(f )

ˆ

where ET (f ) denotes sample mean, σ2 (f ) denotes sample variance, ± is a vector of the

ˆ ˆ

estimated intercepts,

£ ¤0

± = ±1 ±2 ... ±N

ˆ ˆ ˆ

ˆ

ˆ

Σ is the residual covariance matrix, i.e. the sample estimate of E(µt µ0 ) = Σ, where

t

£ ¤0

µt = µ1 µ2 · · · µN .

t t t

As usual when testing hypotheses about regression coef¬cients, this test is valid asymp-

totically. The asymptotic distribution theory assumes that σ2 (f) (i.e. X 0 X) and Σ have

converged to their probability limits; therefore it is asymptotically valid even though the fac-

tor is stochastic and Σ is estimated, but it ignores those sources of variation in a ¬nite sample.

It does not require that the errors are normal, relying on the central limit theorem so that ± is

ˆ

normal. I derive (12.168) below.

Also as usual in a regression context, we can derive a ¬nite-sample F distribution for the

hypothesis that a set of parameters are jointly zero, for ¬xed values of the right hand variable

ft ,.

" ¶2 #’1

µ

T ’N ’1 ET (f)

ˆˆ ˆ

±0 Σ’1 ± ∼FN,T ’N’1 (169)

1+

N σ (f )

ˆ

This is the Gibbons Ross and Shanken (1989) or “GRS” test statistic. The F distribution

ˆ

recognizes sampling variation in Σ, which is not included in (12.168). This distribution

requires that the errors µ are normal as well as uncorrelated and homoskedastic. With normal

ˆ

errors, the ± are normal and Σ is an independent Wishart (the multivariate version of a χ2 ),

ˆ

so the ratio is F . This distribution is exact in a ¬nite sample.

Tests (12.168) and (12.169) have a very intuitive form. The basic part of the test is a

ˆˆ ˆ

quadratic form in the pricing errors, ±0 Σ’1 ±. If there were no βf in the model, then the ± ˆ

would simply be the sample mean of the regression errors µt . Assuming i.i.d. µt , the variance

of their sample mean is just 1/T Σ. Thus, if we knew Σ then T ±0 Σ’1 ± would be a sum

ˆ ˆ

of squared sample means divided by their variance-covariance matrix, which would have an

asymptotic χ2 distribution, or a ¬nite sample χ2 distribution if the µt are normal. But we

N N

have to estimate Σ, which is why the ¬nite-sample distribution is F rather than χ2 . We also

estimate the β, and the second term in (12.168) and (12.169) accounts for that fact.

Recall that a single beta representation exists if and only if the reference return is on

the mean-variance frontier. Thus, the test can also be interpreted as a test whether f is ex-

216

SECTION 12.1 TIME-SERIES REGRESSIONS

ante mean-variance ef¬cient “ whether it is on the mean-variance frontier using population

moments “ after accounting for sampling error. Even if f is on the true or ex-ante mean-

variance frontier, other returns will outperform it in sample due to luck, so the return f will

usually be inside the ex-post mean-variance frontier “ i.e. the frontier drawn using sample

moments. Still, it should not be too far inside the sample frontier. Gibbons Ross and Shanken

show that the test statistic can be expressed in terms of how far inside the ex-post frontier the

return f is,

³ ´2 ³ ´2

µq ET (f )

T ’ N ’ 1 σq ’ σ(f ) ˆ

(170)

´2 .

³

N ET (f )

1 + σ(f )

ˆ

³ ´2

µq

is the Sharpe ratio of the ex-post tangency portfolio (maximum ex-post Sharpe ratio)

σq

formed from the test assets plus the factor f .

If there are many factors that are excess returns, the same ideas work, with some cost of

algebraic complexity. The regression equation is

Rei = ±i + β 0 ft + µi .

i t

The asset pricing model

E(Rei ) = β 0 E(f)

i

again predicts that the intercepts should be zero. We can estimate ± and β with OLS time-

ˆˆ ˆ

series regressions. Assuming normal i.i.d. errors, the quadratic form ±0 Σ’1 ± has the distri-

bution,

T ’N ’K ³ ´’1

0 ˆ ’1

ˆˆ ˆ

±0 Σ’1 ± ∼FN,T ’N’K (171)

1 + ET (f ) „¦ ET (f)

N

where

Number of assets

N =

Number of factors

K =

T

1X

[ft ’ ET (f)] [ft ’ ET (f)]0

ˆ

„¦=

T t=1

The main difference is that the Sharpe ratio of the single factor is replaced by the natural

0ˆ

generalization ET (f) „¦’1 ET (f).

12.1.1 Derivation of the χ2 statistic and distributions with general errors.

I derive (12.168) as an instance of GMM. This approach allows us to generate straightfor-

wardly the required corrections for autocorrelated and heteroskedastic disturbances. (MacKin-

217

CHAPTER 12 REGRESSION-BASED TESTS OF LINEAR FACTOR MODELS

lay and Richardson (1991) advocate GMM approaches to regression tests in this way.) It also

serves to remind us that GMM and p = E(mx) are not necessarily paired; one can do a

GMM estimate of an expected return - beta model too. The mechanics are only slightly dif-

ferent than what we did to generate distributions for OLS regression coef¬cients in section

11.4, since we keep track of N OLS regressions simultaneously.

Write the equations for all N assets together in vector form,

Re = ± + βft + µt .

t

We use the usual OLS moments to estimate the coef¬cients,

· ¸ µ· ¸¶

e

ET (Rt ’ ± ’ βft ) µt

gT (b) = = ET =0

e

ET [(Rt ’ ± ’ βft ) ft ] ft µt

These moments exactly identify the parameters ±, β, so the a matrix in agT (ˆ = 0 is the

b)

identity matrix. Solving, the GMM estimates are of course the OLS estimates,

ˆ

e

± = ET (Rt ) ’ βET (ft )

ˆ

ET [(Rt ’ ET (Re )) ft ]

e e

covT (Rt , ft )

ˆ t

β= = .

ET [(ft ’ ET (ft )) ft ] varT (ft )

The d matrix in the general GMM formula is

· ¸ · ¸

‚gT (b) IN IN E(ft ) 1 E(ft )

d≡ =’ =’ — IN

IN E(ft ) IN E(ft2 ) E(ft ) E(ft2 )

‚b0

where IN is an N — N identity matrix. The S matrix is

X · E(µt µ0 ) ¸

∞

E(µt µ0 ft’j )

t’j t’j

S= .

E(ft µt µ0 ) E(ft µt µ0 ft’j )

t’j t’j

j=’∞

Using the GMM variance formula (11.146) with a = I we have

µ· ¸¶

±

ˆ 1

= d’1 Sd’10 . (172)

var ˆ

β T

At this point, we™re done. The upper left hand corner of var(± β) gives us var(ˆ ) and the

±

’1

0 2

test we™re looking for is ± var(ˆ ) ± ∼ χN .

ˆ ± ˆ

The standard formulas make this expression prettier by assuming that the errors are uncor-

related over time and not heteroskedastic to simplify the S matrix, as we derived the standard

OLS formulas in section 11.4. If we assume that f and µ are independent as well as orthog-

onal, E(fµµ0 ) = E(f )E(µµ0 ) and E(f 2 µµ0 ) = E(f 2 )E(µµ0 ). If we assume that the errors

are independent over time as well, we lose all the lead and lag terms. Then, the S matrix

218

SECTION 12.2 CROSS-SECTIONAL REGRESSIONS

simpli¬es to

· ¸ · ¸

E(µt µ0 ) E(µt µ0 )E(ft ) 1 E(ft )

(173)

t t

S= = —Σ

E(ft )E(µt µt ) E(µt µt )E(ft2 )

0 0

E(ft ) E(ft2 )

Now we can plug into (12.172). Using (A—B)’1 = A’1 —B ’1 and (A—B)(C —D) =

AC — BD, we obtain

Ã· !

µ· ¸¶ ¸’1

±

ˆ 1 1 E(ft )

var = —Σ .

ˆ E(ft ) E(ft2 )

β T

Evaluating the inverse,

µ· ¸¶ · ¸

E(ft2 ) ’E(ft )

1 1

±

ˆ

var = —Σ

ˆ ’E(ft ) 1

β T var(f)

We™re interested in the top left corner. Using E(f 2 ) = E(f )2 + var(f),

µ ¶

E(f )2

1

var (ˆ ) =

± 1+ Σ.

T var(f )

This is the traditional formula (12.168), but there is now no real reason to assume that the

errors are i.i.d. or independent of the factors. By simply calculating 12.172, we can easily

construct standard errors and test statistics that do not require these assumptions.

12.2 Cross-sectional regressions

We can ¬t

E(Rei ) = β 0 » + ±i

i

by running a cross-sectional regression of average returns on the betas. This technique can

be used whether the factor is a return or not.

I discuss OLS and GLS cross-sectional regressions, I ¬nd formulas for the standard errors

of », and a χ2 test whether the ± are jointly zero. I derive the distributions as an instance of

GMM, and I show how to implement the same approach for autocorrelated and heteroskedas-

tic errors. I show that the GLS cross-sectional regression is the same as the time-series re-

gression when the factor is also an excess return, and is included in the set of test assets.

219

CHAPTER 12 REGRESSION-BASED TESTS OF LINEAR FACTOR MODELS

E ( Rei )

±i

Assets i

Slope »

βi

Figure 26. Cross-sectional regression

Start again with the K factor model, written as

E(Rei ) = β 0 »; i = 1, 2, ...N

i

The central economic question is why average returns vary across assets; expected returns of

an asset should be high if that asset has high betas or risk exposure to factors that carry high

risk premia.

Figure 26 graphs the case of a single factor such as the CAPM. Each dot represents one

asset i. The model says that average returns should be proportional to betas, so plot the

sample average returns against the betas. Even if the model is true, this plot will not work out

perfectly in each sample, so there will be some spread as shown.

Given these facts, a natural idea is to run a cross-sectional regression to ¬t a line through

the scatterplot of Figure 26. First ¬nd estimates of the betas from a time series regression,

Rei = ai + β 0 ft + µi , t = 1, 2, ...T for each i. (174)

t i t

Then estimate the factor risk premia » from a regression across assets of average returns on

the betas,

ET (Rei ) = β 0 » + ±i , i = 1, 2....N. (175)

i

220

SECTION 12.2 CROSS-SECTIONAL REGRESSIONS

As in the ¬gure, β are the right hand variables, » are the regression coef¬cients, and the

cross-sectional regression residuals ±i are the pricing errors. This is also known as a two-

pass regression estimate, because one estimates ¬rst time-series and then cross-sectional re-

gressions.

You can run the cross-sectional regression with or without a constant. The theory says

that the constant or zero-beta excess return should be zero. You can impose this restriction or

estimate a constant and see if it turns out to be small. The usual tradeoff between ef¬ciency

(impose the null as much as possible to get ef¬cient estimates) and robustness applies.

12.2.1 OLS cross-sectional regression

It will simplify notation to consider a single factor; the case of multiple factors looks the

same with vectors in place of scalars. I denote vectors from 1 to N with¤ missing sub or

£ ¤0 £ 0

superscripts, i.e. µt = µ1 µ2 · · · µN , β = β 1 β 2 · · · β N , and similarly

t t t

for Re and ±. For simplicity take the case of no intercept in the cross-sectional regression.

t

With this notation OLS cross-sectional estimates are

¡ ¢’1 0

ˆ

» = β0β (12.176)

β ET (Re )

ˆ

± = ET (Re ) ’ »β.

ˆ

Next, we need a distribution theory for the estimated parameters. The most natural place

to start is with the standard OLS distribution formulas. I start with the traditional assumption

that the true errors are i.i.d. over time, and independent of the factors. This will give us some

easily interpretable formulas, and we will see most of these terms remain when we do the

distribution theory right later on.

In an OLS regression Y = Xβ + u and E(uu0 ) = „¦, the standard error of the β estimate

is (X 0 X)’1 X 0 „¦X(X 0 X)’1 . The residual covariance matrix is (I ’ X(X 0 X)’1 X 0 )„¦(I ’

X(X 0 X)’1 X 0 )0

Denote Σ = E (µt µ0 ). Since the ±i are just time series averages of the true µi shocks

t t

(the average of the sample residuals is always zero), the errors in the cross-sectional regres-

sion have covariance matrix E (±±0 ) = T Σ. Thus the conventional OLS formulas for the

1

covariance matrix of OLS estimates and residual with correlated errors give

³´ 1 ¡ 0 ¢’1 0 ¡ ¢’1

2ˆ

β Σβ β 0 β (12.177)

σ» = ββ

T

1³ ¡ 0 ¢’1 0 ´ ³ ¡ 0 ¢’1 0 ´

(12.178)

cov(ˆ ) =

± I’β β β β Σ I ’β β β β

T

We could test whether all pricing errors are zero with the statistic

±0 cov(ˆ )’1 ± ∼χ2 . (179)

ˆ ± ˆ N’1

221

CHAPTER 12 REGRESSION-BASED TESTS OF LINEAR FACTOR MODELS

The distribution is χ2

N’1 not χN because the covariance matrix is singular. The singularity

2

and the extra terms in (12.178) result from the fact that the » coef¬cient was estimated along

the way, and means that we have to use a generalized inverse. (If there are K factors, we

obviously end up with χ2 N’K .)

A test of the residuals is unusual in OLS regressions. We do not usually test whether the

residuals are “too large,” since we have no information other than the residuals themselves

about how large they should be. In this case, however, the ¬rst stage time-series regression

gives us some independent information about the size of cov(±±0 ), information that we could

not get from looking at the cross-sectional residual ± itself.

12.2.2 GLS cross-sectional regression

Since the residuals in the cross-sectional regression (12.175) are correlated with each other,

standard textbook advice is to run a GLS cross-sectional regression rather than OLS, using

E(±±0 ) = T Σ as the error covariance matrix:

1

¡ 0 ’1 ¢’1 0 ’1

ˆ (12.180)

β Σ ET (Re )

»= βΣ β

ˆ

± = ET (Re ) ’ »β.

ˆ

The standard regression formulas give the variance of these estimates as

³´ 1 ¡ 0 ’1 ¢’1

2ˆ

(12.181)

σ» = βΣ β

T

1³ ¡ 0 ’1 ¢’1 0 ´

(12.182)

cov(ˆ ) =

± Σ’β β Σ β β

T

The comments of section 11.5 warning that OLS is sometimes much more robust than

GLS apply in this case. The GLS regression should improve ef¬ciency, i.e. give more precise

estimates. However, Σ may be hard to estimate and to invert, especially if the cross-section

N is large. One may well choose the robustness of OLS over the asymptotic statistical ad-

vantages of GLS.

A GLS regression can be understood as a transformation of the space of returns, to focus

attention on the statistically most informative portfolios. Finding (say, by Choleski decompo-

sition) a matrix C such that CC 0 = Σ’1 , the GLS regression is the same as an OLS regression

of CET (Re ) on Cβ, i.e. of testing the model on the portfolios CRe . The statistically most

informative portfolios are those with the lowest residual variance Σ. But this asymptotic sta-

tistical theory assumes that the covariance matrix has converged to its true value. In most

samples, the ex-post or sample mean-variance frontier still seems to indicate lots of luck, and

this is especially true if the cross section is large, anything more than 1/10 of the time series.

The portfolios CRe are likely to contain many extreme long-short positions.

Again, we could test the hypothesis that all the ± are equal to zero with (12.179). Though

the appearance of the statistic is the same, the covariance matrix is smaller, re¬‚ecting the

222

SECTION 12.2 CROSS-SECTIONAL REGRESSIONS

greater power of the GLS test. As with the JT test, (11.152) we can develop an equivalent

test that does not require a generalized inverse;

T ±0 Σ’1 ± ∼χ2 . (183)

ˆ ˆ N’1

To derive (12.183), I proceed exactly as in the derivation of the JT test (11.152). De¬ne, say

by Choleski decomposition, a matrix C such that CC 0 = Σ’1 . Now, ¬nd the covariance

√

matrix of T C 0 ±:

ˆ

³ ¢’1 0 ´

√ ¡0 ¡ ¢’1 0

0 ’1

β C = I ’ δ δ0 δ

0 0

cov( T C±) = C (CC ) ’ β β CC β δ

where

δ = C 0 β.

√ √

In sum, ± is asymptotically normal so T C 0 ± is asymptotically normal, cov( T C 0 ±) is an

ˆ ˆ ˆ

0 0 ’1

idempotent matrix with rank N ’ 1; therefore T ± CC ± = T ± Σ ± is χN’1 .

0 2

ˆ ˆ ˆ ˆ

12.2.3 Correction for the fact that β are estimated, and GMM formulas that

don™t need i.i.d. errors.

In applying standard OLS formulas to a cross-sectional regression, we assume that the right

hand variables β are ¬xed. The β in the cross-sectional regression are not ¬xed, of course,

but are estimated in the time series regression. This turns out to matter, even as T ’ ∞.

In this section, I derive the correct asymptotic standard errors. With the simplifying as-

sumption that the errors µ are i.i.d. over time and independent of the factors, the result is

1 h 0 ’1 0 ¡ ¢’1 ³ ´ i

ˆ (β β) β Σβ β 0 β 1 + »0 Σ’1 » + Σf (12.184)

σ 2 (»OLS ) = f

T

1 h¡ 0 ’1 ¢’1 ³ ´ i

2ˆ 0 ’1

σ (»GLS ) = βΣ β 1 + » Σf » + Σf

T

where Σf is the variance-covariance matrix of the factors. This correction is due to Shanken

(1992). Comparing these standard errors to (12.177) and (12.181), we see that there is a

³ ´

multiplicative correction 1 + »0 Σ’1 » and an additive correction Σf .

f

The asymptotic variance-covariance matrix of the pricing errors is

1³ ¡ 0 ¢’1 0 ´ ¡ ³ ´

0 ’1 0 ¢ 0 ’1

1 + » Σf (12.185)

cov(ˆ OLS ) =

± IN ’ β β β β Σ IN ’ β(β β) β »

T

1³ ¡ 0 ’1 ¢’1 0 ´ ³ ´

0 ’1

(12.186)

cov(ˆ GLS ) =

± Σ’β β Σ β β 1 + » Σf »

T

Comparing these results to (12.178) and (12.182) we see the same multiplicative correction

applies.

223

CHAPTER 12 REGRESSION-BASED TESTS OF LINEAR FACTOR MODELS

We can form the asymptotic χ2 test of the pricing errors by dividing pricing errors by their

variance-covariance matrix, ±cov(ˆ )’1 ±. Following (12.183), we can simplify this result for

ˆ ± ˆ

the GLS pricing errors resulting in

³ ´

T 1 + » Σf » ±0

0 ’1

ˆ GLS Σ’1 ±GLS ∼ χ2 (187)

ˆ N’K .

Are the corrections important relative to the simple OLS formulas given above? In the

CAPM » = E(Rem ) so »2 /σ2 (Rem ) ≈ (0.08/0.16)2 = 0.25 in annual data. In annual data,

then, the multiplicative term is too large to ignore. However, the mean and variance both

scale with horizon, so the Sharpe ration scales with the square root of horizon. Therefore,

for a monthly interval »2 /σ 2 (Rem ) ≈ 0.25/12 ≈ 0.02 which is quite small and ignoring the

multiplicative term makes little difference.

ˆ

The additive term in the standard error of » can be very important. Consider a one factor

model, suppose all the β are 1.0, all the residuals are uncorrelated so Σ is diagonal, suppose

all assets have the same residual covariance σ2 (µ), and ignore the multiplicative term. Now

we can write either covariance matrix in (12.184) as

· ¸

1 12

2ˆ 2

σ (») = σ (µ) + σ (f)

TN

Even with N = 1, most factor models have fairly high R2 , so σ2 (µ) < σ2 (f ). Typical

CAPM values of R2 = 1 ’ σ2 (µ)/σ2 (f ) for large portfolios are 0.6-0.7; and multifactor

models such as the Fama French 3 factor model have R2 often over 0.9. Typical numbers of

assets N = 10 to 50 make the ¬rst term vanish compared to the second term.

More generally, suppose the factor were in fact a return. Then the factor risk premium is

» = E(f), and we™d use Σf /T as the standard error of ». This is the “correction” term in

(12.184), so we expect it to be, in fact, the most important term.

Note that Σf /T is the standard error of the mean of f . Thus, in the case that the return

is a factor, so E(f ) = », this is the only term you would use.

This example suggests that Σf is not just an important correction, it is likely to be the

ˆ

dominant consideration in the sampling error of the ».

Comparing (12.187) to the GRS tests for a time-series regression, (12.168), (12.169),

(12.171) we see the same statistic. The only difference is that by estimating » from the

cross-section rather than imposing » = E(f ), the cross-sectional regression loses degrees of

freedom equal to the number of factors.

Though these formulas are standard classics, I emphasize that we don™t have to make the

severe assumptions on the error terms that are used to derive them. As with the time-series

ˆ

case, I derive a general formula for the distribution of » and ±, and only at the last moment

ˆ

make classic error term assumptions to make the spectral density matrix pretty.

Derivation and formulas that don™t require i.i.d. errors.

224

SECTION 12.2 CROSS-SECTIONAL REGRESSIONS

The easy and elegant way to account for the effects of “generated regressors” such as

the β in the cross-sectional regression is to map the whole thing into GMM. Then, we treat

the moments that generate the regressors β at the same time as the moments that generate

the cross-sectional regression coef¬cient », and the covariance matrix S between the two

sets of moments captures the effects of generating the regressors on the standard error of the

cross-sectional regression coef¬cients. Comparing this straightforward derivation with the

ˆ

dif¬culty of Shanken™s (1992) paper that originally derived the corrections for », and noting

that Shanken did not go on to ¬nd the formulas (12.185) that allow a test of the pricing errors

is a nice argument for the simplicity and power of the GMM framework.

To keep the algebra manageable, I treat the case of a single factor. The moments are

® ®

e

E(Rt ’ a ’ βft ) 0

° E [(Rt ’ a ’ βft )ft ] » = ° 0 » (188)

e

gT (b) =

E (Re ’ β») 0

The top two moment conditions exactly identify a and β as the time-series OLS estimates.

(Note a not ±. The time-series intercept is not necessarily equal to the pricing error in a

cross-sectional regression.) The bottom moment condition is the asset pricing model. It is in

general overidenti¬ed in a sample, since there is only one extra parameter (») and N extra

moment conditions. If we use a weighting vector β 0 on this condition, we obtain the OLS

cross-sectional estimate of ». If we use a weighting vector β 0 Σ’1 , we obtain the GLS cross-

sectional estimate of ». To accommodate both cases, use a weighting vector γ 0 , and then

substitute γ 0 = β 0 , γ 0 = β 0 Σ’1 , etc. at the end.

ˆ

The standard errors for » come straight from the general GMM standard error formula

(11.146). The ± are not parameters, but are the last N moments. Their covariance matrix is

ˆ

thus given by the GMM formula (11.147) for the sample variation of the gT .

All we have to do is map the problem into the GMM notation. The parameter vector is

£ ¤

β0

b0 = a0 »

The a matrix chooses which moment conditions are set to zero in estimation,

· ¸

I2N 0

a= .

γ0

0

The d matrix is the sensitivity of the moment conditions to the parameters,

®

’IN ’IN E(f ) 0

‚gT

= ° ’IN E(f ) ’IN E(f 2 ) 0 »

d= 0

‚b

0 ’»IN ’β

225

CHAPTER 12 REGRESSION-BASED TESTS OF LINEAR FACTOR MODELS

The S matrix is the long-run covariance matrix of the moments.

«® ® 0

e

Re ’ a ’ βft Rt’j ’ a ’ βft’j

∞

X t

E ° (Re ’ a ’ βft )ft » ° (Rt’j ’ a ’ βft’j )ft’j »

e

S = t

e e

Rt ’ β» Rt’j ’ β»

j=’∞

«® ® 0

µt µt’j

∞

X

E ° »° »

µt ft µt’j ft’j

=

β(ft ’ Ef ) + µt β(ft’j ’ Ef) + µt’j

j=’∞

In the second expression, I have used the regression model and the restriction under the null

that E (Rt ) = β». In calculations, of course, you could simply estimate the ¬rst expression.

e

We are done. We have the ingredients to calculate the GMM standard error formula

(11.146) and formula for the covariance of moments (11.147).

We can recover the classic formulas (12.184), (12.185), (12.186) by adding the assump-

tion that the errors are i.i.d. and independent of the factors, and that the factors are uncorre-

lated over time as well. The assumption that the errors and factors are uncorrelated over time

means we can ignore the lead and lag terms. Thus, the top left corner is E(µt µ0 ) = Σ. The

t

assumption that the errors are independent from the factors ft simpli¬es the terms in which

µt and ft are multiplied: E(µt (µ0 ft )) = E(f )Σ for example. The result is

t

®

Σ E(f )Σ Σ

S = ° E(f )Σ E(f 2 )Σ »

E(f )Σ

E(f )Σ ββ 0 σ 2 (f ) + Σ

Σ

Multiplying a, d, S together as speci¬ed by the GMM formula for the covariance matrix

of parameters (11.146) we obtain the covariance matrix of all the parameters, and its (3,3)

ˆ

element gives the variance of ». Multiplying the terms together as speci¬ed by (11.147), we

obtain the sampling distribution of the ±, (12.185). The formulas (12.184) reported above

ˆ

are derived the same way with a vector of factors ft rather than a scalar; the second moment

condition in (12.188) then reads E [(Rt ’ a ’ βf t ) — ft ]. The matrix multiplication is not

e

particularly enlightening.

Once again, there is really no need to make the assumption that the errors are i.i.d. and

especially that they are conditionally homoskedastic “ that the factor f and errors µ are in-

dependent. It is quite easy to estimate an S matrix that does not impose these conditions

and calculate standard errors. They will not have the pretty analytic form given above, but

they will more closely report the true sampling uncertainty of the estimate. Furthermore, if

one is really interested in ef¬ciency, the GLS cross-sectional estimate should use the spectral

density matrix as weighting matrix rather than Σ’1 .

226

SECTION 12.2 CROSS-SECTIONAL REGRESSIONS

12.2.4 Time series vs. cross-section

How are the time-series and cross-sectional approaches different?

Most importantly, you can run the cross-sectional regression when the factor is not a

return. The time-series test requires factors that are also returns, so that you can estimate

ˆ

factor risk premia by » = ET (f ). The asset pricing model does predict a restriction on the

intercepts in the time-series regression. Why not just test these? If you impose the restriction

E(Rei ) = β 0 », you can write the time-series regression (12.174) as

i

Rei = β 0 » + β 0 (ft ’ E(f)) + µi , t = 1, 2, ...T for each i.

t i i t

Comparing this with (12.174), you see that the intercept restriction is

ai = β 0 (» ’ E(f )) .

i

This restriction makes sense. The model says that mean returns should be proportional to

betas, and the intercept in the time-series regression controls the mean return. You can also

see how » = E(f ) results in a zero intercept. Finally, however, you see that without an

estimate of », you can™t check this intercept restriction. If the factor is not a return, you will

be forced to do something like a cross-sectional regression.

When the factor is a return, so that we can compare the two methods, they are not neces-

sarily the same. The time-series regression estimates the factor risk premium as the sample

mean of the factor. Hence, the factor receives a zero pricing error. Also, the predicted zero-

beta excess return is also zero. Thus, the time-series regression describes the cross-section of

expected returns by drawing a line as in Figure 26 that runs through the origin and through

the factor, ignoring all of the other points. The OLS cross-sectional regression picks the slope

and intercept, if you include one, to best ¬t all the points; to minimize the sum of squares of

all the pricing errors.

If the factor is a return, the GLS cross-sectional regression, including the factor as a test

asset, is identical to the time-series regression. The time-series regression for the factor is, of

course,

ft = 0 + 1ft + 0

so it has a zero intercept, beta equal to one, and zero residual in every sample. The residual

variance covariance matrix of the returns, including the factor, is

µ· e ¸¶· ¸

R ’ a ’ βf Σ0

0

E [·] =

f ’ 0 ’ 1f 00

Since the factor has zero residual variance, a GLS regression puts all its weight on that asset.

ˆ

Therefore, » = ET (f ) just as for the time-series regression. The pricing errors are the same,

as is their distribution and the χ2 test. (You gain a degree of freedom by adding the factor to

the cross sectional regression, so the test is a χ2 .)

N

227

CHAPTER 12 REGRESSION-BASED TESTS OF LINEAR FACTOR MODELS

Why does the “ef¬cient” technique ignore the pricing errors of all of the other assets in

estimating the factor risk premium, and focus only on the mean return? The answer is simple,

though subtle. In the regression model

Re = a + βft + µt ,

t

the average return of each asset in a sample is equal to beta times the average return of the

factor in the sample, plus the average residual in the sample. An average return carries no

additional information about the mean of the factor. A signal plus noise carries no additional

information beyond that in the same signal. Thus, an “ef¬cient” cross-sectional regression

wisely ignores all the information in the other asset returns and uses only the information in

the factor return to estimate the factor risk premium.

12.3 Fama-MacBeth Procedure

I introduce the Fama-MacBeth procedure for running cross sectional regression and calcu-

lating standard errors that correct for cross-sectional correlation in a panel. I show that, when

the right hand variables do not vary over time, Fama-MacBeth is numerically equivalent to

pooled time-series, cross-section OLS with standard errors corrected for cross-sectional cor-

relation, and also to a single cross-sectional regression on time-series averages with standard

errors corrected for cross-sectional correlation. Fama-MacBeth standard errors do not include

corrections for the fact that the betas are also estimated.

Fama and MacBeth (1973) suggest an alternative procedure for running cross-sectional

regressions, and for producing standard errors and test statistics. This is a historically impor-

tant procedure, it is computationally simple to implement, and is still widely used, so it is

important to understand it and relate it to other procedures.

First, you ¬nd beta estimates with a time-series regression. Fama and MacBeth use rolling

5 year regressions, but one can also use the technique with full-sample betas, and I will

consider that simpler case. Second, instead of estimating a single cross-sectional regression

with the sample averages, we now run a cross-sectional regression at each time period, i.e.

Rt = β 0 »t + ±it i = 1, 2, ...N for each t.

ei

i

I write the case of a single factor for simplicity, but it™s easy to extend the model to multiple

factors. Then, Fama and MacBeth suggest that we estimate » and ±i as the average of the

cross sectional regression estimates,