X X

ˆ= 1 ˆ t ; ±i = 1

» »ˆ ±it .

ˆ

T t=1 T t=1

228

SECTION 12.3 FAMA-MACBETH PROCEDURE

Most importantly, they suggest that we use the standard deviations of the cross-sectional

regression estimates to generate the sampling errors for these estimates,

X³ ´2

T T

X

ˆ=1 ˆ t ’ » ; σ2 (ˆ i ) = 1 (ˆ it ’ ±i )2 .

ˆ

2

σ (») » ± ± ˆ

2 2

T t=1 T t=1

It™s 1/T 2 because we™re ¬nding standard errors of sample means, σ2 /T

This is an intuitively appealing procedure once you stop to think about it. Sampling error

is, after all, about how a statistic would vary from one sample to the next if we repeated the

observations. We can™t do that with only one sample, but why not cut the sample in half, and

deduce how a statistic would vary from one full sample to the next from how it varies from

the ¬rst half of the sample to the next half? Proceeding, why not cut the sample in fourths,

eights and so on? The Fama-MacBeth procedure carries this idea to is logical conclusion,

ˆ

using the variation in the statistic »t over time to deduce its sampling variation.

We are used to deducing the sampling variance of the sample mean of a series xt by

looking at the variation of xt through time in the sample, using σ 2 (¯) = σ2 (x)/T =

x

P 2

t (xt ’ x) . The Fama-MacBeth technique just applies this idea to the slope and pric-

1

¯

T2

ing error estimates. The formula assumes that the time series is not autocorrelated, but one

ˆ

could easily extend the idea to estimates »t that are correlated over time by using a long run

variance matrix, i.e. estimate .

∞

X

ˆ =1 ˆˆ

2

σ (») covT (»t , »t’j )

T j=’∞

One should of course use some sort of weighting matrix or a parametric description of the

ˆ

autocorrelations of », as explained in section 11.7. Asset return data are usually not highly

correlated, but accounting for such correlation could have a big effect on the application

of the Fama-MacBeth technique to corporate ¬nance data or other regressions in which the

cross-sectional estimates are highly correlated over time.

It is natural to use this sampling theory to test whether all the pricing errors are jointly

zero as we have before. Denote by ± the vector of pricing errors across assets. We could

estimate the covariance matrix of the sample pricing errors by

T

1X

±=

ˆ ±t

ˆ

T t=1

T

1X

(ˆ t ’ ±) (ˆ t ’ ±)0

cov(ˆ ) =

± ± ˆ± ˆ

2

T t=1

(or a general version that accounts for correlation over time) and then use the test

±0 cov(ˆ )’1 ± ∼ χ2 .

ˆ ± ˆ N’1

229

CHAPTER 12 REGRESSION-BASED TESTS OF LINEAR FACTOR MODELS

12.3.1 Fama MacBeth in depth

The GRS procedure and the formulas given above for a single cross-sectional regression are

familiar from any course in regression. We will see them justi¬ed by maximum likelihood

below. The Fama MacBeth procedure seems unlike anything you™ve seen in any econometrics

course, and it is obviously a useful and simple technique that can be widely used in panel

data in economics and corporate ¬nance as well as asset pricing. Is it truly different? Is there

something different about asset pricing data that requires a fundamentally new technique not

taught in standard regression courses? Or is it similar to standard techniques? To answer these

questions it is worth looking in a little more detail at what it accomplishes and why.

It™s easier to do this in a more standard setup, with left hand variable y and right hand

variable x. Consider a regression

yit = β 0 xit + µit i = 1, 2, ...N; t = 1, 2, ...T.

The data in this regression has a cross-sectional element as well as a time-series element.

In corporate ¬nance, for example, one might be interested in the relationship between in-

vestment and ¬nancial variables, and the data set has many ¬rms (N ) as well as time series

observations for each ¬rm (T ). In and expected return-beta asset pricing model, the xit stand-

ing for the β i and β stands for ».

An obvious thing to do in this context is simply to stack the i and t observations together

and estimate β by OLS. I will call this the pooled time-series cross-section estimate. How-

ever, the error terms are not likely to be uncorrelated with each other. In particular, the error

terms are likely to be cross-sectionally correlated at a given time. If one stock™s return is un-

usually high this month, another stock™s return is also likely to be high; if one ¬rm invests an

unusually great amount this year, another ¬rm is also likely to do so. When errors are corre-

lated, OLS is still consistent, but the OLS distribution theory is wrong, and typically suggests

standard errors that are much too small. In the extreme case that the N errors are perfectly

correlated at each time period, there really only one observation for each time period, so one

really has T rather than NT observations. Therefore, a real pooled time-series cross-section

estimate must include corrected standard errors. People often ignore this fact and report OLS

standard errors.

Another thing we could do is ¬rst take time series averages and then run a pure cross-

sectional regression of

ET (yit ) = β 0 ET (xit ) + ui i = 1, 2, ...N

This procedure would lose any information due to variation of the xit over time, but at least

it might be easier to ¬gure out a variance-covariance matrix for ui and correct the standard

errors for residual correlation. (You could also average cross-sectionally and than run a single

time-series regression. We™ll get to that option later.)

In either case, the standard error corrections are just applications of the standard formula

230

SECTION 12.3 FAMA-MACBETH PROCEDURE

for OLS regressions with correlated error terms.

Finally, we could run the Fama-MacBeth procedure: run a cross-sectional regression at

ˆ ˆ

each point in time; average the cross-sectional β t estimates to get an estimate β, and use the

ˆ ˆ

time-series standard deviation of β t to estimate the standard error of β.

It turns out that the Fama MacBeth procedure is just another way of calculating the stan-

dard errors, corrected for cross-sectional correlation:

Proposition: If the xit variables do not vary over time, and if the errors are cross-sectionally

correlated but not correlated over time, then the Fama-MacBeth estimate, the pure cross-

sectional OLS estimate and the pooled time-series cross-sectional OLS estimates are identi-

cal. Also, the Fama-MacBeth standard errors are identical to the cross-sectional regression or

stacked OLS standard errors, corrected for residual correlation. None of these relations hold

if the x vary through time.

Since they are identical procedures, whether one calculates estimates and standard errors

in one way or the other is a matter of taste.

I emphasize one procedure that is incorrect: pooled time series and cross section OLS with

no correction of the standard errors. The errors are so highly cross-sectionally correlated in

most ¬nance applications that the standard errors so computed are often off by a factor of 10.

The assumption that the errors are not correlated over time is probably not so bad for

asset pricing applications, since returns are close to independent. However, when pooled

time-series cross-section regressions are used in corporate ¬nance applications, errors are

likely to be as severely correlated over time as across ¬rms, if not more so. The “other

factors” (µ) that cause, say, company i to invest more at time t than predicted by a set of right

hand variables is surely correlated with the other factors that cause company j to invest more.

But such factors are especially likely to cause company i to invest more tomorrow as well. In

this case, any standard errors must also correct for serial correlation in the errors; the GMM

based formulas in section 11.4 can do this easily.

ˆ

The Fama-MacBeth standard errors also do not correct for the fact that β are generated

regressors. If one is going to use them, it is a good idea to at least calculate the Shanken

correction factors outlined above, and check that the corrections are not large.

Proof: We just have to write out the three approaches and compare them. Having assumed

that the x variables do not vary over time, the regression is

yit = x0 β + µit .

i

We can stack up the cross-sections i = 1...N and write the regression as

yt = xβ + µt .

x is now a matrix with the x0 as rows. The error assumptions mean E(µt µ0 ) = Σ.

t

i

231

CHAPTER 12 REGRESSION-BASED TESTS OF LINEAR FACTOR MODELS

Pooled OLS: To run pooled OLS, we stack the time series and cross sections by writing

® ® ®

y1 x µ1

y2 x µ2

Y = . ; X = . ; ² = .

.» .» .»

°. °. °.

yT x µT

and then

Y = Xβ + ²

with

®

Σ

..

E(²²0 ) = „¦ = ° »

.

Σ

The estimate and its standard error are then

’1

ˆ

β OLS = (X 0 X) X 0 Y

’1 ’1

ˆ

cov(β OLS ) = (X 0 X) X 0 „¦X (X 0 X)

Writing this out from the de¬nitions of the stacked matrices, with X 0 X =T x0 x,

’1

ˆ = (x0 x) x0 ET (yt )

β OLS

1 0 ’1 0 ’1

ˆ (x x) (x Σx) (x0 x) .

cov(β OLS ) =

T

We can estimate this sampling variance with

¡ ¢

ˆ

ˆ

Σ = ET ˆtˆ0 ; ˆt ≡ yt ’ xβ OLS

µ µt µ

Pure cross-section: The pure cross-sectional estimator runs one cross-sectional regression

of the time-series averages. So, take those averages,

ET (yt ) = xβ + ET (µt )

where x = ET (x ) since x is constant. Having assumed i.i.d. errors over time, the error

covariance matrix is

1

E (ET (µt ) ET (µ0 )) = Σ.

t

T

The cross sectional estimate and corrected standard errors are then

’1

ˆ = (x0 x) x0 ET (yt )

β XS

1 0 ’1 0 ’1 0 ’1

ˆ

σ 2 (β XS ) = (x x) x Σx (x x)

T

232

SECTION 12.3 FAMA-MACBETH PROCEDURE

Thus, the cross-sectional and pooled OLS estimates and standard errors are exactly the same,

in each sample.

Fama-MacBeth: The Fama“MacBeth estimator is formed by ¬rst running the cross-

sectional regression at each moment in time,

’1

ˆ

β t = (x0 x) x0 yt .

Then the estimate is the average of the cross-sectional regression estimates,

³´

ˆ F M = ET β t = (x0 x)’1 x0 ET (yt ) .

ˆ

β

Thus, the Fama-MacBeth estimator is also the same as the OLS estimator, in each sample.

ˆ

The Fama-MacBeth standard error is based on the time-series standard deviation of the β t .

Using covT to denote sample covariance,

³ ´ ³´

ˆ F M = 1 covT β t = 1 (x0 x)’1 x0 covT (yt ) x (x0 x)’1 .

ˆ

cov β

T T

with

yt = xβ F M + ˆt

µ

we have

ˆ

covT (yt ) = ET (ˆtˆ0 ) = Σ

µ µt

and ¬nally

³ ´

ˆ F M = 1 (x0 x)’1 x0 Σx (x0 x)’1 .

ˆ

cov β

T

Thus, the FM estimator of the standard error is also numerically equivalent to the OLS cor-

rected standard error.

Varying x If the xit vary through time, none of the three procedures are equal anymore,

since the cross-sectional regressions ignore time-series variation in the xit . As an extreme

example, suppose a scalar xit varies over time but not cross-sectionally,

yit = ± + xt β + µit ; i = 1, 2, ...N ; t = 1, 2, ...T.

The grand OLS regression is

P P

P

˜ 1 i yit

t xt N

xt yit

˜

ˆ it

=P 2= P2

β OLS

it xt

˜ t xt

˜

where x = x ’ ET (x) denotes the demeaned variables. The estimate is driven by the covari-

˜

ance over time of xt with the cross-sectional average of the yit , which is sensible because all

of the information in the sample lies in time variation. It is identical to a regression over time

233

CHAPTER 12 REGRESSION-BASED TESTS OF LINEAR FACTOR MODELS

of cross-sectional averages. However, you can™t even run a cross-sectional estimate, since

the right hand variable is constant across i. As a practical example, you might be interested

in a CAPM speci¬cation in which the betas vary over time (β t ) but not across test assets.

This sample still contains information about the CAPM: the time-variation in betas should

be matched by time variation in expected returns. But any method based on cross-sectional

¥

regressions will completely miss it.

In historical context, the Fama MacBeth procedure was also important because it allowed

changing betas, which a single cross-sectional regression or a time-series regression test can-

not easily handle.

12.4 Problems

1. When we express the CAPM in excess return form, can the test assets be differences

between risky assets, Ri ’Rj ? Can the market excess return also use a risky asset, or must

¡ ¢

it be relative to a risk free rate? (Hint: start with E(Ri ) ’ Rf = β i,m E(Rm ) ’ Rf

and see if you can get to the other forms. Betas must be regression coef¬cients.)

2. Can you run the GRS test on a model that uses industrial production growth as a factor,

E(Ri ) ’ Rf = β i,∆ip »ip ?

3. Fama and French (1997b) report that pricing errors are correlated with betas in a test of a

factor pricing model on industry portfolios. How is this possible?

4. We saw that a GLS cross-sectional regression of the CAPM passes through the market

and riskfree rate by construction. Show that if the market return is an equally weighted

portfolio of the test assets, then an OLS cross-sectional regression with an estimated

intercept passes through the market return by construction. Does it also pass through the

riskfree rate or origin?

234

Chapter 13. GMM for linear factor

models in discount factor form

13.1 GMM on the pricing errors gives a cross-sectional regression

The ¬rst stage estimate is an OLS cross-sectional regression, and the second stage is a

GLS regression,

ˆ1 = (d0 d)’1 d0 ET (p)

First stage : b

ˆ2 = (d0 S ’1 d)d0 S ’1 E(p).

Second stage : b

Standard errors are the corresponding regression formulas, and the variance of the pricing

errors are the standard regression formula for variance of a residual.

Treating the constant a — 1 as a constant factor, the model is

m = b0 f

E(p) = E(mx).

or simply

(189)

E(p) = E(xf 0 )b.

Keep in mind that p and x are N — 1 vectors of asset prices and payoffs respectively; f

is a K — 1 vector of factors, and b is a K — 1 vector of parameters. I suppress the time

indices mt+1 , ft+1 , xt+1, pt . The payoffs are typically returns or excess returns, including

returns scaled by instruments. The prices are typically one (returns) zero (excess returns) or

instruments.

To implement GMM, we need to choose a set of moments. The obvious set of moments

to use are the pricing errors,

gT (b) = ET (xf 0 b ’ p).

This choice is natural but not necessary. You don™t have to use p = E(mx) with GMM, and

you don™t have to use GMM with p = E(mx). You can (we will) use GMM on expected

return-beta models, and you can use maximum likelihood on p = E(mx). It is a choice, and

the results will depend on this choice of moments as well as the speci¬cation of the model.

235

CHAPTER 13 GMM FOR LINEAR FACTOR MODELS IN DISCOUNT FACTOR FORM

The GMM estimate is formed from

min gT (b)0 W gT (b)

b

with ¬rst order condition

d0 W gT (b) = d0 W ET (xf 0 b ’ p) = 0

where

0

‚gT (b)

d0 = = ET (f x0 ).

‚b

This is the second moment matrix of payoffs and factors. The ¬rst stage has W = I, the

second stage has W = S ’1 . Since this is a linear model, we can solve analytically for the

GMM estimate, and it is

ˆ1 = (d0 d)’1 d0 ET (p)

First stage : b

ˆ2 = (d0 S ’1 d)d0 S ’1 ET (p).

Second stage : b

The ¬rst stage estimate is an OLS cross-sectional regression of average prices on the

second moment of payoff with factors, and the second stage estimate is a GLS cross-sectional

regression. What could be more sensible? The model (13.189) says that average prices should

be a linear function of the second moment of payoff with factors, so the estimate runs a linear

regression. These are cross-sectional regressions since they operate across assets on sample

averages. The “data points” in the regression are sample average prices (y) and second

moments of payoffs with factors (x) across test assets. We are picking the parameter b to

make the model ¬t explain the cross-section of asset prices as well as possible.

We ¬nd the distribution theory from the standard GMM standard error formulas (11.144)

and (11.150). In the ¬rst stage, a = d0 .

1

cov(ˆ1 ) = (d0 d)’1 d0 Sd(d0 d)’1

First stage : (13.190)

b

T

1

cov(ˆ2 ) = (d0 S ’1 d)’1 .

Second stage : b

T

Unsurprisingly, these are exactly the formulas for OLS and GLS regression errors with error

covariance S. The pricing errors are correlated across assets, since the payoffs are correlated.

Therefore the OLS cross-sectional regression standard errors need to be corrected for correla-

tion, as they are in (13.190) and one can pursue an ef¬cient estimate as in GLS. The analogy

is GLS is close, since S is the covariance matrix of E(p) ’ E(xf 0 )b; S is the covariance

matrix of the “errors” in the cross-sectional regression.

236

SECTION 13.2 THE CASE OF EXCESS RETURNS

The covariance matrix of the pricing errors is, from (11.147), (11.151) and (11.152)

h i¡ ¢¡ ¢

First stage : T cov gT (ˆ = I ’ d(d0 d)’1 d0 S I ’ d(d0 d)’1 d0 (13.191)

b)

h i

ˆ = S ’ d(d0 S ’1 d)’1 d0 .

Second stage : T cov gT (b)

These are obvious analogues to the standard regression formulas for the covariance matrix of

regression residuals.

The model test

gT (b)0 cov(gT )’1 gT (b) ∼ χ2 (#moments ’ #parameters)

which specializes for the second-stage estimate as

T gT (ˆ 0 S ’1 gT (ˆ ∼ χ2 (#moments ’ #parameters).

b) b)

There is not much point in writing these out, other than to point out that the test is a quadratic

form in the vector of pricing errors. It turns out that the χ2 test has the same value for ¬rst

and second stage for this model, even though the parameter estimates, pricing errors and

covariance matrix are not the same.

13.2 The case of excess returns

When mt+1 = a ’ b0 ft+1 and the test assets are excess returns, the GMM estimate is

a GLS cross-sectional regression of average returns on the second moments of returns with

factors,

ˆ1 = (d0 d)’1 d0 ET (Re )

First stage : b

ˆ2 = (d0 S ’1 d)d0 S ’1 ET (Re ).

Second stage : b

where d is the covariance matrix between returns and factors. The other formulas are the

same.

The analysis of the last section requires that at least one asset has a nonzero price. If all

assets are excess returns then ˆ1 = (d0 d)’1 d0 ET (p) = 0. Linear factor models are most often

b

applied to excess returns, so this case is important. The trouble is that in this case the mean

discount factor is not identi¬ed. If E(mRe ) = 0 then E((2 — m)Re ) = 0. Analogously in

expected return-beta models, if all test assets are excess returns, then we have no information

on the level of the zero-beta rate.

Writing out the model as m = a’b0 f , we cannot separately identify a and b so we have to

choose some normalization. The choice is entirely one of convenience; lack of identi¬cation

237

CHAPTER 13 GMM FOR LINEAR FACTOR MODELS IN DISCOUNT FACTOR FORM

means precisely that the pricing errors do not depend on the choice of normalization.

The easiest choice is a = 1. Then

gT (b) = ET (mRe ) = ET (Re ) ’ E(Re f 0 )b.

We have

‚gT (b)

d0 = = E(fRe0 ),

0

‚b

the second moment matrix of returns and factors. The ¬rst order condition to min gT W gT

0

is

d0 W [d b + ET (Re )] = 0.

Then, the GMM estimates of b are

ˆ1 = (d0 d)’1 d0 ET (Re )

First stage : b

ˆ2 = (d0 S ’1 d)d0 S ’1 ET (Re ).

Second stage : b

The GMM estimate is a cross-sectional regression of mean excess returns on the second

moments of returns with factors. From here on in, the distribution theory is unchanged from

the last section.

Mean returns on covariances

We can obtain a cross-sectional regression of mean excess returns on covariances, which

are just a heartbeat away from betas, by choosing the normalization a = 1 + b0 E(f) rather

than a = 1. Then, the model is m = 1 ’ b0 (f ’ E(f )) with mean E(m) = 1. The pricing

errors are

˜

gT (b) = ET (mRe ) = ET (Re ) ’ ET (Re f 0 )b

˜

where I denote f ≡ f ’ E(f). We have

‚gT (b) ˜

d0 = = ET (fRe0 ),

‚b0

which now denotes the covariance matrix of returns and factors. The ¬rst order condition to

min gT W gT is now

0

d0 W [d b + ET (Re )] = 0.

Then, the GMM estimates of b are

ˆ1 = (d0 d)’1 d0 ET (Re )

First stage : b

ˆ2 = (d0 S ’1 d)d0 S ’1 ET (Re ).

Second stage : b

238

SECTION 13.3 HORSE RACES

The GMM estimate is a cross-sectional regression of expected excess returns on the covari-

ance between returns and factors. Naturally, the model says that expected excess returns

should be proportional to the covariance between returns and factors, and the estimate es-

timates that relation by a linear regression. The standard errors and variance of the pricing

errors are the same as in (13.190) and (13.191), with d now representing the covariance ma-

trix. The formulas are almost exactly identical to those of the cross-sectional regressions in

section 12.2. The p = E(mx) formulation of the model for excess returns is equivalent to

E(Re ) = ’Cov(Re , f 0 )b; thus covariances enter in place of betas β.

There is one ¬‚y in the ointment; the mean of the factor E(f ) is estimated, and the dis-

tribution theory should recognize sampling variation induced by this fact, as we did for the

fact that betas are generated regressors in the cross-sectional regressions of section 2.3. The

distribution theory is straightforward, and a problem at the end of the chapter guides you

through it. However, I think it is better to avoid the complication and just use the second mo-

ment approach, or some other non-sample dependent normalization for a. The pricing errors

are identical “ the whole point is that the normalization of a does not matter to the pricing

errors. Therefore, the χ2 statistics are also identical. As you change the normalization for

a, you change the estimate of b. Therefore, the only effect is to add a term in the sampling

variance of the estimated parameter b.

13.3 Horse Races

How to test whether one set of factors drives out another. Test b2 = 0 in m = b0 f1 + b0 f2

1 2

ˆ2 , or the χ2 difference test.

using the standard error of b

It™s often interesting to test whether one set of factors drives out another. For example,

Chen Roll and Ross (1986) test whether their ¬ve macroeconomic factors price assets so well

that one can ignore even the market return. Given the large number of factors that have been

proposed, a statistical procedure for testing which factors survive in the presence of the others

is desirable.

In this framework, such a test is very easy. Start by estimating a general model

m = b0 f1 + b0 f2 . (192)

1 2

We want to know, given factors f1 , do we need the f2 to price assets “ i.e. is b2 = 0? There

are two ways to do this.

First and most obviously, we have an asymptotic covariance matrix for [b1 b2 ], so we can

form a t test (if b2 is scalar) or χ2 test for b2 = 0 by forming the statistic

ˆ0 var(ˆ2 )’1ˆ2 ∼ χ2

b2 b b #b2

239

CHAPTER 13 GMM FOR LINEAR FACTOR MODELS IN DISCOUNT FACTOR FORM

where #b2 is the number of elements in the b2 vector. This is a Wald test..

Second, we can estimate a restricted system m = b0 f1 . Since there are fewer free param-

1

eters and the same number of moments than in (13.192), we expect the criterion JT to rise.

If we use the same weighting matrix, (usually the one estimated from the unrestricted model

(13.192)) then the JT cannot in fact decline. But if b2 really is zero, it shouldn™t rise “much.”

How much? The χ2 difference test answers that question;

T JT (restricted) ’ T JT (unrestricted) ∼ χ2 (#of restrictions)

This is very much like a likelihood ratio test.

13.4 Testing for characteristics

How to check whether an asset pricing model drives out a characteristic such as size,

book/market or volatility. Run cross sectional regressions of pricing errors on characteristics;

use the formulas for covariance matrix of the pricing errors to create standard errors.

It™s often interesting to characterize a model by checking whether the model drives out

a characteristic. For example, portfolios organized by size or market capitalization show a

wide dispersion in average returns (at least up to 1979). Small stocks gave higher average

returns than large stocks. The size of the portfolio is a characteristic. A good asset pricing

model should account for average returns by betas. It™s ok if a characteristic is associated

with average returns, but in the end betas should drive out the characteristic; the alphas or

pricing errors should not be associated with the characteristic. The original tests of the CAPM

similarly checked whether the variance of the individual portfolio had anything to do with

average returns once betas were included.

Denote the characteristic of portfolio i by yi . An obvious idea is to include both betas and

the characteristic in a multiple, cross-sectional regression,

E(Rei ) = (±0 ) + β 0 » + γyi + µi ; i = 1, 2, ...N

i

Alternatively, subtract β» from both sides and consider a cross-sectional regression of alphas

on the characteristic,

±i = (±0 ) + γyi + µi ; i = 1, 2, ...N.

(The difference is whether you allow the presence of the size characteristic to affect the »

estimate or not.)

We can always run such a regression, but we don™t want to use the OLS formulas for the

sampling error of the estimates, since the errors µi are correlated across assets. Under the

null that γ = 0, µ = ±, so we can simply use the covariance matrix of the alphas to generate

240

SECTION 13.5 TESTING FOR PRICED FACTORS: LAMBDAS OR B™S?

standard errors of the γ. Let X denote the vector of characteristics, then the estimate is

γ = (X 0 X)’1 X 0 ±

ˆ ˆ

with standard error

σ(ˆ) = (X 0 X)’1 X 0 cov(ˆ )X(X 0 X)’1

γ ±

At this point, simply use the formula for cov(ˆ ) or cov(gT ) as appropriate for the model that

±

you tested.

Sometimes, the characteristic is also estimated rather than being a ¬xed number such

as the size rank of a size portfolio, and you™d like to include the sampling uncertainty of

its estimation in the standard errors of γ . Let yt denote the time series whose mean E(yt )

i i

ˆ

determines the characteristic. Now, write the moment condition for the ith asset as

i

gT = ET (mt+1 (b)xt+1 ’ pt ’ γyt ).

The estimate of γ tells you how the characteristic E(yi ) is associated with model pricing

errors E(mt+1 (b)xt+1 ’ pt ). The GMM estimate of γ is

E(y)0 W (E(mx) ’ p ’ γy)

’1

0 0

γ = (ET (y)W ET (y))

ˆ ET (y)W gT

a OLS or GLS regression of the pricing errors on the estimated characteristics. The stan-

dard GMM formulas for the standard deviation of γ or the χ2 difference test for γ = 0 tell

you whether the γ estimate is statistically signi¬cant, including the fact that E(y) must be

estimated.

13.5 Testing for priced factors: lambdas or b™s?

bj asks whether factor j helps to price assets given the other factors. bj gives the multiple

regression coef¬cient of m on fj given the other factors.

»j asks whether factor j is priced, or whether its factor-mimicking portfolio carries a

positive risk premium. »j gives the single regression coef¬cient of m on fj .

Therefore, when factors are correlated, one should test bj = 0 to see whether to include

factor j given the other factors rather than test »j = 0.

Expected return-beta models de¬ned with single regression betas give rise to » with mul-

tiple regression interpretation that one can use to test factor pricing.

241

CHAPTER 13 GMM FOR LINEAR FACTOR MODELS IN DISCOUNT FACTOR FORM

In the context of expected return-beta models, it has been more traditional to evaluate the

relative strengths of models by testing the factor risk premia » of additional factors, rather

than test whether their b is zero. (The b™s are not the same as the β™s. b are the regression

coef¬cient of m on f, β are the regression coef¬cients of Ri on f.)

To keep the equations simple, I™ll use mean-zero factors, excess returns, and normalize to

E(m) = 1, since the mean of m is not identi¬ed with excess returns.

The parameters b and » are related by

» = E(f f 0 )b.

See section 6.3. Brie¬‚y,

0 = E(mRe ) = E [Re (1 ’ f 0 b)]

E(Re ) = cov(Re , f 0 )b = cov(Re , f 0 )E(f f 0 )’1 E(f f 0 )b = β 0 ».

Thus, when the factors are orthogonal, E(f f 0 ) is diagonal, and each »j = 0 if and only if

the corresponding bj = 0. The distinction between b and » only matters when the factors are

correlated. Factors are often correlated however.

»j captures whether factor fj is priced. We can write » = E [f (f 0 b)] = ’E(mf) to see

that » is (the negative of) the price that the discount factor m assigns to f . b captures whether

factor fj is marginally useful in pricing assets, given the presence of other factors. If bj = 0,

we can price assets just as well without factor fj as with it.

»j is proportional to the single regression coef¬cient of m on f. »j = cov(m, fj ).

»j = 0 asks the corresponding single regression coef¬cient question”“is factor j correlated

with the true discount factor?”

bj is the multiple regression coef¬cient of m on fj given all the other factors. This just

follows from m = b0 f. (Regressions don™t have to have error terms!) A multiple regression

coef¬cient β j in y = xβ + µ is the way to answer “does xj help to explain variation in y

given the presence of the other x™s?” When you want to ask the question, “should I include

factor j given the other factors?” you want to ask the multiple regression question.

For example, suppose the CAPM is true, which is the single factor model

m = a ’ bRem

where Rem is the market excess return. Consider any other excess return Rex , positively

correlated with Rem (x for extra). If we try a factor model with the spurious factor Rex , the

answer is

m = a ’ bRem + 0 — Rex .

bx is obviously zero, indicating that adding this factor does not help to price assets.

However, since the correlation of Rex with Rem is positive, the beta of Rex on Rem is

positive, Rex earns a positive expected excess return, and »x = E(Rex ) > 0. In the expected

242

SECTION 13.5 TESTING FOR PRICED FACTORS: LAMBDAS OR B™S?

return - beta model

E(Rei ) = β im »m + β ix »x

»m = E(Rem ) is unchanged by the addition of the spurious factor. However, since the fac-

tors Rem , Rex are correlated, the multiple regression betas of Rei on the factors change when

we add the extra factor Rex . If β ix is positive, β im will decline from its single-regression

value, so the new model explains the same expected return E(Rei ). The expected return -

beta model will indicate a risk premium for β x exposure, and many assets will have β x expo-

sure (Rx for example!) even though factor Rx is spurious. In particular, Rex will of course

have multiple regression coef¬cients β x,m = 0 and β x,x = 1, and its expected return will be

entirely explained by the new factor x.

So, as usual, the answer depends on the question. If you want to know whether factor i

is priced, look at » (or E(mf i )). If you want to know whether factor i helps to price other

assets, look at bi . This is not an issue about sampling error or testing. All moments above are

population values.

Of course, testing b = 0 is particularly easy in the GMM, p = E(mx) setup. But you can

always test the same ideas in any expression of the model. In an expected return-beta model,

estimate b by E(ff 0 )’1 » and test the elements of that vector rather than » itself.

You can write an asset pricing model as ERe = β 0 » and use the » to test whether each

factor can be dropped in the presence of the others, if you use single regression betas rather

than multiple regression betas. In this case each » is proportional to the corresponding b.

Problem 2 at the end of this chapter helps you to work out this case.

13.5.1 Mean-variance frontier and performance evaluation

A GMM, p = E(mx) approach to testing whether a return expands the mean-variance

frontier. Just test whether m = a + bR prices all returns. If there is no risk free rate, use two

values of a.

We often summarize asset return data by mean-variance frontiers. For example, a large

literature has examined the desirability of international diversi¬cation in a mean-variance

context. Stock returns from many countries are not perfectly correlated, so it looks like one

can reduce portfolio variance a great deal for the same mean return by holding an internation-

ally diversi¬ed portfolio. But is this real or just sampling error? Even if the value-weighted

portfolio were ex-ante mean-variance ef¬cient, an ex-post mean-variance frontier constructed

from historical returns on the roughly NYSE stocks would leave the value-weighted portfo-

lio well inside the ex-post frontier. So is “I should have bought Japanese stocks in 1960” (and

sold them in 1990!) a signal that broad-based international diversi¬cation a good idea now,

or is it simply 20/20 hindsight regret like “I should have bought Microsoft in 1982?” Sim-

243

CHAPTER 13 GMM FOR LINEAR FACTOR MODELS IN DISCOUNT FACTOR FORM

Frontiers intersect

E(R)

1/E(m)

σ(R)

Figure 27. Mean variance frontiers might intersect rather than coincide.

ilarly, when evaluating fund managers, we want to know whether the manager is truly able

to form a portfolio that beats mean-variance ef¬cient passive portfolios, or whether better

performance in sample is just due to luck.

Since a factor model is true if and only if a linear combination of the factors (or factor-

mimicking portfolios if the factors are not returns) is mean-variance ef¬cient, one can inter-

pret a test of any factor pricing model as a test whether a given return is on the mean-variance

frontier. Section 12.1 showed how the Gibbons Ross and Shanken pricing error statistic can

be interpreted as a test whether a given portfolio is on the mean-variance frontier, when re-

turns and factors are i.i.d., and the GMM distribution theory of that test statistic allows us to

extend the test to non-i.i.d. errors. A GMM, p = E(mx), m = a ’ bRp test analogously

tests whether Rp is on the mean-variance frontier of the test assets.

We may want to go one step further, and not just test whether a combination of a set of

assets Rd (say, domestic assets) is on the mean-variance frontier, but whether the Rd assets

span the mean-variance frontier of Rd and Ri (say, foreign or international) assets. The

trouble is, that if there is no riskfree rate, the frontier generated by Rd might just intersect the

frontier generated by Rd and Ri together, rather than span or coincide with the latter frontier,

as shown in Figure 27. Testing that m = a ’ b0 Rd prices both Rd and Ri only checks for

intersection.

244

SECTION 13.6 PROBLEMS

DeSantis (1992) and Chen and Knez (1992,1993) show how to test for spanning as op-

posed to intersection. For intersection, m = a ’ b0 Rd will price both Rd and Rf only for

d

one value of a, or equivalently E(m) or choice of the intercept, as shown. If the frontiers co-

incide or span, then m = a + b0 Rd prices both Rd and Rf for any value of a. Thus,we can

d

test for coincident frontiers by testing whether m = a + b0 Rd prices both Rd and Rf for

d

two prespeci¬ed values of a simultaneously.

To see how this work, start by noting that there must be at least two assets in Rd . If not,

there is no mean-variance frontier of Rd assets; it is simply a point. If there are two assets in

Rd ,Rd1 and Rd2 , then the mean-variance frontier of domestic assets connects them; they are

each on the frontier. If they are both on the frontier, then there must be discount factors

m1 = a1 ’ ˜1 Rd1

b

and

m2 = a2 ’ ˜2 Rd2

b

and, of course, any linear combination,

¤h i

£

m = »a1 + (1 ’ »)a2 ’ »˜1 Rd1 + (1 ’ »)˜2 Rd2 .

b b

Equivalently, for any value of a, there is a discount factor of the form

¡ ¢

m = a ’ b1 Rd1 + b2 Rd2 .

Thus, you can test for spanning with a JT test on the moments

£ ¤

E (a1 ’ b10 Rd )Rd = 0

£ ¤

E (a1 ’ b10 Rd )Ri = 0

£ ¤

E (a2 ’ b20 Rd )Rd = 0

£ ¤

E (a2 ’ b20 Rd )Ri = 0

for any two ¬xed values of a1 , a2 .

13.6 Problems

1. Work out the GMM distribution theory for the model m = 1 ’ b0 (f ’ E(f )) and

test assets are excess returns. The distribution should recognize the fact that E(f ) is

estimated in sample. To do this, set up

· ¸

ET (Re ’ Re (f 0 ’ Ef 0 ) b)

gT =

ET (f ’ Ef )

245

CHAPTER 13 GMM FOR LINEAR FACTOR MODELS IN DISCOUNT FACTOR FORM

" ³ ´ #

˜ e0

ET fR 0

aT = .

0 IK

The estimated parameters are b, E(f). You should end up with a formula for the standard

error of b that resembles the Shanken correction (12.184), and an unchanged JT test.

2. Show that if you use single regression betas, then the corresponding » can be used to test

for the marginal importance of factors. However, the » are no longer the expected return

of factor mimicking portfolios.

246

Chapter 14. Maximum likelihood

Maximum likelihood is, like GMM, a general organizing principle that is a useful place to

start when thinking about how to choose parameters and evaluate a model. It comes with

an asymptotic distribution theory, which, like GMM, is a good place to start when you are

unsure about how to treat various problems such as the fact that betas must be estimated in a

cross-sectional regression.

As we will see, maximum likelihood is a special case of GMM. Given a statistical descrip-

tion of the data, it prescribes which moments are statistically most informative. Given those

moments, ML and GMM are the same. Thus, ML can be used to defend why one picks a cer-

tain set of moments, or for advice on which moments to pick if one is unsure. In this sense,

maximum likelihood (paired with carefully chosen statistical models) justi¬es the regression

tests above, as it justi¬es standard regressions. On the other hand, ML does not easily allow

you to use other non-“ef¬cient” moments, if you suspect that ML™s choices are not robust to

misspeci¬cations of the economic or statistical model. For example, ML will tell you how

to do GLS, but it will not tell you how to adjust OLS standard errors for non-standard error

terms.

Hamilton (1994) p.142-148 and the appendix in Campbell Lo MacKinlay (1997) give

nice summaries of maximum likelihood theory. Campbell Lo and MacKinlay™s Chapter 5

and 6 treat many more variations of regression based tests and maximum likelihood.

14.1 Maximum likelihood

The maximum likelihood principle says to pick the parameters that make the observed

data most likely. Maximum likelihood estimates are asymptotically ef¬cient. The informa-

tion matrix gives the asymptotic standard errors of ML estimates.

The maximum likelihood principle says to pick that set of parameters that makes the

observed data most likely. This is not “the set of parameters that are most likely given the

data” “ in classical (as opposed to Bayesian) statistics, parameters are numbers, not random

variables.

To implement this idea, you ¬rst have to ¬gure out what the probability of seeing a data

set {xt } is, given the free parameters θ of a model. This probability distribution is called the

likelihood function f({xt } ; θ). Then, the maximum likelihood principle says to pick

ˆ = arg max f ({xt } ; θ).

θ

{θ}

For reasons that will soon be obvious, it™s much easier to work with the log of this probability

247

CHAPTER 14 MAXIMUM LIKELIHOOD

distribution

L({xt } ; θ) = ln f ({xt } ; θ),

Maximizing the log likelihood is the same thing as maximizing the likelihood.

Finding the likelihood function isn™t always easy. In a time-series context, the best way

to do it is often to ¬rst ¬nd the log conditional likelihood function f (xt |xt’1 , xt’2 , ...x0 ; θ),

the chance of seeing xt+1 given xt , xt’1 , ... and given values for the parameters, . Since joint

probability is the product of conditional probabilities, the log likelihood function is just the

sum of the conditional log likelihood functions,

T

X

(193)

L({xt } ; θ) = ln f(xt |xt’1 , xt’2 ...x0 ; θ).

t=1

More concretely, we usually assume normal errors, so the likelihood function is

T

1 X 0 ’1

T

(194)

L=’ ln (2π |Σ|) ’ µ Σ µt

2 t=1 t

2

where µt denotes a vector of shocks; µt = xt ’ E(xt |xt’1 , xt’2 ...x0 ; θ).

This expression gives a simple recipe for constructing a likelihood function. You usually

start with a model that generates xt from errors, e.g. xt = ρxt’1 + µt . Invert that model to

express the errors µt in terms of the data {xt } and plug in to (14.194).

There is a small issue about how to start off a model such as (14.193). Ideally, the ¬rst

observation should be the unconditional density, i.e.

L({xt } ; θ) = ln f (x1 ; θ) + ln f (x2 |x1 ; θ) + ln f (x3 |x2 , x1 ; θ)...

However, it is usually hard to evaluate the unconditional density or the ¬rst terms with only

a few lagged xs. Therefore, if as usual the conditional density can be expressed in terms

of a ¬nite number k of lags of xt , one often maximizes the conditional likelihood function

(conditional on the ¬rst k observations), treating the ¬rst k observations as ¬xed rather than

random variables.

L({xt } ; θ) = ln f (xk+1 |xk , xk’1 ...x1 ; θ) + ln f(xk+2 |xk , xk’1... x2 ; θ) + ...

Alternatively, one can treat k pre-sample values {x0 , x’1 , ...x’k+1 } as additional parameters

over which to maximize the likelihood function.

Maximum likelihood estimators come with a useful asymptotic (i.e. approximate) distri-

248

SECTION 14.2 ML IS GMM ON THE SCORES

bution theory. First, the distribution of the estimates is

Ã· ¸’1 !

‚ 2L

ˆ (195)

θ∼N θ, ’

‚θ‚θ0

If the likelihood L has a sharp peak at ˆ then we know a lot about the parameters, while if

θ,

the peak is ¬‚at, other parameters are just as plausible. The maximum likelihood estimator

is asymptotically ef¬cient meaning that no other estimator can produce a smaller covariance

matrix.

The second derivative in (14.195) is known as the information matrix,

T

1 X ‚ 2 ln f (xt+1 |xt , xt’1 , ...x0 ; θ)

1 ‚ 2L

(196)

I=’ =’ .

T ‚θ‚θ0 ‚θ‚θ0

T t=1

(More precisely, the information matrix is de¬ned as the expected value of the second partial,

which is estimated with the sample value.) The information matrix can also be estimated as

a product of ¬rst derivatives. The expression

Tµ ¶µ ¶0

1 X ‚ ln f (xt+1 |xt , xt’1 , ...x0 ; θ) ‚ ln f(xt+1 |xt , xt’1 , ...x0 ; θ)

I =’ .

T t=1 ‚θ ‚θ

converges to the same value as (14.196). (Hamilton 1994 p.429 gives a proof.)

If we estimate a model restricting the parameters, the maximum value of the likelihood

function will necessarily be lower. However, if the restriction is true, it shouldn™t be that

much lower. This intuition is captured in the likelihood ratio test

(197)

2(Lunrestricted ’ Lrestricted )∼χ2

number of restrictions

The form and idea of this test is much like the χ2 difference test for GMM objectives that we

met in section 11.1.

14.2 ML is GMM on the scores

ML is a special case of GMM. ML uses the information in the auxiliary statistical model to

derive statistically most informative moment conditions. To see this fact, start with the ¬rst

order conditions for maximizing a likelihood function

T

‚L({xt } ; θ) X ‚ ln f (xt |xt’1 xt’2 ...; θ)

(198)

= = 0.

‚θ ‚θ

t=1

249

CHAPTER 14 MAXIMUM LIKELIHOOD

This is a GMM estimate. It is the sample counterpart to a population moment condition

µ ¶

‚ ln f (xt |xt’1 xt’2 ...; θ)

(199)

g(θ) = E = 0.

‚θ

The term ‚ ln f (xt |xt’1 xt’2 ...; θ)/‚θ is known as the “score.” It is a random variable,

formed as a combination of current and past data (xt , xt’1 ...). Thus, maximum likelihood is

a special case of GMM, a special choice of which moments to examine.

For example, suppose that x follows an AR(1) with known variance,

xt = ρxt’1 + µt ,

and suppose the error terms are i.i.d. normal random variables. Then,

(xt ’ ρxt’1 )2

µ2

t

ln f (xt |xt’1 , xt’2 ...; ρ) = const. ’ 2 = const ’

2σ 2

2σ

and the score is

‚ ln f (xt |xt’1 xt’2 ...; ρ) (xt ’ ρxt’1 ) xt’1

= .

σ2

‚ρ

The ¬rst order condition for maximizing likelihood is

T

1X

(xt ’ ρxt’1 ) xt’1 = 0.

T t=1

This expression is a moment condition, and you™ll recognize it as the OLS estimator of ρ,

which we have already regarded as a case of GMM.

The example shows another property of scores: The scores should be unforecastable. In

the example,

· ¸ hµ x i

(xt ’ ρxt’1 ) xt’1 t t’1

(200)

Et’1 = Et’1 = 0.

2 σ2

σ

Intuitively, if we used a combination of the x variables E(h(xt , xt’1 , ...)) = 0 that was

predictable, we could form another moment “ an instrument “ that described the predictability

of the h variable and use that moment to get more information about the parameters. To prove

this property more generally, start with the fact that f (xt |xt’1 , xt’2 , ...; θ) is a conditional

250

SECTION 14.3 WHEN FACTORS ARE RETURNS, ML PRESCRIBES A TIME-SERIES REGRESSION

density and therefore must integrate to one,

Z

1= f (xt |xt’1 , xt’2 , ...; θ)dxt

Z

‚f (xt |xt’1 , xt’2 , ...; θ)

0= dxt

‚θ

Z

‚ ln f (xt |xt’1 , xt’2 , ...; θ)

0= f (xt |xt’1 , xt’2 , ...; θ)dxt

‚θ

· ¸

‚ ln f (xt |xt’1 , xt’2 , ...; θ)

0 = Et’1 .

‚θ

Furthermore, as you might expect, the GMM distribution theory formulas give the same

result as the ML distribution, i.e., the information matrix is the asymptotic variance-covariance

matrix. To show this fact, apply the GMM distribution theory (11.144) to (14.198). The

derivative matrix is

T

1 X ‚ 2 ln f (xt |xt’1 xt’2 ...; θ)

‚gT (θ)

d= = =I

‚θ0

0

T t=1 ‚θ‚θ

This is the second derivative expression of the information matrix. The S matrix is

· ¸

‚ ln f (xt |xt’1 xt’2 ...; θ) ‚ ln f (xt |xt’1 xt’2 ...; θ) 0

E =I

‚θ ‚θ

The lead and lag terms in S are all zero since we showed above that scores should be un-

forecastable. This is the outer product de¬nition of the information matrix. There is no a

matrix, since the moments themselves are set to zero. The GMM asymptotic distribution of

ˆ is therefore

θ

√ £ ¤ £ ¤

T (ˆ ’ θ) ’ N 0, d’1 Sd’10 = N 0, I ’1 .

θ

We recover the inverse information matrix, as speci¬ed by the ML asymptotic distribution

theory.

14.3 When factors are returns, ML prescribes a time-series

regression

I add to the economic model E (Re ) = βE(f) a statistical assumption that the regression

errors are independent over time and independent of the factors. ML then prescribes a time-

series regression with no constant. To prescribe a time series regression with a constant, we

drop the model prediction ± = 0. I show how the information matrix gives the same result

as the OLS standard errors.

251

CHAPTER 14 MAXIMUM LIKELIHOOD

Given a linear factor model whose factors are also returns, as with the CAPM, ML pre-

scribes a time-series regression test. To keep notation simple, I again treat a single factor f.

The economic model is

(201)

E (Re ) = βE(f )

Re is an N — 1 vector of test assets, and β is an N — 1 vector of regression coef¬cients of

these assets on the factor (the market return Rem in the case of the CAPM).

To apply maximum likelihood, we need to add an explicit statistical model that fully

describes the joint distribution of the data. I assume that the market return and regression

errors are i.i.d. normal, i.e.

(14.202)

Re = ± + βft + µt

t

ft = E(f ) + ut

· ¸ µ· ¸ · ¸¶

µt 0 Σ0

∼N ,

0 σ2

ut 0 u

(We can get by with non-normal factors, but it is easier not to present the general case.)

Equation (14.202) has no content other than normality. The zero correlation between ut and

µt identi¬es β as a regression coef¬cient. You can just write Re , Rem as a general bivariate

normal, and you will get the same results.

The economic model (14.201) implies restrictions on this statistical model. Taking ex-

pectations of (14.202), the CAPM implies that the intercepts ± should all be zero. Again, this

is also the only restriction that the CAPM places on the statistical model (14.202).

The most principled way to apply maximum likelihood is to impose the null hypothesis

throughout. Thus, we write the likelihood function imposing ± = 0. To construct the likeli-

hood function, we reduce the statistical model to independent error terms, and then add their

log probability densities to get the likelihood function.

T T

1X e 1 X (ft ’ E(f ))2

0 ’1 e

L = (const.) ’ (R ’ βft ) Σ (Rt ’ βft ) ’

2 t=1 t σ2

2 t=1 u

The estimates follow from the ¬rst order conditions,

ÃT !’1

T T

X X X

‚L ˆ

’1

(Re ft2 Re ft

=Σ ’ βft ) ft = 0 ’ β =

t t

‚β t=1 t=1 t=1

T T

1X X

‚L [ =»= 1

ˆ

= (ft ’ E(f)) = 0 ’ E(f) ft

σ 2 t=1

‚E(f ) T t=1

u

(‚L/‚Σ and ‚L/‚σ2 also produce ML estimates of the covariance matrices, which turn out

252

SECTION 14.3 WHEN FACTORS ARE RETURNS, ML PRESCRIBES A TIME-SERIES REGRESSION

to be the standard averages of squared residuals.)

The ML estimate of β is the OLS regression without a constant. The null hypothesis says

to leave out the constant, and the ML estimator uses that fact to avoid estimating a constant.

Since the factor risk premium is equal to the market return, it™s not too surprising that the »

estimate is the same as that of the average market return.

We know that the ML distribution theory must give the same result as the GMM distribu-

tion theory which we already derived in section 12.1, but it™s worth seeing it explicitly. The

asymptotic standard errors follow from either estimate of the information matrix, for example

T

X

‚ 2L ’1

ft2 = 0

0 = ’Σ

‚β‚β t=1

Thus,

11 1 1

ˆ (203)

cov(β) = Σ= Σ.

T E(f 2 ) T E(f )2 + σ2 (f )

This is the standard OLS formula.

We also want pricing error measurements, standard errors and tests. We can apply maxi-

mum likelihood to estimate an unconstrained model, containing intercepts, and then use Wald

tests (estimate/standard error) to test the restriction that the intercepts are zero. We can also

use the unconstrained model to run the likelihood ratio test. The unconstrained likelihood

function is

T

1X e

(Rt ’ ± ’ βft )0 Σ’1 (Rt ’ ± ’ βft ) + ...

e

L = (const.) ’

2 t=1

(I ignore the term in the factor, since it will again just tell us to use the sample mean to

estimate the factor risk premium.)

The estimates are now

T

X

‚L ˆ

’1

(Re ’ ± ’ βft ) = 0 ’ ± = ET (Rt ) ’ βET (ft )

e

=Σ ˆ

t

‚± t=1

T

X e

‚L ˆ = covT (Rt , ft )

’1

(Re

=Σ ’ ± ’ βft ) ft = 0 ’ β

t

σ2 (ft )

‚β T

t=1

Unsurprisingly, the maximum likelihood estimates of ± and β are the OLS estimates, with a

constant.

The inverse of the information matrix gives the asymptotic distribution of these estimates.

Since they are just OLS estimates, we™re going to get the OLS standard errors, but it™s worth

253

CHAPTER 14 MAXIMUM LIKELIHOOD

seeing it come out of ML.

® ’1

· ¸’1

‚ 2L Σ’1 Σ’1 E(f )

’ ·

¸ =

° ¤» Σ’1 E(f ) Σ’1 E(f 2 )

£

±

‚±β

‚

β

· ¸

E(f 2 ) E(f )

1

= —Σ

E(f ) 1

σ2 (f)

ˆ

The covariance matrices of ± and β are thus

ˆ

" ¶2 #

µ

1 E(f )

cov(ˆ ) =

± 1+ Σ

T σ(f )

11

ˆ (14.204)

cov(β) = Σ.

T σ 2 (f )

These are just the usual OLS standard errors, which we derived in section 12.1 as a special

case of GMM standard errors for the OLS time-series regressions when errors are uncorre-

lated over time and independent of the factors, or by specializing σ2 (X 0 X)’1 .

You cannot just invert ‚ 2 L/‚±‚±0 to ¬nd the covariance of ±. That attempt would give

ˆ

just Σ as the covariance matrix of ±, which would be wrong. You have to invert the entire

ˆ

information matrix to get the standard error of any parameter. Otherwise, you are ignoring

the effect that estimating β has on the distribution of ±. In fact, what I presented is really

ˆ

ˆ

wrong, since we also must estimate Σ. However, it turns out that Σ is independent of ± andˆ

ˆ “ the information matrix is block-diagonal “ so the top left two elements of the true inverse

β

information matrix are the same as I have written here.

ˆ

The variance of β in (14.204) is larger than it is in (14.203) was when we impose the null

of no constant. ML uses all the information it can to produce ef¬cient estimates “ estimates

with the smallest possible covariance matrix. The ratio of the two formulas is equal to the

familiar term 1 + E(f )2 /σ2 (f ). In annual data for the CAPM, σ(Rem ) = 16%, E(Rem ) =

8%, means that unrestricted estimate (14.204) has a variance 25% larger than the restricted

estimate (14.203), so the gain in ef¬ciency can be important. In monthly data, however the

gain is smaller since variance and mean both scale with the horizon.

We can also view this fact as a warning: ML will ruthlessly exploit the null hypothesis and

do things like running regressions without a constant in order to get any small improvement

in ef¬ciency.

We can use these covariance matrices to construct a Wald (estimate/standard error) test

the restriction of the model that the alphas are all zero,

Ã ¶2 !’1

µ

E(f)

±0 Σ’1 ±∼χ2 . (205)

T 1+ ˆ ˆN

σ(f)

254

SECTION 14.4 WHEN FACTORS ARE NOT EXCESS RETURNS, ML PRESCRIBES A CROSS-SECTIONAL REGRESSION