. 6
( 17)


The investor™s total risky portfolio is y 0 R. Hence, Σy gives the covariance of each return
with y0 R, and also with the investor™s overall portfolio y f Rf + y0 R. If all investors are
identical, then the market portfolio is the same as the individual™s portfolio so Σy also gives
the correlation of each return with Rm = yf Rf + y 0 R. (If investors differ in risk aversion ±,
the same thing goes through but with an aggregate risk aversion coef¬cient.)
Thus, we have the CAPM. This version is especially interesting because it ties the market
price of risk to the risk aversion coef¬cient. Applying (9.119) to the market return itself, we
E(Rm ) ’ Rf
= ±.
σ2 (Rm )

9.1.3 Quadratic value function, dynamic programming.

We can let investors live forever in the quadratic utility CAPM so long as we assume that
the environment is independent over time. Then the value function is quadratic, taking the
place of the quadratic second-period utility function. This case is a nice ¬rst introduction to
dynamic programming.

The two-period structure given above is unpalatable, since (most) investors do in fact live
longer than two periods. It is natural to try to make the same basic ideas work with less


restrictive and more palatable assumptions.
We can derive the CAPM in a multi-period context by replacing the second-period quadratic
utility function with a quadratic value function. However, the quadratic value function re-
quires the additional assumption that returns are i.i.d. (no “shifts in the investment oppor-
tunity set”). This observation, due to Fama (1970), is also a nice introduction to dynamic
programming, which is a powerful way to handle multiperiod problems by expressing them
as two period problems. Finally, I think this derivation makes the CAPM more realistic, trans-
parent and intuitively compelling. Buying stocks amounts to taking bets over wealth; really
the fundamental assumption driving the CAPM is that marginal utility of wealth is linear in
wealth and does not depend on other state variables.
Let™s start in a simple ad-hoc manner by just writing down a “utility function” de¬ned
over this period™s consumption and next period™s wealth,

U = u(ct ) + βEt V (Wt+1 ).

This is a reasonable objective for an investor, and does not require us to make the very ar-
ti¬cial assumption that he will die tomorrow. If an investor with this “utility function” can
buy an asset at price pt with payoff xt+1 , his ¬rst order condition (buy a little more, then
x contributes to wealth next period) is

pt u0 (ct ) = βEt [V 0 (Wt+1 )xt+1 ] .

Thus, the discount factor uses next period™s marginal value of wealth in place of the more
familiar marginal utility of consumption

V 0 (Wt+1 )
mt+1 = β
u0 (ct )

(The envelope condition states that, at the optimum, a penny saved has the same value as a
penny consumed u0 (ct ) = V 0 (Wt ). We could use this condition to express the denominator
in terms of wealth also.)
Now, suppose the value function were quadratic,
V (Wt+1 ) = ’ (Wt+1 ’ W — )2 .

Then, we would have

Rt+1 (Wt ’ ct ) ’ W —
Wt+1 ’ W —
mt+1 = ’β· = ’β·
u0 (ct ) u0 (ct )
· ¸· ¸
β·W — β·(Wt ’ ct ) W
= +’ Rt+1 ,
0 (c ) 0 (c )
ut ut


or, once again,

mt+1 = at + bt RW ,

the CAPM!
Let™s be clear about the assumptions and what they do.
1) The value function only depends on wealth. If other variables entered the value func-
tion, then ‚V /‚W would depend on those other variables, and so would m. This assumption
bought us the ¬rst objective of any derivation: the identity of the factors. The ICAPM, be-
low, allows other variables in the value function, and obtains more factors. (Actually, other
variables could enter so long as they don™t affect the marginal value of wealth. The weather
is an example: You like me might be happier on sunny days, but you do not value additional
wealth more on sunny than on rainy days. Hence, covariance with weather does not affect
how you value stocks.)
2) The value function is quadratic. We wanted the marginal value function V 0 (W ) be
linear, to buy us the second objective, showing m is linear in the factor. Quadratic utility and
value functions deliver a globally linear marginal value function V 0 (W ). By the usual Taylor
series logic, linearity of V 0 (W ) is probably not a bad assumption for small perturbations, and
not a good one for large perturbations.

Why is the value function quadratic?
You might think we are done. But economists are unhappy about a utility function that
has wealth in it. Few of us are like Disney™s Uncle Scrooge, who got pure enjoyment out
of a daily swim in the coins in his vault. Wealth is valuable because it gives us access to
more consumption. Utility functions should always be written over consumption. One of the
few real rules in economics that keep our theories from being vacuous is that ad-hoc “utility
functions” over other objects like wealth (or means and variances of portfolio returns, or
“status” or “political power”) should be defended as arising from a more fundamental desire
for consumption.
More practically, being careful about the derivation makes clear that the super¬cially
plausible assumption that the value function is only a function of wealth derives from the
much less plausible, in fact certainly false, assumption that interest rates are constant, the
distribution of returns is i.i.d., and that the investor has no risky labor income. So, let us see
what it takes to defend the quadratic value function in terms of some utility function.
Suppose investors last forever, and have the standard sort of utility function

1 Xj
U = ’ Et β u(ct+j ).
2 j=0

Again, investors start with wealth W0 which earns a random return RW and they have no
other source of income. In addition, suppose that interest rates are constant, and stock returns


are i.i.d. over time.
De¬ne the value function as the maximized value of the utility function in this environ-
ment. Thus, de¬ne V (W ) as7

β j u(ct+j ) (9.120)
V (Wt ) ≡ max{ct ,ct+1 ,ct+2 ...±t ,±t+1 ,...} Et

= RW (Wt ’ ct ); RW = ±0 Rt ; ±0 1 = 1
s.t. Wt+1 t+1 t t t
£ ¤0
(I used vector notation to simplify the statement of the portfolio problem; R ≡ R1 R2 ... RN ,
etc.) The value function is the total level of utility the investor can achieve, given how much
wealth he has, and any other variables constraining him. This is where the assumptions of
no labor income, a constant interest rate and i.i.d. returns come in. Without these assump-
tions, the value function as de¬ned above might depend on these other characteristics of the
investor™s environment. For example, if there were some variable, say, “D/P” that indicated
returns would be high or low for a while, then the investor would be happier, and have a
high value, when D/P is high, for a given level of wealth. Thus, we would have to write
V (Wt , D/Pt )
Value functions allow you to express an in¬nite period problem as a two period problem.
Break up the maximization into the ¬rst period and all the remaining periods, as follows
± 
 

V (Wt ) = max{ct ,±t } u(ct ) + βEt ° β u(ct+1+j )»
max Et+1 s. t. ..
 
{ct+1 ,ct+2 ..,±t+1 ,±t+2 ....}


V (Wt ) = max{ct ,±t } {u(ct ) + βEt V (Wt+1 )} s.t. ...

Thus, we have defended the existence of a value function. Writing down a two period
“utility function” over this period™s consumption and next period™s wealth is not as crazy as
it might seem.
The value function is also an attractive view of how people actually make decisions. You
don™t think “If I buy a sandwich today, I won™t be able to go out to dinner one night 20 years
from now” “ trading off goods directly as expressed by the utility function. You think “I can™t
afford a new car” meaning that the decline in the value of wealth is not worth the increase
in the marginal utility of consumption. Thus, the maximization in (9.121) describes your
psychological approach to utility maximization.

There is also a transversality condition or a lower limit on wealth in the budget constraints. This keeps the

consumer from consuming a bit more and rolling over more and more debt, and it means we can write the budget
constraint in present value form.


The remaining question is, can the value function be quadratic? What utility function
assumption leads to a quadratic value function? Here is the fun fact: A quadratic utility
function leads to a quadratic value function in this environment. This is not a law of nature;
it is not true that for any u(c), V (W ) has the same functional form. But it is true here and
a few other special cases. The “in this environment” clause is not innocuous. The value
function “ the achieved level of expected utility “ is a result of the utility function and the
How could we show this fact? One way would be to try to calculate the value function
by brute force from its de¬nition, equation (9.120). This approach is not fun, and it does
not exploit the beauty of dynamic programming, which is the reduction of an in¬nite period
problem to a two period problem.
Instead solve (9.121) as a functional equation. Guess that the value function V (Wt+1 )
is quadratic, with some unknown parameters. Then use the recursive de¬nition of V (Wt ) in
(9.121), and solve a two period problem“¬nd the optimal consumption choice, plug it into
(9.121) and calculate the value function V (Wt ). If the guess was right, you obtain a quadratic
function for V (Wt ), and determine any free parameters.
Let™s do it. Specify
u(ct ) = ’ (ct ’ c— )2 .
V (Wt+1 ) = ’ (Wt+1 ’ W — )2
with γ and W — parameters to be determined later. Then the problem (9.121) is (I don™t write
the portfolio choice ± part for simplicity; it doesn™t change anything)
· ¸
1 γ
—2 —2
s. t. Wt+1 = RW (Wt ’ ct ).
V (Wt ) = max ’ (ct ’ c ) ’ β E(Wt+1 ’ W ) t+1
2 2
{ct }

(Et is now E since I assumed i.i.d.) Substituting the constraint into the objective,
· ¸
γ £W ¤2
V (Wt ) = max ’ (ct ’ c— )2 ’ β E Rt+1 (Wt ’ ct ) ’ W — (122)
2 2
{ct }

The ¬rst order condition with respect to ct , using c to denote the optimal value, is
©£ ¤ Wª
ct ’ c— = βγE RW (Wt ’ ct ) ’ W — Rt+1
ˆ ˆ

Solving for ct ,
©£ W 2 ¤ª
ct = c— + βγE Rt+1 Wt ’ ct Rt+1 ’ W — Rt+1
ˆ W2 W


£ ¤
ct 1 + βγE(RW 2 ) = c— + βγE(Rt+1 )Wt ’ βγW — E(RW )
ˆ t+1 t+1

c— ’ βγE(RW )W — + βγE(Rt+1 )Wt
ct =
1 + βγE(RW 2 )

This is a linear function of Wt . Writing (9.122) in terms of the optimal value of c, we get
γ £W ¤2
V (Wt ) = ’ (ˆt ’ c— )2 ’ β E Rt+1 (Wt ’ ct ) ’ W — (124)
c ˆ
2 2
This is a quadratic function of Wt and c. A quadratic function of a linear function is a
quadratic function, so the value function is a quadratic function of Wt . If you want to spend
a pleasant few hours doing algebra, plug (9.123) into (9.124), check that the result really is
quadratic in Wt , and determine the coef¬cients γ, W — in terms of fundamental parameters
β, c— , E(RW ), E(RW 2 ) (or σ2 (RW )). The expressions for γ, W — do not give much insight,
so I don™t do the algebra here.

9.1.4 Log utility

Log utility rather than quadratic utility also implies a CAPM. Log utility implies that
consumption is proportional to wealth, allowing us to substitute the wealth return for con-
sumption data.

The point of the CAPM is to avoid the use of consumption data, and so to use wealth
or the rate of return on wealth instead. Log utility is another special case that allows this
substitution. Log utility is much more plausible than quadratic utility.
Suppose that the investor has log utility

u(c) = ln(c).

De¬ne the wealth portfolio as a claim to all future consumption. Then, with log utility, the
price of the wealth portfolio is proportional to consumption itself.
∞ ∞
u0 (ct+j ) ct β
βj 0 βj
pW = Et ct+j = Et ct+j = ct
u (ct ) ct+j 1’β
j=1 j=1

The return on the wealth portfolio is proportional to consumption growth,
pW + ct+1 1’β + 1 ct+1 1 u0 (ct )
1 ct+1
= t+1 W
Rt+1 = = = .
β β u0 (ct+1 )
ct β ct
pt 1’β


Thus, the log utility discount factor equals the inverse of the wealth portfolio return,
mt+1 = .

Equation (9.125) could be used by itself: it attains the goal of replacing consumption data
by some other variable. (Brown and Gibbons 1982 test a CAPM in this form.) Note that log
utility is the only assumption so far. We do not assume constant interest rates, i.i.d. returns
or the absence of labor income.
Log utility has a special property that “income effects offset substitution effects,” or in
an asset pricing context that “discount rate effects offset cash¬‚ow effects.” News of higher
consumption = dividend should make the claim to consumption more valuable. However,
through u0 (c) it also raises the discount rate, lowering the value of the claim to consumption.
For log utility, these two effects exactly offset.

9.1.5 Linearizing any model: Taylor approximations and normal distributions.

Any nonlinear model m = f(z) can be turned into a linear model m = a + bz in discrete
time by assuming normal returns.

It is traditional in the CAPM literature to try to derive a linear relation between m and
the wealth portfolio return. We could always do this by a Taylor approximation,

mt+1 ∼ at + bt RW .
= t+1

We can make this approximation exact in a special case, that the factors and all asset returns
are normally distributed. (We can also take the continuous time limit, which is really the
same thing. However, this discrete-time trick is common and useful.) First, I quote without
proof the central mathematical trick as a lemma

Lemma 1 (Stein™s lemma) If f, R are bivariate normal, g(f ) is differentiable and E |
g0 (f) |< ∞, then

cov [g(f), R] = E[g 0 (f )] cov(f, R). (126)

Now we can use the lemma to state the theorem.

Theorem 2 If m = g(f ), if f and a set of the payoffs priced by m are normally distributed
returns, and if |E[g0 (f )]| < ∞, then there is a linear model m = a + bf that prices the
normally distributed returns.


Proof: First, the de¬nition of covariance means that the pricing equation can be
rewritten as a restriction between mean returns and the covariance of returns with

1 = E(mR) ” 1 = E(m)E(R) + cov(m, R).

Now, given m = g(f), f and R jointly normal, apply Stein™s lemma (9.126) and

1 = E[g(f)]E(R) + E[g0 (f)]cov(f, R)

1 = E[g(f)]E(R) + cov(E[g 0 (f)]f, R)

Exploiting the ⇐ part of (9.127), we know that an m with mean E(g(f )) and that
depends on f via E(g 0 (f ))f will price assets,

m = E[g(f )] + E[g0 (f )][f ’ E(f )].


Using this trick, and recalling that we have not assumed i.i.d. so all these moments are
conditional, the log utility CAPM implies the linear model
"µ ¶2 #
µ ¶
£W ¤
1 1
Rt+1 ’ Et (RW ) (128)
mt+1 = Et ’ Et t+1
Rt+1 t+1

if RW and all asset returns to be priced are normally distributed. From here it is a short
step to an expected return-beta representation using the wealth portfolio return as the factor.
In the same way, we can trade the quadratic utility function for normal distributions in the
dynamic programming derivation of the CAPM. Starting from
£ ¤
V 0 RW (Wt ’ ct )
V 0 (Wt+1 ) t+1
mt+1 = β =β
0 (c ) u0 (ct )

we can derive an expression that links m linearly to RW by assuming normality.
Using the same trick, the consumption-based model can be written in linear fashion, i.e.
expected returns can be expressed as a linear function of betas on consumption growth rather
than betas on consumption growth raised to a power. However, for large risk aversion co-
ef¬cients (more than about 10 in postwar consumption data) or other transformations, the
inaccuracies due to the normal or lognormal approximation can be very signi¬cant in dis-
crete data.
The normal distribution assumption seems rather restrictive, and it is. However, the most
popular class of continuous-time models specify instantaneously normal distributions even
for things like options that have very non-normal distributions for discrete time intervals.


Therefore, one can think of the Stein™s lemma tricks as a way to get to continuous time
approximations without doing it in continuous time. I demonstrate the explicit continuous
time approach with the ICAPM, in the next section.

9.1.6 Portfolio intuition

The classic derivation of the CAPM contains some useful intuition. The classic derivation
starts with a mean-variance objective for portfolio wealth, max Eu(W ). Beta drives average
returns because beta measures how much adding a bit of the asset to a diversi¬ed portfolio
increases the volatility of the portfolio.
The central insight that started it all is that investors care about portfolio returns, not about
the behavior of speci¬c assets. Once the characteristics of portfolios replaced demand curves
for individual stocks, modern ¬nance was born.

9.2 Intertemporal Capital Asset Pricing Model (ICAPM)

Any “state variable” zt can be a factor. The ICAPM is a linear factor model with wealth
and state variables that forecast changes in the distribution of future returns or income.

The ICAPM generates linear discount factor models

mt+1 = a + b0 ft+1

in which the factors are “state variables” for the investor™s consumption-portfolio decision.
The “state variables” are the variables that determine how well the investor can do in
his maximization. Current wealth is obviously a state variable. Additional state variables
describe the conditional distribution of income and asset returns the agent will face in the
future or “shifts in the investment opportunity set.” In multiple good or international models,
relative price changes are also state variables.
Optimal consumption is a function of the state variables, ct = g(zt ). We can use this fact
once again to substitute out consumption, and write
u0 [g(zt+1 )]
mt+1 =β 0 .
u [g(zt )]
From here, it is a simple linearization to deduce that the state variables zt+1 will be factors.
Alternatively, the value function depends on the state variables

V (Wt+1 , zt+1 ),


so we can write
VW (Wt+1 , zt+1 )
mt+1 = β
VW (Wt , zt )
(The marginal value of a dollar must be the same in any use, so I made the denominator pretty
by writing u0 (ct ) = VW (Wt , zt ). This fact is known as the envelope condition.)
This completes the ¬rst step, naming the proxies. To obtain a linear relation, we can take
a Taylor approximation, assume normality and use Stein™s lemma, or, most conveniently,
move to continuous time (which is really just a more convenient way of making the normal
approximation.) We saw above that we can write the basic pricing equation in continuous
time as
µ ¶
dp dΛ dp
E ’ r dt = ’E .
p Λp
(for simplicity of the formulas, I™m folding any dividends into the price process). The dis-
count factor is marginal utility, which is the same as the marginal value of wealth,
du0 (ct )
dΛt dVW (Wt , zt )
=0 =
Λt u (ct ) VW
Our objective is to express the model in terms of factors z rather than marginal utility or
value, and Ito™s lemma makes this easy
dVW W VW W dW VW z 1
dz + (second derivative terms)
= +
(We don™t have to grind out the second derivative terms if we are going to take rf dt =
Et (dΛ/Λ) , though this approach removes a potentially interesting and testable implication
of the model). The elasticity of marginal value with respect to wealth is often called the
coef¬cient of relative risk aversion,
rra ≡ ’ .
Substituting, we obtain the ICAPM, which relates expected returns to the covariance of re-
turns with wealth, and also with the other state variables,
µ ¶ µ ¶
dp dW dp VW z dp
E ’ r dt = rra E ’ E dz .
p Wp VW p

From here, it is fairly straightforward to express the ICAPM in terms of betas rather than
covariances, or as a linear discount factor model. Most empirical work occurs in discrete
time; we often simply approximate the continuous time result as

E(R) ’ Rf ≈ rra cov(R, ∆W ) + »z cov(R, ∆z).


One often substitutes covariance with the wealth portfolio for covariance with wealth, and
one uses factor-mimicking portfolios for the other factors dz as well. The factor-mimicking
portfolios are interesting for portfolio advice as well, as they give the purest way of hedging
against or pro¬ting from state variable risk exposure.
This short derivation does not do justice to the beauty of Merton™s portfolio theory and
ICAPM. What remains is to actually state the consumer™s problem and prove that the value
function depends on W and z, the state variables for future investment opportunities, and that
the optimal portfolio holds the market and hedge portfolios for the investment opportunity

9.3 Comments on the CAPM and ICAPM

Conditional vs. unconditional models.
Do they price options?
Why bother linearizing?
The wealth portfolio.
Ex-post returns.
The implicit consumption-based model.
What are the ICAPM state variables?
CAPM and ICAPM as general equilibrium models

Is the CAPM conditional or unconditional?
Is the CAPM a conditional or an unconditional factor model? I.e., are the parameters a
and b in m = a ’ bRW constants, or do they change at each time period, as conditioning in-
formation changes? We saw above that a conditional CAPM does not imply an unconditional
CAPM, so additional steps must be taken to say anything about observed average returns.
The two period quadratic utility based derivation results in a conditional CAPM, since the
parameters at and bt depend on consumption which changes over time. Also we know that a
and b must vary over time if the conditional moments of RW , Rf vary over time. This two-
period investor chooses a portfolio on the conditional mean variance frontier, which is not on
the unconditional frontier. The multiperiod quadratic utility CAPM only holds if returns are
i.i.d. so it only holds if there is no difference between conditional and unconditional models.
The log utility CAPM expressed with the inverse market return is a beautiful model, since
it holds both conditionally and unconditionally. There are no free parameters that can change


with conditioning information:
µ ¶ µ ¶
1 1
1 = Et Rt+1 ”1=E Rt+1 .
t+1 t+1

In fact there are no free parameters at all! Furthermore, the model makes no distributional as-
sumptions, so it can apply to any asset, including options. Finally it requires no speci¬cation
of the investment opportunity set, or (macro language) no speci¬cation of technology.
Linearizing the log utility CAPM comes at enormous price. The expectations in the lin-
earized log utility CAPM (9.128) are conditional. Thus, the apparent simpli¬cation of linear-
ity destroys the nice unconditional feature of the log utility CAPM.
Should the CAPM price options?
As I have derived them, the quadratic utility CAPM and the nonlinear log utility CAPM
should apply to all payoffs: stocks, bonds, options, contingent claims, etc. However, if we as-
sume normal return distributions to obtain a linear CAPM from log utility, we can no longer
hope to price options, since option returns are non-normally distributed (that™s the point of
options!) Even the normal distribution for regular returns is a questionable assumption. You
may hear the statement “the CAPM is not designed to price derivative securities”; the state-
ment refers to the log utility plus normal-distribution derivation of the linear CAPM.
Why linearize?
Why bother linearizing a model? Why take the log utility model m = 1/RW which
should price any asset, and turn it into mt+1 = at + bt Rt+1 that loses the clean conditioning-

down property and cannot price non-normally distributed payoffs? These tricks were de-
veloped before the p = E(mx) expression of asset pricing models, when (linear) expected
return-beta models were the only thing around. You need a linear model of m to get an ex-
pected return - beta model. More importantly, the tricks were developed when it was hard to
estimate nonlinear models. It™s clear how to estimate a β and a » by regressions, but estimat-
ing nonlinear models used to be a big headache. Now, GMM has made it easy to estimate and
evaluate nonlinear models. Thus, in my opinion, linearization is mostly intellectual baggage.
The desire for linear representations and this normality trick is one of the central reasons
why many asset pricing models are written in continuous time. In most continuous time
models, everything is locally normal. Unfortunately for empiricists, this approach adds time-
aggregation and another layer of unobservable conditioning information into the predictions
of the model. For this reason, most empirical work is still based on discrete-time models.
However, the local normal distributions in continuous time, even for option returns, is a good
reminder that normal approximations probably aren™t that bad, so long as the time interval is
kept reasonably short.
What about the wealth portfolio?
The log utility derivation makes clear just how expansive is the concept of the wealth
portfolio. To own a (share of) the consumption stream, you have to own not only all stocks,


but all bonds, real estate, privately held capital, publicly held capital (roads, parks, etc.), and
human capital “ a nice word for “people.” Clearly, the CAPM is a poor defense of common
proxies such as the value-weighted NYSE portfolio. And keep in mind that since it is easy to
¬nd ex-post mean-variance ef¬cient portfolios of any subset of assets (like stocks) out there,
taking the theory seriously is our only guard against ¬shing.
Implicit consumption-based models
Many users of alternative models clearly are motivated by a belief that the consumption-
based model doesn™t work, no matter how well measured consumption might be. This view is
not totally unreasonable; as above, perhaps transactions costs de-link consumption and asset
returns at high frequencies, and some diagnostic evidence suggests that the consumption
behavior necessary to save the consumption model is too wild to be believed.
However, the derivations make clear that the CAPM and ICAPM are not alternatives to
the consumption-based model, they are special cases of that model. In each case mt+1 =
βu0 (ct+1 )/u0 (ct ) still operates. We just added assumptions that allowed us to substitute other
variables in place of ct . One cannot adopt the CAPM on the belief that the consumption
based model is wrong. If you think the consumption-based model is wrong, the economic
justi¬cation for the alternative factor models evaporates.
The only plausible excuse for factor models is a belief that consumption data are un-
satisfactory. However, while asset return data are well measured, it is not obvious that the
S&P500 or other portfolio returns are terri¬c measures of the return to total wealth. “Macro
factors” used by Chen, Roll and Ross (1986) and others are distant proxies for the quanti-
ties they want to measure, and macro factors based on other NIPA aggregates (investment,
output, etc.) suffer from the same measurement problems as aggregate consumption.
In large part, the “better performance” of the CAPM and ICAPM relative to consumption-
based models comes from throwing away content. Again mt+1 = δu0 (ct+1 )/u0 (ct ) is there
in any CAPM or ICAPM. The CAPM and ICAPM make predictions concerning consump-
tion data that are wildly implausible, not only of admittedly poorly measured aggregate con-
sumption data but any imaginable perfectly measured individual consumption data as well.
For example, equation (9.129) says that the standard deviation of the wealth portfolio return
equals the standard deviation of consumption growth. The latter is about 1% per year. All the
miserable failures of the log-utility consumption-based model apply equally to the log util-
ity CAPM. Finally, most models take the market price of risk as a free parameter. Of course
it isn™t; it is related to risk aversion and consumption volatility and is very hard to justify as
Ex-post returns
The log utility model also allows us for the ¬rst time to look at what moves returns ex-post
as well as ex-ante. Recall that, in the log utility model, we have

1 ct+1
RW = (129)
β ct


Thus, the wealth portfolio return is high, ex-post, when consumption is high. This holds at
every frequency: If stocks go up between 12:00 and 1:00, it must be because (on average) we
all decided to have a big lunch. This seems silly. Aggregate consumption and asset returns are
likely to be de-linked at high frequencies, but how high (quarterly?) and by what mechanism
are important questions to be answered. In any case, this is another implication of the log
utility CAPM that is just thrown out.
In sum, the poor performance of the consumption-based model is an important nut to
chew on, not just a blind alley or failed attempt that we can safely disregard and go on about
our business.
Identity of state variables
The ICAPM does not tell us the identity of the state variables zt , and many authors use
the ICAPM as an obligatory citation to theory on the way to using factors composed of
ad-hoc portfolios, leading Fama (1991) to characterize the ICAPM as a “¬shing license.”
The ICAPM really isn™t quite such an expansive license. One could do a lot to insist that the
factor-mimicking portfolios actually are the projections of some identi¬able state variables on
to the space of returns, and one could do a lot to make sure the candidate state variables really
are plausible state variables for an explicitly stated optimization problem. For example, one
could check that investment-opportunity set state variables actually do forecast something.
The ¬shing license comes as much from habits of applying the theory as from the theory
General equilibrium models
The CAPM and other models are really general equilibrium models. Looking at the
derivation through general-equilibrium glasses, we have speci¬ed a set of linear technologies
with returns Ri that do not depend on the amount invested. Some derivations make further
assumptions, such as an initial capital stock, and no labor or labor income.
The CAPM is obviously very arti¬cial. Its central place really comes from its long string
of empirical successes rather than its theoretical purity. The theory was extended and multiple
factors anticipated long before they became empirically popular.
Portfolio intuition
I have derived all the models as instances of the consumption-based model. The more tra-
ditional portfolio intuition for multifactor models is also useful. The intuition (and historical
development) comes from looking past consumption to its determinants in sources of income
or news.
The CAPM simpli¬es matters by assuming that the average investor only cares the per-
formance of his investment portfolio. Most of us have jobs, so events like recessions hurt the
majority of investors. People with jobs will prefer stocks that don™t fall in recessions, even if
their market betas, mean returns, and standard deviations are the same as stocks that do fall
in recessions. Demanding such stocks, they drive down the corresponding expected returns.
Thus, we expect expected returns to depend on additional betas that capture labor market


The traditional ICAPM intuition works the same way. Even jobless investors have long
horizons. Thus, they will prefer stocks that do well when news comes that future returns are
lower. Demanding more of such stocks, they depress expected returns. Thus, expected re-
turns come to depend on covariation with news of future returns, not just covariation with the
current market return. The ICAPM remained on the theoretical shelf for 20 years mostly be-
cause it took that long to accumulate empirical evidence that returns are, in fact, predictable.
It is vitally important that the extra factors affect the average investor. If an event makes
investor A worse off and investor B better off, then investor A buys assets that do well when
the event happens, and investor B sells them. They transfer the risk of the event, but the
price or expected return of the asset is unaffected. For a factor to affect prices or expected
returns, the average investor must be affected by it, so investors collectively bid up or down
the price and expected return of assets that covary with the event rather than just transfer the
risk without affecting equilibrium prices.
As you can see, this traditional intuition is encompassed by consumption. Bad labor
market outcomes or bad news about future returns are bad news that raise the marginal utility
of wealth, which equals the marginal utility of consumption.

9.4 Arbitrage Pricing Theory (APT)

The APT: If a set of asset returns are generated by a linear factor model
i i
β ij fj + µi
R = E(R ) +

E(µi ) = E(µi fj ) = 0.

Then (with additional assumptions) there is a discount factor m linear in the factors m =
a + b0 f that prices the returns.

The APT starts from a statistical characterization. There is a big common component
to stock returns: when the market goes up, most individual stocks also go up. Beyond the
market, groups of stocks move together such as computer stocks, utilities, small stocks, value
stocks and so forth. Finally, each stock™s return has some completely idiosyncratic movement.
This is a characterization of realized returns, outcomes or payoffs. The point of the APT is to
start with this statistical characterization of outcomes, and derive something about expected
returns or prices.
The intuition behind the APT is that the completely idiosyncratic movements in asset


returns should not carry any risk prices, since investors can diversify them away by holding
portfolios. Therefore, risk prices or expected returns on a security should be related to the
security™s covariance with the common components or “factors” only.
The job of this section is then 1) to describe a mathematical model of the tendency for
stocks to move together, and thus to de¬ne the “factors” and residual idiosyncratic compo-
nents, and 2) to think carefully about what it takes for the idiosyncratic components to have
zero (or small) risk prices, so that only the common components matter to asset pricing.
There are two lines of attack for the second item. 1) If there were no residual, then we
could price securities from the factors by arbitrage (really, by the law of one price, but the
current distinction between law of one price and arbitrage came after the APT was named.)
Perhaps we can extend this logic and show that if the residuals are small, they must have
small risk prices. 2) If investors all hold well-diversi¬ed portfolios, then only variations in
the factors drive consumption and hence marginal utility.
Much of the original appeal and marketing of the APT came from the ¬rst line of attack,
the idea that we could derive pricing implications without the economic structure required
of the CAPM, ICAPM, or any other model derived as a specialization of the consumption-
based model. In this section, I will ¬rst try to see how far we can in fact get with purely law
of one price arguments. I will conclude that the answer is, “not very far,” and that the most
satisfactory argument for the APT is in fact just another specialization of the consumption-
based model.

9.4.1 Factor structure in covariance matrices

I de¬ne and examine the factor decomposition

xi = ±i + β 0 f + µi ; E(µi ) = 0, E(fµi ) = 0

The factor decomposition is equivalent to a restriction on the payoff covariance matrix.

The APT models the tendency of asset payoffs (returns) to move together via a statistical
factor decomposition
β ij fj + µi = ±i + β 0 f + µi .
x = ±i + i

The fj are the factors, the β ij are the betas or factor loadings and the µi are residuals.
£ usual, I use the ¤0
As same letter without subscripts to denote a vector, for example f =
f1 f2 ... fK . A discount factor m, pricing factors f in m = b0 f and this factor
decomposition (or factor structure) for returns are totally unrelated uses of the word “factor.”


I didn™t invent the terminology! The APT is conventionally written with xi = returns, but it
ends up being much less confusing to use prices and payoffs.
It is a convenient and conventional simpli¬cation to fold the factor means into the ¬rst,
constant, factor and write the factor decomposition with zero-mean factors f ≡ f ’ E(f).
i i
β ij fj + µi . (131)
x = E(x ) +

Remember that E(xi ) is still just a statistical characterization, not a prediction of a model.
We can construct the factor decomposition as a regression equation. De¬ne the β ij as
regression coef¬cients, and then the µi are uncorrelated with the factors by construction,
E(µi fj ) = 0.

The content ” the assumption that keeps (9.131) from describing any arbitrary set of returns
” is an assumption that the µi are uncorrelated with each other.

E(µi µj ) = 0.

(More general versions of the model allow some limited correlation across the residuals but
the basic story is the same.)
The factor structure is thus a restriction on the covariance matrix of payoffs. For example,
if there is only one factor, then
σµi if i = j
˜ ˜
cov(xi , xj ) = E[(β i f + µi )(β j f + µj )] = β i β j σ2 (f) + .
0 if i 6= j

Thus, with N = number of securities, the N (N ’ 1)/2 elements of a variance-covariance
matrix are described by N betas, and N + 1 variances. A vector version of the same thing is
®2 
σ1 0 0
 
cov(x, x0 ) = ββ 0 σ 2 (f ) + ° 0 σ 2 0 » .
0 0
With multiple (orthogonalized) factors, we obtain

cov(x, x0 ) = β 1 β 0 σ2 (f1 ) + β 2 β 0 σ2 (f2 ) + . . . + (diagonal matrix)
1 2

In all these cases, we describe the covariance matrix a singular matrix ββ 0 (or a sum of a few
such singular matrices) plus a diagonal matrix.
If we know the factors we want to use ahead of time, say the market (value-weighted
portfolio) and industry portfolios, or size and book to market portfolios, we can estimate
a factor structure by running regressions. Often, however, we don™t know the identities of
the factor portfolios ahead of time. In this case we have to use one of several statistical


techniques under the broad heading of factor analysis (that™s where the word “factor” came
from in this context) to estimate the factor model. One can estimate a factor structure quickly
by simply taking an eigenvalue decomposition of the covariance matrix, and then setting
small eigenvalues to zero.

9.4.2 Exact factor pricing

With no error term,
xi = E(xi )1 + β 0 f .

p(xi ) = E(xi )p(1) + β 0 p(f )

and thus

m = a + b0 f ; p(xi ) = E(mxi )

E(Ri ) = Rf + β 0 ».

using only the law of one price.

Suppose that there are no idiosyncratic terms µi . This is called an exact factor model.
Now look again at the factor decomposition,
xi = E(xi )1 + β 0 f . (132)

It started as a statistical decomposition. But it also says that the payoff xi can be synthesized
as a portfolio of the factors and a constant (risk-free payoff). Thus, the price of xi can only
depend on the prices of the factors f,
p(xi ) = E(xi )p(1) + β 0 p(f). (133)

The law of one price assumption lets you take prices of right and left sides.
If the factors are returns, their prices are 1. If the factors are not returns, their prices are
free parameters which can be picked to make the model ¬t as well as possible. Since there
are fewer factors than payoffs, this procedure is not vacuous. (Recall that the prices of the
factors are related to the » in expected return beta representations. » is determined by the
expected return of a return factor, and is a free parameter for non-return factor models.)
We are really done, but the APT is usually stated as “there is a discount factor linear
in f that prices returns Ri ,” or “there is an expected return-beta representation with f as


factors.” Therefore, we should take a minute to show that the rather obvious relationship
(9.133) between prices is equivalent to discount factor and expected return statements.
Assuming only the law of one price, we know there is a discount factor m linear in
factors that prices the factors. We usually call it x— , but call it f — here to remind us that it
£ ¤
˜ 0 the factors including the constant. As with x— ,
prices the factors. Denote f = 1 f
ˆ ˆˆ ˆ ˆ ˆ
f — = p(f )0 E(f f 0 )’1 f = a + b0 f satis¬es p(f ) = E(f — f ) and p(1) = E(f — ). If the
discount factor prices the factors, it must price any portfolio of the factors; hence f — prices
all payoffs xi that follow the factor structure (9.132).
We could now go from m linear in the factors to an expected return-beta model using the
above theorems that connect the two representations. But there is a more direct and elegant
connection. Start with (9.133), specialized to returns xi = Ri and of course p(Ri ) = 1. Use
p(1) = 1/Rf and solve for expected return as
h i
˜ = Rf + β 0 ».
i f f
E(R ) = R + β i ’R p(f) i

The last equality de¬nes ». Expected returns are linear in the betas, and the constants (») are
related to the prices of the factors. In fact, this is the same de¬nition of » that we arrived at
above connecting m = b0 f to expected return-beta models.

9.4.3 Approximate APT using the law of one price

Attempts to extend the exact factor model to an approximate factor pricing model when
errors are “small,” or markets are “large,” still only using law of one price.
For ¬xed m, the APT gets better and better as R2 or the number of assets increases.
However, for any ¬xed R2 or size of market, the APT can be arbitrarily bad.
These observations mean that we must go beyond the law of one price to derive factor
pricing models.

Actual returns do not display an exact factor structure. There is some idiosyncratic or
residual risk; we cannot exactly replicate the return of a given stock with a portfolio of a few
large factor portfolios. However, the idiosyncratic risks are often small. For example, factor
model regressions of the form (9.130) often have very high R2 , especially when portfolios
rather than individual securities are on the left hand side. And the residual risks are still
idiosyncratic: Even if they are a large part of an individual security™s variance, they should
be a small contributor to the variance of well diversi¬ed portfolios. Thus, there is reason to
hope that the APT holds approximately, especially for reasonably large portfolios. Surely, if
the residuals are “small” and/or “idiosyncratic,” the price of an asset can™t be “too different”
from the price predicted from its factor content?


To think about these issues, start again from a factor structure, but this time put in a

xi = E(xi )1 + β 0 f + µi

Again take prices of both sides,

p(xi ) = E(xi )p(1) + β 0 p(f ) + E(mµi )

Now, what can we say about the price of the residual p(µi ) = E(mµi )?
Figure 23 illustrates the situation. Portfolios of the factors span a payoff space, the line
from the origin through β 0 f in the Figure. The payoff we want to price, xi is not in that space,
since the residual µi is not zero. A discount factor f — that is in the f payoff space prices the
factors. The set of all discount factors that price the factors is the line m perpendicular to
f — . The residual µi is orthogonal to the factor space, since it is a regression residual, and to
f — in particular, E(f — µi ) = 0. This means that f — assigns zero price to the residual. But the
other discount factors on the m line are not orthogonal to µi , so generate non-zero price for
the residual µi . As we sweep along the line of discount factors m that price the f, in fact, we
generate every price from ’∞ to ∞ for the residual. Thus, the law of one price does not nail
down the price of the residual µi and hence the price or expected return of xi .

All m
m: σ2(m) < A


Figure 23. Approximate arbitrage pricing.


Limiting arguments

We would like to show that the price of xi has to be “close to” the price ofβ 0 f . One notion
of “close to” is that in some appropriate limit the price of xi converges to the price of β 0 f.
“Limit” means, of course, that you can get arbitrarily good accuracy by going far enough in
the direction of the limit (for every µ > 0 there is a δ....). Thus, establishing a limit result is
a way to argue for an approximation.
Here is one theorem that seems to imply that the APT should be a good approximation
for portfolios that have high R2 on the factors. I state the argument for the case that there is a
constant factor, so the constant is in the f space and E(µi ) = 0. The same ideas work in the
less usual case that there is no constant factor, using second moments in place of variance.
Theorem: Fix a discount factor m that prices the factors. Then, as var(µi ) ’ 0,
p(xi ) ’ p(β 0 f ).

This is easiest to see by just looking at the graph. E(µi ) = 0 so var(µi ) = E(µi2 ) =
||µi ||2 . Thus, as the size of the µi vector in Figure 23 gets smaller, xi gets closer and closer to
β 0 f. For any ¬xed m, the induced pricing function (lines perpendicular to the chosen m) is
continuous. Thus, as xi gets closer and closer to β 0 f , its price gets closer and closer to β 0 f.
i i
The factor model is de¬ned as a regression, so

var(xi ) = var(β 0 f ) + var(µi )

Thus, the variance of the residual is related to the regression R2 .

var(µi )
= 1 ’ R2

The theorem says that as R2 ’ 1, the price of the residual goes to zero.
We were hoping for some connection between the fact that the risks are idiosyncratic and
factor pricing. Even if the idiosyncratic risks are a large part of the payoff at hand, they
are a small part of a well-diversi¬ed portfolio. The next theorem shows that portfolios with
high R2 don™t have to happen by chance; well-diversi¬ed portfolios will always have this
Theorem: As the number of primitive assets increases, the R2 of well-diversi¬ed
portfolios increases to 1.
Proof: Start with an equally weighted portfolio
1X i
x= x.
N i=1

Going back to the factor decomposition (9.130) for each individual asset xi , the


factor decomposition of xp is
1 X¡ 1X 1X 0 1Xi
¢ 0
p i
µ = ap + β p f + µp .
x= ai + β i f + µ = ai + β if +
N i=1 N i=1 N i=1 N i=1

The last equality de¬nes notation ±p , β p , µp . But
à !
var(µp ) = var µ
N i=1

So long as the variance of µi are bounded, and given the factor assumption E(µi µj ) =

lim var(µp ) = 0.

Obviously, the same idea goes through so long as the portfolio spreads some weight
on all the new assets, i.e. so long as it is “well-diversi¬ed.”

These two theorems can be interpreted to say that the APT holds approximately (in the
usual limiting sense) for portfolios that either naturally have high R2 , or well-diversi¬ed
portfolios in large enough markets. We have only used the law of one price.

Law of one price arguments fail

Now, let me pour some cold water on these results. I ¬xed m and then let other things take
limits. The ¬‚ip side is that for any nonzero residual µi , no matter how small, we can pick a
discount factor m that prices the factors and assigns any price to xi ! As often in mathematics,
the order of “for all” and “there exists” matters a lot.
Theorem: For any nonzero residual µi there is a discount factor that prices the fac-
tors f (consistent with the law of one price) and that assigns any desired price in
(’∞, ∞) to the payoff xi .

So long as ||µi || > 0, as we sweep the choice of m along the dashed line, the inner
product of m with µi and hence xi varies from ’∞ to ∞. Thus, for a given size R2 < 1, or
a given ¬nite market, the law of one price says absolutely nothing about the prices of payoffs
that do not exactly follow the factor structure. The law of one price says that two ways of
constructing the same portfolio must give the same price. If the residual is not exactly zero,
there is no way of replicating the payoff xi from the factors and no way to infer anything
about the price of xi from the price of the factors.
I think the contrast between this theorem and those of the last subsection accounts for
most of the huge theoretical controversy over the APT. If you ¬x m and take limits of N or


µ, the APT gets arbitrarily good. But if you ¬x N or µ, as one does in any application, the
APT can get arbitrarily bad as you search over possible m.
The lesson I learn is that the effort to extend prices from an original set of securities (f in
this case) to new payoffs that are not exactly spanned by the original set of securities, using
only the law of one price, is fundamentally doomed. To extend a pricing function, you need
to add some restrictions beyond the law of one price.

9.4.4 Beyond the law of one price: arbitrage and Sharpe ratios

We can ¬nd a well-behaved approximate APT if we impose the law of one price and a
restriction on the volatility of discount factors, or, equivalently, a bound on the Sharpe ratio
achievable by portfolios of the factors and test assets.

The approximate APT based on the law of one price fell apart because we could always
choose a discount factor suf¬ciently “far out” to generate an arbitrarily large price for an
arbitrarily small residual. But those discount factors are surely “unreasonable.” Surely, we
can rule them out, reestablishing an approximate APT, without jumping all the way to fully
speci¬ed discount factor models such as the CAPM or consumption-based model
A natural ¬rst idea is to impose the no-arbitrage restriction that m must be positive.
Graphically, we are now restricted to the solid m line in Figure 23. Since that line only
extends a ¬nite amount, restricting us to strictly positive m0 s gives rise to ¬nite upper and
lower arbitrage bounds on the price of µi and hence xi . (The word arbitrage bounds comes
from option pricing, and we will see these ideas again in that context. If this idea worked, it
would restore the APT to “arbitrage pricing” rather than “law of one-pricing.”)
Alas, in applications of the APT (as often in option pricing), the arbitrage bounds are
too wide to be of much use. The positive discount factor restriction is equivalent to saying
“if portfolio A gives a higher payoff than portfolio B in every state of nature, then the price
of A must be higher than the price of B.” Since stock returns and factors are continuously
distributed, not two-state distributions as I have graphed for ¬gure 23, there typically are no
strictly dominating portfolios, so adding m > 0 does not help.
A second restriction does let us derive an approximate APT that is useful in ¬nite markets
with R2 < 1. We can restrict the variance and hence the size (||m|| = E(m2 ) = σ 2 (m) +
E(m)2 = σ 2 (m) + 1/Rf2 ) of the discount factor. Figure 23 includes a plot of the discount
factors with limited variance, size, or length in the geometry of that Figure. The restricted
range of discount factors produces a restricted range of prices for xi . The restricted range
of discount factors gives us upper and lower price bounds for the price of xi in terms of the


factor prices. Precisely, the upper and lower bounds solve the problem

min ( or max) p(xi ) = E(mxi ) s.t. E(mf ) = p(f), m ≥ 0, σ2 (m) ¤ A.
{m} {m}

Limiting the variance of the discount factor is of course the same as limiting the maximum
Sharpe ratio (mean / standard deviation of excess return) available from portfolios of the
factors and xi . Recall that
E (Re ) σ(m)
¤ .
σ(Re ) E(m)

Though a bound on Sharpe ratios or discount factor volatility is not a totally preference-
free concept, it clearly imposes a great deal less structure than the CAPM or ICAPM which
are essentially full general equilibrium models. Ross (1976) included this suggestion in his
original APT paper, though it seems to have disappeared from the literature since then in
the failed effort to derive an APT from the law of one price alone. Ross pointed out that
deviations from factor pricing could provide very high Sharpe ratio opportunities, which
seem implausible though not violations of the law of one price. Saá-Requejo and I (2000)
dub this idea “good-deal” pricing, as an extension of “arbitrage pricing.” Limiting σ(m) rules
out “good deals” as well as pure arbitrage opportunities.
Having imposed a limit on discount factor volatility or Sharpe ratio A, then the APT limit
does work, and does not depend on the order of “for all” and “there exists.”
Theorem: As µi ’ 0 and R2 ’ 1, the price p(xi ) assigned by any discount factor
m that satis¬es E(mf) = p(f ), m ≥ 0, σ 2 (m) ¤ A approaches p(β 0 f ).

9.5 APT vs. ICAPM

A factor structure in the covariance of returns or high R2 in regressions of returns on
factors can imply factor pricing (APT) but factors can price returns without describing their
covariance matrix (ICAPM).
Differing inspiration for factors.
The disappearance of absolute pricing.

The APT and ICAPM stories are often confused. Factor structure can imply factor pric-
ing (APT), but factor pricing does not require a factor structure. In the ICAPM there is no
presumption that factors f in a pricing model m = b0 f describe the covariance matrix of
returns. The factors don™t have to be orthogonal or i.i.d. either. High R2 in time-series re-
gressions of the returns on the factors may imply factor pricing (APT), but again are not


necessary (ICAPM). The regressions of returns on factors can have low R2 in the ICAPM.
Factors such as industry may describe large parts of returns™ variances but not contribute to
the explanation of average returns.
The biggest difference between APT and ICAPM for empirical work is in the inspiration
for factors. The APT suggests that one start with a statistical analysis of the covariance matrix
of returns and ¬nd portfolios that characterize common movement. The ICAPM suggests that
one start by thinking about state variables that describe the conditional distribution of future
asset returns and non-asset income. More generally, the idea of proxying for marginal utility
growth suggests macroeconomic indicators, and indicators of shocks to non-asset income in
The difference between the derivations of factor pricing models, and in particular an ap-
proximate law-of-one-price basis vs. a proxy for marginal utility basis seems not to have
had much impact on practice. In practice, we just test models m = b0 f and rarely worry
about derivations. The best evidence for this view is the introductions of famous papers.
Chen, Roll and Ross (1986) describe one of the earliest popular multifactor models, using
industrial production and in¬‚ation as some of the main factors. They do not even present
a factor decomposition of test asset returns, or the time-series regressions. A reader might
well categorize the paper as a macroeconomic factor model or perhaps an ICAPM. Fama and
French (1993) describe the currently most popular multifactor model, and their introduction
describes it as an ICAPM in which the factors are state variables. But the factors are sorted
on size and book/market just like the test assets, the time-series R2 are all above 90%, and
much of the explanation involves “common movement” in test assets captured by the factors.
A a reader might well categorize the model as much closer to an APT.
In the ¬rst chapter, I made a distinction between relative pricing and absolute pricing. In
the former, we price one security given the prices of others, while in the latter, we price each
security by reference to fundamental sources of risk. The factor pricing stories are interesting
in that they start with a nice absolute pricing model, the consumption-based model, and
throw out enough information to end up with relative models. The CAPM prices Ri given
the market, but throws out the consumption-based model™s description of where the market
return came from.

9.6 Problems

1. Suppose the investor only has a one-period horizon. He invests wealth W at date zero,
and only consumes with expected utility Eu(c) = Eu(W ) in period one. Derive the
quadratic utility CAPM in this case. (This is an even simpler derivation. The Lagrange
multiplier on initial wealth W now becomes the denominator of m in place of u0 (c0 )).
2. Express the log utility CAPM in continuous time to derive a discount factor linear in
3. Figure 23 suggests that m > 0 is enough to establish a well-behaved approximate APT.


The text claims this is not true. Which is right?
4. Can you use any excess return for the market factor in the CAPM, or must it be the
market less the riskfree rate?

Estimating and evaluating asset
pricing models


Our ¬rst task in bringing an asset pricing model to data is to estimate the free parameters; the
β and γ in m = β(ct+1 /ct )’γ , or the b in m = b0 f. Then we want to evaluate the model. Is
it a good model or not? Is another model better?
Statistical analysis helps us to evaluate a model by providing a distribution theory for
numbers such as parameter estimates that we create from the data. A distribution theory
pursues the following idea: Suppose that we generate arti¬cial data over and over again from
a statistical model. For example, we could specify that the market return is an i.i.d. normal
random variable, and a set of stock returns is generated by Rt = ±i + β i Rem + µi . After
t t
picking values for the mean and variance of the market return and the ±i , β i , σ 2 (µi ), we could
ask a computer to simulate many arti¬cial data sets. We can repeat our statistical procedure
in each of these arti¬cial data sets, and graph the distribution of any statistic which we have
estimated from the real data, i.e. the frequency that it takes on any particular value in our
arti¬cial data sets.
In particular, we are interested in a distribution theory for the estimated parameters, to give
us some sense of how much the data really has to say about their values; and for the pricing
errors, which helps us to judge whether pricing errors are just bad luck of one particular
historical accident or if they indicate a failure of the model. We also will want to generate
distributions for statistics that compare one model to another, or provide other interesting
evidence, to judge how much sample luck affects those calculations.
All of the statistical methods I discuss in this part achieve these ends. They give methods
for estimating free parameters; they provide a distribution theory for those parameters, and
they provide distributions for statistics that we can use to evaluate models, most often a
quadratic form of pricing errors in the form ±0 V ’1 ±.
ˆ ˆ
I start by focusing on the GMM approach. The GMM approach is a natural ¬t for a
discount factor formulation of asset pricing theories, since we just use sample moments in
the place of population moments. As you will see, there is no singular “GMM estimate and
test.” GMM is a large canvas and a big set of paints and brushes; a ¬‚exible tool for doing
all kinds of sensible (and, unless you™re careful, not-so-sensible) things to the data. Then
I consider traditional regression tests (naturally paired with expected return-beta statements
of factor models) and their maximum likelihood formalization. I emphasize the fundamental
similarities between these three methods, as I emphasized the similarity between p = E(mx),


expected return-beta models, and mean-variance frontiers. A concluding chapter highlights
some of the differences between the methods, as I contrasted p = E(mx) and beta or mean-
variance representations of the models.

Chapter 10. GMM in explicit discount
factor models
The basic idea in the GMM approach is very straightforward. The asset pricing model pre-

E(pt ) = E [m(datat+1 , parameters) xt+1 ] . (134)

The most natural way to check this prediction is to examine sample averages, i.e. to calculate
1X 1X
pt and [m(datat+1 , parameters) xt+1 ] . (135)
T t=1 T t=1

GMM estimates the parameters by making the sample averages as close to each other as
possible. It seems natural, before evaluating a model, to pick parameters that give it its best
chance. GMM then works out a distribution theory for the estimates. This distribution theory
is a generalization of the simplest exercise in statistics: the distribution of the sample mean.
Then, it suggests that we evaluate the model by looking at how close the sample averages
of price and discounted payoff are to each other, or equivalently by looking at the pricing
errors. It gives a statistical test of the hypothesis that the underlying population means are in
fact zero.


. 6
( 17)