стр. 1(всего 3)СОДЕРЖАНИЕ >>
Estimation and Calibration

7.1 Introduction
Virtually all models in п¬Ѓnance have parameters that must be speciп¬Ѓed in order
to completely describe them. Statistical estimation can be used for some or all
of these parameters under two important conditions:

1. We believe that the parameter values do not change quickly in time.

2. We have access to suп¬ѓcient (recent) historical data on the underlying
process to provide reliable estimates of these parameter values.

Statistical estimation involves the distribution of the data and uses methods
such as maximum likelihood estimation or least squares.

There are also some instances in which estimation of parameters is unde-
sirable. For example we have seen that a risk-neutral distribution Q, used in
the pricing of options, is not necessarily identical to the historical distribution
P. Estimating parameter values for the distribution P based on historical data
may have little or nothing to do with their values under Q. However, market in-
formation, for example the prices of options with a given underlying stock carry
information about the expected value of certain functions of the stock under
Q and the process of arranging parameter values so that the model prescribes
options prices as close as possible to those observed is called calibration of the
model.
Although there is a substantial philosophical diп¬Ђerence between calibration
and statistical estimation, there are many similarities in the methodology, since
both often require obtaining the root of one or more functions or maximizing or
minimizing a function and so these are the problems that we begin this section
with.
There are many numerical problems that become much more diп¬ѓcult in the
presence of noise, from п¬Ѓnding the roots of a given function to solving a system
of diп¬Ђerential equations. Problems if this sort have given rise to substantial
research in a variety of areas including stochastic approximation and Markov
Chain Monte Carlo. In statistics, for example, we typically estimate parameters
using least squares or maximum likelihood estimation. However, how do we

339
340 ESTIMATION AND CALIBRATION

carry out maximum likelihood estimation in cases where no simple form for the
likelihood function is available? Suppose we are able to simulate data from a
model with a given set of parameter values, say, but unable to express the
likelihood function in a simple form amenable to numerical maximization.
For some concreteness, suppose we are interested in calibrating a model for
the price of options on the SPX, the S&P500 index. The call option prices in
Table 7.1 were observed on May 6, 2004 when the SPX was quoted at 1116.84.
All options expired on September 18, corresponding to T = .3699 years. The
option price is calculated as midpoint of the bid and ask, and K is the exercise
price, PO (K), the corresponding option price.
950 1005 1050 1100 1125 1150 1175 1200 1225 1250
K
PO (K) 173.1 124.8 88.5 53.7 39.3 27.3 17.85 10.95 6.2 3.25

Table 7.1. Price of SPX Call Options

Suppose we use the Black-Scholes formula to determine the implied volatility
for these options. The implied volatility is the value of Пѓ solving

(7.1)
BS(1116.84, K, r, 0.3699, Пѓ) = PO (K)

where r is the spot interest rate. If exact prices were known (rather than just the
bid and ask prices), we could determine r from the put call parity relation since
there are also put options listed with the same exercise price. However, here we
used the rate r = .01 of a short-term treasury bill since it is diп¬ѓcult to п¬Ѓnd nearly
simultaneous trades in put and call options with the same strike price.. The
MATLAB function BLSIMPV(S0 ,K,r,T ,PO (K)) returns the implied volatility
i.e. provides the solution of (7.1). In Figure 7.1 we give what is often referred to
as the volatility smile, although smirks and frowns are probably more common
than smiles these days (surely a sign of the times). This is a graph of the
exercise price of the option against the implied volatility of the option. The fact
that this implied volatility is not constant is further evidence that the Black-
Scholes model (at least with constant parameters) does not п¬Ѓt the risk-neutral
distribution very well.
We saw in Chapter 3 that distributions such as the normal inverse Gaussian
distribution can do a better job of п¬Ѓtting historical distributions of stocks and
indices than does the normal and so we might hope that it can be applied suc-
cessfully to the п¬Ѓt of a risk-neutral distribution. The main problem in applying
distributions such as the NIG is that of calibrating the parameters of the model
to market data. In models such as this, if option prices are obtained from a
simulation, how can we hope to obtain parameter values for the model that are
consistent with a set of market prices for the options? More speciп¬Ѓcally suppose
we assume that the price of a stock at time T is given by

ST = S0 erT exp(X)

where X has the NIG(О±, ОІ, Оґ, Вµ) distribution of Lemma 29. In order that the
discounted future price forms a martingale, we require that E(exp(X)) = 1 and
7.1. INTRODUCTION 341

Figure 7.1: The volatility вЂњsmileвЂќ for SPX call options

from (3.23) with s = 1 this implies that
Вµ В¶
p
p О±2 в€’ ОІ 2
1
2 в€’ (1 + ОІ)2 в€’ О±2 в€’ ОІ 2 } + ln
Вµ = Оґ{ О±
О±2 в€’ (1 + ОІ)2
4
and
|ОІ + 1| < О±.
One of the four parameters, Вµ, is determined by the martingale condition so the
other three can be used to п¬Ѓt to market option data if we wish. The price of a
call option with exercise price K is now
E(S0 exp(X) в€’ eв€’rT K)+
and this expectation is either a rather complicated numerical exercise or one
that can be estimated by simulation. Whether or not we choose to price this
option using simulation, the calibration problem reduces to selecting values of
the parameters so that the theoretical option price agrees with the market price,
i.e.

E(S0 exp(X) в€’ eв€’rT K)+ = P0 (K). (7.2)
If we have 10 options as in Table 7.1 and only three parameters to vary, there is
no hope for exact equality in 7.2 (10 equations in 3 unknowns is not promising).
We could either п¬Ѓx two of the parameters, e.g. О±, ОІ and select the third,
Оґ to agree with the price of a speciп¬Ѓc option or we could calibrate all three
parameters to three or more options with similar strike prices and minimize the
sum of the squared diп¬Ђerences between observed and theoretical option prices.
We will return to this problem later, but for the present notice that in order
to solve it, we need to be able to either п¬Ѓnd roots or minima of functions when
evaluations of these functions are noisy and have measurement or simulation
error superimposed. We begin a discussion of this problem in the next section.
342 ESTIMATION AND CALIBRATION

7.2 Finding a Root

Let us begin with the simple problem of identifying the root of a non-decreasing
function, i.e. solving for the value x0 satisfying

f (x0 ) = 0.

The standard algorithm for this is the Newton-Raphson algorithm which begins
with an arbitrary value of x1 and then iterates

f (xn )
xn+1 = xn в€’ (7.3)
.
f 0 (xn )

Suppose, however, that our observations on the function f (x) are of the form

Y (x) = f (x) + Оµx
2
where the errors Оµx at the point x are small, with mean 0 and variance Пѓx . Even
if we know the gradient D = f 0 (x0 ) of the function at or near the root (whether
we use the gradient at the root or at the value xn approaching the root does not
materially eп¬Ђect our argument), the attempt at a Newton-Raphson algorithm

Yxn f (xn ) Оµxn
xn+1 = xn в€’ = xn в€’ в€’ (7.4)
D D D
has a serious problem. If there is no error whatsoever in the evaluation of the
function, the iteration
f (xn )
xn+1 = xn в€’
D
typically converges to the root x0 . However in (7.4), on each iteration we add
Оµx
to the вЂњcorrectвЂќ Newton-Raphson iterant xn в€’ f (xn ) an error term Dn which
D
is often independent of the previous errors having variance Dв€’2 Пѓxi .Therefore
2

after a total of N iterations of the process, we have accumulated an error which
has variance
N
X
в€’2 2
D Пѓxi .
i=1

The fact that this error typically increases without bound means that continuing
the iteration of (7.4) does not provide a consistent estimator of the root but
one whose error eventually grows. NewtonвЂ™s method is inappropriate for any
model in which errors in the evaluation of the function, when accumulated until
convergence is declared, are comparable in size to the tolerance of the algorithm.
In general as long as the standard deviation of the function evaluations Пѓx is of
the same order of magnitude as
в€љ
tolerance Г— D/ N
7.2. FINDING A ROOT 343

where N is the number of iterations before convergence was declared, then the
accumulation of the error in (7.4) clouds any possible conclusion concerning the
location of the root. Suppose for example you declare a tolerance of 10в€’4 and
it takes about 100 iterations for your program to claim convergence. Assume
for the sake of argument that D = 0.01, N ' 100. Then an error in the function
evaluation of order Пѓx ' 10в€’5 will invalidate conclusions from the Newton-
Raphson iteration.
There is a relatively simple п¬Ѓx to the usual Newton-Raphson method which
at least corrects the lack of convergence, unfortunately at the price of greater
insensitivity when we are further from the root. Suppose we multiply the con-
stant Dв€’1 by a sequence of constants an that depend on n so the iteration is of
the form
xn+1 = xn в€’ an Dв€’1 Yxn . (7.5)
It is easy to see that as long as Пѓx is bounded, for example and
X
a2 < в€ћ (7.6)
n

then, under mild conditions on the function f (x),then xn converges. This is in-
tuitively clear because the variance of the error beyond the N 0 th term is roughly
Pв€ћ 22
n=N an Пѓx and this approaches zero as N в†’ в€ћ. Furthermore, provided that
X
an = в€ћ, (7.7)

convergence is to a root of the function f (x). This is the result of Robbins-
Monro (1951). Having the correct or asymptotically correct gradient D in (7.5)
really only matters asymptotically since it controls the rate of convergence once
we are in a small neighbourhood around the root. What is more important and
sometimes diп¬ѓcult to achieve is arranging that the sequence an is large when
we are quite far from the root, but small (something like 1/n) when we are
very close.
The ideal sequence of constants an in (7.5) is approximately 1 (this is the
unmodiп¬Ѓed Newton-Raphson) until we are in the vicinity of the root and then
slowly decreases to 1/n. Provided we had some easy device for determining
whether we are near the root, the Robbins-Monro method gives an adequate
solution to the problem of п¬Ѓnding roots to functions when only one root exists.
However, for functions with many roots, other methods are required.
While searching for a root of a function, we are inevitably required to es-
timate the slope of the function at least in the vicinity of the current iterant
xn , because without knowledge of the slope, we do not know in what direction
or how far to travel. The independent values of Yxn in the Robbins-Monro
algorithm (7.5) do not serve this purpose very well. We saw in Chapter 4 that
when estimating a diп¬Ђerence, common random numbers provide considerable
variance reduction. If the function f (x) is evaluated by simulation with a single
uniform input U to the simulation, we can express f in the form
Z1
f (x) = H(x, u)du
0
344 ESTIMATION AND CALIBRATION

for some function H. Alternatively
n
1X
(7.8)
f (x) = E{ H(x, Ui )}
n i=1

where Ui are the pseudo-random inputs to the i0 th simulation, which again
we assume to be uniform[0,1]. There are many ways of incorporating variance
reduction in the Monte Carlo evaluation of (7.8). For a probability density
function g(z) other than the U [0, 1] probability density function, we may wish
to re-express (7.8) using importance sampling as
n
1 X H(x, Zi )
} (7.9)
f (x) = E{
n i=1 g(Zi )

where the inputs Zi are now distributed according to the p.d.f. g(z) on [0,1].
Ideally the importance density g(z) is approximately proportional to the func-
tion H(x0 , z) where x0 is the root and so we may wish to adjust g as our
iteration proceeds and we get closer to the root and the shape of the function
H(x, z) changes. Since the expectation is unknown we could attempt to п¬Ѓnd
the roots of an approximation to this, namely,
n
1 X H(x, Zi )
c (7.10)
fn (x) = .
n i=1 g(Zi )

c
Then fn (x) в†’ f (x) as n в†’ в€ћ. Since the error in this approximation approaches
c
zero as n в†’ в€ћ, it seems reasonable to apply NewtonвЂ™s method to fn (x),

b
fn (xn )
xn+1 = xn в€’
b0
fn (xn )
Pn
wi H(xn , Zi )
= xn в€’ Pn i=1 в€‚ (7.11)
i=1 wi в€‚x H(xn , Zi )

with weights
1
wi = .
g(Zi )
Notice that even if the function H(x, Zi ) is only known up to a multiplicative
constant, since this constant disappears from the expression (7.11), we can
still carry out the iteration. Similarly, the weights wi need only be known up
to a multiplicative constant. There are two primary advantages to this over
the simplest version of the Robbins-Monro. The gradient has been estimated
using common random numbers Zi and we can therefore expect it to be a
better estimator. The ability to incorporate an importance sampling density
g(z) also provides some potential improvement in eп¬ѓciency. There is an obvious
disadvantage to any method such as this or the Newton-Raphson which require
7.2. FINDING A ROOT 345

в€‚
the derivatives of the function в€‚x H(xn , Zi ). If the function f (x) is expressed as
an integral against some density other than the uniform, for example if
Zв€ћ
f (x) = H(x, w)h(w)dw
в€’в€ћ

for density h(w) then (7.11) requires a minimal change to
Pn
wi H(xn , Zi )
xn+1 = xn в€’ Pn i=1 в€‚ where
wi в€‚x H(xn , Zi )
i=1
h(Zi )
wi = .
g(Zi )

Let us now return to the calibration of the Normal Inverse Gaussian model
to the option prices in Table 7.1. We will attempt a calibration of all the
parameters in Chapter 8 but for the moment, for simplicity we will reuse the
parameter values we estimated in Chapter 1 for the S7P500 index, О± = 95.23,
ОІ = в€’4.72. Since Вµ is determined by the martingale condition this leaves only
one parameter Оґ, the analogue of variance, to calibrate to the option price.
We wish to solve for Оґ, with S0 = 1116.84, r = 0.01, T = .3699, О± = 95.23,
ОІ = в€’4.72 and Вµ determined by the martingale condition
Zв€ћ
(S0 exp(x) в€’ eв€’rT K)+ nig(x; О±, ОІ, Оґ, Вµ)dx = P0 (K)
в€’в€ћ

where nig is the normal inverse Gaussian probability density function.

Since we are seeking the root over the value of Оґ, the derivative of the density
nig(x; О±, ОІ, Оґ, Вµ) is required for the gradient and this is cumbersome (again
involving Bessel functions) so we replaced it by п¬Ѓtting a cubic spline to the
observations in a neighbourhood around Оґ and then using the derivative of this
в€љ
cubic spline. In Figure 7.2 we plot the values of Оґ against the corresponding
strike price K. We use the square root of Оґ since this is the closest analogue of
the standard deviation of returns. The п¬Ѓgure is similar to the volatility smile
using the Black Scholes model in Figure 7.1 and, at least in this case, shows
no more tendency for constancy of parameters than does the Normal model. If
the NIG model if preferable for valuing options on the S&P500 index, it doesnвЂ™t
show in the calibration of the parameter Оґ to observed option prices.
Similar methods for п¬Ѓnding roots can be used when the function f is a
function of d arguments, a function from Rd to Rd and so both xn and Yxn are
dв€’dimensional but be forewarned, the problem is considerably more diп¬ѓcult in
d dimensions when d is large. The analogous procedure to the Robbins-Monro
method is
в€’1
xn+1 = xn в€’ an Dn Yxn
where the d Г— d matrix Dn is an approximation to the gradient of the function
f at the root x0 . Once again, since we do not have precise measurements of
346 ESTIMATION AND CALIBRATION

Figure 7.2: вЂњVolatilityвЂќ Smile for NIG п¬Ѓt to call option prices on the S&P500.

the function f , п¬Ѓnding an approximation to the gradient Dn is non-trivial but
even more important than in the one-dimensional case, since it determines the
direction in which we search for the root. For noisy data, there are few simple
procedures that give reliable estimates of the gradient. One possibility that
seems to work satisfactorily is to п¬Ѓt a smooth function or spline to a set of values
of (x, Yx ) which are in some small neighbourhood of the current iterant xn and
use the gradient of this function in place of Dn . This set of values (x, Yx ) can
include previously sampled values (xj , Yxj ), j < n and/or new values sampled

7.3 Maximization of Functions
We can use the results of the previous section to maximize a unimodal function
f (x) provided at on each iteration we are able to approximate the gradient of
the function with a noisy observation

Yx ' f 0 (x)

by simply using (7.5) to solve the equation f 0 (x) = 0.
For example suppose we can express the function f (x) as
Z 1
f (x) = H(x, u)du
0

for some function H so that
n
1X
(7.12)
f (x) = E{ H(x, Ui )}
n i=1
7.3. MAXIMIZATION OF FUNCTIONS 347

Again, the Ui are the pseudo-random inputs to the i0 th simulation, which we
can assume to be uniform[0,1]. For a probability density function g(z) we can
rewrite (7.12) using importance sampling as
n
1 X H(x, Zi )
} (7.13)
f (x) = E{
n i=1 g(Zi )

where the inputs Zi are distributed according to the p.d.f. g(z) on [0,1]. We
have many alternatives to п¬Ѓnding its maximum if the function is smooth. For ex-
ample we may use the Robbins-Monro method discussed in the previous section
b
to п¬Ѓnd a root of f 0 (x) = 0 where
n
1X в€‚
в€‚x H(x, Zi )
b
f 0 (x) = .
n i=1 g(Zi )

Again there is a lack of consistency unless we permit the number of simulations
n to grow to inп¬Ѓnity, and our choice of ideal importance distribution will change
as we iterate toward convergence. With this in mind, suppose we assume that
the importance distribution comes from a one-parameter family g(z; Оё) and we
update the value of this parameter from time to time with information obtained
about the function H(x, z). In other words, assume that Оёn is a function of
(Z1 , ..., Zn ) and Zn+1 |Оёn is drawn from the probability density function

g(z; Оёn ).

Then the sequence of approximations
n
1 X H(x, Zi )
c
fn (x) =
n i=1 g(Zi ; Оёiв€’1 )

all have expected value f (x) and under mild conditions on the variance of the
c
H(x,Zi )
terms var( g(Zi ;Оёiв€’1 ) |Z1 , ..., Ziв€’1 ), then fn (x) в†’ f (x) as n в†’ в€ћ. Since the error
in this approximation approaches zero as n в†’ в€ћ, it seems reasonable to apply
NewtonвЂ™s method to such a system:

b0
fn (xn )
= xn в€’ (7.14)
xn+1
b00
fn (xn )
Pn в€‚
i=1 wi в€‚x H(xn , Zi )
= xn в€’ Pn (7.15)
в€‚2
i=1 wi в€‚x2 H(xn , Zi )

with weights
1
wi = .
g(Zi ; Оёiв€’1 )
Notice that even if the function H(x, Zi ) is only known up to a multiplicative
constant, since this constant disappears from the expression (7.15), we can still
348 ESTIMATION AND CALIBRATION

carry out the iteration. Similarly, the weights wi need only be known up to a
multiplicative constant.
Evidently there are at least three alternatives for maximization in the pres-
ence of noise:

1. Update the sample Zi on each iteration and use a Robbins-Monro proce-
dure, iterating
в€‚
xn+1 = xn в€’ an H(xn , Zn )
в€‚x
2. Fix the number of simulations n and the importance distribution and
maximize the approximation as in (Geyer, 1995)
n
1 X H(x, Zi )
}
xn = arg max{
n i=1 g(Zi )

3. Update the importance distribution in some fashion and iterate (7.15).

The last two methods both require an importance distribution. The suc-
cess of the third alternative really depends largely on how well the importance
distribution g(Zi ) adapts to the shape of the function H(x, Zi ) for x near the
maximizing value.

7.3.1 Example: estimation with missing data.
This example is a special case of a maximization problem involving вЂњMissing
DataвЂќ or latent variables. Although this methodology arose largely in biostatis-
tics where various types of censorship of data are common, it has application as
well in many areas including Finance. For example, periods such as week-ends
holidays or evenings or periods between trades for thinly traded stocks or bonds
could be considered вЂњmissing dataвЂќ in the sense that if trades had occurred in
these periods, it would often simply the analysis.
For a simple example suppose we wish to the daily data for a given stock
index such as the Toronto Stock Exchange 300 index, TSE300, but on a given
day, July 1 for example, the exchange is closed. Should we simply assume that
this day does not exist on the market and use the return over a two day period
around this time as if it was the return over a single day? Since other indices
such as the S&P500 are available on this day, it is preferable to analyze the
data as a bivariate or multivariate series and essentially impute the вЂњmissingвЂќ
values for the TSE300 using its relationship to a correlated index. The data is
in Table 7.2.
Date (2003) June 30 July 1 July 2 July 3 July 4 July 7 July 8
TSE300 Close 6983.1 * 6990.3 6999.8 7001.9 * 7089.6
S&P500 Close 974.5 982.3 993.8 985.7 * 1004.42 1007.84

Table 7.2 Closing values for the S&P500 and TSE300, July 2003.
вЂњ*вЂќ= missing value
7.3. MAXIMIZATION OF FUNCTIONS 349

We will give a more complete discussion of dealing with missing data later
when we introduce the EM algorithm but for the present we give a brief discus-
sion of how вЂњimputed valuesвЂќ, (a statisticians euphemism for educated guesses)
can help maximize the likelihood in problems such as this one. So for the mo-
ment, suppose that X is the вЂњcompleteвЂќ data set (in the above example this is
the data with the missing values вЂњ*вЂќ п¬Ѓlled in) and Z be the data we actually
observe. Then since the probability density for X factors into the marginal
density for Z and the conditional density for X given Z,

fx (x) = fz (z)fx|z (x|z)

we have upon taking logarithms and then the conditional expectation given
Z = z the relation
Z
ln(fz (z)) = E[ln(fx (X))|Z = z] в€’ ln(f (x|z))f (x|z)dx (7.16)

where the second term is the entropy in the conditional distribution f (x|z). Now
suppose that we were somehow able to randomly generate the missing obser-
vations from their correct distribution, perhaps many times, while leaving the
observed data alone. Averaging the log likelihood ln(fx (X)) for the completed
data sets while holding the observed data Z constant at z is equivalent to
approximating E[ln(fx (X))|Z = z].
The maximum likelihood estimator of the parameters based on the incom-
plete data Z is obtained by maximizing the incomplete data likelihood ln(fz (z))
over the value(s) of the parameters, and a simple attempt at this would be to
maximize the п¬Ѓrst term on the right side of (7.16) instead. Of course there
is no guarantee that this is equivalent to maximizing both terms, and we re-
turn to this question later, but for the moment suppose that our objective is
maximization of
E[ln(fx (X))|Z = z]

over the parameter values. In the above example, we can assume that the daily
returns for the two indices are correlated normally distributed random variables
so that in order to model these indices we need the 2 mean values and the 3
parameters in the covariance matrix for a bivariate normal distribution. Our
strategy will be to вЂњп¬Ѓll inвЂќ or impute the missing values, evaluate the likelihood
ln(fx (X)) as if these imputed observations were real data, and regard this as a
вЂњnoisyвЂќ evaluation of the function E[ln(fx (X))|Z = z] which we would like to
maximize.
In order to put this data on the more familiar ground of the bivariate normal
distribution, we п¬Ѓt the log-normal distribution so that (starting with t = 0 on
June 30, 2003), the close of the TSE at time t is given by

t
X
Yi }
6983.1 exp{
i=1
350 ESTIMATION AND CALIBRATION

and of the S&P500,
t
X
Xi }
974.5 exp{
i=1

where the daily returns (Xi , Yi ) are correlated normal random variables with
mean (Вµx , Вµy ) and with covariance matrix
Вµ В¶
2
Пѓy ПЃПѓx Пѓy
О›= .
2
ПЃПѓx Пѓy Пѓx

Then the conditional log-likelihood (given the values on June 30) for the
data, assuming all of the returns are observed, is
ВµВµ В¶ Вµ В¶В¶0 ВµВµ В¶ Вµ В¶В¶
n 1 Yi Вµy Yi Вµy
lC (Вµ, О›) = в€’ ln(|О›|)в€’ ОЈn О›в€’1
в€’ в€’ .
2 i=1 Xi Вµx Xi Вµx
2
(7.17)
We would like to evaluate this log-likelihood when the missing data is replaced
by imputed data and then maximize over the parameters. To this end we need
to know how to impute missing values, requiring their conditional distribution
given what is observed. So let us return to a small portion of the data as shown
in Table 7.3. Notice that we are essentially given the total return S = 0.001
for the TSE over the two-day period July 1&2 but not the individual returns
Y1 , Y2 which we need to impute.

Date (2003) July 1 July 2 Total Return
TSE300 return (Yi ) Y1 Y2 S = 0.001
S&P500 return (Xi ) 0.008 0.0116 0.0196
TABLE 7.3 Returns for TSE300 and S&P500

In order to carry out the imputation for Y1 , Y2 , recall that bivariate normal
data is often expressed in terms of a linear regression relationship describing the
conditional distribution of Y given X of the form

Yi = Вµy + ОІy|x (Xi в€’ Вµx ) + Оµi

where the error term Оµi is a Normal(0, (1 в€’ ПЃ2 )Пѓy ) random variable independent
2

of Xi and the regression parameter ОІy|x is related to the covariance matrix by

cov(Y, X) Пѓy
ОІy|x = =ПЃ .
var(X) Пѓx

It follows from Problem 1 at the end of the Chapter that given the values X1 , X2
and the sum S = Y1 + Y2 , the distribution of Y1 is normal

ОІy|x
S 1
(X1 в€’ X2 ), (1 в€’ ПЃ2 )Пѓy ).
2
(7.18)
N( +
2 2 2
7.3. MAXIMIZATION OF FUNCTIONS 351

2
If we are given tentative values of the parameters ОІy|x , ПЃ, Пѓy , this distribution
can be used to generate an imputed value of Y1 with Y2 = S в€’ Y1 . Of course a
similar strategy allows us to impute the missing values of Xi , Xi+1 when Yi , Yi+1
and Xi + Xi+1 are observed. In this case the value of Xi should be generated
from a
ОІx|y
Xi + Xi+1 1
(Yi в€’ Yi+1 ), (1 в€’ ПЃ2 )Пѓx )
2
(7.19)
N( +
2 2 2
distribution where
Пѓx
ОІx|y = ПЃ .
Пѓy

E[lC (Вµ, О›)|observed data]

where lC (Вµ, О›) is deп¬Ѓned at (7.17). The maximum likelihood estimator of Вµ, О›
from complete data is
ВµВµ В¶Вµ В¶В¶ ВµВµ В¶Вµ В¶В¶0
1n c c
Yi Вµy Yi Вµy
b в€’ в€’ (7.20)
О› = ОЈi=1
c c
Xi Вµx Xi Вµx
n
Вµ В¶ Вµ В¶
1
c
Вµy Yi
= ОЈn (7.21)
.
i=1
c
Вµx Xi
n

The imputed values can be treated as actual observations for the missing data
provided that we impute them many times and average the results. In other
words, the following algorithm can be used to estimate the parameters. We
begin with some arbitrary scheme for imputing the missing values. For example
we could replace the individual returns Y1 , Y2 in Table 7.3 by their average
S/2 = 0.0005. Then iterate the following

1. Treating the imputed values as if they were observed, use the complete
b cc
data (observed and imputed) to obtain estimators О› and Вµy , Вµx .

2. Use the estimated parameters to estimate ОІy|x and ОІx|y and to again im-
pute the missing Xi using (7.19) and the missing Yi using the distribution
(7.18).

3. Again use the complete data (observed and imputed) to obtain estimators
b cc
О› and Вµy , Вµx . Use as our current estimate of these parameters the average
b cc
of the values of О› and Вµy , Вµx obtained on each iteration of steps 1 or 3.

4. Repeat steps 3 and 4 until the averages used as parameter estimators of
Вµ and О› appear to converge.

We ran through steps three and four above a total of 100, 000 times (in
about 70 seconds cpu time on a laptop) and obtained estimates of the mean
352 ESTIMATION AND CALIBRATION

and covariance matrix by averaging these 100000 estimates, resulting in
Вµ В¶
1.034 в€’0.115
b Г— 10в€’5 (7.22)
О›av =
в€’0.115 5.329
Вµ В¶Вµ В¶
c
Вµy 0.0025
(7.23)
= .
c
Вµx 0.0056

These can be compared with a naive estimator obtained by imputing the missing
returns as the mean return over a two-day period, based on the returns in Table
7.4 in which the imputed returns are in bold letters:

Date (2003) July 1 July 2 July 3 July 4 July 7 July 8
TSE300 (Expected) Return 0.0005 0.0005 0.0014 0.003 0.0062 0.0062
S&P500 (Expected) Return 0.008 0.0116 -0.0082 0.0094 0.0094 0.0034
Table 7.4: Returns and imputed returns for TSE300 and S&P500

These naive estimates are the mean and covariance matrix of the data in
Table 7.4,
Вµ В¶
1.5 1.8
b Г— 10в€’5 ,
О›=
1.8 7.6
Вµ В¶Вµ В¶
c
Вµy 0.0025
= .
c
Вµx 0.0056

Notice that the two estimates of the mean are identical. This is not acciden-
tal, because essentially this is a numerical or Monte Carlo version of the EM
algorithm discussed in the next section, and at least for estimating the mean,
replacing each unobserved by its conditional expectation given the observations
results in the correct maximum likelihood estimator. The covariance estimates,
on the other hand, are quite diп¬Ђerent. Often, a covariance matrix obtained by
naГЇve imputation, replacing observations with an average as we did with the
data in Table 7.4, results in underestimating variances because the variation of
to the missing observation is understated by an average. Similarly estimators
of covariance or correlation may be biased for the same reason. The estimator
(7.22), though it has random noise in it, does not suп¬Ђer from bias to the same
extent, because it has does a better job of replicating the characteristics of the
random observations.

7.4 Maximum Likelihood Estimation
7.4.1 Missing and Censored Data
It is common in п¬Ѓnance and generally in science for data to be biased by under-
reporting, and various forms of censorship. Patients who react badly to a treat-
ment may not return for a follow-up appointment and the data that would have
been obtained on this follow-up appointment is вЂњmissingвЂќ or unavailable, and
7.4. MAXIMUM LIKELIHOOD ESTIMATION 353

the event that such data is unobserved may depend on the value that it would
have taken. Similarly, data on the time to failure of companies can be вЂњcen-
soredвЂќ or only partially observed if there is a merger or take-over during the
study, since, in this case, we know only that the failure time of the original п¬Ѓrm
exceeds some value. Other forms of data-modiп¬Ѓcation which introduce bias are
also common. There is reporting bias when mutual funds that have had a poor
quarter or year are more likely to give prominence to their 3 or 5-year returns
in advertising or neglect to reply to a question on performance over a particular
period. Managers, stocks, and companies with particularly poor returns disap-
pear from data-bases, exchanges, or companies for obvious reasons. We have
but in general, successful treatment of вЂњmissingвЂќ or вЂњcensoredвЂќ data requires
some insight into the mechanism by which the data goes missing or is censored.
In order to give a simple example of survivorship or reporting bias, suppose
that the losses xi , i = 1, ..., n on portfolios managed by n managers over a period
of time are assumed normally distributed and for simplicity we assume that all
have the same mean Вµ and standard deviation Пѓ. For various reasons we expect
that the probability that a manager is no longer with us at the end of this period
of time, (and so the loss for the whole period is вЂњmissingвЂќ) is a function of the
loss that would have occurred had they persisted so that the probability that a
given data value xi is observed is ПЂ(xi ), a non-increasing function of the value
xi . The probability that an observation xi is missing is therefore 1 в€’ ПЂ(xi ).
Then the likelihood for this data is
Z
nв€’1
Y
{ПЂ(xi )n(xi ; Вµ, Пѓ )} { (1 в€’ ПЂ(z))n(z; Вµ, Пѓ 2 )dz}1в€’в€†i
2 в€†i
(7.24)
L(Вµ, Пѓ) =
i=1

where в€†i = 1 if the i0 th value is observed and otherwise в€†i = 0 if it is missing,
and n(xi ; Вµ, Пѓ 2 ) denotes the normal(Вµ, Пѓ 2 ) probability density function at the
point xi . The п¬Ѓrst term
ПЂ(xi )n(xi ; Вµ, Пѓ 2 )
is the likelihood corresponding to an event вЂњthe i0 th loss is xi and it is observedвЂќ
and the last term
Z
(1 в€’ ПЂ(z))n(z; Вµ, Пѓ 2 )dz = E[1 в€’ ПЂ(Xi )]

is the probability that an observation is missing. Suppose n = 7 and all of the
xi , i = 1, ..., n в€’ 1 are observed but the last value x7 is missing. In this case we
know the values of xi for all but the last observation, which could be replaced
by a randomly generated observation if we knew the parameters Вµ, Пѓ and the
function ПЂ(x). If we wish to maximize the likelihood (7.24) then one simple
approach is to complete the data by generating a missing or latent observations
X7 = Z with the normal(Вµ, Пѓ 2 ) distribution and then accepting the observation
(as missing) with probability 1 в€’ ПЂ(Z). The likelihood is proportional to

E{H(Вµ, Пѓ, Z)}
354 ESTIMATION AND CALIBRATION

where the expectation is over the distribution of Z,
P6
(xi в€’ Вµ)2 + (Z в€’ Вµ)2 (1 в€’ ПЂ(Z))
1
H(Вµ, Пѓ, Z) = H(Вµ, Пѓ, Z, Оё) = 7 exp{в€’ i=1 }
2Пѓ 2
Пѓ g(Z; Оё)

and we assume Z has probability density function

g(z; Оё).

Consider maximizing this function over Вµ in the special case that Пѓ = 1 and
ПЂ(z) = eв€’z , z > 0 and g(x; Оё) is an exponential probability density function
with mean Оё. We might п¬Ѓx the number of simulations m and the importance
distribution and maximize the approximation to the integral:
m
1X
Вµn = arg max{ H(Вµ, Zi )}
m i=1

This method recommended by Geyer(1996) and has the advantage that it
essentially estimates the function unbiasedly and then one can rely on standard
maximization programs such as fminsearch in Matlab. It requires however that
we п¬Ѓx the importance distribution g(z, Оё) and the number of simulations. Al-
ternatives are Robbins-Monro methods which have the ability to update the
importance distribution parameter Оё to reduce the variance of the estimator.
Various stochastic volatility models and вЂњRegime switchingвЂќ models can also
be viewed as examples of latent variables or missing data problems. If the
market has two volatility states, вЂњhighвЂќ and вЂњlowвЂќ these can not be directly
observed but only inferred from the movement in the market over a period of
time. In this case, given the state that the market is in, the parameters are often
easily estimated, but this state is a latent variable that needs to be estimated
or imputed from the observations.

7.4.2 Example: Mixture Models
One common problem which can also be viewed as a missing data problem is
that of estimation of parameters in a mixture model. To take a simple exam-
ple, consider a vector of observations of stock returns (x1 , ..., xn ) and assume
that xi has a probability density function that takes one of two possible forms
f1 (x|Оё1 ) or f2 (x|Оё2 ) where the two parameters Оё1 , Оё2 are not speciп¬Ѓed. We as-
sume that a given observation comes from population distribution f1 (x|Оё1 ) with
some probability p1 and otherwise, with probability p2 = 1 в€’ p1 it comes from
the population with density f2 (x|Оё2 ). This is an example of a mixture of two
distributions: the overall population distribution is

f (x) = p1 f1 (x|Оё1 ) + p2 f2 (x|Оё2 ).

If each data point came with an identiп¬Ѓer zi taking possible values 1, 2 that
speciп¬Ѓed the parent distribution, estimation of the parameters p1 , Оё1 , Оё2 would
7.4. MAXIMUM LIKELIHOOD ESTIMATION 355

be easy. We would estimate p1 using the proportion of observations from pop-
ulation 1 and then use these to estimate the parameter Оё1 , and similarly Оё2 .
However, this identiп¬Ѓer is not usually available. For example it is a common
model in п¬Ѓnance to assume that there are two categories of observations, or
states of the market, corresponding to high volatility (Оё2 large) and low volatil-
ity (Оё1 small) but of course we can only infer which state we are currently in by
the magnitude of the last few changes in price. Alternatively we might assume
that the vector of identiп¬Ѓers z1 , z2 , .... mentioned above forms a simple 2в€’state
Markov chain, in which case the model for stock returns is commonly referred
to as a regime-switching model.
Once again for the sake of a concrete problem, we use the returns from the
S&P500 index between January 1, 2003 and May 11, 2004 to estimate a mixture
model. In particular suppose that the density function or a return is assumed
to be a mixture of normal distributions with mean Вµ and variances Оёi , i = 1, 2,
i.e.
p1 n(x; Вµ, Оё1 ) + (1 в€’ p1 )n(x; Вµ, Оё2 )
= p1 f1 (x|Оё1 ) + (1 в€’ p1 )f2 (x|Оё2 ).
For the sake of identiп¬Ѓability of the parameters, we assume Оё1 < Оё2 . Consider
п¬Ѓrst the simple case in which we assume that the identiп¬Ѓers zi are independent
and observed. Then the likelihood function is
n
Y
f1 (xi |Оё1 )в€†i f2 (xi |Оё2 )1в€’в€†i
L(Оё) =
i=1

where в€†i = I(zi = 1). Since в€†i is not observed, the estimating function in the
missing data case is obtained by conditioning the вЂњcomplete dataвЂќ estimating
function on the observed information, i.e.
n
X в€‚
E[в€†i |x1 ...xn ] ln f1 (xi |Оё1 ) = 0
в€‚Оё1
i=1
n
X в€‚
E[1 в€’ в€†i |x1 ...xn ] ln f2 (xi |Оё2 ) = 0.
в€‚Оё2
i=1

In the case of independent zi the conditional probability that в€†i = 1 is given
by
p1 f1 (xi |Оё1 )
wi =
p1 f1 (xi |Оё1 ) + p2 f2 (xi |Оё2 )
and this results in the estimating functions to be solved for Оё,
n
X в€‚
ln f1 (xi |Оё1 ) = 0 (7.25)
wi
в€‚Оё1
i=1
n
X в€‚
(1 в€’ wi ) ln f2 (xi |Оё2 ) = 0. (7.26)
в€‚Оё2
i=1
356 ESTIMATION AND CALIBRATION

Of course typically p1 , p2 are not known either and they also need to be estimated
from the data concurrently with our estimation of Оё. The estimating function
can be derived in exactly the same manner; if the в€†i were observed, then we
would estimate the parameters p1 , p2 using
n
1X
p1 = в€†i
n i=1

which after conditioning is of the form
n n
1X 1X
E[в€†i |xi ] = wi and, (7.27)
p1 =
n i=1 n i=1
p2 = 1 в€’ p1 .

Typically a simple routine which begins with initial value of the parameters and
then iteratively steps towards a solution of (7.25) and 7.27) provides at least
slow convergence to a solution. There is no absolute guarantee that the solution
corresponds to a local maximum of the likelihood function, but for the S&P500
data mentioned the likelihood appears to increase monotonically. Convergence
is to the model

0.811N (Вµ, 0.00852 ) + 0.189N (Вµ, 0.0152 ),
в€љ
which corresponds to a 252 Г— .00852 =14% annual volatility on 81% of the
days and 24% volatility the other 19% of the days. In general we need to be
concerned about whether the initial values of the parameters aп¬Ђect the result
but in this case varying the initial values of p1 , Оё1 , Оё2 did not appear to have any
eп¬Ђect (except in the time to convergence). This simple algorithm, when applied
to mixture problems, can more generally be quite sensitive to the starting values.
When we estimate parameters as we did in the mixture problem above,
whether we use the EM algorithm or one of its many relatives, missing compo-
nents of the score function are replaced by their conditional expectation given
the observations. For estimation of the mean in a normal family, we replace
missing observations by their conditional expectation. On the other hand, these
imputed values typically have less variability than the actual data we were
unable to observe, and so if our interest is in the variance, the substitution for
non-observed data by a conditional expectation tends to reduce the sample vari-
ance. If we have many parameters that we wish to estimate, we can essentially
replace the missing data by repeated draws from the appropriate conditional dis-
tribution and use these repeatedly to estimate the parameter values and their
variances. If the missing data is randomly generated, it should have the same
characteristics as the data it is replacing, including the higher moments. The
next method is a Bayesian implementation of this idea, one of few excursions
into Bayesian methodology in this text.
The Data Augmentation algorithm (see Tanner and Wong, 1987) is used in
Bayesian problems when there is latent or missing data as in the above example.
7.4. MAXIMUM LIKELIHOOD ESTIMATION 357

A Bayesian begins with a prior distribution ПЂ(Оё) assigned to the parameters
being estimated. This distribution reп¬‚ects our prior knowledge of the parameter
values and their variability and there are many possible choices for it, but ideally
it should not have too much inп¬‚uence on the conclusions since we always wish
the data to do more talking than the BayesianвЂ™s preconceived ideas of where the
parameters lie. We explain the algorithm in the context of the example above
so x is the vector of observations, and the missing data is denoted by в€†. The
posterior joint distribution of the parameter Оё and the missing data в€† is the
conditional distribution of Оё and в€† given the data x and can be written

p(Оё, в€†|x) = p(Оё|x)p(в€†|Оё, x).

The idea behind the data augmentation algorithm is to iteratively draw (Оё, в€†)
from approximations to the joint distribution of Оё, в€† given the observed data.
Typically p(Оё|x) is diп¬ѓcult to compute but the conditional densities such as
p(Оё|x, в€†) and p(Оё|x, в€†) are much easier to deal with.
The algorithm iterates these two steps:

1. Draw в€†в€— from the density p(в€†|Оёв€— , x).

2. Draw Оёв€— from the density p(Оё|x, в€†в€— ).

Iteration continues until the joint distribution of the pair (Оёв€— , в€†в€— ) appears to
stabilize in which case they are regarded as a draw from the joint distribution
p(Оё, в€†|x). The method is, in fact, a special case of the вЂњGibbs samplerвЂќ to
be discussed later and its convergence to the correct conditional distribution
derived from the balance equations we give there. At this point, to provide just
a little credibility, we outline an intuitive argument for a Gibbs sampler in this
simple case.
Consider the following problem. We wish to generate random variables (Оё, в€†)
having some joint distribution p(Оё, в€†|x). To save a little notation let us abbre-
viate p(.|x) to px (.). Unfortunately, the joint conditional distribution px (Оё, в€†)is
diп¬ѓcult to deal with, but conditional distributions such as

px (Оё, в€†)
and
px (Оё|в€†) =
px (в€†)
px (Оё, в€†)
px (в€†|Оё) =
px (Оё)

are easy to sample from. We proceed as follows. Start with an arbitrary value
for Оё1 and then generate в€†1 from the density px (в€†|Оё1 ). Now we generate Оё2 from
the probability density px (Оё|в€†1 ) and then в€†2 from the density px (в€†|Оё2 ) and so
on. It is easy to see that the pairs of observations (Оё1 , в€†1 ), (Оё2 , в€†2 ), ... constitute
a two dimensional Markov Chain with transition kernel K(Оёt+1 , в€†t+1 |Оёt , в€†t ) =
px (Оёt+1 |в€†t )px (в€†t+1 |Оёt+1 ) and it is our hope that the joint density px (Оё, в€†) is
358 ESTIMATION AND CALIBRATION

an invariant distribution for this chain. To verify this note that
ZZ
px (Оёt+1 |в€†)px (в€†t+1 |Оёt+1 )px (Оё, в€†)dОёdв€†
ZZ
px (Оёt+1 |в€†)px (в€†t+1 |Оёt+1 )px (Оё, в€†)dОёdв€†
=
Z
= px (в€†t+1 |Оёt+1 ) px (Оёt+1 |в€†)px (в€†)dв€†

= px (в€†t+1 |Оёt+1 )px (Оёt+1 ) = px (Оёt+1 , в€†t+1 ).

Since this is the invariant distribution, if the chain is ergodic, it will converge
in distribution to px (Оё, в€†). There is no certainty in this or any other Markov
chain that we have reached equilibrium after a п¬Ѓnite number of draws, although
there are many techniques for trying to monitor whether the distribution con-
tinues to change. There are other, more elaborate, methods for estimating the
parameters in mixture models. See for example Robert (1996).
The algorithm is clearer if we apply it to the Normal mixture model for the
S&P500 data discussed above. In this case we start with initial guesses в€†в€— at i
the values of в€†i . For example we could assign the smallest returns (in absolute
value) to population one so в€†в€— = 1 and the rest to population 2, в€†в€— = 0
i i
в€—
with about equal numbers in the two camps. Then Оё is drawn with density
px (Оё|Оё, в€†в€— ). This is easy to do in this example because the vector в€†в€— assigns
each observation to one of the two populations. For simplicity suppose that we
have subtracted the mean from the returns so that we can assume Вµ = 0 and
let the parameters Оё1 , Оё2 denote the variances of the two normal distributions.
в€’1
For example suppose that the prior distribution of Оё1 is proportional to Оё1
в€’1
ПЂ(Оё1 ) в€ќ Оё1 , Оё1 > 0.

There is no such probability density function, but there is a measure with this
density and it is often convenient to permit such so-called improper prior dis-
tributions. The prior ПЂ(Оё1 ) corresponds to assuming an improper uniform prior
distribution for log(Оё1 ) because if U is uniform on some set then Оё1 = eU will
have density function

dU в€’1
в€ќ| | = Оё1 on the corresponding set.
dОё1

Then the posterior density of Оё1 is
Y в€—
в€’1
p(Оё1 |x, в€†в€— ) в€ќ [n(xi ; 0, Оё1 )]в€†i Оё1
i
P
в€†в€— xi 2
в€’ОЅ /2в€’1 i
в€ќ }
Оё1 1 exp{в€’
2Оё1
P
в€†в€— is the degrees of freedom for this sample. The posterior
where ОЅ1 = i
i
7.4. MAXIMUM LIKELIHOOD ESTIMATION 359

в€’1
density of the reciprocal of the variance П€ = Оё1 is
P
в€†в€— xi 2 dОё1
П€
p(П€|x, в€†в€— ) в€ќ П€ ОЅ1 /2+1 exp{в€’ i
}| |
2 dП€
P
П€ в€†в€— xi 2
ОЅ1 /2в€’1 i
в€ќП€ }
exp{в€’
2
в€’1
We identify this posterior distribution of Оё1 as the Gamma(ОЅ1 /2 ,2/ss2 ) dis-
1
tribution with
n
X
2
в€†в€— xi 2
ss1 = i
i=1

and this means that
в€’1
ss2 Оё1
1

has a chi-squared distribution with ОЅ1 /2 degrees if freedom.
Similarly with two parameters Оё1 , Оё2 , we may assume the prior density of
(Оё1 , Оё2 ) is proportional to
в€’2 в€’2
Оё1 Оё2
resulting in a independent Gamma(ОЅ1 /2 ,2/ss2 ), Gamma(ОЅ2 /2 ,2/ss2 ) posterior
P1 2
в€’1 в€’1
distributions for Оё1 and Оё2 where ОЅ2 = i (1 в€’ в€†в€— ) and
i

n
X
ss2 (1 в€’ в€†в€— )x2 .
=
2 i i
i=1

The parameter p1 could also have a prior distribution assigned and a posterior
calculated in a similar fashion (e.g. assume a (non-informative) prior distribu-
tion for p1 with density function proportional to [p(1 в€’ p)]в€’1/2 , so the posterior
distribution given the value of в€† is that of a Beta (ОЅ1 + 1 , ОЅ2 + 1 ) distribu-
2 2
tion (see Box and Tiao, 1973, Section 1.3)). However, for comparison with the
mixture model above we chose p1 equal to its estimated value there, 0.81.
The algorithm therefore is as follows, beginning with k = 0 and an initial
guess at в€†в€— ,

1. Calculate
P P
ss2 = Pn в€†в€— x2
ОЅ1 = Pi в€†в€— , 1
i i=1 i i
n
ОЅ2 = i (1 в€’ в€†в€— ) ss2 = i=1 (1 в€’ в€†в€— )x2
2
i i i

в€’1 в€’1
2. Generate Оё1 в€ј Gamma(ОЅ1 /2 ,2/ss2 ) and Оё2 в€јGamma(ОЅ2 /2 ,2/ss2 )
1 2
(and p1 в€јBeta (ОЅ1 + 1 , ОЅ2 + 1 )) if we wish to allow p1 to vary.
2 2

3. Generate independent

p1 f (xi |Оё1 )
в€†в€— в€ј Bernoulli( )
i
p1 f (xi |Оё1 ) + p2 f (xi |Оё2 )
360 ESTIMATION AND CALIBRATION

Figure 7.3: The posterior distribution of the annualized volatility parameters
for the two components of a mixture model, п¬Ѓt to the S&P500 index, June-July
2003.

в€’1 в€’1
4. Return to step 1 until the distribution of (в€†в€— , Оё1 , Оё1 ) has ceased to
i
change. Then iterate steps 1-4 m times, considering these m subsequently
в€’1 в€’1
generated values of (в€†в€— , Оё1 , Оё1 ) as draws from the joint distribution of
i
(в€†, Оё) given (x1 , ..., xn ).

The posterior distribution of the annualized volatilities determined by the
data augmentation algorithm is give in Figure 7.3 and this gives a better idea of
the error attached to the variances of the two normally distributed components
for п¬Ѓxed p1 .
The next section provides a simple general technique for estimating parame-
ters in the presence of missing or censored data.

7.4.3 The EM Algorithm
In principle maximum likelihood estimation is simple. We write down the like-
lihood or the probability of the observed values and then choose the parameter
value which maximizes this expression. Diп¬ѓculties occur primarily when the
likelihood does not have a simple closed-form or easily evaluated expression.
In the last section we saw an example in which we observed data Z but if
complete data X were available (including вЂњlatentвЂќ or missing variables), then
the complete-data likelihood takes a simple form as a function of the unknown
parameter
LC (Оё; x), (C for вЂњcompleteвЂќ).
However each missing observation in a data set gives rise to a conditional ex-
pectation. If only a portion z of the complete data x is observed, then the
likelihood function for the observed data, LO (Оё, z) say, (вЂњOвЂќ is for вЂњobservedвЂќ)
is related to the complete data likelihood. As before, since

fОё (x) = fОё (z)fОё (x|z),
7.4. MAXIMUM LIKELIHOOD ESTIMATION 361

we have, for an arbitrary value of the parameter Оё0 ,

ln(fОё (x)) = ln(fОё (z)) + ln(fОё (x|z))
EОё0 [ln(fОё (x))|z] = EОё0 [ln(fОё (z))|z] + EОё0 [ln(fОё (x|z))|z]
Z
= ln(fОё (z)) + ln{fОё (x|z)}fОё0 (x|z)dx.

Denote the term on the left side by

(7.28)
QОё0 (Оё) = EОё0 [ln(fОё (x))|z].

Then the log-likelihood of the observed data ln{LO (Оё, z)} is
Z
ln(fОё (z)) = QОё0 (Оё) в€’ ln{fОё (x|z)}fОё0 (x|z)dx.

From this we obtain

ln{LO (Оё, z)} в€’ ln{LO (Оё0 , z)} = QОё0 (Оё) в€’ QОё0 (Оё0 ) + H(Оё0 , Оё) (7.30)

where
fОё0 (x|z)
|z]
H(Оё0 , Оё) = EОё0 [ln(
fОё (x|z)
is the cross entropy between the conditional density of x|z at the two values of
the parameter.
The EM algorithm proceeds as follows: We begin with what we think is
a good estimator of the parameter Оё0 and then maximize over the value of Оё
the function QОё0 (Оё). The maximizing value of Оё, Оё1 , say, is the next estimator.
Replacing Оё0 by Оё1 we repeat, now maximizing QОё1 (Оё) and so on, obtaining
a sequence of estimators Оёn , n = 1, 2, ..... It is not hard to show that under
fairly simple conditions, this sequence converges to the maximum likelihood
b b
estimator, i.e. the value Оё satisfying LO (Оё; z) = max{LO (Оё; z)}. In other words,
the algorithm switches back and forth between the two steps starting with an
initial guess at the parameter value Оё0 and n = 0 and stopping when the sequence
Оёn appears to converge:

E Step Obtain the conditional expected value

QОёn (Оё) = EОёn [ln LC (Оё; x)|z]

M Step Maximize this function QОёn (Оё) over Оё, letting

Оёn+1 = arg max QОёn (Оё)

be the next approximation to the parameter value. Increment n.
362 ESTIMATION AND CALIBRATION

This algorithm works because of the identity (7.30). Notice that because of
the M-step above, QОёn (Оёn+1 ) в€’ QОёn (Оёn ) в‰Ґ 0 and the term H(Оёn+1 , Оёn ), because
it is the cross-entropy between two distributions, is always non-negative. This
shows that
ln{LO (Оёn+1 , z)} в€’ ln{LO (Оёn , z)} в‰Ґ 0
and that the observed data likelihood is a non-decreasing sequence. It must
therefore converge.
The score function for the complete data

в€‚
SC (Оё; x) = ln LC (Оё; x)
в€‚Оё
в€‚
is related to the observed data score SO (Оё; z) = ln LO (Оё; z) even more simply:
в€‚Оё

(7.31)
SO (Оё; z) = EОё [SC (Оё; x)|z].

Therefore on the Eв€’step of the EM algorithm, we may solve the equation for
Оёn+1 (checking as always that the solution corresponds to a maximum)

Sobs (Оёn+1 |Оёn ) = 0.

Thus, the EM algorithm can be more simply expressed as a simple resubstitution
algorithm for solving the score equation (7.31) equals zero:

E Step For an initial guess at the parameter value Оё0 , obtain the conditional
expected value
(7.32)
SO (Оё|Оёn ) = EОёn [SC (Оё; x)|z].

Solve for Оёn+1 the equation

SO (Оёn+1 |Оёn ) = 0.

It is clear from this formulation that if the algorithm converges it must converge
to a point Оё satisfying
SO (Оё|Оё) = 0
which, in view of (7.31) is the score equation for the observed data.
If there are m missing observations, notice that both (7.28) and (7.31) are
mв€’fold integrals and may be diп¬ѓcult to compute. Nevertheless, (7.31) leads
to an extremely useful observation that applies to missing or вЂњcensoredвЂќ data,
within the normal family of distributions and more generally, to exponential
families. Equation (7.32) indicates that any term in the score function that is
unobserved should be replaced by its conditional expectation given the observed
data. If the score function contained terms like X 2 or sin(X) then the condi-
tional expectation of such terms would need to be computed. This is, of course,
NOT the same as replacing X by its conditional expectation and then using X 2
or sin(X). However, one of the marvelous features of the normal distributions
7.4. MAXIMUM LIKELIHOOD ESTIMATION 363

is that, for estimating the mean, the score function is linear in X. For exam-
ple, diп¬Ђerentiating (7.17) with respect to ВµX , ВµY we obtain the score function
corresponding to estimation of Вµ for complete data.
ВµВµ В¶Вµ В¶В¶
Yi Вµy
n в€’1
в€’
ОЈi=1 О›
Xi Вµx

Notice that taking the expected value given the observed data is equivalent to
replacing each unknown component of Xi , Yi by its conditional expectation given
the observed. Moreover, within the multivariate normal family of distributions,
taking conditional expectation is accomplished essentially by linear regression.
For a simple example of its application let us return to the problem with
вЂњmissingвЂќ values in the S&P500 index and the TSE300 index discussed above.
Suppose, for example we wish to п¬Ѓll in the value for the TSE300 on July 1 using
conditional expectation. Which observations are directly relevant? Table 7.3
gives the daily returns for the two indices leaving out values that are independent
of the unknowns Y1 , Y2 .
Recall from (7.18) that the conditional distribution of Yi given Xi , Xi+1 and
S = Yi + Yi+1 is
ОІy|x
S 1
(Xi в€’ Xi+1 ), (1 в€’ ПЃ2 )Пѓy ).
2
(7.33)
N( +
2 2 2
and so the conditional expectation
ОІy|x
S
E(Yi |Xi , Xi+1 , S] = (Xi в€’ Xi+1 )
+
2 2
with ОІy|x = ПЃПѓy /Пѓx . Rather than naively replacing the unknown returns Yi , Yi+1
by their average S/2 the term 1 ОІy|x (Xi в€’ Xi+1 ) provides an adjustment deter-
2
mined from the values of X. The regression coeп¬ѓcient ОІy|x can either be found
by regressing Y on X preferably with data around the same period of time
(since these coeп¬ѓcients tend to be somewhat time-dependent) or by using the
current values of the maximum likelihood estimator of the covariance matrix
О›. Using the observed data for June and July 2003, we arrived at an estimated
regression coeп¬ѓcient
ОІy|x ' 0.432
and this allows to to п¬Ѓll in the missing values in the table with their conditional
expected values:

Date (2003) July 1 July 2
TSE300 Return в€’0.0003 0.0013
S&P500 Return 0.008 0.0116
TABLE 7.4

Filling in the expected returns for all of the missing value for July 4 in the
same way, but using the regression coeп¬ѓcient for X on Y of ОІx|y ' 1.045, we
observe the following for the unknowns in Table 7.5.
364 ESTIMATION AND CALIBRATION

Date (2003) July 1 July 2 July 3 July 4 July 7 July 8
в€’0.0003
TSE300 (Expected) Return 0.0014 0.003
0.0013 Y7 Y8
S&P500 (Expected) Return 0.008 0.0116 -0.0082 0.0034
X4 X7
TABLE 7.5
1
(0.0188 + 1.045(0.003 в€’ Y7 ))
X4 =
2
1
Y7 = (0.0124 + 0.432(X7 в€’ 0.0034))
2
X4 + X7 = 0.0188
Y7 + Y8 = 0.0124

and these four equations in four unknown can be solved in the four unknowns
giving

Y7 = в€’0.0634, Y8 = 0.0758,
X7 = в€’0.0253.
X4 = 0.0441,

Assuming that this pair of returns follows a correlated geometric Brownian
motion then in order to estimate cov(X, Y ) in the presence of complete data we
would solve the following equations:
n
1X
c (7.34)
Вµx = xi
 стр. 1(всего 3)СОДЕРЖАНИЕ >>