. 1
( 3)



>>

Estimation and Calibration

7.1 Introduction
Virtually all models in ¬nance have parameters that must be speci¬ed in order
to completely describe them. Statistical estimation can be used for some or all
of these parameters under two important conditions:

1. We believe that the parameter values do not change quickly in time.

2. We have access to su¬cient (recent) historical data on the underlying
process to provide reliable estimates of these parameter values.

Statistical estimation involves the distribution of the data and uses methods
such as maximum likelihood estimation or least squares.

There are also some instances in which estimation of parameters is unde-
sirable. For example we have seen that a risk-neutral distribution Q, used in
the pricing of options, is not necessarily identical to the historical distribution
P. Estimating parameter values for the distribution P based on historical data
may have little or nothing to do with their values under Q. However, market in-
formation, for example the prices of options with a given underlying stock carry
information about the expected value of certain functions of the stock under
Q and the process of arranging parameter values so that the model prescribes
options prices as close as possible to those observed is called calibration of the
model.
Although there is a substantial philosophical di¬erence between calibration
and statistical estimation, there are many similarities in the methodology, since
both often require obtaining the root of one or more functions or maximizing or
minimizing a function and so these are the problems that we begin this section
with.
There are many numerical problems that become much more di¬cult in the
presence of noise, from ¬nding the roots of a given function to solving a system
of di¬erential equations. Problems if this sort have given rise to substantial
research in a variety of areas including stochastic approximation and Markov
Chain Monte Carlo. In statistics, for example, we typically estimate parameters
using least squares or maximum likelihood estimation. However, how do we

339
340 ESTIMATION AND CALIBRATION

carry out maximum likelihood estimation in cases where no simple form for the
likelihood function is available? Suppose we are able to simulate data from a
model with a given set of parameter values, say, but unable to express the
likelihood function in a simple form amenable to numerical maximization.
For some concreteness, suppose we are interested in calibrating a model for
the price of options on the SPX, the S&P500 index. The call option prices in
Table 7.1 were observed on May 6, 2004 when the SPX was quoted at 1116.84.
All options expired on September 18, corresponding to T = .3699 years. The
option price is calculated as midpoint of the bid and ask, and K is the exercise
price, PO (K), the corresponding option price.
950 1005 1050 1100 1125 1150 1175 1200 1225 1250
K
PO (K) 173.1 124.8 88.5 53.7 39.3 27.3 17.85 10.95 6.2 3.25

Table 7.1. Price of SPX Call Options

Suppose we use the Black-Scholes formula to determine the implied volatility
for these options. The implied volatility is the value of σ solving

(7.1)
BS(1116.84, K, r, 0.3699, σ) = PO (K)

where r is the spot interest rate. If exact prices were known (rather than just the
bid and ask prices), we could determine r from the put call parity relation since
there are also put options listed with the same exercise price. However, here we
used the rate r = .01 of a short-term treasury bill since it is di¬cult to ¬nd nearly
simultaneous trades in put and call options with the same strike price.. The
MATLAB function BLSIMPV(S0 ,K,r,T ,PO (K)) returns the implied volatility
i.e. provides the solution of (7.1). In Figure 7.1 we give what is often referred to
as the volatility smile, although smirks and frowns are probably more common
than smiles these days (surely a sign of the times). This is a graph of the
exercise price of the option against the implied volatility of the option. The fact
that this implied volatility is not constant is further evidence that the Black-
Scholes model (at least with constant parameters) does not ¬t the risk-neutral
distribution very well.
We saw in Chapter 3 that distributions such as the normal inverse Gaussian
distribution can do a better job of ¬tting historical distributions of stocks and
indices than does the normal and so we might hope that it can be applied suc-
cessfully to the ¬t of a risk-neutral distribution. The main problem in applying
distributions such as the NIG is that of calibrating the parameters of the model
to market data. In models such as this, if option prices are obtained from a
simulation, how can we hope to obtain parameter values for the model that are
consistent with a set of market prices for the options? More speci¬cally suppose
we assume that the price of a stock at time T is given by

ST = S0 erT exp(X)

where X has the NIG(±, β, δ, µ) distribution of Lemma 29. In order that the
discounted future price forms a martingale, we require that E(exp(X)) = 1 and
7.1. INTRODUCTION 341




Figure 7.1: The volatility “smile” for SPX call options


from (3.23) with s = 1 this implies that
µ ¶
p
p ±2 ’ β 2
1
2 ’ (1 + β)2 ’ ±2 ’ β 2 } + ln
µ = δ{ ±
±2 ’ (1 + β)2
4
and
|β + 1| < ±.
One of the four parameters, µ, is determined by the martingale condition so the
other three can be used to ¬t to market option data if we wish. The price of a
call option with exercise price K is now
E(S0 exp(X) ’ e’rT K)+
and this expectation is either a rather complicated numerical exercise or one
that can be estimated by simulation. Whether or not we choose to price this
option using simulation, the calibration problem reduces to selecting values of
the parameters so that the theoretical option price agrees with the market price,
i.e.

E(S0 exp(X) ’ e’rT K)+ = P0 (K). (7.2)
If we have 10 options as in Table 7.1 and only three parameters to vary, there is
no hope for exact equality in 7.2 (10 equations in 3 unknowns is not promising).
We could either ¬x two of the parameters, e.g. ±, β and select the third,
δ to agree with the price of a speci¬c option or we could calibrate all three
parameters to three or more options with similar strike prices and minimize the
sum of the squared di¬erences between observed and theoretical option prices.
We will return to this problem later, but for the present notice that in order
to solve it, we need to be able to either ¬nd roots or minima of functions when
evaluations of these functions are noisy and have measurement or simulation
error superimposed. We begin a discussion of this problem in the next section.
342 ESTIMATION AND CALIBRATION

7.2 Finding a Root

Let us begin with the simple problem of identifying the root of a non-decreasing
function, i.e. solving for the value x0 satisfying

f (x0 ) = 0.

The standard algorithm for this is the Newton-Raphson algorithm which begins
with an arbitrary value of x1 and then iterates

f (xn )
xn+1 = xn ’ (7.3)
.
f 0 (xn )

Suppose, however, that our observations on the function f (x) are of the form

Y (x) = f (x) + µx
2
where the errors µx at the point x are small, with mean 0 and variance σx . Even
if we know the gradient D = f 0 (x0 ) of the function at or near the root (whether
we use the gradient at the root or at the value xn approaching the root does not
materially e¬ect our argument), the attempt at a Newton-Raphson algorithm

Yxn f (xn ) µxn
xn+1 = xn ’ = xn ’ ’ (7.4)
D D D
has a serious problem. If there is no error whatsoever in the evaluation of the
function, the iteration
f (xn )
xn+1 = xn ’
D
typically converges to the root x0 . However in (7.4), on each iteration we add
µx
to the “correct” Newton-Raphson iterant xn ’ f (xn ) an error term Dn which
D
is often independent of the previous errors having variance D’2 σxi .Therefore
2

after a total of N iterations of the process, we have accumulated an error which
has variance
N
X
’2 2
D σxi .
i=1

The fact that this error typically increases without bound means that continuing
the iteration of (7.4) does not provide a consistent estimator of the root but
one whose error eventually grows. Newton™s method is inappropriate for any
model in which errors in the evaluation of the function, when accumulated until
convergence is declared, are comparable in size to the tolerance of the algorithm.
In general as long as the standard deviation of the function evaluations σx is of
the same order of magnitude as

tolerance — D/ N
7.2. FINDING A ROOT 343


where N is the number of iterations before convergence was declared, then the
accumulation of the error in (7.4) clouds any possible conclusion concerning the
location of the root. Suppose for example you declare a tolerance of 10’4 and
it takes about 100 iterations for your program to claim convergence. Assume
for the sake of argument that D = 0.01, N ' 100. Then an error in the function
evaluation of order σx ' 10’5 will invalidate conclusions from the Newton-
Raphson iteration.
There is a relatively simple ¬x to the usual Newton-Raphson method which
at least corrects the lack of convergence, unfortunately at the price of greater
insensitivity when we are further from the root. Suppose we multiply the con-
stant D’1 by a sequence of constants an that depend on n so the iteration is of
the form
xn+1 = xn ’ an D’1 Yxn . (7.5)
It is easy to see that as long as σx is bounded, for example and
X
a2 < ∞ (7.6)
n

then, under mild conditions on the function f (x),then xn converges. This is in-
tuitively clear because the variance of the error beyond the N 0 th term is roughly
P∞ 22
n=N an σx and this approaches zero as N ’ ∞. Furthermore, provided that
X
an = ∞, (7.7)

convergence is to a root of the function f (x). This is the result of Robbins-
Monro (1951). Having the correct or asymptotically correct gradient D in (7.5)
really only matters asymptotically since it controls the rate of convergence once
we are in a small neighbourhood around the root. What is more important and
sometimes di¬cult to achieve is arranging that the sequence an is large when
we are quite far from the root, but small (something like 1/n) when we are
very close.
The ideal sequence of constants an in (7.5) is approximately 1 (this is the
unmodi¬ed Newton-Raphson) until we are in the vicinity of the root and then
slowly decreases to 1/n. Provided we had some easy device for determining
whether we are near the root, the Robbins-Monro method gives an adequate
solution to the problem of ¬nding roots to functions when only one root exists.
However, for functions with many roots, other methods are required.
While searching for a root of a function, we are inevitably required to es-
timate the slope of the function at least in the vicinity of the current iterant
xn , because without knowledge of the slope, we do not know in what direction
or how far to travel. The independent values of Yxn in the Robbins-Monro
algorithm (7.5) do not serve this purpose very well. We saw in Chapter 4 that
when estimating a di¬erence, common random numbers provide considerable
variance reduction. If the function f (x) is evaluated by simulation with a single
uniform input U to the simulation, we can express f in the form
Z1
f (x) = H(x, u)du
0
344 ESTIMATION AND CALIBRATION

for some function H. Alternatively
n
1X
(7.8)
f (x) = E{ H(x, Ui )}
n i=1

where Ui are the pseudo-random inputs to the i0 th simulation, which again
we assume to be uniform[0,1]. There are many ways of incorporating variance
reduction in the Monte Carlo evaluation of (7.8). For a probability density
function g(z) other than the U [0, 1] probability density function, we may wish
to re-express (7.8) using importance sampling as
n
1 X H(x, Zi )
} (7.9)
f (x) = E{
n i=1 g(Zi )

where the inputs Zi are now distributed according to the p.d.f. g(z) on [0,1].
Ideally the importance density g(z) is approximately proportional to the func-
tion H(x0 , z) where x0 is the root and so we may wish to adjust g as our
iteration proceeds and we get closer to the root and the shape of the function
H(x, z) changes. Since the expectation is unknown we could attempt to ¬nd
the roots of an approximation to this, namely,
n
1 X H(x, Zi )
c (7.10)
fn (x) = .
n i=1 g(Zi )

c
Then fn (x) ’ f (x) as n ’ ∞. Since the error in this approximation approaches
c
zero as n ’ ∞, it seems reasonable to apply Newton™s method to fn (x),

b
fn (xn )
xn+1 = xn ’
b0
fn (xn )
Pn
wi H(xn , Zi )
= xn ’ Pn i=1 ‚ (7.11)
i=1 wi ‚x H(xn , Zi )

with weights
1
wi = .
g(Zi )
Notice that even if the function H(x, Zi ) is only known up to a multiplicative
constant, since this constant disappears from the expression (7.11), we can
still carry out the iteration. Similarly, the weights wi need only be known up
to a multiplicative constant. There are two primary advantages to this over
the simplest version of the Robbins-Monro. The gradient has been estimated
using common random numbers Zi and we can therefore expect it to be a
better estimator. The ability to incorporate an importance sampling density
g(z) also provides some potential improvement in e¬ciency. There is an obvious
disadvantage to any method such as this or the Newton-Raphson which require
7.2. FINDING A ROOT 345


the derivatives of the function ‚x H(xn , Zi ). If the function f (x) is expressed as
an integral against some density other than the uniform, for example if
Z∞
f (x) = H(x, w)h(w)dw
’∞

for density h(w) then (7.11) requires a minimal change to
Pn
wi H(xn , Zi )
xn+1 = xn ’ Pn i=1 ‚ where
wi ‚x H(xn , Zi )
i=1
h(Zi )
wi = .
g(Zi )

Let us now return to the calibration of the Normal Inverse Gaussian model
to the option prices in Table 7.1. We will attempt a calibration of all the
parameters in Chapter 8 but for the moment, for simplicity we will reuse the
parameter values we estimated in Chapter 1 for the S7P500 index, ± = 95.23,
β = ’4.72. Since µ is determined by the martingale condition this leaves only
one parameter δ, the analogue of variance, to calibrate to the option price.
We wish to solve for δ, with S0 = 1116.84, r = 0.01, T = .3699, ± = 95.23,
β = ’4.72 and µ determined by the martingale condition
Z∞
(S0 exp(x) ’ e’rT K)+ nig(x; ±, β, δ, µ)dx = P0 (K)
’∞

where nig is the normal inverse Gaussian probability density function.

Since we are seeking the root over the value of δ, the derivative of the density
nig(x; ±, β, δ, µ) is required for the gradient and this is cumbersome (again
involving Bessel functions) so we replaced it by ¬tting a cubic spline to the
observations in a neighbourhood around δ and then using the derivative of this

cubic spline. In Figure 7.2 we plot the values of δ against the corresponding
strike price K. We use the square root of δ since this is the closest analogue of
the standard deviation of returns. The ¬gure is similar to the volatility smile
using the Black Scholes model in Figure 7.1 and, at least in this case, shows
no more tendency for constancy of parameters than does the Normal model. If
the NIG model if preferable for valuing options on the S&P500 index, it doesn™t
show in the calibration of the parameter δ to observed option prices.
Similar methods for ¬nding roots can be used when the function f is a
function of d arguments, a function from Rd to Rd and so both xn and Yxn are
d’dimensional but be forewarned, the problem is considerably more di¬cult in
d dimensions when d is large. The analogous procedure to the Robbins-Monro
method is
’1
xn+1 = xn ’ an Dn Yxn
where the d — d matrix Dn is an approximation to the gradient of the function
f at the root x0 . Once again, since we do not have precise measurements of
346 ESTIMATION AND CALIBRATION




Figure 7.2: “Volatility” Smile for NIG ¬t to call option prices on the S&P500.



the function f , ¬nding an approximation to the gradient Dn is non-trivial but
even more important than in the one-dimensional case, since it determines the
direction in which we search for the root. For noisy data, there are few simple
procedures that give reliable estimates of the gradient. One possibility that
seems to work satisfactorily is to ¬t a smooth function or spline to a set of values
of (x, Yx ) which are in some small neighbourhood of the current iterant xn and
use the gradient of this function in place of Dn . This set of values (x, Yx ) can
include previously sampled values (xj , Yxj ), j < n and/or new values sampled
expressly to estimate the gradient.


7.3 Maximization of Functions
We can use the results of the previous section to maximize a unimodal function
f (x) provided at on each iteration we are able to approximate the gradient of
the function with a noisy observation

Yx ' f 0 (x)

by simply using (7.5) to solve the equation f 0 (x) = 0.
For example suppose we can express the function f (x) as
Z 1
f (x) = H(x, u)du
0

for some function H so that
n
1X
(7.12)
f (x) = E{ H(x, Ui )}
n i=1
7.3. MAXIMIZATION OF FUNCTIONS 347

Again, the Ui are the pseudo-random inputs to the i0 th simulation, which we
can assume to be uniform[0,1]. For a probability density function g(z) we can
rewrite (7.12) using importance sampling as
n
1 X H(x, Zi )
} (7.13)
f (x) = E{
n i=1 g(Zi )

where the inputs Zi are distributed according to the p.d.f. g(z) on [0,1]. We
have many alternatives to ¬nding its maximum if the function is smooth. For ex-
ample we may use the Robbins-Monro method discussed in the previous section
b
to ¬nd a root of f 0 (x) = 0 where
n
1X ‚
‚x H(x, Zi )
b
f 0 (x) = .
n i=1 g(Zi )

Again there is a lack of consistency unless we permit the number of simulations
n to grow to in¬nity, and our choice of ideal importance distribution will change
as we iterate toward convergence. With this in mind, suppose we assume that
the importance distribution comes from a one-parameter family g(z; θ) and we
update the value of this parameter from time to time with information obtained
about the function H(x, z). In other words, assume that θn is a function of
(Z1 , ..., Zn ) and Zn+1 |θn is drawn from the probability density function

g(z; θn ).

Then the sequence of approximations
n
1 X H(x, Zi )
c
fn (x) =
n i=1 g(Zi ; θi’1 )

all have expected value f (x) and under mild conditions on the variance of the
c
H(x,Zi )
terms var( g(Zi ;θi’1 ) |Z1 , ..., Zi’1 ), then fn (x) ’ f (x) as n ’ ∞. Since the error
in this approximation approaches zero as n ’ ∞, it seems reasonable to apply
Newton™s method to such a system:

b0
fn (xn )
= xn ’ (7.14)
xn+1
b00
fn (xn )
Pn ‚
i=1 wi ‚x H(xn , Zi )
= xn ’ Pn (7.15)
‚2
i=1 wi ‚x2 H(xn , Zi )

with weights
1
wi = .
g(Zi ; θi’1 )
Notice that even if the function H(x, Zi ) is only known up to a multiplicative
constant, since this constant disappears from the expression (7.15), we can still
348 ESTIMATION AND CALIBRATION

carry out the iteration. Similarly, the weights wi need only be known up to a
multiplicative constant.
Evidently there are at least three alternatives for maximization in the pres-
ence of noise:

1. Update the sample Zi on each iteration and use a Robbins-Monro proce-
dure, iterating

xn+1 = xn ’ an H(xn , Zn )
‚x
2. Fix the number of simulations n and the importance distribution and
maximize the approximation as in (Geyer, 1995)
n
1 X H(x, Zi )
}
xn = arg max{
n i=1 g(Zi )

3. Update the importance distribution in some fashion and iterate (7.15).

The last two methods both require an importance distribution. The suc-
cess of the third alternative really depends largely on how well the importance
distribution g(Zi ) adapts to the shape of the function H(x, Zi ) for x near the
maximizing value.

7.3.1 Example: estimation with missing data.
This example is a special case of a maximization problem involving “Missing
Data” or latent variables. Although this methodology arose largely in biostatis-
tics where various types of censorship of data are common, it has application as
well in many areas including Finance. For example, periods such as week-ends
holidays or evenings or periods between trades for thinly traded stocks or bonds
could be considered “missing data” in the sense that if trades had occurred in
these periods, it would often simply the analysis.
For a simple example suppose we wish to the daily data for a given stock
index such as the Toronto Stock Exchange 300 index, TSE300, but on a given
day, July 1 for example, the exchange is closed. Should we simply assume that
this day does not exist on the market and use the return over a two day period
around this time as if it was the return over a single day? Since other indices
such as the S&P500 are available on this day, it is preferable to analyze the
data as a bivariate or multivariate series and essentially impute the “missing”
values for the TSE300 using its relationship to a correlated index. The data is
in Table 7.2.
Date (2003) June 30 July 1 July 2 July 3 July 4 July 7 July 8
TSE300 Close 6983.1 * 6990.3 6999.8 7001.9 * 7089.6
S&P500 Close 974.5 982.3 993.8 985.7 * 1004.42 1007.84

Table 7.2 Closing values for the S&P500 and TSE300, July 2003.
“*”= missing value
7.3. MAXIMIZATION OF FUNCTIONS 349

We will give a more complete discussion of dealing with missing data later
when we introduce the EM algorithm but for the present we give a brief discus-
sion of how “imputed values”, (a statisticians euphemism for educated guesses)
can help maximize the likelihood in problems such as this one. So for the mo-
ment, suppose that X is the “complete” data set (in the above example this is
the data with the missing values “*” ¬lled in) and Z be the data we actually
observe. Then since the probability density for X factors into the marginal
density for Z and the conditional density for X given Z,

fx (x) = fz (z)fx|z (x|z)

we have upon taking logarithms and then the conditional expectation given
Z = z the relation
Z
ln(fz (z)) = E[ln(fx (X))|Z = z] ’ ln(f (x|z))f (x|z)dx (7.16)

where the second term is the entropy in the conditional distribution f (x|z). Now
suppose that we were somehow able to randomly generate the missing obser-
vations from their correct distribution, perhaps many times, while leaving the
observed data alone. Averaging the log likelihood ln(fx (X)) for the completed
data sets while holding the observed data Z constant at z is equivalent to
approximating E[ln(fx (X))|Z = z].
The maximum likelihood estimator of the parameters based on the incom-
plete data Z is obtained by maximizing the incomplete data likelihood ln(fz (z))
over the value(s) of the parameters, and a simple attempt at this would be to
maximize the ¬rst term on the right side of (7.16) instead. Of course there
is no guarantee that this is equivalent to maximizing both terms, and we re-
turn to this question later, but for the moment suppose that our objective is
maximization of
E[ln(fx (X))|Z = z]

over the parameter values. In the above example, we can assume that the daily
returns for the two indices are correlated normally distributed random variables
so that in order to model these indices we need the 2 mean values and the 3
parameters in the covariance matrix for a bivariate normal distribution. Our
strategy will be to “¬ll in” or impute the missing values, evaluate the likelihood
ln(fx (X)) as if these imputed observations were real data, and regard this as a
“noisy” evaluation of the function E[ln(fx (X))|Z = z] which we would like to
maximize.
In order to put this data on the more familiar ground of the bivariate normal
distribution, we ¬t the log-normal distribution so that (starting with t = 0 on
June 30, 2003), the close of the TSE at time t is given by

t
X
Yi }
6983.1 exp{
i=1
350 ESTIMATION AND CALIBRATION

and of the S&P500,
t
X
Xi }
974.5 exp{
i=1

where the daily returns (Xi , Yi ) are correlated normal random variables with
mean (µx , µy ) and with covariance matrix
µ ¶
2
σy ρσx σy
Λ= .
2
ρσx σy σx

Then the conditional log-likelihood (given the values on June 30) for the
data, assuming all of the returns are observed, is
µµ ¶ µ ¶¶0 µµ ¶ µ ¶¶
n 1 Yi µy Yi µy
lC (µ, Λ) = ’ ln(|Λ|)’ Σn Λ’1
’ ’ .
2 i=1 Xi µx Xi µx
2
(7.17)
We would like to evaluate this log-likelihood when the missing data is replaced
by imputed data and then maximize over the parameters. To this end we need
to know how to impute missing values, requiring their conditional distribution
given what is observed. So let us return to a small portion of the data as shown
in Table 7.3. Notice that we are essentially given the total return S = 0.001
for the TSE over the two-day period July 1&2 but not the individual returns
Y1 , Y2 which we need to impute.

Date (2003) July 1 July 2 Total Return
TSE300 return (Yi ) Y1 Y2 S = 0.001
S&P500 return (Xi ) 0.008 0.0116 0.0196
TABLE 7.3 Returns for TSE300 and S&P500


In order to carry out the imputation for Y1 , Y2 , recall that bivariate normal
data is often expressed in terms of a linear regression relationship describing the
conditional distribution of Y given X of the form

Yi = µy + βy|x (Xi ’ µx ) + µi

where the error term µi is a Normal(0, (1 ’ ρ2 )σy ) random variable independent
2

of Xi and the regression parameter βy|x is related to the covariance matrix by

cov(Y, X) σy
βy|x = =ρ .
var(X) σx

It follows from Problem 1 at the end of the Chapter that given the values X1 , X2
and the sum S = Y1 + Y2 , the distribution of Y1 is normal

βy|x
S 1
(X1 ’ X2 ), (1 ’ ρ2 )σy ).
2
(7.18)
N( +
2 2 2
7.3. MAXIMIZATION OF FUNCTIONS 351

2
If we are given tentative values of the parameters βy|x , ρ, σy , this distribution
can be used to generate an imputed value of Y1 with Y2 = S ’ Y1 . Of course a
similar strategy allows us to impute the missing values of Xi , Xi+1 when Yi , Yi+1
and Xi + Xi+1 are observed. In this case the value of Xi should be generated
from a
βx|y
Xi + Xi+1 1
(Yi ’ Yi+1 ), (1 ’ ρ2 )σx )
2
(7.19)
N( +
2 2 2
distribution where
σx
βx|y = ρ .
σy
We are now ready to return to the maximization of the log-likelihood function

E[lC (µ, Λ)|observed data]

where lC (µ, Λ) is de¬ned at (7.17). The maximum likelihood estimator of µ, Λ
from complete data is
µµ ¶µ ¶¶ µµ ¶µ ¶¶0
1n c c
Yi µy Yi µy
b ’ ’ (7.20)
Λ = Σi=1
c c
Xi µx Xi µx
n
µ ¶ µ ¶
1
c
µy Yi
= Σn (7.21)
.
i=1
c
µx Xi
n

The imputed values can be treated as actual observations for the missing data
provided that we impute them many times and average the results. In other
words, the following algorithm can be used to estimate the parameters. We
begin with some arbitrary scheme for imputing the missing values. For example
we could replace the individual returns Y1 , Y2 in Table 7.3 by their average
S/2 = 0.0005. Then iterate the following

1. Treating the imputed values as if they were observed, use the complete
b cc
data (observed and imputed) to obtain estimators Λ and µy , µx .

2. Use the estimated parameters to estimate βy|x and βx|y and to again im-
pute the missing Xi using (7.19) and the missing Yi using the distribution
(7.18).

3. Again use the complete data (observed and imputed) to obtain estimators
b cc
Λ and µy , µx . Use as our current estimate of these parameters the average
b cc
of the values of Λ and µy , µx obtained on each iteration of steps 1 or 3.

4. Repeat steps 3 and 4 until the averages used as parameter estimators of
µ and Λ appear to converge.

We ran through steps three and four above a total of 100, 000 times (in
about 70 seconds cpu time on a laptop) and obtained estimates of the mean
352 ESTIMATION AND CALIBRATION

and covariance matrix by averaging these 100000 estimates, resulting in
µ ¶
1.034 ’0.115
b — 10’5 (7.22)
Λav =
’0.115 5.329
µ ¶µ ¶
c
µy 0.0025
(7.23)
= .
c
µx 0.0056

These can be compared with a naive estimator obtained by imputing the missing
returns as the mean return over a two-day period, based on the returns in Table
7.4 in which the imputed returns are in bold letters:


Date (2003) July 1 July 2 July 3 July 4 July 7 July 8
TSE300 (Expected) Return 0.0005 0.0005 0.0014 0.003 0.0062 0.0062
S&P500 (Expected) Return 0.008 0.0116 -0.0082 0.0094 0.0094 0.0034
Table 7.4: Returns and imputed returns for TSE300 and S&P500

These naive estimates are the mean and covariance matrix of the data in
Table 7.4,
µ ¶
1.5 1.8
b — 10’5 ,
Λ=
1.8 7.6
µ ¶µ ¶
c
µy 0.0025
= .
c
µx 0.0056

Notice that the two estimates of the mean are identical. This is not acciden-
tal, because essentially this is a numerical or Monte Carlo version of the EM
algorithm discussed in the next section, and at least for estimating the mean,
replacing each unobserved by its conditional expectation given the observations
results in the correct maximum likelihood estimator. The covariance estimates,
on the other hand, are quite di¬erent. Often, a covariance matrix obtained by
naïve imputation, replacing observations with an average as we did with the
data in Table 7.4, results in underestimating variances because the variation of
to the missing observation is understated by an average. Similarly estimators
of covariance or correlation may be biased for the same reason. The estimator
(7.22), though it has random noise in it, does not su¬er from bias to the same
extent, because it has does a better job of replicating the characteristics of the
random observations.


7.4 Maximum Likelihood Estimation
7.4.1 Missing and Censored Data
It is common in ¬nance and generally in science for data to be biased by under-
reporting, and various forms of censorship. Patients who react badly to a treat-
ment may not return for a follow-up appointment and the data that would have
been obtained on this follow-up appointment is “missing” or unavailable, and
7.4. MAXIMUM LIKELIHOOD ESTIMATION 353

the event that such data is unobserved may depend on the value that it would
have taken. Similarly, data on the time to failure of companies can be “cen-
sored” or only partially observed if there is a merger or take-over during the
study, since, in this case, we know only that the failure time of the original ¬rm
exceeds some value. Other forms of data-modi¬cation which introduce bias are
also common. There is reporting bias when mutual funds that have had a poor
quarter or year are more likely to give prominence to their 3 or 5-year returns
in advertising or neglect to reply to a question on performance over a particular
period. Managers, stocks, and companies with particularly poor returns disap-
pear from data-bases, exchanges, or companies for obvious reasons. We have
already seen some techniques for adjusting for the latter “survivorship bias”
but in general, successful treatment of “missing” or “censored” data requires
some insight into the mechanism by which the data goes missing or is censored.
In order to give a simple example of survivorship or reporting bias, suppose
that the losses xi , i = 1, ..., n on portfolios managed by n managers over a period
of time are assumed normally distributed and for simplicity we assume that all
have the same mean µ and standard deviation σ. For various reasons we expect
that the probability that a manager is no longer with us at the end of this period
of time, (and so the loss for the whole period is “missing”) is a function of the
loss that would have occurred had they persisted so that the probability that a
given data value xi is observed is π(xi ), a non-increasing function of the value
xi . The probability that an observation xi is missing is therefore 1 ’ π(xi ).
Then the likelihood for this data is
Z
n’1
Y
{π(xi )n(xi ; µ, σ )} { (1 ’ π(z))n(z; µ, σ 2 )dz}1’∆i
2 ∆i
(7.24)
L(µ, σ) =
i=1

where ∆i = 1 if the i0 th value is observed and otherwise ∆i = 0 if it is missing,
and n(xi ; µ, σ 2 ) denotes the normal(µ, σ 2 ) probability density function at the
point xi . The ¬rst term
π(xi )n(xi ; µ, σ 2 )
is the likelihood corresponding to an event “the i0 th loss is xi and it is observed”
and the last term
Z
(1 ’ π(z))n(z; µ, σ 2 )dz = E[1 ’ π(Xi )]

is the probability that an observation is missing. Suppose n = 7 and all of the
xi , i = 1, ..., n ’ 1 are observed but the last value x7 is missing. In this case we
know the values of xi for all but the last observation, which could be replaced
by a randomly generated observation if we knew the parameters µ, σ and the
function π(x). If we wish to maximize the likelihood (7.24) then one simple
approach is to complete the data by generating a missing or latent observations
X7 = Z with the normal(µ, σ 2 ) distribution and then accepting the observation
(as missing) with probability 1 ’ π(Z). The likelihood is proportional to

E{H(µ, σ, Z)}
354 ESTIMATION AND CALIBRATION

where the expectation is over the distribution of Z,
P6
(xi ’ µ)2 + (Z ’ µ)2 (1 ’ π(Z))
1
H(µ, σ, Z) = H(µ, σ, Z, θ) = 7 exp{’ i=1 }
2σ 2
σ g(Z; θ)

and we assume Z has probability density function

g(z; θ).

Consider maximizing this function over µ in the special case that σ = 1 and
π(z) = e’z , z > 0 and g(x; θ) is an exponential probability density function
with mean θ. We might ¬x the number of simulations m and the importance
distribution and maximize the approximation to the integral:
m
1X
µn = arg max{ H(µ, Zi )}
m i=1

This method recommended by Geyer(1996) and has the advantage that it
essentially estimates the function unbiasedly and then one can rely on standard
maximization programs such as fminsearch in Matlab. It requires however that
we ¬x the importance distribution g(z, θ) and the number of simulations. Al-
ternatives are Robbins-Monro methods which have the ability to update the
importance distribution parameter θ to reduce the variance of the estimator.
Various stochastic volatility models and “Regime switching” models can also
be viewed as examples of latent variables or missing data problems. If the
market has two volatility states, “high” and “low” these can not be directly
observed but only inferred from the movement in the market over a period of
time. In this case, given the state that the market is in, the parameters are often
easily estimated, but this state is a latent variable that needs to be estimated
or imputed from the observations.

7.4.2 Example: Mixture Models
One common problem which can also be viewed as a missing data problem is
that of estimation of parameters in a mixture model. To take a simple exam-
ple, consider a vector of observations of stock returns (x1 , ..., xn ) and assume
that xi has a probability density function that takes one of two possible forms
f1 (x|θ1 ) or f2 (x|θ2 ) where the two parameters θ1 , θ2 are not speci¬ed. We as-
sume that a given observation comes from population distribution f1 (x|θ1 ) with
some probability p1 and otherwise, with probability p2 = 1 ’ p1 it comes from
the population with density f2 (x|θ2 ). This is an example of a mixture of two
distributions: the overall population distribution is

f (x) = p1 f1 (x|θ1 ) + p2 f2 (x|θ2 ).

If each data point came with an identi¬er zi taking possible values 1, 2 that
speci¬ed the parent distribution, estimation of the parameters p1 , θ1 , θ2 would
7.4. MAXIMUM LIKELIHOOD ESTIMATION 355

be easy. We would estimate p1 using the proportion of observations from pop-
ulation 1 and then use these to estimate the parameter θ1 , and similarly θ2 .
However, this identi¬er is not usually available. For example it is a common
model in ¬nance to assume that there are two categories of observations, or
states of the market, corresponding to high volatility (θ2 large) and low volatil-
ity (θ1 small) but of course we can only infer which state we are currently in by
the magnitude of the last few changes in price. Alternatively we might assume
that the vector of identi¬ers z1 , z2 , .... mentioned above forms a simple 2’state
Markov chain, in which case the model for stock returns is commonly referred
to as a regime-switching model.
Once again for the sake of a concrete problem, we use the returns from the
S&P500 index between January 1, 2003 and May 11, 2004 to estimate a mixture
model. In particular suppose that the density function or a return is assumed
to be a mixture of normal distributions with mean µ and variances θi , i = 1, 2,
i.e.
p1 n(x; µ, θ1 ) + (1 ’ p1 )n(x; µ, θ2 )
= p1 f1 (x|θ1 ) + (1 ’ p1 )f2 (x|θ2 ).
For the sake of identi¬ability of the parameters, we assume θ1 < θ2 . Consider
¬rst the simple case in which we assume that the identi¬ers zi are independent
and observed. Then the likelihood function is
n
Y
f1 (xi |θ1 )∆i f2 (xi |θ2 )1’∆i
L(θ) =
i=1

where ∆i = I(zi = 1). Since ∆i is not observed, the estimating function in the
missing data case is obtained by conditioning the “complete data” estimating
function on the observed information, i.e.
n
X ‚
E[∆i |x1 ...xn ] ln f1 (xi |θ1 ) = 0
‚θ1
i=1
n
X ‚
E[1 ’ ∆i |x1 ...xn ] ln f2 (xi |θ2 ) = 0.
‚θ2
i=1

In the case of independent zi the conditional probability that ∆i = 1 is given
by
p1 f1 (xi |θ1 )
wi =
p1 f1 (xi |θ1 ) + p2 f2 (xi |θ2 )
and this results in the estimating functions to be solved for θ,
n
X ‚
ln f1 (xi |θ1 ) = 0 (7.25)
wi
‚θ1
i=1
n
X ‚
(1 ’ wi ) ln f2 (xi |θ2 ) = 0. (7.26)
‚θ2
i=1
356 ESTIMATION AND CALIBRATION

Of course typically p1 , p2 are not known either and they also need to be estimated
from the data concurrently with our estimation of θ. The estimating function
can be derived in exactly the same manner; if the ∆i were observed, then we
would estimate the parameters p1 , p2 using
n
1X
p1 = ∆i
n i=1

which after conditioning is of the form
n n
1X 1X
E[∆i |xi ] = wi and, (7.27)
p1 =
n i=1 n i=1
p2 = 1 ’ p1 .

Typically a simple routine which begins with initial value of the parameters and
then iteratively steps towards a solution of (7.25) and 7.27) provides at least
slow convergence to a solution. There is no absolute guarantee that the solution
corresponds to a local maximum of the likelihood function, but for the S&P500
data mentioned the likelihood appears to increase monotonically. Convergence
is to the model

0.811N (µ, 0.00852 ) + 0.189N (µ, 0.0152 ),

which corresponds to a 252 — .00852 =14% annual volatility on 81% of the
days and 24% volatility the other 19% of the days. In general we need to be
concerned about whether the initial values of the parameters a¬ect the result
but in this case varying the initial values of p1 , θ1 , θ2 did not appear to have any
e¬ect (except in the time to convergence). This simple algorithm, when applied
to mixture problems, can more generally be quite sensitive to the starting values.
When we estimate parameters as we did in the mixture problem above,
whether we use the EM algorithm or one of its many relatives, missing compo-
nents of the score function are replaced by their conditional expectation given
the observations. For estimation of the mean in a normal family, we replace
missing observations by their conditional expectation. On the other hand, these
imputed values typically have less variability than the actual data we were
unable to observe, and so if our interest is in the variance, the substitution for
non-observed data by a conditional expectation tends to reduce the sample vari-
ance. If we have many parameters that we wish to estimate, we can essentially
replace the missing data by repeated draws from the appropriate conditional dis-
tribution and use these repeatedly to estimate the parameter values and their
variances. If the missing data is randomly generated, it should have the same
characteristics as the data it is replacing, including the higher moments. The
next method is a Bayesian implementation of this idea, one of few excursions
into Bayesian methodology in this text.
The Data Augmentation algorithm (see Tanner and Wong, 1987) is used in
Bayesian problems when there is latent or missing data as in the above example.
7.4. MAXIMUM LIKELIHOOD ESTIMATION 357

A Bayesian begins with a prior distribution π(θ) assigned to the parameters
being estimated. This distribution re¬‚ects our prior knowledge of the parameter
values and their variability and there are many possible choices for it, but ideally
it should not have too much in¬‚uence on the conclusions since we always wish
the data to do more talking than the Bayesian™s preconceived ideas of where the
parameters lie. We explain the algorithm in the context of the example above
so x is the vector of observations, and the missing data is denoted by ∆. The
posterior joint distribution of the parameter θ and the missing data ∆ is the
conditional distribution of θ and ∆ given the data x and can be written

p(θ, ∆|x) = p(θ|x)p(∆|θ, x).

The idea behind the data augmentation algorithm is to iteratively draw (θ, ∆)
from approximations to the joint distribution of θ, ∆ given the observed data.
Typically p(θ|x) is di¬cult to compute but the conditional densities such as
p(θ|x, ∆) and p(θ|x, ∆) are much easier to deal with.
The algorithm iterates these two steps:

1. Draw ∆— from the density p(∆|θ— , x).

2. Draw θ— from the density p(θ|x, ∆— ).

Iteration continues until the joint distribution of the pair (θ— , ∆— ) appears to
stabilize in which case they are regarded as a draw from the joint distribution
p(θ, ∆|x). The method is, in fact, a special case of the “Gibbs sampler” to
be discussed later and its convergence to the correct conditional distribution
derived from the balance equations we give there. At this point, to provide just
a little credibility, we outline an intuitive argument for a Gibbs sampler in this
simple case.
Consider the following problem. We wish to generate random variables (θ, ∆)
having some joint distribution p(θ, ∆|x). To save a little notation let us abbre-
viate p(.|x) to px (.). Unfortunately, the joint conditional distribution px (θ, ∆)is
di¬cult to deal with, but conditional distributions such as

px (θ, ∆)
and
px (θ|∆) =
px (∆)
px (θ, ∆)
px (∆|θ) =
px (θ)

are easy to sample from. We proceed as follows. Start with an arbitrary value
for θ1 and then generate ∆1 from the density px (∆|θ1 ). Now we generate θ2 from
the probability density px (θ|∆1 ) and then ∆2 from the density px (∆|θ2 ) and so
on. It is easy to see that the pairs of observations (θ1 , ∆1 ), (θ2 , ∆2 ), ... constitute
a two dimensional Markov Chain with transition kernel K(θt+1 , ∆t+1 |θt , ∆t ) =
px (θt+1 |∆t )px (∆t+1 |θt+1 ) and it is our hope that the joint density px (θ, ∆) is
358 ESTIMATION AND CALIBRATION

an invariant distribution for this chain. To verify this note that
ZZ
px (θt+1 |∆)px (∆t+1 |θt+1 )px (θ, ∆)dθd∆
ZZ
px (θt+1 |∆)px (∆t+1 |θt+1 )px (θ, ∆)dθd∆
=
Z
= px (∆t+1 |θt+1 ) px (θt+1 |∆)px (∆)d∆

= px (∆t+1 |θt+1 )px (θt+1 ) = px (θt+1 , ∆t+1 ).

Since this is the invariant distribution, if the chain is ergodic, it will converge
in distribution to px (θ, ∆). There is no certainty in this or any other Markov
chain that we have reached equilibrium after a ¬nite number of draws, although
there are many techniques for trying to monitor whether the distribution con-
tinues to change. There are other, more elaborate, methods for estimating the
parameters in mixture models. See for example Robert (1996).
The algorithm is clearer if we apply it to the Normal mixture model for the
S&P500 data discussed above. In this case we start with initial guesses ∆— at i
the values of ∆i . For example we could assign the smallest returns (in absolute
value) to population one so ∆— = 1 and the rest to population 2, ∆— = 0
i i

with about equal numbers in the two camps. Then θ is drawn with density
px (θ|θ, ∆— ). This is easy to do in this example because the vector ∆— assigns
each observation to one of the two populations. For simplicity suppose that we
have subtracted the mean from the returns so that we can assume µ = 0 and
let the parameters θ1 , θ2 denote the variances of the two normal distributions.
’1
For example suppose that the prior distribution of θ1 is proportional to θ1
’1
π(θ1 ) ∝ θ1 , θ1 > 0.

There is no such probability density function, but there is a measure with this
density and it is often convenient to permit such so-called improper prior dis-
tributions. The prior π(θ1 ) corresponds to assuming an improper uniform prior
distribution for log(θ1 ) because if U is uniform on some set then θ1 = eU will
have density function

dU ’1
∝| | = θ1 on the corresponding set.
dθ1

Then the posterior density of θ1 is
Y —
’1
p(θ1 |x, ∆— ) ∝ [n(xi ; 0, θ1 )]∆i θ1
i
P
∆— xi 2
’ν /2’1 i
∝ }
θ1 1 exp{’
2θ1
P
∆— is the degrees of freedom for this sample. The posterior
where ν1 = i
i
7.4. MAXIMUM LIKELIHOOD ESTIMATION 359

’1
density of the reciprocal of the variance ψ = θ1 is
P
∆— xi 2 dθ1
ψ
p(ψ|x, ∆— ) ∝ ψ ν1 /2+1 exp{’ i
}| |
2 dψ
P
ψ ∆— xi 2
ν1 /2’1 i
∝ψ }
exp{’
2
’1
We identify this posterior distribution of θ1 as the Gamma(ν1 /2 ,2/ss2 ) dis-
1
tribution with
n
X
2
∆— xi 2
ss1 = i
i=1

and this means that
’1
ss2 θ1
1

has a chi-squared distribution with ν1 /2 degrees if freedom.
Similarly with two parameters θ1 , θ2 , we may assume the prior density of
(θ1 , θ2 ) is proportional to
’2 ’2
θ1 θ2
resulting in a independent Gamma(ν1 /2 ,2/ss2 ), Gamma(ν2 /2 ,2/ss2 ) posterior
P1 2
’1 ’1
distributions for θ1 and θ2 where ν2 = i (1 ’ ∆— ) and
i

n
X
ss2 (1 ’ ∆— )x2 .
=
2 i i
i=1

The parameter p1 could also have a prior distribution assigned and a posterior
calculated in a similar fashion (e.g. assume a (non-informative) prior distribu-
tion for p1 with density function proportional to [p(1 ’ p)]’1/2 , so the posterior
distribution given the value of ∆ is that of a Beta (ν1 + 1 , ν2 + 1 ) distribu-
2 2
tion (see Box and Tiao, 1973, Section 1.3)). However, for comparison with the
mixture model above we chose p1 equal to its estimated value there, 0.81.
The algorithm therefore is as follows, beginning with k = 0 and an initial
guess at ∆— ,

1. Calculate
P P
ss2 = Pn ∆— x2
ν1 = Pi ∆— , 1
i i=1 i i
n
ν2 = i (1 ’ ∆— ) ss2 = i=1 (1 ’ ∆— )x2
2
i i i

’1 ’1
2. Generate θ1 ∼ Gamma(ν1 /2 ,2/ss2 ) and θ2 ∼Gamma(ν2 /2 ,2/ss2 )
1 2
(and p1 ∼Beta (ν1 + 1 , ν2 + 1 )) if we wish to allow p1 to vary.
2 2

3. Generate independent

p1 f (xi |θ1 )
∆— ∼ Bernoulli( )
i
p1 f (xi |θ1 ) + p2 f (xi |θ2 )
360 ESTIMATION AND CALIBRATION




Figure 7.3: The posterior distribution of the annualized volatility parameters
for the two components of a mixture model, ¬t to the S&P500 index, June-July
2003.


’1 ’1
4. Return to step 1 until the distribution of (∆— , θ1 , θ1 ) has ceased to
i
change. Then iterate steps 1-4 m times, considering these m subsequently
’1 ’1
generated values of (∆— , θ1 , θ1 ) as draws from the joint distribution of
i
(∆, θ) given (x1 , ..., xn ).

The posterior distribution of the annualized volatilities determined by the
data augmentation algorithm is give in Figure 7.3 and this gives a better idea of
the error attached to the variances of the two normally distributed components
for ¬xed p1 .
The next section provides a simple general technique for estimating parame-
ters in the presence of missing or censored data.


7.4.3 The EM Algorithm
In principle maximum likelihood estimation is simple. We write down the like-
lihood or the probability of the observed values and then choose the parameter
value which maximizes this expression. Di¬culties occur primarily when the
likelihood does not have a simple closed-form or easily evaluated expression.
In the last section we saw an example in which we observed data Z but if
complete data X were available (including “latent” or missing variables), then
the complete-data likelihood takes a simple form as a function of the unknown
parameter
LC (θ; x), (C for “complete”).
However each missing observation in a data set gives rise to a conditional ex-
pectation. If only a portion z of the complete data x is observed, then the
likelihood function for the observed data, LO (θ, z) say, (“O” is for “observed”)
is related to the complete data likelihood. As before, since

fθ (x) = fθ (z)fθ (x|z),
7.4. MAXIMUM LIKELIHOOD ESTIMATION 361

we have, for an arbitrary value of the parameter θ0 ,

ln(fθ (x)) = ln(fθ (z)) + ln(fθ (x|z))
Eθ0 [ln(fθ (x))|z] = Eθ0 [ln(fθ (z))|z] + Eθ0 [ln(fθ (x|z))|z]
Z
= ln(fθ (z)) + ln{fθ (x|z)}fθ0 (x|z)dx.

Denote the term on the left side by

(7.28)
Qθ0 (θ) = Eθ0 [ln(fθ (x))|z].

Then the log-likelihood of the observed data ln{LO (θ, z)} is
Z
ln(fθ (z)) = Qθ0 (θ) ’ ln{fθ (x|z)}fθ0 (x|z)dx.

From this we obtain

ln{LO (θ, z)} ’ ln{LO (θ0 , z)} = Qθ0 (θ) ’ Qθ0 (θ0 ) + H(θ0 , θ) (7.30)

where
fθ0 (x|z)
|z]
H(θ0 , θ) = Eθ0 [ln(
fθ (x|z)
is the cross entropy between the conditional density of x|z at the two values of
the parameter.
The EM algorithm proceeds as follows: We begin with what we think is
a good estimator of the parameter θ0 and then maximize over the value of θ
the function Qθ0 (θ). The maximizing value of θ, θ1 , say, is the next estimator.
Replacing θ0 by θ1 we repeat, now maximizing Qθ1 (θ) and so on, obtaining
a sequence of estimators θn , n = 1, 2, ..... It is not hard to show that under
fairly simple conditions, this sequence converges to the maximum likelihood
b b
estimator, i.e. the value θ satisfying LO (θ; z) = max{LO (θ; z)}. In other words,
the algorithm switches back and forth between the two steps starting with an
initial guess at the parameter value θ0 and n = 0 and stopping when the sequence
θn appears to converge:

E Step Obtain the conditional expected value

Qθn (θ) = Eθn [ln LC (θ; x)|z]

M Step Maximize this function Qθn (θ) over θ, letting

θn+1 = arg max Qθn (θ)

be the next approximation to the parameter value. Increment n.
362 ESTIMATION AND CALIBRATION

This algorithm works because of the identity (7.30). Notice that because of
the M-step above, Qθn (θn+1 ) ’ Qθn (θn ) ≥ 0 and the term H(θn+1 , θn ), because
it is the cross-entropy between two distributions, is always non-negative. This
shows that
ln{LO (θn+1 , z)} ’ ln{LO (θn , z)} ≥ 0
and that the observed data likelihood is a non-decreasing sequence. It must
therefore converge.
The score function for the complete data


SC (θ; x) = ln LC (θ; x)
‚θ

is related to the observed data score SO (θ; z) = ln LO (θ; z) even more simply:
‚θ

(7.31)
SO (θ; z) = Eθ [SC (θ; x)|z].

Therefore on the E’step of the EM algorithm, we may solve the equation for
θn+1 (checking as always that the solution corresponds to a maximum)

Sobs (θn+1 |θn ) = 0.

Thus, the EM algorithm can be more simply expressed as a simple resubstitution
algorithm for solving the score equation (7.31) equals zero:

E Step For an initial guess at the parameter value θ0 , obtain the conditional
expected value
(7.32)
SO (θ|θn ) = Eθn [SC (θ; x)|z].

Solve for θn+1 the equation

SO (θn+1 |θn ) = 0.

It is clear from this formulation that if the algorithm converges it must converge
to a point θ satisfying
SO (θ|θ) = 0
which, in view of (7.31) is the score equation for the observed data.
If there are m missing observations, notice that both (7.28) and (7.31) are
m’fold integrals and may be di¬cult to compute. Nevertheless, (7.31) leads
to an extremely useful observation that applies to missing or “censored” data,
within the normal family of distributions and more generally, to exponential
families. Equation (7.32) indicates that any term in the score function that is
unobserved should be replaced by its conditional expectation given the observed
data. If the score function contained terms like X 2 or sin(X) then the condi-
tional expectation of such terms would need to be computed. This is, of course,
NOT the same as replacing X by its conditional expectation and then using X 2
or sin(X). However, one of the marvelous features of the normal distributions
7.4. MAXIMUM LIKELIHOOD ESTIMATION 363

is that, for estimating the mean, the score function is linear in X. For exam-
ple, di¬erentiating (7.17) with respect to µX , µY we obtain the score function
corresponding to estimation of µ for complete data.
µµ ¶µ ¶¶
Yi µy
n ’1

Σi=1 Λ
Xi µx

Notice that taking the expected value given the observed data is equivalent to
replacing each unknown component of Xi , Yi by its conditional expectation given
the observed. Moreover, within the multivariate normal family of distributions,
taking conditional expectation is accomplished essentially by linear regression.
For a simple example of its application let us return to the problem with
“missing” values in the S&P500 index and the TSE300 index discussed above.
Suppose, for example we wish to ¬ll in the value for the TSE300 on July 1 using
conditional expectation. Which observations are directly relevant? Table 7.3
gives the daily returns for the two indices leaving out values that are independent
of the unknowns Y1 , Y2 .
Recall from (7.18) that the conditional distribution of Yi given Xi , Xi+1 and
S = Yi + Yi+1 is
βy|x
S 1
(Xi ’ Xi+1 ), (1 ’ ρ2 )σy ).
2
(7.33)
N( +
2 2 2
and so the conditional expectation
βy|x
S
E(Yi |Xi , Xi+1 , S] = (Xi ’ Xi+1 )
+
2 2
with βy|x = ρσy /σx . Rather than naively replacing the unknown returns Yi , Yi+1
by their average S/2 the term 1 βy|x (Xi ’ Xi+1 ) provides an adjustment deter-
2
mined from the values of X. The regression coe¬cient βy|x can either be found
by regressing Y on X preferably with data around the same period of time
(since these coe¬cients tend to be somewhat time-dependent) or by using the
current values of the maximum likelihood estimator of the covariance matrix
Λ. Using the observed data for June and July 2003, we arrived at an estimated
regression coe¬cient
βy|x ' 0.432
and this allows to to ¬ll in the missing values in the table with their conditional
expected values:

Date (2003) July 1 July 2
TSE300 Return ’0.0003 0.0013
S&P500 Return 0.008 0.0116
TABLE 7.4

Filling in the expected returns for all of the missing value for July 4 in the
same way, but using the regression coe¬cient for X on Y of βx|y ' 1.045, we
observe the following for the unknowns in Table 7.5.
364 ESTIMATION AND CALIBRATION


Date (2003) July 1 July 2 July 3 July 4 July 7 July 8
’0.0003
TSE300 (Expected) Return 0.0014 0.003
0.0013 Y7 Y8
S&P500 (Expected) Return 0.008 0.0116 -0.0082 0.0034
X4 X7
TABLE 7.5
1
(0.0188 + 1.045(0.003 ’ Y7 ))
X4 =
2
1
Y7 = (0.0124 + 0.432(X7 ’ 0.0034))
2
X4 + X7 = 0.0188
Y7 + Y8 = 0.0124

and these four equations in four unknown can be solved in the four unknowns
giving

Y7 = ’0.0634, Y8 = 0.0758,
X7 = ’0.0253.
X4 = 0.0441,

Assuming that this pair of returns follows a correlated geometric Brownian
motion then in order to estimate cov(X, Y ) in the presence of complete data we
would solve the following equations:
n
1X
c (7.34)
µx = xi

. 1
( 3)



>>