7.1 Introduction

Virtually all models in ¬nance have parameters that must be speci¬ed in order

to completely describe them. Statistical estimation can be used for some or all

of these parameters under two important conditions:

1. We believe that the parameter values do not change quickly in time.

2. We have access to su¬cient (recent) historical data on the underlying

process to provide reliable estimates of these parameter values.

Statistical estimation involves the distribution of the data and uses methods

such as maximum likelihood estimation or least squares.

There are also some instances in which estimation of parameters is unde-

sirable. For example we have seen that a risk-neutral distribution Q, used in

the pricing of options, is not necessarily identical to the historical distribution

P. Estimating parameter values for the distribution P based on historical data

may have little or nothing to do with their values under Q. However, market in-

formation, for example the prices of options with a given underlying stock carry

information about the expected value of certain functions of the stock under

Q and the process of arranging parameter values so that the model prescribes

options prices as close as possible to those observed is called calibration of the

model.

Although there is a substantial philosophical di¬erence between calibration

and statistical estimation, there are many similarities in the methodology, since

both often require obtaining the root of one or more functions or maximizing or

minimizing a function and so these are the problems that we begin this section

with.

There are many numerical problems that become much more di¬cult in the

presence of noise, from ¬nding the roots of a given function to solving a system

of di¬erential equations. Problems if this sort have given rise to substantial

research in a variety of areas including stochastic approximation and Markov

Chain Monte Carlo. In statistics, for example, we typically estimate parameters

using least squares or maximum likelihood estimation. However, how do we

339

340 ESTIMATION AND CALIBRATION

carry out maximum likelihood estimation in cases where no simple form for the

likelihood function is available? Suppose we are able to simulate data from a

model with a given set of parameter values, say, but unable to express the

likelihood function in a simple form amenable to numerical maximization.

For some concreteness, suppose we are interested in calibrating a model for

the price of options on the SPX, the S&P500 index. The call option prices in

Table 7.1 were observed on May 6, 2004 when the SPX was quoted at 1116.84.

All options expired on September 18, corresponding to T = .3699 years. The

option price is calculated as midpoint of the bid and ask, and K is the exercise

price, PO (K), the corresponding option price.

950 1005 1050 1100 1125 1150 1175 1200 1225 1250

K

PO (K) 173.1 124.8 88.5 53.7 39.3 27.3 17.85 10.95 6.2 3.25

Table 7.1. Price of SPX Call Options

Suppose we use the Black-Scholes formula to determine the implied volatility

for these options. The implied volatility is the value of σ solving

(7.1)

BS(1116.84, K, r, 0.3699, σ) = PO (K)

where r is the spot interest rate. If exact prices were known (rather than just the

bid and ask prices), we could determine r from the put call parity relation since

there are also put options listed with the same exercise price. However, here we

used the rate r = .01 of a short-term treasury bill since it is di¬cult to ¬nd nearly

simultaneous trades in put and call options with the same strike price.. The

MATLAB function BLSIMPV(S0 ,K,r,T ,PO (K)) returns the implied volatility

i.e. provides the solution of (7.1). In Figure 7.1 we give what is often referred to

as the volatility smile, although smirks and frowns are probably more common

than smiles these days (surely a sign of the times). This is a graph of the

exercise price of the option against the implied volatility of the option. The fact

that this implied volatility is not constant is further evidence that the Black-

Scholes model (at least with constant parameters) does not ¬t the risk-neutral

distribution very well.

We saw in Chapter 3 that distributions such as the normal inverse Gaussian

distribution can do a better job of ¬tting historical distributions of stocks and

indices than does the normal and so we might hope that it can be applied suc-

cessfully to the ¬t of a risk-neutral distribution. The main problem in applying

distributions such as the NIG is that of calibrating the parameters of the model

to market data. In models such as this, if option prices are obtained from a

simulation, how can we hope to obtain parameter values for the model that are

consistent with a set of market prices for the options? More speci¬cally suppose

we assume that the price of a stock at time T is given by

ST = S0 erT exp(X)

where X has the NIG(±, β, δ, µ) distribution of Lemma 29. In order that the

discounted future price forms a martingale, we require that E(exp(X)) = 1 and

7.1. INTRODUCTION 341

Figure 7.1: The volatility “smile” for SPX call options

from (3.23) with s = 1 this implies that

µ ¶

p

p ±2 ’ β 2

1

2 ’ (1 + β)2 ’ ±2 ’ β 2 } + ln

µ = δ{ ±

±2 ’ (1 + β)2

4

and

|β + 1| < ±.

One of the four parameters, µ, is determined by the martingale condition so the

other three can be used to ¬t to market option data if we wish. The price of a

call option with exercise price K is now

E(S0 exp(X) ’ e’rT K)+

and this expectation is either a rather complicated numerical exercise or one

that can be estimated by simulation. Whether or not we choose to price this

option using simulation, the calibration problem reduces to selecting values of

the parameters so that the theoretical option price agrees with the market price,

i.e.

E(S0 exp(X) ’ e’rT K)+ = P0 (K). (7.2)

If we have 10 options as in Table 7.1 and only three parameters to vary, there is

no hope for exact equality in 7.2 (10 equations in 3 unknowns is not promising).

We could either ¬x two of the parameters, e.g. ±, β and select the third,

δ to agree with the price of a speci¬c option or we could calibrate all three

parameters to three or more options with similar strike prices and minimize the

sum of the squared di¬erences between observed and theoretical option prices.

We will return to this problem later, but for the present notice that in order

to solve it, we need to be able to either ¬nd roots or minima of functions when

evaluations of these functions are noisy and have measurement or simulation

error superimposed. We begin a discussion of this problem in the next section.

342 ESTIMATION AND CALIBRATION

7.2 Finding a Root

Let us begin with the simple problem of identifying the root of a non-decreasing

function, i.e. solving for the value x0 satisfying

f (x0 ) = 0.

The standard algorithm for this is the Newton-Raphson algorithm which begins

with an arbitrary value of x1 and then iterates

f (xn )

xn+1 = xn ’ (7.3)

.

f 0 (xn )

Suppose, however, that our observations on the function f (x) are of the form

Y (x) = f (x) + µx

2

where the errors µx at the point x are small, with mean 0 and variance σx . Even

if we know the gradient D = f 0 (x0 ) of the function at or near the root (whether

we use the gradient at the root or at the value xn approaching the root does not

materially e¬ect our argument), the attempt at a Newton-Raphson algorithm

Yxn f (xn ) µxn

xn+1 = xn ’ = xn ’ ’ (7.4)

D D D

has a serious problem. If there is no error whatsoever in the evaluation of the

function, the iteration

f (xn )

xn+1 = xn ’

D

typically converges to the root x0 . However in (7.4), on each iteration we add

µx

to the “correct” Newton-Raphson iterant xn ’ f (xn ) an error term Dn which

D

is often independent of the previous errors having variance D’2 σxi .Therefore

2

after a total of N iterations of the process, we have accumulated an error which

has variance

N

X

’2 2

D σxi .

i=1

The fact that this error typically increases without bound means that continuing

the iteration of (7.4) does not provide a consistent estimator of the root but

one whose error eventually grows. Newton™s method is inappropriate for any

model in which errors in the evaluation of the function, when accumulated until

convergence is declared, are comparable in size to the tolerance of the algorithm.

In general as long as the standard deviation of the function evaluations σx is of

the same order of magnitude as

√

tolerance — D/ N

7.2. FINDING A ROOT 343

where N is the number of iterations before convergence was declared, then the

accumulation of the error in (7.4) clouds any possible conclusion concerning the

location of the root. Suppose for example you declare a tolerance of 10’4 and

it takes about 100 iterations for your program to claim convergence. Assume

for the sake of argument that D = 0.01, N ' 100. Then an error in the function

evaluation of order σx ' 10’5 will invalidate conclusions from the Newton-

Raphson iteration.

There is a relatively simple ¬x to the usual Newton-Raphson method which

at least corrects the lack of convergence, unfortunately at the price of greater

insensitivity when we are further from the root. Suppose we multiply the con-

stant D’1 by a sequence of constants an that depend on n so the iteration is of

the form

xn+1 = xn ’ an D’1 Yxn . (7.5)

It is easy to see that as long as σx is bounded, for example and

X

a2 < ∞ (7.6)

n

then, under mild conditions on the function f (x),then xn converges. This is in-

tuitively clear because the variance of the error beyond the N 0 th term is roughly

P∞ 22

n=N an σx and this approaches zero as N ’ ∞. Furthermore, provided that

X

an = ∞, (7.7)

convergence is to a root of the function f (x). This is the result of Robbins-

Monro (1951). Having the correct or asymptotically correct gradient D in (7.5)

really only matters asymptotically since it controls the rate of convergence once

we are in a small neighbourhood around the root. What is more important and

sometimes di¬cult to achieve is arranging that the sequence an is large when

we are quite far from the root, but small (something like 1/n) when we are

very close.

The ideal sequence of constants an in (7.5) is approximately 1 (this is the

unmodi¬ed Newton-Raphson) until we are in the vicinity of the root and then

slowly decreases to 1/n. Provided we had some easy device for determining

whether we are near the root, the Robbins-Monro method gives an adequate

solution to the problem of ¬nding roots to functions when only one root exists.

However, for functions with many roots, other methods are required.

While searching for a root of a function, we are inevitably required to es-

timate the slope of the function at least in the vicinity of the current iterant

xn , because without knowledge of the slope, we do not know in what direction

or how far to travel. The independent values of Yxn in the Robbins-Monro

algorithm (7.5) do not serve this purpose very well. We saw in Chapter 4 that

when estimating a di¬erence, common random numbers provide considerable

variance reduction. If the function f (x) is evaluated by simulation with a single

uniform input U to the simulation, we can express f in the form

Z1

f (x) = H(x, u)du

0

344 ESTIMATION AND CALIBRATION

for some function H. Alternatively

n

1X

(7.8)

f (x) = E{ H(x, Ui )}

n i=1

where Ui are the pseudo-random inputs to the i0 th simulation, which again

we assume to be uniform[0,1]. There are many ways of incorporating variance

reduction in the Monte Carlo evaluation of (7.8). For a probability density

function g(z) other than the U [0, 1] probability density function, we may wish

to re-express (7.8) using importance sampling as

n

1 X H(x, Zi )

} (7.9)

f (x) = E{

n i=1 g(Zi )

where the inputs Zi are now distributed according to the p.d.f. g(z) on [0,1].

Ideally the importance density g(z) is approximately proportional to the func-

tion H(x0 , z) where x0 is the root and so we may wish to adjust g as our

iteration proceeds and we get closer to the root and the shape of the function

H(x, z) changes. Since the expectation is unknown we could attempt to ¬nd

the roots of an approximation to this, namely,

n

1 X H(x, Zi )

c (7.10)

fn (x) = .

n i=1 g(Zi )

c

Then fn (x) ’ f (x) as n ’ ∞. Since the error in this approximation approaches

c

zero as n ’ ∞, it seems reasonable to apply Newton™s method to fn (x),

b

fn (xn )

xn+1 = xn ’

b0

fn (xn )

Pn

wi H(xn , Zi )

= xn ’ Pn i=1 ‚ (7.11)

i=1 wi ‚x H(xn , Zi )

with weights

1

wi = .

g(Zi )

Notice that even if the function H(x, Zi ) is only known up to a multiplicative

constant, since this constant disappears from the expression (7.11), we can

still carry out the iteration. Similarly, the weights wi need only be known up

to a multiplicative constant. There are two primary advantages to this over

the simplest version of the Robbins-Monro. The gradient has been estimated

using common random numbers Zi and we can therefore expect it to be a

better estimator. The ability to incorporate an importance sampling density

g(z) also provides some potential improvement in e¬ciency. There is an obvious

disadvantage to any method such as this or the Newton-Raphson which require

7.2. FINDING A ROOT 345

‚

the derivatives of the function ‚x H(xn , Zi ). If the function f (x) is expressed as

an integral against some density other than the uniform, for example if

Z∞

f (x) = H(x, w)h(w)dw

’∞

for density h(w) then (7.11) requires a minimal change to

Pn

wi H(xn , Zi )

xn+1 = xn ’ Pn i=1 ‚ where

wi ‚x H(xn , Zi )

i=1

h(Zi )

wi = .

g(Zi )

Let us now return to the calibration of the Normal Inverse Gaussian model

to the option prices in Table 7.1. We will attempt a calibration of all the

parameters in Chapter 8 but for the moment, for simplicity we will reuse the

parameter values we estimated in Chapter 1 for the S7P500 index, ± = 95.23,

β = ’4.72. Since µ is determined by the martingale condition this leaves only

one parameter δ, the analogue of variance, to calibrate to the option price.

We wish to solve for δ, with S0 = 1116.84, r = 0.01, T = .3699, ± = 95.23,

β = ’4.72 and µ determined by the martingale condition

Z∞

(S0 exp(x) ’ e’rT K)+ nig(x; ±, β, δ, µ)dx = P0 (K)

’∞

where nig is the normal inverse Gaussian probability density function.

Since we are seeking the root over the value of δ, the derivative of the density

nig(x; ±, β, δ, µ) is required for the gradient and this is cumbersome (again

involving Bessel functions) so we replaced it by ¬tting a cubic spline to the

observations in a neighbourhood around δ and then using the derivative of this

√

cubic spline. In Figure 7.2 we plot the values of δ against the corresponding

strike price K. We use the square root of δ since this is the closest analogue of

the standard deviation of returns. The ¬gure is similar to the volatility smile

using the Black Scholes model in Figure 7.1 and, at least in this case, shows

no more tendency for constancy of parameters than does the Normal model. If

the NIG model if preferable for valuing options on the S&P500 index, it doesn™t

show in the calibration of the parameter δ to observed option prices.

Similar methods for ¬nding roots can be used when the function f is a

function of d arguments, a function from Rd to Rd and so both xn and Yxn are

d’dimensional but be forewarned, the problem is considerably more di¬cult in

d dimensions when d is large. The analogous procedure to the Robbins-Monro

method is

’1

xn+1 = xn ’ an Dn Yxn

where the d — d matrix Dn is an approximation to the gradient of the function

f at the root x0 . Once again, since we do not have precise measurements of

346 ESTIMATION AND CALIBRATION

Figure 7.2: “Volatility” Smile for NIG ¬t to call option prices on the S&P500.

the function f , ¬nding an approximation to the gradient Dn is non-trivial but

even more important than in the one-dimensional case, since it determines the

direction in which we search for the root. For noisy data, there are few simple

procedures that give reliable estimates of the gradient. One possibility that

seems to work satisfactorily is to ¬t a smooth function or spline to a set of values

of (x, Yx ) which are in some small neighbourhood of the current iterant xn and

use the gradient of this function in place of Dn . This set of values (x, Yx ) can

include previously sampled values (xj , Yxj ), j < n and/or new values sampled

expressly to estimate the gradient.

7.3 Maximization of Functions

We can use the results of the previous section to maximize a unimodal function

f (x) provided at on each iteration we are able to approximate the gradient of

the function with a noisy observation

Yx ' f 0 (x)

by simply using (7.5) to solve the equation f 0 (x) = 0.

For example suppose we can express the function f (x) as

Z 1

f (x) = H(x, u)du

0

for some function H so that

n

1X

(7.12)

f (x) = E{ H(x, Ui )}

n i=1

7.3. MAXIMIZATION OF FUNCTIONS 347

Again, the Ui are the pseudo-random inputs to the i0 th simulation, which we

can assume to be uniform[0,1]. For a probability density function g(z) we can

rewrite (7.12) using importance sampling as

n

1 X H(x, Zi )

} (7.13)

f (x) = E{

n i=1 g(Zi )

where the inputs Zi are distributed according to the p.d.f. g(z) on [0,1]. We

have many alternatives to ¬nding its maximum if the function is smooth. For ex-

ample we may use the Robbins-Monro method discussed in the previous section

b

to ¬nd a root of f 0 (x) = 0 where

n

1X ‚

‚x H(x, Zi )

b

f 0 (x) = .

n i=1 g(Zi )

Again there is a lack of consistency unless we permit the number of simulations

n to grow to in¬nity, and our choice of ideal importance distribution will change

as we iterate toward convergence. With this in mind, suppose we assume that

the importance distribution comes from a one-parameter family g(z; θ) and we

update the value of this parameter from time to time with information obtained

about the function H(x, z). In other words, assume that θn is a function of

(Z1 , ..., Zn ) and Zn+1 |θn is drawn from the probability density function

g(z; θn ).

Then the sequence of approximations

n

1 X H(x, Zi )

c

fn (x) =

n i=1 g(Zi ; θi’1 )

all have expected value f (x) and under mild conditions on the variance of the

c

H(x,Zi )

terms var( g(Zi ;θi’1 ) |Z1 , ..., Zi’1 ), then fn (x) ’ f (x) as n ’ ∞. Since the error

in this approximation approaches zero as n ’ ∞, it seems reasonable to apply

Newton™s method to such a system:

b0

fn (xn )

= xn ’ (7.14)

xn+1

b00

fn (xn )

Pn ‚

i=1 wi ‚x H(xn , Zi )

= xn ’ Pn (7.15)

‚2

i=1 wi ‚x2 H(xn , Zi )

with weights

1

wi = .

g(Zi ; θi’1 )

Notice that even if the function H(x, Zi ) is only known up to a multiplicative

constant, since this constant disappears from the expression (7.15), we can still

348 ESTIMATION AND CALIBRATION

carry out the iteration. Similarly, the weights wi need only be known up to a

multiplicative constant.

Evidently there are at least three alternatives for maximization in the pres-

ence of noise:

1. Update the sample Zi on each iteration and use a Robbins-Monro proce-

dure, iterating

‚

xn+1 = xn ’ an H(xn , Zn )

‚x

2. Fix the number of simulations n and the importance distribution and

maximize the approximation as in (Geyer, 1995)

n

1 X H(x, Zi )

}

xn = arg max{

n i=1 g(Zi )

3. Update the importance distribution in some fashion and iterate (7.15).

The last two methods both require an importance distribution. The suc-

cess of the third alternative really depends largely on how well the importance

distribution g(Zi ) adapts to the shape of the function H(x, Zi ) for x near the

maximizing value.

7.3.1 Example: estimation with missing data.

This example is a special case of a maximization problem involving “Missing

Data” or latent variables. Although this methodology arose largely in biostatis-

tics where various types of censorship of data are common, it has application as

well in many areas including Finance. For example, periods such as week-ends

holidays or evenings or periods between trades for thinly traded stocks or bonds

could be considered “missing data” in the sense that if trades had occurred in

these periods, it would often simply the analysis.

For a simple example suppose we wish to the daily data for a given stock

index such as the Toronto Stock Exchange 300 index, TSE300, but on a given

day, July 1 for example, the exchange is closed. Should we simply assume that

this day does not exist on the market and use the return over a two day period

around this time as if it was the return over a single day? Since other indices

such as the S&P500 are available on this day, it is preferable to analyze the

data as a bivariate or multivariate series and essentially impute the “missing”

values for the TSE300 using its relationship to a correlated index. The data is

in Table 7.2.

Date (2003) June 30 July 1 July 2 July 3 July 4 July 7 July 8

TSE300 Close 6983.1 * 6990.3 6999.8 7001.9 * 7089.6

S&P500 Close 974.5 982.3 993.8 985.7 * 1004.42 1007.84

Table 7.2 Closing values for the S&P500 and TSE300, July 2003.

“*”= missing value

7.3. MAXIMIZATION OF FUNCTIONS 349

We will give a more complete discussion of dealing with missing data later

when we introduce the EM algorithm but for the present we give a brief discus-

sion of how “imputed values”, (a statisticians euphemism for educated guesses)

can help maximize the likelihood in problems such as this one. So for the mo-

ment, suppose that X is the “complete” data set (in the above example this is

the data with the missing values “*” ¬lled in) and Z be the data we actually

observe. Then since the probability density for X factors into the marginal

density for Z and the conditional density for X given Z,

fx (x) = fz (z)fx|z (x|z)

we have upon taking logarithms and then the conditional expectation given

Z = z the relation

Z

ln(fz (z)) = E[ln(fx (X))|Z = z] ’ ln(f (x|z))f (x|z)dx (7.16)

where the second term is the entropy in the conditional distribution f (x|z). Now

suppose that we were somehow able to randomly generate the missing obser-

vations from their correct distribution, perhaps many times, while leaving the

observed data alone. Averaging the log likelihood ln(fx (X)) for the completed

data sets while holding the observed data Z constant at z is equivalent to

approximating E[ln(fx (X))|Z = z].

The maximum likelihood estimator of the parameters based on the incom-

plete data Z is obtained by maximizing the incomplete data likelihood ln(fz (z))

over the value(s) of the parameters, and a simple attempt at this would be to

maximize the ¬rst term on the right side of (7.16) instead. Of course there

is no guarantee that this is equivalent to maximizing both terms, and we re-

turn to this question later, but for the moment suppose that our objective is

maximization of

E[ln(fx (X))|Z = z]

over the parameter values. In the above example, we can assume that the daily

returns for the two indices are correlated normally distributed random variables

so that in order to model these indices we need the 2 mean values and the 3

parameters in the covariance matrix for a bivariate normal distribution. Our

strategy will be to “¬ll in” or impute the missing values, evaluate the likelihood

ln(fx (X)) as if these imputed observations were real data, and regard this as a

“noisy” evaluation of the function E[ln(fx (X))|Z = z] which we would like to

maximize.

In order to put this data on the more familiar ground of the bivariate normal

distribution, we ¬t the log-normal distribution so that (starting with t = 0 on

June 30, 2003), the close of the TSE at time t is given by

t

X

Yi }

6983.1 exp{

i=1

350 ESTIMATION AND CALIBRATION

and of the S&P500,

t

X

Xi }

974.5 exp{

i=1

where the daily returns (Xi , Yi ) are correlated normal random variables with

mean (µx , µy ) and with covariance matrix

µ ¶

2

σy ρσx σy

Λ= .

2

ρσx σy σx

Then the conditional log-likelihood (given the values on June 30) for the

data, assuming all of the returns are observed, is

µµ ¶ µ ¶¶0 µµ ¶ µ ¶¶

n 1 Yi µy Yi µy

lC (µ, Λ) = ’ ln(|Λ|)’ Σn Λ’1

’ ’ .

2 i=1 Xi µx Xi µx

2

(7.17)

We would like to evaluate this log-likelihood when the missing data is replaced

by imputed data and then maximize over the parameters. To this end we need

to know how to impute missing values, requiring their conditional distribution

given what is observed. So let us return to a small portion of the data as shown

in Table 7.3. Notice that we are essentially given the total return S = 0.001

for the TSE over the two-day period July 1&2 but not the individual returns

Y1 , Y2 which we need to impute.

Date (2003) July 1 July 2 Total Return

TSE300 return (Yi ) Y1 Y2 S = 0.001

S&P500 return (Xi ) 0.008 0.0116 0.0196

TABLE 7.3 Returns for TSE300 and S&P500

In order to carry out the imputation for Y1 , Y2 , recall that bivariate normal

data is often expressed in terms of a linear regression relationship describing the

conditional distribution of Y given X of the form

Yi = µy + βy|x (Xi ’ µx ) + µi

where the error term µi is a Normal(0, (1 ’ ρ2 )σy ) random variable independent

2

of Xi and the regression parameter βy|x is related to the covariance matrix by

cov(Y, X) σy

βy|x = =ρ .

var(X) σx

It follows from Problem 1 at the end of the Chapter that given the values X1 , X2

and the sum S = Y1 + Y2 , the distribution of Y1 is normal

βy|x

S 1

(X1 ’ X2 ), (1 ’ ρ2 )σy ).

2

(7.18)

N( +

2 2 2

7.3. MAXIMIZATION OF FUNCTIONS 351

2

If we are given tentative values of the parameters βy|x , ρ, σy , this distribution

can be used to generate an imputed value of Y1 with Y2 = S ’ Y1 . Of course a

similar strategy allows us to impute the missing values of Xi , Xi+1 when Yi , Yi+1

and Xi + Xi+1 are observed. In this case the value of Xi should be generated

from a

βx|y

Xi + Xi+1 1

(Yi ’ Yi+1 ), (1 ’ ρ2 )σx )

2

(7.19)

N( +

2 2 2

distribution where

σx

βx|y = ρ .

σy

We are now ready to return to the maximization of the log-likelihood function

E[lC (µ, Λ)|observed data]

where lC (µ, Λ) is de¬ned at (7.17). The maximum likelihood estimator of µ, Λ

from complete data is

µµ ¶µ ¶¶ µµ ¶µ ¶¶0

1n c c

Yi µy Yi µy

b ’ ’ (7.20)

Λ = Σi=1

c c

Xi µx Xi µx

n

µ ¶ µ ¶

1

c

µy Yi

= Σn (7.21)

.

i=1

c

µx Xi

n

The imputed values can be treated as actual observations for the missing data

provided that we impute them many times and average the results. In other

words, the following algorithm can be used to estimate the parameters. We

begin with some arbitrary scheme for imputing the missing values. For example

we could replace the individual returns Y1 , Y2 in Table 7.3 by their average

S/2 = 0.0005. Then iterate the following

1. Treating the imputed values as if they were observed, use the complete

b cc

data (observed and imputed) to obtain estimators Λ and µy , µx .

2. Use the estimated parameters to estimate βy|x and βx|y and to again im-

pute the missing Xi using (7.19) and the missing Yi using the distribution

(7.18).

3. Again use the complete data (observed and imputed) to obtain estimators

b cc

Λ and µy , µx . Use as our current estimate of these parameters the average

b cc

of the values of Λ and µy , µx obtained on each iteration of steps 1 or 3.

4. Repeat steps 3 and 4 until the averages used as parameter estimators of

µ and Λ appear to converge.

We ran through steps three and four above a total of 100, 000 times (in

about 70 seconds cpu time on a laptop) and obtained estimates of the mean

352 ESTIMATION AND CALIBRATION

and covariance matrix by averaging these 100000 estimates, resulting in

µ ¶

1.034 ’0.115

b — 10’5 (7.22)

Λav =

’0.115 5.329

µ ¶µ ¶

c

µy 0.0025

(7.23)

= .

c

µx 0.0056

These can be compared with a naive estimator obtained by imputing the missing

returns as the mean return over a two-day period, based on the returns in Table

7.4 in which the imputed returns are in bold letters:

Date (2003) July 1 July 2 July 3 July 4 July 7 July 8

TSE300 (Expected) Return 0.0005 0.0005 0.0014 0.003 0.0062 0.0062

S&P500 (Expected) Return 0.008 0.0116 -0.0082 0.0094 0.0094 0.0034

Table 7.4: Returns and imputed returns for TSE300 and S&P500

These naive estimates are the mean and covariance matrix of the data in

Table 7.4,

µ ¶

1.5 1.8

b — 10’5 ,

Λ=

1.8 7.6

µ ¶µ ¶

c

µy 0.0025

= .

c

µx 0.0056

Notice that the two estimates of the mean are identical. This is not acciden-

tal, because essentially this is a numerical or Monte Carlo version of the EM

algorithm discussed in the next section, and at least for estimating the mean,

replacing each unobserved by its conditional expectation given the observations

results in the correct maximum likelihood estimator. The covariance estimates,

on the other hand, are quite di¬erent. Often, a covariance matrix obtained by

naïve imputation, replacing observations with an average as we did with the

data in Table 7.4, results in underestimating variances because the variation of

to the missing observation is understated by an average. Similarly estimators

of covariance or correlation may be biased for the same reason. The estimator

(7.22), though it has random noise in it, does not su¬er from bias to the same

extent, because it has does a better job of replicating the characteristics of the

random observations.

7.4 Maximum Likelihood Estimation

7.4.1 Missing and Censored Data

It is common in ¬nance and generally in science for data to be biased by under-

reporting, and various forms of censorship. Patients who react badly to a treat-

ment may not return for a follow-up appointment and the data that would have

been obtained on this follow-up appointment is “missing” or unavailable, and

7.4. MAXIMUM LIKELIHOOD ESTIMATION 353

the event that such data is unobserved may depend on the value that it would

have taken. Similarly, data on the time to failure of companies can be “cen-

sored” or only partially observed if there is a merger or take-over during the

study, since, in this case, we know only that the failure time of the original ¬rm

exceeds some value. Other forms of data-modi¬cation which introduce bias are

also common. There is reporting bias when mutual funds that have had a poor

quarter or year are more likely to give prominence to their 3 or 5-year returns

in advertising or neglect to reply to a question on performance over a particular

period. Managers, stocks, and companies with particularly poor returns disap-

pear from data-bases, exchanges, or companies for obvious reasons. We have

already seen some techniques for adjusting for the latter “survivorship bias”

but in general, successful treatment of “missing” or “censored” data requires

some insight into the mechanism by which the data goes missing or is censored.

In order to give a simple example of survivorship or reporting bias, suppose

that the losses xi , i = 1, ..., n on portfolios managed by n managers over a period

of time are assumed normally distributed and for simplicity we assume that all

have the same mean µ and standard deviation σ. For various reasons we expect

that the probability that a manager is no longer with us at the end of this period

of time, (and so the loss for the whole period is “missing”) is a function of the

loss that would have occurred had they persisted so that the probability that a

given data value xi is observed is π(xi ), a non-increasing function of the value

xi . The probability that an observation xi is missing is therefore 1 ’ π(xi ).

Then the likelihood for this data is

Z

n’1

Y

{π(xi )n(xi ; µ, σ )} { (1 ’ π(z))n(z; µ, σ 2 )dz}1’∆i

2 ∆i

(7.24)

L(µ, σ) =

i=1

where ∆i = 1 if the i0 th value is observed and otherwise ∆i = 0 if it is missing,

and n(xi ; µ, σ 2 ) denotes the normal(µ, σ 2 ) probability density function at the

point xi . The ¬rst term

π(xi )n(xi ; µ, σ 2 )

is the likelihood corresponding to an event “the i0 th loss is xi and it is observed”

and the last term

Z

(1 ’ π(z))n(z; µ, σ 2 )dz = E[1 ’ π(Xi )]

is the probability that an observation is missing. Suppose n = 7 and all of the

xi , i = 1, ..., n ’ 1 are observed but the last value x7 is missing. In this case we

know the values of xi for all but the last observation, which could be replaced

by a randomly generated observation if we knew the parameters µ, σ and the

function π(x). If we wish to maximize the likelihood (7.24) then one simple

approach is to complete the data by generating a missing or latent observations

X7 = Z with the normal(µ, σ 2 ) distribution and then accepting the observation

(as missing) with probability 1 ’ π(Z). The likelihood is proportional to

E{H(µ, σ, Z)}

354 ESTIMATION AND CALIBRATION

where the expectation is over the distribution of Z,

P6

(xi ’ µ)2 + (Z ’ µ)2 (1 ’ π(Z))

1

H(µ, σ, Z) = H(µ, σ, Z, θ) = 7 exp{’ i=1 }

2σ 2

σ g(Z; θ)

and we assume Z has probability density function

g(z; θ).

Consider maximizing this function over µ in the special case that σ = 1 and

π(z) = e’z , z > 0 and g(x; θ) is an exponential probability density function

with mean θ. We might ¬x the number of simulations m and the importance

distribution and maximize the approximation to the integral:

m

1X

µn = arg max{ H(µ, Zi )}

m i=1

This method recommended by Geyer(1996) and has the advantage that it

essentially estimates the function unbiasedly and then one can rely on standard

maximization programs such as fminsearch in Matlab. It requires however that

we ¬x the importance distribution g(z, θ) and the number of simulations. Al-

ternatives are Robbins-Monro methods which have the ability to update the

importance distribution parameter θ to reduce the variance of the estimator.

Various stochastic volatility models and “Regime switching” models can also

be viewed as examples of latent variables or missing data problems. If the

market has two volatility states, “high” and “low” these can not be directly

observed but only inferred from the movement in the market over a period of

time. In this case, given the state that the market is in, the parameters are often

easily estimated, but this state is a latent variable that needs to be estimated

or imputed from the observations.

7.4.2 Example: Mixture Models

One common problem which can also be viewed as a missing data problem is

that of estimation of parameters in a mixture model. To take a simple exam-

ple, consider a vector of observations of stock returns (x1 , ..., xn ) and assume

that xi has a probability density function that takes one of two possible forms

f1 (x|θ1 ) or f2 (x|θ2 ) where the two parameters θ1 , θ2 are not speci¬ed. We as-

sume that a given observation comes from population distribution f1 (x|θ1 ) with

some probability p1 and otherwise, with probability p2 = 1 ’ p1 it comes from

the population with density f2 (x|θ2 ). This is an example of a mixture of two

distributions: the overall population distribution is

f (x) = p1 f1 (x|θ1 ) + p2 f2 (x|θ2 ).

If each data point came with an identi¬er zi taking possible values 1, 2 that

speci¬ed the parent distribution, estimation of the parameters p1 , θ1 , θ2 would

7.4. MAXIMUM LIKELIHOOD ESTIMATION 355

be easy. We would estimate p1 using the proportion of observations from pop-

ulation 1 and then use these to estimate the parameter θ1 , and similarly θ2 .

However, this identi¬er is not usually available. For example it is a common

model in ¬nance to assume that there are two categories of observations, or

states of the market, corresponding to high volatility (θ2 large) and low volatil-

ity (θ1 small) but of course we can only infer which state we are currently in by

the magnitude of the last few changes in price. Alternatively we might assume

that the vector of identi¬ers z1 , z2 , .... mentioned above forms a simple 2’state

Markov chain, in which case the model for stock returns is commonly referred

to as a regime-switching model.

Once again for the sake of a concrete problem, we use the returns from the

S&P500 index between January 1, 2003 and May 11, 2004 to estimate a mixture

model. In particular suppose that the density function or a return is assumed

to be a mixture of normal distributions with mean µ and variances θi , i = 1, 2,

i.e.

p1 n(x; µ, θ1 ) + (1 ’ p1 )n(x; µ, θ2 )

= p1 f1 (x|θ1 ) + (1 ’ p1 )f2 (x|θ2 ).

For the sake of identi¬ability of the parameters, we assume θ1 < θ2 . Consider

¬rst the simple case in which we assume that the identi¬ers zi are independent

and observed. Then the likelihood function is

n

Y

f1 (xi |θ1 )∆i f2 (xi |θ2 )1’∆i

L(θ) =

i=1

where ∆i = I(zi = 1). Since ∆i is not observed, the estimating function in the

missing data case is obtained by conditioning the “complete data” estimating

function on the observed information, i.e.

n

X ‚

E[∆i |x1 ...xn ] ln f1 (xi |θ1 ) = 0

‚θ1

i=1

n

X ‚

E[1 ’ ∆i |x1 ...xn ] ln f2 (xi |θ2 ) = 0.

‚θ2

i=1

In the case of independent zi the conditional probability that ∆i = 1 is given

by

p1 f1 (xi |θ1 )

wi =

p1 f1 (xi |θ1 ) + p2 f2 (xi |θ2 )

and this results in the estimating functions to be solved for θ,

n

X ‚

ln f1 (xi |θ1 ) = 0 (7.25)

wi

‚θ1

i=1

n

X ‚

(1 ’ wi ) ln f2 (xi |θ2 ) = 0. (7.26)

‚θ2

i=1

356 ESTIMATION AND CALIBRATION

Of course typically p1 , p2 are not known either and they also need to be estimated

from the data concurrently with our estimation of θ. The estimating function

can be derived in exactly the same manner; if the ∆i were observed, then we

would estimate the parameters p1 , p2 using

n

1X

p1 = ∆i

n i=1

which after conditioning is of the form

n n

1X 1X

E[∆i |xi ] = wi and, (7.27)

p1 =

n i=1 n i=1

p2 = 1 ’ p1 .

Typically a simple routine which begins with initial value of the parameters and

then iteratively steps towards a solution of (7.25) and 7.27) provides at least

slow convergence to a solution. There is no absolute guarantee that the solution

corresponds to a local maximum of the likelihood function, but for the S&P500

data mentioned the likelihood appears to increase monotonically. Convergence

is to the model

0.811N (µ, 0.00852 ) + 0.189N (µ, 0.0152 ),

√

which corresponds to a 252 — .00852 =14% annual volatility on 81% of the

days and 24% volatility the other 19% of the days. In general we need to be

concerned about whether the initial values of the parameters a¬ect the result

but in this case varying the initial values of p1 , θ1 , θ2 did not appear to have any

e¬ect (except in the time to convergence). This simple algorithm, when applied

to mixture problems, can more generally be quite sensitive to the starting values.

When we estimate parameters as we did in the mixture problem above,

whether we use the EM algorithm or one of its many relatives, missing compo-

nents of the score function are replaced by their conditional expectation given

the observations. For estimation of the mean in a normal family, we replace

missing observations by their conditional expectation. On the other hand, these

imputed values typically have less variability than the actual data we were

unable to observe, and so if our interest is in the variance, the substitution for

non-observed data by a conditional expectation tends to reduce the sample vari-

ance. If we have many parameters that we wish to estimate, we can essentially

replace the missing data by repeated draws from the appropriate conditional dis-

tribution and use these repeatedly to estimate the parameter values and their

variances. If the missing data is randomly generated, it should have the same

characteristics as the data it is replacing, including the higher moments. The

next method is a Bayesian implementation of this idea, one of few excursions

into Bayesian methodology in this text.

The Data Augmentation algorithm (see Tanner and Wong, 1987) is used in

Bayesian problems when there is latent or missing data as in the above example.

7.4. MAXIMUM LIKELIHOOD ESTIMATION 357

A Bayesian begins with a prior distribution π(θ) assigned to the parameters

being estimated. This distribution re¬‚ects our prior knowledge of the parameter

values and their variability and there are many possible choices for it, but ideally

it should not have too much in¬‚uence on the conclusions since we always wish

the data to do more talking than the Bayesian™s preconceived ideas of where the

parameters lie. We explain the algorithm in the context of the example above

so x is the vector of observations, and the missing data is denoted by ∆. The

posterior joint distribution of the parameter θ and the missing data ∆ is the

conditional distribution of θ and ∆ given the data x and can be written

p(θ, ∆|x) = p(θ|x)p(∆|θ, x).

The idea behind the data augmentation algorithm is to iteratively draw (θ, ∆)

from approximations to the joint distribution of θ, ∆ given the observed data.

Typically p(θ|x) is di¬cult to compute but the conditional densities such as

p(θ|x, ∆) and p(θ|x, ∆) are much easier to deal with.

The algorithm iterates these two steps:

1. Draw ∆— from the density p(∆|θ— , x).

2. Draw θ— from the density p(θ|x, ∆— ).

Iteration continues until the joint distribution of the pair (θ— , ∆— ) appears to

stabilize in which case they are regarded as a draw from the joint distribution

p(θ, ∆|x). The method is, in fact, a special case of the “Gibbs sampler” to

be discussed later and its convergence to the correct conditional distribution

derived from the balance equations we give there. At this point, to provide just

a little credibility, we outline an intuitive argument for a Gibbs sampler in this

simple case.

Consider the following problem. We wish to generate random variables (θ, ∆)

having some joint distribution p(θ, ∆|x). To save a little notation let us abbre-

viate p(.|x) to px (.). Unfortunately, the joint conditional distribution px (θ, ∆)is

di¬cult to deal with, but conditional distributions such as

px (θ, ∆)

and

px (θ|∆) =

px (∆)

px (θ, ∆)

px (∆|θ) =

px (θ)

are easy to sample from. We proceed as follows. Start with an arbitrary value

for θ1 and then generate ∆1 from the density px (∆|θ1 ). Now we generate θ2 from

the probability density px (θ|∆1 ) and then ∆2 from the density px (∆|θ2 ) and so

on. It is easy to see that the pairs of observations (θ1 , ∆1 ), (θ2 , ∆2 ), ... constitute

a two dimensional Markov Chain with transition kernel K(θt+1 , ∆t+1 |θt , ∆t ) =

px (θt+1 |∆t )px (∆t+1 |θt+1 ) and it is our hope that the joint density px (θ, ∆) is

358 ESTIMATION AND CALIBRATION

an invariant distribution for this chain. To verify this note that

ZZ

px (θt+1 |∆)px (∆t+1 |θt+1 )px (θ, ∆)dθd∆

ZZ

px (θt+1 |∆)px (∆t+1 |θt+1 )px (θ, ∆)dθd∆

=

Z

= px (∆t+1 |θt+1 ) px (θt+1 |∆)px (∆)d∆

= px (∆t+1 |θt+1 )px (θt+1 ) = px (θt+1 , ∆t+1 ).

Since this is the invariant distribution, if the chain is ergodic, it will converge

in distribution to px (θ, ∆). There is no certainty in this or any other Markov

chain that we have reached equilibrium after a ¬nite number of draws, although

there are many techniques for trying to monitor whether the distribution con-

tinues to change. There are other, more elaborate, methods for estimating the

parameters in mixture models. See for example Robert (1996).

The algorithm is clearer if we apply it to the Normal mixture model for the

S&P500 data discussed above. In this case we start with initial guesses ∆— at i

the values of ∆i . For example we could assign the smallest returns (in absolute

value) to population one so ∆— = 1 and the rest to population 2, ∆— = 0

i i

—

with about equal numbers in the two camps. Then θ is drawn with density

px (θ|θ, ∆— ). This is easy to do in this example because the vector ∆— assigns

each observation to one of the two populations. For simplicity suppose that we

have subtracted the mean from the returns so that we can assume µ = 0 and

let the parameters θ1 , θ2 denote the variances of the two normal distributions.

’1

For example suppose that the prior distribution of θ1 is proportional to θ1

’1

π(θ1 ) ∝ θ1 , θ1 > 0.

There is no such probability density function, but there is a measure with this

density and it is often convenient to permit such so-called improper prior dis-

tributions. The prior π(θ1 ) corresponds to assuming an improper uniform prior

distribution for log(θ1 ) because if U is uniform on some set then θ1 = eU will

have density function

dU ’1

∝| | = θ1 on the corresponding set.

dθ1

Then the posterior density of θ1 is

Y —

’1

p(θ1 |x, ∆— ) ∝ [n(xi ; 0, θ1 )]∆i θ1

i

P

∆— xi 2

’ν /2’1 i

∝ }

θ1 1 exp{’

2θ1

P

∆— is the degrees of freedom for this sample. The posterior

where ν1 = i

i

7.4. MAXIMUM LIKELIHOOD ESTIMATION 359

’1

density of the reciprocal of the variance ψ = θ1 is

P

∆— xi 2 dθ1

ψ

p(ψ|x, ∆— ) ∝ ψ ν1 /2+1 exp{’ i

}| |

2 dψ

P

ψ ∆— xi 2

ν1 /2’1 i

∝ψ }

exp{’

2

’1

We identify this posterior distribution of θ1 as the Gamma(ν1 /2 ,2/ss2 ) dis-

1

tribution with

n

X

2

∆— xi 2

ss1 = i

i=1

and this means that

’1

ss2 θ1

1

has a chi-squared distribution with ν1 /2 degrees if freedom.

Similarly with two parameters θ1 , θ2 , we may assume the prior density of

(θ1 , θ2 ) is proportional to

’2 ’2

θ1 θ2

resulting in a independent Gamma(ν1 /2 ,2/ss2 ), Gamma(ν2 /2 ,2/ss2 ) posterior

P1 2

’1 ’1

distributions for θ1 and θ2 where ν2 = i (1 ’ ∆— ) and

i

n

X

ss2 (1 ’ ∆— )x2 .

=

2 i i

i=1

The parameter p1 could also have a prior distribution assigned and a posterior

calculated in a similar fashion (e.g. assume a (non-informative) prior distribu-

tion for p1 with density function proportional to [p(1 ’ p)]’1/2 , so the posterior

distribution given the value of ∆ is that of a Beta (ν1 + 1 , ν2 + 1 ) distribu-

2 2

tion (see Box and Tiao, 1973, Section 1.3)). However, for comparison with the

mixture model above we chose p1 equal to its estimated value there, 0.81.

The algorithm therefore is as follows, beginning with k = 0 and an initial

guess at ∆— ,

1. Calculate

P P

ss2 = Pn ∆— x2

ν1 = Pi ∆— , 1

i i=1 i i

n

ν2 = i (1 ’ ∆— ) ss2 = i=1 (1 ’ ∆— )x2

2

i i i

’1 ’1

2. Generate θ1 ∼ Gamma(ν1 /2 ,2/ss2 ) and θ2 ∼Gamma(ν2 /2 ,2/ss2 )

1 2

(and p1 ∼Beta (ν1 + 1 , ν2 + 1 )) if we wish to allow p1 to vary.

2 2

3. Generate independent

p1 f (xi |θ1 )

∆— ∼ Bernoulli( )

i

p1 f (xi |θ1 ) + p2 f (xi |θ2 )

360 ESTIMATION AND CALIBRATION

Figure 7.3: The posterior distribution of the annualized volatility parameters

for the two components of a mixture model, ¬t to the S&P500 index, June-July

2003.

’1 ’1

4. Return to step 1 until the distribution of (∆— , θ1 , θ1 ) has ceased to

i

change. Then iterate steps 1-4 m times, considering these m subsequently

’1 ’1

generated values of (∆— , θ1 , θ1 ) as draws from the joint distribution of

i

(∆, θ) given (x1 , ..., xn ).

The posterior distribution of the annualized volatilities determined by the

data augmentation algorithm is give in Figure 7.3 and this gives a better idea of

the error attached to the variances of the two normally distributed components

for ¬xed p1 .

The next section provides a simple general technique for estimating parame-

ters in the presence of missing or censored data.

7.4.3 The EM Algorithm

In principle maximum likelihood estimation is simple. We write down the like-

lihood or the probability of the observed values and then choose the parameter

value which maximizes this expression. Di¬culties occur primarily when the

likelihood does not have a simple closed-form or easily evaluated expression.

In the last section we saw an example in which we observed data Z but if

complete data X were available (including “latent” or missing variables), then

the complete-data likelihood takes a simple form as a function of the unknown

parameter

LC (θ; x), (C for “complete”).

However each missing observation in a data set gives rise to a conditional ex-

pectation. If only a portion z of the complete data x is observed, then the

likelihood function for the observed data, LO (θ, z) say, (“O” is for “observed”)

is related to the complete data likelihood. As before, since

fθ (x) = fθ (z)fθ (x|z),

7.4. MAXIMUM LIKELIHOOD ESTIMATION 361

we have, for an arbitrary value of the parameter θ0 ,

ln(fθ (x)) = ln(fθ (z)) + ln(fθ (x|z))

Eθ0 [ln(fθ (x))|z] = Eθ0 [ln(fθ (z))|z] + Eθ0 [ln(fθ (x|z))|z]

Z

= ln(fθ (z)) + ln{fθ (x|z)}fθ0 (x|z)dx.

Denote the term on the left side by

(7.28)

Qθ0 (θ) = Eθ0 [ln(fθ (x))|z].

Then the log-likelihood of the observed data ln{LO (θ, z)} is

Z

ln(fθ (z)) = Qθ0 (θ) ’ ln{fθ (x|z)}fθ0 (x|z)dx.

From this we obtain

ln{LO (θ, z)} ’ ln{LO (θ0 , z)} = Qθ0 (θ) ’ Qθ0 (θ0 ) + H(θ0 , θ) (7.30)

where

fθ0 (x|z)

|z]

H(θ0 , θ) = Eθ0 [ln(

fθ (x|z)

is the cross entropy between the conditional density of x|z at the two values of

the parameter.

The EM algorithm proceeds as follows: We begin with what we think is

a good estimator of the parameter θ0 and then maximize over the value of θ

the function Qθ0 (θ). The maximizing value of θ, θ1 , say, is the next estimator.

Replacing θ0 by θ1 we repeat, now maximizing Qθ1 (θ) and so on, obtaining

a sequence of estimators θn , n = 1, 2, ..... It is not hard to show that under

fairly simple conditions, this sequence converges to the maximum likelihood

b b

estimator, i.e. the value θ satisfying LO (θ; z) = max{LO (θ; z)}. In other words,

the algorithm switches back and forth between the two steps starting with an

initial guess at the parameter value θ0 and n = 0 and stopping when the sequence

θn appears to converge:

E Step Obtain the conditional expected value

Qθn (θ) = Eθn [ln LC (θ; x)|z]

M Step Maximize this function Qθn (θ) over θ, letting

θn+1 = arg max Qθn (θ)

be the next approximation to the parameter value. Increment n.

362 ESTIMATION AND CALIBRATION

This algorithm works because of the identity (7.30). Notice that because of

the M-step above, Qθn (θn+1 ) ’ Qθn (θn ) ≥ 0 and the term H(θn+1 , θn ), because

it is the cross-entropy between two distributions, is always non-negative. This

shows that

ln{LO (θn+1 , z)} ’ ln{LO (θn , z)} ≥ 0

and that the observed data likelihood is a non-decreasing sequence. It must

therefore converge.

The score function for the complete data

‚

SC (θ; x) = ln LC (θ; x)

‚θ

‚

is related to the observed data score SO (θ; z) = ln LO (θ; z) even more simply:

‚θ

(7.31)

SO (θ; z) = Eθ [SC (θ; x)|z].

Therefore on the E’step of the EM algorithm, we may solve the equation for

θn+1 (checking as always that the solution corresponds to a maximum)

Sobs (θn+1 |θn ) = 0.

Thus, the EM algorithm can be more simply expressed as a simple resubstitution

algorithm for solving the score equation (7.31) equals zero:

E Step For an initial guess at the parameter value θ0 , obtain the conditional

expected value

(7.32)

SO (θ|θn ) = Eθn [SC (θ; x)|z].

Solve for θn+1 the equation

SO (θn+1 |θn ) = 0.

It is clear from this formulation that if the algorithm converges it must converge

to a point θ satisfying

SO (θ|θ) = 0

which, in view of (7.31) is the score equation for the observed data.

If there are m missing observations, notice that both (7.28) and (7.31) are

m’fold integrals and may be di¬cult to compute. Nevertheless, (7.31) leads

to an extremely useful observation that applies to missing or “censored” data,

within the normal family of distributions and more generally, to exponential

families. Equation (7.32) indicates that any term in the score function that is

unobserved should be replaced by its conditional expectation given the observed

data. If the score function contained terms like X 2 or sin(X) then the condi-

tional expectation of such terms would need to be computed. This is, of course,

NOT the same as replacing X by its conditional expectation and then using X 2

or sin(X). However, one of the marvelous features of the normal distributions

7.4. MAXIMUM LIKELIHOOD ESTIMATION 363

is that, for estimating the mean, the score function is linear in X. For exam-

ple, di¬erentiating (7.17) with respect to µX , µY we obtain the score function

corresponding to estimation of µ for complete data.

µµ ¶µ ¶¶

Yi µy

n ’1

’

Σi=1 Λ

Xi µx

Notice that taking the expected value given the observed data is equivalent to

replacing each unknown component of Xi , Yi by its conditional expectation given

the observed. Moreover, within the multivariate normal family of distributions,

taking conditional expectation is accomplished essentially by linear regression.

For a simple example of its application let us return to the problem with

“missing” values in the S&P500 index and the TSE300 index discussed above.

Suppose, for example we wish to ¬ll in the value for the TSE300 on July 1 using

conditional expectation. Which observations are directly relevant? Table 7.3

gives the daily returns for the two indices leaving out values that are independent

of the unknowns Y1 , Y2 .

Recall from (7.18) that the conditional distribution of Yi given Xi , Xi+1 and

S = Yi + Yi+1 is

βy|x

S 1

(Xi ’ Xi+1 ), (1 ’ ρ2 )σy ).

2

(7.33)

N( +

2 2 2

and so the conditional expectation

βy|x

S

E(Yi |Xi , Xi+1 , S] = (Xi ’ Xi+1 )

+

2 2

with βy|x = ρσy /σx . Rather than naively replacing the unknown returns Yi , Yi+1

by their average S/2 the term 1 βy|x (Xi ’ Xi+1 ) provides an adjustment deter-

2

mined from the values of X. The regression coe¬cient βy|x can either be found

by regressing Y on X preferably with data around the same period of time

(since these coe¬cients tend to be somewhat time-dependent) or by using the

current values of the maximum likelihood estimator of the covariance matrix

Λ. Using the observed data for June and July 2003, we arrived at an estimated

regression coe¬cient

βy|x ' 0.432

and this allows to to ¬ll in the missing values in the table with their conditional

expected values:

Date (2003) July 1 July 2

TSE300 Return ’0.0003 0.0013

S&P500 Return 0.008 0.0116

TABLE 7.4

Filling in the expected returns for all of the missing value for July 4 in the

same way, but using the regression coe¬cient for X on Y of βx|y ' 1.045, we

observe the following for the unknowns in Table 7.5.

364 ESTIMATION AND CALIBRATION

Date (2003) July 1 July 2 July 3 July 4 July 7 July 8

’0.0003

TSE300 (Expected) Return 0.0014 0.003

0.0013 Y7 Y8

S&P500 (Expected) Return 0.008 0.0116 -0.0082 0.0034

X4 X7

TABLE 7.5

1

(0.0188 + 1.045(0.003 ’ Y7 ))

X4 =

2

1

Y7 = (0.0124 + 0.432(X7 ’ 0.0034))

2

X4 + X7 = 0.0188

Y7 + Y8 = 0.0124

and these four equations in four unknown can be solved in the four unknowns

giving

Y7 = ’0.0634, Y8 = 0.0758,

X7 = ’0.0253.

X4 = 0.0441,

Assuming that this pair of returns follows a correlated geometric Brownian

motion then in order to estimate cov(X, Y ) in the presence of complete data we

would solve the following equations:

n

1X

c (7.34)

µx = xi