Let us return to the example of pricing a call option. By some experimen-

tation, which could involve a preliminary crude simulation or simply evaluating

the function at various points, we discovered that the function

g(u) = 6[(u ’ .47)+ ]2 + (u ’ .47)+

provided a reasonable approximation to the function f (u). The two functions

are compared in Figure 4.4. Moreover, the integral 2 — 0.532 + 1 0.533 of the

2

function g(.) is easy to obtain.

It is obvious from the ¬gure that since f (u) ’ g(u) is generally much smaller

and less variable than is f (u), var[f (U ) ’ g(U )] < var(f (U )). The variance of

the crude Monte Carlo estimator is determined by the variability in the func-

tion f (u) over its full range. The variance of the control variate estimator is

determined by the variance of the di¬erence between the two functions, which

in this case is quite small. We used the following matlab functions, the ¬rst to

generate the function g(u) and the second to determine the e¬ciency gain of

the control variate estimator;

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.227

% this is the function g(u), a control variate for fn(u)

function g=GG(u)

u=max(0,u-.47);

g=6*u.^2+u;

function [est,var1,var2]=control(f,g,intg,n)

% run using a statement like [est,var1,var2]=control(™fn™,™GG™,intg,n)

% runs a simulation on the function f using control variate g (both character

strings) n times.

R1

g(u)du

% intg is the integral of g % intg= 0

% outputs estimator est and variances var1,var2, variances with and without

control variate.

U=unifrnd(0,1,1,n);

% evaluates f (u) for vector u

FN=eval(strcat(f,™(U)™));

% evaluates g(u)

CN=eval(strcat(g,™(U)™));

est=intg+mean(FN-CN);

var1=var(FN);

var2=var(FN-CN);

Then the call [est,var1,var2]=control(™fn™,™GG™,2*(.53)^3+(.53)^2/2,1000000)

yields the estimate 0.4616 and variance=1.46 — 10’8 for an e¬ciency gain over

crude Monte Carlo of around 30.

This elementary form of control variate suggests using the estimator

Z n

1X

[f (Ui ) ’ g(Ui )]

g(u)du +

n i=1

but it may well be that g(U ) is not the best estimator we can imagine for f (U ).

We can often ¬nd a linear function of g(U ) which is better by using regression.

Since elementary regression yields

f (U ) ’ E(f (U )) = β(g(U ) ’ E(g(U ))) + ² (4.19)

228 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

where

cov(f (U ), g(U ))

(4.20)

β=

var(g(U ))

and the errors ² have expectation 0, it follows that E(f (U )) + ² = f (U ) ’

β[g(U ) ’ E(g(U ))] and sof (U ) ’ β[g(U ) ’ E(g(U ))] is an unbiased estimator

of E(f (U )). For a sample of n uniform random numbers this becomes

n

1X

ˆ [f (Ui ) ’ βg(Ui )]. (4.21)

θcv = βE(g(U )) +

n i=1

Moreover this estimator having smallest variance among all linear combina-

tions of f (U )and g(U ). Note that when β = 1 (4.21) reduces to the simpler

form of the control variate technique (4.17) discussed above. However, the lat-

ter is generally better in terms of maximizing e¬ciency. Of course in practice

it is necessary to estimate the covariance and the variances in the de¬nition of

β from the simulations themselves by evaluating f and g at many di¬erent

uniform random variables Ui , i = 1, 2, ..., n and then estimating β using the

standard least squares estimator

P P P

n n f (Ui )g(Ui ) ’ n f (Ui ) n g(Ui )

b i=1 P i=1 i=1

P

β= .

n 2 (U ) ’ ( n g(U ))2

n i=1 g i i

i=1

b

Although in theory the substitution of an estimator β for the true value β

results in a small bias in the estimator, for large numbers of simulations n our

b

estimator β is so close to the true value that this bias can be disregarded.

Importance Sampling.

A second technique that is similar is that of importance sampling. Again we

depend on having a reasonably simple function g that after muultiplication by

some constant, is similar to f. However, rather than attempt to minimize the

di¬erence f (u) ’ g(u) between the two functions, we try and ¬nd g(u) such

that f (u)/g(u) is nearly a constant. We also require that g is non-negative

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.229

and can be integrated so that, after rescaling the function, it integrates to one,

i.e. it is a probability density function. Assume we can easily generate random

variables from the probability density function g(z). The distribution whose

probability density function is g(z), z ∈ [0, 1] is the importance distribution.

Note that if we generate a random variable Z having the probability density

function g(z), z ∈ [0, 1] then

Z Z 1

f (z)

f (u)du = g(z)dz

0 g(z)

¸

f (Z)

(4.22)

=E .

g(Z)

This can therefore be estimated by generating independent random variables Zi

with probability density function g(z) and then setting

n

1 X f (Zi )

ˆ (4.23)

θim = .

n i=1 g(Zi )

Once again, according to (4.22), this is an unbiased estimator and the variance

is

1 f (Z1 )

ˆ

var{θim } = }. (4.24)

var{

n g(Z1 )

Returning to our example, we might consider using the same function as

before for g(u). However, it is not easy to generate variates from a density

proportional to this function g by inverse transform since this would require

solving a cubic equation. Instead, let us consider something much simpler, the

density function g(u) = 2(0.53)’2 (u ’ .47)+ having cumulative distribution

function G(u) = (0.53)2 [(u ’ .47)+ ]2 and inverse cumulative distribution func-

√

tion G’1 (u) = 0.47 + 0.53 u. In this case we generate Zi using Zi = G’1 (Ui )

for Ui ∼ U nif orm[0, 1]. The following function simulates an importance sample

estimator:

function [est,v]=importance(f,g,Ginv,u)

230 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

%runs a simulation on the function ™f” using importance density ”g”(both character

strings) and inverse c.d.f. ”Ginverse”

% outputs all estimators (should be averaged) and variance.

% IM is the inverse cf of the importance distribution c.d.f.

% run e.g.

% [est,v]=importance(™fn™,™2*(IM-.47)/(.53)^2;™,™.47+.53*sqrt(u);™,rand(1,1000));

IM= eval(Ginv); %=.47+.53*sqrt(u);

%IMdens is the density of the importance sampling distribution at IM

IMdens=eval(g); %2*(IM-.47)/(.53)^2;

FN=eval(strcat(f,™(IM)™));

est=FN./IMdens; % mean(est) prrovides the estimator

v=var(FN./IMdens)/length(IM); % this is the variance of the estimator per sim-

ulation

The function was called with [est,v]=importance(™fn™,™2*(IM-.47)/(.53)^2;™,™.47+.53*sqrt(u);™,rand(1,

giving an estimate mean(est) = 0.4616 with variance 1 .28 — 10’8 for an

e¬ciency gain of around 35 over crude Monte Carlo.

Example 36 (Estimating Quantiles using importance sampling.) Suppose we

are able to generate random variables X from a probability density function of

the form

fθ (x)

and we wish to estimate a quantile such as VAR, i.e. estimate xp such that

Pθ0 (X · xp ) = p

for a certain value θ0 of the parameter.

As a very simple example suppose S is the sum of 10 independent random

variables having the exponential distribution with mean θ, and fθ (x1 , ..., x10 ) is

the joint probability density function of these 10 observations. Assume θ0 = 1

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.231

and p = .999 so that we seek an extreme quantile of the sum, i.e. we want to

determine xp such that Pθ0 (S · xp ) = p. The equation that we wish to solve

for xp is

Eθ0 {I(S · xp )} = p. (4.25)

The crudest estimator of this is obtained by generating a large number of

independent observations of S under the parameter value θ0 = 1 and ¬nding

the p™th quantile, i.e. by de¬ning the empirical c.d.f.. We generate independent

random vectors Xi = (Xi1 , ..., Xi10 ) from the probability density fθ0 (x1 , ..., x10 )

P

and with Si = 10 Xij ,de¬ne

j=1

n

1X

b I(Si · x). (4.26)

F (x) =

n i=1

Invert it (possibly with interpolation) to estimate the quantile

b

xp = F ’1 (p).

c (4.27)

If the true cumulative distribution function is di¬erentiable, the variance of this

quantile estimator is asymptotically related to the variance of our estimator of

the cumulative distribution function,

b

var(F (xp ))

var(c) '

xp ,

(F 0 (xp ))2

so any variance reduction in the estimator of the c.d.f. us re¬‚ected, at least

asymptotically, in a variance reduction in the estimator of the quantile. Using

importance sampling (4.25) is equivalent to the same technique but with

n

X

bI (x) = 1 Wi I(Si · x) where (4.28)

F

n i=1

fθ0 (Xi1 , ..., Xi10 )

Wi =

fθ (Xi1 , ..., Xi10 )

b

Ideally we should choose the value of θ so that the variance of xp or of

Wi I(Si · xp )

232 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

is as small as possible. This requires a wise guess or experimentation with

various choices of θ. For a given θ we have another choice of empirical cumulative

distribution function

n

X

1

b

FI2 (x) = Pn Wi I(Si · x). (4.29)

i=1 Wi i=1

Both of these provide fairly crude estimates of the sample quantiles when obser-

vations are weighted and, as one does with the sample median, one could easily

interpolate between adjacent values around the value of xp .

The alternative (4.29) is motivated by the fact that the values Wi appear

as weights attached to the observations Si and it therefore seems reasonable to

divide by the sum of the weights. In fact the expected value of the denominator

is

n

X

Eθ { Wi } = n

i=1

so the two denominators are similar. In the example where the Xij are

independent exponential(1) let us examine the weight on Si determined by

Xi = (Xi1 , ..., Xi10 ),

10

fθ0 (Xi1 , ..., Xi10 ) Y exp(’Xij )

= θ10 exp{’Si (1 ’ θ’1 )}.

Wi = = ’1 exp(’X /θ)

fθ (Xi1 , ..., Xi10 ) θ ij

j=1

The renormalized alternative (4.29) might be necessary for estimating extreme

quantiles when the number of simulations is small but only the ¬rst provides

an completely unbiased estimating function. In our case, using (4.28) with

θ = 2.5 we obtained an estimator of F (x0.999 ) with e¬ciency about 180 times

that of a crude Monte Carlo simulation. There is some discussion of various

renormalizations of the importance sampling weights in Hesterberg(1995).

Importance Sampling, the Exponential Tilt and the Saddlepoint Ap-

proximation

When searching for a convenient importance distribution, particularly if we

wish to increase or decrease the frequency of observations in the tails, it is

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.233

quite common to embed a given density in an exponential family. For example

suppose we wish to estimate an integral

Z

g(x)f (x)dx

where f (x) is a probability density function. Suppose K(s) denotes the cumu-

lant generating function (the logarithm of the moment generating function) of

the density f (x),i.e. if

Z

exs f (x)dx.

exp{K(s)} =

The cumulant generating function is a useful summary of the moments of a

distribution since the mean can be determined as K 0 (0) and the variance as

K 00 (0). From this single probability density function, we can now produce a

whole (exponential) family of densities

fθ (x) = eθx’K(θ) f (x) (4.30)

of which f (x) is a special case corresponding to θ = 0. The density (4.30) is often

referred to as an exponential tilt of the original density function and increases

the weight in the right tail for θ > 0, decreases it for θ < 0.

This family of densities is closely related to the saddlepoint approximation.

If we wish to estimate the value of a probability density function f (x) at a par-

ticular point x, then note that this could be obtained from (4.30) if we knew

the probability density function fθ (x). On the other hand a normal approxi-

mation to a density is often reasonable at or around its mode, particularly if

we are interested in the density of a sum or an average of independent random

variables. The cumulant generating function of the density fθ (x) is easily seen

to be K(θ + s) and the mean is therefore K 0 (θ). If we choose the parameter

θ = θ(x) so that

K 0 (θ) = x (4.31)

then the density fθ has mean x and variance K 00 (θ). How do we know for a given

value of x there exists a solution to (4.31)? From the properties of cumulant

234 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

generating functions, K(t) is convex, increasing and K(0) = 0. This implies

that as t increases, the slope of the cumulant generating function K 0 (t) is non-

decreasing. It therefore approaches a limit xmax (¬nite or in¬nite) as t ’ ∞ and

as long as we restrict the value of x in (4.31) to the interval x < xmax we can

¬nd a solution. The value of the N (x, K 00 (θ)) at the value x is

s

1

fθ (x) ≈

2πK 00 (θ)

and therefore the approximation to the density f (x) is

s

1

eK(θ)’θx .

f (x) ≈ (4.32)

00 (θ)

2πK

where θ = θ(x) satis¬es K 0 (θ) = x.

This is the saddlepoint approximation, discovered by Daniels (1954, 1980), and

usually applied to the distribution of sums or averages of independent random

variables because then the normal approximation is better motivated. Indeed,

the saddlepoint approximation to the distribution of the sum of n independent

identically distributed random variables is accurate to order O(n’1 ) and if we

renormalize it to integrate to one, accuracy to order O(n’3/2 ) is possible, sub-

stantially better than the order O(n’1/2 ) of of the usual normal approximation.

Consider, for example, the saddlepoint approximation to the Gamma(±, 1)

distribution. Because the moment generating function of the Gamma(±, 1) dis-

tribution is

1

m(t) = , t < 1,

(1 ’ t)±

the cumulant generating function is

K(t) = ln(m(t)) = ’± ln(1 ’ t),

±

K 0 (θ) = x implies θ(x) = 1 ’ and

x

x2

±

00 00

so that K (θ(x)) =

K (θ) = .

(1 ’ θ)2 ±

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.235

Therefore the saddlepoint approximation to the probability density function is

r

± ±

f (x) ' exp{’± ln(±/x) ’ x(1 ’ )}

2

2πx x

r

1 1 ’± ± ±’1

= ±2 e x exp(’x).

2π

This is exactly the gamma density function with Stirling™s approximation replac-

ing “(±) and after renormalization this is exactly the Gamma density function.

Since it is often computationally expensive to generate random variables

whose distribution is a convolution of known densities, it is interesting to ask

whether (4.32) makes this any easier. In many cases the saddlepoint approxi-

mation can be used to generate a random variable whose distribution is close

to this convolution with high e¬ciency. For example suppose that we wish to

Pn

generate the random variable Sn = i=1 Xi where each random variable Xi

has the non-central chi-squared distribution with cumulant generating function

2»t p

’ ln(1 ’ 2t). (4.33)

K(t) =

1 ’ 2t 2

The parameter » is the non-centrality parameter of the distribution and p is

the degrees of freedom. Notice that the cumulant generating function of the sum

takes the same form but with (», p) replaced by (n», np) so in e¬ect we wish

to generate a random variable with cumulant generating function (4.33) for

large values of the parameters (», p). In stead we generate from the saddlepoint

approximation (4.32) to this distribution and in fact we do this indirectly. If we

change variable in (4.32) to determine the density of the new random variable

˜ which solves the equation

K 0 (˜) = X

then the saddlepoint approximation (4.32) is equivalent to specifying a proba-

bility density for this variable,

dx

f˜ (θ) = f (K 0 (θ))

dθ

p 0

= constant — K 00 (θ)eK(θ)’θK (θ) . (4.34)

236 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

In general, this probability density function can often be bounded above by

some density over the range of possible values of θ allowing us to generate ˜

by acceptance rejection. Then the value of the random variable is X = K 0 (˜).

In the particular case of the non-central chi-squared example above, swe may

take the dominating density to be the U [0, 1 ] density since (4.34) is bounded.

2

Combining Monte Carlo Estimators.

We have now seen a number of di¬erent variance reduction techniques and there

are many more possible. With many of these methods such as importance and

strati¬ed sampling are associated parameters which may be chosen in di¬erent

ways. The variance formula may be used as a basis of choosing a “best” method

but these variances and e¬ciencies must also estimated from the simulation and

it is rarely clear a priori which sampling procedure and estimator is best. For

example if a function f is monotone on [0, 1] then an antithetic variate can be

introduced with an estimator of the form

1

ˆ

θa1 = [f (U ) + f (1 ’ U )], U ∼ U [0, 1] (4.35)

2

1

but if the function is increasing to a maximum somewhere around and then

2

decreasing thereafter we might prefer

1

ˆ

θa2 = [f (U/2) + f ((1 ’ U )/2) + f ((1 + U )/2) + f (1 ’ U/2)]. (4.36)

4

Notice that any weighted average of these two unbiased estimators of θ would

also provide an unbiased estimator of θ. The large number of potential variance

reduction techniques is an embarrassment of riches. Which variance reduction

methods we should use and how will we know whether it is better than the

competitors? Fortunately, the answer is often to use “all of the methods” (within

reason of course); that choosing a single method is often neither necessary nor

desirable. Rather it is preferable to use a weighted average of the available

estimators with the optimal choice of the weights provided by regression.

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.237

b

Suppose in general that we have k estimators or statistics θi , i = 1, ..k, all

b

unbiased estimators of the same parameter θ so that E(θi ) = θ for all i . In

b b

vector notation, letting ˜0 = (θ1 , ..., θk ), we write E(˜) = 1θ where 1 is the

k-dimensional column vector of ones so that 10 = (1, 1, ..., 1). Let us suppose for

the moment that we know the variance-covariance matrix V of the vector ˜,

de¬ned by

bb

Vij = cov(θi , θj ).

Theorem 37 (best linear combinations of estimators)

b

The linear combination of the θi which provides an unbiased estimator of θ and

has minimum variance among all linear unbiased estimators is

X

b b (4.37)

θblc = bi θi

i

where the vector b = (b1 , ..., bk )0 is given by

b = (1t V ’1 1)’1 V ’1 1.

The variance of the resulting estimator is

b

var(θblc ) = bt V b = 1/(1t V ’1 1)

Proof. The proof is straightforward. It is easy to see that for any linear

combination (4.37) the variance of the estimator is

bt V b

and we wish to minimize this quadratic form as a function of b subject to the

constraint that the coe¬cients add to one, or that

b0 1 =1.

Introducing the Lagrangian, we wish to set the derivatives with respect to

238 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

the components bi equal to zero

‚ 0

{bt V b + »(b 1’1)} = 0 or

‚b

2V b + »1= 0

b=constant — V ’1 1

and upon requiring that the coe¬cients add to one, we discover the value of the

constant above is (1t V ’1 1)’1 .

This theorem indicates that the ideal linear combination of estimators has

coe¬cients proportional to the row sums of the inverse covariance matrix. No-

b

tably, the variance of a particular estimator θi is an ingredient in that sum,

but one of many. In practice, of course, we almost never know the variance-

covariance matrix V of a vector of estimators ˜. However, when we do simula-

tion evaluating these estimators using the same uniform input to each, we obtain

independent replicated values of ˜. This permits us to estimate the covariance

matrix V and since we typically conduct many simulations this estimate can be

very accurate. Let us suppose that we have n simulated values of the vectors ˜,

and call these ˜1 , ..., ˜n . As usual we estimate the covariance matrix V using

the sample covariance matrix

n

1X

b (˜i ’ ˜)(˜i ’ ˜)0

V=

n ’ 1 i=1

where

n

1X

˜= ˜i .

n i=1

Let us return to the example and attempt to ¬nd the best combination of

the many estimators we have considered so far. To this end, let

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.239

0.53

b [f (.47 + .53u) + f (1 ’ .53u)] an antithetic estimator,

θ1 =

2

0.37 0.16

b [f (.47 + .37u) + f (.84 ’ .37u)] + [f (.84 + .16u) + f (1 ’ .16u)],

θ2 =

2 2

b

θ3 = 0.37[f (.47 + .37u)] + 0.16[f (1 ’ .16u)], (strati¬ed-antithetic)

Z

b

θ4 = g(x)dx + [f (u) ’ g(u)], (control variate)

b ˆ

θ5 = θim , the importance sampling estimator (4.23).

b b b

Then θ2 , and θ3 are both strati¬ed-antithetic estimators, θ4 is a control

b

variate estimator and θ5 the importance sampling estimator discussed earlier, all

obtained from a single input uniform random variate U. In order to determine

the optimal linear combination we need to generate simulated values of all 5

estimators using the same uniform random numbers as inputs. We determine

the best linear combination of these estimators using

function [o,v,b,V]=optimal(U)

% generates optimal linear combination of ¬ve estimators and outputs

% average estimator, variance and weights

% input U a row vector of U[0,1] random numbers

T1=(.53/2)*(fn(.47+.53*U)+fn(1-.53*U));

T2=.37*.5*(fn(.47+.37*U)+fn(.84-.37*U))+.16*.5*(fn(.84+.16*U)+fn(1-.16*U));

T3=.37*fn(.47+.37*U)+.16*fn(1-.16*U);

intg=2*(.53)^3+.53^2/2;

T4=intg+fn(U)-GG(U);

T5=importance(™fn™,U);

X=[T1™ T2™ T3™ T4™ T5™]; % matrix whose columns are replications of the same

estimator, a row=5 estimators using same U

mean(X)

V=cov(X); % this estimates the covariance matrix V

240 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

on=ones(5,1);

V1=inv(V); % the inverse of the covariance matrix

b=V1*on/(on™*V1*on); % vector of coe¬cients of the optimal linear combination

o=mean(X*b); % vector of the optimal linear combinations

v=1/(on™*V1*on); % variance of the optimal linear combination based on

a single U

One run of this estimator, called with [o,v,b,V]= optimal(unifrnd(0,1,1,1000000))

yields

o = 0.4615

b0 = [’0.5499 1.4478 0.1011 0.0491 ’ 0.0481].

The estimate 0.4615 is accurate to at least four decimals which is not surprising

since the variance per uniform random number input is v = 1.13 — 10’5 . In

other words, the variance of the mean based on 1,000,000 uniform input is

1.13— 10’10 , the standard error is around .00001 so we can expect accuracy to at

least 4 decimal places. Note that some of the weights are negative and others are

greater than one. Do these negative weights indicate estimators that are worse

than useless? The e¬ect of some estimators may be, on subtraction, to render the

remaining function more linear and more easily estimated using another method

and negative coe¬cients are quite common in regression generally. The e¬ciency

gain over crude Monte Carlo is an extraordinary 40,000. However since there

are 10 function evaluations for each uniform variate input, the e¬ciency when

we adjust for the number of function evaluations is 4,000. This simulation

using 1,000,000 uniform random numbers and taking a 63 seconds on a Pentium

IV (2.4 GHz) (including the time required to generate all ¬ve estimators) is

equivalent to forty billion simulations by crude Monte Carlo, a major task on a

supercomputer!

If we intended to use this simulation method repeatedly, we might well wish

to see whether some of the estimators can be omitted without too much loss

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.241

of information. Since the variance of the optimal estimator is 1/(1t V ’1 1), we

might use this to attempt to select one of the estimators for deletion. Notice that

it is not so much the covariance of the estimators V which enters into Theorem

35 but its inverse J = V ’1 which we can consider a type of information matrix

by analogy to maximum likelihood theory. For example we could choose to delete

the i0 th estimator, i.e. delete the i0 th row and column of V where i is chosen

PP

to have the smallest e¬ect on 1/(1t V ’1 1) or its reciprocal 1t J1 = i j Jij .

with the i0 th row and column

In particular, if we let V(i) be the matrix V

’1

deleted and J(i) = V(i) , then we can identify 1t J1 ’ 1t J(i) 1 as the loss of

information when the i0 th estimator is deleted. Since not all estimators have

the same number of function evaluations, we should adjust this information

by F E(i) =number of function evaluations required by the i0 th estimator. In

other words, if an estimator i is to be deleted, it should be the one corresponding

to

1t J1 ’ 1t J(i) 1

min{ }.

F E(i)

i

We should drop this i0 th estimator if the minimum is less than the information

per function evaluation in the combined estimator, because this means we will

increase the information available in our simulation per function evaluation.

In the above example with all ¬ve estimators included, 1t J1 = 88757 (with

10 function evaluations per uniform variate) so the information per function

evaluation is 8, 876.

1t J1’1t J(i) 1

1t J1 ’ 1t J(i) 1

i F E(i) F E(i)

1 88,048 2 44024

2 87,989 4 21,997

3 28,017 2 14,008

4 55,725 1 55,725

5 32,323 1 32,323

In this case, if we were to eliminate one of the estimators, our choice would

242 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

likely be number 3 since it contributes the least information per function eval-

uation. However, since all contribute more than 8, 876 per function evaluation,

we should likely retain all ¬ve.

Common Random Numbers.

We now discuss another variance reduction technique, closely related to anti-

thetic variates called common random numbers, used for example whenever

we wish to estimate the di¬erence in performance between two systems or any

other variable involving a di¬erence such as a slope of a function.

bb

Example 38 For a simple example suppose we have two estimators θ1 , θ2 of

the “center” of a symmetric distribution. We would like to know which of these

estimators is better in the sense that it has smaller variance when applied to

a sample from a speci¬c distribution symmetric about its median. If both esti-

mators are unbiased estimators of the median, then the ¬rst estimator is better

if

b b

var(θ1 ) < var(θ2 )

and so we are interested in estimating a quantity like

Eh1 (X) ’ Eh2 (X)

where X is a vector representing a sample from the distribution and h1 (X) =

b2 b2

θ1 , h2 (X) = θ2 . There are at least two ways of estimating these di¬erences;

1. Generate samples and hence values of h1 (Xi ), i = 1, ..., n and Eh2 (Xj ), j =

1, 2, ..., m independently and use the estimator

n m

1X 1X

h1 (Xi ) ’ h2 (Xj ).

n i=1 m j=1

2. Generate samples and hence values of h1 (Xi ), h2 (Xi ), i = 1, ..., n inde-

pendently and use the estimator

n

1X

(h1 (Xi ) ’ h2 (Xi )).

n i=1

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.243

It seems intuitive that the second method is preferable since it removes

the variability due to the particular sample from the comparison. This is a

common type of problem in which we want to estimate the di¬erence between

two expected values. For example we may be considering investing in a new

piece of equipment that will speed up processing at one node of a network and

we wish to estimate the expected improvement in performance between the new

system and the old. In general, suppose that we wish to estimate the di¬erence

between two expectations, say

Eh1 (X) ’ Eh2 (Y ) (4.38)

where the random variable or vector X has cumulative distribution function FX

and Y has cumulative distribution function FY . Notice that the variance of a

Monte Carlo estimator

var[h1 (X) ’ h2 (Y )] = var[h1 (X)] + var[h2 (Y )] ’ 2cov{h1 (X), h2 (Y )} (4.39)

is small if we can induce a high degree of positive correlation between the gen-

erated random variables X and Y . This is precisely the opposite problem that

led to antithetic random numbers, where we wished to induce a high degree

of negative correlation. The following lemma is due to Hoe¬ding (1940) and

provides a useful bound on the joint cumulative distribution function of two

random variables X and Y. Suppose X, Y have cumulative distribution func-

tions FX (x) and FY (y) respectively and joint cumulative distribution function

G(x, y) = P [X · x, Y · y].

Lemma 39 (a) The joint cumulative distribution function G of (X, Y ) always

satis¬es

(FX (x) + FY (y) ’ 1)+ · G(x, y) · min(FX (x), FY (y)) (4.40)

for all x, y .

244 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

(b) Assume that FX and FY are continuous functions. In the case that

’1 ’1

X = FX (U ) and Y = FY (U ) for U uniform on [0, 1], equality is achieved

’1

on the right G(x, y) = min(FX (x), FY (y)). In the case that X = FX (U ) and

’1

Y = FY (1 ’ U ) there is equality on the left; (FX (x) + FY (y) ’ 1)+ = G(x, y).

Proof. (a) Note that

P [X · x, Y · y] · P [X · x] and similarly

· P [Y · y].

This shows that

G(x, y) · min(FX (x), FY (y)),

verifying the right side of (4.40). Similarly for the left side

P [X · x, Y · y] = P [X · x] ’ P [X · x, Y > y]

≥ P [X · x] ’ P [Y > y]

= FX (x) ’ (1 ’ FY (y))

= (FX (x) + FY (y) ’ 1).

Since it is also non-negative the left side follows.

’1 ’1

For (b) suppose X = FX (U ) and Y = FY (U ), then

’1 ’1

P [X · x, Y · y] = P [FX (U ) · x, FY (U ) · y]

= P [U · FX (x), U · FY (y)]

since P [X = x] = 0 and P [Y = y] = 0.

But

P [U · FX (x), U · FY (y)] = min(FX (x), FY (y))

verifying the equality on the right of (4.40) for common random numbers. By

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.245

a similar argument,

’1 ’1

P [FX (U ) · x, FY (1 ’ U ) · y] = P [U · FX (x), 1 ’ U · FY (y)]

= P [U · FX (x), U ≥ 1 ’ FY (y)]

= (FX (x) ’ (1 ’ FY (y)))+

verifying the equality on the left.

The following theorem supports the use of common random numbers to

maximize covariance and antithetic random numbers to minimize covariance.

Theorem 40 (maximum/minimum covariance)

Suppose h1 and h2 are both non-decreasing (or both non-increasing) functions.

Subject to the constraint that X, Y have cumulative distribution functions FX , FY

respectively, the covariance

cov[h1 (X), h2 (Y )]

’1 ’1

is maximized when Y = FY (U ) and X = FX (U ) (i.e. for common uniform[0, 1]

’1 ’1

random numbers) and is minimized when Y = FY (U ) and X = FX (1 ’ U )

(i.e. for antithetic random numbers).

Proof. We will sketch a proof of the theorem when the distributions are

all continuous and h1 , h2 are di¬erentiable. De¬ne G(x, y) = P [X · x, Y · y].

The following representation of covariance is useful: de¬ne

H(x, y) = P (X > x, Y > y) ’ P (X > x)P (Y > y) (4.41)

= G(x, y) ’ FX (x)FY (y).

246 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

Notice that, using integration by parts,

Z∞Z∞

H(x, y)h0 (x)h0 (y)dxdy

1 2

’∞ ’∞

Z∞Z∞

‚

H(x, y)h1 (x)h0 (y)dxdy

=’ 2

’∞ ’∞ ‚x

Z∞Z∞

‚2

= H(x, y)h1 (x)h2 (y)dxdy

’∞ ’∞ ‚x‚y

Z∞Z∞ Z∞ Z ∞

h1 (x)h2 (y)g(x, y)dxdy ’

= h1 (x)fX (x)dx h2 (y)fY (y)dy

’∞ ’∞ ’∞ ’∞

(4.42)

= cov(h1 (X), h2 (Y ))

where g(x, y), fX (x), fY (y) denote the joint probability density function, the

probability density function of X and that of Y respectively. In fact this result

holds in general even without the assumption that the distributions are contin-

uous. The covariance between h1 (X) and h2 (Y ), for h1 and h2 di¬erentiable

functions, is

Z Z

∞ ∞

H(x, y)h0 (x)h0 (y)dxdy.

cov(h1 (X), h2 (Y )) = 1 2

’∞ ’∞

The formula shows that to maximize the covariance, if h1 , h2 are both increasing

or both decreasing functions, it is su¬cient to maximize H(x, y) for each x, y

since h0 (x), h0 (y) are both non-negative. Since we are constraining the mar-

1 2

ginal cumulative distribution functions FX , FY , this is equivalent to maximizing

G(x, y) subject to the constraints

lim G(x, y) = FX (x)

y’∞

lim G(x, y) = FY (y).

x’∞

Lemma 37 shows that the maximum is achieved when common random numbers

are used and the minimum achieved when we use antithetic random numbers.

We can argue intuitively for the use of common random numbers in the case

of a discrete distribution with probability on the points indicated in Figure 4.5.

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.247

This ¬gure corresponds to a joint distribution with the following probabilities,

say

0 0.25 0.25 0.75 0.75 1

x

0 0.25 0.75 0.25 0.75 1

y

.1 .2 .2 .1 .2 .2

P [X = x, Y = y]

Suppose we wish to maximize P [X > x, Y > y] subject to the constraint that

the probabilities P [X > x] and P [Y > y] are ¬xed. We have indicated

arbitrary ¬xed values of (x, y) in the ¬gure. Note that if there is any weight

attached to the point in the lower right quadrant (labelled “P2 ”), some or all

of this weight can be reassigned to the point P3 in the lower left quadrant

provided there is an equal movement of weight from the upper left P4 to the

upper right P1 . Such a movement of weight will increase the value of G(x, y)

without a¬ecting P [X · x] or P [Y · y]. The weight that we are able to transfer

in this example is 0.1, the minimum of the weights on P4 and P2 . In general,

this continues until there is no weight in one of the o¬-diagonal quadrants for

every choice of (x, y). The resulting distribution in this example is given by

0 0.25 0.25 0.75 0.75 1

x

0 0.25 0.75 0.25 0.75 1

y

.1 .3 0 .1 .3 .2

P [X = x, Y = y]

and it is easy to see that such a joint distribution can be generated from common

’1 ’1

random numbers X = FX (U ), Y = FY (U ).

Conditioning

We now consider a simple but powerful generalization of control variates. Sup-

pose that we can decompose a random variable T into two components T1 , µ

(4.43)

T = T1 + µ

so that T1 , µ are uncorrelated

cov(T1 , µ) = 0.

248 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

Figure 4.5: Changing weights on points to maximize covariance

Assume as well that E(µ) = 0. Regression is one method for determining such

a decomposition and the error term µ in regression satis¬es these conditions.

Then T1 has the same mean as T and it is easy to see that

var(T ) = var(T1 ) + var(µ)

so T1 as smaller variance than T (unless µ = 0 with probability 1). This means

that if we wish to estimate the common mean of T or T1 , the estimator T1 is

preferable, since it has the same mean with smaller variance.

One special case is variance reduction by conditioning. For the standard

de¬nition and properties of conditional expectation see the appendix. One com-

mon de¬nition of E[X|Y ] is the unique (with probability one) function g(y)

of Y which minimizes E{X ’ g(Y )}2 . This de¬nition only applies to random

variables X which have ¬nite variance and so this de¬nition requires some mod-

i¬cation when E(X 2 ) = ∞, but we will assume here that all random variables,

say X, Y, Z have ¬nite variances. We can de¬ne conditional covariance using

conditional expectation as

cov(X, Y |Z) = E[XY |Z] ’ E[X|Z]E[Y |Z]

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.249

and conditional variance:

var(X|Z) = E(X 2 |Z) ’ (E[X|Z])2 .

The variance reduction through conditioning is justi¬ed by the following well-

known result:

Theorem 41 (a)E(X) = E{E[X|Y ]}

(b) cov(X, Y ) = E{cov(X, Y |Z)} + cov{E[X|Z], E[Y |Z]}

(c) var(X) = E{var(X|Z)} + var{E[X|Z]}

This theorem is used as follows. Suppose we are considering a candidate

ˆ

estimator θ, an unbiased estimator of θ. We also have an arbitrary random

ˆ

variable Z which is somehow related to θ. Suppose that we have chosen Z

carefully so that we are able to calculate the conditional expectation T1 =

ˆ

E[θ|Z]. Then by part (a) of the above Theorem, T1 is also an unbiased estimator

of θ. De¬ne

ˆ

µ = θ ’ T1 .

By part (c),

ˆ

var(θ) = var(T1 ) + var(µ)

ˆ ˆ

and var(T1 ) = var(θ) ’ var(µ) < var(θ). In other words, for any variable Z,

ˆ ˆ

E[θ|Z] has the same expectation as does θ but smaller variance and the decrease

ˆ

in variance is largest if Z and θ are nearly independent, because in this case

ˆ

E[θ|Z] is close to a constant and its variance close to zero. In general the

search for an appropriate Z so as to reducing the variance of an estimator by

conditioning requires searching for a random variable Z such that:

ˆ

1. the conditional expectation E[θ|Z] with the original estimator is com-

putable

ˆ ˆ

2. var(E[θ|Z]) is substantially smaller than var(θ).

250 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

Figure 4.6: Example of the Hit and Miss Method

Example 42 (hit or miss)

Suppose we wish to estimate the area under a certain graph f (x) by the hit

and miss method. A crude method would involve determining a multiple c of a

probability density function g(x) which dominates f (x) so that cg(x) ≥ f (x) for

all x. We can generate points (X, Y ) at random and uniformly distributed under

the graph of cg(x) by generating X by inverse transform X = G’1 (U1 ) where

G(x) is the cumulative distribution function corresponding to density g and

then generating Y from the Uniform[0, cg(X)] distribution, say Y = cg(X)U2 .

An example, with g(x) = 2x, 0 < x < 1 and c = 1/4 is given in Figure 4.6.

The hit and miss estimator of the area under the graph of f obtains by

generating such random points (X, Y ) and counting the proportion that fall

under the graph of g, i.e. for which Y · f (X). This proportion estimates the

probability

area under f (x)

P [Y · f (X)] =

area under cg(x)

area under f (x)

=

c

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.251

since g(x) is a probability density function. Notice that if we de¬ne

§

⎨ c if Y · f (X)

W=

© 0 if Y > f (X)

then

area under f (x)

E(W ) = c —

area under cg(x)

= area under f (x)

so W is an unbiased estimator of the parameter that we wish to estimate. We

might therefore estimate the area under f (x) using a Monte Carlo estimator

Pn

ˆ 1

θHM = n i=1 Wi based on independent values of Wi .This is the “hit-or-miss”

estimator. However, in this case it is easy to ¬nd a random variable Z such

that the conditional expectation E(Z|W ) can be determined in closed form. In

fact we can choose Z = X, we obtain

f (X)

E[W |X] = .

g(X)

This is therefore an unbiased estimator of the same parameter and it has smaller

variance than does W. For a sample of size n we should replace the crude esti-

ˆ

mator θcr by the estimator

n

1 X f (Xi )

ˆ

θCond =

n i=1 g(Xi )

n

1 X f (Xi )

=

n i=1 2Xi

√

with Xi generated from X = G’1 (Ui ) = Ui , i = 1, 2, ..., n and Ui ∼ Uni-

form[0,1]. In this case, the conditional expectation results in a familiar form for

ˆ

the estimator θCond . This is simply an importance sampling estimator with g(x)

the importance distribution. However, this derivation shows that the estimator

ˆ ˆ

θCond has smaller variance than θHM .

252 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

Problems

1. Use both crude and antithetic random numbers to integrate the function

Z 1

eu ’ 1

du.

e’1

0

(a) What is the e¬ciency gain attributed to the use of antithetic random

numbers?

(b) How large a sample size would I need, using antithetic and crude

Monte Carlo, in order to estimate the above integral, correct to four

decimal places, with probability at least 95%?

2. Under what conditions on f does the use of antithetic random numbers

completely correct for the variability of the Monte-Carlo estimator? i.e.

When is var(f (U ) + f (1 ’ U )) = 0?

3. Suppose that F (x) is the normal(µ, σ 2 ) cumulative distribution function,

Prove that F ’1 (1 ’ U ) = 2µ ’ F ’1 (U ) and therefore, if we use antithetic

random numbers to generate two normal random variables X1 , X2 , having

and variance σ 2 , this is equivalent to setting X2 = 2µ ’ X1 .

mean µ

In other words, if we wish to use antithetic random numbers for normal

variates, it is not necessary to generate the normal random variables using

the inverse transform method.

4. Show that the variance of a weighted average

var(±X + (1 ’ ±)W )

is minimized over ± when

var(W ) ’ cov(X, W )

±=

var(W ) + var(X) ’ 2cov(X, W )

Determine the resulting minimum variance. What if the random variables

X, W are independent?

PROBLEMS 253

5. Use a strati¬ed random sample to integrate the function

Z1 u

e ’1

du.

0 e’1

What do you recommend for intervals (two or three) and sample sizes?

What is the e¬ciency gain?

6. Use a combination of strati¬ed random sampling and an antithetic random

number in the form

1

[f (U/2) + f (1 ’ U/2)]

2

to integrate the function

Z 1

eu ’ 1

du.

e’1

0

What is the e¬ciency gain?

ex ’1

7. In the case f (x) = e’1 , use g(x) = x as a control variate to integrate over

[0,1]. Show that the variance is reduced by a factor of approximately 60.

Is there much additional improvement if we use a more general quadratic

function of x?

8. The second version of the control variate Monte-Carlo estimator

n

1X

b {f (Ui ) ’ β[g(Ui ) ’ E(g(Ui ))]}

θcv =

n i=1

is an improved control variate estimator, is equivalent to the ¬rst version

ex ’1

in the case β = 1. In the case f (x) = e’1 , consider using g(x) = x as a

b

control variate to integrate over [0,1]. Determine how much better θcv is

than the basic control variate (β = 1) by performing simulations. Show

that the variance is reduced by a factor of approximately 60 over crude

Monte Carlo. Is there much additional improvement if we use a more

general quadratic function of x for g(x).

9. It has been suggested that stocks are not log-normally distributed but the

distribution can be well approximated by replacing the normal distribu-

tion by a student t distribution. Suppose that the daily returns Xi are

254 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

independent with probability density function f (x) = c(1 + (x/b)2 )’2 (the

re-scaled student distribution with 3 degrees of freedom). We wish to es-

P

timate a weekly Value at Risk, V ar.95 , a value ev such that P [ 5 Xi <

i=1

v] = 0.95. If we wish to do this by simulation, suggest an appropriate

method involving importance sampling. Implement and estimate the vari-

ance reduction.

10. Suppose three di¬erent simulation estimators Y1 , Y2 , Y3 have means which

depend on two unknown parameters θ1 , θ2 so that Y1 , Y2 , Y3 , are unbi-

ased estimators of θ1 , θ1 + θ2 , θ2 respectively. Assume that var(Yi ) =

1, cov(Yi , Yj ) = ’1/2 an we want to estimate the parameter θ1 . Should

we use only the estimator Y1 which is the unbiased estimator of θ1 , or

some linear combination of Y1 , Y2 , Y3 ? Compare the number of simula-

tions necessary for a certain degree of accuracy.

ex ’1

11. In the case f (x) = e’1 , use g(x) = x as a control variate to integrate

over [0,1]. Find the optimal linear combination using estimators (4.35) and

(4.36), an importance sampling estimator and the control variate estimator

above. What is the e¬ciency gain over crude Monte-Carlo?

12. Show that the Jacobian of the transformation used in the proof of Theorem

√1

23; (x, m) ’ (x, y) where y = exp(’(2m’x)2 /2) is given by .

2y ’2 ln(y)

Chapter 5

Simulating the Value of

Options

Asian Options

An Asian option, at expiration T, has value determined not by the closing price

of the underlying asset as for a European option, but on an average price of

the asset over an interval. For example a discretely sampled Asian call op-

tion on an asset with price process S(t) pays an amount on maturity equal

Pk

¯ ¯ 1

to max(0, Sk ’ K) where Sk = k i=1 S(iT /k) is the average asset price at k

equally spaced time points in the time interval (0, T ). Here, k depends on the

frequency of sampling (e.g. if T = .25 (years) and sampling is weekly, then

¯

k = 13. If S(t) follows a geometric Brownian motion, then Sk is the sum of

lognormally distributed random variables and the distribution of the sum or av-

erage of lognormal random variables is very di¬cult to express analytically. For

this reason we will resort to pricing the Asian option using simulation. Notice,

however that in contrast to the arithmetic average, the distribution of the geo-

metric average has a distribution which can easily be obtained. The geometric

255

256 CHAPTER 5. SIMULATING THE VALUE OF OPTIONS

Pn

1

mean of n values X1 , ..., Xn is (X1 X2 ...Xn )1/n = exp{ n ln(Xi )} and if the

i=1

random variables Xn were each lognormally distributed then this results adding

the normally distributed random variables ln(Xi ) in the exponent, a much more

P

familiar operation. In fact the sum in the exponent n n ln(Xi ) is normally

1

i=1

distributed so the geometric average will have a lognormal distribution.

Our objective is to determine the value of the Asian option E(V1 ) with

¯

V1 = e’rT max(0, Sk ’ K)

Since we expect geometric means to be close to arithmetic means, a reasonable

˜ ˜

control variate is the random variable V2 = e’rT max(0, Sk ’ K) where Sk =

Qk

{ i=1 S(iT /k)}1/k is the geometric mean. Assume that V1 and V2 obtain from

the same simulation and are therefore possibly correlated. Of course V2 is only

useful as a control variate if its expected value can be determined analytically

or numerically more easily than that of V1 but in view of the fact that V2

has a known lognormal distribution, the prospects of this are excellent. Since

S(t) = S0 eY (t) where Y (t) is a Brownian motion with Y (0) = 0, drift r ’ σ 2 /2

˜

and di¬usion σ, it follows that Sk has the same distribution as does

k

1X

(5.1)

S0 exp{ Y (iT /k)}.

k i=1

The exponent is a weighted average of the independent normal increments of

the process and therefore normally distributed. In particular if we set

k

1X

¯

Y= Y (iT /k)

k i=1

1

[k(Y (T /k)) + (k ’ 1){Y (2T /k) ’ Y (T /k)} + (k ’ 2){Y (3T /k) ’ Y (2T /k)}

=

k

+ ... + {Y (T ) ’ Y ((k ’ 1)T /k)}],

ASIAN OPTIONS 257

¯

then we can ¬nd the mean and variance of Y ,

k

X

2

¯ ) = r ’ σ /2

E(Y iT /k

k i=1

σ2 k + 1

= (r ’ ) T

2 2k

e

= µT, say,