where θ = θ(x) satis¬es K 0 (θ) = x.

This is the saddlepoint approximation, discovered by Daniels (1954, 1980), and

usually applied to the distribution of sums or averages of independent random

variables because then the normal approximation is better motivated. Indeed,

the saddlepoint approximation to the distribution of the sum of n independent

identically distributed random variables is accurate to order O(n’1 ) and if we

renormalize it to integrate to one, accuracy to order O(n’3/2 ) is possible, sub-

stantially better than the order O(n’1/2 ) of of the usual normal approximation.

Consider, for example, the saddlepoint approximation to the Gamma(±,1)

distribution. In this case the cumulant generating function is

K(t) = ’± ln(1 ’ t),

±

K 0 (θ) = x implies θ(x) = 1 ’ and

x

x2

K 00 (θ) =

±

Therefore the saddlepoint approximation to the probability density function is

r

± ±

f (x) ' exp{± ln(x/±) ’ x(1 ’ )}

2

2πx x

r

1 1 ’± ± ±’1

= ±2 e x exp(’x).

2π

This is exactly the gamma density function with Stirling™s approximation replac-

ing “(±) and after renormalization this is exactly the Gamma density function.

In many cases the saddlepoint approximation is used to estimate a proba-

bility or a moment of a distribution and it is of interest to estimate the error

in this approximation. For example suppose that we wish, for some function

h, to estimate the error in the saddlepoint approximation to E(h(Sn )) where

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.243

Pn

Xi and each random variable Xi has the non-central chi-squared

Sn = i=1

distribution with cumulant generating function

2»t p

’ ln(1 ’ 2t).

K(t) =

1 ’ 2t 2

The parameter » is the non-centrality parameter of the distribution and p is

the degrees of freedom. Notice that the cumulant generating function of the sum

takes the same form but with (», p) replaced by (n», np). If f (x) is the actual

probability density function of this sum and fs is the saddlepoint approximation

to the same density, then the error in the saddlepoint approximation is

Z Z Z

∞ ∞ ∞

h(x)fs (x)dx ’ h(x)(fs (x) ’ f (x))dx. (4.33)

h(x)f (x)dx =

0 0 0

The right hand side of (4.33) can be approximated with a relatively small Monte

Carlo sample since the di¬erences appearing in it fs (x)’f (x) are typically small.

In this case, we might use importance sampling and since it is easy to generate

from the distribution of Sn , we could simply average simulated values of

fs (Sn )

’ 1).

h(Sn )(

f (Sn )

Since it is often computationally expensive to generate random variables

whose distribution is a convolution of known densities, it is interesting to ask

whether (4.32) makes this any easier. In general, K(t) is convex with K(0) = 0

K(θ(x))

and »(x) = θ(x) ’ is an increasing function of x.This is because, upon

x

di¬erentiation, we obtain

1 K(θ(x))

»0 (x) = ’1+ =

K 00 (θ(x)) x2

FINISH

Combining Monte Carlo Estimators.

We have now seen a number of di¬erent variance reduction techniques and there

are many more possible. With many of these methods such as importance and

244 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

strati¬ed sampling are associated parameters which may be chosen in di¬erent

ways. The variance formula may be used as a basis of choosing a “best” method

but these variances and e¬ciencies must also estimated from the simulation and

it is rarely clear a priori which sampling procedure and estimator is best. For

example if a function f is monotone on [0, 1] then an antithetic variate can be

introduced with an estimator of the form

1

ˆ

θa1 = [f (U ) + f (1 ’ U )], U ∼ U [0, 1] (4.34)

2

1

but if the function is increasing to a maximum somewhere around and then

2

decreasing thereafter we might prefer

1

ˆ

θa2 = [f (U/2) + f ((1 ’ U )/2) + f ((1 + U )/2) + f (1 ’ U/2)]. (4.35)

4

Notice that any weighted average of these two unbiased estimators of θ would

also provide an unbiased estimator of θ. The large number of potential variance

reduction techniques is an embarrassment of riches. Which variance reduction

methods we should use and how will we know whether it is better than the

competitors? Fortunately, the answer is often to use “all of the methods” (within

reason of course); that choosing a single method is often neither necessary nor

desirable. Rather it is preferable to use a weighted average of the available

estimators with the optimal choice of the weights provided by regression.

b

Suppose in general that we have k estimators or statistics θi , i = 1, ..k, all

b

unbiased estimators of the same parameter θ so that E(θi ) = θ for all i . In

b b

vector notation, letting ˜0 = (θ1 , ..., θk ), we write E(˜) = 1θ where 1 is the

k-dimensional column vector of ones so that 10 = (1, 1, ..., 1). Let us suppose for

the moment that we know the variance-covariance matrix V of the vector ˜,

de¬ned by

bb

Vij = cov(θi , θj ).

Theorem 37 (best linear combinations of estimators)

b

The linear combination of the θi which provides an unbiased estimator of θ and

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.245

has minimum variance among all linear unbiased estimators is

X

b b (4.36)

θblc = bi θi

i

where the vector b = (b1 , ..., bk )0 is given by

b = (1t V ’1 1)’1 V ’1 1.

The variance of the resulting estimator is

b

var(θblc ) = bt V b = 1/(1t V ’1 1)

Proof. The proof is straightforward. It is easy to see that for any linear

combination (4.36) the variance of the estimator is

bt V b

and we wish to minimize this quadratic form as a function of b subject to the

constraint that the coe¬cients add to one, or that

b0 1 =1.

Introducing the Lagrangian, we wish to set the derivatives with respect to

the components bi equal to zero

‚ 0

{bt V b + »(b 1’1)} = 0 or

‚b

2V b + »1= 0

b=constant — V ’1 1

and upon requiring that the coe¬cients add to one, we discover the value of the

constant above is (1t V ’1 1)’1 .

This theorem indicates that the ideal linear combination of estimators has

coe¬cients proportional to the row sums of the inverse covariance matrix. No-

b

tably, the variance of a particular estimator θi is an ingredient in that sum,

246 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

but one of many. In practice, of course, we almost never know the variance-

covariance matrix V of a vector of estimators ˜. However, when we do simula-

tion evaluating these estimators using the same uniform input to each, we obtain

independent replicated values of ˜. This permits us to estimate the covariance

matrix V and since we typically conduct many simulations this estimate can be

very accurate. Let us suppose that we have n simulated values of the vectors ˜,

and call these ˜1 , ..., ˜n . As usual we estimate the covariance matrix V using

the sample covariance matrix

n

1X

b (˜i ’ ˜)(˜i ’ ˜)0

V=

n ’ 1 i=1

where

n

1X

˜= ˜i .

n i=1

Let us return to the example and attempt to ¬nd the best combination of

the many estimators we have considered so far. To this end, let

0.53

b [f (.47 + .53u) + f (1 ’ .53u)] an antithetic estimator,

θ1 =

2

0.37 0.16

b [f (.47 + .37u) + f (.84 ’ .37u)] + [f (.84 + .16u) + f (1 ’ .16u)],

θ2 =

2 2

b

θ3 = 0.37[f (.47 + .37u)] + 0.16[f (1 ’ .16u)], (strati¬ed-antithetic)

Z

b

θ4 = g(x)dx + [f (u) ’ g(u)], (control variate)

b ˆ

θ5 = θim , the importance sampling estimator (4.23).

b b b

Then θ2 , and θ3 are both strati¬ed-antithetic estimators, θ4 is a control

b

variate estimator and θ5 the importance sampling estimator discussed earlier, all

obtained from a single input uniform random variate U. In order to determine

the optimal linear combination we need to generate simulated values of all 5

estimators using the same uniform random numbers as inputs. We determine

the best linear combination of these estimators using

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.247

function [o,v,b,V]=optimal(U)

% generates optimal linear combination of ¬ve estimators and outputs

% average estimator, variance and weights

% input U a row vector of U[0,1] random numbers

T1=(.53/2)*(fn(.47+.53*U)+fn(1-.53*U));

T2=.37*.5*(fn(.47+.37*U)+fn(.84-.37*U))+.16*.5*(fn(.84+.16*U)+fn(1-.16*U));

T3=.37*fn(.47+.37*U)+.16*fn(1-.16*U);

intg=2*(.53)^3+.53^2/2;

T4=intg+fn(U)-GG(U);

T5=importance(™fn™,U);

X=[T1™ T2™ T3™ T4™ T5™]; % matrix whose columns are replications of the same

estimator, a row=5 estimators using same U

mean(X)

V=cov(X); % this estimates the covariance matrix V

on=ones(5,1);

V1=inv(V); % the inverse of the covariance matrix

b=V1*on/(on™*V1*on); % vector of coe¬cients of the optimal linear combination

o=mean(X*b); % vector of the optimal linear combinations

v=1/(on™*V1*on); % variance of the optimal linear combination based on

a single U

One run of this estimator, called with [o,v,b,V]= optimal(unifrnd(0,1,1,1000000))

yields

o = 0.4615

b0 = [’0.5499 1.4478 0.1011 0.0491 ’ 0.0481].

The estimate 0.4615 is accurate to at least four decimals which is not surprising

since the variance per uniform random number input is v = 1.13 — 10’5 . In

other words, the variance of the mean based on 1,000,000 uniform input is

248 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

1.13— 10’10 , the standard error is around .00001 so we can expect accuracy to at

least 4 decimal places. Note that some of the weights are negative and others are

greater than one. Do these negative weights indicate estimators that are worse

than useless? The e¬ect of some estimators may be, on subtraction, to render the

remaining function more linear and more easily estimated using another method

and negative coe¬cients are quite common in regression generally. The e¬ciency

gain over crude Monte Carlo is an extraordinary 40,000. However since there

are 10 function evaluations for each uniform variate input, the e¬ciency when

we adjust for the number of function evaluations is 4,000. This simulation

using 1,000,000 uniform random numbers and taking a 63 seconds on a Pentium

IV (2.4 GHz) (including the time required to generate all ¬ve estimators) is

equivalent to forty billion simulations by crude Monte Carlo, a major task on a

supercomputer!

If we intended to use this simulation method repeatedly, we might well wish

to see whether some of the estimators can be omitted without too much loss

of information. Since the variance of the optimal estimator is 1/(1t V ’1 1), we

might use this to attempt to select one of the estimators for deletion. Notice that

it is not so much the covariance of the estimators V which enters into Theorem

35 but its inverse J = V ’1 which we can consider a type of information matrix

by analogy to maximum likelihood theory. For example we could choose to delete

the i0 th estimator, i.e. delete the i0 th row and column of V where i is chosen

PP

to have the smallest e¬ect on 1/(1t V ’1 1) or its reciprocal 1t J1 = i j Jij .

with the i0 th row and column

In particular, if we let V(i) be the matrix V

’1

deleted and J(i) = V(i) , then we can identify 1t J1 ’ 1t J(i) 1 as the loss of

information when the i0 th estimator is deleted. Since not all estimators have

the same number of function evaluations, we should adjust this information

by F E(i) =number of function evaluations required by the i0 th estimator. In

other words, if an estimator i is to be deleted, it should be the one corresponding

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.249

to

1t J1 ’ 1t J(i) 1

min{ }.

F E(i)

i

We should drop this i0 th estimator if the minimum is less than the information

per function evaluation in the combined estimator, because this means we will

increase the information available in our simulation per function evaluation.

In the above example with all ¬ve estimators included, 1t J1 = 88757 (with

10 function evaluations per uniform variate) so the information per function

evaluation is 8, 876.

1t J1’1t J(i) 1

1t J1 ’ 1t J(i) 1

i F E(i) F E(i)

1 88,048 2 44024

2 87,989 4 21,997

3 28,017 2 14,008

4 55,725 1 55,725

5 32,323 1 32,323

In this case, if we were to eliminate one of the estimators, our choice would

likely be number 3 since it contributes the least information per function eval-

uation. However, since all contribute more than 8, 876 per function evaluation,

we should likely retain all ¬ve.

Common Random Numbers.

We now discuss another variance reduction technique, closely related to anti-

thetic variates called common random numbers, used for example whenever

we wish to estimate the di¬erence in performance between two systems or any

other variable involving a di¬erence such as a slope of a function.

bb

Example 38 For a simple example suppose we have two estimators θ1 , θ2 of

the “center” of a symmetric distribution. We would like to know which of these

estimators is better in the sense that it has smaller variance when applied to

250 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

a sample from a speci¬c distribution symmetric about its median. If both esti-

mators are unbiased estimators of the median, then the ¬rst estimator is better

if

b b

var(θ1 ) < var(θ2 )

and so we are interested in estimating a quantity like

Eh1 (X) ’ Eh2 (X)

where X is a vector representing a sample from the distribution and h1 (X) =

b2 b2

θ1 , h2 (X) = θ2 . There are at least two ways of estimating these di¬erences;

1. Generate samples and hence values of h1 (Xi ), i = 1, ..., n and Eh2 (Xj ), j =

1, 2, ..., m independently and use the estimator

n m

1X 1X

h1 (Xi ) ’ h2 (Xj ).

n i=1 m j=1

2. Generate samples and hence values of h1 (Xi ), h2 (Xi ), i = 1, ..., n inde-

pendently and use the estimator

n

1X

(h1 (Xi ) ’ h2 (Xi )).

n i=1

It seems intuitive that the second method is preferable since it removes

the variability due to the particular sample from the comparison. This is a

common type of problem in which we want to estimate the di¬erence between

two expected values. For example we may be considering investing in a new

piece of equipment that will speed up processing at one node of a network and

we wish to estimate the expected improvement in performance between the new

system and the old. In general, suppose that we wish to estimate the di¬erence

between two expectations, say

Eh1 (X) ’ Eh2 (Y ) (4.37)

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.251

where the random variable or vector X has cumulative distribution function FX

and Y has cumulative distribution function FY . Notice that the variance of a

Monte Carlo estimator

var[h1 (X) ’ h2 (Y )] = var[h1 (X)] + var[h2 (Y )] ’ 2cov{h1 (X), h2 (Y )} (4.38)

is small if we can induce a high degree of positive correlation between the gen-

erated random variables X and Y . This is precisely the opposite problem that

led to antithetic random numbers, where we wished to induce a high degree

of negative correlation. The following lemma is due to Hoe¬ding (1940) and

provides a useful bound on the joint cumulative distribution function of two

random variables X and Y. Suppose X, Y have cumulative distribution func-

tions FX (x) and FY (y) respectively and joint cumulative distribution function

G(x, y) = P [X · x, Y · y].

Lemma 39 (a) The joint cumulative distribution function G of (X, Y ) always

satis¬es

(FX (x) + FY (y) ’ 1)+ · G(x, y) · min(FX (x), FY (y)) (4.39)

for all x, y .

(b) Assume that FX and FY are continuous functions. In the case that

’1 ’1

X = FX (U ) and Y = FY (U ) for U uniform on [0, 1], equality is achieved

’1

on the right G(x, y) = min(FX (x), FY (y)). In the case that X = FX (U ) and

’1

Y = FY (1 ’ U ) there is equality on the left; (FX (x) + FY (y) ’ 1)+ = G(x, y).

Proof. (a) Note that

P [X · x, Y · y] · P [X · x] and similarly

· P [Y · y].

This shows that

G(x, y) · min(FX (x), FY (y)),

252 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

verifying the right side of (4.39). Similarly for the left side

P [X · x, Y · y] = P [X · x] ’ P [X · x, Y > y]

≥ P [X · x] ’ P [Y > y]

= FX (x) ’ (1 ’ FY (y))

= (FX (x) + FY (y) ’ 1).

Since it is also non-negative the left side follows.

’1 ’1

For (b) suppose X = FX (U ) and Y = FY (U ), then

’1 ’1

P [X · x, Y · y] = P [FX (U ) · x, FY (U ) · y]

= P [U · FX (x), U · FY (y)]

since P [X = x] = 0 and P [Y = y] = 0.

But

P [U · FX (x), U · FY (y)] = min(FX (x), FY (y))

verifying the equality on the right of (4.39) for common random numbers. By

a similar argument,

’1 ’1

P [FX (U ) · x, FY (1 ’ U ) · y] = P [U · FX (x), 1 ’ U · FY (y)]

= P [U · FX (x), U ≥ 1 ’ FY (y)]

= (FX (x) ’ (1 ’ FY (y)))+

verifying the equality on the left.

The following theorem supports the use of common random numbers to

maximize covariance and antithetic random numbers to minimize covariance.

Theorem 40 (maximum/minimum covariance)

Suppose h1 and h2 are both non-decreasing (or both non-increasing) functions.

Subject to the constraint that X, Y have cumulative distribution functions FX , FY

respectively, the covariance

cov[h1 (X), h2 (Y )]

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.253

’1 ’1

is maximized when Y = FY (U ) and X = FX (U ) (i.e. for common uniform[0, 1]

’1 ’1

random numbers) and is minimized when Y = FY (U ) and X = FX (1 ’ U )

(i.e. for antithetic random numbers).

Proof. We will sketch a proof of the theorem when the distributions are

all continuous and h1 , h2 are di¬erentiable. De¬ne G(x, y) = P [X · x, Y · y].

The following representation of covariance is useful: de¬ne

H(x, y) = P (X > x, Y > y) ’ P (X > x)P (Y > y) (4.40)

= G(x, y) ’ FX (x)FY (y).

Notice that, using integration by parts,

Z∞Z∞

H(x, y)h0 (x)h0 (y)dxdy

1 2

’∞ ’∞

Z∞Z∞

‚

H(x, y)h1 (x)h0 (y)dxdy

=’ 2

’∞ ’∞ ‚x

Z∞Z∞

‚2

= H(x, y)h1 (x)h2 (y)dxdy

’∞ ’∞ ‚x‚y

Z∞Z∞ Z∞ Z ∞

h1 (x)h2 (y)g(x, y)dxdy ’

= h1 (x)fX (x)dx h2 (y)fY (y)dy

’∞ ’∞ ’∞ ’∞

(4.41)

= cov(h1 (X), h2 (Y ))

where g(x, y), fX (x), fY (y) denote the joint probability density function, the

probability density function of X and that of Y respectively. In fact this result

holds in general even without the assumption that the distributions are contin-

uous. The covariance between h1 (X) and h2 (Y ), for h1 and h2 di¬erentiable

functions, is

Z Z

∞ ∞

H(x, y)h0 (x)h0 (y)dxdy.

cov(h1 (X), h2 (Y )) = 1 2

’∞ ’∞

The formula shows that to maximize the covariance, if h1 , h2 are both increasing

or both decreasing functions, it is su¬cient to maximize H(x, y) for each x, y

since h0 (x), h0 (y) are both non-negative. Since we are constraining the mar-

1 2

ginal cumulative distribution functions FX , FY , this is equivalent to maximizing

254 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

G(x, y) subject to the constraints

lim G(x, y) = FX (x)

y’∞

lim G(x, y) = FY (y).

x’∞

Lemma 37 shows that the maximum is achieved when common random numbers

are used and the minimum achieved when we use antithetic random numbers.

We can argue intuitively for the use of common random numbers in the case

of a discrete distribution with probability on the points indicated in Figure 4.5.

This ¬gure corresponds to a joint distribution with the following probabilities,

say

0 0.25 0.25 0.75 0.75 1

x

0 0.25 0.75 0.25 0.75 1

y

.1 .2 .2 .1 .2 .2

P [X = x, Y = y]

Suppose we wish to maximize P [X > x, Y > y] subject to the constraint that

the probabilities P [X > x] and P [Y > y] are ¬xed. We have indicated

arbitrary ¬xed values of (x, y) in the ¬gure. Note that if there is any weight

attached to the point in the lower right quadrant (labelled “P2 ”), some or all

of this weight can be reassigned to the point P3 in the lower left quadrant

provided there is an equal movement of weight from the upper left P4 to the

upper right P1 . Such a movement of weight will increase the value of G(x, y)

without a¬ecting P [X · x] or P [Y · y]. The weight that we are able to transfer

in this example is 0.1, the minimum of the weights on P4 and P2 . In general,

this continues until there is no weight in one of the o¬-diagonal quadrants for

every choice of (x, y). The resulting distribution in this example is given by

0 0.25 0.25 0.75 0.75 1

x

0 0.25 0.75 0.25 0.75 1

y

.1 .3 0 .1 .3 .2

P [X = x, Y = y]

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.255

Figure 4.5: Changing weights on points to maximize covariance

and it is easy to see that such a joint distribution can be generated from common

’1 ’1

random numbers X = FX (U ), Y = FY (U ).

Conditioning

We now consider a simple but powerful generalization of control variates. Sup-

pose that we can decompose a random variable T into two components T1 , µ

(4.42)

T = T1 + µ

so that T1 , µ are uncorrelated

cov(T1 , µ) = 0.

Assume as well that E(µ) = 0. Regression is one method for determining such

a decomposition and the error term µ in regression satis¬es these conditions.

Then T1 has the same mean as T and it is easy to see that

var(T ) = var(T1 ) + var(µ)

so T1 as smaller variance than T (unless µ = 0 with probability 1). This means

that if we wish to estimate the common mean of T or T1 , the estimator T1 is

preferable, since it has the same mean with smaller variance.

256 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

One special case is variance reduction by conditioning. For the standard

de¬nition and properties of conditional expectation see the appendix. One com-

mon de¬nition of E[X|Y ] is the unique (with probability one) function g(y)

of Y which minimizes E{X ’ g(Y )}2 . This de¬nition only applies to random

variables X which have ¬nite variance and so this de¬nition requires some mod-

i¬cation when E(X 2 ) = ∞, but we will assume here that all random variables,

say X, Y, Z have ¬nite variances. We can de¬ne conditional covariance using

conditional expectation as

cov(X, Y |Z) = E[XY |Z] ’ E[X|Z]E[Y |Z]

and conditional variance:

var(X|Z) = E(X 2 |Z) ’ (E[X|Z])2 .

The variance reduction through conditioning is justi¬ed by the following well-

known result:

Theorem 41 (a)E(X) = E{E[X|Y ]}

(b) cov(X, Y ) = E{cov(X, Y |Z)} + cov{E[X|Z], E[Y |Z]}

(c) var(X) = E{var(X|Z)} + var{E[X|Z]}

This Theorem is used as follows. Suppose we are considering a candidate

ˆ

estimator θ, an unbiased estimator of θ. We also have an arbitrary random

ˆ

variable Z which is somehow related to θ. Suppose that we have chosen Z

carefully so that we are able to calculate the conditional expectation T1 =

ˆ

E[θ|Z]. Then by part (a) of the above Theorem, T1 is also an unbiased estimator

of θ. De¬ne

ˆ

µ = θ ’ T1 .

By part (c),

ˆ

var(θ) = var(T1 ) + var(µ)

VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.257

ˆ ˆ

and var(T1 ) = var(θ) ’ var(µ) < var(θ). In other words, for any variable Z,

ˆ ˆ

E[θ|Z] has the same expectation as does θ but smaller variance and the decrease

ˆ

in variance is largest if Z and θ are nearly independent, because in this case

ˆ

E[θ|Z] is close to a constant and its variance close to zero. In general the

search for an appropriate Z so as to reducing the variance of an estimator by

conditioning requires searching for a random variable Z such that:

ˆ

1. the conditional expectation E[θ|Z] with the original estimator is com-

putable

ˆ ˆ

2. var(E[θ|Z]) is substantially smaller than var(θ).

Example 42 (hit or miss)

Suppose we wish to estimate the area under a certain graph f (x) by the hit

and miss method. A crude method would involve determining a multiple c of a

probability density function g(x) which dominates f (x) so that cg(x) ≥ f (x) for

all x. We can generate points (X, Y ) at random and uniformly distributed under

the graph of cg(x) by generating X by inverse transform X = G’1 (U1 ) where

G(x) is the cumulative distribution function corresponding to density g and

then generating Y from the Uniform[0, cg(X)] distribution, say Y = cg(X)U2 .

An example, with g(x) = 2x, 0 < x < 1 and c = 1/4 is given in Figure 4.6.

The hit and miss estimator of the area under the graph of f obtains by

generating such random points (X, Y ) and counting the proportion that fall

under the graph of g, i.e. for which Y · f (X). This proportion estimates the

probability

area under f (x)

P [Y · f (X)] =

area under cg(x)

area under f (x)

=

c

258 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

Figure 4.6: Example of the Hit and Miss Method

since g(x) is a probability density function. Notice that if we de¬ne

§

⎨c Y · f (X)

if

W=

©0 if Y > f (X)

then

area under f (x)

E(W ) = c —

area under cg(x)

= area under f (x)

so W is an unbiased estimator of the parameter that we wish to estimate. We

might therefore estimate the area under f (x) using a Monte Carlo estimator

P

θHM = n n Wi based on independent values of Wi .This is the “hit-or-miss”

ˆ 1

i=1

estimator. However, in this case it is easy to ¬nd a random variable Z such

that the conditional expectation E(Z|W ) can be determined in closed form. In

fact we can choose Z = X, we obtain

f (X)

E[W |X] = .

g(X)

PROBLEMS 259

This is therefore an unbiased estimator of the same parameter and it has smaller

variance than does W. For a sample of size n we should replace the crude esti-

ˆ

mator θcr by the estimator

n

1 X f (Xi )

ˆ

θCond =

n i=1 g(Xi )

n

1 X f (Xi )

=

n i=1 2Xi

√

with Xi generated from X = G’1 (Ui ) = Ui , i = 1, 2, ..., n and Ui ∼ Uni-

form[0,1]. In this case, the conditional expectation results in a familiar form for

ˆ

the estimator θCond . This is simply an importance sampling estimator with g(x)

the importance distribution. However, this derivation shows that the estimator

ˆ ˆ

θCond has smaller variance than θHM .

Problems

1. Use both crude and antithetic random numbers to integrate the function

Z 1

eu ’ 1

du.

e’1

0

What is the e¬ciency gain attributed to the use of antithetic random

numbers?

2. How large a sample size would I need, using antithetic and crude Monte

Carlo, in order to estimate the above integral, correct to four decimal

places, with probability at least 95%?

3. Under what conditions on f does the use of antithetic random numbers

completely correct for the variability of the Monte-Carlo estimator? i.e.

When is var(f (U ) + f (1 ’ U )) = 0?

4. Show that if we use antithetic random numbers to generate two normal

random variables X1 , X2 , having mean rT ’ σ 2 T /2 and variance σ 2 T ,

260 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

this is equivalent to setting X2 = 2(rT ’ σ2 T /2) ’ X1 . In other words, it

is not necessary to use the inverse transform method to generate normal

random variables in order to permit the use of antithetic random numbers.

5. Show that the variance of a weighted average

var(±X + (1 ’ ±)W )

is minimized over ± when

var(W ) ’ cov(X, W )

±=

var(W ) + var(X) ’ 2cov(X, W )

Determine the resulting minimum variance. What if the random variables

X, W are independent?

6. Use a strati¬ed random sample to integrate the function

Z 1

eu ’ 1

du.

e’1

0

What do you recommend for intervals (two or three) and sample sizes?

What is the e¬ciency gain?

7. Use a combination of strati¬ed random sampling and an antithetic random

number in the form

1

[f (U/2) + f (1 ’ U/2)]

2

to integrate the function

Z 1

eu ’ 1

du.

e’1

0

What is the e¬ciency gain?

ex ’1

8. In the case f (x) = e’1 , use g(x) = x as a control variate to integrate over

[0,1]. Show that the variance is reduced by a factor of approximately 60.

Is there much additional improvement if we use a more general quadratic

function of x?

PROBLEMS 261

ex ’1

9. In the case f (x) = e’1 , consider using g(x) = x as a control variate

to integrate over [0,1]. Note that regression of f (U ) on g(U ) yields

f (U ) ’ E(f (U )) = β[g(U ) ’ Eg(U )] + µ where the error term µ has mean

0 and is uncorrelated with g(U ) and β = cov(f (U ), g(U ))/var(g(U ).

Therefore, taking expectations on both sides and reorganising the terms,

E(f (U )) = f (U ) ’ β[g(U ) ’ E(g(U ))]. The Monte-Carlo estimator

n

1X

{f (Ui ) ’ β[g(Ui ) ’ E(g(Ui ))]}

n i=1

is an improved control variate estimator, equivalent to the one discussed

above in the case β = 1. Determine how much better this estimator

is than the basic control variate case β = 1 by performing simulations.

Show that the variance is reduced by a factor of approximately 60. Is

there much additional improvement if we use a more general quadratic

function of x?

10. A call option pays an amount V (S) = 1/(1 + exp(S(T ) ’ k)) at time T

for some predetermined price k. Discuss what you would use for a control

variate and conduct a simulation to determine how it performs, assuming

geometric Brownian motion for the stock price, interest rate 5%, annual

volatility 20% and various initial stock prices, values of k and T.

11. It has been suggested that stocks are not log-normally distributed but the

distribution can be well approximated by replacing the normal distribu-

tion by a student t distribution. Suppose that the daily returns Xi are

independent with probability density function f (x) = c(1 + (x/b)2 )’2 (the

re-scaled student distribution with 3 degrees of freedom). We wish to esti-

P5

mate a weekly V ar.95 , a value ev such that P [ i=1 Xi < v] = 0.95. If we

wish to do this by simulation, suggest an appropriate method involving

importance sampling. Implement and estimate the variance reduction.

12. Suppose, for example, I have three di¬erent simulation estimators Y1 , Y2 , Y3

262 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

whose means depend on two unknown parameters θ1 , θ2 . In particular,

suppose Y1 , Y2 , Y3 , are unbiased estimators of θ1 , θ1 + θ2 , θ2 respectively.

Let us assume for the moment that var(Yi ) = 1, cov(Yi , Yj ) = ’1/2.

I want to estimate the parameter θ1 . Should I use only the estimator

Y1 which is the unbiased estimator of θ1 , or some linear combination of

Y1 , Y2 , Y3 ? Compare the number of simulations necessary for a certain

degree of accuracy.

13. Consider the systematic sample estimator based on the trapezoidal rule:

n’1

X

ˆ= 1 1

f (V + i/n), V ∼ U [0, ]

θ

n i=0 n

Discuss the bias and variance of this estimator. In the case f (x) = x2 ,

how does it compare with other estimators such as crude Monte Carlo

and antithetic random numbers requiring n function evaluations. Are

there any disadvantages to its use?

ex ’1

14. In the case f (x) = e’1 , use g(x) = x as a control variate to integrate

over [0,1]. Find the optimal linear combination using estimators (4.34) and

(4.35), an importance sampling estimator and the control variate estimator

above. What is the e¬ciency gain over crude Monte-Carlo?

15. The rho of an option is the derivative of the option price with respect

to the interest rate parameter r. What is the value of ρ for a call option

with S0 = 10, strike=10, r = 0.05, T = .25 and σ = .2? Use a simulation

to estimate this slope and determine the variance of your estimator. Try

using (i) independent simulations at two points and (ii) common random

numbers. What can you say about the variances of your estimators?

16. For any random variables X, Y, prove that P (X · x, Y · y) ’ P (X ·

x)P (Y · y) = P (X > x, Y > y) ’ P (X > x)P (Y > y) for all x, y.

PROBLEMS 263

17. Show that the Jacobian of the transformation used in the proof of Theorem

√1

23; (x, m) ’ (x, y) where y = exp(’(2m’x)2 /2) is given by .

2y ’2 ln(y)

264 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES