. 2
( 2)

where θ = θ(x) satis¬es K 0 (θ) = x.

This is the saddlepoint approximation, discovered by Daniels (1954, 1980), and
usually applied to the distribution of sums or averages of independent random
variables because then the normal approximation is better motivated. Indeed,
the saddlepoint approximation to the distribution of the sum of n independent
identically distributed random variables is accurate to order O(n’1 ) and if we
renormalize it to integrate to one, accuracy to order O(n’3/2 ) is possible, sub-
stantially better than the order O(n’1/2 ) of of the usual normal approximation.
Consider, for example, the saddlepoint approximation to the Gamma(±,1)
distribution. In this case the cumulant generating function is

K(t) = ’± ln(1 ’ t),
K 0 (θ) = x implies θ(x) = 1 ’ and
K 00 (θ) =

Therefore the saddlepoint approximation to the probability density function is
± ±
f (x) ' exp{± ln(x/±) ’ x(1 ’ )}
2πx x
1 1 ’± ± ±’1
= ±2 e x exp(’x).

This is exactly the gamma density function with Stirling™s approximation replac-
ing “(±) and after renormalization this is exactly the Gamma density function.
In many cases the saddlepoint approximation is used to estimate a proba-
bility or a moment of a distribution and it is of interest to estimate the error
in this approximation. For example suppose that we wish, for some function
h, to estimate the error in the saddlepoint approximation to E(h(Sn )) where

Xi and each random variable Xi has the non-central chi-squared
Sn = i=1

distribution with cumulant generating function

2»t p
’ ln(1 ’ 2t).
K(t) =
1 ’ 2t 2

The parameter » is the non-centrality parameter of the distribution and p is
the degrees of freedom. Notice that the cumulant generating function of the sum
takes the same form but with (», p) replaced by (n», np). If f (x) is the actual
probability density function of this sum and fs is the saddlepoint approximation
to the same density, then the error in the saddlepoint approximation is
∞ ∞ ∞
h(x)fs (x)dx ’ h(x)(fs (x) ’ f (x))dx. (4.33)
h(x)f (x)dx =
0 0 0

The right hand side of (4.33) can be approximated with a relatively small Monte
Carlo sample since the di¬erences appearing in it fs (x)’f (x) are typically small.
In this case, we might use importance sampling and since it is easy to generate
from the distribution of Sn , we could simply average simulated values of

fs (Sn )
’ 1).
h(Sn )(
f (Sn )

Since it is often computationally expensive to generate random variables
whose distribution is a convolution of known densities, it is interesting to ask
whether (4.32) makes this any easier. In general, K(t) is convex with K(0) = 0
and »(x) = θ(x) ’ is an increasing function of x.This is because, upon

di¬erentiation, we obtain

1 K(θ(x))
»0 (x) = ’1+ =
K 00 (θ(x)) x2


Combining Monte Carlo Estimators.

We have now seen a number of di¬erent variance reduction techniques and there
are many more possible. With many of these methods such as importance and

strati¬ed sampling are associated parameters which may be chosen in di¬erent
ways. The variance formula may be used as a basis of choosing a “best” method
but these variances and e¬ciencies must also estimated from the simulation and
it is rarely clear a priori which sampling procedure and estimator is best. For
example if a function f is monotone on [0, 1] then an antithetic variate can be
introduced with an estimator of the form

θa1 = [f (U ) + f (1 ’ U )], U ∼ U [0, 1] (4.34)
but if the function is increasing to a maximum somewhere around and then

decreasing thereafter we might prefer

θa2 = [f (U/2) + f ((1 ’ U )/2) + f ((1 + U )/2) + f (1 ’ U/2)]. (4.35)

Notice that any weighted average of these two unbiased estimators of θ would
also provide an unbiased estimator of θ. The large number of potential variance
reduction techniques is an embarrassment of riches. Which variance reduction
methods we should use and how will we know whether it is better than the
competitors? Fortunately, the answer is often to use “all of the methods” (within
reason of course); that choosing a single method is often neither necessary nor
desirable. Rather it is preferable to use a weighted average of the available
estimators with the optimal choice of the weights provided by regression.
Suppose in general that we have k estimators or statistics θi , i = 1, ..k, all
unbiased estimators of the same parameter θ so that E(θi ) = θ for all i . In
b b
vector notation, letting ˜0 = (θ1 , ..., θk ), we write E(˜) = 1θ where 1 is the
k-dimensional column vector of ones so that 10 = (1, 1, ..., 1). Let us suppose for
the moment that we know the variance-covariance matrix V of the vector ˜,
de¬ned by
Vij = cov(θi , θj ).

Theorem 37 (best linear combinations of estimators)
The linear combination of the θi which provides an unbiased estimator of θ and

has minimum variance among all linear unbiased estimators is

b b (4.36)
θblc = bi θi

where the vector b = (b1 , ..., bk )0 is given by

b = (1t V ’1 1)’1 V ’1 1.

The variance of the resulting estimator is

var(θblc ) = bt V b = 1/(1t V ’1 1)

Proof. The proof is straightforward. It is easy to see that for any linear
combination (4.36) the variance of the estimator is

bt V b

and we wish to minimize this quadratic form as a function of b subject to the
constraint that the coe¬cients add to one, or that

b0 1 =1.

Introducing the Lagrangian, we wish to set the derivatives with respect to
the components bi equal to zero

‚ 0
{bt V b + »(b 1’1)} = 0 or
2V b + »1= 0

b=constant — V ’1 1

and upon requiring that the coe¬cients add to one, we discover the value of the
constant above is (1t V ’1 1)’1 .
This theorem indicates that the ideal linear combination of estimators has
coe¬cients proportional to the row sums of the inverse covariance matrix. No-
tably, the variance of a particular estimator θi is an ingredient in that sum,

but one of many. In practice, of course, we almost never know the variance-
covariance matrix V of a vector of estimators ˜. However, when we do simula-
tion evaluating these estimators using the same uniform input to each, we obtain
independent replicated values of ˜. This permits us to estimate the covariance
matrix V and since we typically conduct many simulations this estimate can be
very accurate. Let us suppose that we have n simulated values of the vectors ˜,
and call these ˜1 , ..., ˜n . As usual we estimate the covariance matrix V using
the sample covariance matrix
b (˜i ’ ˜)(˜i ’ ˜)0
n ’ 1 i=1

˜= ˜i .
n i=1

Let us return to the example and attempt to ¬nd the best combination of
the many estimators we have considered so far. To this end, let

b [f (.47 + .53u) + f (1 ’ .53u)] an antithetic estimator,
θ1 =
0.37 0.16
b [f (.47 + .37u) + f (.84 ’ .37u)] + [f (.84 + .16u) + f (1 ’ .16u)],
θ2 =
2 2
θ3 = 0.37[f (.47 + .37u)] + 0.16[f (1 ’ .16u)], (strati¬ed-antithetic)
θ4 = g(x)dx + [f (u) ’ g(u)], (control variate)

b ˆ
θ5 = θim , the importance sampling estimator (4.23).

b b b
Then θ2 , and θ3 are both strati¬ed-antithetic estimators, θ4 is a control
variate estimator and θ5 the importance sampling estimator discussed earlier, all
obtained from a single input uniform random variate U. In order to determine
the optimal linear combination we need to generate simulated values of all 5
estimators using the same uniform random numbers as inputs. We determine
the best linear combination of these estimators using

function [o,v,b,V]=optimal(U)

% generates optimal linear combination of ¬ve estimators and outputs

% average estimator, variance and weights

% input U a row vector of U[0,1] random numbers







X=[T1™ T2™ T3™ T4™ T5™]; % matrix whose columns are replications of the same

estimator, a row=5 estimators using same U


V=cov(X); % this estimates the covariance matrix V


V1=inv(V); % the inverse of the covariance matrix

b=V1*on/(on™*V1*on); % vector of coe¬cients of the optimal linear combination

o=mean(X*b); % vector of the optimal linear combinations

v=1/(on™*V1*on); % variance of the optimal linear combination based on

a single U

One run of this estimator, called with [o,v,b,V]= optimal(unifrnd(0,1,1,1000000))

o = 0.4615

b0 = [’0.5499 1.4478 0.1011 0.0491 ’ 0.0481].

The estimate 0.4615 is accurate to at least four decimals which is not surprising
since the variance per uniform random number input is v = 1.13 — 10’5 . In
other words, the variance of the mean based on 1,000,000 uniform input is

1.13— 10’10 , the standard error is around .00001 so we can expect accuracy to at
least 4 decimal places. Note that some of the weights are negative and others are
greater than one. Do these negative weights indicate estimators that are worse
than useless? The e¬ect of some estimators may be, on subtraction, to render the
remaining function more linear and more easily estimated using another method
and negative coe¬cients are quite common in regression generally. The e¬ciency
gain over crude Monte Carlo is an extraordinary 40,000. However since there
are 10 function evaluations for each uniform variate input, the e¬ciency when
we adjust for the number of function evaluations is 4,000. This simulation
using 1,000,000 uniform random numbers and taking a 63 seconds on a Pentium
IV (2.4 GHz) (including the time required to generate all ¬ve estimators) is
equivalent to forty billion simulations by crude Monte Carlo, a major task on a

If we intended to use this simulation method repeatedly, we might well wish
to see whether some of the estimators can be omitted without too much loss
of information. Since the variance of the optimal estimator is 1/(1t V ’1 1), we
might use this to attempt to select one of the estimators for deletion. Notice that
it is not so much the covariance of the estimators V which enters into Theorem
35 but its inverse J = V ’1 which we can consider a type of information matrix
by analogy to maximum likelihood theory. For example we could choose to delete
the i0 th estimator, i.e. delete the i0 th row and column of V where i is chosen
to have the smallest e¬ect on 1/(1t V ’1 1) or its reciprocal 1t J1 = i j Jij .
with the i0 th row and column
In particular, if we let V(i) be the matrix V
deleted and J(i) = V(i) , then we can identify 1t J1 ’ 1t J(i) 1 as the loss of
information when the i0 th estimator is deleted. Since not all estimators have
the same number of function evaluations, we should adjust this information
by F E(i) =number of function evaluations required by the i0 th estimator. In
other words, if an estimator i is to be deleted, it should be the one corresponding

1t J1 ’ 1t J(i) 1
min{ }.
F E(i)

We should drop this i0 th estimator if the minimum is less than the information
per function evaluation in the combined estimator, because this means we will
increase the information available in our simulation per function evaluation.
In the above example with all ¬ve estimators included, 1t J1 = 88757 (with
10 function evaluations per uniform variate) so the information per function
evaluation is 8, 876.

1t J1’1t J(i) 1
1t J1 ’ 1t J(i) 1
i F E(i) F E(i)

1 88,048 2 44024
2 87,989 4 21,997
3 28,017 2 14,008
4 55,725 1 55,725
5 32,323 1 32,323

In this case, if we were to eliminate one of the estimators, our choice would
likely be number 3 since it contributes the least information per function eval-
uation. However, since all contribute more than 8, 876 per function evaluation,
we should likely retain all ¬ve.

Common Random Numbers.

We now discuss another variance reduction technique, closely related to anti-
thetic variates called common random numbers, used for example whenever
we wish to estimate the di¬erence in performance between two systems or any
other variable involving a di¬erence such as a slope of a function.

Example 38 For a simple example suppose we have two estimators θ1 , θ2 of
the “center” of a symmetric distribution. We would like to know which of these
estimators is better in the sense that it has smaller variance when applied to

a sample from a speci¬c distribution symmetric about its median. If both esti-
mators are unbiased estimators of the median, then the ¬rst estimator is better
b b
var(θ1 ) < var(θ2 )

and so we are interested in estimating a quantity like

Eh1 (X) ’ Eh2 (X)

where X is a vector representing a sample from the distribution and h1 (X) =
b2 b2
θ1 , h2 (X) = θ2 . There are at least two ways of estimating these di¬erences;

1. Generate samples and hence values of h1 (Xi ), i = 1, ..., n and Eh2 (Xj ), j =
1, 2, ..., m independently and use the estimator
n m
1X 1X
h1 (Xi ) ’ h2 (Xj ).
n i=1 m j=1

2. Generate samples and hence values of h1 (Xi ), h2 (Xi ), i = 1, ..., n inde-
pendently and use the estimator
(h1 (Xi ) ’ h2 (Xi )).
n i=1

It seems intuitive that the second method is preferable since it removes
the variability due to the particular sample from the comparison. This is a
common type of problem in which we want to estimate the di¬erence between
two expected values. For example we may be considering investing in a new
piece of equipment that will speed up processing at one node of a network and
we wish to estimate the expected improvement in performance between the new
system and the old. In general, suppose that we wish to estimate the di¬erence
between two expectations, say

Eh1 (X) ’ Eh2 (Y ) (4.37)

where the random variable or vector X has cumulative distribution function FX
and Y has cumulative distribution function FY . Notice that the variance of a
Monte Carlo estimator

var[h1 (X) ’ h2 (Y )] = var[h1 (X)] + var[h2 (Y )] ’ 2cov{h1 (X), h2 (Y )} (4.38)

is small if we can induce a high degree of positive correlation between the gen-
erated random variables X and Y . This is precisely the opposite problem that
led to antithetic random numbers, where we wished to induce a high degree
of negative correlation. The following lemma is due to Hoe¬ding (1940) and
provides a useful bound on the joint cumulative distribution function of two
random variables X and Y. Suppose X, Y have cumulative distribution func-
tions FX (x) and FY (y) respectively and joint cumulative distribution function
G(x, y) = P [X · x, Y · y].

Lemma 39 (a) The joint cumulative distribution function G of (X, Y ) always

(FX (x) + FY (y) ’ 1)+ · G(x, y) · min(FX (x), FY (y)) (4.39)

for all x, y .
(b) Assume that FX and FY are continuous functions. In the case that
’1 ’1
X = FX (U ) and Y = FY (U ) for U uniform on [0, 1], equality is achieved
on the right G(x, y) = min(FX (x), FY (y)). In the case that X = FX (U ) and
Y = FY (1 ’ U ) there is equality on the left; (FX (x) + FY (y) ’ 1)+ = G(x, y).
Proof. (a) Note that

P [X · x, Y · y] · P [X · x] and similarly

· P [Y · y].

This shows that
G(x, y) · min(FX (x), FY (y)),

verifying the right side of (4.39). Similarly for the left side

P [X · x, Y · y] = P [X · x] ’ P [X · x, Y > y]

≥ P [X · x] ’ P [Y > y]

= FX (x) ’ (1 ’ FY (y))

= (FX (x) + FY (y) ’ 1).

Since it is also non-negative the left side follows.
’1 ’1
For (b) suppose X = FX (U ) and Y = FY (U ), then

’1 ’1
P [X · x, Y · y] = P [FX (U ) · x, FY (U ) · y]

= P [U · FX (x), U · FY (y)]

since P [X = x] = 0 and P [Y = y] = 0.
P [U · FX (x), U · FY (y)] = min(FX (x), FY (y))

verifying the equality on the right of (4.39) for common random numbers. By
a similar argument,

’1 ’1
P [FX (U ) · x, FY (1 ’ U ) · y] = P [U · FX (x), 1 ’ U · FY (y)]

= P [U · FX (x), U ≥ 1 ’ FY (y)]

= (FX (x) ’ (1 ’ FY (y)))+

verifying the equality on the left.

The following theorem supports the use of common random numbers to
maximize covariance and antithetic random numbers to minimize covariance.

Theorem 40 (maximum/minimum covariance)
Suppose h1 and h2 are both non-decreasing (or both non-increasing) functions.
Subject to the constraint that X, Y have cumulative distribution functions FX , FY
respectively, the covariance

cov[h1 (X), h2 (Y )]

’1 ’1
is maximized when Y = FY (U ) and X = FX (U ) (i.e. for common uniform[0, 1]
’1 ’1
random numbers) and is minimized when Y = FY (U ) and X = FX (1 ’ U )
(i.e. for antithetic random numbers).

Proof. We will sketch a proof of the theorem when the distributions are
all continuous and h1 , h2 are di¬erentiable. De¬ne G(x, y) = P [X · x, Y · y].
The following representation of covariance is useful: de¬ne

H(x, y) = P (X > x, Y > y) ’ P (X > x)P (Y > y) (4.40)

= G(x, y) ’ FX (x)FY (y).

Notice that, using integration by parts,
H(x, y)h0 (x)h0 (y)dxdy
1 2
’∞ ’∞

H(x, y)h1 (x)h0 (y)dxdy
=’ 2
’∞ ’∞ ‚x
= H(x, y)h1 (x)h2 (y)dxdy
’∞ ’∞ ‚x‚y
Z∞Z∞ Z∞ Z ∞
h1 (x)h2 (y)g(x, y)dxdy ’
= h1 (x)fX (x)dx h2 (y)fY (y)dy
’∞ ’∞ ’∞ ’∞

= cov(h1 (X), h2 (Y ))

where g(x, y), fX (x), fY (y) denote the joint probability density function, the
probability density function of X and that of Y respectively. In fact this result
holds in general even without the assumption that the distributions are contin-
uous. The covariance between h1 (X) and h2 (Y ), for h1 and h2 di¬erentiable
functions, is
∞ ∞
H(x, y)h0 (x)h0 (y)dxdy.
cov(h1 (X), h2 (Y )) = 1 2
’∞ ’∞

The formula shows that to maximize the covariance, if h1 , h2 are both increasing
or both decreasing functions, it is su¬cient to maximize H(x, y) for each x, y
since h0 (x), h0 (y) are both non-negative. Since we are constraining the mar-
1 2

ginal cumulative distribution functions FX , FY , this is equivalent to maximizing

G(x, y) subject to the constraints

lim G(x, y) = FX (x)

lim G(x, y) = FY (y).

Lemma 37 shows that the maximum is achieved when common random numbers
are used and the minimum achieved when we use antithetic random numbers.

We can argue intuitively for the use of common random numbers in the case
of a discrete distribution with probability on the points indicated in Figure 4.5.
This ¬gure corresponds to a joint distribution with the following probabilities,
0 0.25 0.25 0.75 0.75 1
0 0.25 0.75 0.25 0.75 1
.1 .2 .2 .1 .2 .2
P [X = x, Y = y]

Suppose we wish to maximize P [X > x, Y > y] subject to the constraint that
the probabilities P [X > x] and P [Y > y] are ¬xed. We have indicated
arbitrary ¬xed values of (x, y) in the ¬gure. Note that if there is any weight
attached to the point in the lower right quadrant (labelled “P2 ”), some or all
of this weight can be reassigned to the point P3 in the lower left quadrant
provided there is an equal movement of weight from the upper left P4 to the
upper right P1 . Such a movement of weight will increase the value of G(x, y)
without a¬ecting P [X · x] or P [Y · y]. The weight that we are able to transfer
in this example is 0.1, the minimum of the weights on P4 and P2 . In general,
this continues until there is no weight in one of the o¬-diagonal quadrants for
every choice of (x, y). The resulting distribution in this example is given by

0 0.25 0.25 0.75 0.75 1
0 0.25 0.75 0.25 0.75 1
.1 .3 0 .1 .3 .2
P [X = x, Y = y]

Figure 4.5: Changing weights on points to maximize covariance

and it is easy to see that such a joint distribution can be generated from common
’1 ’1
random numbers X = FX (U ), Y = FY (U ).


We now consider a simple but powerful generalization of control variates. Sup-
pose that we can decompose a random variable T into two components T1 , µ

T = T1 + µ

so that T1 , µ are uncorrelated

cov(T1 , µ) = 0.

Assume as well that E(µ) = 0. Regression is one method for determining such
a decomposition and the error term µ in regression satis¬es these conditions.
Then T1 has the same mean as T and it is easy to see that

var(T ) = var(T1 ) + var(µ)

so T1 as smaller variance than T (unless µ = 0 with probability 1). This means
that if we wish to estimate the common mean of T or T1 , the estimator T1 is
preferable, since it has the same mean with smaller variance.

One special case is variance reduction by conditioning. For the standard
de¬nition and properties of conditional expectation see the appendix. One com-
mon de¬nition of E[X|Y ] is the unique (with probability one) function g(y)
of Y which minimizes E{X ’ g(Y )}2 . This de¬nition only applies to random
variables X which have ¬nite variance and so this de¬nition requires some mod-
i¬cation when E(X 2 ) = ∞, but we will assume here that all random variables,
say X, Y, Z have ¬nite variances. We can de¬ne conditional covariance using
conditional expectation as

cov(X, Y |Z) = E[XY |Z] ’ E[X|Z]E[Y |Z]

and conditional variance:

var(X|Z) = E(X 2 |Z) ’ (E[X|Z])2 .

The variance reduction through conditioning is justi¬ed by the following well-
known result:

Theorem 41 (a)E(X) = E{E[X|Y ]}
(b) cov(X, Y ) = E{cov(X, Y |Z)} + cov{E[X|Z], E[Y |Z]}
(c) var(X) = E{var(X|Z)} + var{E[X|Z]}

This Theorem is used as follows. Suppose we are considering a candidate
estimator θ, an unbiased estimator of θ. We also have an arbitrary random
variable Z which is somehow related to θ. Suppose that we have chosen Z
carefully so that we are able to calculate the conditional expectation T1 =
E[θ|Z]. Then by part (a) of the above Theorem, T1 is also an unbiased estimator
of θ. De¬ne
µ = θ ’ T1 .

By part (c),
var(θ) = var(T1 ) + var(µ)

ˆ ˆ
and var(T1 ) = var(θ) ’ var(µ) < var(θ). In other words, for any variable Z,
ˆ ˆ
E[θ|Z] has the same expectation as does θ but smaller variance and the decrease
in variance is largest if Z and θ are nearly independent, because in this case
E[θ|Z] is close to a constant and its variance close to zero. In general the
search for an appropriate Z so as to reducing the variance of an estimator by
conditioning requires searching for a random variable Z such that:

1. the conditional expectation E[θ|Z] with the original estimator is com-

ˆ ˆ
2. var(E[θ|Z]) is substantially smaller than var(θ).

Example 42 (hit or miss)

Suppose we wish to estimate the area under a certain graph f (x) by the hit
and miss method. A crude method would involve determining a multiple c of a
probability density function g(x) which dominates f (x) so that cg(x) ≥ f (x) for
all x. We can generate points (X, Y ) at random and uniformly distributed under
the graph of cg(x) by generating X by inverse transform X = G’1 (U1 ) where
G(x) is the cumulative distribution function corresponding to density g and
then generating Y from the Uniform[0, cg(X)] distribution, say Y = cg(X)U2 .
An example, with g(x) = 2x, 0 < x < 1 and c = 1/4 is given in Figure 4.6.

The hit and miss estimator of the area under the graph of f obtains by
generating such random points (X, Y ) and counting the proportion that fall
under the graph of g, i.e. for which Y · f (X). This proportion estimates the

area under f (x)
P [Y · f (X)] =
area under cg(x)
area under f (x)

Figure 4.6: Example of the Hit and Miss Method

since g(x) is a probability density function. Notice that if we de¬ne
⎨c Y · f (X)
©0 if Y > f (X)


area under f (x)
E(W ) = c —
area under cg(x)
= area under f (x)

so W is an unbiased estimator of the parameter that we wish to estimate. We
might therefore estimate the area under f (x) using a Monte Carlo estimator
θHM = n n Wi based on independent values of Wi .This is the “hit-or-miss”
ˆ 1

estimator. However, in this case it is easy to ¬nd a random variable Z such
that the conditional expectation E(Z|W ) can be determined in closed form. In
fact we can choose Z = X, we obtain

f (X)
E[W |X] = .

This is therefore an unbiased estimator of the same parameter and it has smaller
variance than does W. For a sample of size n we should replace the crude esti-
mator θcr by the estimator
1 X f (Xi )
θCond =
n i=1 g(Xi )
1 X f (Xi )
n i=1 2Xi

with Xi generated from X = G’1 (Ui ) = Ui , i = 1, 2, ..., n and Ui ∼ Uni-
form[0,1]. In this case, the conditional expectation results in a familiar form for
the estimator θCond . This is simply an importance sampling estimator with g(x)
the importance distribution. However, this derivation shows that the estimator
ˆ ˆ
θCond has smaller variance than θHM .


1. Use both crude and antithetic random numbers to integrate the function
Z 1
eu ’ 1

What is the e¬ciency gain attributed to the use of antithetic random

2. How large a sample size would I need, using antithetic and crude Monte
Carlo, in order to estimate the above integral, correct to four decimal
places, with probability at least 95%?

3. Under what conditions on f does the use of antithetic random numbers
completely correct for the variability of the Monte-Carlo estimator? i.e.
When is var(f (U ) + f (1 ’ U )) = 0?

4. Show that if we use antithetic random numbers to generate two normal
random variables X1 , X2 , having mean rT ’ σ 2 T /2 and variance σ 2 T ,

this is equivalent to setting X2 = 2(rT ’ σ2 T /2) ’ X1 . In other words, it
is not necessary to use the inverse transform method to generate normal
random variables in order to permit the use of antithetic random numbers.

5. Show that the variance of a weighted average

var(±X + (1 ’ ±)W )

is minimized over ± when

var(W ) ’ cov(X, W )
var(W ) + var(X) ’ 2cov(X, W )

Determine the resulting minimum variance. What if the random variables
X, W are independent?

6. Use a strati¬ed random sample to integrate the function
Z 1
eu ’ 1

What do you recommend for intervals (two or three) and sample sizes?
What is the e¬ciency gain?

7. Use a combination of strati¬ed random sampling and an antithetic random
number in the form
[f (U/2) + f (1 ’ U/2)]
to integrate the function
Z 1
eu ’ 1

What is the e¬ciency gain?

ex ’1
8. In the case f (x) = e’1 , use g(x) = x as a control variate to integrate over
[0,1]. Show that the variance is reduced by a factor of approximately 60.
Is there much additional improvement if we use a more general quadratic
function of x?

ex ’1
9. In the case f (x) = e’1 , consider using g(x) = x as a control variate
to integrate over [0,1]. Note that regression of f (U ) on g(U ) yields
f (U ) ’ E(f (U )) = β[g(U ) ’ Eg(U )] + µ where the error term µ has mean
0 and is uncorrelated with g(U ) and β = cov(f (U ), g(U ))/var(g(U ).
Therefore, taking expectations on both sides and reorganising the terms,
E(f (U )) = f (U ) ’ β[g(U ) ’ E(g(U ))]. The Monte-Carlo estimator
{f (Ui ) ’ β[g(Ui ) ’ E(g(Ui ))]}
n i=1

is an improved control variate estimator, equivalent to the one discussed
above in the case β = 1. Determine how much better this estimator
is than the basic control variate case β = 1 by performing simulations.
Show that the variance is reduced by a factor of approximately 60. Is
there much additional improvement if we use a more general quadratic
function of x?

10. A call option pays an amount V (S) = 1/(1 + exp(S(T ) ’ k)) at time T
for some predetermined price k. Discuss what you would use for a control
variate and conduct a simulation to determine how it performs, assuming
geometric Brownian motion for the stock price, interest rate 5%, annual
volatility 20% and various initial stock prices, values of k and T.

11. It has been suggested that stocks are not log-normally distributed but the
distribution can be well approximated by replacing the normal distribu-
tion by a student t distribution. Suppose that the daily returns Xi are
independent with probability density function f (x) = c(1 + (x/b)2 )’2 (the
re-scaled student distribution with 3 degrees of freedom). We wish to esti-
mate a weekly V ar.95 , a value ev such that P [ i=1 Xi < v] = 0.95. If we
wish to do this by simulation, suggest an appropriate method involving
importance sampling. Implement and estimate the variance reduction.

12. Suppose, for example, I have three di¬erent simulation estimators Y1 , Y2 , Y3

whose means depend on two unknown parameters θ1 , θ2 . In particular,
suppose Y1 , Y2 , Y3 , are unbiased estimators of θ1 , θ1 + θ2 , θ2 respectively.
Let us assume for the moment that var(Yi ) = 1, cov(Yi , Yj ) = ’1/2.
I want to estimate the parameter θ1 . Should I use only the estimator
Y1 which is the unbiased estimator of θ1 , or some linear combination of
Y1 , Y2 , Y3 ? Compare the number of simulations necessary for a certain
degree of accuracy.

13. Consider the systematic sample estimator based on the trapezoidal rule:

ˆ= 1 1
f (V + i/n), V ∼ U [0, ]
n i=0 n

Discuss the bias and variance of this estimator. In the case f (x) = x2 ,
how does it compare with other estimators such as crude Monte Carlo
and antithetic random numbers requiring n function evaluations. Are
there any disadvantages to its use?

ex ’1
14. In the case f (x) = e’1 , use g(x) = x as a control variate to integrate
over [0,1]. Find the optimal linear combination using estimators (4.34) and
(4.35), an importance sampling estimator and the control variate estimator
above. What is the e¬ciency gain over crude Monte-Carlo?

15. The rho of an option is the derivative of the option price with respect
to the interest rate parameter r. What is the value of ρ for a call option
with S0 = 10, strike=10, r = 0.05, T = .25 and σ = .2? Use a simulation
to estimate this slope and determine the variance of your estimator. Try
using (i) independent simulations at two points and (ii) common random
numbers. What can you say about the variances of your estimators?

16. For any random variables X, Y, prove that P (X · x, Y · y) ’ P (X ·
x)P (Y · y) = P (X > x, Y > y) ’ P (X > x)P (Y > y) for all x, y.

17. Show that the Jacobian of the transformation used in the proof of Theorem
23; (x, m) ’ (x, y) where y = exp(’(2m’x)2 /2) is given by .
2y ’2 ln(y)


. 2
( 2)