<<

. 8
( 11)



>>

Figure 4.4: Comparison of the function f (u) and the control variate g(u)



Let us return to the example of pricing a call option. By some experimen-
tation, which could involve a preliminary crude simulation or simply evaluating
the function at various points, we discovered that the function

g(u) = 6[(u ’ .47)+ ]2 + (u ’ .47)+

provided a reasonable approximation to the function f (u). The two functions
are compared in Figure 4.4. Moreover, the integral 2 — 0.532 + 1 0.533 of the
2

function g(.) is easy to obtain.
It is obvious from the ¬gure that since f (u) ’ g(u) is generally much smaller
and less variable than is f (u), var[f (U ) ’ g(U )] < var(f (U )). The variance of
the crude Monte Carlo estimator is determined by the variability in the func-
tion f (u) over its full range. The variance of the control variate estimator is
determined by the variance of the di¬erence between the two functions, which
in this case is quite small. We used the following matlab functions, the ¬rst to
generate the function g(u) and the second to determine the e¬ciency gain of
the control variate estimator;
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.227

% this is the function g(u), a control variate for fn(u)
function g=GG(u)

u=max(0,u-.47);

g=6*u.^2+u;




function [est,var1,var2]=control(f,g,intg,n)

% run using a statement like [est,var1,var2]=control(™fn™,™GG™,intg,n)

% runs a simulation on the function f using control variate g (both character

strings) n times.
R1
g(u)du
% intg is the integral of g % intg= 0

% outputs estimator est and variances var1,var2, variances with and without

control variate.

U=unifrnd(0,1,1,n);

% evaluates f (u) for vector u
FN=eval(strcat(f,™(U)™));

% evaluates g(u)
CN=eval(strcat(g,™(U)™));

est=intg+mean(FN-CN);

var1=var(FN);

var2=var(FN-CN);


Then the call [est,var1,var2]=control(™fn™,™GG™,2*(.53)^3+(.53)^2/2,1000000)
yields the estimate 0.4616 and variance=1.46 — 10’8 for an e¬ciency gain over
crude Monte Carlo of around 30.
This elementary form of control variate suggests using the estimator
Z n
1X
[f (Ui ) ’ g(Ui )]
g(u)du +
n i=1

but it may well be that g(U ) is not the best estimator we can imagine for f (U ).
We can often ¬nd a linear function of g(U ) which is better by using regression.
Since elementary regression yields

f (U ) ’ E(f (U )) = β(g(U ) ’ E(g(U ))) + ² (4.19)
228 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

where
cov(f (U ), g(U ))
(4.20)
β=
var(g(U ))
and the errors ² have expectation 0, it follows that E(f (U )) + ² = f (U ) ’
β[g(U ) ’ E(g(U ))] and sof (U ) ’ β[g(U ) ’ E(g(U ))] is an unbiased estimator
of E(f (U )). For a sample of n uniform random numbers this becomes
n
1X
ˆ [f (Ui ) ’ βg(Ui )]. (4.21)
θcv = βE(g(U )) +
n i=1

Moreover this estimator having smallest variance among all linear combina-
tions of f (U )and g(U ). Note that when β = 1 (4.21) reduces to the simpler
form of the control variate technique (4.17) discussed above. However, the lat-
ter is generally better in terms of maximizing e¬ciency. Of course in practice
it is necessary to estimate the covariance and the variances in the de¬nition of
β from the simulations themselves by evaluating f and g at many di¬erent
uniform random variables Ui , i = 1, 2, ..., n and then estimating β using the
standard least squares estimator
P P P
n n f (Ui )g(Ui ) ’ n f (Ui ) n g(Ui )
b i=1 P i=1 i=1
P
β= .
n 2 (U ) ’ ( n g(U ))2
n i=1 g i i
i=1

b
Although in theory the substitution of an estimator β for the true value β
results in a small bias in the estimator, for large numbers of simulations n our
b
estimator β is so close to the true value that this bias can be disregarded.




Importance Sampling.

A second technique that is similar is that of importance sampling. Again we
depend on having a reasonably simple function g that after muultiplication by
some constant, is similar to f. However, rather than attempt to minimize the
di¬erence f (u) ’ g(u) between the two functions, we try and ¬nd g(u) such
that f (u)/g(u) is nearly a constant. We also require that g is non-negative
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.229

and can be integrated so that, after rescaling the function, it integrates to one,
i.e. it is a probability density function. Assume we can easily generate random
variables from the probability density function g(z). The distribution whose
probability density function is g(z), z ∈ [0, 1] is the importance distribution.
Note that if we generate a random variable Z having the probability density
function g(z), z ∈ [0, 1] then
Z Z 1
f (z)
f (u)du = g(z)dz
0 g(z)
¸
f (Z)
(4.22)
=E .
g(Z)

This can therefore be estimated by generating independent random variables Zi
with probability density function g(z) and then setting
n
1 X f (Zi )
ˆ (4.23)
θim = .
n i=1 g(Zi )

Once again, according to (4.22), this is an unbiased estimator and the variance
is
1 f (Z1 )
ˆ
var{θim } = }. (4.24)
var{
n g(Z1 )

Returning to our example, we might consider using the same function as
before for g(u). However, it is not easy to generate variates from a density
proportional to this function g by inverse transform since this would require
solving a cubic equation. Instead, let us consider something much simpler, the
density function g(u) = 2(0.53)’2 (u ’ .47)+ having cumulative distribution
function G(u) = (0.53)2 [(u ’ .47)+ ]2 and inverse cumulative distribution func-

tion G’1 (u) = 0.47 + 0.53 u. In this case we generate Zi using Zi = G’1 (Ui )
for Ui ∼ U nif orm[0, 1]. The following function simulates an importance sample
estimator:




function [est,v]=importance(f,g,Ginv,u)
230 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

%runs a simulation on the function ™f” using importance density ”g”(both character

strings) and inverse c.d.f. ”Ginverse”

% outputs all estimators (should be averaged) and variance.

% IM is the inverse cf of the importance distribution c.d.f.

% run e.g.

% [est,v]=importance(™fn™,™2*(IM-.47)/(.53)^2;™,™.47+.53*sqrt(u);™,rand(1,1000));

IM= eval(Ginv); %=.47+.53*sqrt(u);

%IMdens is the density of the importance sampling distribution at IM

IMdens=eval(g); %2*(IM-.47)/(.53)^2;

FN=eval(strcat(f,™(IM)™));

est=FN./IMdens; % mean(est) prrovides the estimator

v=var(FN./IMdens)/length(IM); % this is the variance of the estimator per sim-

ulation


The function was called with [est,v]=importance(™fn™,™2*(IM-.47)/(.53)^2;™,™.47+.53*sqrt(u);™,rand(1,
giving an estimate mean(est) = 0.4616 with variance 1 .28 — 10’8 for an
e¬ciency gain of around 35 over crude Monte Carlo.

Example 36 (Estimating Quantiles using importance sampling.) Suppose we
are able to generate random variables X from a probability density function of
the form
fθ (x)

and we wish to estimate a quantile such as VAR, i.e. estimate xp such that

Pθ0 (X · xp ) = p

for a certain value θ0 of the parameter.

As a very simple example suppose S is the sum of 10 independent random
variables having the exponential distribution with mean θ, and fθ (x1 , ..., x10 ) is
the joint probability density function of these 10 observations. Assume θ0 = 1
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.231

and p = .999 so that we seek an extreme quantile of the sum, i.e. we want to
determine xp such that Pθ0 (S · xp ) = p. The equation that we wish to solve
for xp is
Eθ0 {I(S · xp )} = p. (4.25)

The crudest estimator of this is obtained by generating a large number of
independent observations of S under the parameter value θ0 = 1 and ¬nding
the p™th quantile, i.e. by de¬ning the empirical c.d.f.. We generate independent
random vectors Xi = (Xi1 , ..., Xi10 ) from the probability density fθ0 (x1 , ..., x10 )
P
and with Si = 10 Xij ,de¬ne
j=1

n
1X
b I(Si · x). (4.26)
F (x) =
n i=1

Invert it (possibly with interpolation) to estimate the quantile

b
xp = F ’1 (p).
c (4.27)

If the true cumulative distribution function is di¬erentiable, the variance of this
quantile estimator is asymptotically related to the variance of our estimator of
the cumulative distribution function,

b
var(F (xp ))
var(c) '
xp ,
(F 0 (xp ))2

so any variance reduction in the estimator of the c.d.f. us re¬‚ected, at least
asymptotically, in a variance reduction in the estimator of the quantile. Using
importance sampling (4.25) is equivalent to the same technique but with
n
X
bI (x) = 1 Wi I(Si · x) where (4.28)
F
n i=1
fθ0 (Xi1 , ..., Xi10 )
Wi =
fθ (Xi1 , ..., Xi10 )

b
Ideally we should choose the value of θ so that the variance of xp or of

Wi I(Si · xp )
232 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

is as small as possible. This requires a wise guess or experimentation with
various choices of θ. For a given θ we have another choice of empirical cumulative
distribution function
n
X
1
b
FI2 (x) = Pn Wi I(Si · x). (4.29)
i=1 Wi i=1

Both of these provide fairly crude estimates of the sample quantiles when obser-
vations are weighted and, as one does with the sample median, one could easily
interpolate between adjacent values around the value of xp .
The alternative (4.29) is motivated by the fact that the values Wi appear
as weights attached to the observations Si and it therefore seems reasonable to
divide by the sum of the weights. In fact the expected value of the denominator
is
n
X
Eθ { Wi } = n
i=1
so the two denominators are similar. In the example where the Xij are
independent exponential(1) let us examine the weight on Si determined by
Xi = (Xi1 , ..., Xi10 ),
10
fθ0 (Xi1 , ..., Xi10 ) Y exp(’Xij )
= θ10 exp{’Si (1 ’ θ’1 )}.
Wi = = ’1 exp(’X /θ)
fθ (Xi1 , ..., Xi10 ) θ ij
j=1

The renormalized alternative (4.29) might be necessary for estimating extreme
quantiles when the number of simulations is small but only the ¬rst provides
an completely unbiased estimating function. In our case, using (4.28) with
θ = 2.5 we obtained an estimator of F (x0.999 ) with e¬ciency about 180 times
that of a crude Monte Carlo simulation. There is some discussion of various
renormalizations of the importance sampling weights in Hesterberg(1995).


Importance Sampling, the Exponential Tilt and the Saddlepoint Ap-
proximation

When searching for a convenient importance distribution, particularly if we
wish to increase or decrease the frequency of observations in the tails, it is
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.233

quite common to embed a given density in an exponential family. For example
suppose we wish to estimate an integral
Z
g(x)f (x)dx

where f (x) is a probability density function. Suppose K(s) denotes the cumu-
lant generating function (the logarithm of the moment generating function) of
the density f (x),i.e. if
Z
exs f (x)dx.
exp{K(s)} =

The cumulant generating function is a useful summary of the moments of a
distribution since the mean can be determined as K 0 (0) and the variance as
K 00 (0). From this single probability density function, we can now produce a
whole (exponential) family of densities

fθ (x) = eθx’K(θ) f (x) (4.30)

of which f (x) is a special case corresponding to θ = 0. The density (4.30) is often
referred to as an exponential tilt of the original density function and increases
the weight in the right tail for θ > 0, decreases it for θ < 0.
This family of densities is closely related to the saddlepoint approximation.
If we wish to estimate the value of a probability density function f (x) at a par-
ticular point x, then note that this could be obtained from (4.30) if we knew
the probability density function fθ (x). On the other hand a normal approxi-
mation to a density is often reasonable at or around its mode, particularly if
we are interested in the density of a sum or an average of independent random
variables. The cumulant generating function of the density fθ (x) is easily seen
to be K(θ + s) and the mean is therefore K 0 (θ). If we choose the parameter
θ = θ(x) so that
K 0 (θ) = x (4.31)

then the density fθ has mean x and variance K 00 (θ). How do we know for a given
value of x there exists a solution to (4.31)? From the properties of cumulant
234 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

generating functions, K(t) is convex, increasing and K(0) = 0. This implies
that as t increases, the slope of the cumulant generating function K 0 (t) is non-
decreasing. It therefore approaches a limit xmax (¬nite or in¬nite) as t ’ ∞ and
as long as we restrict the value of x in (4.31) to the interval x < xmax we can
¬nd a solution. The value of the N (x, K 00 (θ)) at the value x is
s
1
fθ (x) ≈
2πK 00 (θ)

and therefore the approximation to the density f (x) is
s
1
eK(θ)’θx .
f (x) ≈ (4.32)
00 (θ)
2πK

where θ = θ(x) satis¬es K 0 (θ) = x.

This is the saddlepoint approximation, discovered by Daniels (1954, 1980), and
usually applied to the distribution of sums or averages of independent random
variables because then the normal approximation is better motivated. Indeed,
the saddlepoint approximation to the distribution of the sum of n independent
identically distributed random variables is accurate to order O(n’1 ) and if we
renormalize it to integrate to one, accuracy to order O(n’3/2 ) is possible, sub-
stantially better than the order O(n’1/2 ) of of the usual normal approximation.
Consider, for example, the saddlepoint approximation to the Gamma(±, 1)
distribution. Because the moment generating function of the Gamma(±, 1) dis-
tribution is
1
m(t) = , t < 1,
(1 ’ t)±
the cumulant generating function is

K(t) = ln(m(t)) = ’± ln(1 ’ t),
±
K 0 (θ) = x implies θ(x) = 1 ’ and
x
x2
±
00 00
so that K (θ(x)) =
K (θ) = .
(1 ’ θ)2 ±
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.235

Therefore the saddlepoint approximation to the probability density function is
r
± ±
f (x) ' exp{’± ln(±/x) ’ x(1 ’ )}
2
2πx x
r
1 1 ’± ± ±’1
= ±2 e x exp(’x).

This is exactly the gamma density function with Stirling™s approximation replac-
ing “(±) and after renormalization this is exactly the Gamma density function.
Since it is often computationally expensive to generate random variables
whose distribution is a convolution of known densities, it is interesting to ask
whether (4.32) makes this any easier. In many cases the saddlepoint approxi-
mation can be used to generate a random variable whose distribution is close
to this convolution with high e¬ciency. For example suppose that we wish to
Pn
generate the random variable Sn = i=1 Xi where each random variable Xi
has the non-central chi-squared distribution with cumulant generating function
2»t p
’ ln(1 ’ 2t). (4.33)
K(t) =
1 ’ 2t 2
The parameter » is the non-centrality parameter of the distribution and p is
the degrees of freedom. Notice that the cumulant generating function of the sum
takes the same form but with (», p) replaced by (n», np) so in e¬ect we wish
to generate a random variable with cumulant generating function (4.33) for
large values of the parameters (», p). In stead we generate from the saddlepoint
approximation (4.32) to this distribution and in fact we do this indirectly. If we
change variable in (4.32) to determine the density of the new random variable
˜ which solves the equation
K 0 (˜) = X

then the saddlepoint approximation (4.32) is equivalent to specifying a proba-
bility density for this variable,
dx
f˜ (θ) = f (K 0 (θ))

p 0
= constant — K 00 (θ)eK(θ)’θK (θ) . (4.34)
236 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

In general, this probability density function can often be bounded above by
some density over the range of possible values of θ allowing us to generate ˜
by acceptance rejection. Then the value of the random variable is X = K 0 (˜).
In the particular case of the non-central chi-squared example above, swe may
take the dominating density to be the U [0, 1 ] density since (4.34) is bounded.
2



Combining Monte Carlo Estimators.

We have now seen a number of di¬erent variance reduction techniques and there
are many more possible. With many of these methods such as importance and
strati¬ed sampling are associated parameters which may be chosen in di¬erent
ways. The variance formula may be used as a basis of choosing a “best” method
but these variances and e¬ciencies must also estimated from the simulation and
it is rarely clear a priori which sampling procedure and estimator is best. For
example if a function f is monotone on [0, 1] then an antithetic variate can be
introduced with an estimator of the form

1
ˆ
θa1 = [f (U ) + f (1 ’ U )], U ∼ U [0, 1] (4.35)
2
1
but if the function is increasing to a maximum somewhere around and then
2

decreasing thereafter we might prefer

1
ˆ
θa2 = [f (U/2) + f ((1 ’ U )/2) + f ((1 + U )/2) + f (1 ’ U/2)]. (4.36)
4

Notice that any weighted average of these two unbiased estimators of θ would
also provide an unbiased estimator of θ. The large number of potential variance
reduction techniques is an embarrassment of riches. Which variance reduction
methods we should use and how will we know whether it is better than the
competitors? Fortunately, the answer is often to use “all of the methods” (within
reason of course); that choosing a single method is often neither necessary nor
desirable. Rather it is preferable to use a weighted average of the available
estimators with the optimal choice of the weights provided by regression.
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.237

b
Suppose in general that we have k estimators or statistics θi , i = 1, ..k, all
b
unbiased estimators of the same parameter θ so that E(θi ) = θ for all i . In
b b
vector notation, letting ˜0 = (θ1 , ..., θk ), we write E(˜) = 1θ where 1 is the
k-dimensional column vector of ones so that 10 = (1, 1, ..., 1). Let us suppose for
the moment that we know the variance-covariance matrix V of the vector ˜,
de¬ned by
bb
Vij = cov(θi , θj ).


Theorem 37 (best linear combinations of estimators)
b
The linear combination of the θi which provides an unbiased estimator of θ and
has minimum variance among all linear unbiased estimators is

X
b b (4.37)
θblc = bi θi
i


where the vector b = (b1 , ..., bk )0 is given by


b = (1t V ’1 1)’1 V ’1 1.


The variance of the resulting estimator is

b
var(θblc ) = bt V b = 1/(1t V ’1 1)


Proof. The proof is straightforward. It is easy to see that for any linear
combination (4.37) the variance of the estimator is


bt V b


and we wish to minimize this quadratic form as a function of b subject to the
constraint that the coe¬cients add to one, or that


b0 1 =1.


Introducing the Lagrangian, we wish to set the derivatives with respect to
238 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

the components bi equal to zero


‚ 0
{bt V b + »(b 1’1)} = 0 or
‚b
2V b + »1= 0

b=constant — V ’1 1


and upon requiring that the coe¬cients add to one, we discover the value of the
constant above is (1t V ’1 1)’1 .

This theorem indicates that the ideal linear combination of estimators has
coe¬cients proportional to the row sums of the inverse covariance matrix. No-
b
tably, the variance of a particular estimator θi is an ingredient in that sum,
but one of many. In practice, of course, we almost never know the variance-
covariance matrix V of a vector of estimators ˜. However, when we do simula-
tion evaluating these estimators using the same uniform input to each, we obtain
independent replicated values of ˜. This permits us to estimate the covariance
matrix V and since we typically conduct many simulations this estimate can be
very accurate. Let us suppose that we have n simulated values of the vectors ˜,
and call these ˜1 , ..., ˜n . As usual we estimate the covariance matrix V using
the sample covariance matrix

n
1X
b (˜i ’ ˜)(˜i ’ ˜)0
V=
n ’ 1 i=1


where

n
1X
˜= ˜i .
n i=1


Let us return to the example and attempt to ¬nd the best combination of
the many estimators we have considered so far. To this end, let
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.239




0.53
b [f (.47 + .53u) + f (1 ’ .53u)] an antithetic estimator,
θ1 =
2
0.37 0.16
b [f (.47 + .37u) + f (.84 ’ .37u)] + [f (.84 + .16u) + f (1 ’ .16u)],
θ2 =
2 2
b
θ3 = 0.37[f (.47 + .37u)] + 0.16[f (1 ’ .16u)], (strati¬ed-antithetic)
Z
b
θ4 = g(x)dx + [f (u) ’ g(u)], (control variate)

b ˆ
θ5 = θim , the importance sampling estimator (4.23).

b b b
Then θ2 , and θ3 are both strati¬ed-antithetic estimators, θ4 is a control
b
variate estimator and θ5 the importance sampling estimator discussed earlier, all
obtained from a single input uniform random variate U. In order to determine
the optimal linear combination we need to generate simulated values of all 5
estimators using the same uniform random numbers as inputs. We determine
the best linear combination of these estimators using


function [o,v,b,V]=optimal(U)

% generates optimal linear combination of ¬ve estimators and outputs

% average estimator, variance and weights

% input U a row vector of U[0,1] random numbers

T1=(.53/2)*(fn(.47+.53*U)+fn(1-.53*U));

T2=.37*.5*(fn(.47+.37*U)+fn(.84-.37*U))+.16*.5*(fn(.84+.16*U)+fn(1-.16*U));

T3=.37*fn(.47+.37*U)+.16*fn(1-.16*U);

intg=2*(.53)^3+.53^2/2;

T4=intg+fn(U)-GG(U);

T5=importance(™fn™,U);

X=[T1™ T2™ T3™ T4™ T5™]; % matrix whose columns are replications of the same

estimator, a row=5 estimators using same U

mean(X)

V=cov(X); % this estimates the covariance matrix V
240 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

on=ones(5,1);

V1=inv(V); % the inverse of the covariance matrix

b=V1*on/(on™*V1*on); % vector of coe¬cients of the optimal linear combination

o=mean(X*b); % vector of the optimal linear combinations

v=1/(on™*V1*on); % variance of the optimal linear combination based on

a single U


One run of this estimator, called with [o,v,b,V]= optimal(unifrnd(0,1,1,1000000))
yields

o = 0.4615

b0 = [’0.5499 1.4478 0.1011 0.0491 ’ 0.0481].

The estimate 0.4615 is accurate to at least four decimals which is not surprising
since the variance per uniform random number input is v = 1.13 — 10’5 . In
other words, the variance of the mean based on 1,000,000 uniform input is
1.13— 10’10 , the standard error is around .00001 so we can expect accuracy to at
least 4 decimal places. Note that some of the weights are negative and others are
greater than one. Do these negative weights indicate estimators that are worse
than useless? The e¬ect of some estimators may be, on subtraction, to render the
remaining function more linear and more easily estimated using another method
and negative coe¬cients are quite common in regression generally. The e¬ciency
gain over crude Monte Carlo is an extraordinary 40,000. However since there
are 10 function evaluations for each uniform variate input, the e¬ciency when
we adjust for the number of function evaluations is 4,000. This simulation
using 1,000,000 uniform random numbers and taking a 63 seconds on a Pentium
IV (2.4 GHz) (including the time required to generate all ¬ve estimators) is
equivalent to forty billion simulations by crude Monte Carlo, a major task on a
supercomputer!
If we intended to use this simulation method repeatedly, we might well wish
to see whether some of the estimators can be omitted without too much loss
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.241

of information. Since the variance of the optimal estimator is 1/(1t V ’1 1), we
might use this to attempt to select one of the estimators for deletion. Notice that
it is not so much the covariance of the estimators V which enters into Theorem
35 but its inverse J = V ’1 which we can consider a type of information matrix
by analogy to maximum likelihood theory. For example we could choose to delete
the i0 th estimator, i.e. delete the i0 th row and column of V where i is chosen
PP
to have the smallest e¬ect on 1/(1t V ’1 1) or its reciprocal 1t J1 = i j Jij .
with the i0 th row and column
In particular, if we let V(i) be the matrix V
’1
deleted and J(i) = V(i) , then we can identify 1t J1 ’ 1t J(i) 1 as the loss of
information when the i0 th estimator is deleted. Since not all estimators have
the same number of function evaluations, we should adjust this information
by F E(i) =number of function evaluations required by the i0 th estimator. In
other words, if an estimator i is to be deleted, it should be the one corresponding
to
1t J1 ’ 1t J(i) 1
min{ }.
F E(i)
i

We should drop this i0 th estimator if the minimum is less than the information
per function evaluation in the combined estimator, because this means we will
increase the information available in our simulation per function evaluation.
In the above example with all ¬ve estimators included, 1t J1 = 88757 (with
10 function evaluations per uniform variate) so the information per function
evaluation is 8, 876.

1t J1’1t J(i) 1
1t J1 ’ 1t J(i) 1
i F E(i) F E(i)

1 88,048 2 44024
2 87,989 4 21,997
3 28,017 2 14,008
4 55,725 1 55,725
5 32,323 1 32,323

In this case, if we were to eliminate one of the estimators, our choice would
242 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

likely be number 3 since it contributes the least information per function eval-
uation. However, since all contribute more than 8, 876 per function evaluation,
we should likely retain all ¬ve.


Common Random Numbers.

We now discuss another variance reduction technique, closely related to anti-
thetic variates called common random numbers, used for example whenever
we wish to estimate the di¬erence in performance between two systems or any
other variable involving a di¬erence such as a slope of a function.

bb
Example 38 For a simple example suppose we have two estimators θ1 , θ2 of
the “center” of a symmetric distribution. We would like to know which of these
estimators is better in the sense that it has smaller variance when applied to
a sample from a speci¬c distribution symmetric about its median. If both esti-
mators are unbiased estimators of the median, then the ¬rst estimator is better
if
b b
var(θ1 ) < var(θ2 )

and so we are interested in estimating a quantity like

Eh1 (X) ’ Eh2 (X)

where X is a vector representing a sample from the distribution and h1 (X) =
b2 b2
θ1 , h2 (X) = θ2 . There are at least two ways of estimating these di¬erences;

1. Generate samples and hence values of h1 (Xi ), i = 1, ..., n and Eh2 (Xj ), j =
1, 2, ..., m independently and use the estimator
n m
1X 1X
h1 (Xi ) ’ h2 (Xj ).
n i=1 m j=1

2. Generate samples and hence values of h1 (Xi ), h2 (Xi ), i = 1, ..., n inde-
pendently and use the estimator
n
1X
(h1 (Xi ) ’ h2 (Xi )).
n i=1
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.243

It seems intuitive that the second method is preferable since it removes
the variability due to the particular sample from the comparison. This is a
common type of problem in which we want to estimate the di¬erence between
two expected values. For example we may be considering investing in a new
piece of equipment that will speed up processing at one node of a network and
we wish to estimate the expected improvement in performance between the new
system and the old. In general, suppose that we wish to estimate the di¬erence
between two expectations, say

Eh1 (X) ’ Eh2 (Y ) (4.38)

where the random variable or vector X has cumulative distribution function FX
and Y has cumulative distribution function FY . Notice that the variance of a
Monte Carlo estimator

var[h1 (X) ’ h2 (Y )] = var[h1 (X)] + var[h2 (Y )] ’ 2cov{h1 (X), h2 (Y )} (4.39)

is small if we can induce a high degree of positive correlation between the gen-
erated random variables X and Y . This is precisely the opposite problem that
led to antithetic random numbers, where we wished to induce a high degree
of negative correlation. The following lemma is due to Hoe¬ding (1940) and
provides a useful bound on the joint cumulative distribution function of two
random variables X and Y. Suppose X, Y have cumulative distribution func-
tions FX (x) and FY (y) respectively and joint cumulative distribution function
G(x, y) = P [X · x, Y · y].


Lemma 39 (a) The joint cumulative distribution function G of (X, Y ) always
satis¬es


(FX (x) + FY (y) ’ 1)+ · G(x, y) · min(FX (x), FY (y)) (4.40)

for all x, y .
244 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

(b) Assume that FX and FY are continuous functions. In the case that
’1 ’1
X = FX (U ) and Y = FY (U ) for U uniform on [0, 1], equality is achieved
’1
on the right G(x, y) = min(FX (x), FY (y)). In the case that X = FX (U ) and
’1
Y = FY (1 ’ U ) there is equality on the left; (FX (x) + FY (y) ’ 1)+ = G(x, y).

Proof. (a) Note that


P [X · x, Y · y] · P [X · x] and similarly

· P [Y · y].


This shows that

G(x, y) · min(FX (x), FY (y)),


verifying the right side of (4.40). Similarly for the left side


P [X · x, Y · y] = P [X · x] ’ P [X · x, Y > y]

≥ P [X · x] ’ P [Y > y]

= FX (x) ’ (1 ’ FY (y))

= (FX (x) + FY (y) ’ 1).


Since it is also non-negative the left side follows.
’1 ’1
For (b) suppose X = FX (U ) and Y = FY (U ), then

’1 ’1
P [X · x, Y · y] = P [FX (U ) · x, FY (U ) · y]

= P [U · FX (x), U · FY (y)]


since P [X = x] = 0 and P [Y = y] = 0.

But

P [U · FX (x), U · FY (y)] = min(FX (x), FY (y))


verifying the equality on the right of (4.40) for common random numbers. By
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.245

a similar argument,


’1 ’1
P [FX (U ) · x, FY (1 ’ U ) · y] = P [U · FX (x), 1 ’ U · FY (y)]

= P [U · FX (x), U ≥ 1 ’ FY (y)]

= (FX (x) ’ (1 ’ FY (y)))+


verifying the equality on the left.



The following theorem supports the use of common random numbers to
maximize covariance and antithetic random numbers to minimize covariance.



Theorem 40 (maximum/minimum covariance)
Suppose h1 and h2 are both non-decreasing (or both non-increasing) functions.
Subject to the constraint that X, Y have cumulative distribution functions FX , FY
respectively, the covariance


cov[h1 (X), h2 (Y )]


’1 ’1
is maximized when Y = FY (U ) and X = FX (U ) (i.e. for common uniform[0, 1]
’1 ’1
random numbers) and is minimized when Y = FY (U ) and X = FX (1 ’ U )
(i.e. for antithetic random numbers).



Proof. We will sketch a proof of the theorem when the distributions are
all continuous and h1 , h2 are di¬erentiable. De¬ne G(x, y) = P [X · x, Y · y].
The following representation of covariance is useful: de¬ne


H(x, y) = P (X > x, Y > y) ’ P (X > x)P (Y > y) (4.41)

= G(x, y) ’ FX (x)FY (y).
246 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

Notice that, using integration by parts,
Z∞Z∞
H(x, y)h0 (x)h0 (y)dxdy
1 2
’∞ ’∞
Z∞Z∞

H(x, y)h1 (x)h0 (y)dxdy
=’ 2
’∞ ’∞ ‚x
Z∞Z∞
‚2
= H(x, y)h1 (x)h2 (y)dxdy
’∞ ’∞ ‚x‚y
Z∞Z∞ Z∞ Z ∞
h1 (x)h2 (y)g(x, y)dxdy ’
= h1 (x)fX (x)dx h2 (y)fY (y)dy
’∞ ’∞ ’∞ ’∞

(4.42)
= cov(h1 (X), h2 (Y ))

where g(x, y), fX (x), fY (y) denote the joint probability density function, the
probability density function of X and that of Y respectively. In fact this result
holds in general even without the assumption that the distributions are contin-
uous. The covariance between h1 (X) and h2 (Y ), for h1 and h2 di¬erentiable
functions, is
Z Z
∞ ∞
H(x, y)h0 (x)h0 (y)dxdy.
cov(h1 (X), h2 (Y )) = 1 2
’∞ ’∞

The formula shows that to maximize the covariance, if h1 , h2 are both increasing
or both decreasing functions, it is su¬cient to maximize H(x, y) for each x, y
since h0 (x), h0 (y) are both non-negative. Since we are constraining the mar-
1 2

ginal cumulative distribution functions FX , FY , this is equivalent to maximizing
G(x, y) subject to the constraints

lim G(x, y) = FX (x)
y’∞

lim G(x, y) = FY (y).
x’∞


Lemma 37 shows that the maximum is achieved when common random numbers
are used and the minimum achieved when we use antithetic random numbers.


We can argue intuitively for the use of common random numbers in the case
of a discrete distribution with probability on the points indicated in Figure 4.5.
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.247

This ¬gure corresponds to a joint distribution with the following probabilities,
say
0 0.25 0.25 0.75 0.75 1
x
0 0.25 0.75 0.25 0.75 1
y
.1 .2 .2 .1 .2 .2
P [X = x, Y = y]
Suppose we wish to maximize P [X > x, Y > y] subject to the constraint that
the probabilities P [X > x] and P [Y > y] are ¬xed. We have indicated
arbitrary ¬xed values of (x, y) in the ¬gure. Note that if there is any weight
attached to the point in the lower right quadrant (labelled “P2 ”), some or all
of this weight can be reassigned to the point P3 in the lower left quadrant
provided there is an equal movement of weight from the upper left P4 to the
upper right P1 . Such a movement of weight will increase the value of G(x, y)
without a¬ecting P [X · x] or P [Y · y]. The weight that we are able to transfer
in this example is 0.1, the minimum of the weights on P4 and P2 . In general,
this continues until there is no weight in one of the o¬-diagonal quadrants for
every choice of (x, y). The resulting distribution in this example is given by

0 0.25 0.25 0.75 0.75 1
x
0 0.25 0.75 0.25 0.75 1
y
.1 .3 0 .1 .3 .2
P [X = x, Y = y]

and it is easy to see that such a joint distribution can be generated from common
’1 ’1
random numbers X = FX (U ), Y = FY (U ).


Conditioning

We now consider a simple but powerful generalization of control variates. Sup-
pose that we can decompose a random variable T into two components T1 , µ

(4.43)
T = T1 + µ

so that T1 , µ are uncorrelated

cov(T1 , µ) = 0.
248 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES




Figure 4.5: Changing weights on points to maximize covariance



Assume as well that E(µ) = 0. Regression is one method for determining such
a decomposition and the error term µ in regression satis¬es these conditions.
Then T1 has the same mean as T and it is easy to see that


var(T ) = var(T1 ) + var(µ)


so T1 as smaller variance than T (unless µ = 0 with probability 1). This means
that if we wish to estimate the common mean of T or T1 , the estimator T1 is
preferable, since it has the same mean with smaller variance.

One special case is variance reduction by conditioning. For the standard
de¬nition and properties of conditional expectation see the appendix. One com-
mon de¬nition of E[X|Y ] is the unique (with probability one) function g(y)
of Y which minimizes E{X ’ g(Y )}2 . This de¬nition only applies to random
variables X which have ¬nite variance and so this de¬nition requires some mod-
i¬cation when E(X 2 ) = ∞, but we will assume here that all random variables,
say X, Y, Z have ¬nite variances. We can de¬ne conditional covariance using
conditional expectation as


cov(X, Y |Z) = E[XY |Z] ’ E[X|Z]E[Y |Z]
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.249

and conditional variance:

var(X|Z) = E(X 2 |Z) ’ (E[X|Z])2 .

The variance reduction through conditioning is justi¬ed by the following well-
known result:


Theorem 41 (a)E(X) = E{E[X|Y ]}
(b) cov(X, Y ) = E{cov(X, Y |Z)} + cov{E[X|Z], E[Y |Z]}
(c) var(X) = E{var(X|Z)} + var{E[X|Z]}

This theorem is used as follows. Suppose we are considering a candidate
ˆ
estimator θ, an unbiased estimator of θ. We also have an arbitrary random
ˆ
variable Z which is somehow related to θ. Suppose that we have chosen Z
carefully so that we are able to calculate the conditional expectation T1 =
ˆ
E[θ|Z]. Then by part (a) of the above Theorem, T1 is also an unbiased estimator
of θ. De¬ne
ˆ
µ = θ ’ T1 .

By part (c),
ˆ
var(θ) = var(T1 ) + var(µ)

ˆ ˆ
and var(T1 ) = var(θ) ’ var(µ) < var(θ). In other words, for any variable Z,
ˆ ˆ
E[θ|Z] has the same expectation as does θ but smaller variance and the decrease
ˆ
in variance is largest if Z and θ are nearly independent, because in this case
ˆ
E[θ|Z] is close to a constant and its variance close to zero. In general the
search for an appropriate Z so as to reducing the variance of an estimator by
conditioning requires searching for a random variable Z such that:

ˆ
1. the conditional expectation E[θ|Z] with the original estimator is com-
putable

ˆ ˆ
2. var(E[θ|Z]) is substantially smaller than var(θ).
250 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES




Figure 4.6: Example of the Hit and Miss Method



Example 42 (hit or miss)


Suppose we wish to estimate the area under a certain graph f (x) by the hit
and miss method. A crude method would involve determining a multiple c of a
probability density function g(x) which dominates f (x) so that cg(x) ≥ f (x) for
all x. We can generate points (X, Y ) at random and uniformly distributed under
the graph of cg(x) by generating X by inverse transform X = G’1 (U1 ) where
G(x) is the cumulative distribution function corresponding to density g and
then generating Y from the Uniform[0, cg(X)] distribution, say Y = cg(X)U2 .
An example, with g(x) = 2x, 0 < x < 1 and c = 1/4 is given in Figure 4.6.
The hit and miss estimator of the area under the graph of f obtains by
generating such random points (X, Y ) and counting the proportion that fall
under the graph of g, i.e. for which Y · f (X). This proportion estimates the
probability

area under f (x)
P [Y · f (X)] =
area under cg(x)
area under f (x)
=
c
VARIANCE REDUCTION FOR ONE-DIMENSIONAL MONTE-CARLO INTEGRATION.251

since g(x) is a probability density function. Notice that if we de¬ne
§
⎨ c if Y · f (X)
W=
© 0 if Y > f (X)

then



area under f (x)
E(W ) = c —
area under cg(x)
= area under f (x)

so W is an unbiased estimator of the parameter that we wish to estimate. We
might therefore estimate the area under f (x) using a Monte Carlo estimator
Pn
ˆ 1
θHM = n i=1 Wi based on independent values of Wi .This is the “hit-or-miss”
estimator. However, in this case it is easy to ¬nd a random variable Z such
that the conditional expectation E(Z|W ) can be determined in closed form. In
fact we can choose Z = X, we obtain

f (X)
E[W |X] = .
g(X)

This is therefore an unbiased estimator of the same parameter and it has smaller
variance than does W. For a sample of size n we should replace the crude esti-
ˆ
mator θcr by the estimator
n
1 X f (Xi )
ˆ
θCond =
n i=1 g(Xi )
n
1 X f (Xi )
=
n i=1 2Xi

with Xi generated from X = G’1 (Ui ) = Ui , i = 1, 2, ..., n and Ui ∼ Uni-
form[0,1]. In this case, the conditional expectation results in a familiar form for
ˆ
the estimator θCond . This is simply an importance sampling estimator with g(x)
the importance distribution. However, this derivation shows that the estimator
ˆ ˆ
θCond has smaller variance than θHM .
252 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

Problems

1. Use both crude and antithetic random numbers to integrate the function
Z 1
eu ’ 1
du.
e’1
0

(a) What is the e¬ciency gain attributed to the use of antithetic random
numbers?

(b) How large a sample size would I need, using antithetic and crude
Monte Carlo, in order to estimate the above integral, correct to four
decimal places, with probability at least 95%?

2. Under what conditions on f does the use of antithetic random numbers
completely correct for the variability of the Monte-Carlo estimator? i.e.
When is var(f (U ) + f (1 ’ U )) = 0?

3. Suppose that F (x) is the normal(µ, σ 2 ) cumulative distribution function,
Prove that F ’1 (1 ’ U ) = 2µ ’ F ’1 (U ) and therefore, if we use antithetic
random numbers to generate two normal random variables X1 , X2 , having
and variance σ 2 , this is equivalent to setting X2 = 2µ ’ X1 .
mean µ
In other words, if we wish to use antithetic random numbers for normal
variates, it is not necessary to generate the normal random variables using
the inverse transform method.

4. Show that the variance of a weighted average

var(±X + (1 ’ ±)W )

is minimized over ± when

var(W ) ’ cov(X, W )
±=
var(W ) + var(X) ’ 2cov(X, W )

Determine the resulting minimum variance. What if the random variables
X, W are independent?
PROBLEMS 253

5. Use a strati¬ed random sample to integrate the function
Z1 u
e ’1
du.
0 e’1

What do you recommend for intervals (two or three) and sample sizes?
What is the e¬ciency gain?

6. Use a combination of strati¬ed random sampling and an antithetic random
number in the form
1
[f (U/2) + f (1 ’ U/2)]
2
to integrate the function
Z 1
eu ’ 1
du.
e’1
0
What is the e¬ciency gain?

ex ’1
7. In the case f (x) = e’1 , use g(x) = x as a control variate to integrate over
[0,1]. Show that the variance is reduced by a factor of approximately 60.
Is there much additional improvement if we use a more general quadratic
function of x?

8. The second version of the control variate Monte-Carlo estimator
n
1X
b {f (Ui ) ’ β[g(Ui ) ’ E(g(Ui ))]}
θcv =
n i=1
is an improved control variate estimator, is equivalent to the ¬rst version
ex ’1
in the case β = 1. In the case f (x) = e’1 , consider using g(x) = x as a
b
control variate to integrate over [0,1]. Determine how much better θcv is
than the basic control variate (β = 1) by performing simulations. Show
that the variance is reduced by a factor of approximately 60 over crude
Monte Carlo. Is there much additional improvement if we use a more
general quadratic function of x for g(x).

9. It has been suggested that stocks are not log-normally distributed but the
distribution can be well approximated by replacing the normal distribu-
tion by a student t distribution. Suppose that the daily returns Xi are
254 CHAPTER 4. VARIANCE REDUCTION TECHNIQUES

independent with probability density function f (x) = c(1 + (x/b)2 )’2 (the
re-scaled student distribution with 3 degrees of freedom). We wish to es-
P
timate a weekly Value at Risk, V ar.95 , a value ev such that P [ 5 Xi <
i=1

v] = 0.95. If we wish to do this by simulation, suggest an appropriate
method involving importance sampling. Implement and estimate the vari-
ance reduction.

10. Suppose three di¬erent simulation estimators Y1 , Y2 , Y3 have means which
depend on two unknown parameters θ1 , θ2 so that Y1 , Y2 , Y3 , are unbi-
ased estimators of θ1 , θ1 + θ2 , θ2 respectively. Assume that var(Yi ) =
1, cov(Yi , Yj ) = ’1/2 an we want to estimate the parameter θ1 . Should
we use only the estimator Y1 which is the unbiased estimator of θ1 , or
some linear combination of Y1 , Y2 , Y3 ? Compare the number of simula-
tions necessary for a certain degree of accuracy.

ex ’1
11. In the case f (x) = e’1 , use g(x) = x as a control variate to integrate
over [0,1]. Find the optimal linear combination using estimators (4.35) and
(4.36), an importance sampling estimator and the control variate estimator
above. What is the e¬ciency gain over crude Monte-Carlo?

12. Show that the Jacobian of the transformation used in the proof of Theorem
√1
23; (x, m) ’ (x, y) where y = exp(’(2m’x)2 /2) is given by .
2y ’2 ln(y)
Chapter 5


Simulating the Value of
Options

Asian Options

An Asian option, at expiration T, has value determined not by the closing price
of the underlying asset as for a European option, but on an average price of
the asset over an interval. For example a discretely sampled Asian call op-
tion on an asset with price process S(t) pays an amount on maturity equal
Pk
¯ ¯ 1
to max(0, Sk ’ K) where Sk = k i=1 S(iT /k) is the average asset price at k
equally spaced time points in the time interval (0, T ). Here, k depends on the
frequency of sampling (e.g. if T = .25 (years) and sampling is weekly, then
¯
k = 13. If S(t) follows a geometric Brownian motion, then Sk is the sum of
lognormally distributed random variables and the distribution of the sum or av-
erage of lognormal random variables is very di¬cult to express analytically. For
this reason we will resort to pricing the Asian option using simulation. Notice,
however that in contrast to the arithmetic average, the distribution of the geo-
metric average has a distribution which can easily be obtained. The geometric

255
256 CHAPTER 5. SIMULATING THE VALUE OF OPTIONS

Pn
1
mean of n values X1 , ..., Xn is (X1 X2 ...Xn )1/n = exp{ n ln(Xi )} and if the
i=1

random variables Xn were each lognormally distributed then this results adding
the normally distributed random variables ln(Xi ) in the exponent, a much more
P
familiar operation. In fact the sum in the exponent n n ln(Xi ) is normally
1
i=1

distributed so the geometric average will have a lognormal distribution.

Our objective is to determine the value of the Asian option E(V1 ) with




¯
V1 = e’rT max(0, Sk ’ K)


Since we expect geometric means to be close to arithmetic means, a reasonable
˜ ˜
control variate is the random variable V2 = e’rT max(0, Sk ’ K) where Sk =
Qk
{ i=1 S(iT /k)}1/k is the geometric mean. Assume that V1 and V2 obtain from
the same simulation and are therefore possibly correlated. Of course V2 is only
useful as a control variate if its expected value can be determined analytically
or numerically more easily than that of V1 but in view of the fact that V2
has a known lognormal distribution, the prospects of this are excellent. Since
S(t) = S0 eY (t) where Y (t) is a Brownian motion with Y (0) = 0, drift r ’ σ 2 /2
˜
and di¬usion σ, it follows that Sk has the same distribution as does

k
1X
(5.1)
S0 exp{ Y (iT /k)}.
k i=1


The exponent is a weighted average of the independent normal increments of
the process and therefore normally distributed. In particular if we set

k
1X
¯
Y= Y (iT /k)
k i=1
1
[k(Y (T /k)) + (k ’ 1){Y (2T /k) ’ Y (T /k)} + (k ’ 2){Y (3T /k) ’ Y (2T /k)}
=
k
+ ... + {Y (T ) ’ Y ((k ’ 1)T /k)}],
ASIAN OPTIONS 257

¯
then we can ¬nd the mean and variance of Y ,
k
X
2
¯ ) = r ’ σ /2
E(Y iT /k
k i=1
σ2 k + 1
= (r ’ ) T
2 2k
e
= µT, say,

<<

. 8
( 11)



>>