. 10
( 11)


the expectation conditional (as always) on the value of the open O. The e¬ect
of a knock-out lower barrier at Oe’a is essentially the same but with b replaced
by a, namely
a(a + ln(C/O))
E[ψ(C)(1 ’ exp{’2 })].

In the next section we consider the e¬ect of two barriers, both an upper and a
lower barrier.

One Process, Two barriers.

We have discussed a simple device above for generating jointly the high and the
close or the low and the close of a process given the value of the open. The joint
distribution of H, L, C given the value of O or the distribution of C in the case
of upper and lower barriers is more problematic. Consider a single factor model
and two barriers- an upper and a lower barrier. Note that the high and the
low in any given interval is dependent, but if we simulate a path in relatively
short segments, by ¬rst generating n increments and then generating the highs
and lows within each increment, then there is an extremely low probability
that the high and low of the process will both lie in the same short increment.
For example for a Brownian motion with the time interval partitioned into 5
equal subintervals, the probability that the high and low both occur in the
same increment is less than around 0.011 whatever the drift. If we increase the
number of subintervals to 10, this is around 0.0008. This indicates that provided
we are willing to simulate highs, lows and close in ten subintervals, pretending

that within subintervals the highs and lows are conditionally independent, the
error in our approximation is very small.

An alternative, more computationally intensive, is to di¬erentiate the in¬nite
series expression for the probability P (H · b, L ≥ a, C = u|O = 0] A ¬rst step
in this direction is the the following result, obtained from the re¬‚ection principle
with two barriers.

Theorem 47 For a Brownian motion process

dSt = µdt + dWt , S0 = 0

de¬ned on [0, 1] and for ’a < u < b,

P (L < ’a or H > b|C = u)

[φ{2n(a + b) + u} + φ{2n(a + b) ’ 2a ’ u}
φ(u) n=1

+ φ{’2n(b + a) + u} + φ{2n(b + a) + 2a + u}]

where φ is the N (0, 1) probability density function.

Proof. The proof is a well-known application of the re¬‚ection principal.
It is su¬cient to prove the result in the case µ = 0 since the conditional
distribution of L, H given C does not depend on µ (A statistician would say
that C is a su¬cient statistic for the drift parameter). Denote the following
paths determined by their behaviour on 0 < t < 1. All paths are assumed to
end at C = u.

= H > b (path goes above b)
path goes above b and then falls below ’a
goes above b then falls below ’a then rises above b
L < ’a
path falls below ’a then rises above b
falls below ’a then rises above b then falls below ’a
For an arbitrary event A, denote by P (A|u) probability of the event conditional
on C = u. Then according to the re¬‚ection principal the probability that the
Brownian motion leaves the interval [’a, b] is given from an inclusion-exclusion
argument by

P (A+1 |u) ’ P (A+2 |u) + P (A+3 |u) ’ · · · (5.30)

+P (A’1 |u) ’ P (A’2 |u) + P (A’3 |u) · · ·

This can be veri¬ed by considering the paths in Figure 5.7. (It should be noted
here that, as in our application of the re¬‚ection principle in the one-barrier case,
the re¬‚ection principle allows us to show that the number of paths in two sets is
the same, and this really only translates to probability in the case of a discrete
sample space, for example a simple random walk that jumps up or down by a
¬xed amount in discrete time steps. This result for Brownian motion obtains if
we take a limit over a sequence of simple random walks approaching a Brownian
motion process.)
Note that

φ(2b ’ u)
P (A+1 |u) =
φ{2n(a + b) + u}
P (A+2n |u) =
φ{2n(a + b) ’ 2a ’ u}
P (A+(2n’1) |u) =

Figure 5.7: The Re¬‚ection principle with Two Barriers


φ(’2a ’ u)
P (A’1 |u) =
φ{’2n(b + a) + u}
P (A’2n |u) =
φ{2n(b + a) + 2a + u}
P (A’(2n+1) |u) = .

The result then obtains from substitution in (5.30).

As a consequence of this result we can obtain an expression for P (a < L ·
H < b, u < C < v) (see also Billingsley, (1968), p. 79) for a Brownian motion
on [0, 1] with zero drift:

P (a, b, u, v) = P (a < L · H < b, u < C < v)

¦[v + 2k(b ’ a)] ’ ¦[u + 2k(b ’ a)]

’ ¦[2b ’ u + 2k(b ’ a)] ’ ¦[2b ’ v + 2k(b ’ a)]. (5.31)

where ¦ is the standard normal cumulative distribution function. From (5.31)
we derive the joint density of (L, H, C) by taking the limit P (a, b, u, u + δ)/δ as

δ ’ 0, and taking partial derivatives with respect to a and b:

k 2 φ00 [u + 2k(b ’ a)] ’ k(1 + k)φ00 [2b ’ u + 2k(b ’ a)]
f (a, b, u) = 4

k φ [u + 2k(b ’ a)] ’ k(1 + k)φ00 [2b ’ u + 2k(b ’ a)]
2 00

+ k2 φ00 [u ’ 2k(b ’ a)] + k(1 ’ k)φ00 [2b ’ u ’ 2k(b ’ a)] (5.32)

for a < u < b.
From this it is easy to see that the conditional cumulative distribution func-
tion of L given C = u, H = b is given by on a · u · b (where ’2φ0 (2b ’ u)
is the joint p.d.f. of H, C) by
‚b‚v P (a, b, u, v)| v=u
F (a|b, u) = 1 +
2φ0 (2b ’ u)

{’kφ0 [u + 2k(b ’ a)] + (1 + k)φ0 [2b ’ u + 2k(b ’ a)]
φ (2b ’ u)

+ kφ0 [u ’ 2k(b ’ a)] + (1 ’ k)φ0 [2b ’ u ’ 2k(b ’ a)]}

This allows us to simulate both the high and the low, given the open and the
close by ¬rst simulating the high and the close using ’2φ0 (2b ’ u) as the joint
p.d.f. of (H, C) and then simulating the low by inverse transform from the
cumulative distribution function of the form (5.33).

Survivorship Bias

It is quite common for retrospective studies in ¬nance, medicine and to be
subject to what is often called “survivorship bias”. This is a bias due to the
fact that only those members of a population that remained in a given class
(for example the survivors) remain in the sampling frame for the duration of
the study. In general, if we ignore the “drop-outs” from the study, we do so
at risk of introducing substantial bias in our conclusions, and this bias is the
survivorship bias.

Suppose for example we have hired a stable of portfolio managers for a large
pension plan. These managers have a responsibility for a given portfolio over
a period of time during which their performance is essentially under continuous
review and they are subject to one of several possible decisions. If returns below
a given threshhold, they are deemed unsatisfactory and ¬red or converted to
another line of work. Those with exemplary performance are promoted, usually
to an administrative position with little direct ¬nancial management. And those
between these two “absorbing” barriers are retained. After a period of time,
T, an amibitious graduate of an unnamed Ivey league school working out of
head o¬ce wishes to compare performance of those still employed managing
portfolios. How are should the performance measures re¬‚ect the ¬ltering of
those with unusually good or unusually bad performance? This is an example
of a process with upper and lower absorbing barriers, and it is quite likely
that the actual value of these barriers di¬ers from one employee to another, for
example the son-in-law of the CEO has a substantially di¬erent barriers than the
math graduate fresh out of UW. However, let us ignore this di¬erence, at least
for the present, and concentrate on a di¬erence that is much harder to ignore
in the real world, the di¬erence between the volatility parameters of portfolios,
possibly in di¬erent sectors of the market, controlled by di¬erent managers.
For example suppose two managers were responsible for funds that began and
ended the year at the same level and had approximately the same value for the
lower barrier as in Table 5.2. For each the value of the volatility parameter
σ was estimated using individual historical volatilities and correlations of the
component investments.

Portfolio Open price Close Price Lower Barrier Volatility

56 5
1 40 30 .5

56 1
2 40 30 .2

Table 5.3

Suppose these portfolios (or their managers) have been selected retrospec-
tively from a list of “survivors” which is such that the low of the portfolio value
never crossed a barrier at l = Oe’a (bankruptcy of fund or termination or
demotion of manager, for example) and the high never crossed an upper barrier
at h = Oeb . However, for the moment let us assume that the upper barrier is so
high that its in¬‚uence can be neglected, so that the only absorbtion with any
substantial probability is at the lower barrier. We interested in the estimate of
return from the two portfolios, and a preliminary estimate indicates a continu-
ously compounded rate of return from portfolio 1 of R1 = ln(56.625/40) = 35%
and from portfolio two of R2 = ln(56.25/40) = 34%. Is this di¬erence signi¬cant
and are these returns reasonably accurate in view of the survivorship bias?
We assume a geometric Brownian motion for both portfolios,

dSt = µSt dt + σSt dWt ,

and de¬ne O = S(0), C = S(T ),

H = max S(t), L = min S(t)
0tT 0tT

with parameters µ, σ possibly di¬erent.
In this case it is quite easy to determine the expected return or the value of
any performance measure dependent on C conditional on survival, since this is
essentially the same as a problem already discussed, the valuation of a barrier
option. According to (5.27), the probability that a given Brownian motion
process having open 0 and close c strikes a barrier placed at l < min(0, c) is

σ2 T


zl = l(l ’ c).

Converting this statement to the Geometric Brownian motion (5.34), the prob-
ability that a geometric Brownian motion process with open O and close c

breaches a lower barrier at l is

P [L · l|O, C] = exp{’2 }


zl = ln(O/l) ln(C/l) = a(a + ln(C/O)).

Of course the probability that a particular path with this pair of values (O, C)
is a “survivor” is 1 minus this or

1 ’ exp{’2 }. (5.35)
σ2 T

When we observe the returns or the closing prices C of survivors only, the results
have been ¬ltered with probability (5.35). In other words if the probability
density function of C without any barriers at all is f (c) (in our case this is a
lognormal density with parameters that depend on µ and σ) then the density
function of C of the survivors in the presence of a lower barrier is proportional

ln(O/l) ln(c/l)
f (c)[1 ’ exp{’2 }]
σ2 T
l 2 ln(O/l) 2a
= f (c)(1 ’ ( )» ), with » = = 2 > 0.
c σ σT

It is interesting to note the e¬ect of this adjustment on the moments of C for
various values of the parameters. For example consider the expected value of C
conditional on survival
R∞ l
cf (c)(1 ’ ( c )» )dc
E(C|L ≥ l] = R ∞ l
f (c)(1 ’ ( c )» )dc
E[CI(C ≥ l)] ’ l» E[C 1’» I(C ≥ l)]
P [C ≥ l] ’ l» E[C ’» I(C ≥ l)]

and this is easy to evaluate in the case of interest in which C has a lognormal
distribution. In fact the same kind of calculation is used in the development of
the Black-Scholes formula. In our case C = exp(Z) where Z is N (µT, σ 2 T )

and so for any p and l > 0, we have from (3.11), using the fact that E(C|O) =
O exp{µT + σ2 T /2}, (and assuming O is ¬xed),

E[C p I(C > l)] = Op exp{pµT + p2 σ 2 T /2}¦( √ (a + µT ) + σ T p)

To keep things slightly less combersome, let us assume that we observe the
geometric Brownian motion for a period of T = 1. Then (5.36) results in
2 2 2
1 1
Oeµ+σ /2
¦( σ (a + µ) + σ) ’ Oe’a»+(1’»)µ+(1’») σ /2 ¦( σ (a + µ) + σ(1 ’ »))
1 1
¦( σ (a + µ)) ’ e’»a’»µ+»2 σ2 /2 ¦( σ (a + µ) ’ σ»)

Let there be no bones about it. At ¬rst blush this is still a truly ugly and
opaque formula. We can attempt to beautify it by re-expressing it in terms more
like those in the Black-Scholes formula, putting

1 1
(µ ’ a), and d2 (0) = (a + µ),
d2 (») =
σ σ
d1 (») = d2 (») + σ, d1 (0) = d2 (0) + σ.

These are analogous to the values of d1, d2 in the Black-Scholes formula in the
case » = 0. Then
2 2 2
eµ+σ /2
¦(d1 (0)) ’ e’»a+(1’»)µ+(1’») σ /2 ¦(d1 (»))
E[C|L ≥ l] = O (5.37)
¦(d2 (0)) ’ e’»a’»µ+»2 σ2 /2 ¦(d2 (»))

What is interesting is how this conditional expectation, the expected close for
the survivors, behaves as a function of the volatility parameter σ. Although this
is a rather complicated looking formula, we can get a simpler picture (Figure
5.8) using a graph with the drift parameter µ chosen so that E(C) = 56.25 is
held ¬xed. We assume a = ’ ln(30/40) (consistent with Table 5.2)and vary
the value of σ over a reasonable range from σ = 0.1 (a very stable investment)
through σ = .8 (a highly volatile investment). In Figure 5.8 notice that for small
volatility, e.g. for σ · 0.2, the conditional expectation E[C|L ≥ 30] remains
close to its unconditional value E(C) but for σ ≥ 0.3 it increases almost linearly
in σ to around 100 for σ = 0.8. The intuitive reason for this dramatic increase is
quite simple. For large values of σ the process ¬‚uctuates more, and only those

Figure 5.8: E[C|L ≥ 30] for various values of (µ, σ) chosen such that E(C) =

paths with very large values of C have abeen able to avoid the absorbing barrier
at l = 30. Two comparable portfolios with unconditional return about 40%
will show radically di¬erent apparent returns in the presence of an absorbing
barrier. If σ = 20% then the survivor™s return will still average around 40%,
but if σ = 0.8, the survivor™s returns average close to 150%. The practical
implications are compelling. If there is any form of survivorship bias (as there
usually is), no measure of performance should be applied to the returns from
di¬erent investments, managers, or portfolios without an adjustment for the
risk or volatility.

In the light of this discussion we can return to the comparison of the two
portfolios in Table 5.3. Evidently there is little bias in the estimate of returns
for portfolio 2, since in this case the volatility is small σ = 0.2. However there
is very substantial bias associated with the estimate for portfolio 1, σ = 0.5.
In fact if we repeat the graph of Figure 5.8 assuming that the unconditional
return is around 8% we discover that E[C|L ≥ 30] is very close to 56 5 when

σ = 0.5 indicating that this is a more reasonable estimator of the performance
of portfolio 1.

Figure 5.9: The E¬ect of Surivorship bias for a Brownian Motion

For a Brownian motion process it is easy to demonstrate graphically the
nature of the surivorship bias. In Figure 5.9, the points under the graph of
the probability density which are shaded correspond to those which whose low
fell below the absorbing barrier at l = 30. The points in the unshaded region
correspond to the survivors. The expected value of the return conditional on
survival is the mean return (x-cooredinate of the center of mass) of those points
chosen uniformly under the density but above the lower curve, in the region
labelled “survivors”. Note that if the mean µ of the unconditional density
approaches the barrier (here at 30) , this region approaches a narrow band
along the top of the curve and to the right of 30. Similarly if the unconditional
standard deviation or volatility increases, the unshaded region stretches out to
the right in a narrow band and the conditional mean increases.

We arrive at the following seemingly paradoxical conclusions which make it
imperative to adjust for survivorship bias: the conditional mean, conditional on
survivorship, may increase as the volatility increases even if the unconditional

mean decreases.
Let us return to the problem with both an upper and lower barrier and
consider the distribution of returns conditional on the low never passing a barrier
Oe’a and the high never crossing a barrier at Oeb ( representing a fund buyout,
recruitment of manager by competitor or promotion of fund manager to Vice
President). It is common in process control to have an upper and lower barrier
and to intervene if either is crossed, so we might wish to study those processes
for which no intervention was required. Similarly, in a retrospective study we
may only be able to determine the trajectory of a particle which has not left
a given region and been lost to us. Again as an example, we use the following
data on two portfolio managers, both observed conditional on survival, for a
period of one year.

Portfolio Open price Close Price Lower Barrier Upper Barrier Volatility

56 5
1 40 30 100 .5

56 1
2 40 30 100 .2

If φ denotes the standard normal p.d.f., then the conditional probability
density function of ln(C/O) given that Oe’a < L < H < Oeb is proportional
to where, as before
σ φ( σ )w(u)

2 2 2 2
w(u) = 1 ’ e’2b(b’u)/σ + e’2(a+b)(a+b’u)/σ ’ e’2a(a+u)/σ + e’2(a+b)(a+b+u)/σ ’ E(W ),
’ ln(L)
ln(H) b a
where W = I[f rac1( ], and
)> ] + I[f rac1( )>
a+b a+b a+b a+b
a = ’ln(30/40).
b = ln(100/40),

The expected return conditional on survival when the drift is µ is given by
Z b
E(ln(C/O)|30 < L < H < 100) = uw(u)φ( )du.
σ σ

where w(u) is the weight function above. Therefore a moment estimator of the
drift for the two portfolios is determined by setting this expected return equal
to the observed return, and solving for µi the equation
1b u ’ µi
uw(u)φ( )du = Ri , i = 1, 2.
σi ’a σi
The solution is, for portfolio 1, µ1 = 0 and for portfolio 2, µ2 = 0.3. Thus the
observed values of C are completely consistent with a drift of 30% per annum
for portfolio 2 and a zero drift for portfolio 1. The bias again very strongly
e¬ects the portfolio with the greater volatility and estimators of drift should
account for this substantial bias. Ignoring the survivorship bias has led in the
past to some highly misleading conclusions about persistence of skill among
mutual funds.


1. If the values of dj are equally spaced, i.e. if dj = j∆, j = ..., ’2, ’1, 0, 1, ...and
with S0 = 0, ST = C and M = max(S0 , ST ), show that
P [C > u and C’M is even]

E[H|C = u] = M + ∆ .
P [C = u]

2. Let W (t) be a standard Brownian motion on [0, 1] with W0 = 0. De¬ne
C = W (1) and H = max{W (t); 0 · t · 1}. Show that the joint probabil-
ity density function of (C, H) is given by

f (c, h) = 2φ(c)(2h ’ c)e’2h(h’c) , for h > max(0, c)

where φ(c) is the standard normal probability density function.

3. Use the results of Problem 2 to show that the joint probability density
function of the random variables

Y = exp{’(2H ’ C)2 /2}

and C is a uniform density on the region {(x, y); y < exp(x2 /2)}.

4. Let X(t) be a Brownian motion on [0, 1], i.e. Xt satis¬es

dXt = µdt + σdWt , and X0 = 0.

De¬ne C = X(1) and H = max{X(t); 0 · t · 1}. Find the joint proba-
bility density function of (C, H).
Chapter 6

Quasi- Monte Carlo
Multiple Integration


In some sense, this chapter ¬ts within Chapter 4 on variance reduction; in
some sense it is strati¬cation run wild. Quasi-Monte Carlo methods are purely
deterministic, numerical analytic methods in the sense that they do not even
attempt to emulate the behaviour of independent uniform random variables, but
rather cover the space in d dimensions with fewer gaps than independent random
variables would normally admit. Although these methods are particularly when
evaluating integrals in moderate dimensions, we return brie¬‚y to the problem
of evaluating a one-dimension integral of the form

Z 1
f (x)dx.

The simplest numerical approximation to this integral consists of choosing a
point xj in the interval [ N , j+1 ], j = 0, 1, ..., N ’ 1, perhaps the midpoint of the


interval, and then evaluating the average
N ’1
f (xj ).
N j=0

If the function f has one continuous derivative, such a numerical method with
N equally or approximately equally spaced points will have bias that approaches
0 at the rate 1/N because, putting M = sup{|f 0 (z)|; 0 < z < 1},
Z (j+1)/N
1 1
f (x)dx ’ f (xj ) · 2 M (6.2)

and so summing both sides over j gives
Z N’1
| f (x)dx ’ f (xj )| · M.
N j=0 N

We will refer to the error in the numerical integral in this case
Z N ’1
µN = | f (x)dx ’ f (xj )|
N j=0

as O(N ’1 ) which means that the sequence of errors µN satis¬es

lim sup N ’1 µN < ∞
N ’∞

or intuitively that the errors are bounded by a constant times N ’1 .
If the function f is known to have bounded derivatives of second or third
order, then integrals can be approximated to an even higher degree of precision.
For example various numerical quadrature formulae permit approximating an
integral of the form 0 f (x)w(x)dx with a weighted average of N points
wj f (xj )

in such a way that if f (x) is a polynomial of degree 2N ’ 1 or less, the approx-
imation is exact. Here the function w(x) is typically some density such as the
uniform, exponential or normal density and the optimal placement of the points
xj as well as the weights wj depends on w(x). Of course a smooth function

can be closely approximated with a polynomial of high degree and so numerical
quadrature formulae of the form (6.3) permit approximating a one-dimension
integral arbitrarily closely provided that the function is su¬ciently smooth, i.e.
it has bounded derivatives of su¬ciently high order. We should note that in
this case, the weights wj and the points xj are both deterministic. By contrast,
the Monte Carlo integral
θMC = f (Ui )
N i=1
with N points places these points at random or pseudo-random locations, has
zero bias but the standard deviation of the estimator var(θMC ) is a constant

multiple of 1/ N . The Central Limit theorem assures us that
N 1/2 (θMC ’ f (x)dx)

converges to a normal distribution which means that the error is order (in prob-
ability) N ’1/2 . Note that there is a change in our measure of the size of an
error, since only the variance or standard deviation of a given term in the se-
quence of errors is bounded, not the whole sequence of errors µN . In particular
if a pseudo-random estimator θ satis¬es
b f (x)dx)2 = O(N ’2k )
E(θ ’

then we say that the error is OP (N ’k ) where OP denotes “order in probability”.
This is clearly a weaker notion than O(N ’k ). Even the simplest numerical
integral (6.1) has a faster rate of convergence then that of the Monte Carlo
integral with or without use of the variance reduction techniques of Chapter 4.
This is a large part of the reason numerical integration is usually preferred to
Monte Carlo methods in one dimension, at least for smooth functions, but it
also indicates that for regular integrands, there is room for improvement over
Monte Carlo in higher dimensions as well.
The situation changes in 2 dimensions. Suppose we wish to distribute N
points over a uniform lattice in some region such as the unit square. One

possible placement is to points of the form

i j
( √ , √ ), i, j = 1, 2, ... N

assuming for convenience of notation that N is integer. The distance between

adjacent points is of order 1/ N and by an argument akin to (6.2), the bias in

a numerical integral is order 1/ N . This is the now same order as the standard
deviation of a Monte Carlo integral, indicating that the latter is already, in two
dimensions, competitive. When the dimension s ≥ 3, a similar calculation shows
that the standard deviation of the Monte-Carlo method is strictly smaller order
than the error of a numerical integral with weights at lattice points. Essentially,
the placement of points on a lattice for evaluating a d’dimensional integral is far
from optimal when d ≥ 2. Indeed various deterministic alternatives called quasi-
random samples provide substantially better estimators especially for smooth
functions of several variables. Quasi-random samples are analogous to equally
spaced points in one dimension and are discussed at length by Niederreiter
(1978), where it is shown that for su¬ciently smooth functions, one can achieve
rates of convergence close to the rate 1/N for the one-dimensional case.
We have seen a number of methods designed to reduce the dimensionality
of the problem. Perhaps the most important of these is conditioning, which can
reduce an d’dimensional integral to a one-dimensional one. In the multidimen-
sional case, variance reduction has an increased importance because of the high
variability induced by the dimensionality of crude methods. The other vari-
ance reduction techniques such as regression and strati¬cation carry over to the
multivariable problem with little change, except for the increased complexity of
determining a reasonable strati¬cation in such problems.

Errors in numerical Integration

We consider the problem of numerical integration in d dimensions. For d = 1
classical integration methods, like the trapezoidal rule, are weighted averages of

the value of the function at equally spaced points;

Z m
f (u)du ≈ (6.4)
wn f ( ),
0 n=0

where w0 = wm = 1/(2m), and wn = 1/m for 1 · n · m ’ 1. The trape-
zoidal rule is exact for any function that is linear (or piecewise linear between
grid-points) and so we can assess the error of integration by using a linear ap-
proximation through the points ( m , f ( m )) and ( j+1 , f ( j+1 )). Assume
j j
m m

j j+1
<x< .
m m

If the function has a continuous second derivative, we have by Taylor™s Theorem
that the di¬erence between the function and its linear interpolant is of order
O(x ’ i.e.
m) ,

j j j+1 j j
) ’ f ( )] + O(x ’ )2 .
) + (x ’ )m[f (
f (x) = f (
m m m m m
j j+1
Integrating both sides between and notice that

f ( j+1 ) + f ( m )
j j j+1 j m
{f ( ) + (x ’ )m[f ( ) ’ f ( )]}dx =
m m m m 2m

is the area of the trapezoid and the error in the approximation is
Z (j+1)/m
) ) = O(m’3 ).
(x ’

Adding these errors of approximation over the m trapezoids gives O(m’2 ). Con-
sequently, the error in the trapezoidal rule approximation is O(m’2 ), provided
that f has a continuous second derivative on [0, 1].
We now consider the multidimensional case, d ≥ 2. Suppose we evaluate
the function at all of the (m + 1)d points of the form ( n1 , . . . , ns ) and use this
m m

to approximate the integral. The classical numerical integration methods use a
Cartesian product of one-dimensional integration rules. For example, the d-fold
Cartesian product of the trapezoidal rule is

Z m m
X X n1 ns
f (u)du ≈ ··· wn1 · · · wns f ( (6.5)
, . . . , ),
m m
[0,1]d n1 =0 ns =0

where [0, 1]d is the closed s-dimensional unit cube and the wn are as before.
The total number of nodes is N = (m + 1)s . From the previous error bound it
follows that the error is O(m’2 ), provided that the second partial derivatives of
f are continuous on [0, 1]d . We know that the error cannot be smaller because
when the function depends on only one variable and is constant in the others,
the one-dimensional result is a special case. In terms of the number N of nodes
or function evaluations, since m = O(N 1/d ), the error is O(N ’2/d ), which,
with increasing dimension d, changes dramatically. For example if we required
N = 100 nodes to achieve a required precision in the case d = 1, to achieve the
same precision for a d = 5 dimensional integral using this approach we would
need to evaluate the function at a total of 100d = 1010 = ten billion nodes. As
the dimension increases, the number of function evaluations or computation
required for a ¬xed precision increases exponentially. This phenomena is often
called the “curse of dimensionality”, exorcised in part at least by quasi or regular
Monte Carlo methods.

The ordinary Monte Carlo method based on simple random sampling is free
of the curse of dimensionality. By the central limit theorem, even a crude Monte
Carlo estimate for numerical integration yields a probabilistic error bound of the
form OP (N ’1/2 ) in terms of the number N of nodes (or function evaluations)
and this holds under a very weak regularity condition on the function f . The
remarkable feature here is that this order of magnitude does not depend on the
dimension d. This is true even if the integration domain is complicated. Note
however that the de¬nition of “ O” has changed from one that essentially con-
siders the worst case scenario to OP which measures the average or probabilistic
behaviour of the error.

Some of the oft-cited de¬ciencies of the Monte Carlo method limiting its

usefulness are:

1. There are only probabilistic error bounds (there is no guarantee that the
expected accuracy is achieved in a particular case -an alternative approach
would optimize the “worst-case” behaviour);

2. Regularity of the integrand is not exploited even when it is available. The
probabilistic error bound OP (N ’1/2 ) holds under a very weak regularity
condition but no extra bene¬t is derived from any additional regularity or
smoothness of the integrand. For example the estimator is no more precise
if we know that the function f has several continuous derivatives. In cases
when we do not know whether the integrand is smooth or di¬erentiable,
it may be preferable to use Monte Carlo since it performs reasonably well
without this assumption.

3. Genuine Monte Carlo is not feasible anyway since generating truly in-
dependent random numbers is virtually impossible. In practice we use
pseudo-random numbers to approximate independence.

Theory of Low discrepancy sequences

The quasi-Monte Carlo method places attention on the objective, approximating
an integral, rather than attempting to imitate the behaviour of independent
uniform random variates. Quasi-random sequences of low discrepancy sequences
would fail all of the tests applied to a pseudo-random number generate except
those testing for uniformity of the marginal distribution because the sequence
is, by construction, autocorrelated. Our objective is to approximate an integral
using a average of the function at N points, and we may adjust the points so that
the approximation is more accurate. Ideally we would prefer these sequences to
be self-avoiding, so that as the sequence is generated, holes are ¬lled. As usual

we will approximate the integral with an average;
f (u)du ≈ (6.6)
f (xn ).
N n=1

Quasi Monte-Carlo is able to achieve a deterministic error bound O((logN )d /N )
for suitably chosen sets of nodes and for integrands with a relatively low degree
of regularity, much better than the rate O(N ’1/2 ) achieved by Monte Carlo
methods. Even smaller error bounds can be achieved for su¬ciently regular
integrands. There are several algorithms or quasi-Monte-Carlo sequences which
give rise to this level of accuracy.

Suppose, as with a crude Monte Carlo estimate, we approximate the integral
with (6.6) with x1 , . . . , xN ∈ [0, 1]d . The sequence x1 , . . . , xN ,... is determinis-
tic (as indeed are the pseudo-random sequences we used for Crude Monte-Carlo),
but they are now chosen so as to guarantee a small error. Points are chosen so
as to achieve the maximal degree of uniformity or a low degree of discrepancy
with a uniform distribution. A ¬rst requirement for a low discrepancy sequence
is that we obtain convergence of the sequence of averages so that:
lim f (xn ) = f (u)du,
N’∞ N [0,1]d

and this should hold for a reasonably large class of integrands. This suggests
that the most desirable sequences of nodes x1 , . . . , xN are “evenly distributed”
over [0, 1]d . Various notions of discrepancy have been considered as quantitative
measures for the deviation from the uniform distribution but we will introduce
only one here, the so-called “star-discrepancy”. The star discrepancy is perhaps
the more natural one in statistics, since it measures the maximum di¬erence be-
tween the empirical cumulative distribution function of the points {x1 , . . . , xN }
and the uniform distribution of measure on the unit cube. Suppose we construct

bN (x) = 1 I(xn · x),
N n=1

the empirical cumulative distribution function of the points x1 , . . . , xN , and
compare it with

F (x) =F (x1 , ...xd ) = min(1, x1 x2 ...xd ) if all xi ≥ 0

the theoretical uniform distribution on [0, 1]d . While any measure of the dif-
ference could be used, the star discrepancy is simply the Kolmogorov-Smirnov
distance between these two cumulative distribution functions
# of points in B

DN = sup | FN (x) ’ F (x)| = sup | ’ »(B)|,
x B

where the supremum is taken over all rectangles B of the form [0, x1 ] — [0, x2 ] —
... — [0, xd ] and where »(B) denotes the Lebesgue measure of B in Rd .
It makes intuitive sense that we should choose points {x1 , . . . , xN } such
that the discrepancy is small for each N. This intuition is supported by a
large number of theoretical results, at least in the case of smooth integrands
with smooth partial derivatives. The smoothness is measured using V (f ), a
“total variation” in the sense of Hardy and Krause, intuitively the length of the
monotone segments of f. For a one dimensional function with a continuous ¬rst
derivative it is simply
Z 1
|f 0 (x)|dx.
V (f ) =

In higher dimensions, the Hardy Krause variation may be de¬ned in terms of
the integral of partial derivatives;

De¬nition 48 Hardy and Krause Total Variation
If f is su¬ciently di¬erentiable then the variation of f on [0, 1]d in the sense
of Hardy and Krause is
V (k) (f ; i1 , . . . , ik ), (6.7)
V (f ) =
k=1 1 i1 <···<ik s

¯ ¯
¯ ¯
1 1
¯ ¯
··· dxi1 · · · dxik . (6.8)
V (f ; i1 , . . . , ik ) = ¯ ‚xi · · · ‚xi ¯
0 0 k xj =1,j6=i1 ,...,ik

The precision in our approximation to an integral as an average of function
values is closely related to the discrepancy measure as the following result shows.
Indeed the mean of the function values di¬ers from the integral of the function
by an error which is bounded by the product of the discrepancy of the sequence
and the measure V (f ) of smoothness of the function.

Theorem 49 (Koksma-Hlawka inequality)
If f has bounded variation V (f ) on [0, 1]d in the sense of Hardy and Krause,
then, for any x1 , . . . , xN ∈ [0, 1]d , we have
1X —
| f (xn ) ’ f (u)du| · V (f )DN . (6.9)
N n=1 s

We do not normally use this inequality as it stands since the evaluation of the
error bound on the right hand side requires determining V (f ), typically a very
di¬cult task. However this bound allows a separation between the regularity
properties of the integrand and the degree of uniformity of the sequence. We can
guarantee a reasonable approximation for any function f with bounded total

variation V (f ) by ensuring that the discrepancy of the sequence DN is small.
For this reason, the discrepancy is central to quasi-Monte Carlo integration.
Sequences with small star discrepancy are called low-discrepancy sequences. In
fact since a variety of sequences exist with discrepancy of order

(log N )d

as N ’ ∞, the term “low-discrepancy” is often reserved for these.

Examples of low discrepancy sequences

Van der Corput Sequence.

In the one dimensional case the best rate of convergence is O(N ’1 log N ), N ≥ 2.
It is achieved, for example, by the van der Corput sequence, obtained by

reversing the digits in the representation of some sequence of integers in a given
base. Consider one-dimensional case d = 1 and base b = 2. Take the base b
representation of the sequence of natural numbers;

1, 10, 11, 100, 101, 110, 111, 1000, 1001, 1010, 1011, 1100, 1101, ...

and then map these into the unit interval [0, 1] so that the integer t ak bk
is mapped into the point k=0 ak b’k’1 . These binary digits are mapped into
(0,1) in the following three steps;

1. Write n using its binary expansion. e.g. 13 = 1(8) + 1(4) + 0(2) + 1(1)
becomes 1101.

2. Reverse the order of the digits. e.g. 1101 becomes 1011.

3. Determine the number that this is the binary decimal expansion for. e.g.
1011 = 1( 1 ) + 0( 1 ) + 1( 1 ) + 1( 16 ) =
1 11
16 .
2 4 8

Thus 1 generates 1/2, 10 generates 0( 1 ) + 1( 1 ), 11 generates 1( 1 ) + 1( 1 )
2 4 2 4

and the sequence of positive integers generates the points. The intervals are
recursively split in half in the sequence 1/2, 1/4, 3/4, 1/8, 5/8, 3/8, 7/8, ... and
the points are fairly evenly spaced for any value for the number of nodes N ,
and perfectly spaced if N is of the form 2k ’ 1. The star discrepancy of this
sequence is
log N

DN = O( )
which matches the best that is attained for in¬nite sequences.

The Halton Sequence

This is simply the multivariate extension of the Van der Corput sequence. In
higher dimensions, say in d dimensions, we choose d distinct primes, b1 , b2 , ...bd
(usually the smallest primes) and generate, from the same integer m , the d
components of the vector using the method described for the Van der Corput

sequence. For example, we consider the case d = 3 and use bases b1 = 2,
b2 = 3, b3 = 5 because these are the smallest three prime numbers. The ¬rst
few vectors , ( 1 , 1 , 1 ), ( 1 , 2 , 2 ), ( 3 , 1 , 3 ), ...are generated in the table below.
235 435 495

repres ¬rst repres. second repres third
base 2 component base 3 comp base 5 comp
1 1 1/2 1 1/3 1 1/5
2 10 1/4 2 2/3 2 2/5
3 11 3/4 10 1/9 3 3/5
4 100 1/8 11 4/9 4 4/5
5 101 5/8 12 7/9 10 1/25
6 110 3/8 20 2/9 11 6/25
7 111 7/8 21 5/9 12 11/25
9 1000 1/16 22 8/9 13 16/25
10 1001 9/16 100 1/27 14 21/25

Figure 6.1 provides a plot of the ¬rst 500 points in the above Halton sequence
of dimension 3.
There appears to be greater uniformity than a sequence of random points
would have. Some patterns are discernible on the two dimensional plot of the
¬rst 100 points, for example see Figures 6.2 and 6.3.
These ¬gures can be compared with the plot of 100 pairs of independent
uniform random numbers in Figure 6.4, which seems to show more clustering
and more holes in the point cloud.
These points were generated with the following function for producing the
Halton sequence.
function x=halton(n,s)

%x has dimension n by s and is the ¬rst n terms of the halton sequence of

%dimension s.

p=primes(s*6); p=p(1:s); x=[];






0.8 1
0.6 0.8
0 0

Figure 6.1: 500 points from a Halton seqnece of dimension 3

for i=1:s

x=[x (corput(n,p(i)))™];


function x=corput(n,b)

% converts integers 1:n to from van der corput number with base b


n=1:n; A=[];

for i=0:m

a=rem(n,b); n=(n-a)/b;

A=[A ;a];



The Halton sequence is a genuine low discrepancy sequence in the sense that

(log N )d

DN = O( )

Halton sequence of dimension 3




second coordinate






0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
first coordinate

Figure 6.2: The ¬rst and second coordinate of 100 points from the Halton
sequence of dimension 3

and the coverage of the unit cube is reasonably uniform for small dimensions.
Unfortunately the notation O() hides a constant multiple, one which, in this
case, depends on the dimension d. Roughly (Niedereiter, 1992), this constant
is asymptotic to dd which grows extremely fast in d. This is one indicator that
for large d, the uniformity of the points degrades rapidly, largely because the
relative sparseness of the primes means that the d0 th prime is very large for d
large. This results in larger holes or gaps in that component of the vector than
we would like. This is evident for example in Figure6.5 where we plot the last
two coordinates of the Halton sequence of dimension 15.

The performance of the Halton sequence is considerably enhanced by per-
muting the coe¬cients ak prior to mapping into the unit interval as is done by
the Faure sequence.

Figure 6.3: The second and third coordinate of 100 points from the Halton
sequence of dimension 3

Faure Sequence

The Faure sequence is similar to the Halton sequence in that each dimension is
a permutation of a van der Corput sequence; however, the same prime is used
as the base b for each of the components of the vector, and is usually chosen to
be the smallest prime greater than or equal to the dimension (Fox, 1996).
In the Van der Corput sequence we wrote the natural numbers in the form
Pt P
ak bk which was then mapped into the point t ak b’k’1 in the unit
k=0 k=0

interval. For the Faure sequence we use the same construction but we use
di¬erent permutations of the coe¬cients ak for each of the coordinates. In
particular in order to generate the i™th coordinate we generate the point
ck b’k’1

Figure 6.4: 100 independent U [0, 1] pairs

X µm¶
(i ’ 1)m’k am mod b
ck =

Notice that only the last t ’ k + 1 values of ai are used to generate ck . For
example consider the case d = 2, b = 2. Then the ¬rst 10 Faure numbers are

0 1/2 1/4 3/4 1/8 5/8 3/8 7/8 1/16 9/16
0 1/2 3/4 1/4 5/8 1/8 3/8 7/8 15/16 7/16

The ¬rst row corresponds to the Van der Corput numbers and the second row
of obtained from the ¬rst by permuting the values with the same denominator.
The Faure sequence has better regularity properties than does the Halton
sequence above particularly in high dimensions. However the di¬erences are by
no means evident from a graph when the dimension is moderate. For example
we plot in Figure 6.6 the 14™th and 15™th coordinates of 1000 points from the
Faure sequence of dimension d = 15 for comparison with Figure 6.5.

Figure 6.5: The 14™th and 15™th coordinates of the ¬rst 1000 of a Halton sequence
d = 15

Other suggestions for permuting the digits in a Halton sequence include
using only every l0 th term in the sequence so as to destroy the cycle.

In practice, in order to determine the e¬ect of using one of these low dis-
crepancy sequences we need only substitute such a sequence for the vector of
independent uniform random numbers used by a simulation. For example if we
wished to simulate a process for 10 time periods, then value a call option and
average the results, we could replace the 10 independent uniform random num-
bers that we used to generate one path by an element of the Halton sequence
with d = 10.

Suppose we return brie¬‚y to the call option example treated in Chapter 3.
The true value of this call option was around 0.4615 according to the Black-
Scholes formula. If however we substitute the Van der Corput sequence for the
sequence of uniform random numbers,

Figure 6.6: The last two coordinates of the ¬rst 1000 Faure points of dimension
d = 15.


we obtain an estimate of 0.4614 very close to the correct value. I cannot
compare these estimators using the notion of e¬ciency that we used there, how-
ever, because these low-discrepancy sequences are not random and do not even
attempt to emulate random numbers. Though unable to compare performance
with the variance of an estimator, we can look at the Mean squared error (see
for example Figure 6.8). which shows a faster rate of convergence for Quasi
Monte Carlo equivalent to variance reduction in excess of 100). Galanti & Jung
(1997), report that the Faure sequence su¬ers from the problem of start-up

and especially in high-dimensions and the Faure numbers can exhibit clustering
about zero. In order to reduce this problem, Faure suggests discarding the ¬rst
b4 ’ 1 points.

Sobol Sequence

The Sobol sequence is generated using a set of so-called direction numbers
= 1, 2, where the mi are odd positive integers less than 2i . The
vi = 2i , i

values of mi are chosen to satisfy a recurrence relation using the coe¬cients of
a primitive polynomial in the Galois Field of order 2. A primitive polynomial is
irreducible (i.e. cannot be factored into polynomials of smaller degree) and does
not divide the polynomial xr + 1 for r < 2p ’ 1. For example the polynomial
x2 + x + 1 has no non-trival factors over the Galois Field of order 2 and it does
divide x3 + 1 but not xr + 1 for r < 3. Corresponding to a primitive polynomial

z p + c1 z p’1 + ...cp’1 z + cp

is the recursion

mi = 2c1 mi’1 + 22 c2 mi’2 + ... + 2p cp mi’p

where the addition is carried out using binary arithmetic. For the Sobol se-
quence, we then replace the binary digit ak by ak vk .
In the case d = 2, the ¬rst 10 Sobol numbers are, using irreducible polyno-
mials x + 1 and x3 + x + 1

0 1/2 1/4 3/4 3/8 7/8 1/8 5/8 5/16 13/16
0 1/2 1/4 3/4 1/8 5/8 3/8 7/8 11/16 3/16

Again we plot the last two coordinates for the ¬rst 1000 points from a Sobol
sequence of dimension d = 15 in Figure 6.7 for comparison with Figures 6.5 and

Figure 6.7: The last two coordinates of the ¬rst 1000 points from a Sobel se-
quence of dimension 15

Although there is a great deal of literature espousing the use of one quasi-
Monte Carlo sequence over another, most results from a particular application


. 10
( 11)