. 6
( 11)


θ θ 1 t 1
t’2 exp(’ ((x ’ µ)2 + θ2 ) ’ (β 2 + 2 ))dt
exp(β(x ’ µ) + 2 )
2πc c 2t 2 c
p K1 (± δ 2 + (x ’ µ)2 )
exp(δ ±2 ’ β 2 + β(x ’ µ)) p
= .
π δ 2 + (x ’ µ)2

The modi¬ed Bessel function of the second kind K» (x) is given in MATLAB
by besselk( ν, x) and in R by besselK(x,ν,expon.scaled=FALSE). The distri-
bution with probability density function given by (3.22) is called the normal
inverse Gaussian distribution with real-valued parameters x, µ, 0 · δ and
± ≥ |β|. The tails of the normal inverse Gaussian density are substantially
heavier than those of the normal distribution. In fact up to a constant

f (x; ±, β, δ, µ) ∼ |x| ’3/2 exp((“± + β)x) as x ’ ±∞.

The moments of this distribution can be obtained from the moment gener-
ating function
±2 ’ (β + s)2
M (s) = eµs exp{δ(±2 ’β 2 )1/2 ’δ(±2 ’(s+β)2 )1/2 } for |β+s| < ±.
±2 ’ β 2

These moments are:

E(X) = µ + δβ(±2 ’ β 2 )’1/2

var(X) = δ±2 (±2 ’ β 2 )’3/2

and the skewness and kurtosis:

skew = 3β±’1 δ ’1/2 (±2 ’ β 2 )’1/4

kurtosis = 3δ ’1 ±’2 (±2 + 4β 2 )(±2 ’ β 2 )’1/2 .

One of the particularly attractive features of this family of distributions,
shared by the normal and the stable family of distributions, is that it is closed
under convolutions. This is apparent from the moment generating function
(3.23) since
M N (s)

gives a moment generating function of the same form but with µ replaced by
µN and δ by δN. In Figure 3.19 we plot the probability density function of a
member of this family of distributions.
Note the similarity to the normal density but with a modest amount of
skewness and increased weight in the tails. We can generate random variables
from this distribution as follows:

Sample T from an inverse Gaussian distribution (3.21)

Return X = µ + βT + N (0, T )
where N (0, T ) is a normal random variable with mean 0 and variance T.

We sample from the inverse Gaussian by using a property of the distribution
that if T has density of the form (3.21) then

(T ’ θ)2
c2 T

Figure 3.19: Normal Inverse Gaussian probability density function with ± =
δ = 1, β = 1 , µ = 0

has a chi-squared distribution with one degree of freedom (easily generated
as the square of a standard normal random variable). The algorithm is (see
Michael, Shucany, Hass (1976));

1. For
c= p , and θ = δc,
±2 ’ β 2
generate G1 from the Gamma( 1 , δ ) distribution. De¬ne
Y1 = 1 + G1 (1 ’ 1 + )}.

2. Generate U2 ∼ U [0, 1]. If U2 · then output T = θY1

3. Otherwise output T = θY1’1 .

The two values θY1 , and θY1’1 are the two roots of the equation obtained by
setting (3.24) equal to a chi-squared variate with one degree of freedom and the
relative values of the probability density function at these two roots are 1+Y1
and 1 ’ 1+Y1 .

Finally to generate from the normal inverse Gaussian distribution (3.22) we
generate an inverse gamma random variable above and then set X = µ + βT +
N (0, T ). Prause (1999) provides a statistical evidence that the Normal Inverse
Gaussian provides a better ¬t than does the normal itself. For example we ¬t the
normal inverse gamma distribution to the S&P500 index returns over the period
Jan 1, 1997-Sept 27, 2002. There were a total of 1442 values over this period.
Figure 3.20 shows a histogram of the daily returns together with the normal
and the NIG ¬t to the data. The mean return over this period is 8 — 10’5 and
the standard deviation of returns 0.013. If we ¬t the normal inverse Gaussian
distribution to these returns we obtain parameter estimates

± = 95.23, β = ’4.72, δ = 0.016, µ = 0.0009

and the Q-Q plots in Figure 3.21 . Both

Figure 3.20: The Normal and the Normal inverse Gaussian ¬t to the S&P500

Figure 3.21: QQ plots showing the Normal Inverse Gaussian and the Normal ¬t
to S&P 500 data, 1997-2002

indicate that the normal approximation fails to properly ¬t the tails of the
distribution but that the NIG distribution is a much better ¬t. This is similar to
the conclusion in Prause using observations on the Dow Jones Industrial Index.

Generating Random Numbers from Discrete Dis-

Many of the methods described above such as inversion and acceptance-rejection
for generating continuous distributions work as well for discrete random vari-
ables. Suppose for example X is a discrete distribution taking values on the
integers with probability function P [X = x] = f (x), for x = 0, 1, 2, ... Suppose
we can ¬nd a continuous random variable Y which has exactly the same value
of its cumulative distribution function at these integers so that FY (j) = FX (j)
for all j = 1, 2, .... Then we may generate the continuous random variable Y,
say by inversion or acceptance-rejection and then set X = bY c the integer part
of Y . Clearly X takes integer values and since P [X · j] = P [Y · j] = FX (j)
for all j = 0, 1, ..., then X has the desired distribution. The continuous ex-
ponential distribution and the geometric distribution are linked in this way. If
X has a geometric(p) distribution and Y has the exponential distribution with
parameter » = ’ ln(1 ’ p), then X has the same distribution as dY e or bY c + 1.

Using the inverse transform method for generating discrete random vari-
ables is usually feasible but for random variables with a wide range of values
of reasonably high probability, it often requires some setup costs to achieve
reasonable e¬ciency. For example if X has cumulative distribution function
F (x), x = 0, 1, ...inversion requires that we output an integer X = F ’1 (U ),an
integer X satisfying F (X ’ 1) < U · F (X). The most obvious technique for
¬nding such a value of X is to search sequentially through the potential values
x = 0, 1, 2, .... Figure 3.22 is the search tree for inversion for the distribution on





3 4

distribution: .11 .30 .25 .21 .13

Figure 3.22: Sequential Search tree for Inverse Transform with root at x = 0

the integers 0, . . . 4 given by
0 1 2 3 4
0.11 0.30 0.25 0.21 0.13
f (x)
We generate an integer by repeatedly comparing a uniform [0,1] variate
U with the value at each node, taking the right branch if it is greater than
this threshold value, the left if it is smaller. If X takes positive integer values
{1, 2, ..., N },the number of values searched will average to E(X) which for many
discrete distributions can be unacceptably large.

An easy alternative is to begin the search at a value m which is near the
median (or mode or mean) of the distribution. For example we choose m = 2
and search to the left or right depending on the value of U in Figure 3.23.

If we assume for example that we root the tree at m then this results in
searching roughly an average of E[|X ’ m + 1|] before obtaining the generated
variable. This is often substantially smaller than E(X) especially when E(X)



2 3

0 1

Figure 3.23: Search tree rooted near the median

is large but still unacceptably large when the distribution has large variance.
An optimal binary search tree for this distribution is graphed in Figure
3.24. This tree has been constructed from the bottom up as follows. We begin
by joining the two smallest probabilities f (4) and f (0) to form a new node
with weight f (0) + f (4) = 0.24. Since we take the left path (towards X =
0 rather than towards X = 4) if U is smaller than the value .11 labelling
the node at the intersection of these two branches. We now regard this pair
of values as a unit and continue to work up the tree from the leaves to the
root. The next smallest pair of probabilities are {0, 1} and {3} which have
probabilities 0.24 and 0.21 respectively so these are the next to be joined hence
working from the leaves to the root of the tree. This optimal binary search
tree provides the minimum expected number of comparisons and is equivalent
to sorting the values in order of largest to smallest probability, in this case
1, 2, 3, 4, 0 , relabelling them or coding them {0, 1, 2, 3, 4} and then applying the
inverse transform method starting at 0.

distribution .11 .30 .25 .21 .13



.11 3


Figure 3.24: Optimal Binary Search Tree

The leaves of the tree are the individual probabilities f (j) and the internal
nodes are sums of the weights or probabilities of the “children”, the values f (j)
for j on paths below this node. Let Di represents the depth of the i0 th leaf so for
example the depth of leaf 0 in Figure 3.24 is D0 = 3. Then the average number
of comparisons to generate a single random variable Xstarting at the root is
i f (i)Di . The procedure for constructing the last tree provides an optimal

algorithm in the sense that this quantity is minimized. It is possible to show that
an optimal binary search tree will reduce the average number of comparisons
from E(X) for ordinary inversion to less than 1 + 4 [log2 (1 + E(X))].
Another general method for producing variates from a discrete distribution
was suggested by Walker (1974, 1977) and is called the alias method. This is
based on the fact that every discrete distribution is a uniform mixture of two-
point distributions. Apart from the time required to set up an initial table of
aliases and aliasing probabilities, the time required to generate values from a dis-
crete distribution with K supporting points is bounded in K, whereas methods

such as inverse transform have computational time which increase proportion-
ally with E(X).

Consider a discrete distribution of the form with probability function f (j)
on K integers j = 1, 2, ...K. We seek a table of values of A(i) and associated
“alias” probabilities q(i) so that the desired discrete random variable can be
generated in two steps, ¬rst generate one of the integers {1, 2, ..., K} at random
and uniformly, then if we generated the value I, say, replace it by an “alias” value
A(I) with alias probability q(I). These values A(I) and q(I) are determined
below. The algorithm is:



An algorithm for producing these values of (A(i), q(i)), i = 1, ..., K} is sug-
gested by Walker(1977) and proceeds by reducing the number of non-zero prob-
abilities one at a time.

1. Put q(i) = Kf (i) for all i = 1, ..., K.

2. LET m be the index so that q(m) = min{q(i); q(i) > 0} and let q(M ) =
max{q(i); q(i) > 0}.

3. SET A(m) = M and ¬x q(m) ( it is no longer is subject to change).

4. Replace q(M ) by q(M ) ’ (1 ’ q(m))

5. Replace (q(1), ...q(K))by (q(1), ..., q(m ’ 1), q(m + 1), .q(M ) (so the com-
ponent with index m is removed).

6. Return to 2 unless all remaining qi = 1 or the vector of qi ™s is empty.

Note that on each iteration of the steps above, we ¬x one of components
q(m) and remove it from the vector and adjust one other, namely q(M ). Since
we always ¬x the smallest q(m) and since the average q(i) is one, we always

obtain a probability, i.e. ¬x a value 0 < q(m) · 1. Figure 3.25 shows the way
in which this algorithm proceeds for the distribution
1 2 3 4
.1 .2 .3 .4
f (x) =
We begin with q(i) = 4 — f (i) = .4, .8, 1.2, 1.6 for i = 1, 2, 3, 4. Then since
m = 1 and M = 4 these are the ¬rst to be adjusted. We assign A(1) = 4 and
q(1) = 0.4. Now since we have reassigned mass 1 ’ q(1) to M = 4 we replace
q(4) by 1.6 ’ (1 ’ 0.4) = 1. We now ¬x and remove q(1) and continue with
q(i) = .8, 1.2, 1.0 for i = 2, 3, 4. The next step results in ¬xing q(2) = 0.8,
A(2) = 3 and changing q(3) to q(3) ’ (1 ’ q(2)) = 1. After this iteration,
the remaining q(3), q(4) are both equal to 1, so according to step 6 we may
terminate the algorithm. Notice that we terminated without assigning a value
to A(3) and A(4). This assignment is unnecessary since the probability the alias
A(i) is used is (1 ’ q(i)) which is zero in these two cases. The algorithm
therefore results in aliases A(i) = 4, 3, i = 1, 2 and q(i) = .4, .8, 1, 1, respectively
for i = 1, 2, 3, 4. Geometrically, this method iteratively adjusts a probability
histogram to form a rectangle with base K as in Figure 3.25.
Suppose I now wish to generate random variables from this discrete distrib-
ution. We simply generate a random variable uniform on the set {1, 2, 3, 4} and
if 1 is selected, we replace it by A(1) = 4 with probability 1 ’ q(1) = 0.6. If 2
is selected it is replaced by A(2) = 3 with probability 1 ’ q(2) = 0.2.

Acceptance-Rejection for Discrete Random Variables

The acceptance-rejection algorithm can be used both for generating discrete and
continuous random variables and the geometric interpretation in both cases is
essentially the same. Suppose for example we wish to generate a discrete random
variable X having probability function f (x) using as a dominating function
a multiple of g(x) the probability density function of a continuous random
variable. Take for example the probability function

distribution aliased: .1 .2 .3 .4






4 3
1 2 3 1 2 4

Figure 3.25: The alias method for generating from the distribution 0.1 0.2 0.3

1 2 3 4
f (x) = .1 .3 .4 .2
using the dominating function 2g(x) = 0.1 + 0.2(x ’ 0.5) for 0.5 < x < 4.5. It
is easy to generate a continuous random variable from the probability density
function g(x) by inverse transform. Suppose we generate the value X. Then if
this value is under the probability histogram graphed in Figure 3.26 we accept
the value (after rounding it to the nearest integer to conform the discreteness
of the output distribution) and otherwise we reject and repeat.

We may also dominate a discrete distribution with another discrete distrib-
ution in which case the algorithm proceeds as in the continuous case but with
the probability density functions replaced by probability functions.









1 2 3 4

Figure 3.26: Acceptance-Rejection with for Discrete Distribution with continu-
ous dominating function.

The Poisson Distribution.

Consider the probability function for a Poisson distribution with parameter »

»x e’»
f (x) = , x = 0, 1, ...

The simplest generator is to use the Poisson process. Recall that a Poisson
process with rate 1 on the real line can be described in two equivalent ways:

1. Points are distributed on the line in such a way that the spacings be-
tween consecutive points are independent exponential(») random vari-
ables. Then the resulting process is a Poisson process with rate ».

2. The number of points in an interval of length h has a Poisson (»h) distri-
bution. Moreover the numbers of points in non-overlapping intervals are
independent random variables.

The simplest generator stems from this equivalence. Suppose we use the
¬rst speci¬cation to construct a Poisson process with rate parameter 1 and
then examine X = the number of points occurring in the interval [0, »]. This
is the number of partial sums of exponential(1) random variables that are less

than or equal to »
X = inf{n; (’lnUi ) > »}

or equivalently
Ui < e’» } (3.26)
X = inf{n;

This generator requires CPU time which grows linearly with » since the
number of exponential random variables generated and summed grows linearly
with » and so an alternative for large » is required. Various possibilities of
acceptance-rejection algorithms have been suggested including dominating the
Poisson probability function with multiples of the logistic probability density
function (Atkinson (1979)), the normal density with exponential right tail (cf.
Devroye, lemma 3.8, page 509). A simple all-purpose dominating function is the
so-called table-mountain function (cf. Stadlober (1989)), essentially a function
with a ¬‚at top and tails that decrease as 1/x2 . Another simple alternative for
generating Poisson variates that is less e¬cient but simpler to implement is to
use the Lorentzian, or truncated Cauchy distribution with probability density
g(x|a, b) = ,x > 0
b2 + (x ’ a)2

where c0 is the normalizing constant. A random variable is generated from this
distribution using the inverse transform method; X = a + b tan(πU ),, where
U ∼ U [0, 1]. Provided that we match the modes of the distribution a = » and

put b = 2», this function may be used to dominate the Poisson distribution
and provide a simple rejection generator. The Matlab Poisson random number
generator is poissrnd (», m, n) which generates an m — n matrix of Poisson(»)
variables. This uses the simple generator (3.26) and is not computationally
e¬cient for large values of ».In R the command rpois(n,») generates a vector
of n Poisson variates.

The Binomial Distribution

For the Binomial distribution, we may use any one of the following alternatives:

I(Ui < p), Ui ∼ independent uniform[0, 1]
(1) X = i=1

Gi > n}, where Gi ∼independent Geometric(p)
(2) X = inf{x; i=1

Px+1 Ei
> ’log(1’p)}, where Ei ∼independent Exponential(1).
(3) X = inf{x; i=1 n’i+1

Method (1) obtains from the de¬nition of the sum of independent Bernoulli
random variables since each of the random variables I(Ui < p) are independent,
have values 0 and 1 with probabilities 1 ’ p and p respectively. The event
(Ui < p) having probability p is typically referred to as a “success”. Obviously
this method will be slow if n is large. For method (2), recall that the number of
trials necessary to obtain the ¬rst success, G1 , say, has a geometric distribution.
Similarly, G2 represents the number of additional trials to obtain the second
success. So if X = j, the number of trials required to obtain j + 1 successes
was greater than n and to obtain j successes, less than or equal to n. In other
words there were exactly j successes in the ¬rst n trials. When n is large but np
fairly small, method (2) is more e¬cient since it is proportional to the number
os successes rather than the total number of trials. Of course for large n and
np su¬ciently small (e.g. <1), we can also replace the Binomial distribution
by its Poisson (» = np) approximation. Method (3) is clearly more e¬cient if
’log(1 ’ p) is not too large so that p is not too close to 1, because in this case
we need to add fewer exponential random variables.
For large mean np and small n(1 ’ p) we can simply reverse the role of
successes and failures and use method (2) or (3) above. But if both np and
n(1 ’ p) are large, a rejection method is required. Again we may use rejection
beginning with a Lorentzian distribution, choosing a = np, and b = 2np(1 ’ p)
in the case p < 1/2. When p > 1/2, we simply reverse the roles of “failures”
and “successes”. Alternatively, a dominating table-mountain function may be

used (Stadlober (1989)). The binomial generator in Matlab is the function
binornd(n,p,j,k) which generates an n — k matrix of binomial(n, p) random
variables. This uses the simplest form (1) of the binomial generator and is not
computationally e¬cient for large n. In R, rbinom(m,n,p) will generate a vector
of length m of Binomial(n, p) variates.

Random Samples Associated with Markov Chains

Consider a ¬nite state Markov Chain, a sequence of (discrete) random variables
X1 , X2 , . . .each of which takes integer values 1, 2, . . . N (called states). The
number of states of a Markov chain may be large or even in¬nite and it is not
always convenient to label them with the positive integers and so it is common
to de¬ne the state space as the set of all possible states of a Markov chain, but
we will give some examples of this later. For the present we restrict attention to
the case of a ¬nite state space. The transition probability matrix is a matrix P
describing the conditional probability of moving between possible states of the
chain, so that

P [Xn+1 = j|Xn = i] = Pij , i = 1, . . . N, j = 1, . . . N.

where Pij ≥ 0 for all i, j and for all i. A limiting distribution
Pij = 1

of a Markov chain is a vector (π say) of long run probabilities of the individual
states with the property that

πi = limt’∞ P [Xt = i].

A stationary distribution of a Markov chain is the column vector (π say) of
probabilities of the individual states such that

π0 P = π0 . (3.28)

π 0 P = π0 . For a Markov chain, every limiting distribution is in fact a station-
ary distribution. For the basic theory of Markov Chains, see the Appendix.
Roughly, a Markov chain which eventually “forgets” the states that were occu-
pied in the distant path, in other words for which the probability of the current
states does not vary much as we condition on di¬erent states in the distant
past, is called ergodic. A Markov chain which simply cycles through three
states 1 ’ 2 ’ 3 ’ 1 ’ ... is an example of a periodic chain, and is not

It is often the case that we wish to simulate from a ¬nite ergodic Markov
chain when it has reached equilibrium or stationarity, which is equivalent to
sampling from the distribution of Xn assuming that the distribution of X0
is given by the stationary distribution π. In a few cases, we can obtain this
stationary distribution directly from (3.28) but when N is large this system of
equations is usually not feasible to solve and we need to ¬nd another way to
sample from the probability vector π. Of course we can always begin the Markov
chain in some arbitrary initial state and run it waiting for Hele to freeze over (it
does happen since Helle is in Devon) until we are quite sure that the chain has
essentially reachedequilibrium, and then use a subsequent portion of this chain,
discarding this initial period, sometimes referred to as the “initial transient”.

Clearly this is often not a very e¬cient method, particularly in cases in
which the chain mixes or forgets its past very slowly for in this case the required
initial transient is long. On the other hand if we shortened it, we run the risk of
introducing bias into our simulations because the distribution generated is too
far from the equilibrium distribution π. There are a number of solutions to this
problem proposed in a burgeoning literature. Here we limit ourselves to a few
of the simpler methods.

Metropolis-Hastings Algorithm

The Metropolis-Hastings Algorithm is a method for generating random variables
from a distribution π that applies even in the case of an in¬nite number of
states or a continuous distribution π. It is assumed that π is known up to some
multiplicative constant. Roughly, the method consists of using a convenient
“proposal” Markov chain with transition matrix Q to generate transitions, but
then only “accept” the move to these new states with probability that depends
on the distribution π. The idea resembles that behind importance sampling.
The basic result on which the Metropolis-Hastings algorithm is pinned is the
following theorem.

Theorem 31 Suppose Qij is the transition matrix of a Markov chain. Assume
that g is a vector of non-negative values such that N gi = G and

| | · K < ∞ for all i, j

for some ¬nite value K. De¬ne

gj Qji
ρij = min(1, )
gi Qij

Then the Markov Chain with transition probability matrix

Pij = Qij ρij , for i 6= j (3.29)

has stationary distribution πi = G.

Proof. The proof consists of showing that the so-called “detailed balance
condition” is satis¬ed, i.e. with πi = that

πi Pij = πj Pji , for all i, j. (3.30)

This condition implies that when the chain is operating in equilibrium,

P [Xn = i, Xn+1 = j] = P [Xn = j, Xn+1 = i]

re¬‚ecting a cavalier attitude to the direction in which time ¬‚ows or reversibility
of the chain. Of course (3.30) is true automatically if i = j and for i 6= j,
gi gj Qji
πi Pij = Qij min(1, )
G gi Qij
= min(gi Qij , gj Qji )
= πj Pji

by the symmetry of the function min(gi Qij , gj Qji ). Now the detailed balance

condition (3.30) implies that π is a stationary distribution for this Markov chain
πi Pij = πj Pji
i=1 i=1
= πj Pji

= πj for each j = 1, ..., N.

Provided that we are able to generate transitions for the Markov Chain with
transition matrix Q, it is easy to generate a chain with transition matrix P in
(3.29). If we are currently in state i, generate the next state with probability
Qij . If j = i then we stay in state i. If j 6= i, then we “accept” the move to state
j with probability ρij , otherwise we stay in state i. Notice that the Markov
Chain with transition matrix P tends to favour moves which increase the value
of π. For example if the proposal chain is as likely to jump from i to j as it
is to jump back so that Qij = Qji , then if πj > πi the move to j is always
accepted whereas if πj < πi the move is only accepted with probability The
πi .

assumption Qij = Qji is a common and natural one, since in applications of
the Metropolis-Hastings algorithm, it is common to choose j “at random” (i.e.
uniformly distributed) from a suitable neighborhood of i.
The above proof only provides that π is a stationary distribution of the
Markov Chain associated with P, not that it is necessarily the limiting distrib-

ution of this Markov chain. For this to follow we need to know that the chain
is ergodic. Various conditions for ergodicity are given in the literature. See for
example Robert and Casella (1999, Chapter 6) for more detail.

Gibbs Sampling

There is one simple special case of the Metropolis-Hastings algorithm that is
particularly simple, common and compelling. To keep the discussion simple,
suppose the possible states of our Markov Chain are points in two-dimensional
space (x, y). We may assume both components are discrete or continuous. Sup-
pose we wish to generate observations from a stationary distribution which is
proportional to g(x, y) so

g(x, y)
π(x, y) = P P (3.31)
y g(x, y)

de¬ned on this space but that the form of the distribution is such that directly
generating from this distribution is di¬cult, perhaps because it is di¬cult to
obtain the denominator of (3.31). However there are many circumstances where
it is much easier to obtain the value of the conditional distributions

π(x, y)
π(x|y) = P and
z π(z, y)
π(x, y)
π(y|x) = P
z π(x, z)

Now consider the following algorithm: begin with an arbitrary value of y0
and generated x1 from the distribution π(x|y0 ) followed by generating y1 from
the distribution π(y|x1 ). It is hard to imagine a universe in which iteratively
generating values xn+1 from the distribution π(x|yn ) and then yn+1 from the
distribution π(y|xn+1 ) does not, at least asymptotically as n ’ ∞, eventually
lead to a draw from the joint distribution π(x, y). Indeed that is the case since
the transition probabilities for this chain are given by

P (xn+1 , yn+1 |xn , yn ) = π(xn+1 |yn )π(yn+1 |xn+1 )

and it is easy to show directly from these transition probabilities that
P (x1 , y1 |x, y)π(x, y)
= π(y1 |x1 ) π(x1 |y) π(x, y)
y x
= π(y1 |x1 ) π(x1 , y)

= π(x1 , y1 ).

Of course the real power of Gibbs Sampling is achieved in problems that are
not two-dimensional such as the example above, but have dimension su¬ciently
high that calculating the sums or integrals in the denominator of expressions
like (3.31) is not computationally feasible.

Coupling From the Past: Sampling from the stationary dis-
tribution of a Markov Chain

All of the above methods assume that we generate from the stationary distri-
bution of a Markov chain by the “until Hele freezes over” method, i.e. wait
until run the chain from an arbitrary starting value and then delete the initial
transient. An alternative elegant method that is feasible at least for some ¬nite
state Markov chains is the method of “coupling from the past” due to Propp
and Wilson (1996).
We assume that we are able to generate transitions in the Markov Chain. In
other words if the chain is presently in state i at time n we are able to generate
a random variable Xn+1 from the distribution proportional to Pij , j = 1, ...K.
Suppose F (x|i) is the cumulative distribution function P (Xn+1 · x|Xn = i)
and let us denote its inverse by F ’1 (y|i). So if we wish to generate a random
variable Xn+1 conditional on Xn , we can use the inverse transform Xn+1 =
F ’1 (Un+1 |Xn ) applied to the Uniform[0,1] random variable Un+1 . Notice that
a starting value say X’100 together with the sequence of uniform[0,1] variables

(U’99 , ..., U0 ) determines the chain completely over the period ’100 · t · 0.
If we wish to generated the value of Xt given Xs , s < t, then we can work this
expression backwards

Xt = F ’1 (Ut’1 |Xt’1 )

= F ’1 (Ut’1 |F ’1 (Ut’2 |Xt’2 ))

= F ’1 (Ut’1 |F ’1 (Ut’2 |...F ’1 (Ut’1 |F ’1 (Us |i))))
= Fs (Xs ), say.

Now imagine an in¬nite sequence {Ut , t = ..., ’3, ’2, ’1} of independent uni-
form[0,1] random variables that was used to generate the state X0 of a chain
at time 0. Let us imagine for the moment that there is a value of M such that
F’M (i) is a constant function of i. This means that for this particular draw
of uniform random numbers, whatever the state i of the system at time ’M,
the same state X0 = F’M (i) is generated to time 0. All chains, possibly with
di¬erent behaviour prior to time ’M are ”coupled” at time ’M and identical
from then on. In this case we say that coalescence has occurred in the interval
[’M, 0]. No matter where we start the chain at time ’M it ends up in the
same state at time 0, so it is quite unnecessary to simulate the chain over the
whole in¬nite time interval ’∞ < t · 0. No matter what state is occupied
at time t = ’M, the chain ends up in the same state at time t = 0. When
coalescence has occurred, we can safely consider the common value of the chain
at time 0 to be generated from the stationary distribution since it is exactly
the same value as if we had run the chain from t = ’∞.
There is sometimes an easy way to check whether coalescence has occurred
in an interval, if the state space of the Markov chain is suitably ordered. For
example suppose the states are numbered 1, 2, ..., N. Then it is sometimes
possible to relabel the states so that the conditional distribution functions F (x|i)
are stochastically ordered, or equivalently that F ’1 (U |i) is monotonic (say
monotonically increasing) in i for each value of U. This is the case for example

provided that the partial sums Pil are increasing functions of i for each
j = 1, 2, ..., N. If follows that the functions F’M (i) are all monotonic functions
of i and so
0 0 0
F’M (1) · F’M (2) · ...F’M (N ).

0 0 0
Therefore, if F’M (1) = F’M (N ), then F’M (i) must be a constant function.
Notice also that if there is any time in an interval [s, t] at which coalescence
occurs so that Fs (i) is a constant function of i, then for any interval [S, T ]
containing it [S, T ] ⊃ [s, t], FS (i) is also a constant function of i.
It is easy to prove that coalescence occurs in the interval [’M, 0] for suf-
¬ciently large M. For an ergodic ¬nite Markov chain, there is some step size
„ such that every transition has positive probability P [Xt+„ = j|Xt = i] > ²
for all i, j. Consider two independent chains, one beginning in state i and the
other in state i0 at time t = 0. Then the probability that they occupy the same
state j at time t = „ is at least ²2 . It is easy to see that if we use inverse
transform to generate the transitions and if they are driven by common random
numbers then this can only increase the probability of being in the same state,
so the probability these two chains are coupled at time „ is at least ²2 . Similarly
for N possible states, the probability of coalescence in an interval of length „
is at least µN > 0. Since there are in¬nitely many intervals disjoint of length
„ in [’∞, 0] and the events that there is a coalescence in each interval are
independent, the probability that coalescence occurs somewhere in [’∞, 0] is
We now detail the Propp Wilson algorithm

1. Set M = 1, XU = N, XL = 1

2. Generate U’M ....U’M/2+1 all independent U nif orm[0, 1].

3. For t = ’M to ’1 repeat

(a) obtain XL = F ’1 (Ut’1 |XL ) and XU = F ’1 (Ut’1 |XU ).

(b) If XL = XU stop and output X(0) = XL

4. Otherwise, set M = 2M and go to step 2.

This algorithm tests for coalescence repeatedly by starting on the intervals

[’1, 0], [’2, ’1], [’4, ’2], [’8, ’4].

We are assured that with probability one, the process will terminate with co-
alescence after a ¬nite number of steps. Moreover, in this algorithm that the
random variable Ut once generated is NOT generated again on a subsequent
pass when M is doubled. The generated Ut is reused at each pass until coales-
cence occurs. If Ut were regenerated on subseuqent passes, this would lead to
bias in the algorithm.
It may well be that this algorithm needs to run for a very long time before
achieving coalescence and an impatient observer who interrupts the algorithm
prior to coalescence and starts over will bias the results. Varous modi¬cations
have been made to speed up the algorithm (e.g. Fill, 1998).

Sampling from the Stationary Distribution of a Di¬usion

A basic Ito process of the form

dXt = a(Xt )dt + σ(Xt )dWt

is perhaps the simplest extension of a Markov chain to continuous time, contin-
uous state-space. It is well-known that under fairly simple conditions, there is
a unique (strong) solution to this equation and that the limiting distribution of
XT as T ’ ∞ has stationary distribution with probability density function
1 a(z)
f (x) = c 2 exp{2 dz}
σ (x) 0 σ (z)

where the constant c is chosen so that the integral of the density is 1. To be
able to do this we need to assume that
∞ x
1 a(z)
dz}dx < ∞. (3.32)
σ 2 (x) σ 2 (z)
’∞ 0

In order to generate from this stationary distribution, we can now start the
process at some arbitrary value X0 and run it for a very long time T , hoping
that this is su¬ciently long that the process is essentially in its stationary
state, or try to generate X0 more directly from (3.32) in which case the process
is beginning (and subsequently running) with its stationary distribution.
For an example, let us return to the CIR process

dXt = k(b ’ Xt )dt + σXt (3.33)
dWt .

In this case

a(x) = k(b ’ x), for x > 0,

σ 2 (x) = σ2 x, for x > 0.

Notice that
Z x
k(b ’ z)
1 1 2kb k
dz} = 2 x’1 exp{ 2 ln(x/µ) ’ 2 (x ’ µ)}
σ2 x 2z
σ σ σ σ

is proportional to
x2kb/σ ’1
exp{’kx/σ 2 }

and the integral of this function, a Gamma function, will fail to converge unless
2kb/σ 2 ’ 1 > ’1 or 2kb > σ 2 . Under this condition the stationary distribution
of the CIR process is Gamma(2kb/σ 2 , σ ). If this condition fails and 2kb < σ 2 ,

then the process Xt is absorbed at 0. If we wished to simulate a CIR process in
equilibrium, we should generate starting values of X0 from the Gamma distrib-
ution. More generally for a CEV process satisfying

dXt = k(b ’ Xt )dt + σXt (3.34)

a similar calculation shows that the stationary density is proportional to

2kb 1 k
x’γ exp{’ ’ 2 xγ }, for γ > 1.
σ 2 xγ’1 (γ ’ 1) σ γ

Simulating Stochastic Partial Di¬erential Equa-

Consider a derivative product whose underlying asset has price Xt which follows
some model. Suppose the derivative pays an amount V0 (XT ) on the maturity
date T. Suppose that the value of the derivative depends only on the current time
t and the current value of the asset S, then its current value is the discounted
future payo¬, an expectation of the form
V (S, t) = E[V0 (XT )exp{’ r(Xv , v)dv}|Xt = S]

where r(Xt , t) is the current spot interest rate at time t. In most cases, this ex-
pectation is impossible to evaluate analytically and so we need to resort to
numerical methods. If the spot interest rate is function of both arguments
(Xv , v) and not just a function of time, then this integral is over the whole
joint distribution of the process Xv , 0 < v < T and simple one-dimensional
methods of numerical integration do not su¬ce. In such cases, we will usu-
ally resort to a Monte-Carlo method. The simplest version requires simulating
a number of sample paths for the process Xv starting at Xt = S, evaluating
V0 (XT )exp{’ t r(Xv , v)dv} and averaging the results over all simulations. We
begin by discussing the simulation of the process Xv required for integrations
such as this.
Many of the stochastic models in ¬nance reduce to simple di¬usion equation
(which may have more than one factor or dimension). Most of the models
in ¬nance are Markovian in the sense that at any point t in time, the future
evolution of the process depends only on the current state Xt and not on the

past behaviour of the process Xs , s < t. Consequently we restrict to a “Markov
di¬usion model” of the form

dXt = a(Xt , t)dt + σ(Xt , t)dWt

with some initial value X0 for Xt at t = 0. Here Wt is a driving standard Brown-
ian motion process. Solving deterministic di¬erential equations can sometimes
provide a solution to a speci¬c problem such as ¬nding the arbitrage-free price of
a derivative. In general, for more complex features of the derivative such as the
distribution of return, important for considerations such as the Value at Risk,
we need to obtain a solution {Xt , 0 < t < T }to an equation of the above form
which is a stochastic process. Typically this can only be done by simulation.
One of the simplest methods of simulating such a process is motivated through
a crude interpretation of the above equation in terms of discrete time steps, that
is that a small increment Xt+h ’ Xt in the process is approximately normally
distributed with mean given by a(Xt , t)hand variance given by σ 2 (Xt , t)h. We
generate these increments sequentially, beginning with an assumed value for
X0 , and then adding to obtain an approximation to the value of the process
at discrete times t = 0, h, 2h, 3h, . . .. Between these discrete points, we can
linearly interpolate the values. Approximating the process by assuming that
the conditional distribution of Xt+h ’ Xt is N (a(Xt , t)h, σ 2 (Xt , t)h) is called
Euler™s method by analogy to a simple method by the same name for solving or-
dinary di¬erential equations. Given simulations of the process satisfying (3.36)
together with some initial conditions, we might average the returns on a given
derivative for many such simulations, (provided the process is expressed with
respect to the risk-neutral distribution), to arrive at an arbitrage-free return for
the derivative.

In this section we will discuss the numerical solution, or simulation of the
solution to stochastic di¬erential equations.

Letting ti = i∆x, Equation (3.36) in integral form implies
ti+1 ti+1
Xti+1 = Xti + a(Xs , s)ds + σ(Xs , s)dWs
ti ti

For the following lemma we need to introduce Op or “order in probability”,
notation common in mathematics and probability. A sequence indexed by ∆t,
say Y∆t = Op (∆t)k means that when we divide this term by (∆t)k and then
let ∆t ’ 0, the resulting sequence is bounded in probability or that for each µ
there exists K < ∞ so that

| > K] < µ
P [|

whenever |∆t| < µ. As an example, if W is a Brownian motion, then ∆Wt =
W (t+∆t)’W (t) has a Normal distribution with mean 0 and standard deviation

∆t and is therefore Op (∆t)1/2 . Similarly Then we have two very common
and useful approximations to a di¬usion given by the following lemma.

Lemma 32 If Xt satis¬es a di¬usion equation of the form (3.36) then

Xti+1 = Xti + a(Xti , ti )∆t + σ(Xti , ti )∆Wt + Op (∆t) (Euler approximation)

σ(Xti , ti ) ‚x σ(Xti , ti )
[(∆Wt )2 ’ ∆t] + Op (∆t)3/2 (Milstei
Xti+1 = Xti + a(Xti , ti )∆t + σ(Xti , ti )∆Wt +

Proof. Ito™s lemma can be written in terms of two operators on functions
f for which the derivatives below exist;

df (Xt , t) = Lo f dt + L1 f dWt where

1 2 ‚2
‚ ‚
+ , and
L =a +σ 2
‚x 2 ‚x ‚t

L1 = σ .

Integrating, this and applying to twice di¬erentiable functions a and σ and
s > ti ,
s s
L1 a(Xu , u)dWu
a(Xs , s) = a(Xti , ti ) + L a(Xu , u)du +
t t
Z is Z is
L0 σ(Xu , u)du + L1 σ(Xu , u)dWu .
σ(Xs , s) = σ(Xti , ti ) +
ti ti

By substituting in each of the integrands in 3.37 using the above identity and
iterating this process we arrive at the Ito-Taylor expansions (e.g. Kloeden and
Platen, 1992). For example,
Z ti+1 Z ti+1 Z Z
s s
L1 a(Xu , u)dWu }ds
{a(Xti , ti ) +
a(Xs , s)ds = L a(Xu , u)du +
ti ti ti ti
Z ti+1 Z Z Z
s ti+1 s
0 1
≈ a(Xti , ti )∆t + L a(Xti , ti ) duds + L a(Xti , ti ) dWu ds
ti ti ti ti

The ¬rst term a(Xti , ti )∆t, is an initial approximation to the desired inte-
gral and the rest is a lower order correction that we may regard as an er-
ror term for the moment. For example it is easy to see that the second term
Rt Rs Rt Rs
L0 a(Xti , ti ) tii+1 ti duds is Op (∆t)2 because the integral tii+1 ti duds = (∆t)2 /2
Rt Rs
and L0 a(Xti , ti ) is bounded in probability. The third term L1 a(Xti , ti ) tii+1 ti dWu ds
Rt Rs Rt
is Op (∆t)3/2 since tii+1 ti dWu ds = tii+1 (ti+1 ’ u)dWu and this is a normal
random variable with mean 0 and variance tii+1 (ti+1 ’ u)2 du = (∆t)3 /3. We
can write such a normal random variable as 3’1/2 (∆t)3/2 Z for Z a standard
normal random variable and so this is obviously Op (∆t)3/2 . Thus the simplest
Euler approximation to the distribution of the increment assumes that ∆X has
conditional mean a(Xti , ti )∆t. Similarly
Z ti+1 Z ti+1 Z Z
s s
L1 σ(Xu , u)dWu }dWs
{σ(Xti , ti ) +
σ(Xs , s)dWs = L σ(Xu , u)du +
ti ti ti ti
ti+1 s ti+1 s
0 1
≈ σ(Xti , ti )∆Wt + L σ(Xti , ti ) dudWs + L σ(Xti , ti ) dWu dWs
ti ti ti ti

σ(Xti , ti ) ‚x σ(Xti , ti )
[(∆Wt )2 ’ ∆t] + Op (∆t)3/2
= σ(Xti , ti )∆Wt +
R ti+1 R s
dWu dWs = 1 [(∆Wt )2 ’ ∆t], L0 σ(Xu , u) = σ(Xti , ti ) +Op (∆t)1/2 ,
since 2
ti ti
Rt Rs

L1 σ(Xu , u) = σ(Xu , u) ‚x σ(Xu , u) and tii+1 ti dudWs = Op (∆t)3/2 . Putting

these terms together, we arrive at an approximation to the increment of the

σ(Xti , ti ) ‚x σ(Xti , ti )
[(∆Wt )2 ’∆t]+Op (∆t)3/2
∆Xt = a(Xti , ti )∆t+σ(Xti , ti )∆Wt +
which allow an explicit representation of the increment in the process X in terms
of the increment of a Brownian motion process ∆Wt ∼ N (0, ∆t).

The approximation (3.38) is called the Milstein approximation, a re¬nement
of the ¬rst, the Euler approximation. It is the second Ito-Taylor approximation
to a di¬usion process. Obviously, the increments of the process are quadratic
functions of a normal random variable and are no longer normal. The error
approaches 0 at the rate Op (∆t)3/2 in probability only. This does not mean
that the trajectory is approximated to this order but that the di¬erence between
the Milstein approximation to a di¬usion and the di¬usion itself is bounded in
probability when divided by (∆t)3/2 and as we let ∆t ’ 0. Higher order Taylor
approximations are also possible, although they grow excessively complicated
very quickly. See the book by Kloeden and Platten(1992) for details.

There remains the question of how much di¬erence it makes which of these
approximations we employ for a particular di¬usion. Certainly there is no dif-
ference at all between the two approximations in the case that the di¬usion
coe¬cient σ(Xt , t) does not depend at all on Xt . In general, the di¬erence is
hard to assess but in particular cases we can at least compare the performance of
the two methods. The approximations turn out to be very close in most simple
cases. For example consider the stock price path in Figure 3.27. The dashed
line corresponds to a Milstein approximation whereas the piecewise continuous
line corresponds to the Euler approximation. In this case the Milstein appears
to be a little better, but if I run a number of simulations and compare the sum
of the squared errors (i.e. squared di¬erences between the approximate value
of Xt and the true value of Xt ) we ¬nd that the improvement is only about

Figure 3.27: Comparison of Milstein and Euler approximation to stock with
∆t = 1/12 year.

two percent of the di¬erence. The same is true even if I change the value of
∆t from 1/12 (i.e. one month) to 1/52 (i.e. one week). Unlike the behaviour
of higher order approximations to deterministic functions, there appears to be
little advantage in using a higher order approximation, at least in the case of
di¬usions with smooth drift and di¬usion coe¬cients.

We can compare using Milstein approximation on the original process and
using Euler™s approximation on a transformation of the process in the case that
the di¬usion term depends only on the state of the process (not time). In other

words, suppose we have an Ito process of the form

dXt = a(Xt , t)dt + σ(Xt )dWt

where Wt is an ordinary Wiener measure. A simple transformation reduces this
to a problem with constant di¬usion term. Suppose σ(x) > 0 for all x and let
Z x
dz, for x ≥ 0
s(x) =
Z 0
s(x) = ’ dz for x < 0

where we assume these integrals are well de¬ned. Let g be the inverse function of
s. This inverse exists since the function is continuous monotonically increasing.
Suppose we apply Ito™s lemma to the transformed process Yt = s(Xt ). We obtain

dYt = {a(Xt , t)s0 (Xt ) + σ2 (Xt )s00 (Xt )}dt + σ(Xt )s0 (Xt )dWt
σ 0 (Xt )
a(Xt , t) 1 2
={ ’ σ (Xt ) 2 }dt + dWt
σ(Xt ) 2 σ (Xt )
= µ(Yt , t)dt + dWt

a(g(Yt ), t) 1 0
’ σ (g(Yt )).
µ(Yt , t) =
σ(g(Yt )) 2

In other words, Y t satis¬es an Ito equation with constant di¬usion term.
Suppose we generate an increment in Yt using Euler™s method and then solve
for the corresponding increment in Xt .Then using the ¬rst two terms in the
Taylor series expansion of g,

∆Xt = g 0 (Yt )∆Yt + g 00 (Yt )(∆Yt )2
= g 0 (Yt )(µ(Yti , ti )∆t + ∆Wt ) + σ 0 (g(Yt ))σ(g(Yt ))(∆Yt )2
1 1
= {a(g(Yt ), t) ’ σ(g(Yt ))σ 0 (g(Yt ))}∆t + σ(g(Yt ))∆Wt + σ 0 (g(Yt ))σ(g(Yt ))(∆Yt )2
2 2


g 0 (Yt ) = = σ(g(Yt )) and
s0 (g(Yt ))
g 00 (Yt ) = σ0 (g(Yt ))σ(g(Yt )).

But since (∆Yt )2 = (∆Wt )2 + o(∆t) it follows that

1 1
∆Xt = {a(Xt , t)’ σ(Xt )σ 0 (Xt )}∆t+σ(Xt )∆Wt + σ 0 (Xt )σ(Xt )(∆Wt )2 +o(∆t)
2 2

and so the approximation to this increment is identical, up to the order con-
sidered, to the Milstein approximation. For most processes, it is preferable
to apply a di¬usion stabilizing transformation as we have here, prior to dis-
cretizing the process. For the geometric Brownian motion process, for example,
the di¬usion-stabilizing transformation is a multiple of the logarithm, and this
transforms to a Brownian motion, for which the Euler approximation gives the
exact distribution.

Example: Down-and-out-Call.

Consider an asset whose price under the risk-neutral measure Q follows a con-
stant elasticity of variance (CEV) process

dSt = rSt dt + σSt dWt

for a standard Brownian motion process Wt . A down-and-out call option with
exercise price K provides the usual payment (ST ’ K)+ of a European call
option on maturity T if the asset never falls below a given out barrier b. The
parameter γ > 0 governs the change in the di¬usion term as the asset price
changes. We wish to use simulation to price such an option with current asset
price S0 , time to maturity T , out barrier b < S0 and constant interest rate r
and compare with the Black-Scholes formula as b ’ 0.
A geometric Brownian motion is most easily simulated by taking logarithms.

For example if St satis¬es the risk-neutral speci¬cation

dSt = rSt dt + σSt dWt

then Yt = log(St ) satis¬es

dYt = (r ’ σ2 /2)dt + σdWt . (3.42)

This is a Brownian motion and is simulated with a normal random walk. In-
dependent normal increments are generated ∆Yt ∼ N ((r ’ σ2 /2)∆t, σ2 ∆t) and
their partial sums used to simulate the process Yt . The return for those options
that are in the money is the average of the values of (eYT ’ E)+ over those
paths for which min{Ys ; t < s < T } ≥ ln(b). Similarly the transformation of
the CEV process which provides a constant di¬usion term is determined by
s(x) = dz
0 σ(z)
Zx ⎨ x1’γ if γ 6= 1
z ’γ dz =
= .
© ln(x) if γ = 1


. 6
( 11)