<<

. 3
( 4)



>>


Summary (continued)
’ The eigenvectors belonging to the largest eigenvalues indicate the “main
direction” of the data.
’ The Jordan decomposition allows one to easily compute the power of a
symmetric matrix A: A± = “Λ± “ .
’ The singular value decomposition (SVD) is a generalization of the Jordan
decomposition to non-quadratic matrices.



2.3 Quadratic Forms
A quadratic form Q(x) is built from a symmetric matrix A(p — p) and a vector x ∈ Rp :
p p
Q(x) = x A x = aij xi xj . (2.21)
i=1 j=1



De¬niteness of Quadratic Forms and Matrices
Q(x) > 0 for all x = 0 positive de¬nite
Q(x) ≥ 0 for all x = 0 positive semide¬nite
A matrix A is called positive de¬nite (semide¬nite) if the corresponding quadratic form Q(.)

is positive de¬nite (semide¬nite). We write A > 0 (≥ 0).
Quadratic forms can always be diagonalized, as the following result shows.

THEOREM 2.3 If A is symmetric and Q(x) = x Ax is the corresponding quadratic form,
then there exists a transformation x ’ “ x = y such that
p
2
x Ax= »i y i ,
i=1

where »i are the eigenvalues of A.

Proof:
A = “ Λ “ . By Theorem 2.1 and y = “ ± we have that x Ax = x “Λ“ x = y Λy =
p 2
i=1 »i yi . 2


Positive de¬niteness of quadratic forms can be deduced from positive eigenvalues.
66 2 A Short Excursion into Matrix Algebra


THEOREM 2.4 A > 0 if and only if all »i > 0, i = 1, . . . , p.

Proof:
2 2
0 < »1 y1 + · · · + »p yp = x Ax for all x = 0 by Theorem 2.3. 2


COROLLARY 2.1 If A > 0, then A’1 exists and |A| > 0.

EXAMPLE 2.6 The quadratic form Q(x) = x2 +x2 corresponds to the matrix A = 1 0 with
1 2 01
eigenvalues »1 = »2 = 1 and is thus positive de¬nite. The quadratic form Q(x) = (x1 ’ x2 )2
corresponds to the matrix A = ’1 ’1 with eigenvalues »1 = 2, »2 = 0 and is positive
1
1
semide¬nite. The quadratic form Q(x) = x2 ’ x2 with eigenvalues »1 = 1, »2 = ’1 is
1 2
inde¬nite.

In the statistical analysis of multivariate data, we are interested in maximizing quadratic
forms given some constraints.

THEOREM 2.5 If A and B are symmetric and B > 0, then the maximum of x Ax under
the constraints x Bx = 1 is given by the largest eigenvalue of B ’1 A. More generally,

x Ax = »1 ≥ »2 ≥ · · · ≥ »p = x Ax,
max min
{x:x Bx=1} {x:x Bx=1}


where »1 , . . . , »p denote the eigenvalues of B ’1 A. The vector which maximizes (minimizes)
x Ax under the constraint x Bx = 1 is the eigenvector of B ’1 A which corresponds to the
largest (smallest) eigenvalue of B ’1 A.

Proof:
1/2
By de¬nition, B 1/2 = “B ΛB “B . Set y = B 1/2 x, then

max y B ’1/2 AB ’1/2 y.
x Ax =
max (2.22)
{x:x Bx=1} {y:y y=1}

From Theorem 2.1, let
B ’1/2 A B ’1/2 = “ Λ “
be the spectral decomposition of B ’1/2 A B ’1/2 . Set

z = “ y ’ z z = y “ “ y = y y.

Thus (2.22) is equivalent to
p
»i zi2 .
max z Λ z = max
{z:z z=1} {z:z z=1}
i=1
2.3 Quadratic Forms 67


But
»i zi2 ¤ »1 max zi2 = »1 .
max
z z
=1

The maximum is thus obtained by z = (1, 0, . . . , 0) , i.e.,

y = γ1 ’ x = B ’1/2 γ1 .

Since B ’1 A and B ’1/2 A B ’1/2 have the same eigenvalues, the proof is complete. 2



EXAMPLE 2.7 Consider the following matrices

12 10
A= B=
and .
23 01

We calculate
12
B ’1 A = .
23

’1
The biggest eigenvalue of the matrix B A √ 2 +
is 5. This means that the maximum of
x Ax under the constraint x Bx = 1 is 2 + 5.
Notice that the constraint x Bx = 1 corresponds, with our choice of B, to the points which
lie on the unit circle x2 + x2 = 1.
1 2




Summary
’ A quadratic form can be described by a symmetric matrix A.
’ Quadratic forms can always be diagonalized.
’ Positive de¬niteness of a quadratic form is equivalent to positiveness of
the eigenvalues of the matrix A.
’ The maximum and minimum of a quadratic form given some constraints
can be expressed in terms of eigenvalues.
68 2 A Short Excursion into Matrix Algebra


2.4 Derivatives
For later sections of this book, it will be useful to introduce matrix notation for derivatives
of a scalar function of a vector x with respect to x. Consider f : Rp ’ R and a (p — 1) vector
x, then ‚f (x) is the column vector of partial derivatives ‚f (x) , j = 1, . . . , p and ‚f (x) is the
‚x ‚xj ‚x

row vector of the same derivative ( ‚f (x) is called the gradient of f ).
‚x
2 ‚ f (x)
We can also introduce second order derivatives: ‚x‚x is the (p — p) matrix of elements
‚ 2 f (x) ‚ 2 f (x)
, i = 1, . . . , p and j = 1, . . . , p. ( ‚x‚x is called the Hessian of f ).
‚xi ‚xj

Suppose that a is a (p — 1) vector and that A = A is a (p — p) matrix. Then
‚a x ‚x a
= = a, (2.23)
‚x ‚x

‚x Ax
= 2Ax. (2.24)
‚x

The Hessian of the quadratic form Q(x) = x Ax is:
‚ 2 x Ax
= 2A. (2.25)
‚x‚x
EXAMPLE 2.8 Consider the matrix
12
A= .
23
From formulas (2.24) and (2.25) it immediately follows that the gradient of Q(x) = x Ax
is
‚x Ax 12 2x 4x
= 2Ax = 2 x=
23 4x 6x
‚x
and the Hessian is
‚ 2 x Ax 12 24
= 2A = 2 = .
23 46
‚x‚x


2.5 Partitioned Matrices
Very often we will have to consider certain groups of rows and columns of a matrix A(n — p).
In the case of two groups, we have
A11 A12
A=
A21 A22
where Aij (ni — pj ), i, j = 1, 2, n1 + n2 = n and p1 + p2 = p.
2.5 Partitioned Matrices 69


If B(n — p) is partitioned accordingly, we have:

A11 + B11 A12 + B12
A+B =
A21 + B21 A22 + B22
B11 B21
B =
B12 B22
A11 B11 + A12 B12 A11 B21 + A12 B22
AB = .
A21 B11 + A22 B12 A21 B21 + A22 B22

An important particular case is the square matrix A(p — p), partitioned such that A11 and
A22 are both square matrices (i.e., nj = pj , j = 1, 2). It can be veri¬ed that when A is
non-singular (AA’1 = Ip ):
A11 A12
’1
A= (2.26)
A21 A22
where
def
±
(A11 ’ A12 A’1 A21 )’1 = (A11·2 )’1
 A11 = 22

’(A11·2 ) A12 A’1
’1
 12
A = 22
’1
’A22 A21 (A11·2 )’1
 A21 =

A’1 + A’1 A21 (A11·2 )’1 A12 A’1
 22
A = .
22 22 22

An alternative expression can be obtained by reversing the positions of A11 and A22 in the
original matrix.
The following results will be useful if A11 is non-singular:

|A| = |A11 ||A22 ’ A21 A’1 A12 | = |A11 ||A22·1 |. (2.27)
11


If A22 is non-singular, we have that:

|A| = |A22 ||A11 ’ A12 A’1 A21 | = |A22 ||A11·2 |. (2.28)
22


A useful formula is derived from the alternative expressions for the inverse and the determi-
nant. For instance let
1b
B=
aA
where a and b are (p — 1) vectors and A is non-singular. We then have:

|B| = |A ’ ab | = |A||1 ’ b A’1 a| (2.29)

and equating the two expressions for B 22 , we obtain the following:

A’1 ab A’1
’1 ’1
(A ’ ab ) =A + . (2.30)
1 ’ b A’1 a
70 2 A Short Excursion into Matrix Algebra


EXAMPLE 2.9 Let™s consider the matrix
12
A= .
22

We can use formula (2.26) to calculate the inverse of a partitioned matrix, i.e., A11 =
’1, A12 = A21 = 1, A22 = ’1/2. The inverse of A is

’1 1
A’1 = .
1 ’0.5

It is also easy to calculate the determinant of A:

|A| = |1||2 ’ 4| = ’2.

Let A(n — p) and B(p — n) be any two matrices and suppose that n ≥ p. From (2.27)
and (2.28) we can conclude that

’»In ’A
= (’»)n’p |BA ’ »Ip | = |AB ’ »In |. (2.31)
B Ip

Since both determinants on the right-hand side of (2.31) are polynomials in », we ¬nd that
the n eigenvalues of AB yield the p eigenvalues of BA plus the eigenvalue 0, n ’ p times.
The relationship between the eigenvectors is described in the next theorem.

THEOREM 2.6 For A(n — p) and B(p — n), the non-zero eigenvalues of AB and BA are
the same and have the same multiplicity. If x is an eigenvector of AB for an eigenvalue
» = 0, then y = Bx is an eigenvector of BA.

COROLLARY 2.2 For A(n — p), B(q — n), a(p — 1), and b(q — 1) we have

rank(Aab B) ¤ 1.

The non-zero eigenvalue, if it exists, equals b BAa (with eigenvector Aa).

Proof:
Theorem 2.6 asserts that the eigenvalues of Aab B are the same as those of b BAa. Note
that the matrix b BAa is a scalar and hence it is its own eigenvalue »1 .
Applying Aab B to Aa yields

(Aab B)(Aa) = (Aa)(b BAa) = »1 Aa.

2
2.6 Geometrical Aspects 71

 
© ©
¨ §¥
¨¨ ¦
¥
¨¨¨ 

¥
¤¢ ¢
£
¥
¢¢
¥
¢¢
¥
¢¢
¢¢¥
¢ ¥¢
¡
 ©



Figure 2.1. Distance d.

2.6 Geometrical Aspects

Distance

Let x, y ∈ Rp . A distance d is de¬ned as a function
±
∀x = y
 d(x, y) > 0
2p
d : R ’ R+ which ful¬lls d(x, y) = 0 if and only if x = y .
d(x, y) ¤ d(x, z) + d(z, y) ∀x, y, z


A Euclidean distance d between two points x and y is de¬ned as
d2 (x, y) = (x ’ y)T A(x ’ y) (2.32)
where A is a positive de¬nite matrix (A > 0). A is called a metric.

EXAMPLE 2.10 A particular case is when A = Ip , i.e.,
p
(xi ’ yi )2 .
d2 (x, y) = (2.33)
i=1

Figure 2.1 illustrates this de¬nition for p = 2.

Note that the sets Ed = {x ∈ Rp | (x ’ x0 ) (x ’ x0 ) = d2 } , i.e., the spheres with radius d
and center x0 , are the Euclidean Ip iso-distance curves from the point x0 (see Figure 2.2).
The more general distance (2.32) with a positive de¬nite matrix A (A > 0) leads to the
iso-distance curves
Ed = {x ∈ Rp | (x ’ x0 ) A(x ’ x0 ) = d2 }, (2.34)
i.e., ellipsoids with center x0 , matrix A and constant d (see Figure 2.3).
»2 ≥ ... ≥ »p . The resulting observations are given in the next theorem.
Let γ1 , γ2 , ..., γp be the orthonormal eigenvectors of A corresponding to the eigenvalues »1 ≥
Figure 2.3. Iso“distance ellipsoid.
¨

¡ ¢¢¢¢¢¢¢¢¢¢¢¢¢ ¥£
¢¢¢¢¢¢¢¢¢¢¢¢¢ ¥ £
¢¢¢¢¢¢¢¢¢¢¢¢¢
§¥ ¥
¦ £ £¤
¢¢¢¢¢¢¢¢¢¢¢¢¢
§
 §

¢¢¢¢¢¢¢¢¢¢¢¢¢
 ¢¢¢¢¢¢¢ ¢¢¢¢¢¢ ¢¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢ ¢ ¢¢ ¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢ ¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢ ¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢¢ ¢¢ ¢¢ ¢¢ ¢¢¢ ¢¢ ¢¢
¢ ¢¢
¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢ ¢¢ ¢¢¢ ¢¢¢ ¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢¢ ¢¢¢ ¢¢¢¢ ¢¢¢
¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢ ¢¢¢¢ ¢ ¢¢ ¢¢¢¢¢ ¢¢¢¢ ¢¢¢¢ ¢ ¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢ ¢¢¢¢
¢¢
¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢
¢¢¢¢¢¢¢¢¢¢¢ ¢
¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢
¢
¢¢¢ ¢¢
©¨ ¢¢¢¢¢¢¢¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢%¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢'¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ %¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ &¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢¢¢¢¢¢¢¢¢'¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ &¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ %¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢$¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ %¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ $¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ %¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ $¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢ ¢¢ %¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢ © ¨ ¢¢¢¢¢ 
¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢
¢¢ # ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢
¢¢¢
¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢ $  ¥ £
¥ £
¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢
¦
¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢ ¢ ¢¢¢¢¢ ¢¢¢¢¢¢ §¥ 
¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢
!"  £
¢¢¢¢¢¢¢¢ ¢ ¢¢ ¢¢¢¢ ¢ ¢¢ ¢¢¢¢ ¢ ¢¢ ¢¢¢ ¢¢¢ ¢¢¢¢ ¢¢¢ ¢¢¢ ¢¢ ¢¢¢ ¢¢¢ ¢
¢ ¢¢ £ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢
¢¢¢¢¢¢ ¢¢¢ ¢ ¢¢ ¢¢¢ ¢ ¢¢ ¢¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢¢£¤¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢
 
¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢
Figure 2.2. Iso“distance sphere.
¤¢
£
¡
§
¨¦
£¤¢ ¤¢
©£
¥  
 
2 A Short Excursion into Matrix Algebra 72
2.6 Geometrical Aspects 73


THEOREM 2.7 (i) The principal axes of Ed are in the direction of γi ; i = 1, . . . , p.
d2
(ii) The half-lengths of the axes are ; i = 1, . . . , p.
»i

(iii) The rectangle surrounding the ellipsoid Ed is de¬ned by the following inequalities:
√ √
2 aii ¤ x ¤ x + d2 aii , i = 1, . . . , p,
x0i ’ d i 0i

where aii is the (i, i) element of A’1 . By the rectangle surrounding the ellipsoid Ed we
mean the rectangle whose sides are parallel to the coordinate axis.

It is easy to ¬nd the coordinates of the tangency points between the ellipsoid and its sur-
rounding rectangle parallel to the coordinate axes. Let us ¬nd the coordinates of the tangency
point that are in the direction of the j-th coordinate axis (positive direction).
For ease of notation, we suppose the ellipsoid is centered around the origin (x0 = 0). If not,
the rectangle will be shifted by the value of x0 .
The coordinate of the tangency point is given by the solution to the following problem:

x = arg max ej x (2.35)
x Ax=d2

where ej is the j-th column of the identity matrix Ip . The coordinate of the tangency point
in the negative direction would correspond to the solution of the min problem: by symmetry,
it is the opposite value of the former.
The solution is computed via the Lagrangian L = ej x ’ »(x Ax ’ d2 ) which by (2.23) leads
to the following system of equations:
‚L
= ej ’ 2»Ax = 0 (2.36)
‚x
‚L
= xT Ax ’ d2 = 0. (2.37)
‚»
A’1 ej ,
1
This gives x = or componentwise


1 ij
xi = a , i = 1, . . . , p (2.38)

where aij denotes the (i, j)-th element of A’1 .
Premultiplying (2.36) by x , we have from (2.37):

xj = 2»d2 .
jj
Comparing this to the value obtained by (2.38), for i = j we obtain 2» = a 2 . We choose
d
the positive value of the square root because we are maximizing ej x. A minimum would
74 2 A Short Excursion into Matrix Algebra


correspond to the negative value. Finally, we have the coordinates of the tangency point
between the ellipsoid and its surrounding rectangle in the positive direction of the j-th axis:

d2 ij
xi = a , i = 1, . . . , p. (2.39)
ajj

The particular case where i = j provides statement (iii) in Theorem 2.7.



Remark: usefulness of Theorem 2.7

Theorem 2.7 will prove to be particularly useful in many subsequent chapters. First, it
provides a helpful tool for graphing an ellipse in two dimensions. Indeed, knowing the slope
of the principal axes of the ellipse, their half-lengths and drawing the rectangle inscribing
the ellipse allows one to quickly draw a rough picture of the shape of the ellipse.
In Chapter 7, it is shown that the con¬dence region for the vector µ of a multivariate
normal population is given by a particular ellipsoid whose parameters depend on sample
characteristics. The rectangle inscribing the ellipsoid (which is much easier to obtain) will
provide the simultaneous con¬dence intervals for all of the components in µ.
In addition it will be shown that the contour surfaces of the multivariate normal density
are provided by ellipsoids whose parameters depend on the mean vector and on the covari-
ance matrix. We will see that the tangency points between the contour ellipsoids and the
surrounding rectangle are determined by regressing one component on the (p ’ 1) other
components. For instance, in the direction of the j-th axis, the tangency points are given
by the intersections of the ellipsoid contours with the regression line of the vector of (p ’ 1)
variables (all components except the j-th) on the j-th component.



Norm of a Vector

Consider a vector x ∈ Rp . The norm or length of x (with respect to the metric Ip ) is de¬ned
as

x = d(0, x) = x x.

If x = 1, x is called a unit vector. A more general norm can be de¬ned with respect to the
metric A:

x A = x Ax.
2.6 Geometrical Aspects 75


 
©§¥ ¥
¨¦
¥ 
¥
¤¢ ¢
£ !
!!!!!!! ! !!!!!! !!!!!! !!!! !!!! !!!! !!!! !!! !!! ! ¥
¢¢
¢ ¢ !!!!!!!!!!!!!!! !!!!!!!!! !!!!! !!!!! !!!! !!!! !!! !!!! ! ¥
¢!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ¢ !!!!!!!!!!!!!!!!!!!!! !!!!!!!!!! ! ¥
  !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! ¢

  ¥
!!!!!!!!!!!!!!!!! !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!  ¢ ¢ ¢ ¥
!!!!!!!!!!!!!!!!!!!!!! 
!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!! ¢ ¢¥
!!!!!!!!!!!
!!!!!!!!!!! ¡

Figure 2.4. Angle between vectors.

Angle between two Vectors

Consider two vectors x and y ∈ Rp . The angle θ between x and y is de¬ned by the cosine of
θ:
xy
cos θ = , (2.40)
xy
x1 y1
see Figure 2.4. Indeed for p = 2, x = and y = , we have
x2 y2

x cos θ1 = x1 ; y cos θ2 = y1
(2.41)
x sin θ1 = x2 ; y sin θ2 = y2 ,

therefore,
x1 y1 + x2 y2 xy
cos θ = cos θ1 cos θ2 + sin θ1 sin θ2 = = .
xy xy


π
REMARK 2.1 If x y = 0, then the angle θ is equal to . From trigonometry, we know that
2
the cosine of θ equals the length of the base of a triangle (||px ||) divided by the length of the
hypotenuse (||x||). Hence, we have

|x y|
||px || = ||x||| cos θ| = , (2.42)
y
76 2 A Short Excursion into Matrix Algebra

¦ £¡ ¢¡ ¡
¡¡
¨ ¡¡
¤  ¥ § 



Figure 2.5. Projection.

where px is the projection of x on y (which is de¬ned below). It is the coordinate of x on the
y vector, see Figure 2.5.
The angle can also be de¬ned with respect to a general metric A
x Ay
cos θ = . (2.43)
xA y A

If cos θ = 0 then x is orthogonal to y with respect to the metric A.

EXAMPLE 2.11 Assume that there are two centered (i.e., zero mean) data vectors. The
cosine of the angle between them is equal to their correlation (de¬ned in (3.8))! Indeed for
x and y with x = y = 0 we have
x i yi
rXY = = cos θ
x2 2
yi
i

according to formula (2.40).


Rotations

When we consider a point x ∈ Rp , we generally use a p-coordinate system to obtain its geo-
metric representation, like in Figure 2.1 for instance. There will be situations in multivariate
techniques where we will want to rotate this system of coordinates by the angle θ.
Consider for example the point P with coordinates x = (x1 , x2 ) in R2 with respect to a
given set of orthogonal axes. Let “ be a (2 — 2) orthogonal matrix where

cos θ sin θ
“= . (2.44)
’ sin θ cos θ

If the axes are rotated about the origin through an angle θ in a clockwise direction, the new
coordinates of P will be given by the vector y

y = “ x, (2.45)
2.6 Geometrical Aspects 77


and a rotation through the same angle in a counterclockwise direction gives the new coordi-
nates as
y = “ x. (2.46)

More generally, premultiplying a vector x by an orthogonal matrix “ geometrically corre-
sponds to a rotation of the system of axes, so that the ¬rst new axis is determined by the
¬rst row of “. This geometric point of view will be exploited in Chapters 9 and 10.


Column Space and Null Space of a Matrix

De¬ne for X (n — p)
def
Im(X ) = C(X ) = {x ∈ Rn | ∃a ∈ Rp so that X a = x},
the space generated by the columns of X or the column space of X . Note that C(X ) ⊆ Rn
and dim{C(X )} = rank(X ) = r ¤ min(n, p).
def
Ker(X ) = N (X ) = {y ∈ Rp | X y = 0}
is the null space of X . Note that N (X ) ⊆ Rp and that dim{N (X )} = p ’ r.

REMARK 2.2 N (X ) is the orthogonal complement of C(X ) in Rn , i.e., given a vector
b ∈ Rn it will hold that x b = 0 for all x ∈ C(X ), if and only if b ∈ N (X ).
« 
235
¬4 6 7·
EXAMPLE 2.12 Let X = ¬  6 8 6  . It is easy to show (e.g. by calculating the de-
·

824
terminant of X ) that rank(X ) = 3. Hence, the columns space of X is C(X ) = R3 .
The null space of X contains only the zero vector (0, 0, 0) and its dimension is equal to
rank(X ) ’ 3 = 0.
« 
231
¬4 6 2·
For X = ¬  6 8 3  , the third column is a multiple of the ¬rst one and the matrix X
·

824
cannot be of full rank. Noticing that the ¬rst two columns of X are independent, we see that
rank(X ) = 2. In this case, the dimension of the columns space is 2 and the dimension of the
null space is 1.


Projection Matrix

A matrix P(n — n) is called an (orthogonal) projection matrix in Rn if and only if P = P =
P 2 (P is idempotent). Let b ∈ Rn . Then a = Pb is the projection of b on C(P).
78 2 A Short Excursion into Matrix Algebra


Projection on C(X )

Consider X (n — p) and let
P = X (X X )’1 X (2.47)
and Q = In ’ P. It™s easy to check that P and Q are idempotent and that

PX = X and QX = 0. (2.48)

Since the columns of X are projected onto themselves, the projection matrix P projects any
vector b ∈ Rn onto C(X ). Similarly, the projection matrix Q projects any vector b ∈ Rn
onto the orthogonal complement of C(X ).

THEOREM 2.8 Let P be the projection (2.47) and Q its orthogonal complement. Then:

(i) x = Pb ’ x ∈ C(X ),

(ii) y = Qb ’ y x = 0 ∀x ∈ C(X ).

Proof:
(i) holds, since x = X (X X )’1 X b = X a, where a = (X X )’1 X b ∈ Rp .
(ii) follows from y = b ’ Pb and x = X a ’ y x = b X a ’ b X (X X )’1 X X a = 0. 2



REMARK 2.3 Let x, y ∈ Rn and consider px ∈ Rn , the projection of x on y (see Figure
2.5). With X = y we have from (2.47)

yx
px = y(y y)’1 y x = y (2.49)
y2

and we can easily verify that
|y x|
px = px px = .
y
See again Remark 2.1.
2.7 Exercises 79



Summary
’ A distance between two p-dimensional points x and y is a quadratic form
(x ’ y) A(x ’ y) in the vectors of di¬erences (x ’ y). A distance de¬nes
the norm of a vector.
’ Iso-distance curves of a point x0 are all those points that have the same
distance from x0 . Iso-distance curves are ellipsoids whose principal axes
are determined by the direction of the eigenvectors of A. The half-length of
principal axes is proportional to the inverse of the roots of the eigenvalues
of A.
’ The angle between two vectors x and y is given by cos θ = x Ay
w.r.t.
x Ay A
the metric A.
’ For the Euclidean distance with A = I the correlation between two cen-
tered data vectors x and y is given by the cosine of the angle between
them, i.e., cos θ = rXY .
’ The projection P = X (X X )’1 X is the projection onto the column
space C(X ) of X .
’ The projection of x ∈ Rn on y ∈ Rn is given by p = yx
y.
x y2




2.7 Exercises
EXERCISE 2.1 Compute the determinant for a (3 — 3) matrix.


EXERCISE 2.2 Suppose that |A| = 0. Is it possible that all eigenvalues of A are positive?


EXERCISE 2.3 Suppose that all eigenvalues of some (square) matrix A are di¬erent from
zero. Does the inverse A’1 of A exist?


EXERCISE 2.4 Write a program that calculates the Jordan decomposition of the matrix
« 
12 3
A= 2 1 2 .
32 1

Check Theorem 2.1 numerically.
80 2 A Short Excursion into Matrix Algebra


EXERCISE 2.5 Prove (2.23), (2.24) and (2.25).

EXERCISE 2.6 Show that a projection matrix only has eigenvalues in {0, 1}.

EXERCISE 2.7 Draw some iso-distance ellipsoids for the metric A = Σ’1 of Example 3.13.

EXERCISE 2.8 Find a formula for |A + aa | and for (A + aa )’1 . (Hint: use the inverse
1 ’a
partitioned matrix with B = .)
aA

EXERCISE 2.9 Prove the Binomial inverse theorem for two non-singular matrices A(p — p)
and B(p — p): (A + B)’1 = A’1 ’ A’1 (A’1 + B ’1 )’1 A’1 . (Hint: use (2.26) with C =
A Ip
.)
’Ip B ’1
3 Moving to Higher Dimensions

We have seen in the previous chapters how very simple graphical devices can help in under-
standing the structure and dependency of data. The graphical tools were based on either
univariate (bivariate) data representations or on “slick” transformations of multivariate infor-
mation perceivable by the human eye. Most of the tools are extremely useful in a modelling
step, but unfortunately, do not give the full picture of the data set. One reason for this is
that the graphical tools presented capture only certain dimensions of the data and do not
necessarily concentrate on those dimensions or subparts of the data under analysis that carry
the maximum structural information. In Part III of this book, powerful tools for reducing
the dimension of a data set will be presented. In this chapter, as a starting point, simple and
basic tools are used to describe dependency. They are constructed from elementary facts of
probability theory and introductory statistics (for example, the covariance and correlation
between two variables).
Sections 3.1 and 3.2 show how to handle these concepts in a multivariate setup and how a
simple test on correlation between two variables can be derived. Since linear relationships
are involved in these measures, Section 3.4 presents the simple linear model for two variables
and recalls the basic t-test for the slope. In Section 3.5, a simple example of one-factorial
analysis of variance introduces the notations for the well known F -test.
Due to the power of matrix notation, all of this can easily be extended to a more general
multivariate setup. Section 3.3 shows how matrix operations can be used to de¬ne summary
statistics of a data set and for obtaining the empirical moments of linear transformations of
the data. These results will prove to be very useful in most of the chapters in Part III.
Finally, matrix notation allows us to introduce the ¬‚exible multiple linear model, where more
general relationships among variables can be analyzed. In Section 3.6, the least squares
adjustment of the model and the usual test statistics are presented with their geometric
interpretation. Using these notations, the ANOVA model is just a particular case of the
multiple linear model.
82 3 Moving to Higher Dimensions


3.1 Covariance
Covariance is a measure of dependency between random variables. Given two (random)
variables X and Y the (theoretical) covariance is de¬ned by:

σXY = Cov (X, Y ) = E(XY ) ’ (EX)(EY ). (3.1)

The precise de¬nition of expected values is given in Chapter 4. If X and Y are independent
of each other, the covariance Cov (X, Y ) is necessarily equal to zero, see Theorem 3.1. The
converse is not true. The covariance of X with itself is the variance:

σXX = Var (X) = Cov (X, X).
« 
X1
If the variable X is p-dimensional multivariate, e.g., X =  . , then the theoretical
¬.·
.
Xp
covariances among all the elements are put into matrix form, i.e., the covariance matrix:
« 
σX1 X1 . . . σX1 Xp
. .
...
. .
Σ= .
¬ ·
. .
σXp X1 . . . σXp Xp

Properties of covariance matrices will be detailed in Chapter 4. Empirical versions of these
quantities are:
n
1
(xi ’ x)(yi ’ y)
sXY = (3.2)
n i=1
n
1
(xi ’ x)2 .
sXX = (3.3)
n i=1

1 1
For small n, say n ¤ 20, we should replace the factor n in (3.2) and (3.3) by n’1 in order
to correct for a small bias. For a p-dimensional random variable, one obtains the empirical
covariance matrix (see Section 3.3 for properties and details)
« 
sX1 X1 . . . sX1 Xp
S= . . ·.
...
¬. .
. .
sXp X1 . . . sXp Xp

For a scatterplot of two variables the covariances measure “how close the scatter is to a
line”. Mathematical details follow but it should already be understood here that in this
sense covariance measures only “linear dependence”.
3.1 Covariance 83


EXAMPLE 3.1 If X is the entire bank data set, one obtains the covariance matrix S as
indicated below:
« 
0.02 ’0.10 ’0.01
0.14 0.03 0.08
0.10 ’0.21 ·
¬ 0.03 0.12 0.10 0.21
¬ ·
0.12 ’0.24 ·
¬ 0.02 0.10 0.16 0.28
S=¬ ·. (3.4)
¬ ’0.10 0.16 ’1.03 ·
0.21 0.28 2.07
¬ ·
 ’0.01 0.64 ’0.54 
0.10 0.12 0.16
0.08 ’0.21 ’0.24 ’1.03 ’0.54 1.32
The empirical covariance between X4 and X5 , i.e., sX4 X5 , is found in row 4 and column 5.
The value is sX4 X5 = 0.16. Is it obvious that this value is positive? In Exercise 3.1 we will
discuss this question further.
If Xf denotes the counterfeit bank notes, we obtain:
« 
0.023 ’0.099
0.123 0.031 0.019 0.011
0.046 ’0.024 ’0.012 ’0.005
¬ 0.031 0.064 ·
¬ ·
0.088 ’0.018
¬ 0.024 0.046 0.000 0.034 ·
Sf = ¬ ·· (3.5)
¬ ’0.099 ’0.024 ’0.018 1.268 ’0.485 0.236 ·
¬ ·
 0.019 ’0.012 0.000 ’0.485 0.400 ’0.022 
0.011 ’0.005 0.236 ’0.022
0.034 0.308
For the genuine, Xg , we have:
« 
0.149 0.057 0.057 0.056 0.014 0.005
’0.043
¬ 0.057 0.131 0.085 0.056 0.048 ·
¬ ·
’0.024
¬ 0.057 0.085 0.125 0.058 0.030 ·
Sg = ¬ ·· (3.6)
0.409 ’0.261 ’0.000
¬ 0.056 0.056 0.058 ·
¬ ·
0.030 ’0.261 ’0.074
 0.014 0.049 0.417 
0.005 ’0.043 ’0.024 ’0.000 ’0.074 0.198

Note that the covariance between X4 (distance of the frame to the lower border) and X5
(distance of the frame to the upper border) is negative in both (3.5) and (3.6)! Why would
this happen? In Exercise 3.2 we will discuss this question in more detail.
At ¬rst sight, the matrices Sf and Sg look di¬erent, but they create almost the same scatter-
plots (see the discussion in Section 1.4). Similarly, the common principal component analysis
in Chapter 9 suggests a joint analysis of the covariance structure as in Flury and Riedwyl
(1988).
Scatterplots with point clouds that are “upward-sloping”, like the one in the upper left of
Figure 1.14, show variables with positive covariance. Scatterplots with “downward-sloping”
structure have negative covariance. In Figure 3.1 we show the scatterplot of X4 vs. X5 of
the entire bank data set. The point cloud is upward-sloping. However, the two sub-clouds
of counterfeit and genuine bank notes are downward-sloping.
84 3 Moving to Higher Dimensions



Swiss bank notes


12
11
X_5
10
9
8




8 9 10 11 12
X_4




Figure 3.1. Scatterplot of variables X4 vs. X5 of the entire bank data
set. MVAscabank45.xpl


EXAMPLE 3.2 A textile shop manager is studying the sales of “classic blue” pullovers over
10 di¬erent periods. He observes the number of pullovers sold (X1 ), variation in price (X2 ,
in EUR), the advertisement costs in local newspapers (X3 , in EUR) and the presence of a
sales assistant (X4 , in hours per period). Over the periods, he observes the following data
matrix:
« 
230 125 200 109
¬ 181 99 55 107 ·
¬ ·
¬ 165 97 105 98 ·
¬ ·
¬ 150 115 85 71 ·
¬ ·
¬ 97 120 0 82 ·
X =¬ ¬ 192 100 150 103 · .
·
¬ ·
¬ 181 80 85 111 ·
¬ ·
¬ 189 90 120 93 ·
¬ ·
 172 95 110 86 
170 125 130 78
3.1 Covariance 85




pullovers data
200
sales (x1)
150 100




80 90 100 110 120
price (X2)



Figure 3.2. Scatterplot of variables X2 vs. X1 of the pullovers data set.
MVAscapull1.xpl

He is convinced that the price must have a large in¬‚uence on the number of pullovers sold.
So he makes a scatterplot of X2 vs. X1 , see Figure 3.2. A rough impression is that the cloud
is somewhat downward-sloping. A computation of the empirical covariance yields
10
1 ¯ ¯
X1i ’ X1 X2i ’ X2 = ’80.02,
sX1 X2 =
9 i=1

a negative value as expected.
Note: The covariance function is scale dependent. Thus, if the prices in this example were
in Japanese Yen (JPY), we would obtain a di¬erent answer (see Exercise 3.16). A measure
of (linear) dependence independent of the scale is the correlation, which we introduce in the
next section.
86 3 Moving to Higher Dimensions



Summary
’ The covariance is a measure of dependence.
’ Covariance measures only linear dependence.
’ Covariance is scale dependent.
’ There are nonlinear dependencies that have zero covariance.
’ Zero covariance does not imply independence.
’ Independence implies zero covariance.
’ Negative covariance corresponds to downward-sloping scatterplots.
’ Positive covariance corresponds to upward-sloping scatterplots.
’ The covariance of a variable with itself is its variance Cov (X, X) = σXX =
2
σX .
1
’ For small n, we should replace the factor in the computation of the
n
1
covariance by n’1 .




3.2 Correlation
The correlation between two variables X and Y is de¬ned from the covariance as the follow-
ing:
Cov (X, Y )
·
ρXY = (3.7)
Var (X) Var (Y )
The advantage of the correlation is that it is independent of the scale, i.e., changing the
variables™ scale of measurement does not change the value of the correlation. Therefore, the
correlation is more useful as a measure of association between two random variables than
the covariance. The empirical version of ρXY is as follows:
sXY
·
rXY = √ (3.8)
sXX sY Y

The correlation is in absolute value always less than 1. It is zero if the covariance is zero
and vice-versa. For p-dimensional vectors (X1 , . . . , Xp ) we have the theoretical correlation
matrix « 
ρX1 X1 . . . ρX1 Xp
P= . . ·,
...
¬. .
. .
ρXp X1 . . . ρXp Xp
3.2 Correlation 87


and its empirical version, the empirical correlation matrix which can be calculated from the
observations, « 
rX1 X1 . . . rX1 Xp
R= . . ·.
..
¬. .
.
. .
rXp X1 . . . rXp Xp

EXAMPLE 3.3 We obtain the following correlation matrix for the genuine bank notes:
« 
1.00 0.41 0.41 0.22 0.05 0.03
0.20 ’0.25 ·
¬ 0.41 1.00 0.66 0.24
¬ ·
0.13 ’0.14 ·
¬ 0.41 0.66 1.00 0.25
Rg = ¬ ·, (3.9)
1.00 ’0.63 ’0.00 ·
¬ 0.22 0.24 0.25
¬ ·
0.13 ’0.63 1.00 ’0.25 
 0.05 0.20
0.03 ’0.25 ’0.14 ’0.00 ’0.25 1.00
and for the counterfeit bank notes:
« 
0.24 ’0.25
1.00 0.35 0.08 0.06
0.61 ’0.08 ’0.07 ’0.03 ·
¬ 0.35 1.00
¬ ·
1.00 ’0.05
¬ 0.24 0.61 0.00 0.20 ·
Rf = ¬ ·. (3.10)
¬ ’0.25 ’0.08 ’0.05 1.00 ’0.68 0.37 ·
¬ ·
 0.08 ’0.07 0.00 ’0.68 1.00 ’0.06 
0.06 ’0.03 0.37 ’0.06
0.20 1.00
As noted before for Cov (X4 , X5 ), the correlation between X4 (distance of the frame to the
lower border) and X5 (distance of the frame to the upper border) is negative. This is natural,
since the covariance and correlation always have the same sign (see also Exercise 3.17).

Why is the correlation an interesting statistic to study? It is related to independence of
random variables, which we shall de¬ne more formally later on. For the moment we may
think of independence as the fact that one variable has no in¬‚uence on another.

THEOREM 3.1 If X and Y are independent, then ρ(X, Y ) = Cov (X, Y ) = 0.
¡e
!
¡e
e In general, the converse is not true, as the following example shows.
¡


EXAMPLE 3.4 Consider a standard normally-distributed random variable X and a random
variable Y = X 2 , which is surely not independent of X. Here we have
Cov (X, Y ) = E(XY ) ’ E(X)E(Y ) = E(X 3 ) = 0
(because E(X) = 0 and E(X 2 ) = 1). Therefore ρ(X, Y ) = 0, as well. This example
also shows that correlations and covariances measure only linear dependence. The quadratic
dependence of Y = X 2 on X is not re¬‚ected by these measures of dependence.
88 3 Moving to Higher Dimensions


REMARK 3.1 For two normal random variables, the converse of Theorem 3.1 is true: zero
covariance for two normally-distributed random variables implies independence. This will be
shown later in Corollary 5.2.


Theorem 3.1 enables us to check for independence between the components of a bivariate
normal random variable. That is, we can use the correlation and test whether it is zero. The
distribution of rXY for an arbitrary (X, Y ) is unfortunately complicated. The distribution
of rXY will be more accessible if (X, Y ) are jointly normal (see Chapter 5). If we transform
the correlation by Fisher™s Z-transformation,

1 1 + rXY
W= log , (3.11)
1 ’ rXY
2

we obtain a variable that has a more accessible distribution. Under the hypothesis that
ρ = 0, W has an asymptotic normal distribution. Approximations of the expectation and
variance of W are given by the following:

1+ρXY
1
E(W ) ≈ log
2 1’ρXY
(3.12)
1
Var (W ) ≈ ·
(n’3)


The distribution is given in Theorem 3.2.


THEOREM 3.2
W ’ E(W ) L
’’ N (0, 1).
Z= (3.13)
Var (W )

L
The symbol “’’” denotes convergence in distribution, which will be explained in more
detail in Chapter 4.
Theorem 3.2 allows us to test di¬erent hypotheses on correlation. We can ¬x the level of
signi¬cance ± (the probability of rejecting a true hypothesis) and reject the hypothesis if the
di¬erence between the hypothetical value and the calculated value of Z is greater than the
corresponding critical value of the normal distribution. The following example illustrates
the procedure.


EXAMPLE 3.5 Let™s study the correlation between mileage (X2 ) and weight (X8 ) for the
car data set (B.3) where n = 74. We have rX2 X8 = ’0.823. Our conclusions from the
boxplot in Figure 1.3 (“Japanese cars generally have better mileage than the others”) needs
to be revised. From Figure 3.3 and rX2 X8 , we can see that mileage is highly correlated with
weight, and that the Japanese cars in the sample are in fact all lighter than the others!
3.2 Correlation 89


If we want to know whether ρX2 X8 is signi¬cantly di¬erent from ρ0 = 0, we apply Fisher™s
Z-transform (3.11). This gives us

’1.166 ’ 0
1 1 + rX2 X8
= ’1.166 = ’9.825,
w= log and z=
1 ’ rX2 X8
2 1
71


i.e., a highly signi¬cant value to reject the hypothesis that ρ = 0 (the 2.5% and 97.5%
quantiles of the normal distribution are ’1.96 and 1.96, respectively). If we want to test the
hypothesis that, say, ρ0 = ’0.75, we obtain:

’1.166 ’ (’0.973)
= ’1.627.
z=
1
71


This is a nonsigni¬cant value at the ± = 0.05 level for z since it is between the critical values
at the 5% signi¬cance level (i.e., ’1.96 < z < 1.96).

EXAMPLE 3.6 Let us consider again the pullovers data set from example 3.2. Consider the
correlation between the presence of the sales assistants (X4 ) vs. the number of sold pullovers
(X1 ) (see Figure 3.4). Here we compute the correlation as

rX1 X4 = 0.633.

The Z-transform of this value is

1 1 + rX1 X4
w= loge = 0.746. (3.14)
1 ’ rX1 X4
2

The sample size is n = 10, so for the hypothesis ρX1 X4 = 0, the statistic to consider is:

z = 7(0.746 ’ 0) = 1.974 (3.15)

which is just statistically signi¬cant at the 5% level (i.e., 1.974 is just a little larger than
1.96).

REMARK 3.2 The normalizing and variance stabilizing properties of W are asymptotic. In
addition the use of W in small samples (for n ¤ 25) is improved by Hotelling™s transform
(Hotelling, 1953):

3W + tanh(W ) 1
W— = W ’ V ar(W — ) =
with .
4(n ’ 1) n’1

The transformed variable W — is asymptotically distributed as a normal distribution.
90 3 Moving to Higher Dimensions



car data


30
25
1500+weight (X8)*E2
20
15
10
5




15 20 25 30 35 40
mileage (X2)




Figure 3.3. Mileage (X2 ) vs. weight (X8 ) of U.S. (star), European (plus
signs) and Japanese (circle) cars. MVAscacar.xpl

EXAMPLE 3.7 From the preceding remark, we obtain w— = 0.6663 and 10 ’ 1w— = 1.9989
for the preceding Example 3.6. This value is signi¬cant at the 5% level.

REMARK 3.3 Note that the Fisher™s Z-transform is the inverse of the hyperbolic tangent
2W
function: W = tanh’1 (rXY ); equivalently rXY = tanh(W ) = e2W ’1 .
e +1


REMARK 3.4 Under the assumptions of normality of X and Y , we may test their indepen-
dence (ρXY = 0) using the exact t-distribution of the statistic

n’2 ρXY =0

T = rXY tn’2 .
2
1 ’ rXY
Setting the probability of the ¬rst error type to ±, we reject the null hypothesis ρXY = 0 if
|T | ≥ t1’±/2;n’2 .
3.2 Correlation 91




pullovers data
200
sales (X1)
150 100




80 90 100 110
sales assistants (X4)



Figure 3.4. Hours of sales assistants (X4 ) vs. sales (X1 ) of pullovers.
MVAscapull2.xpl



Summary
’ The correlation is a standardized measure of dependence.
’ The absolute value of the correlation is always less than one.
’ Correlation measures only linear dependence.
’ There are nonlinear dependencies that have zero correlation.
’ Zero correlation does not imply independence.
’ Independence implies zero correlation.
’ Negative correlation corresponds to downward-sloping scatterplots.
’ Positive correlation corresponds to upward-sloping scatterplots.
92 3 Moving to Higher Dimensions


Summary (continued)
’ Fisher™s Z-transform helps us in testing hypotheses on correlation.
’ For small samples, Fisher™s Z-transform can be improved by the transfor-
mation W — = W ’ 3W 4(n’1) ) .
+tanh(W




3.3 Summary Statistics
This section focuses on the representation of basic summary statistics (means, covariances
and correlations) in matrix notation, since we often apply linear transformations to data.
The matrix notation allows us to derive instantaneously the corresponding characteristics of
the transformed variables. The Mahalanobis transformation is a prominent example of such
linear transformations.
Assume that we have observed n realizations of a p-dimensional random variable; we have a
data matrix X (n — p):
x11 · · · x1p
« 
¬. .·
¬. .·
. .
X =¬ . . ·. (3.16)
. .
. .
xn1 · · · xnp
The rows xi = (xi1 , . . . , xip ) ∈ Rp denote the i-th observation of a p-dimensional random
variable X ∈ Rp .
The statistics that were brie¬‚y introduced in Section 3.1 and 3.2 can be rewritten in matrix
form as follows. The “center of gravity” of the n observations in Rp is given by the vector x
of the means xj of the p variables:
« 
x1
x =  .  = n’1 X 1n .
¬.·
(3.17)
.
xp

The dispersion of the n observations can be characterized by the covariance matrix of the
p variables. The empirical covariances de¬ned in (3.2) and (3.3) are the elements of the
following matrix:
S = n’1 X X ’ x x = n’1 (X X ’ n’1 X 1n 1n X ). (3.18)
Note that this matrix is equivalently de¬ned by
n
1
S= (xi ’ x)(xi ’ x) .
n i=1
3.3 Summary Statistics 93


The covariance formula (3.18) can be rewritten as S = n’1 X HX with the centering matrix

H = In ’ n’1 1n 1n . (3.19)
Note that the centering matrix is symmetric and idempotent. Indeed,
H2 = (In ’ n’1 1n 1n )(In ’ n’1 1n 1n )
= In ’ n’1 1n 1n ’ n’1 1n 1n + (n’1 1n 1n )(n’1 1n 1n )
= In ’ n’1 1n 1n = H.
As a consequence S is positive semide¬nite, i.e.
S ≥ 0. (3.20)
Indeed for all a ∈ Rp ,

<<

. 3
( 4)



>>