Summary (continued)

’ The eigenvectors belonging to the largest eigenvalues indicate the “main

direction” of the data.

’ The Jordan decomposition allows one to easily compute the power of a

symmetric matrix A: A± = “Λ± “ .

’ The singular value decomposition (SVD) is a generalization of the Jordan

decomposition to non-quadratic matrices.

2.3 Quadratic Forms

A quadratic form Q(x) is built from a symmetric matrix A(p — p) and a vector x ∈ Rp :

p p

Q(x) = x A x = aij xi xj . (2.21)

i=1 j=1

De¬niteness of Quadratic Forms and Matrices

Q(x) > 0 for all x = 0 positive de¬nite

Q(x) ≥ 0 for all x = 0 positive semide¬nite

A matrix A is called positive de¬nite (semide¬nite) if the corresponding quadratic form Q(.)

is positive de¬nite (semide¬nite). We write A > 0 (≥ 0).

Quadratic forms can always be diagonalized, as the following result shows.

THEOREM 2.3 If A is symmetric and Q(x) = x Ax is the corresponding quadratic form,

then there exists a transformation x ’ “ x = y such that

p

2

x Ax= »i y i ,

i=1

where »i are the eigenvalues of A.

Proof:

A = “ Λ “ . By Theorem 2.1 and y = “ ± we have that x Ax = x “Λ“ x = y Λy =

p 2

i=1 »i yi . 2

Positive de¬niteness of quadratic forms can be deduced from positive eigenvalues.

66 2 A Short Excursion into Matrix Algebra

THEOREM 2.4 A > 0 if and only if all »i > 0, i = 1, . . . , p.

Proof:

2 2

0 < »1 y1 + · · · + »p yp = x Ax for all x = 0 by Theorem 2.3. 2

COROLLARY 2.1 If A > 0, then A’1 exists and |A| > 0.

EXAMPLE 2.6 The quadratic form Q(x) = x2 +x2 corresponds to the matrix A = 1 0 with

1 2 01

eigenvalues »1 = »2 = 1 and is thus positive de¬nite. The quadratic form Q(x) = (x1 ’ x2 )2

corresponds to the matrix A = ’1 ’1 with eigenvalues »1 = 2, »2 = 0 and is positive

1

1

semide¬nite. The quadratic form Q(x) = x2 ’ x2 with eigenvalues »1 = 1, »2 = ’1 is

1 2

inde¬nite.

In the statistical analysis of multivariate data, we are interested in maximizing quadratic

forms given some constraints.

THEOREM 2.5 If A and B are symmetric and B > 0, then the maximum of x Ax under

the constraints x Bx = 1 is given by the largest eigenvalue of B ’1 A. More generally,

x Ax = »1 ≥ »2 ≥ · · · ≥ »p = x Ax,

max min

{x:x Bx=1} {x:x Bx=1}

where »1 , . . . , »p denote the eigenvalues of B ’1 A. The vector which maximizes (minimizes)

x Ax under the constraint x Bx = 1 is the eigenvector of B ’1 A which corresponds to the

largest (smallest) eigenvalue of B ’1 A.

Proof:

1/2

By de¬nition, B 1/2 = “B ΛB “B . Set y = B 1/2 x, then

max y B ’1/2 AB ’1/2 y.

x Ax =

max (2.22)

{x:x Bx=1} {y:y y=1}

From Theorem 2.1, let

B ’1/2 A B ’1/2 = “ Λ “

be the spectral decomposition of B ’1/2 A B ’1/2 . Set

z = “ y ’ z z = y “ “ y = y y.

Thus (2.22) is equivalent to

p

»i zi2 .

max z Λ z = max

{z:z z=1} {z:z z=1}

i=1

2.3 Quadratic Forms 67

But

»i zi2 ¤ »1 max zi2 = »1 .

max

z z

=1

The maximum is thus obtained by z = (1, 0, . . . , 0) , i.e.,

y = γ1 ’ x = B ’1/2 γ1 .

Since B ’1 A and B ’1/2 A B ’1/2 have the same eigenvalues, the proof is complete. 2

EXAMPLE 2.7 Consider the following matrices

12 10

A= B=

and .

23 01

We calculate

12

B ’1 A = .

23

√

’1

The biggest eigenvalue of the matrix B A √ 2 +

is 5. This means that the maximum of

x Ax under the constraint x Bx = 1 is 2 + 5.

Notice that the constraint x Bx = 1 corresponds, with our choice of B, to the points which

lie on the unit circle x2 + x2 = 1.

1 2

Summary

’ A quadratic form can be described by a symmetric matrix A.

’ Quadratic forms can always be diagonalized.

’ Positive de¬niteness of a quadratic form is equivalent to positiveness of

the eigenvalues of the matrix A.

’ The maximum and minimum of a quadratic form given some constraints

can be expressed in terms of eigenvalues.

68 2 A Short Excursion into Matrix Algebra

2.4 Derivatives

For later sections of this book, it will be useful to introduce matrix notation for derivatives

of a scalar function of a vector x with respect to x. Consider f : Rp ’ R and a (p — 1) vector

x, then ‚f (x) is the column vector of partial derivatives ‚f (x) , j = 1, . . . , p and ‚f (x) is the

‚x ‚xj ‚x

row vector of the same derivative ( ‚f (x) is called the gradient of f ).

‚x

2 ‚ f (x)

We can also introduce second order derivatives: ‚x‚x is the (p — p) matrix of elements

‚ 2 f (x) ‚ 2 f (x)

, i = 1, . . . , p and j = 1, . . . , p. ( ‚x‚x is called the Hessian of f ).

‚xi ‚xj

Suppose that a is a (p — 1) vector and that A = A is a (p — p) matrix. Then

‚a x ‚x a

= = a, (2.23)

‚x ‚x

‚x Ax

= 2Ax. (2.24)

‚x

The Hessian of the quadratic form Q(x) = x Ax is:

‚ 2 x Ax

= 2A. (2.25)

‚x‚x

EXAMPLE 2.8 Consider the matrix

12

A= .

23

From formulas (2.24) and (2.25) it immediately follows that the gradient of Q(x) = x Ax

is

‚x Ax 12 2x 4x

= 2Ax = 2 x=

23 4x 6x

‚x

and the Hessian is

‚ 2 x Ax 12 24

= 2A = 2 = .

23 46

‚x‚x

2.5 Partitioned Matrices

Very often we will have to consider certain groups of rows and columns of a matrix A(n — p).

In the case of two groups, we have

A11 A12

A=

A21 A22

where Aij (ni — pj ), i, j = 1, 2, n1 + n2 = n and p1 + p2 = p.

2.5 Partitioned Matrices 69

If B(n — p) is partitioned accordingly, we have:

A11 + B11 A12 + B12

A+B =

A21 + B21 A22 + B22

B11 B21

B =

B12 B22

A11 B11 + A12 B12 A11 B21 + A12 B22

AB = .

A21 B11 + A22 B12 A21 B21 + A22 B22

An important particular case is the square matrix A(p — p), partitioned such that A11 and

A22 are both square matrices (i.e., nj = pj , j = 1, 2). It can be veri¬ed that when A is

non-singular (AA’1 = Ip ):

A11 A12

’1

A= (2.26)

A21 A22

where

def

±

(A11 ’ A12 A’1 A21 )’1 = (A11·2 )’1

A11 = 22

’(A11·2 ) A12 A’1

’1

12

A = 22

’1

’A22 A21 (A11·2 )’1

A21 =

A’1 + A’1 A21 (A11·2 )’1 A12 A’1

22

A = .

22 22 22

An alternative expression can be obtained by reversing the positions of A11 and A22 in the

original matrix.

The following results will be useful if A11 is non-singular:

|A| = |A11 ||A22 ’ A21 A’1 A12 | = |A11 ||A22·1 |. (2.27)

11

If A22 is non-singular, we have that:

|A| = |A22 ||A11 ’ A12 A’1 A21 | = |A22 ||A11·2 |. (2.28)

22

A useful formula is derived from the alternative expressions for the inverse and the determi-

nant. For instance let

1b

B=

aA

where a and b are (p — 1) vectors and A is non-singular. We then have:

|B| = |A ’ ab | = |A||1 ’ b A’1 a| (2.29)

and equating the two expressions for B 22 , we obtain the following:

A’1 ab A’1

’1 ’1

(A ’ ab ) =A + . (2.30)

1 ’ b A’1 a

70 2 A Short Excursion into Matrix Algebra

EXAMPLE 2.9 Let™s consider the matrix

12

A= .

22

We can use formula (2.26) to calculate the inverse of a partitioned matrix, i.e., A11 =

’1, A12 = A21 = 1, A22 = ’1/2. The inverse of A is

’1 1

A’1 = .

1 ’0.5

It is also easy to calculate the determinant of A:

|A| = |1||2 ’ 4| = ’2.

Let A(n — p) and B(p — n) be any two matrices and suppose that n ≥ p. From (2.27)

and (2.28) we can conclude that

’»In ’A

= (’»)n’p |BA ’ »Ip | = |AB ’ »In |. (2.31)

B Ip

Since both determinants on the right-hand side of (2.31) are polynomials in », we ¬nd that

the n eigenvalues of AB yield the p eigenvalues of BA plus the eigenvalue 0, n ’ p times.

The relationship between the eigenvectors is described in the next theorem.

THEOREM 2.6 For A(n — p) and B(p — n), the non-zero eigenvalues of AB and BA are

the same and have the same multiplicity. If x is an eigenvector of AB for an eigenvalue

» = 0, then y = Bx is an eigenvector of BA.

COROLLARY 2.2 For A(n — p), B(q — n), a(p — 1), and b(q — 1) we have

rank(Aab B) ¤ 1.

The non-zero eigenvalue, if it exists, equals b BAa (with eigenvector Aa).

Proof:

Theorem 2.6 asserts that the eigenvalues of Aab B are the same as those of b BAa. Note

that the matrix b BAa is a scalar and hence it is its own eigenvalue »1 .

Applying Aab B to Aa yields

(Aab B)(Aa) = (Aa)(b BAa) = »1 Aa.

2

2.6 Geometrical Aspects 71

© ©

¨ §¥

¨¨ ¦

¥

¨¨¨

¥

¤¢ ¢

£

¥

¢¢

¥

¢¢

¥

¢¢

¢¢¥

¢ ¥¢

¡

©

Figure 2.1. Distance d.

2.6 Geometrical Aspects

Distance

Let x, y ∈ Rp . A distance d is de¬ned as a function

±

∀x = y

d(x, y) > 0

2p

d : R ’ R+ which ful¬lls d(x, y) = 0 if and only if x = y .

d(x, y) ¤ d(x, z) + d(z, y) ∀x, y, z

A Euclidean distance d between two points x and y is de¬ned as

d2 (x, y) = (x ’ y)T A(x ’ y) (2.32)

where A is a positive de¬nite matrix (A > 0). A is called a metric.

EXAMPLE 2.10 A particular case is when A = Ip , i.e.,

p

(xi ’ yi )2 .

d2 (x, y) = (2.33)

i=1

Figure 2.1 illustrates this de¬nition for p = 2.

Note that the sets Ed = {x ∈ Rp | (x ’ x0 ) (x ’ x0 ) = d2 } , i.e., the spheres with radius d

and center x0 , are the Euclidean Ip iso-distance curves from the point x0 (see Figure 2.2).

The more general distance (2.32) with a positive de¬nite matrix A (A > 0) leads to the

iso-distance curves

Ed = {x ∈ Rp | (x ’ x0 ) A(x ’ x0 ) = d2 }, (2.34)

i.e., ellipsoids with center x0 , matrix A and constant d (see Figure 2.3).

»2 ≥ ... ≥ »p . The resulting observations are given in the next theorem.

Let γ1 , γ2 , ..., γp be the orthonormal eigenvectors of A corresponding to the eigenvalues »1 ≥

Figure 2.3. Iso“distance ellipsoid.

¨

©

¡ ¢¢¢¢¢¢¢¢¢¢¢¢¢ ¥£

¢¢¢¢¢¢¢¢¢¢¢¢¢ ¥ £

¢¢¢¢¢¢¢¢¢¢¢¢¢

§¥ ¥

¦ £ £¤

¢¢¢¢¢¢¢¢¢¢¢¢¢

§

§

¢¢¢¢¢¢¢¢¢¢¢¢¢

¢¢¢¢¢¢¢ ¢¢¢¢¢¢ ¢¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢ ¢ ¢¢ ¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢ ¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢ ¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢¢ ¢¢ ¢¢ ¢¢ ¢¢¢ ¢¢ ¢¢

¢ ¢¢

¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢ ¢¢ ¢¢¢ ¢¢¢ ¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢¢ ¢¢¢ ¢¢¢¢ ¢¢¢

¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢ ¢¢¢¢ ¢ ¢¢ ¢¢¢¢¢ ¢¢¢¢ ¢¢¢¢ ¢ ¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢ ¢¢¢¢

¢¢

¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢

¢¢¢¢¢¢¢¢¢¢¢ ¢

¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢

¢

¢¢¢ ¢¢

©¨ ¢¢¢¢¢¢¢¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢%¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢'¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ %¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ &¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢¢¢¢¢¢¢¢¢'¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ &¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ %¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢$¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ %¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ $¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ %¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ $¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢ ¢¢ %¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢ © ¨ ¢¢¢¢¢

¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢

¢¢ # ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢

¢¢¢

¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢ ¢ $ ¥ £

¥ £

¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢

¦

¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢ ¢ ¢¢¢¢¢ ¢¢¢¢¢¢ §¥

¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢

!" £

¢¢¢¢¢¢¢¢ ¢ ¢¢ ¢¢¢¢ ¢ ¢¢ ¢¢¢¢ ¢ ¢¢ ¢¢¢ ¢¢¢ ¢¢¢¢ ¢¢¢ ¢¢¢ ¢¢ ¢¢¢ ¢¢¢ ¢

¢ ¢¢ £ ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢

¢¢¢¢¢¢ ¢¢¢ ¢ ¢¢ ¢¢¢ ¢ ¢¢ ¢¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢ ¢ ¢¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢ ¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢¢£¤¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢¢¢

¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢ ¢¢ ¢¢ ¢ ¢

Figure 2.2. Iso“distance sphere.

¤¢

£

¡

§

¨¦

£¤¢ ¤¢

©£

¥

2 A Short Excursion into Matrix Algebra 72

2.6 Geometrical Aspects 73

THEOREM 2.7 (i) The principal axes of Ed are in the direction of γi ; i = 1, . . . , p.

d2

(ii) The half-lengths of the axes are ; i = 1, . . . , p.

»i

(iii) The rectangle surrounding the ellipsoid Ed is de¬ned by the following inequalities:

√ √

2 aii ¤ x ¤ x + d2 aii , i = 1, . . . , p,

x0i ’ d i 0i

where aii is the (i, i) element of A’1 . By the rectangle surrounding the ellipsoid Ed we

mean the rectangle whose sides are parallel to the coordinate axis.

It is easy to ¬nd the coordinates of the tangency points between the ellipsoid and its sur-

rounding rectangle parallel to the coordinate axes. Let us ¬nd the coordinates of the tangency

point that are in the direction of the j-th coordinate axis (positive direction).

For ease of notation, we suppose the ellipsoid is centered around the origin (x0 = 0). If not,

the rectangle will be shifted by the value of x0 .

The coordinate of the tangency point is given by the solution to the following problem:

x = arg max ej x (2.35)

x Ax=d2

where ej is the j-th column of the identity matrix Ip . The coordinate of the tangency point

in the negative direction would correspond to the solution of the min problem: by symmetry,

it is the opposite value of the former.

The solution is computed via the Lagrangian L = ej x ’ »(x Ax ’ d2 ) which by (2.23) leads

to the following system of equations:

‚L

= ej ’ 2»Ax = 0 (2.36)

‚x

‚L

= xT Ax ’ d2 = 0. (2.37)

‚»

A’1 ej ,

1

This gives x = or componentwise

2»

1 ij

xi = a , i = 1, . . . , p (2.38)

2»

where aij denotes the (i, j)-th element of A’1 .

Premultiplying (2.36) by x , we have from (2.37):

xj = 2»d2 .

jj

Comparing this to the value obtained by (2.38), for i = j we obtain 2» = a 2 . We choose

d

the positive value of the square root because we are maximizing ej x. A minimum would

74 2 A Short Excursion into Matrix Algebra

correspond to the negative value. Finally, we have the coordinates of the tangency point

between the ellipsoid and its surrounding rectangle in the positive direction of the j-th axis:

d2 ij

xi = a , i = 1, . . . , p. (2.39)

ajj

The particular case where i = j provides statement (iii) in Theorem 2.7.

Remark: usefulness of Theorem 2.7

Theorem 2.7 will prove to be particularly useful in many subsequent chapters. First, it

provides a helpful tool for graphing an ellipse in two dimensions. Indeed, knowing the slope

of the principal axes of the ellipse, their half-lengths and drawing the rectangle inscribing

the ellipse allows one to quickly draw a rough picture of the shape of the ellipse.

In Chapter 7, it is shown that the con¬dence region for the vector µ of a multivariate

normal population is given by a particular ellipsoid whose parameters depend on sample

characteristics. The rectangle inscribing the ellipsoid (which is much easier to obtain) will

provide the simultaneous con¬dence intervals for all of the components in µ.

In addition it will be shown that the contour surfaces of the multivariate normal density

are provided by ellipsoids whose parameters depend on the mean vector and on the covari-

ance matrix. We will see that the tangency points between the contour ellipsoids and the

surrounding rectangle are determined by regressing one component on the (p ’ 1) other

components. For instance, in the direction of the j-th axis, the tangency points are given

by the intersections of the ellipsoid contours with the regression line of the vector of (p ’ 1)

variables (all components except the j-th) on the j-th component.

Norm of a Vector

Consider a vector x ∈ Rp . The norm or length of x (with respect to the metric Ip ) is de¬ned

as

√

x = d(0, x) = x x.

If x = 1, x is called a unit vector. A more general norm can be de¬ned with respect to the

metric A:

√

x A = x Ax.

2.6 Geometrical Aspects 75

©§¥ ¥

¨¦

¥

¥

¤¢ ¢

£ !

!!!!!!! ! !!!!!! !!!!!! !!!! !!!! !!!! !!!! !!! !!! ! ¥

¢¢

¢ ¢ !!!!!!!!!!!!!!! !!!!!!!!! !!!!! !!!!! !!!! !!!! !!! !!!! ! ¥

¢!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ¢ !!!!!!!!!!!!!!!!!!!!! !!!!!!!!!! ! ¥

!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! ¢

¥

!!!!!!!!!!!!!!!!! !

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ¢ ¢ ¢ ¥

!!!!!!!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!!!!!!!! ¢ ¢¥

!!!!!!!!!!!

!!!!!!!!!!! ¡

Figure 2.4. Angle between vectors.

Angle between two Vectors

Consider two vectors x and y ∈ Rp . The angle θ between x and y is de¬ned by the cosine of

θ:

xy

cos θ = , (2.40)

xy

x1 y1

see Figure 2.4. Indeed for p = 2, x = and y = , we have

x2 y2

x cos θ1 = x1 ; y cos θ2 = y1

(2.41)

x sin θ1 = x2 ; y sin θ2 = y2 ,

therefore,

x1 y1 + x2 y2 xy

cos θ = cos θ1 cos θ2 + sin θ1 sin θ2 = = .

xy xy

π

REMARK 2.1 If x y = 0, then the angle θ is equal to . From trigonometry, we know that

2

the cosine of θ equals the length of the base of a triangle (||px ||) divided by the length of the

hypotenuse (||x||). Hence, we have

|x y|

||px || = ||x||| cos θ| = , (2.42)

y

76 2 A Short Excursion into Matrix Algebra

¦ £¡ ¢¡ ¡

¡¡

¨ ¡¡

¤ ¥ §

©

Figure 2.5. Projection.

where px is the projection of x on y (which is de¬ned below). It is the coordinate of x on the

y vector, see Figure 2.5.

The angle can also be de¬ned with respect to a general metric A

x Ay

cos θ = . (2.43)

xA y A

If cos θ = 0 then x is orthogonal to y with respect to the metric A.

EXAMPLE 2.11 Assume that there are two centered (i.e., zero mean) data vectors. The

cosine of the angle between them is equal to their correlation (de¬ned in (3.8))! Indeed for

x and y with x = y = 0 we have

x i yi

rXY = = cos θ

x2 2

yi

i

according to formula (2.40).

Rotations

When we consider a point x ∈ Rp , we generally use a p-coordinate system to obtain its geo-

metric representation, like in Figure 2.1 for instance. There will be situations in multivariate

techniques where we will want to rotate this system of coordinates by the angle θ.

Consider for example the point P with coordinates x = (x1 , x2 ) in R2 with respect to a

given set of orthogonal axes. Let “ be a (2 — 2) orthogonal matrix where

cos θ sin θ

“= . (2.44)

’ sin θ cos θ

If the axes are rotated about the origin through an angle θ in a clockwise direction, the new

coordinates of P will be given by the vector y

y = “ x, (2.45)

2.6 Geometrical Aspects 77

and a rotation through the same angle in a counterclockwise direction gives the new coordi-

nates as

y = “ x. (2.46)

More generally, premultiplying a vector x by an orthogonal matrix “ geometrically corre-

sponds to a rotation of the system of axes, so that the ¬rst new axis is determined by the

¬rst row of “. This geometric point of view will be exploited in Chapters 9 and 10.

Column Space and Null Space of a Matrix

De¬ne for X (n — p)

def

Im(X ) = C(X ) = {x ∈ Rn | ∃a ∈ Rp so that X a = x},

the space generated by the columns of X or the column space of X . Note that C(X ) ⊆ Rn

and dim{C(X )} = rank(X ) = r ¤ min(n, p).

def

Ker(X ) = N (X ) = {y ∈ Rp | X y = 0}

is the null space of X . Note that N (X ) ⊆ Rp and that dim{N (X )} = p ’ r.

REMARK 2.2 N (X ) is the orthogonal complement of C(X ) in Rn , i.e., given a vector

b ∈ Rn it will hold that x b = 0 for all x ∈ C(X ), if and only if b ∈ N (X ).

«

235

¬4 6 7·

EXAMPLE 2.12 Let X = ¬ 6 8 6 . It is easy to show (e.g. by calculating the de-

·

824

terminant of X ) that rank(X ) = 3. Hence, the columns space of X is C(X ) = R3 .

The null space of X contains only the zero vector (0, 0, 0) and its dimension is equal to

rank(X ) ’ 3 = 0.

«

231

¬4 6 2·

For X = ¬ 6 8 3 , the third column is a multiple of the ¬rst one and the matrix X

·

824

cannot be of full rank. Noticing that the ¬rst two columns of X are independent, we see that

rank(X ) = 2. In this case, the dimension of the columns space is 2 and the dimension of the

null space is 1.

Projection Matrix

A matrix P(n — n) is called an (orthogonal) projection matrix in Rn if and only if P = P =

P 2 (P is idempotent). Let b ∈ Rn . Then a = Pb is the projection of b on C(P).

78 2 A Short Excursion into Matrix Algebra

Projection on C(X )

Consider X (n — p) and let

P = X (X X )’1 X (2.47)

and Q = In ’ P. It™s easy to check that P and Q are idempotent and that

PX = X and QX = 0. (2.48)

Since the columns of X are projected onto themselves, the projection matrix P projects any

vector b ∈ Rn onto C(X ). Similarly, the projection matrix Q projects any vector b ∈ Rn

onto the orthogonal complement of C(X ).

THEOREM 2.8 Let P be the projection (2.47) and Q its orthogonal complement. Then:

(i) x = Pb ’ x ∈ C(X ),

(ii) y = Qb ’ y x = 0 ∀x ∈ C(X ).

Proof:

(i) holds, since x = X (X X )’1 X b = X a, where a = (X X )’1 X b ∈ Rp .

(ii) follows from y = b ’ Pb and x = X a ’ y x = b X a ’ b X (X X )’1 X X a = 0. 2

REMARK 2.3 Let x, y ∈ Rn and consider px ∈ Rn , the projection of x on y (see Figure

2.5). With X = y we have from (2.47)

yx

px = y(y y)’1 y x = y (2.49)

y2

and we can easily verify that

|y x|

px = px px = .

y

See again Remark 2.1.

2.7 Exercises 79

Summary

’ A distance between two p-dimensional points x and y is a quadratic form

(x ’ y) A(x ’ y) in the vectors of di¬erences (x ’ y). A distance de¬nes

the norm of a vector.

’ Iso-distance curves of a point x0 are all those points that have the same

distance from x0 . Iso-distance curves are ellipsoids whose principal axes

are determined by the direction of the eigenvectors of A. The half-length of

principal axes is proportional to the inverse of the roots of the eigenvalues

of A.

’ The angle between two vectors x and y is given by cos θ = x Ay

w.r.t.

x Ay A

the metric A.

’ For the Euclidean distance with A = I the correlation between two cen-

tered data vectors x and y is given by the cosine of the angle between

them, i.e., cos θ = rXY .

’ The projection P = X (X X )’1 X is the projection onto the column

space C(X ) of X .

’ The projection of x ∈ Rn on y ∈ Rn is given by p = yx

y.

x y2

2.7 Exercises

EXERCISE 2.1 Compute the determinant for a (3 — 3) matrix.

EXERCISE 2.2 Suppose that |A| = 0. Is it possible that all eigenvalues of A are positive?

EXERCISE 2.3 Suppose that all eigenvalues of some (square) matrix A are di¬erent from

zero. Does the inverse A’1 of A exist?

EXERCISE 2.4 Write a program that calculates the Jordan decomposition of the matrix

«

12 3

A= 2 1 2 .

32 1

Check Theorem 2.1 numerically.

80 2 A Short Excursion into Matrix Algebra

EXERCISE 2.5 Prove (2.23), (2.24) and (2.25).

EXERCISE 2.6 Show that a projection matrix only has eigenvalues in {0, 1}.

EXERCISE 2.7 Draw some iso-distance ellipsoids for the metric A = Σ’1 of Example 3.13.

EXERCISE 2.8 Find a formula for |A + aa | and for (A + aa )’1 . (Hint: use the inverse

1 ’a

partitioned matrix with B = .)

aA

EXERCISE 2.9 Prove the Binomial inverse theorem for two non-singular matrices A(p — p)

and B(p — p): (A + B)’1 = A’1 ’ A’1 (A’1 + B ’1 )’1 A’1 . (Hint: use (2.26) with C =

A Ip

.)

’Ip B ’1

3 Moving to Higher Dimensions

We have seen in the previous chapters how very simple graphical devices can help in under-

standing the structure and dependency of data. The graphical tools were based on either

univariate (bivariate) data representations or on “slick” transformations of multivariate infor-

mation perceivable by the human eye. Most of the tools are extremely useful in a modelling

step, but unfortunately, do not give the full picture of the data set. One reason for this is

that the graphical tools presented capture only certain dimensions of the data and do not

necessarily concentrate on those dimensions or subparts of the data under analysis that carry

the maximum structural information. In Part III of this book, powerful tools for reducing

the dimension of a data set will be presented. In this chapter, as a starting point, simple and

basic tools are used to describe dependency. They are constructed from elementary facts of

probability theory and introductory statistics (for example, the covariance and correlation

between two variables).

Sections 3.1 and 3.2 show how to handle these concepts in a multivariate setup and how a

simple test on correlation between two variables can be derived. Since linear relationships

are involved in these measures, Section 3.4 presents the simple linear model for two variables

and recalls the basic t-test for the slope. In Section 3.5, a simple example of one-factorial

analysis of variance introduces the notations for the well known F -test.

Due to the power of matrix notation, all of this can easily be extended to a more general

multivariate setup. Section 3.3 shows how matrix operations can be used to de¬ne summary

statistics of a data set and for obtaining the empirical moments of linear transformations of

the data. These results will prove to be very useful in most of the chapters in Part III.

Finally, matrix notation allows us to introduce the ¬‚exible multiple linear model, where more

general relationships among variables can be analyzed. In Section 3.6, the least squares

adjustment of the model and the usual test statistics are presented with their geometric

interpretation. Using these notations, the ANOVA model is just a particular case of the

multiple linear model.

82 3 Moving to Higher Dimensions

3.1 Covariance

Covariance is a measure of dependency between random variables. Given two (random)

variables X and Y the (theoretical) covariance is de¬ned by:

σXY = Cov (X, Y ) = E(XY ) ’ (EX)(EY ). (3.1)

The precise de¬nition of expected values is given in Chapter 4. If X and Y are independent

of each other, the covariance Cov (X, Y ) is necessarily equal to zero, see Theorem 3.1. The

converse is not true. The covariance of X with itself is the variance:

σXX = Var (X) = Cov (X, X).

«

X1

If the variable X is p-dimensional multivariate, e.g., X = . , then the theoretical

¬.·

.

Xp

covariances among all the elements are put into matrix form, i.e., the covariance matrix:

«

σX1 X1 . . . σX1 Xp

. .

...

. .

Σ= .

¬ ·

. .

σXp X1 . . . σXp Xp

Properties of covariance matrices will be detailed in Chapter 4. Empirical versions of these

quantities are:

n

1

(xi ’ x)(yi ’ y)

sXY = (3.2)

n i=1

n

1

(xi ’ x)2 .

sXX = (3.3)

n i=1

1 1

For small n, say n ¤ 20, we should replace the factor n in (3.2) and (3.3) by n’1 in order

to correct for a small bias. For a p-dimensional random variable, one obtains the empirical

covariance matrix (see Section 3.3 for properties and details)

«

sX1 X1 . . . sX1 Xp

S= . . ·.

...

¬. .

. .

sXp X1 . . . sXp Xp

For a scatterplot of two variables the covariances measure “how close the scatter is to a

line”. Mathematical details follow but it should already be understood here that in this

sense covariance measures only “linear dependence”.

3.1 Covariance 83

EXAMPLE 3.1 If X is the entire bank data set, one obtains the covariance matrix S as

indicated below:

«

0.02 ’0.10 ’0.01

0.14 0.03 0.08

0.10 ’0.21 ·

¬ 0.03 0.12 0.10 0.21

¬ ·

0.12 ’0.24 ·

¬ 0.02 0.10 0.16 0.28

S=¬ ·. (3.4)

¬ ’0.10 0.16 ’1.03 ·

0.21 0.28 2.07

¬ ·

’0.01 0.64 ’0.54

0.10 0.12 0.16

0.08 ’0.21 ’0.24 ’1.03 ’0.54 1.32

The empirical covariance between X4 and X5 , i.e., sX4 X5 , is found in row 4 and column 5.

The value is sX4 X5 = 0.16. Is it obvious that this value is positive? In Exercise 3.1 we will

discuss this question further.

If Xf denotes the counterfeit bank notes, we obtain:

«

0.023 ’0.099

0.123 0.031 0.019 0.011

0.046 ’0.024 ’0.012 ’0.005

¬ 0.031 0.064 ·

¬ ·

0.088 ’0.018

¬ 0.024 0.046 0.000 0.034 ·

Sf = ¬ ·· (3.5)

¬ ’0.099 ’0.024 ’0.018 1.268 ’0.485 0.236 ·

¬ ·

0.019 ’0.012 0.000 ’0.485 0.400 ’0.022

0.011 ’0.005 0.236 ’0.022

0.034 0.308

For the genuine, Xg , we have:

«

0.149 0.057 0.057 0.056 0.014 0.005

’0.043

¬ 0.057 0.131 0.085 0.056 0.048 ·

¬ ·

’0.024

¬ 0.057 0.085 0.125 0.058 0.030 ·

Sg = ¬ ·· (3.6)

0.409 ’0.261 ’0.000

¬ 0.056 0.056 0.058 ·

¬ ·

0.030 ’0.261 ’0.074

0.014 0.049 0.417

0.005 ’0.043 ’0.024 ’0.000 ’0.074 0.198

Note that the covariance between X4 (distance of the frame to the lower border) and X5

(distance of the frame to the upper border) is negative in both (3.5) and (3.6)! Why would

this happen? In Exercise 3.2 we will discuss this question in more detail.

At ¬rst sight, the matrices Sf and Sg look di¬erent, but they create almost the same scatter-

plots (see the discussion in Section 1.4). Similarly, the common principal component analysis

in Chapter 9 suggests a joint analysis of the covariance structure as in Flury and Riedwyl

(1988).

Scatterplots with point clouds that are “upward-sloping”, like the one in the upper left of

Figure 1.14, show variables with positive covariance. Scatterplots with “downward-sloping”

structure have negative covariance. In Figure 3.1 we show the scatterplot of X4 vs. X5 of

the entire bank data set. The point cloud is upward-sloping. However, the two sub-clouds

of counterfeit and genuine bank notes are downward-sloping.

84 3 Moving to Higher Dimensions

Swiss bank notes

12

11

X_5

10

9

8

8 9 10 11 12

X_4

Figure 3.1. Scatterplot of variables X4 vs. X5 of the entire bank data

set. MVAscabank45.xpl

EXAMPLE 3.2 A textile shop manager is studying the sales of “classic blue” pullovers over

10 di¬erent periods. He observes the number of pullovers sold (X1 ), variation in price (X2 ,

in EUR), the advertisement costs in local newspapers (X3 , in EUR) and the presence of a

sales assistant (X4 , in hours per period). Over the periods, he observes the following data

matrix:

«

230 125 200 109

¬ 181 99 55 107 ·

¬ ·

¬ 165 97 105 98 ·

¬ ·

¬ 150 115 85 71 ·

¬ ·

¬ 97 120 0 82 ·

X =¬ ¬ 192 100 150 103 · .

·

¬ ·

¬ 181 80 85 111 ·

¬ ·

¬ 189 90 120 93 ·

¬ ·

172 95 110 86

170 125 130 78

3.1 Covariance 85

pullovers data

200

sales (x1)

150 100

80 90 100 110 120

price (X2)

Figure 3.2. Scatterplot of variables X2 vs. X1 of the pullovers data set.

MVAscapull1.xpl

He is convinced that the price must have a large in¬‚uence on the number of pullovers sold.

So he makes a scatterplot of X2 vs. X1 , see Figure 3.2. A rough impression is that the cloud

is somewhat downward-sloping. A computation of the empirical covariance yields

10

1 ¯ ¯

X1i ’ X1 X2i ’ X2 = ’80.02,

sX1 X2 =

9 i=1

a negative value as expected.

Note: The covariance function is scale dependent. Thus, if the prices in this example were

in Japanese Yen (JPY), we would obtain a di¬erent answer (see Exercise 3.16). A measure

of (linear) dependence independent of the scale is the correlation, which we introduce in the

next section.

86 3 Moving to Higher Dimensions

Summary

’ The covariance is a measure of dependence.

’ Covariance measures only linear dependence.

’ Covariance is scale dependent.

’ There are nonlinear dependencies that have zero covariance.

’ Zero covariance does not imply independence.

’ Independence implies zero covariance.

’ Negative covariance corresponds to downward-sloping scatterplots.

’ Positive covariance corresponds to upward-sloping scatterplots.

’ The covariance of a variable with itself is its variance Cov (X, X) = σXX =

2

σX .

1

’ For small n, we should replace the factor in the computation of the

n

1

covariance by n’1 .

3.2 Correlation

The correlation between two variables X and Y is de¬ned from the covariance as the follow-

ing:

Cov (X, Y )

·

ρXY = (3.7)

Var (X) Var (Y )

The advantage of the correlation is that it is independent of the scale, i.e., changing the

variables™ scale of measurement does not change the value of the correlation. Therefore, the

correlation is more useful as a measure of association between two random variables than

the covariance. The empirical version of ρXY is as follows:

sXY

·

rXY = √ (3.8)

sXX sY Y

The correlation is in absolute value always less than 1. It is zero if the covariance is zero

and vice-versa. For p-dimensional vectors (X1 , . . . , Xp ) we have the theoretical correlation

matrix «

ρX1 X1 . . . ρX1 Xp

P= . . ·,

...

¬. .

. .

ρXp X1 . . . ρXp Xp

3.2 Correlation 87

and its empirical version, the empirical correlation matrix which can be calculated from the

observations, «

rX1 X1 . . . rX1 Xp

R= . . ·.

..

¬. .

.

. .

rXp X1 . . . rXp Xp

EXAMPLE 3.3 We obtain the following correlation matrix for the genuine bank notes:

«

1.00 0.41 0.41 0.22 0.05 0.03

0.20 ’0.25 ·

¬ 0.41 1.00 0.66 0.24

¬ ·

0.13 ’0.14 ·

¬ 0.41 0.66 1.00 0.25

Rg = ¬ ·, (3.9)

1.00 ’0.63 ’0.00 ·

¬ 0.22 0.24 0.25

¬ ·

0.13 ’0.63 1.00 ’0.25

0.05 0.20

0.03 ’0.25 ’0.14 ’0.00 ’0.25 1.00

and for the counterfeit bank notes:

«

0.24 ’0.25

1.00 0.35 0.08 0.06

0.61 ’0.08 ’0.07 ’0.03 ·

¬ 0.35 1.00

¬ ·

1.00 ’0.05

¬ 0.24 0.61 0.00 0.20 ·

Rf = ¬ ·. (3.10)

¬ ’0.25 ’0.08 ’0.05 1.00 ’0.68 0.37 ·

¬ ·

0.08 ’0.07 0.00 ’0.68 1.00 ’0.06

0.06 ’0.03 0.37 ’0.06

0.20 1.00

As noted before for Cov (X4 , X5 ), the correlation between X4 (distance of the frame to the

lower border) and X5 (distance of the frame to the upper border) is negative. This is natural,

since the covariance and correlation always have the same sign (see also Exercise 3.17).

Why is the correlation an interesting statistic to study? It is related to independence of

random variables, which we shall de¬ne more formally later on. For the moment we may

think of independence as the fact that one variable has no in¬‚uence on another.

THEOREM 3.1 If X and Y are independent, then ρ(X, Y ) = Cov (X, Y ) = 0.

¡e

!

¡e

e In general, the converse is not true, as the following example shows.

¡

EXAMPLE 3.4 Consider a standard normally-distributed random variable X and a random

variable Y = X 2 , which is surely not independent of X. Here we have

Cov (X, Y ) = E(XY ) ’ E(X)E(Y ) = E(X 3 ) = 0

(because E(X) = 0 and E(X 2 ) = 1). Therefore ρ(X, Y ) = 0, as well. This example

also shows that correlations and covariances measure only linear dependence. The quadratic

dependence of Y = X 2 on X is not re¬‚ected by these measures of dependence.

88 3 Moving to Higher Dimensions

REMARK 3.1 For two normal random variables, the converse of Theorem 3.1 is true: zero

covariance for two normally-distributed random variables implies independence. This will be

shown later in Corollary 5.2.

Theorem 3.1 enables us to check for independence between the components of a bivariate

normal random variable. That is, we can use the correlation and test whether it is zero. The

distribution of rXY for an arbitrary (X, Y ) is unfortunately complicated. The distribution

of rXY will be more accessible if (X, Y ) are jointly normal (see Chapter 5). If we transform

the correlation by Fisher™s Z-transformation,

1 1 + rXY

W= log , (3.11)

1 ’ rXY

2

we obtain a variable that has a more accessible distribution. Under the hypothesis that

ρ = 0, W has an asymptotic normal distribution. Approximations of the expectation and

variance of W are given by the following:

1+ρXY

1

E(W ) ≈ log

2 1’ρXY

(3.12)

1

Var (W ) ≈ ·

(n’3)

The distribution is given in Theorem 3.2.

THEOREM 3.2

W ’ E(W ) L

’’ N (0, 1).

Z= (3.13)

Var (W )

L

The symbol “’’” denotes convergence in distribution, which will be explained in more

detail in Chapter 4.

Theorem 3.2 allows us to test di¬erent hypotheses on correlation. We can ¬x the level of

signi¬cance ± (the probability of rejecting a true hypothesis) and reject the hypothesis if the

di¬erence between the hypothetical value and the calculated value of Z is greater than the

corresponding critical value of the normal distribution. The following example illustrates

the procedure.

EXAMPLE 3.5 Let™s study the correlation between mileage (X2 ) and weight (X8 ) for the

car data set (B.3) where n = 74. We have rX2 X8 = ’0.823. Our conclusions from the

boxplot in Figure 1.3 (“Japanese cars generally have better mileage than the others”) needs

to be revised. From Figure 3.3 and rX2 X8 , we can see that mileage is highly correlated with

weight, and that the Japanese cars in the sample are in fact all lighter than the others!

3.2 Correlation 89

If we want to know whether ρX2 X8 is signi¬cantly di¬erent from ρ0 = 0, we apply Fisher™s

Z-transform (3.11). This gives us

’1.166 ’ 0

1 1 + rX2 X8

= ’1.166 = ’9.825,

w= log and z=

1 ’ rX2 X8

2 1

71

i.e., a highly signi¬cant value to reject the hypothesis that ρ = 0 (the 2.5% and 97.5%

quantiles of the normal distribution are ’1.96 and 1.96, respectively). If we want to test the

hypothesis that, say, ρ0 = ’0.75, we obtain:

’1.166 ’ (’0.973)

= ’1.627.

z=

1

71

This is a nonsigni¬cant value at the ± = 0.05 level for z since it is between the critical values

at the 5% signi¬cance level (i.e., ’1.96 < z < 1.96).

EXAMPLE 3.6 Let us consider again the pullovers data set from example 3.2. Consider the

correlation between the presence of the sales assistants (X4 ) vs. the number of sold pullovers

(X1 ) (see Figure 3.4). Here we compute the correlation as

rX1 X4 = 0.633.

The Z-transform of this value is

1 1 + rX1 X4

w= loge = 0.746. (3.14)

1 ’ rX1 X4

2

The sample size is n = 10, so for the hypothesis ρX1 X4 = 0, the statistic to consider is:

√

z = 7(0.746 ’ 0) = 1.974 (3.15)

which is just statistically signi¬cant at the 5% level (i.e., 1.974 is just a little larger than

1.96).

REMARK 3.2 The normalizing and variance stabilizing properties of W are asymptotic. In

addition the use of W in small samples (for n ¤ 25) is improved by Hotelling™s transform

(Hotelling, 1953):

3W + tanh(W ) 1

W— = W ’ V ar(W — ) =

with .

4(n ’ 1) n’1

The transformed variable W — is asymptotically distributed as a normal distribution.

90 3 Moving to Higher Dimensions

car data

30

25

1500+weight (X8)*E2

20

15

10

5

15 20 25 30 35 40

mileage (X2)

Figure 3.3. Mileage (X2 ) vs. weight (X8 ) of U.S. (star), European (plus

signs) and Japanese (circle) cars. MVAscacar.xpl

√

EXAMPLE 3.7 From the preceding remark, we obtain w— = 0.6663 and 10 ’ 1w— = 1.9989

for the preceding Example 3.6. This value is signi¬cant at the 5% level.

REMARK 3.3 Note that the Fisher™s Z-transform is the inverse of the hyperbolic tangent

2W

function: W = tanh’1 (rXY ); equivalently rXY = tanh(W ) = e2W ’1 .

e +1

REMARK 3.4 Under the assumptions of normality of X and Y , we may test their indepen-

dence (ρXY = 0) using the exact t-distribution of the statistic

n’2 ρXY =0

∼

T = rXY tn’2 .

2

1 ’ rXY

Setting the probability of the ¬rst error type to ±, we reject the null hypothesis ρXY = 0 if

|T | ≥ t1’±/2;n’2 .

3.2 Correlation 91

pullovers data

200

sales (X1)

150 100

80 90 100 110

sales assistants (X4)

Figure 3.4. Hours of sales assistants (X4 ) vs. sales (X1 ) of pullovers.

MVAscapull2.xpl

Summary

’ The correlation is a standardized measure of dependence.

’ The absolute value of the correlation is always less than one.

’ Correlation measures only linear dependence.

’ There are nonlinear dependencies that have zero correlation.

’ Zero correlation does not imply independence.

’ Independence implies zero correlation.

’ Negative correlation corresponds to downward-sloping scatterplots.

’ Positive correlation corresponds to upward-sloping scatterplots.

92 3 Moving to Higher Dimensions

Summary (continued)

’ Fisher™s Z-transform helps us in testing hypotheses on correlation.

’ For small samples, Fisher™s Z-transform can be improved by the transfor-

mation W — = W ’ 3W 4(n’1) ) .

+tanh(W

3.3 Summary Statistics

This section focuses on the representation of basic summary statistics (means, covariances

and correlations) in matrix notation, since we often apply linear transformations to data.

The matrix notation allows us to derive instantaneously the corresponding characteristics of

the transformed variables. The Mahalanobis transformation is a prominent example of such

linear transformations.

Assume that we have observed n realizations of a p-dimensional random variable; we have a

data matrix X (n — p):

x11 · · · x1p

«

¬. .·

¬. .·

. .

X =¬ . . ·. (3.16)

. .

. .

xn1 · · · xnp

The rows xi = (xi1 , . . . , xip ) ∈ Rp denote the i-th observation of a p-dimensional random

variable X ∈ Rp .

The statistics that were brie¬‚y introduced in Section 3.1 and 3.2 can be rewritten in matrix

form as follows. The “center of gravity” of the n observations in Rp is given by the vector x

of the means xj of the p variables:

«

x1

x = . = n’1 X 1n .

¬.·

(3.17)

.

xp

The dispersion of the n observations can be characterized by the covariance matrix of the

p variables. The empirical covariances de¬ned in (3.2) and (3.3) are the elements of the

following matrix:

S = n’1 X X ’ x x = n’1 (X X ’ n’1 X 1n 1n X ). (3.18)

Note that this matrix is equivalently de¬ned by

n

1

S= (xi ’ x)(xi ’ x) .

n i=1

3.3 Summary Statistics 93

The covariance formula (3.18) can be rewritten as S = n’1 X HX with the centering matrix

H = In ’ n’1 1n 1n . (3.19)

Note that the centering matrix is symmetric and idempotent. Indeed,

H2 = (In ’ n’1 1n 1n )(In ’ n’1 1n 1n )

= In ’ n’1 1n 1n ’ n’1 1n 1n + (n’1 1n 1n )(n’1 1n 1n )

= In ’ n’1 1n 1n = H.

As a consequence S is positive semide¬nite, i.e.

S ≥ 0. (3.20)

Indeed for all a ∈ Rp ,