141

11

10

Y

Y

Y

140

10

139

8

9

138

8 10 12 8 10 12

129 129.5 130 130.5 131

X X X

Var 5

12

12

142

11

11

141

10

10

Y

Y

Y

140

139

9

9

138

8

8

9 10 11 12

129 129.5 130 130.5 131 8 10 12

X X X

Var 6

142

142

142

141

141

141

Y

Y

Y

140

140

140

139

139

139

138

138

138

129 129.5 130 130.5 131 8 10 12 8 9 10 11 12

X X X

Figure 1.14. Draftman plot of the bank notes. The pictures in the left col-

umn show (X3 , X4 ), (X3 , X5 ) and (X3 , X6 ), in the middle we have (X4 , X5 )

and (X4 , X6 ), and in the lower right is (X5 , X6 ). The upper right half con-

tains the corresponding density contour plots. MVAdrafbank4.xpl

or mouse over the screen. Inside the brush we can highlight or color observations. Suppose

the technique is installed in such a way that as we move the brush in one scatter, the

corresponding observations in the other scatters are also highlighted. By moving the brush,

we can study conditional dependence.

If we brush (i.e., highlight or color the observation with the brush) the X5 vs. X6 plot

and move through the upper point cloud, we see that in other plots (e.g., X3 vs. X4 ), the

corresponding observations are more embedded in the other sub-cloud.

34 1 Comparison of Batches

Summary

’ Scatterplots in two and three dimensions helps in identifying separated

points, outliers or sub-clusters.

’ Scatterplots help us in judging positive or negative dependencies.

’ Draftman scatterplot matrices help detect structures conditioned on values

of other variables.

’ As the brush of a scatterplot matrix moves through a point cloud, we can

study conditional dependence.

1.5 Cherno¬-Flury Faces

If we are given data in numerical form, we tend to display it also numerically. This was

done in the preceding sections: an observation x1 = (1, 2) was plotted as the point (1, 2) in a

two-dimensional coordinate system. In multivariate analysis we want to understand data in

low dimensions (e.g., on a 2D computer screen) although the structures are hidden in high

dimensions. The numerical display of data structures using coordinates therefore ends at

dimensions greater than three.

If we are interested in condensing a structure into 2D elements, we have to consider alter-

native graphical techniques. The Cherno¬-Flury faces, for example, provide such a conden-

sation of high-dimensional information into a simple “face”. In fact faces are a simple way

to graphically display high-dimensional data. The size of the face elements like pupils, eyes,

upper and lower hair line, etc., are assigned to certain variables. The idea of using faces goes

back to Cherno¬ (1973) and has been further developed by Bernhard Flury. We follow the

design described in Flury and Riedwyl (1988) which uses the following characteristics.

1 right eye size

2 right pupil size

3 position of right pupil

4 right eye slant

5 horizontal position of right eye

6 vertical position of right eye

7 curvature of right eyebrow

8 density of right eyebrow

9 horizontal position of right eyebrow

10 vertical position of right eyebrow

11 right upper hair line

1.5 Cherno¬-Flury Faces 35

Observations 91 to 110

Figure 1.15. Cherno¬-Flury faces for observations 91 to 110 of the bank

notes. MVAfacebank10.xpl

12 right lower hair line

13 right face line

14 darkness of right hair

15 right hair slant

16 right nose line

17 right size of mouth

18 right curvature of mouth

19“36 like 1“18, only for the left side.

First, every variable that is to be coded into a characteristic face element is transformed

into a (0, 1) scale, i.e., the minimum of the variable corresponds to 0 and the maximum to

1. The extreme positions of the face elements therefore correspond to a certain “grin” or

“happy” face element. Dark hair might be coded as 1, and blond hair as 0 and so on.

36 1 Comparison of Batches

Observations 1 to 50

Figure 1.16. Cherno¬-Flury faces for observations 1 to 50 of the bank

notes. MVAfacebank50.xpl

As an example, consider the observations 91 to 110 of the bank data. Recall that the bank

data set consists of 200 observations of dimension 6 where, for example, X6 is the diagonal

of the note. If we assign the six variables to the following face elements

X1 = 1, 19 (eye sizes)

X2 = 2, 20 (pupil sizes)

X3 = 4, 22 (eye slants)

X4 = 11, 29 (upper hair lines)

X5 = 12, 30 (lower hair lines)

X6 = 13, 14, 31, 32 (face lines and darkness of hair),

we obtain Figure 1.15. Also recall that observations 1“100 correspond to the genuine notes,

and that observations 101“200 correspond to the counterfeit notes. The counterfeit bank

notes then correspond to the lower half of Figure 1.15. In fact the faces for these observations

look more grim and less happy. The variable X6 (diagonal) already worked well in the boxplot

on Figure 1.4 in distinguishing between the counterfeit and genuine notes. Here, this variable

is assigned to the face line and the darkness of the hair. That is why we clearly see a good

separation within these 20 observations.

What happens if we include all 100 genuine and all 100 counterfeit bank notes in the Cherno¬-

Flury face technique? Figures 1.16 and 1.17 show the faces of the genuine bank notes with the

1.5 Cherno¬-Flury Faces 37

Observations 51 to 100

Figure 1.17. Cherno¬-Flury faces for observations 51 to 100 of the bank

notes. MVAfacebank50.xpl

same assignments as used before and Figures 1.18 and 1.19 show the faces of the counterfeit

bank notes. Comparing Figure 1.16 and Figure 1.18 one clearly sees that the diagonal (face

line) is longer for genuine bank notes. Equivalently coded is the hair darkness (diagonal)

which is lighter (shorter) for the counterfeit bank notes. One sees that the faces of the

genuine bank notes have a much darker appearance and have broader face lines. The faces

in Figures 1.16“1.17 are obviously di¬erent from the ones in Figures 1.18“1.19.

Summary

’ Faces can be used to detect subgroups in multivariate data.

’ Subgroups are characterized by similar looking faces.

’ Outliers are identi¬ed by extreme faces, e.g., dark hair, smile or a happy

face.

’ If one element of X is unusual, the corresponding face element signi¬cantly

changes in shape.

38 1 Comparison of Batches

Observations 101 to 150

Figure 1.18. Cherno¬-Flury faces for observations 101 to 150 of the bank

notes. MVAfacebank50.xpl

Observations 151 to 200

Figure 1.19. Cherno¬-Flury faces for observations 151 to 200 of the bank

notes. MVAfacebank50.xpl

1.6 Andrews™ Curves 39

1.6 Andrews™ Curves

The basic problem of graphical displays of multivariate data is the dimensionality. Scat-

terplots work well up to three dimensions (if we use interactive displays). More than three

dimensions have to be coded into displayable 2D or 3D structures (e.g., faces). The idea

of coding and representing multivariate data by curves was suggested by Andrews (1972).

Each multivariate observation Xi = (Xi,1 , .., Xi,p ) is transformed into a curve as follows:

±

Xi,1 + Xi,2 sin(t) + Xi,3 cos(t) + ... + Xi,p’1 sin( p’1 t) + Xi,p cos( p’1 t) for p odd

√

2 2

2

fi (t) =

Xi,1 + Xi,2 sin(t) + Xi,3 cos(t) + ... + Xi,p sin( p t) for p even

√

2

2

(1.13)

such that the observation represents the coe¬cients of a so-called Fourier series (t ∈ [’π, π]).

Suppose that we have three-dimensional observations: X1 = (0, 0, 1), X2 = (1, 0, 0) and

X3 = (0, 1, 0). Here p = 3 and the following representations correspond to the Andrews™

curves:

f1 (t) = cos(t)

1

f2 (t) = √ and

2

f3 (t) = sin(t).

These curves are indeed quite distinct, since the observations X1 , X2 , and X3 are the 3D

unit vectors: each observation has mass only in one of the three dimensions. The order of

the variables plays an important role.

EXAMPLE 1.2 Let us take the 96th observation of the Swiss bank note data set,

X96 = (215.6, 129.9, 129.9, 9.0, 9.5, 141.7).

The Andrews™ curve is by (1.13):

215.6

f96 (t) = √ + 129.9 sin(t) + 129.9 cos(t) + 9.0 sin(2t) + 9.5 cos(2t) + 141.7 sin(3t).

2

Figure 1.20 shows the Andrews™ curves for observations 96“105 of the Swiss bank note data

set. We already know that the observations 96“100 represent genuine bank notes, and that

the observations 101“105 represent counterfeit bank notes. We see that at least four curves

di¬er from the others, but it is hard to tell which curve belongs to which group.

We know from Figure 1.4 that the sixth variable is an important one. Therefore, the An-

drews™ curves are calculated again using a reversed order of the variables.

40 1 Comparison of Batches

Andrews curves (Bank data)

2

1

f96- f105

0

-1

-2 0 2

t

Figure 1.20. Andrews™ curves of the observations 96“105 from the

Swiss bank note data. The order of the variables is 1,2,3,4,5,6.

MVAandcur.xpl

EXAMPLE 1.3 Let us consider again the 96th observation of the Swiss bank note data set,

X96 = (215.6, 129.9, 129.9, 9.0, 9.5, 141.7).

The Andrews™ curve is computed using the reversed order of variables:

141.7

f96 (t) = √ + 9.5 sin(t) + 9.0 cos(t) + 129.9 sin(2t) + 129.9 cos(2t) + 215.6 sin(3t).

2

In Figure 1.21 the curves f96 “f105 for observations 96“105 are plotted. Instead of a di¬erence

in high frequency, now we have a di¬erence in the intercept, which makes it more di¬cult

for us to see the di¬erences in observations.

This shows that the order of the variables plays an important role for the interpretation. If

X is high-dimensional, then the last variables will have only a small visible contribution to

1.6 Andrews™ Curves 41

Andrews curves (Bank data)

2

1

f96 - f105

0

-1

-2 0 2

t

Figure 1.21. Andrews™ curves of the observations 96“105 from the

Swiss bank note data. The order of the variables is 6,5,4,3,2,1.

MVAandcur2.xpl

the curve. They fall into the high frequency part of the curve. To overcome this problem

Andrews suggested using an order which is suggested by Principal Component Analysis.

This technique will be treated in detail in Chapter 9. In fact, the sixth variable will appear

there as the most important variable for discriminating between the two groups. If the

number of observations is more than 20, there may be too many curves in one graph. This

will result in an over plotting of curves or a bad “signal-to-ink-ratio”, see Tufte (1983). It

is therefore advisable to present multivariate observations via Andrews™ curves only for a

limited number of observations.

Summary

’ Outliers appear as single Andrews™ curves that look di¬erent from the rest.

42 1 Comparison of Batches

Summary (continued)

’ A subgroup of data is characterized by a set of simular curves.

’ The order of the variables plays an important role for interpretation.

’ The order of variables may be optimized by Principal Component

Analysis.

’ For more than 20 observations we may obtain a bad “signal-to-ink-ratio”,

i.e., too many curves are overlaid in one picture.

1.7 Parallel Coordinates Plots

Parallel coordinates plots (PCP) constitute a technique that is based on a non-Cartesian

coordinate system and therefore allows one to “see” more than four dimensions. The idea

Parallel coordinate plot (Bank data)

1

f96 - f105

0.5

0

1 2 3 4 5 6

t

Figure 1.22. Parallel coordinates plot of observations 96“105.

MVAparcoo1.xpl

1.7 Parallel Coordinates Plots 43

Parallel coordinate plot (Bank data)

1

f96 - f105

0.5

0

1 2 3 4 5 6

t

Figure 1.23. The entire bank data set. Genuine bank notes are dis-

played as black lines. The counterfeit bank notes are shown as red lines.

MVAparcoo2.xpl

is simple: Instead of plotting observations in an orthogonal coordinate system, one draws

their coordinates in a system of parallel axes. Index j of the coordinate is mapped onto the

horizontal axis, and the value xj is mapped onto the vertical axis. This way of representation

is very useful for high-dimensional data. It is however also sensitive to the order of the

variables, since certain trends in the data can be shown more clearly in one ordering than in

another.

EXAMPLE 1.4 Take once again the observations 96“105 of the Swiss bank notes. These

observations are six dimensional, so we can™t show them in a six dimensional Cartesian

coordinate system. Using the parallel coordinates plot technique, however, they can be plotted

on parallel axes. This is shown in Figure 1.22.

We have already noted in Example 1.2 that the diagonal X6 plays an important role. This

important role is clearly visible from Figure 1.22 The last coordinate X6 shows two di¬erent

subgroups. The full bank note data set is displayed in Figure 1.23. One sees an overlap of

the coordinate values for indices 1“3 and an increased separability for the indices 4“6.

44 1 Comparison of Batches

Summary

’ Parallel coordinates plots overcome the visualization problem of the Carte-

sian coordinate system for dimensions greater than 4.

’ Outliers are visible as outlying polygon curves.

’ The order of variables is still important, for example, for detection of

subgroups.

’ Subgroups may be screened by selective coloring in an interactive manner.

1.8 Boston Housing

Aim of the analysis

The Boston Housing data set was analyzed by Harrison and Rubinfeld (1978) who wanted

to ¬nd out whether “clean air” had an in¬‚uence on house prices. We will use this data set in

this chapter and in most of the following chapters to illustrate the presented methodology.

The data are described in Appendix B.1.

What can be seen from the PCPs

In order to highlight the relations of X14 to the remaining 13 variables we color all of the

observations with X14 >median(X14 ) as red lines in Figure 1.24. Some of the variables seem

to be strongly related. The most obvious relation is the negative dependence between X13

and X14 . It can also be argued that there exists a strong dependence between X12 and X14

since no red lines are drawn in the lower part of X12 . The opposite can be said about X11 :

there are only red lines plotted in the lower part of this variable. Low values of X11 induce

high values of X14 .

For the PCP, the variables have been rescaled over the interval [0, 1] for better graphical

representations. The PCP shows that the variables are not distributed in a symmetric

manner. It can be clearly seen that the values of X1 and X9 are much more concentrated

around 0. Therefore it makes sense to consider transformations of the original data.

1.8 Boston Housing 45

Figure 1.24. Parallel coordinates plot for Boston Housing data.

MVApcphousing.xpl

The scatterplot matrix

One characteristic of the PCPs is that many lines are drawn on top of each other. This

problem is reduced by depicting the variables in pairs of scatterplots. Including all 14

variables in one large scatterplot matrix is possible, but makes it hard to see anything from

the plots. Therefore, for illustratory purposes we will analyze only one such matrix from a

subset of the variables in Figure 1.25. On the basis of the PCP and the scatterplot matrix

we would like to interpret each of the thirteen variables and their eventual relation to the

14th variable. Included in the ¬gure are images for X1 “X5 and X14 , although each variable

is discussed in detail below. All references made to scatterplots in the following refer to

Figure 1.25.

46 1 Comparison of Batches

Figure 1.25. Scatterplot matrix for variables X1 , . . . , X5 and X14 of the

Boston Housing data. MVAdrafthousing.xpl

Per-capita crime rate X1

Taking the logarithm makes the variable™s distribution more symmetric. This can be seen

in the boxplot of X1 in Figure 1.27 which shows that the median and the mean have moved

closer to each other than they were for the original X1 . Plotting the kernel density esti-

mate (KDE) of X1 = log (X1 ) would reveal that two subgroups might exist with di¬erent

mean values. However, taking a look at the scatterplots in Figure 1.26 of the logarithms

which include X1 does not clearly reveal such groups. Given that the scatterplot of log (X1 )

vs. log (X14 ) shows a relatively strong negative relation, it might be the case that the two

subgroups of X1 correspond to houses with two di¬erent price levels. This is con¬rmed by

the two boxplots shown to the right of the X1 vs. X2 scatterplot (in Figure 1.25): the red

boxplot™s shape di¬ers a lot from the black one™s, having a much higher median and mean.

1.8 Boston Housing 47

Figure 1.26. Scatterplot matrix for variables X1 , . . . , X5 and X14 of the

Boston Housing data. MVAdrafthousingt.xpl

Proportion of residential area zoned for large lots X2

It strikes the eye in Figure 1.25 that there is a large cluster of observations for which X2 is

equal to 0. It also strikes the eye that”as the scatterplot of X1 vs. X2 shows”there is a

strong, though non-linear, negative relation between X1 and X2 : Almost all observations for

which X2 is high have an X1 -value close to zero, and vice versa, many observations for which

X2 is zero have quite a high per-capita crime rate X1 . This could be due to the location of

the areas, e.g., downtown districts might have a higher crime rate and at the same time it

is unlikely that any residential land would be zoned in a generous manner.

As far as the house prices are concerned it can be said that there seems to be no clear (linear)

relation between X2 and X14 , but it is obvious that the more expensive houses are situated

in areas where X2 is large (this can be seen from the two boxplots on the second position of

the diagonal, where the red one has a clearly higher mean/median than the black one).

48 1 Comparison of Batches

Proportion of non-retail business acres X3

The PCP (in Figure 1.24) as well as the scatterplot of X3 vs. X14 shows an obvious negative

relation between X3 and X14 . The relationship between the logarithms of both variables

seems to be almost linear. This negative relation might be explained by the fact that non-

retail business sometimes causes annoying sounds and other pollution. Therefore, it seems

reasonable to use X3 as an explanatory variable for the prediction of X14 in a linear-regression

analysis.

As far as the distribution of X3 is concerned it can be said that the kernel density estimate

of X3 clearly has two peaks, which indicates that there are two subgroups. According to the

negative relation between X3 and X14 it could be the case that one subgroup corresponds to

the more expensive houses and the other one to the cheaper houses.

Charles River dummy variable X4

The observation made from the PCP that there are more expensive houses than cheap

houses situated on the banks of the Charles River is con¬rmed by inspecting the scatterplot

matrix. Still, we might have some doubt that the proximity to the river in¬‚uences the house

prices. Looking at the original data set, it becomes clear that the observations for which

X4 equals one are districts that are close to each other. Apparently, the Charles River does

not ¬‚ow through too many di¬erent districts. Thus, it may be pure coincidence that the

more expensive districts are close to the Charles River”their high values might be caused by

many other factors such as the pupil/teacher ratio or the proportion of non-retail business

acres.

Nitric oxides concentration X5

The scatterplot of X5 vs. X14 and the separate boxplots of X5 for more and less expensive

houses reveal a clear negative relation between the two variables. As it was the main aim of

the authors of the original study to determine whether pollution had an in¬‚uence on housing

prices, it should be considered very carefully whether X5 can serve as an explanatory variable

for the price X14 . A possible reason against it being an explanatory variable is that people

might not like to live in areas where the emissions of nitric oxides are high. Nitric oxides are

emitted mainly by automobiles, by factories and from heating private homes. However, as

one can imagine there are many good reasons besides nitric oxides not to live downtown or in

industrial areas! Noise pollution, for example, might be a much better explanatory variable

for the price of housing units. As the emission of nitric oxides is usually accompanied by

noise pollution, using X5 as an explanatory variable for X14 might lead to the false conclusion

that people run away from nitric oxides, whereas in reality it is noise pollution that they are

trying to escape.

1.8 Boston Housing 49

Average number of rooms per dwelling X6

The number of rooms per dwelling is a possible measure for the size of the houses. Thus we

expect X6 to be strongly correlated with X14 (the houses™ median price). Indeed”apart from

some outliers”the scatterplot of X6 vs. X14 shows a point cloud which is clearly upward-

sloping and which seems to be a realisation of a linear dependence of X14 on X6 . The two

boxplots of X6 con¬rm this notion by showing that the quartiles, the mean and the median

are all much higher for the red than for the black boxplot.

Proportion of owner-occupied units built prior to 1940 X7

There is no clear connection visible between X7 and X14 . There could be a weak negative

correlation between the two variables, since the (red) boxplot of X7 for the districts whose

price is above the median price indicates a lower mean and median than the (black) boxplot

for the district whose price is below the median price. The fact that the correlation is not

so clear could be explained by two opposing e¬ects. On the one hand house prices should

decrease if the older houses are not in a good shape. On the other hand prices could increase,

because people often like older houses better than newer houses, preferring their atmosphere

of space and tradition. Nevertheless, it seems reasonable that the houses™ age has an in¬‚uence

on their price X14 .

Raising X7 to the power of 2.5 reveals again that the data set might consist of two subgroups.

But in this case it is not obvious that the subgroups correspond to more expensive or cheaper

houses. One can furthermore observe a negative relation between X7 and X8 . This could

re¬‚ect the way the Boston metropolitan area developed over time: the districts with the

newer buildings are farther away from employment centres with industrial facilities.

Weighted distance to ¬ve Boston employment centres X8

Since most people like to live close to their place of work, we expect a negative relation

between the distances to the employment centres and the houses™ price. The scatterplot

hardly reveals any dependence, but the boxplots of X8 indicate that there might be a slightly

positive relation as the red boxplot™s median and mean are higher than the black one™s.

Again, there might be two e¬ects in opposite directions at work. The ¬rst is that living

too close to an employment centre might not provide enough shelter from the pollution

created there. The second, as mentioned above, is that people do not travel very far to their

workplace.

50 1 Comparison of Batches

Index of accessibility to radial highways X9

The ¬rst obvious thing one can observe in the scatterplots, as well in the histograms and the

kernel density estimates, is that there are two subgroups of districts containing X9 values

which are close to the respective group™s mean. The scatterplots deliver no hint as to what

might explain the occurrence of these two subgroups. The boxplots indicate that for the

cheaper and for the more expensive houses the average of X9 is almost the same.

Full-value property tax X10

X10 shows a behavior similar to that of X9 : two subgroups exist. A downward-sloping curve

seems to underlie the relation of X10 and X14 . This is con¬rmed by the two boxplots drawn

for X10 : the red one has a lower mean and median than the black one.

Pupil/teacher ratio X11

The red and black boxplots of X11 indicate a negative relation between X11 and X14 . This

is con¬rmed by inspection of the scatterplot of X11 vs. X14 : The point cloud is downward

sloping, i.e., the less teachers there are per pupil, the less people pay on median for their

dwellings.

Proportion of blacks B, X12 = 1000(B ’ 0.63)2 I(B < 0.63)

Interestingly, X12 is negatively”though not linearly”correlated with X3 , X7 and X11 ,

whereas it is positively related with X14 . Having a look at the data set reveals that for

almost all districts X12 takes on a value around 390. Since B cannot be larger than 0.63,

such values can only be caused by B close to zero. Therefore, the higher X12 is, the lower

the actual proportion of blacks is! Among observations 405 through 470 there are quite a

few that have a X12 that is much lower than 390. This means that in these districts the

proportion of blacks is above zero. We can observe two clusters of points in the scatterplots

of log (X12 ): one cluster for which X12 is close to 390 and a second one for which X12 is

between 3 and 100. When X12 is positively related with another variable, the actual pro-

portion of blacks is negatively correlated with this variable and vice versa. This means that

blacks live in areas where there is a high proportion of non-retail business acres, where there

are older houses and where there is a high (i.e., bad) pupil/teacher ratio. It can be observed

that districts with housing prices above the median can only be found where the proportion

of blacks is virtually zero!

1.8 Boston Housing 51

Proportion of lower status of the population X13

Of all the variables X13 exhibits the clearest negative relation with X14 ”hardly any outliers

show up. Taking the square root of X13 and the logarithm of X14 transforms the relation

into a linear one.

Transformations

Since most of the variables exhibit an asymmetry with a higher density on the left side, the

following transformations are proposed:

X1 = log (X1 )

X2 = X2 /10

X3 = log (X3 )

X4 none, since X4 is binary

X5 = log (X5 )

X6 = log (X6 )

X7 = X7 2.5 /10000

X8 = log (X8 )

X9 = log (X9 )

X10 = log (X10 )

X11 = exp (0.4 — X11 )/1000

X12 = X12 /100

X13 = X13

X14 = log (X14 )

Taking the logarithm or raising the variables to the power of something smaller than one helps

to reduce the asymmetry. This is due to the fact that lower values move further away from

each other, whereas the distance between greater values is reduced by these transformations.

Figure 1.27 displays boxplots for the original mean variance scaled variables as well as for the

proposed transformed variables. The transformed variables™ boxplots are more symmetric

and have less outliers than the original variables™ boxplots.

52 1 Comparison of Batches

Boston Housing data

Transformed Boston Housing data

Figure 1.27. Boxplots for all of the variables from the Boston Housing

data before and after the proposed transformations. MVAboxbhd.xpl

1.9 Exercises

EXERCISE 1.1 Is the upper extreme always an outlier?

EXERCISE 1.2 Is it possible for the mean or the median to lie outside of the fourths or

even outside of the outside bars?

EXERCISE 1.3 Assume that the data are normally distributed N (0, 1). What percentage of

the data do you expect to lie outside the outside bars?

EXERCISE 1.4 What percentage of the data do you expect to lie outside the outside bars if

we assume that the data are normally distributed N (0, σ 2 ) with unknown variance σ 2 ?

1.9 Exercises 53

EXERCISE 1.5 How would the ¬ve-number summary of the 15 largest U.S. cities di¬er from

that of the 50 largest U.S. cities? How would the ¬ve-number summary of 15 observations

of N (0, 1)-distributed data di¬er from that of 50 observations from the same distribution?

EXERCISE 1.6 Is it possible that all ¬ve numbers of the ¬ve-number summary could be

equal? If so, under what conditions?

EXERCISE 1.7 Suppose we have 50 observations of X ∼ N (0, 1) and another 50 observa-

tions of Y ∼ N (2, 1). What would the 100 Flury faces look like if you had de¬ned as face

elements the face line and the darkness of hair? Do you expect any similar faces? How many

faces do you think should look like observations of Y even though they are X observations?

EXERCISE 1.8 Draw a histogram for the mileage variable of the car data (Table B.3). Do

the same for the three groups (U.S., Japan, Europe). Do you obtain a similar conclusion as

in the parallel boxplot on Figure 1.3 for these data?

EXERCISE 1.9 Use some bandwidth selection criterion to calculate the optimally chosen

bandwidth h for the diagonal variable of the bank notes. Would it be better to have one

bandwidth for the two groups?

EXERCISE 1.10 In Figure 1.9 the densities overlap in the region of diagonal ≈ 140.4. We

partially observed this in the boxplot of Figure 1.4. Our aim is to separate the two groups.

Will we be able to do this e¬ectively on the basis of this diagonal variable alone?

EXERCISE 1.11 Draw a parallel coordinates plot for the car data.

EXERCISE 1.12 How would you identify discrete variables (variables with only a limited

number of possible outcomes) on a parallel coordinates plot?

EXERCISE 1.13 True or false: the height of the bars of a histogram are equal to the relative

frequency with which observations fall into the respective bins.

EXERCISE 1.14 True or false: kernel density estimates must always take on a value between

0 and 1. (Hint: Which quantity connected with the density function has to be equal to 1?

Does this property imply that the density function has to always be less than 1?)

EXERCISE 1.15 Let the following data set represent the heights of 13 students taking the

Applied Multivariate Statistical Analysis course:

1.72, 1.83, 1.74, 1.79, 1.94, 1.81, 1.66, 1.60, 1.78, 1.77, 1.85, 1.70, 1.76.

54 1 Comparison of Batches

1. Find the corresponding ¬ve-number summary.

2. Construct the boxplot.

3. Draw a histogram for this data set.

EXERCISE 1.16 Describe the unemployment data (see Table B.19) that contain unemploy-

ment rates of all German Federal States using various descriptive techniques.

EXERCISE 1.17 Using yearly population data (see B.20), generate

1. a boxplot (choose one of variables)

2. an Andrew™s Curve (choose ten data points)

3. a scatterplot

4. a histogram (choose one of the variables)

What do these graphs tell you about the data and their structure?

EXERCISE 1.18 Make a draftman plot for the car data with the variables

X1 = price,

X2 = mileage,

X8 = weight,

X9 = length.

Move the brush into the region of heavy cars. What can you say about price, mileage and

length? Move the brush onto high fuel economy. Mark the Japanese, European and U.S.

American cars. You should ¬nd the same condition as in boxplot Figure 1.3.

EXERCISE 1.19 What is the form of a scatterplot of two independent random variables X1

and X2 with standard Normal distribution?

EXERCISE 1.20 Rotate a three-dimensional standard normal point cloud in 3D space. Does

it “almost look the same from all sides”? Can you explain why or why not?

Part II

Multivariate Random Variables

2 A Short Excursion into Matrix Algebra

This chapter is a reminder of basic concepts of matrix algebra, which are particularly useful

in multivariate analysis. It also introduces the notations used in this book for vectors and

matrices. Eigenvalues and eigenvectors play an important role in multivariate techniques.

In Sections 2.2 and 2.3, we present the spectral decomposition of matrices and consider the

maximization (minimization) of quadratic forms given some constraints.

In analyzing the multivariate normal distribution, partitioned matrices appear naturally.

Some of the basic algebraic properties are given in Section 2.5. These properties will be

heavily used in Chapters 4 and 5.

The geometry of the multinormal and the geometric interpretation of the multivariate tech-

niques (Part III) intensively uses the notion of angles between two vectors, the projection

of a point on a vector and the distances between two points. These ideas are introduced in

Section 2.6.

2.1 Elementary Operations

A matrix A is a system of numbers with n rows and p columns:

«

a11 a12 . . . . . . . . . a1p

. .

. a22 .

¬ ·

. .

¬ ·

. . .. .

¬ ·

. . .

.

. . .

¬ ·

A=¬ ·.

¬ ·

. . .

..

. . .

.

. . .

¬ ·

¬ ·

. . ... .

. . .

¬ ·

. . .

an1 an2 . . . . . . . . . anp

We also write (aij ) for A and A(n — p) to indicate the numbers of rows and columns. Vectors

are matrices with one column and are denoted as x or x(p — 1). Special matrices and vectors

are de¬ned in Table 2.1. Note that we use small letters for scalars as well as for vectors.

58 2 A Short Excursion into Matrix Algebra

Matrix Operations

Elementary operations are summarized below:

A = (aji )

A+B = (aij + bij )

A’B (aij ’ bij )

=

c·A (c · aij )

=

p

A · B = A(n — p) B(p — m) = C(n — m) = aij bjk .

j=1

Properties of Matrix Operations

A+B B+A

=

A(B + C) AB + AC

=

A(BC) = (AB)C

A

(A ) =

BA

(AB) =

Matrix Characteristics

Rank

The rank, rank(A), of a matrix A(n — p) is de¬ned as the maximum number of linearly

independent rows (columns). A set of k rows aj of A(n—p) are said to be linearly independent

if k cj aj = 0p implies cj = 0, ∀j, where c1 , . . . , ck are scalars. In other words no rows in

j=1

this set can be expressed as a linear combination of the (k ’ 1) remaining rows.

Trace

The trace of a matrix is the sum of its diagonal elements

p

tr(A) = aii .

i=1

2.1 Elementary Operations 59

Name De¬nition Notation Example

scalar p=n=1 a 3

1

column vector p=1 a

3

13

row vector n=1 a

1

vector of ones (1, . . . , 1) 1n

1

n

0

vector of zeros (0, . . . , 0) 0n

0

n

2 0

A(p — p)

square matrix n=p

0 2

1 0

diagonal matrix aij = 0, i = j, n = p diag(aii )

0 2

1 0

Ip

identity matrix diag(1, . . . , 1)

0 1

p

11

aij ≡ 1, n = p

unit matrix 1n 1n

11

12

symmetric matrix aij = aji

23

00

null matrix aij = 0 0

00

«

124

013

upper triangular matrix aij = 0, i < j

001

«

100

021

1

AA = A

idempotent matrix

2

011

2 2

1 1

√ √

2 2

A A = I = AA

orthogonal matrix 1 1

’2

√ √

2

Table 2.1. Special matrices and vectors.

Determinant

The determinant is an important concept of matrix algebra. For a square matrix A, it is

de¬ned as:

(’1)|„ | a1„ (1) . . . ap„ (p) ,

det(A) = |A| =

the summation is over all permutations „ of {1, 2, . . . , p}, and |„ | = 0 if the permutation can

be written as a product of an even number of transpositions and |„ | = 1 otherwise.

60 2 A Short Excursion into Matrix Algebra

a11 a12

EXAMPLE 2.1 In the case of p = 2, A = and we can permute the digits “1”

a21 a22

and “2” once or not at all. So,

|A| = a11 a22 ’ a12 a21 .

Transpose

For A(n — p) and B(p — n)

(A ) = A, and (AB) = B A .

Inverse

If |A| = 0 and A(p — p), then the inverse A’1 exists:

A A’1 = A’1 A = Ip .

For small matrices, the inverse of A = (aij ) can be calculated as

C

A’1 = ,

|A|

where C = (cij ) is the adjoint matrix of A. The elements cji of C are the co-factors of A:

a11 ... a1(j’1) a1(j+1) ... a1p

.

.

.

a(i’1)1 . . . a(i’1)(j’1) a(i’1)(j+1) . . . a(i’1)p

cji = (’1)i+j .

a(i+1)1 . . . a(i+1)(j’1) a(i+1)(j+1) . . . a(i+1)p

.

.

.

ap1 ... ap(j’1) ap(j+1) ... app

G-inverse

A more general concept is the G-inverse (Generalized Inverse) A’ which satis¬es the follow-

ing:

A A’ A = A.

Later we will see that there may be more than one G-inverse.

2.1 Elementary Operations 61

EXAMPLE 2.2 The generalized inverse can also be calculated for singular matrices. We

have:

10 10 10 10

= ,

00 00 00 00

10 10

is A’ =

which means that the generalized inverse of A = even though

00 00

the inverse matrix of A does not exist in this case.

Eigenvalues, Eigenvectors

Consider a (p — p) matrix A. If there exists a scalar » and a vector γ such that

Aγ = »γ, (2.1)

then we call

» an eigenvalue

γ an eigenvector.

It can be proven that an eigenvalue » is a root of the p-th order polynomial |A ’ »Ip | = 0.

Therefore, there are up to p eigenvalues »1 , »2 , . . . , »p of A. For each eigenvalue »j , there

exists a corresponding eigenvector γj given by equation (2.1) . Suppose the matrix A has

the eigenvalues »1 , . . . , »p . Let Λ = diag(»1 , . . . , »p ).

The determinant |A| and the trace tr(A) can be rewritten in terms of the eigenvalues:

p

|A| = |Λ| = »j (2.2)

j=1

p

tr(A) = tr(Λ) = »j . (2.3)

j=1

An idempotent matrix A (see the de¬nition in Table 2.1) can only have eigenvalues in {0, 1}

therefore tr(A) = rank(A) = number of eigenvalues = 0.

«

100

EXAMPLE 2.3 Let us consider the matrix A = 0 2 1 . It is easy to verify that

1

2

021 1

2

AA = A which implies that the matrix A is idempotent.

We know that the eigenvalues of an idempotent matrix are equal to 0 « In this case, the

or 1.

« «

100 1 1

1 1

eigenvalues of A are »1 = 1, »2 = 1, and »3 = 0 since 0 2 2 0 = 1 0 ,

011 0 0

22

0 0 0 0

« « « « « «

100 100

√ √ √ √

0 1 1 2 = 1 2 , and 0 1 1 2

= 0 √ 2 .

2 2 2 2 2 2 2 2

√ √ √

1 1 1 1

2 2 2

’ 22

022 022 ’2

2 2

62 2 A Short Excursion into Matrix Algebra

Using formulas (2.2) and (2.3), we can calculate the trace and the determinant of A from

the eigenvalues: tr(A) = »1 + »2 + »3 = 2, |A| = »1 »2 »3 = 0, and rank(A) = 2.

Properties of Matrix Characteristics

A(n — n), B(n — n), c ∈ R

tr(A + B) tr A + tr B

= (2.4)

c tr A

tr(cA) = (2.5)

cn |A|

|cA| = (2.6)

|AB| |BA| = |A||B|

= (2.7)

A(n — p), B(p — n)

tr(A· B) tr(B· A)

= (2.8)

¤

rank(A) min(n, p)

≥

rank(A) 0 (2.9)

rank(A) = rank(A ) (2.10)

rank(A A) = rank(A) (2.11)

rank(A + B) ¤ rank(A) + rank(B) (2.12)

¤

rank(AB) min{rank(A), rank(B)} (2.13)

A(n — p), B(p — q), C(q — n)

tr(ABC) = tr(BCA)

= tr(CAB) (2.14)

for nonsingular A, C

rank(ABC) = rank(B) (2.15)

A(p — p)

|A’1 | = |A|’1 (2.16)

rank(A) = p if and only if A is nonsingular. (2.17)

Summary

’ The determinant |A| is the product of the eigenvalues of A.

’ The inverse of a matrix A exists if |A| = 0.

2.2 Spectral Decompositions 63

Summary (continued)

’ The trace tr(A) is the sum of the eigenvalues of A.

’ The sum of the traces of two matrices equals the trace of the sum of the

two matrices.

’ The trace tr(AB) equals tr(BA).

’ The rank(A) is the maximal number of linearly independent rows

(columns) of A.

2.2 Spectral Decompositions

The computation of eigenvalues and eigenvectors is an important issue in the analysis of

matrices. The spectral decomposition or Jordan decomposition links the structure of a

matrix to the eigenvalues and the eigenvectors.

THEOREM 2.1 (Jordan Decomposition) Each symmetric matrix A(p — p) can be written

as p

A=“Λ“ = »j γj γj (2.18)

j=1

where

Λ = diag(»1 , . . . , »p )

and where

“ = (γ1 , γ2 , . . . , γp )

is an orthogonal matrix consisting of the eigenvectors γj of A.

12

EXAMPLE 2.4 Suppose that A = . The eigenvalues are found by solving |A ’ »I| = 0.

23

This is equivalent to

1’» 2

= (1 ’ »)(3 ’ ») ’ 4 = 0.

3’»

2

√ √

Hence, the eigenvalues are »1 = 2 + 5 and »2 = 2 ’ 5. The eigenvectors are γ1 =

(0.5257, 0.8506) and γ2 = (0.8506, ’0.5257) . They are orthogonal since γ1 γ2 = 0.

Using spectral decomposition, we can de¬ne powers of a matrix A(p — p). Suppose A is a

symmetric matrix. Then by Theorem 2.1

A = “Λ“ ,

64 2 A Short Excursion into Matrix Algebra

and we de¬ne for some ± ∈ R

A± = “Λ± “ , (2.19)

where Λ± = diag(»± , . . . , »± ). In particular, we can easily calculate the inverse of the matrix

1 p

A. Suppose that the eigenvalues of A are positive. Then with ± = ’1, we obtain the inverse

of A from

A’1 = “Λ’1 “ . (2.20)

Another interesting decomposition which is later used is given in the following theorem.

THEOREM 2.2 (Singular Value Decomposition) Each matrix A(n — p) with rank r can

be decomposed as

A=“Λ∆ ,

where “(n — r) and ∆(p — r). Both “ and ∆ are column orthonormal, i.e., “ “ = ∆ ∆ = Ir

1/2 1/2

and Λ = diag »1 , . . . , »r , »j > 0. The values »1 , . . . , »r are the non-zero eigenvalues of

the matrices AA and A A. “ and ∆ consist of the corresponding r eigenvectors of these

matrices.

This is obviously a generalization of Theorem 2.1 (Jordan decomposition). With Theorem

2.2, we can ¬nd a G-inverse A’ of A. Indeed, de¬ne A’ = ∆ Λ’1 “ . Then A A’ A =

“ Λ ∆ = A. Note that the G-inverse is not unique.

10

EXAMPLE 2.5 In Example 2.2, we showed that the generalized inverse of A =

00

10

is A’ . The following also holds

00

10 10 10 10

=

00 08 00 00

10

is also a generalized inverse of A.

which means that the matrix

08

Summary

’ The Jordan decomposition gives a representation of a symmetric matrix

in terms of eigenvalues and eigenvectors.

2.3 Quadratic Forms 65