. 2
( 4)







8 10 12 8 10 12
129 129.5 130 130.5 131

Var 5











9 10 11 12
129 129.5 130 130.5 131 8 10 12

Var 6













129 129.5 130 130.5 131 8 10 12 8 9 10 11 12

Figure 1.14. Draftman plot of the bank notes. The pictures in the left col-
umn show (X3 , X4 ), (X3 , X5 ) and (X3 , X6 ), in the middle we have (X4 , X5 )
and (X4 , X6 ), and in the lower right is (X5 , X6 ). The upper right half con-
tains the corresponding density contour plots. MVAdrafbank4.xpl

or mouse over the screen. Inside the brush we can highlight or color observations. Suppose
the technique is installed in such a way that as we move the brush in one scatter, the
corresponding observations in the other scatters are also highlighted. By moving the brush,
we can study conditional dependence.
If we brush (i.e., highlight or color the observation with the brush) the X5 vs. X6 plot
and move through the upper point cloud, we see that in other plots (e.g., X3 vs. X4 ), the
corresponding observations are more embedded in the other sub-cloud.
34 1 Comparison of Batches

’ Scatterplots in two and three dimensions helps in identifying separated
points, outliers or sub-clusters.
’ Scatterplots help us in judging positive or negative dependencies.
’ Draftman scatterplot matrices help detect structures conditioned on values
of other variables.
’ As the brush of a scatterplot matrix moves through a point cloud, we can
study conditional dependence.

1.5 Cherno¬-Flury Faces
If we are given data in numerical form, we tend to display it also numerically. This was
done in the preceding sections: an observation x1 = (1, 2) was plotted as the point (1, 2) in a
two-dimensional coordinate system. In multivariate analysis we want to understand data in
low dimensions (e.g., on a 2D computer screen) although the structures are hidden in high
dimensions. The numerical display of data structures using coordinates therefore ends at
dimensions greater than three.
If we are interested in condensing a structure into 2D elements, we have to consider alter-
native graphical techniques. The Cherno¬-Flury faces, for example, provide such a conden-
sation of high-dimensional information into a simple “face”. In fact faces are a simple way
to graphically display high-dimensional data. The size of the face elements like pupils, eyes,
upper and lower hair line, etc., are assigned to certain variables. The idea of using faces goes
back to Cherno¬ (1973) and has been further developed by Bernhard Flury. We follow the
design described in Flury and Riedwyl (1988) which uses the following characteristics.

1 right eye size
2 right pupil size
3 position of right pupil
4 right eye slant
5 horizontal position of right eye
6 vertical position of right eye
7 curvature of right eyebrow
8 density of right eyebrow
9 horizontal position of right eyebrow
10 vertical position of right eyebrow
11 right upper hair line
1.5 Cherno¬-Flury Faces 35

Observations 91 to 110

Figure 1.15. Cherno¬-Flury faces for observations 91 to 110 of the bank
notes. MVAfacebank10.xpl

12 right lower hair line
13 right face line
14 darkness of right hair
15 right hair slant
16 right nose line
17 right size of mouth
18 right curvature of mouth
19“36 like 1“18, only for the left side.

First, every variable that is to be coded into a characteristic face element is transformed
into a (0, 1) scale, i.e., the minimum of the variable corresponds to 0 and the maximum to
1. The extreme positions of the face elements therefore correspond to a certain “grin” or
“happy” face element. Dark hair might be coded as 1, and blond hair as 0 and so on.
36 1 Comparison of Batches

Observations 1 to 50

Figure 1.16. Cherno¬-Flury faces for observations 1 to 50 of the bank
notes. MVAfacebank50.xpl

As an example, consider the observations 91 to 110 of the bank data. Recall that the bank
data set consists of 200 observations of dimension 6 where, for example, X6 is the diagonal
of the note. If we assign the six variables to the following face elements
X1 = 1, 19 (eye sizes)
X2 = 2, 20 (pupil sizes)
X3 = 4, 22 (eye slants)
X4 = 11, 29 (upper hair lines)
X5 = 12, 30 (lower hair lines)
X6 = 13, 14, 31, 32 (face lines and darkness of hair),
we obtain Figure 1.15. Also recall that observations 1“100 correspond to the genuine notes,
and that observations 101“200 correspond to the counterfeit notes. The counterfeit bank
notes then correspond to the lower half of Figure 1.15. In fact the faces for these observations
look more grim and less happy. The variable X6 (diagonal) already worked well in the boxplot
on Figure 1.4 in distinguishing between the counterfeit and genuine notes. Here, this variable
is assigned to the face line and the darkness of the hair. That is why we clearly see a good
separation within these 20 observations.
What happens if we include all 100 genuine and all 100 counterfeit bank notes in the Cherno¬-
Flury face technique? Figures 1.16 and 1.17 show the faces of the genuine bank notes with the
1.5 Cherno¬-Flury Faces 37

Observations 51 to 100

Figure 1.17. Cherno¬-Flury faces for observations 51 to 100 of the bank
notes. MVAfacebank50.xpl

same assignments as used before and Figures 1.18 and 1.19 show the faces of the counterfeit
bank notes. Comparing Figure 1.16 and Figure 1.18 one clearly sees that the diagonal (face
line) is longer for genuine bank notes. Equivalently coded is the hair darkness (diagonal)
which is lighter (shorter) for the counterfeit bank notes. One sees that the faces of the
genuine bank notes have a much darker appearance and have broader face lines. The faces
in Figures 1.16“1.17 are obviously di¬erent from the ones in Figures 1.18“1.19.

’ Faces can be used to detect subgroups in multivariate data.
’ Subgroups are characterized by similar looking faces.
’ Outliers are identi¬ed by extreme faces, e.g., dark hair, smile or a happy
’ If one element of X is unusual, the corresponding face element signi¬cantly
changes in shape.
38 1 Comparison of Batches

Observations 101 to 150

Figure 1.18. Cherno¬-Flury faces for observations 101 to 150 of the bank
notes. MVAfacebank50.xpl

Observations 151 to 200

Figure 1.19. Cherno¬-Flury faces for observations 151 to 200 of the bank
notes. MVAfacebank50.xpl
1.6 Andrews™ Curves 39

1.6 Andrews™ Curves
The basic problem of graphical displays of multivariate data is the dimensionality. Scat-
terplots work well up to three dimensions (if we use interactive displays). More than three
dimensions have to be coded into displayable 2D or 3D structures (e.g., faces). The idea
of coding and representing multivariate data by curves was suggested by Andrews (1972).
Each multivariate observation Xi = (Xi,1 , .., Xi,p ) is transformed into a curve as follows:
 Xi,1 + Xi,2 sin(t) + Xi,3 cos(t) + ... + Xi,p’1 sin( p’1 t) + Xi,p cos( p’1 t) for p odd

2 2
fi (t) =
 Xi,1 + Xi,2 sin(t) + Xi,3 cos(t) + ... + Xi,p sin( p t) for p even

such that the observation represents the coe¬cients of a so-called Fourier series (t ∈ [’π, π]).
Suppose that we have three-dimensional observations: X1 = (0, 0, 1), X2 = (1, 0, 0) and
X3 = (0, 1, 0). Here p = 3 and the following representations correspond to the Andrews™

f1 (t) = cos(t)
f2 (t) = √ and
f3 (t) = sin(t).

These curves are indeed quite distinct, since the observations X1 , X2 , and X3 are the 3D
unit vectors: each observation has mass only in one of the three dimensions. The order of
the variables plays an important role.

EXAMPLE 1.2 Let us take the 96th observation of the Swiss bank note data set,

X96 = (215.6, 129.9, 129.9, 9.0, 9.5, 141.7).

The Andrews™ curve is by (1.13):

f96 (t) = √ + 129.9 sin(t) + 129.9 cos(t) + 9.0 sin(2t) + 9.5 cos(2t) + 141.7 sin(3t).

Figure 1.20 shows the Andrews™ curves for observations 96“105 of the Swiss bank note data
set. We already know that the observations 96“100 represent genuine bank notes, and that
the observations 101“105 represent counterfeit bank notes. We see that at least four curves
di¬er from the others, but it is hard to tell which curve belongs to which group.
We know from Figure 1.4 that the sixth variable is an important one. Therefore, the An-
drews™ curves are calculated again using a reversed order of the variables.
40 1 Comparison of Batches

Andrews curves (Bank data)

f96- f105

-2 0 2

Figure 1.20. Andrews™ curves of the observations 96“105 from the
Swiss bank note data. The order of the variables is 1,2,3,4,5,6.

EXAMPLE 1.3 Let us consider again the 96th observation of the Swiss bank note data set,

X96 = (215.6, 129.9, 129.9, 9.0, 9.5, 141.7).

The Andrews™ curve is computed using the reversed order of variables:
f96 (t) = √ + 9.5 sin(t) + 9.0 cos(t) + 129.9 sin(2t) + 129.9 cos(2t) + 215.6 sin(3t).
In Figure 1.21 the curves f96 “f105 for observations 96“105 are plotted. Instead of a di¬erence
in high frequency, now we have a di¬erence in the intercept, which makes it more di¬cult
for us to see the di¬erences in observations.

This shows that the order of the variables plays an important role for the interpretation. If
X is high-dimensional, then the last variables will have only a small visible contribution to
1.6 Andrews™ Curves 41

Andrews curves (Bank data)
f96 - f105

-2 0 2

Figure 1.21. Andrews™ curves of the observations 96“105 from the
Swiss bank note data. The order of the variables is 6,5,4,3,2,1.

the curve. They fall into the high frequency part of the curve. To overcome this problem
Andrews suggested using an order which is suggested by Principal Component Analysis.
This technique will be treated in detail in Chapter 9. In fact, the sixth variable will appear
there as the most important variable for discriminating between the two groups. If the
number of observations is more than 20, there may be too many curves in one graph. This
will result in an over plotting of curves or a bad “signal-to-ink-ratio”, see Tufte (1983). It
is therefore advisable to present multivariate observations via Andrews™ curves only for a
limited number of observations.

’ Outliers appear as single Andrews™ curves that look di¬erent from the rest.
42 1 Comparison of Batches

Summary (continued)
’ A subgroup of data is characterized by a set of simular curves.
’ The order of the variables plays an important role for interpretation.
’ The order of variables may be optimized by Principal Component
’ For more than 20 observations we may obtain a bad “signal-to-ink-ratio”,
i.e., too many curves are overlaid in one picture.

1.7 Parallel Coordinates Plots
Parallel coordinates plots (PCP) constitute a technique that is based on a non-Cartesian
coordinate system and therefore allows one to “see” more than four dimensions. The idea

Parallel coordinate plot (Bank data)
f96 - f105

1 2 3 4 5 6

Figure 1.22. Parallel coordinates plot of observations 96“105.
1.7 Parallel Coordinates Plots 43

Parallel coordinate plot (Bank data)
f96 - f105

1 2 3 4 5 6

Figure 1.23. The entire bank data set. Genuine bank notes are dis-
played as black lines. The counterfeit bank notes are shown as red lines.

is simple: Instead of plotting observations in an orthogonal coordinate system, one draws
their coordinates in a system of parallel axes. Index j of the coordinate is mapped onto the
horizontal axis, and the value xj is mapped onto the vertical axis. This way of representation
is very useful for high-dimensional data. It is however also sensitive to the order of the
variables, since certain trends in the data can be shown more clearly in one ordering than in

EXAMPLE 1.4 Take once again the observations 96“105 of the Swiss bank notes. These
observations are six dimensional, so we can™t show them in a six dimensional Cartesian
coordinate system. Using the parallel coordinates plot technique, however, they can be plotted
on parallel axes. This is shown in Figure 1.22.
We have already noted in Example 1.2 that the diagonal X6 plays an important role. This
important role is clearly visible from Figure 1.22 The last coordinate X6 shows two di¬erent
subgroups. The full bank note data set is displayed in Figure 1.23. One sees an overlap of
the coordinate values for indices 1“3 and an increased separability for the indices 4“6.
44 1 Comparison of Batches

’ Parallel coordinates plots overcome the visualization problem of the Carte-
sian coordinate system for dimensions greater than 4.
’ Outliers are visible as outlying polygon curves.
’ The order of variables is still important, for example, for detection of
’ Subgroups may be screened by selective coloring in an interactive manner.

1.8 Boston Housing

Aim of the analysis

The Boston Housing data set was analyzed by Harrison and Rubinfeld (1978) who wanted
to ¬nd out whether “clean air” had an in¬‚uence on house prices. We will use this data set in
this chapter and in most of the following chapters to illustrate the presented methodology.
The data are described in Appendix B.1.

What can be seen from the PCPs

In order to highlight the relations of X14 to the remaining 13 variables we color all of the
observations with X14 >median(X14 ) as red lines in Figure 1.24. Some of the variables seem
to be strongly related. The most obvious relation is the negative dependence between X13
and X14 . It can also be argued that there exists a strong dependence between X12 and X14
since no red lines are drawn in the lower part of X12 . The opposite can be said about X11 :
there are only red lines plotted in the lower part of this variable. Low values of X11 induce
high values of X14 .
For the PCP, the variables have been rescaled over the interval [0, 1] for better graphical
representations. The PCP shows that the variables are not distributed in a symmetric
manner. It can be clearly seen that the values of X1 and X9 are much more concentrated
around 0. Therefore it makes sense to consider transformations of the original data.
1.8 Boston Housing 45

Figure 1.24. Parallel coordinates plot for Boston Housing data.

The scatterplot matrix

One characteristic of the PCPs is that many lines are drawn on top of each other. This
problem is reduced by depicting the variables in pairs of scatterplots. Including all 14
variables in one large scatterplot matrix is possible, but makes it hard to see anything from
the plots. Therefore, for illustratory purposes we will analyze only one such matrix from a
subset of the variables in Figure 1.25. On the basis of the PCP and the scatterplot matrix
we would like to interpret each of the thirteen variables and their eventual relation to the
14th variable. Included in the ¬gure are images for X1 “X5 and X14 , although each variable
is discussed in detail below. All references made to scatterplots in the following refer to
Figure 1.25.
46 1 Comparison of Batches

Figure 1.25. Scatterplot matrix for variables X1 , . . . , X5 and X14 of the
Boston Housing data. MVAdrafthousing.xpl

Per-capita crime rate X1

Taking the logarithm makes the variable™s distribution more symmetric. This can be seen
in the boxplot of X1 in Figure 1.27 which shows that the median and the mean have moved
closer to each other than they were for the original X1 . Plotting the kernel density esti-
mate (KDE) of X1 = log (X1 ) would reveal that two subgroups might exist with di¬erent
mean values. However, taking a look at the scatterplots in Figure 1.26 of the logarithms
which include X1 does not clearly reveal such groups. Given that the scatterplot of log (X1 )
vs. log (X14 ) shows a relatively strong negative relation, it might be the case that the two
subgroups of X1 correspond to houses with two di¬erent price levels. This is con¬rmed by
the two boxplots shown to the right of the X1 vs. X2 scatterplot (in Figure 1.25): the red
boxplot™s shape di¬ers a lot from the black one™s, having a much higher median and mean.
1.8 Boston Housing 47

Figure 1.26. Scatterplot matrix for variables X1 , . . . , X5 and X14 of the
Boston Housing data. MVAdrafthousingt.xpl

Proportion of residential area zoned for large lots X2

It strikes the eye in Figure 1.25 that there is a large cluster of observations for which X2 is
equal to 0. It also strikes the eye that”as the scatterplot of X1 vs. X2 shows”there is a
strong, though non-linear, negative relation between X1 and X2 : Almost all observations for
which X2 is high have an X1 -value close to zero, and vice versa, many observations for which
X2 is zero have quite a high per-capita crime rate X1 . This could be due to the location of
the areas, e.g., downtown districts might have a higher crime rate and at the same time it
is unlikely that any residential land would be zoned in a generous manner.
As far as the house prices are concerned it can be said that there seems to be no clear (linear)
relation between X2 and X14 , but it is obvious that the more expensive houses are situated
in areas where X2 is large (this can be seen from the two boxplots on the second position of
the diagonal, where the red one has a clearly higher mean/median than the black one).
48 1 Comparison of Batches

Proportion of non-retail business acres X3

The PCP (in Figure 1.24) as well as the scatterplot of X3 vs. X14 shows an obvious negative
relation between X3 and X14 . The relationship between the logarithms of both variables
seems to be almost linear. This negative relation might be explained by the fact that non-
retail business sometimes causes annoying sounds and other pollution. Therefore, it seems
reasonable to use X3 as an explanatory variable for the prediction of X14 in a linear-regression
As far as the distribution of X3 is concerned it can be said that the kernel density estimate
of X3 clearly has two peaks, which indicates that there are two subgroups. According to the
negative relation between X3 and X14 it could be the case that one subgroup corresponds to
the more expensive houses and the other one to the cheaper houses.

Charles River dummy variable X4

The observation made from the PCP that there are more expensive houses than cheap
houses situated on the banks of the Charles River is con¬rmed by inspecting the scatterplot
matrix. Still, we might have some doubt that the proximity to the river in¬‚uences the house
prices. Looking at the original data set, it becomes clear that the observations for which
X4 equals one are districts that are close to each other. Apparently, the Charles River does
not ¬‚ow through too many di¬erent districts. Thus, it may be pure coincidence that the
more expensive districts are close to the Charles River”their high values might be caused by
many other factors such as the pupil/teacher ratio or the proportion of non-retail business

Nitric oxides concentration X5

The scatterplot of X5 vs. X14 and the separate boxplots of X5 for more and less expensive
houses reveal a clear negative relation between the two variables. As it was the main aim of
the authors of the original study to determine whether pollution had an in¬‚uence on housing
prices, it should be considered very carefully whether X5 can serve as an explanatory variable
for the price X14 . A possible reason against it being an explanatory variable is that people
might not like to live in areas where the emissions of nitric oxides are high. Nitric oxides are
emitted mainly by automobiles, by factories and from heating private homes. However, as
one can imagine there are many good reasons besides nitric oxides not to live downtown or in
industrial areas! Noise pollution, for example, might be a much better explanatory variable
for the price of housing units. As the emission of nitric oxides is usually accompanied by
noise pollution, using X5 as an explanatory variable for X14 might lead to the false conclusion
that people run away from nitric oxides, whereas in reality it is noise pollution that they are
trying to escape.
1.8 Boston Housing 49

Average number of rooms per dwelling X6

The number of rooms per dwelling is a possible measure for the size of the houses. Thus we
expect X6 to be strongly correlated with X14 (the houses™ median price). Indeed”apart from
some outliers”the scatterplot of X6 vs. X14 shows a point cloud which is clearly upward-
sloping and which seems to be a realisation of a linear dependence of X14 on X6 . The two
boxplots of X6 con¬rm this notion by showing that the quartiles, the mean and the median
are all much higher for the red than for the black boxplot.

Proportion of owner-occupied units built prior to 1940 X7

There is no clear connection visible between X7 and X14 . There could be a weak negative
correlation between the two variables, since the (red) boxplot of X7 for the districts whose
price is above the median price indicates a lower mean and median than the (black) boxplot
for the district whose price is below the median price. The fact that the correlation is not
so clear could be explained by two opposing e¬ects. On the one hand house prices should
decrease if the older houses are not in a good shape. On the other hand prices could increase,
because people often like older houses better than newer houses, preferring their atmosphere
of space and tradition. Nevertheless, it seems reasonable that the houses™ age has an in¬‚uence
on their price X14 .
Raising X7 to the power of 2.5 reveals again that the data set might consist of two subgroups.
But in this case it is not obvious that the subgroups correspond to more expensive or cheaper
houses. One can furthermore observe a negative relation between X7 and X8 . This could
re¬‚ect the way the Boston metropolitan area developed over time: the districts with the
newer buildings are farther away from employment centres with industrial facilities.

Weighted distance to ¬ve Boston employment centres X8

Since most people like to live close to their place of work, we expect a negative relation
between the distances to the employment centres and the houses™ price. The scatterplot
hardly reveals any dependence, but the boxplots of X8 indicate that there might be a slightly
positive relation as the red boxplot™s median and mean are higher than the black one™s.
Again, there might be two e¬ects in opposite directions at work. The ¬rst is that living
too close to an employment centre might not provide enough shelter from the pollution
created there. The second, as mentioned above, is that people do not travel very far to their
50 1 Comparison of Batches

Index of accessibility to radial highways X9

The ¬rst obvious thing one can observe in the scatterplots, as well in the histograms and the
kernel density estimates, is that there are two subgroups of districts containing X9 values
which are close to the respective group™s mean. The scatterplots deliver no hint as to what
might explain the occurrence of these two subgroups. The boxplots indicate that for the
cheaper and for the more expensive houses the average of X9 is almost the same.

Full-value property tax X10

X10 shows a behavior similar to that of X9 : two subgroups exist. A downward-sloping curve
seems to underlie the relation of X10 and X14 . This is con¬rmed by the two boxplots drawn
for X10 : the red one has a lower mean and median than the black one.

Pupil/teacher ratio X11

The red and black boxplots of X11 indicate a negative relation between X11 and X14 . This
is con¬rmed by inspection of the scatterplot of X11 vs. X14 : The point cloud is downward
sloping, i.e., the less teachers there are per pupil, the less people pay on median for their

Proportion of blacks B, X12 = 1000(B ’ 0.63)2 I(B < 0.63)

Interestingly, X12 is negatively”though not linearly”correlated with X3 , X7 and X11 ,
whereas it is positively related with X14 . Having a look at the data set reveals that for
almost all districts X12 takes on a value around 390. Since B cannot be larger than 0.63,
such values can only be caused by B close to zero. Therefore, the higher X12 is, the lower
the actual proportion of blacks is! Among observations 405 through 470 there are quite a
few that have a X12 that is much lower than 390. This means that in these districts the
proportion of blacks is above zero. We can observe two clusters of points in the scatterplots
of log (X12 ): one cluster for which X12 is close to 390 and a second one for which X12 is
between 3 and 100. When X12 is positively related with another variable, the actual pro-
portion of blacks is negatively correlated with this variable and vice versa. This means that
blacks live in areas where there is a high proportion of non-retail business acres, where there
are older houses and where there is a high (i.e., bad) pupil/teacher ratio. It can be observed
that districts with housing prices above the median can only be found where the proportion
of blacks is virtually zero!
1.8 Boston Housing 51

Proportion of lower status of the population X13

Of all the variables X13 exhibits the clearest negative relation with X14 ”hardly any outliers
show up. Taking the square root of X13 and the logarithm of X14 transforms the relation
into a linear one.


Since most of the variables exhibit an asymmetry with a higher density on the left side, the
following transformations are proposed:

X1 = log (X1 )
X2 = X2 /10
X3 = log (X3 )
X4 none, since X4 is binary
X5 = log (X5 )
X6 = log (X6 )
X7 = X7 2.5 /10000
X8 = log (X8 )
X9 = log (X9 )
X10 = log (X10 )
X11 = exp (0.4 — X11 )/1000
X12 = X12 /100
X13 = X13
X14 = log (X14 )

Taking the logarithm or raising the variables to the power of something smaller than one helps
to reduce the asymmetry. This is due to the fact that lower values move further away from
each other, whereas the distance between greater values is reduced by these transformations.
Figure 1.27 displays boxplots for the original mean variance scaled variables as well as for the
proposed transformed variables. The transformed variables™ boxplots are more symmetric
and have less outliers than the original variables™ boxplots.
52 1 Comparison of Batches

Boston Housing data

Transformed Boston Housing data

Figure 1.27. Boxplots for all of the variables from the Boston Housing
data before and after the proposed transformations. MVAboxbhd.xpl

1.9 Exercises

EXERCISE 1.1 Is the upper extreme always an outlier?

EXERCISE 1.2 Is it possible for the mean or the median to lie outside of the fourths or
even outside of the outside bars?

EXERCISE 1.3 Assume that the data are normally distributed N (0, 1). What percentage of
the data do you expect to lie outside the outside bars?

EXERCISE 1.4 What percentage of the data do you expect to lie outside the outside bars if
we assume that the data are normally distributed N (0, σ 2 ) with unknown variance σ 2 ?
1.9 Exercises 53

EXERCISE 1.5 How would the ¬ve-number summary of the 15 largest U.S. cities di¬er from
that of the 50 largest U.S. cities? How would the ¬ve-number summary of 15 observations
of N (0, 1)-distributed data di¬er from that of 50 observations from the same distribution?

EXERCISE 1.6 Is it possible that all ¬ve numbers of the ¬ve-number summary could be
equal? If so, under what conditions?

EXERCISE 1.7 Suppose we have 50 observations of X ∼ N (0, 1) and another 50 observa-
tions of Y ∼ N (2, 1). What would the 100 Flury faces look like if you had de¬ned as face
elements the face line and the darkness of hair? Do you expect any similar faces? How many
faces do you think should look like observations of Y even though they are X observations?

EXERCISE 1.8 Draw a histogram for the mileage variable of the car data (Table B.3). Do
the same for the three groups (U.S., Japan, Europe). Do you obtain a similar conclusion as
in the parallel boxplot on Figure 1.3 for these data?

EXERCISE 1.9 Use some bandwidth selection criterion to calculate the optimally chosen
bandwidth h for the diagonal variable of the bank notes. Would it be better to have one
bandwidth for the two groups?

EXERCISE 1.10 In Figure 1.9 the densities overlap in the region of diagonal ≈ 140.4. We
partially observed this in the boxplot of Figure 1.4. Our aim is to separate the two groups.
Will we be able to do this e¬ectively on the basis of this diagonal variable alone?

EXERCISE 1.11 Draw a parallel coordinates plot for the car data.

EXERCISE 1.12 How would you identify discrete variables (variables with only a limited
number of possible outcomes) on a parallel coordinates plot?

EXERCISE 1.13 True or false: the height of the bars of a histogram are equal to the relative
frequency with which observations fall into the respective bins.

EXERCISE 1.14 True or false: kernel density estimates must always take on a value between
0 and 1. (Hint: Which quantity connected with the density function has to be equal to 1?
Does this property imply that the density function has to always be less than 1?)

EXERCISE 1.15 Let the following data set represent the heights of 13 students taking the
Applied Multivariate Statistical Analysis course:

1.72, 1.83, 1.74, 1.79, 1.94, 1.81, 1.66, 1.60, 1.78, 1.77, 1.85, 1.70, 1.76.
54 1 Comparison of Batches

1. Find the corresponding ¬ve-number summary.

2. Construct the boxplot.

3. Draw a histogram for this data set.

EXERCISE 1.16 Describe the unemployment data (see Table B.19) that contain unemploy-
ment rates of all German Federal States using various descriptive techniques.

EXERCISE 1.17 Using yearly population data (see B.20), generate

1. a boxplot (choose one of variables)

2. an Andrew™s Curve (choose ten data points)

3. a scatterplot

4. a histogram (choose one of the variables)

What do these graphs tell you about the data and their structure?

EXERCISE 1.18 Make a draftman plot for the car data with the variables

X1 = price,
X2 = mileage,
X8 = weight,
X9 = length.

Move the brush into the region of heavy cars. What can you say about price, mileage and
length? Move the brush onto high fuel economy. Mark the Japanese, European and U.S.
American cars. You should ¬nd the same condition as in boxplot Figure 1.3.

EXERCISE 1.19 What is the form of a scatterplot of two independent random variables X1
and X2 with standard Normal distribution?

EXERCISE 1.20 Rotate a three-dimensional standard normal point cloud in 3D space. Does
it “almost look the same from all sides”? Can you explain why or why not?
Part II

Multivariate Random Variables
2 A Short Excursion into Matrix Algebra

This chapter is a reminder of basic concepts of matrix algebra, which are particularly useful
in multivariate analysis. It also introduces the notations used in this book for vectors and
matrices. Eigenvalues and eigenvectors play an important role in multivariate techniques.
In Sections 2.2 and 2.3, we present the spectral decomposition of matrices and consider the
maximization (minimization) of quadratic forms given some constraints.
In analyzing the multivariate normal distribution, partitioned matrices appear naturally.
Some of the basic algebraic properties are given in Section 2.5. These properties will be
heavily used in Chapters 4 and 5.
The geometry of the multinormal and the geometric interpretation of the multivariate tech-
niques (Part III) intensively uses the notion of angles between two vectors, the projection
of a point on a vector and the distances between two points. These ideas are introduced in
Section 2.6.

2.1 Elementary Operations

A matrix A is a system of numbers with n rows and p columns:
« 
a11 a12 . . . . . . . . . a1p
. .
. a22 .
¬ ·
. .
¬ ·
. . .. .
¬ ·
. . .
. . .
¬ ·
A=¬ ·.
¬ ·
. . .
. . .
. . .
¬ ·
¬ ·
. . ... .
. . .
¬ ·
. . .
 
an1 an2 . . . . . . . . . anp

We also write (aij ) for A and A(n — p) to indicate the numbers of rows and columns. Vectors
are matrices with one column and are denoted as x or x(p — 1). Special matrices and vectors
are de¬ned in Table 2.1. Note that we use small letters for scalars as well as for vectors.
58 2 A Short Excursion into Matrix Algebra

Matrix Operations

Elementary operations are summarized below:

A = (aji )
A+B = (aij + bij )
A’B (aij ’ bij )
c·A (c · aij )
A · B = A(n — p) B(p — m) = C(n — m) = aij bjk .

Properties of Matrix Operations

A(B + C) AB + AC
A(BC) = (AB)C
(A ) =
(AB) =

Matrix Characteristics


The rank, rank(A), of a matrix A(n — p) is de¬ned as the maximum number of linearly
independent rows (columns). A set of k rows aj of A(n—p) are said to be linearly independent
if k cj aj = 0p implies cj = 0, ∀j, where c1 , . . . , ck are scalars. In other words no rows in
this set can be expressed as a linear combination of the (k ’ 1) remaining rows.


The trace of a matrix is the sum of its diagonal elements
tr(A) = aii .
2.1 Elementary Operations 59

Name De¬nition Notation Example
scalar p=n=1 a 3
column vector p=1 a
row vector n=1 a
vector of ones (1, . . . , 1) 1n
vector of zeros (0, . . . , 0) 0n
2 0
A(p — p)
square matrix n=p
0 2
1 0
diagonal matrix aij = 0, i = j, n = p diag(aii )
0 2
1 0
identity matrix diag(1, . . . , 1)
0 1
aij ≡ 1, n = p
unit matrix 1n 1n
symmetric matrix aij = aji
null matrix aij = 0 0
« 
upper triangular matrix aij = 0, i < j  
« 
AA = A
idempotent matrix  
2 2
1 1
√ √
2 2
A A = I = AA
orthogonal matrix 1 1
√ √

Table 2.1. Special matrices and vectors.


The determinant is an important concept of matrix algebra. For a square matrix A, it is
de¬ned as:
(’1)|„ | a1„ (1) . . . ap„ (p) ,
det(A) = |A| =

the summation is over all permutations „ of {1, 2, . . . , p}, and |„ | = 0 if the permutation can
be written as a product of an even number of transpositions and |„ | = 1 otherwise.
60 2 A Short Excursion into Matrix Algebra

a11 a12
EXAMPLE 2.1 In the case of p = 2, A = and we can permute the digits “1”
a21 a22
and “2” once or not at all. So,

|A| = a11 a22 ’ a12 a21 .


For A(n — p) and B(p — n)

(A ) = A, and (AB) = B A .


If |A| = 0 and A(p — p), then the inverse A’1 exists:

A A’1 = A’1 A = Ip .

For small matrices, the inverse of A = (aij ) can be calculated as

A’1 = ,

where C = (cij ) is the adjoint matrix of A. The elements cji of C are the co-factors of A:

a11 ... a1(j’1) a1(j+1) ... a1p
a(i’1)1 . . . a(i’1)(j’1) a(i’1)(j+1) . . . a(i’1)p
cji = (’1)i+j .
a(i+1)1 . . . a(i+1)(j’1) a(i+1)(j+1) . . . a(i+1)p
ap1 ... ap(j’1) ap(j+1) ... app


A more general concept is the G-inverse (Generalized Inverse) A’ which satis¬es the follow-
A A’ A = A.
Later we will see that there may be more than one G-inverse.
2.1 Elementary Operations 61

EXAMPLE 2.2 The generalized inverse can also be calculated for singular matrices. We
10 10 10 10
= ,
00 00 00 00
10 10
is A’ =
which means that the generalized inverse of A = even though
00 00
the inverse matrix of A does not exist in this case.

Eigenvalues, Eigenvectors

Consider a (p — p) matrix A. If there exists a scalar » and a vector γ such that
Aγ = »γ, (2.1)
then we call
» an eigenvalue
γ an eigenvector.
It can be proven that an eigenvalue » is a root of the p-th order polynomial |A ’ »Ip | = 0.
Therefore, there are up to p eigenvalues »1 , »2 , . . . , »p of A. For each eigenvalue »j , there
exists a corresponding eigenvector γj given by equation (2.1) . Suppose the matrix A has
the eigenvalues »1 , . . . , »p . Let Λ = diag(»1 , . . . , »p ).
The determinant |A| and the trace tr(A) can be rewritten in terms of the eigenvalues:
|A| = |Λ| = »j (2.2)
tr(A) = tr(Λ) = »j . (2.3)

An idempotent matrix A (see the de¬nition in Table 2.1) can only have eigenvalues in {0, 1}
therefore tr(A) = rank(A) = number of eigenvalues = 0.
« 
EXAMPLE 2.3 Let us consider the matrix A =  0 2 1 . It is easy to verify that
021 1
AA = A which implies that the matrix A is idempotent.
We know that the eigenvalues of an idempotent matrix are equal to 0 « In this case, the
or 1. 
« «
100 1 1
1 1 
eigenvalues of A are »1 = 1, »2 = 1, and »3 = 0 since  0 2 2 0  = 1  0 ,
011 0 0
0 0 0 0
« «  «  « « « 
100 100
√ √ √ √
 0 1 1   2  = 1  2 , and  0 1 1   2
= 0  √ 2 .
2 2 2 2 2 2 2 2
√ √ √
1 1 1 1
2 2 2
’ 22
022 022 ’2
2 2
62 2 A Short Excursion into Matrix Algebra

Using formulas (2.2) and (2.3), we can calculate the trace and the determinant of A from
the eigenvalues: tr(A) = »1 + »2 + »3 = 2, |A| = »1 »2 »3 = 0, and rank(A) = 2.

Properties of Matrix Characteristics

A(n — n), B(n — n), c ∈ R

tr(A + B) tr A + tr B
= (2.4)
c tr A
tr(cA) = (2.5)
cn |A|
|cA| = (2.6)
|AB| |BA| = |A||B|
= (2.7)

A(n — p), B(p — n)

tr(A· B) tr(B· A)
= (2.8)
rank(A) min(n, p)

rank(A) 0 (2.9)
rank(A) = rank(A ) (2.10)
rank(A A) = rank(A) (2.11)
rank(A + B) ¤ rank(A) + rank(B) (2.12)
rank(AB) min{rank(A), rank(B)} (2.13)

A(n — p), B(p — q), C(q — n)

tr(ABC) = tr(BCA)
= tr(CAB) (2.14)
for nonsingular A, C
rank(ABC) = rank(B) (2.15)

A(p — p)

|A’1 | = |A|’1 (2.16)
rank(A) = p if and only if A is nonsingular. (2.17)

’ The determinant |A| is the product of the eigenvalues of A.
’ The inverse of a matrix A exists if |A| = 0.
2.2 Spectral Decompositions 63

Summary (continued)
’ The trace tr(A) is the sum of the eigenvalues of A.
’ The sum of the traces of two matrices equals the trace of the sum of the
two matrices.
’ The trace tr(AB) equals tr(BA).
’ The rank(A) is the maximal number of linearly independent rows
(columns) of A.

2.2 Spectral Decompositions
The computation of eigenvalues and eigenvectors is an important issue in the analysis of
matrices. The spectral decomposition or Jordan decomposition links the structure of a
matrix to the eigenvalues and the eigenvectors.

THEOREM 2.1 (Jordan Decomposition) Each symmetric matrix A(p — p) can be written
as p
A=“Λ“ = »j γj γj (2.18)

Λ = diag(»1 , . . . , »p )
and where
“ = (γ1 , γ2 , . . . , γp )
is an orthogonal matrix consisting of the eigenvectors γj of A.

EXAMPLE 2.4 Suppose that A = . The eigenvalues are found by solving |A ’ »I| = 0.
This is equivalent to
1’» 2
= (1 ’ »)(3 ’ ») ’ 4 = 0.
√ √
Hence, the eigenvalues are »1 = 2 + 5 and »2 = 2 ’ 5. The eigenvectors are γ1 =
(0.5257, 0.8506) and γ2 = (0.8506, ’0.5257) . They are orthogonal since γ1 γ2 = 0.

Using spectral decomposition, we can de¬ne powers of a matrix A(p — p). Suppose A is a
symmetric matrix. Then by Theorem 2.1

A = “Λ“ ,
64 2 A Short Excursion into Matrix Algebra

and we de¬ne for some ± ∈ R
A± = “Λ± “ , (2.19)
where Λ± = diag(»± , . . . , »± ). In particular, we can easily calculate the inverse of the matrix
1 p
A. Suppose that the eigenvalues of A are positive. Then with ± = ’1, we obtain the inverse
of A from
A’1 = “Λ’1 “ . (2.20)

Another interesting decomposition which is later used is given in the following theorem.

THEOREM 2.2 (Singular Value Decomposition) Each matrix A(n — p) with rank r can
be decomposed as
A=“Λ∆ ,
where “(n — r) and ∆(p — r). Both “ and ∆ are column orthonormal, i.e., “ “ = ∆ ∆ = Ir
1/2 1/2
and Λ = diag »1 , . . . , »r , »j > 0. The values »1 , . . . , »r are the non-zero eigenvalues of
the matrices AA and A A. “ and ∆ consist of the corresponding r eigenvectors of these

This is obviously a generalization of Theorem 2.1 (Jordan decomposition). With Theorem
2.2, we can ¬nd a G-inverse A’ of A. Indeed, de¬ne A’ = ∆ Λ’1 “ . Then A A’ A =
“ Λ ∆ = A. Note that the G-inverse is not unique.

EXAMPLE 2.5 In Example 2.2, we showed that the generalized inverse of A =
is A’ . The following also holds

10 10 10 10
00 08 00 00

is also a generalized inverse of A.
which means that the matrix

’ The Jordan decomposition gives a representation of a symmetric matrix
in terms of eigenvalues and eigenvectors.
2.3 Quadratic Forms 65


. 2
( 4)