. 1
( 4)



>>

Applied Multivariate
Statistical Analysis —




Wolfgang H¨rdle
a
L´opold Simar
e




— Version: 22nd October 2003
Please note: this is only a sample of the
full book. The complete book can be dow-
nloaded on the e-book page of XploRe.
Just click the download logo:
http://www.xplore-
stat.de/ebooks/ebooks.html



download logo



For further information please contact
MD*Tech at mdtech@mdtech.de
Contents



I Descriptive Techniques 11

1 Comparison of Batches 13
1.1 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3 Kernel Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5 Cherno¬-Flury Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Andrews™ Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.7 Parallel Coordinates Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.8 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52



II Multivariate Random Variables 55

2 A Short Excursion into Matrix Algebra 57
2.1 Elementary Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2 Spectral Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.4 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5 Partitioned Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2 Contents


2.6 Geometrical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3 Moving to Higher Dimensions 81
3.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.4 Linear Model for Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.5 Simple Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.6 Multiple Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.7 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4 Multivariate Distributions 119
4.1 Distribution and Density Function . . . . . . . . . . . . . . . . . . . . . . . . 120
4.2 Moments and Characteristic Functions . . . . . . . . . . . . . . . . . . . . . 125
4.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.4 The Multinormal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.5 Sampling Distributions and Limit Theorems . . . . . . . . . . . . . . . . . . 142
4.6 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5 Theory of the Multinormal 155
5.1 Elementary Properties of the Multinormal . . . . . . . . . . . . . . . . . . . 155
5.2 The Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.3 Hotelling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.4 Spherical and Elliptical Distributions . . . . . . . . . . . . . . . . . . . . . . 167
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Contents 3


6 Theory of Estimation 173
6.1 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.2 The Cramer-Rao Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7 Hypothesis Testing 183
7.1 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
7.2 Linear Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7.3 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212



III Multivariate Techniques 217

8 Decomposition of Data Matrices by Factors 219
8.1 The Geometric Point of View . . . . . . . . . . . . . . . . . . . . . . . . . . 220
8.2 Fitting the p-dimensional Point Cloud . . . . . . . . . . . . . . . . . . . . . 221
8.3 Fitting the n-dimensional Point Cloud . . . . . . . . . . . . . . . . . . . . . 225
8.4 Relations between Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
8.5 Practical Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

9 Principal Components Analysis 233
9.1 Standardized Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . 234
9.2 Principal Components in Practice . . . . . . . . . . . . . . . . . . . . . . . . 238
9.3 Interpretation of the PCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
9.4 Asymptotic Properties of the PCs . . . . . . . . . . . . . . . . . . . . . . . . 246
9.5 Normalized Principal Components Analysis . . . . . . . . . . . . . . . . . . . 249
9.6 Principal Components as a Factorial Method . . . . . . . . . . . . . . . . . . 250
9.7 Common Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . 256
4 Contents


9.8 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
9.9 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

10 Factor Analysis 275
10.1 The Orthogonal Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . . 275
10.2 Estimation of the Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . 282
10.3 Factor Scores and Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
10.4 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

11 Cluster Analysis 301
11.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
11.2 The Proximity between Objects . . . . . . . . . . . . . . . . . . . . . . . . . 302
11.3 Cluster Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
11.4 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

12 Discriminant Analysis 323
12.1 Allocation Rules for Known Distributions . . . . . . . . . . . . . . . . . . . . 323
12.2 Discrimination Rules in Practice . . . . . . . . . . . . . . . . . . . . . . . . . 331
12.3 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

13 Correspondence Analysis 341
13.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
13.2 Chi-square Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
13.3 Correspondence Analysis in Practice . . . . . . . . . . . . . . . . . . . . . . 347
13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Contents 5


14 Canonical Correlation Analysis 361
14.1 Most Interesting Linear Combination . . . . . . . . . . . . . . . . . . . . . . 361
14.2 Canonical Correlation in Practice . . . . . . . . . . . . . . . . . . . . . . . . 366
14.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

15 Multidimensional Scaling 373
15.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
15.2 Metric Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . 379
15.2.1 The Classical Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 379
15.3 Nonmetric Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . 383
15.3.1 Shepard-Kruskal algorithm . . . . . . . . . . . . . . . . . . . . . . . . 384
15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

16 Conjoint Measurement Analysis 393
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
16.2 Design of Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
16.3 Estimation of Preference Orderings . . . . . . . . . . . . . . . . . . . . . . . 398
16.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

17 Applications in Finance 407
17.1 Portfolio Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
17.2 E¬cient Portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
17.3 E¬cient Portfolios in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . 415
17.4 The Capital Asset Pricing Model (CAPM) . . . . . . . . . . . . . . . . . . . 417
17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

18 Highly Interactive, Computationally Intensive Techniques 421
18.1 Simplicial Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
18.2 Projection Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
18.3 Sliced Inverse Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
6 Contents


18.4 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
18.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

A Symbols and Notation 443

B Data 447
B.1 Boston Housing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
B.2 Swiss Bank Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
B.3 Car Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
B.4 Classic Blue Pullovers Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
B.5 U.S. Companies Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
B.6 French Food Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
B.7 Car Marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
B.8 French Baccalaur´at Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . 459
e
B.9 Journaux Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
B.10 U.S. Crime Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
B.11 Plasma Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
B.12 WAIS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
B.13 ANOVA Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
B.14 Timebudget Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
B.15 Geopol Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
B.16 U.S. Health Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
B.17 Vocabulary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
B.18 Athletic Records Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
B.19 Unemployment Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
B.20 Annual Population Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478

Bibliography 479

Index 483
Preface

Most of the observable phenomena in the empirical sciences are of a multivariate nature.
In ¬nancial studies, assets in stock markets are observed simultaneously and their joint
development is analyzed to better understand general tendencies and to track indices. In
medicine recorded observations of subjects in di¬erent locations are the basis of reliable
diagnoses and medication. In quantitative marketing consumer preferences are collected in
order to construct models of consumer behavior. The underlying theoretical structure of
these and many other quantitative studies of applied sciences is multivariate. This book
on Applied Multivariate Statistical Analysis presents the tools and concepts of multivariate
data analysis with a strong focus on applications.
The aim of the book is to present multivariate data analysis in a way that is understandable
for non-mathematicians and practitioners who are confronted by statistical data analysis.
This is achieved by focusing on the practical relevance and through the e-book character of
this text. All practical examples may be recalculated and modi¬ed by the reader using a
standard web browser and without reference or application of any speci¬c software.
The book is divided into three main parts. The ¬rst part is devoted to graphical techniques
describing the distributions of the variables involved. The second part deals with multivariate
random variables and presents from a theoretical point of view distributions, estimators
and tests for various practical situations. The last part is on multivariate techniques and
introduces the reader to the wide selection of tools available for multivariate data analysis.
All data sets are given in the appendix and are downloadable from www.md-stat.com. The
text contains a wide variety of exercises the solutions of which are given in a separate
textbook. In addition a full set of transparencies on www.md-stat.com is provided making it
easier for an instructor to present the materials in this book. All transparencies contain hyper
links to the statistical web service so that students and instructors alike may recompute all
examples via a standard web browser.
The ¬rst section on descriptive techniques is on the construction of the boxplot. Here the
standard data sets on genuine and counterfeit bank notes and on the Boston housing data are
introduced. Flury faces are shown in Section 1.5, followed by the presentation of Andrews
curves and parallel coordinate plots. Histograms, kernel densities and scatterplots complete
the ¬rst part of the book. The reader is introduced to the concept of skewness and correlation
from a graphical point of view.
8 Preface


At the beginning of the second part of the book the reader goes on a short excursion into
matrix algebra. Covariances, correlation and the linear model are introduced. This section
is followed by the presentation of the ANOVA technique and its application to the multiple
linear model. In Chapter 4 the multivariate distributions are introduced and thereafter
specialized to the multinormal. The theory of estimation and testing ends the discussion on
multivariate random variables.
The third and last part of this book starts with a geometric decomposition of data matrices.
It is in¬‚uenced by the French school of analyse de donn´es. This geometric point of view
e
is linked to principal components analysis in Chapter 9. An important discussion on factor
analysis follows with a variety of examples from psychology and economics. The section on
cluster analysis deals with the various cluster techniques and leads naturally to the problem
of discrimination analysis. The next chapter deals with the detection of correspondence
between factors. The joint structure of data sets is presented in the chapter on canonical
correlation analysis and a practical study on prices and safety features of automobiles is
given. Next the important topic of multidimensional scaling is introduced, followed by the
tool of conjoint measurement analysis. The conjoint measurement analysis is often used
in psychology and marketing in order to measure preference orderings for certain goods.
The applications in ¬nance (Chapter 17) are numerous. We present here the CAPM model
and discuss e¬cient portfolio allocations. The book closes with a presentation on highly
interactive, computationally intensive techniques.
This book is designed for the advanced bachelor and ¬rst year graduate student as well as
for the inexperienced data analyst who would like a tour of the various statistical tools in
a multivariate data analysis workshop. The experienced reader with a bright knowledge of
algebra will certainly skip some sections of the multivariate random variables part but will
hopefully enjoy the various mathematical roots of the multivariate techniques. A graduate
student might think that the ¬rst part on description techniques is well known to him from his
training in introductory statistics. The mathematical and the applied parts of the book (II,
III) will certainly introduce him into the rich realm of multivariate statistical data analysis
modules.
The inexperienced computer user of this e-book is slowly introduced to an interdisciplinary
way of statistical thinking and will certainly enjoy the various practical examples. This
e-book is designed as an interactive document with various links to other features. The
complete e-book may be downloaded from www.xplore-stat.de using the license key given
on the last page of this book. Our e-book design o¬ers a complete PDF and HTML ¬le with
links to MD*Tech computing servers.
The reader of this book may therefore use all the presented methods and data via the local
XploRe Quantlet Server (XQS) without downloading or buying additional software. Such
XQ Servers may also be installed in a department or addressed freely on the web (see www.i-
xplore.de for more information).
Preface 9


A book of this kind would not have been possible without the help of many friends, col-
leagues and students. For the technical production of the e-book we would like to thank
J¨rg Feuerhake, Zdenˇk Hl´vka, Torsten Kleinow, Sigbert Klinke, Heiko Lehmann, Marlene
o e a
M¨ller. The book has been carefully read by Christian Hafner, Mia Huber, Stefan Sperlich,
u
ˇ ±ˇ
Axel Werwatz. We would also like to thank Pavel C´zek, Isabelle De Macq, Holger Gerhardt,
Alena Myˇiˇkov´ and Manh Cuong Vu for the solutions to various statistical problems and
sc a
exercises. We thank Clemens Heine from Springer Verlag for continuous support and valuable
suggestions on the style of writing and on the contents covered.
W. H¨rdle and L. Simar
a
Berlin and Louvain-la-Neuve, August 2003
Part I

Descriptive Techniques
1 Comparison of Batches

Multivariate statistical analysis is concerned with analyzing and understanding data in high
dimensions. We suppose that we are given a set {xi }n of n observations of a variable vector
i=1
p
X in R . That is, we suppose that each observation xi has p dimensions:

xi = (xi1 , xi2 , ..., xip ),

and that it is an observed value of a variable vector X ∈ Rp . Therefore, X is composed of p
random variables:
X = (X1 , X2 , ..., Xp )
where Xj , for j = 1, . . . , p, is a one-dimensional random variable. How do we begin to
analyze this kind of data? Before we investigate questions on what inferences we can reach
from the data, we should think about how to look at the data. This involves descriptive
techniques. Questions that we could answer by descriptive techniques are:

• Are there components of X that are more spread out than others?

• Are there some elements of X that indicate subgroups of the data?

• Are there outliers in the components of X?

• How “normal” is the distribution of the data?

• Are there “low-dimensional” linear combinations of X that show “non-normal” behav-
ior?

One di¬culty of descriptive methods for high dimensional data is the human perceptional
system. Point clouds in two dimensions are easy to understand and to interpret. With
modern interactive computing techniques we have the possibility to see real time 3D rotations
and thus to perceive also three-dimensional data. A “sliding technique” as described in
H¨rdle and Scott (1992) may give insight into four-dimensional structures by presenting
a
dynamic 3D density contours as the fourth variable is changed over its range.
A qualitative jump in presentation di¬culties occurs for dimensions greater than or equal to
5, unless the high-dimensional structure can be mapped into lower-dimensional components
14 1 Comparison of Batches


(Klinke and Polzehl, 1995). Features like clustered subgroups or outliers, however, can be
detected using a purely graphical analysis.
In this chapter, we investigate the basic descriptive and graphical techniques allowing simple
exploratory data analysis. We begin the exploration of a data set using boxplots. A boxplot
is a simple univariate device that detects outliers component by component and that can
compare distributions of the data among di¬erent groups. Next several multivariate tech-
niques are introduced (Flury faces, Andrews™ curves and parallel coordinate plots) which
provide graphical displays addressing the questions formulated above. The advantages and
the disadvantages of each of these techniques are stressed.
Two basic techniques for estimating densities are also presented: histograms and kernel
densities. A density estimate gives a quick insight into the shape of the distribution of
the data. We show that kernel density estimates overcome some of the drawbacks of the
histograms.
Finally, scatterplots are shown to be very useful for plotting bivariate or trivariate variables
against each other: they help to understand the nature of the relationship among variables
in a data set and allow to detect groups or clusters of points. Draftman plots or matrix plots
are the visualization of several bivariate scatterplots on the same display. They help detect
structures in conditional dependences by brushing across the plots.



1.1 Boxplots

EXAMPLE 1.1 The Swiss bank data (see Appendix, Table B.2) consists of 200 measure-
ments on Swiss bank notes. The ¬rst half of these measurements are from genuine bank
notes, the other half are from counterfeit bank notes.
The authorities have measured, as indicated in Figure 1.1,

X1 = length of the bill
X2 = height of the bill (left)
X3 = height of the bill (right)
X4 = distance of the inner frame to the lower border
X5 = distance of the inner frame to the upper border
X6 = length of the diagonal of the central picture.



These data are taken from Flury and Riedwyl (1988). The aim is to study how these mea-
surements may be used in determining whether a bill is genuine or counterfeit.
1.1 Boxplots 15




Figure 1.1. An old Swiss 1000-franc bank note.

The boxplot is a graphical technique that displays the distribution of variables. It helps us
see the location, skewness, spread, tail length and outlying points.
It is particularly useful in comparing di¬erent batches. The boxplot is a graphical repre-
sentation of the Five Number Summary. To introduce the Five Number Summary, let us
consider for a moment a smaller, one-dimensional data set: the population of the 15 largest
U.S. cities in 1960 (Table 1.1).
In the Five Number Summary, we calculate the upper quartile FU , the lower quartile FL ,
the median and the extremes. Recall that order statistics {x(1) , x(2) , . . . , x(n) } are a set of
ordered values x1 , x2 , . . . , xn where x(1) denotes the minimum and x(n) the maximum. The
median M typically cuts the set of observations in two equal parts, and is de¬ned as
x( n+1 ) n odd
2
M= . (1.1)
1
x( n ) + x( n +1) n even
2 2 2


The quartiles cut the set into four equal parts, which are often called fourths (that is why we
use the letter F ). Using a de¬nition that goes back to Hoaglin, Mosteller and Tukey (1983)
the de¬nition of a median can be generalized to fourths, eights, etc. Considering the order
statistics we can de¬ne the depth of a data value x(i) as min{i, n ’ i + 1}. If n is odd, the
depth of the median is n+1 . If n is even, n+1 is a fraction. Thus, the median is determined
2 2
to be the average between the two data values belonging to the next larger and smaller order
1
statistics, i.e., M = 2 x( n ) + x( n +1) . In our example, we have n = 15 hence the median
2 2
M = x(8) = 88.
16 1 Comparison of Batches


City Pop. (10,000) Order Statistics
New York 778 x(15)
Chicago 355 x(14)
Los Angeles 248 x(13)
Philadelphia 200 x(12)
Detroit 167 x(11)
Baltimore 94 x(10)
Houston 94 x(9)
Cleveland 88 x(8)
Washington D.C. 76 x(7)
Saint Louis 75 x(6)
Milwaukee 74 x(5)
San Francisco 74 x(4)
Boston 70 x(3)
Dallas 68 x(2)
New Orleans 63 x(1)


Table 1.1. The 15 largest U.S. cities in 1960.

We proceed in the same way to get the fourths. Take the depth of the median and calculate

[depth of median] + 1
depth of fourth =
2
with [z] denoting the largest integer smaller than or equal to z. In our example this gives
4.5 and thus leads to the two fourths
1
FL = x(4) + x(5)
2
1
FU = x(11) + x(12)
2
(recalling that a depth which is a fraction corresponds to the average of the two nearest data
values).
The F -spread, dF , is de¬ned as dF = FU ’ FL . The outside bars

FU + 1.5dF (1.2)
FL ’ 1.5dF (1.3)

are the borders beyond which a point is regarded as an outlier. For the number of points
outside these bars see Exercise 1.3. For the n = 15 data points the fourths are 74 =
1 1
x(4) + x(5) and 183.5 = 2 x(11) + x(12) . Therefore the F -spread and the upper and
2
1.1 Boxplots 17



# 15 U.S. Cities


M 8 88
F 4.5 74 183.5
1 63 778




Table 1.2. Five number summary.

lower outside bars in the above example are calculated as follows:

dF = FU ’ FL = 183.5 ’ 74 = 109.5 (1.4)
FL ’ 1.5dF = 74 ’ 1.5 · 109.5 = ’90.25 (1.5)
FU + 1.5dF = 183.5 + 1.5 · 109.5 = 347.75. (1.6)

Since New York and Chicago are beyond the outside bars they are considered to be outliers.
The minimum and the maximum are called the extremes. The mean is de¬ned as
n
x = n’1 xi ,
i=1


which is 168.27 in our example. The mean is a measure of location. The median (88), the
fourths (74;183.5) and the extremes (63;778) constitute basic information about the data.
The combination of these ¬ve numbers leads to the Five Number Summary as displayed in
Table 1.2. The depths of each of the ¬ve numbers have been added as an additional column.


Construction of the Boxplot

1. Draw a box with borders (edges) at FL and FU (i.e., 50% of the data are in this box).

2. Draw the median as a solid line (|) and the mean as a dotted line ().

3. Draw “whiskers” from each end of the box to the most remote point that is NOT an
outlier.

4. Show outliers as either “ ” or “•”depending on whether they are outside of FU L ±1.5dF
or FU L ± 3dF respectively. Label them if possible.
18 1 Comparison of Batches



Boxplot
778.00




88.00
63.00

US cities




Figure 1.2. Boxplot for U.S. cities. MVAboxcity.xpl


In the U.S. cities example the cuto¬ points (outside bars) are at ’91 and 349, hence we draw
whiskers to New Orleans and Los Angeles. We can see from Figure 1.2 that the data are
very skew: The upper half of the data (above the median) is more spread out than the lower
half (below the median). The data contains two outliers marked as a star and a circle. The
more distinct outlier is shown as a star. The mean (as a non-robust measure of location) is
pulled away from the median.
Boxplots are very useful tools in comparing batches. The relative location of the distribution
of di¬erent batches tells us a lot about the batches themselves. Before we come back to the
Swiss bank data let us compare the fuel economy of vehicles from di¬erent countries, see
Figure 1.3 and Table B.3.
The data are from the second column of Table B.3 and show the mileage (miles per gallon)
of U.S. American, Japanese and European cars. The ¬ve-number summaries for these data
sets are {12, 16.8, 18.8, 22, 30}, {18, 22, 25, 30.5, 35}, and {14, 19, 23, 25, 28} for American,
Japanese, and European cars, respectively. This re¬‚ects the information shown in Figure 1.3.
1.1 Boxplots 19



car data
41.00




33.39




25.78




18.16




US JAPAN EU




Figure 1.3. Boxplot for the mileage of American, Japanese and European
cars (from left to right). MVAboxcar.xpl

The following conclusions can be made:

• Japanese cars achieve higher fuel e¬ciency than U.S. and European cars.
• There is one outlier, a very fuel-e¬cient car (VW-Rabbit Diesel).
• The main body of the U.S. car data (the box) lies below the Japanese car data.
• The worst Japanese car is more fuel-e¬cient than almost 50 percent of the U.S. cars.
• The spread of the Japanese and the U.S. cars are almost equal.
• The median of the Japanese data is above that of the European data and the U.S.
data.

Now let us apply the boxplot technique to the bank data set. In Figure 1.4 we show
the parallel boxplot of the diagonal variable X6 . On the left is the value of the gen-
20 1 Comparison of Batches



Swiss bank notes
142.40




141.19




139.99




138.78




GENUINE COUNTERFEIT




Figure 1.4. The X6 variable of Swiss bank data (diagonal of bank notes).
MVAboxbank6.xpl

uine bank notes and on the right the value of the counterfeit bank notes. The two ¬ve-
number summaries are {140.65, 141.25, 141.5, 141.8, 142.4} for the genuine bank notes, and
{138.3, 139.2, 139.5, 139.8, 140.65} for the counterfeit ones.
One sees that the diagonals of the genuine bank notes tend to be larger. It is harder to see
a clear distinction when comparing the length of the bank notes X1 , see Figure 1.5. There
are a few outliers in both plots. Almost all the observations of the diagonal of the genuine
notes are above the ones from the counterfeit. There is one observation in Figure 1.4 of the
genuine notes that is almost equal to the median of the counterfeit notes. Can the parallel
boxplot technique help us distinguish between the two types of bank notes?
1.1 Boxplots 21



Swiss bank notes
216.30




215.64




214.99




214.33




GENUINE COUNTERFEIT




Figure 1.5. The X1 variable of Swiss bank data (length of bank notes).
MVAboxbank1.xpl



Summary
’ The median and mean bars are measures of locations.
’ The relative location of the median (and the mean) in the box is a measure
of skewness.
’ The length of the box and whiskers are a measure of spread.
’ The length of the whiskers indicate the tail length of the distribution.
’ The outlying points are indicated with a “ ” or “•” depending on if they
are outside of FU L ± 1.5dF or FU L ± 3dF respectively.
’ The boxplots do not indicate multi modality or clusters.
22 1 Comparison of Batches


Summary (continued)
’ If we compare the relative size and location of the boxes, we are comparing
distributions.



1.2 Histograms
Histograms are density estimates. A density estimate gives a good impression of the distri-
bution of the data. In contrast to boxplots, density estimates show possible multimodality
of the data. The idea is to locally represent the data density by counting the number of
observations in a sequence of consecutive intervals (bins) with origin x0 . Let Bj (x0 , h) denote
the bin of length h which is the element of a bin grid starting at x0 :

Bj (x0 , h) = [x0 + (j ’ 1)h, x0 + jh), j ∈ Z,

where [., .) denotes a left closed and right open interval. If {xi }n is an i.i.d. sample with
i=1
density f , the histogram is de¬ned as follows:
n
fh (x) = n’1 h’1 I{xi ∈ Bj (x0 , h)}I{x ∈ Bj (x0 , h)}. (1.7)
j∈Z i=1


In sum (1.7) the ¬rst indicator function I{xi ∈ Bj (x0 , h)} (see Symbols & Notation in
Appendix A) counts the number of observations falling into bin Bj (x0 , h). The second
indicator function is responsible for “localizing” the counts around x. The parameter h is a
smoothing or localizing parameter and controls the width of the histogram bins. An h that
is too large leads to very big blocks and thus to a very unstructured histogram. On the other
hand, an h that is too small gives a very variable estimate with many unimportant peaks.
The e¬ect of h is given in detail in Figure 1.6. It contains the histogram (upper left) for the
diagonal of the counterfeit bank notes for x0 = 137.8 (the minimum of these observations)
and h = 0.1. Increasing h to h = 0.2 and using the same origin, x0 = 137.8, results in
the histogram shown in the lower left of the ¬gure. This density histogram is somewhat
smoother due to the larger h. The binwidth is next set to h = 0.3 (upper right). From this
histogram, one has the impression that the distribution of the diagonal is bimodal with peaks
at about 138.5 and 139.9. The detection of modes requires a ¬ne tuning of the binwidth.
Using methods from smoothing methodology (H¨rdle, M¨ller, Sperlich and Werwatz, 2003)
a u
one can ¬nd an “optimal” binwidth h for n observations:
√ 1/3
24 π
hopt = .
n

Unfortunately, the binwidth h is not the only parameter determining the shapes of f .
1.2 Histograms 23



Swiss bank notes Swiss bank notes
1




0.8
0.6
diagonal




diagonal
0.5




0.4
0.2
0




0
138 138.5 139 139.5 140 140.5 138 138.5 139 139.5 140 140.5
h=0.1 h=0.3

Swiss bank notes Swiss bank notes
0.8
0.6
diagonal




diagonal
0.5




0.4
0.2
0




0




138 138.5 139 139.5 140 140.5 138 138.5 139 139.5 140 140.5 141
h=0.2 h=0.4




Figure 1.6. Diagonal of counterfeit bank notes. Histograms with x0 =
137.8 and h = 0.1 (upper left), h = 0.2 (lower left), h = 0.3 (upper right),
h = 0.4 (lower right). MVAhisbank1.xpl

In Figure 1.7, we show histograms with x0 = 137.65 (upper left), x0 = 137.75 (lower left),
with x0 = 137.85 (upper right), and x0 = 137.95 (lower right). All the graphs have been
scaled equally on the y-axis to allow comparison. One sees that”despite the ¬xed binwidth
h”the interpretation is not facilitated. The shift of the origin x0 (to 4 di¬erent locations)
created 4 di¬erent histograms. This property of histograms strongly contradicts the goal
of presenting data features. Obviously, the same data are represented quite di¬erently by
the 4 histograms. A remedy has been proposed by Scott (1985): “Average the shifted
histograms!”. The result is presented in Figure 1.8. Here all bank note observations (genuine
and counterfeit) have been used. The averaged shifted histogram is no longer dependent on
the origin and shows a clear bimodality of the diagonals of the Swiss bank notes.
24 1 Comparison of Batches



Swiss bank notes Swiss bank notes




0.8




0.8
0.6




0.6
diagonal




diagonal
0.4




0.4
0.2




0.2
0




0
138 138.5 139 139.5 140 140.5 137.5 138 138.5 139 139.5 140 140.5
x0=137.65 x0=137.85

Swiss bank notes Swiss bank notes
0.8




0.8
0.6




0.6
diagonal




diagonal
0.4




0.4
0.2




0.2
0




0




138 138.5 139 139.5 140 140.5 141 137.5 138 138.5 139 139.5 140 140.5
x0=137.75 x0=137.95




Figure 1.7. Diagonal of counterfeit bank notes. Histogram with h = 0.4
and origins x0 = 137.65 (upper left), x0 = 137.75 (lower left), x0 = 137.85
(upper right), x0 = 137.95 (lower right). MVAhisbank2.xpl



Summary
’ Modes of the density are detected with a histogram.
’ Modes correspond to strong peaks in the histogram.
’ Histograms with the same h need not be identical. They also depend on
the origin x0 of the grid.
’ The in¬‚uence of the origin x0 is drastic. Changing x0 creates di¬erent
looking histograms.
’ The consequence of an h that is too large is an unstructured histogram
that is too ¬‚at.
’ A binwidth h that is too small results in an unstable histogram.
1.3 Kernel Densities 25


Summary (continued)

’ There is an “optimal” h = (24 π/n)1/3 .
’ It is recommended to use averaged histograms. They are kernel densities.



1.3 Kernel Densities
The major di¬culties of histogram estimation may be summarized in four critiques:

• determination of the binwidth h, which controls the shape of the histogram,

• choice of the bin origin x0 , which also in¬‚uences to some extent the shape,

• loss of information since observations are replaced by the central point of the interval
in which they fall,

• the underlying density function is often assumed to be smooth, but the histogram is
not smooth.

Rosenblatt (1956), Whittle (1958), and Parzen (1962) developed an approach which avoids
the last three di¬culties. First, a smooth kernel function rather than a box is used as the
basic building block. Second, the smooth function is centered directly over each observation.
Let us study this re¬nement by supposing that x is the center value of a bin. The histogram
can in fact be rewritten as
n
h
’1 ’1
I(|x ’ xi | ¤
fh (x) = n h ). (1.8)
2
i=1

If we de¬ne K(u) = I(|u| ¤ 1 ), then (1.8) changes to
2

n
x ’ xi
fh (x) = n’1 h’1 K . (1.9)
h
i=1

This is the general form of the kernel estimator. Allowing smoother kernel functions like the
quartic kernel,
15
K(u) = (1 ’ u2 )2 I(|u| ¤ 1),
16
and computing x not only at bin centers gives us the kernel density estimator. Kernel
estimators can also be derived via weighted averaging of rounded points (WARPing) or by
averaging histograms with di¬erent origins, see Scott (1985). Table 1.5 introduces some
commonly used kernels.
26 1 Comparison of Batches



Swiss bank notes Swiss bank notes




0.4




0.4
0.3




0.3
diagonal




diagonal
0.2




0.2
0.1




0.1
138 139 140 141 142 138 139 140 141 142
2 shifts 8 shifts

Swiss bank notes Swiss bank notes




0.4
0.4




0.3
0.3
diagonal




diagonal
0.2
0.2
0.1




0.1




138 139 140 141 142 138 139 140 141 142
4 shifts 16 shifts




Figure 1.8. Averaged shifted histograms based on all (counterfeit and gen-
uine) Swiss bank notes: there are 2 shifts (upper left), 4 shifts (lower left),
8 shifts (upper right), and 16 shifts (lower right). MVAashbank.xpl

K(•) Kernel
K(u) = 1 I(|u| ¤ 1) Uniform
2
K(u) = (1 ’ |u|)I(|u| ¤ 1) Triangle
K(u) = 3 (1 ’ u2 )I(|u| ¤ 1) Epanechnikov
4
K(u) = 15 (1 ’ u2 )2 I(|u| ¤ 1) Quartic (Biweight)
16
2
K(u) = √1 exp(’ u ) = •(u) Gaussian
2




Table 1.5. Kernel functions.

Di¬erent kernels generate di¬erent shapes of the estimated density. The most important pa-
rameter is the so-called bandwidth h, and can be optimized, for example, by cross-validation;
see H¨rdle (1991) for details. The cross-validation method minimizes the integrated squared
a
2
ˆ
error. This measure of discrepancy is based on the squared di¬erences fh (x) ’ f (x) .
1.3 Kernel Densities 27



Swiss bank notes
0.8
0.6
density estimates for diagonals
0.4
0.2
0




138 139 140 141 142
counterfeit / genuine




Figure 1.9. Densities of the diagonals of genuine and counterfeit bank
notes. Automatic density estimates. MVAdenbank.xpl

Averaging these squared deviations over a grid of points {xl }L leads to
l=1

L
2
ˆ
’1
fh (xl ) ’ f (xl )
L .
l=1

Asymptotically, if this grid size tends to zero, we obtain the integrated squared error:
2
ˆ
fh (x) ’ f (x) dx.

In practice, it turns out that the method consists of selecting a bandwidth that minimizes
the cross-validation function n
ˆ ˆ
f2 ’ 2 fh,i (xi ) h
i=1
ˆ
where fh,i is the density estimate obtained by using all datapoints except for the i-th obser-
vation. Both terms in the above function involve double sums. Computation may therefore
28 1 Comparison of Batches




142
141
Y
140
139
138




9 10 11 12
X




Figure 1.10. Contours of the density of X4 and X6 of genuine and coun-
terfeit bank notes. MVAcontbank2.xpl

be slow. There are many other density bandwidth selection methods. Probably the fastest
way to calculate this is to refer to some reasonable reference distribution. The idea of using
the Normal distribution as a reference, for example, goes back to Silverman (1986). The
resulting choice of h is called the rule of thumb.
For the Gaussian kernel from Table 1.5 and a Normal reference distribution, the rule of
thumb is to choose
hG = 1.06 σ n’1/5 (1.10)
where σ = n’1 n (xi ’ x)2 denotes the sample standard deviation. This choice of hG
i=1
optimizes the integrated squared distance between the estimator and the true density. For
the quartic kernel, we need to transform (1.10). The modi¬ed rule of thumb is:
hQ = 2.62 · hG . (1.11)

Figure 1.9 shows the automatic density estimates for the diagonals of the counterfeit and
genuine bank notes. The density on the left is the density corresponding to the diagonal
1.3 Kernel Densities 29


of the counterfeit data. The separation is clearly visible, but there is also an overlap. The
problem of distinguishing between the counterfeit and genuine bank notes is not solved by
just looking at the diagonals of the notes! The question arises whether a better separation
could be achieved using not only the diagonals but one or two more variables of the data
set. The estimation of higher dimensional densities is analogous to that of one-dimensional.
We show a two dimensional density estimate for X4 and X5 in Figure 1.10. The contour
lines indicate the height of the density. One sees two separate distributions in this higher
dimensional space, but they still overlap to some extent.




Figure 1.11. Contours of the density of X4 , X5 , X6 of genuine and coun-
terfeit bank notes. MVAcontbank3.xpl

We can add one more dimension and give a graphical representation of a three dimensional
density estimate, or more precisely an estimate of the joint distribution of X4 , X5 and X6 .
Figure 1.11 shows the contour areas at 3 di¬erent levels of the density: 0.2 (light grey), 0.4
(grey), and 0.6 (black) of this three dimensional density estimate. One can clearly recognize
30 1 Comparison of Batches


two “ellipsoids” (at each level), but as before, they overlap. In Chapter 12 we will learn
how to separate the two ellipsoids and how to develop a discrimination rule to distinguish
between these data points.



Summary
’ Kernel densities estimate distribution densities by the kernel method.
’ The bandwidth h determines the degree of smoothness of the estimate f .
’ Kernel densities are smooth functions and they can graphically represent
distributions (up to 3 dimensions).
’ A simple (but not necessarily correct) way to ¬nd a good bandwidth is to
compute the rule of thumb bandwidth hG = 1.06σn’1/5 . This bandwidth
is to be used only in combination with a Gaussian kernel •.
’ Kernel density estimates are a good descriptive tool for seeing modes,
location, skewness, tails, asymmetry, etc.




1.4 Scatterplots

Scatterplots are bivariate or trivariate plots of variables against each other. They help us
understand relationships among the variables of a data set. A downward-sloping scatter
indicates that as we increase the variable on the horizontal axis, the variable on the vertical
axis decreases. An analogous statement can be made for upward-sloping scatters.
Figure 1.12 plots the 5th column (upper inner frame) of the bank data against the 6th
column (diagonal). The scatter is downward-sloping. As we already know from the previous
section on marginal comparison (e.g., Figure 1.9) a good separation between genuine and
counterfeit bank notes is visible for the diagonal variable. The sub-cloud in the upper half
(circles) of Figure 1.12 corresponds to the true bank notes. As noted before, this separation
is not distinct, since the two groups overlap somewhat.
This can be veri¬ed in an interactive computing environment by showing the index and
coordinates of certain points in this scatterplot. In Figure 1.12, the 70th observation in
the merged data set is given as a thick circle, and it is from a genuine bank note. This
observation lies well embedded in the cloud of counterfeit bank notes. One straightforward
approach that could be used to tell the counterfeit from the genuine bank notes is to draw
a straight line and de¬ne notes above this value as genuine. We would of course misclassify
the 70th observation, but can we do better?
1.4 Scatterplots 31



Swiss bank notes
142
141
diagonal (X6)
140
139
138




8 9 10 11 12
upper inner frame (X5)




Figure 1.12. 2D scatterplot for X5 vs. X6 of the bank notes. Genuine
notes are circles, counterfeit notes are stars. MVAscabank56.xpl

If we extend the two-dimensional scatterplot by adding a third variable, e.g., X4 (lower
distance to inner frame), we obtain the scatterplot in three-dimensions as shown in Fig-
ure 1.13. It becomes apparent from the location of the point clouds that a better separation
is obtained. We have rotated the three dimensional data until this satisfactory 3D view
was obtained. Later, we will see that rotation is the same as bundling a high-dimensional
observation into one or more linear combinations of the elements of the observation vector.
In other words, the “separation line” parallel to the horizontal coordinate axis in Figure 1.12
is in Figure 1.13 a plane and no longer parallel to one of the axes. The formula for such a
separation plane is a linear combination of the elements of the observation vector:
a1 x1 + a2 x2 + . . . + a6 x6 = const. (1.12)
The algorithm that automatically ¬nds the weights (a1 , . . . , a6 ) will be investigated later on
in Chapter 12.
Let us study yet another technique: the scatterplot matrix. If we want to draw all possible
two-dimensional scatterplots for the variables, we can create a so-called draftman™s plot
32 1 Comparison of Batches



Swiss bank notes



142.40


141.48


140.56


139.64


138.72



7.20
8.30 12.30
9.40 11.38
10.46
10.50
9.54
11.60 8.62




Figure 1.13. 3D Scatterplot of the bank notes for (X4 , X5 , X6 ). Genuine
notes are circles, counterfeit are stars. MVAscabank456.xpl


(named after a draftman who prepares drafts for parliamentary discussions). Similar to a
draftman™s plot the scatterplot matrix helps in creating new ideas and in building knowledge
about dependencies and structure.
Figure 1.14 shows a draftman plot applied to the last four columns of the full bank data
set. For ease of interpretation we have distinguished between the group of counterfeit and
genuine bank notes by a di¬erent color. As discussed several times before, the separability of
the two types of notes is di¬erent for di¬erent scatterplots. Not only is it di¬cult to perform
this separation on, say, scatterplot X3 vs. X4 , in addition the “separation line” is no longer
parallel to one of the axes. The most obvious separation happens in the scatterplot in the
lower right where we show, as in Figure 1.12, X5 vs. X6 . The separation line here would be
upward-sloping with an intercept at about X6 = 139. The upper right half of the draftman
plot shows the density contours that we have introduced in Section 1.3.
The power of the draftman plot lies in its ability to show the the internal connections of the
scatter diagrams. De¬ne a brush as a re-scalable rectangle that we can move via keyboard
1.4 Scatterplots 33


Var 3




142
12
12




141
11
10
Y




Y




Y
140
10




139
8




9




138
129 129.5 130 130.5 131 129 129.5 130 130.5 131
129 129.5 130 130.5 131
X X X


Var 4




142
12
12


. 1
( 4)



>>