ńņš. 1 |

Statistical Analysis ā—

Wolfgang HĀØrdle

a

LĀ“opold Simar

e

ā— Version: 22nd October 2003

Please note: this is only a sample of the

full book. The complete book can be dow-

nloaded on the e-book page of XploRe.

Just click the download logo:

http://www.xplore-

stat.de/ebooks/ebooks.html

download logo

For further information please contact

MD*Tech at mdtech@mdtech.de

Contents

I Descriptive Techniques 11

1 Comparison of Batches 13

1.1 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.3 Kernel Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.4 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1.5 Chernoļ¬-Flury Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.6 Andrewsā™ Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

1.7 Parallel Coordinates Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

1.8 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

II Multivariate Random Variables 55

2 A Short Excursion into Matrix Algebra 57

2.1 Elementary Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.2 Spectral Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.3 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.4 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.5 Partitioned Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2 Contents

2.6 Geometrical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3 Moving to Higher Dimensions 81

3.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.4 Linear Model for Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.5 Simple Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3.6 Multiple Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

3.7 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4 Multivariate Distributions 119

4.1 Distribution and Density Function . . . . . . . . . . . . . . . . . . . . . . . . 120

4.2 Moments and Characteristic Functions . . . . . . . . . . . . . . . . . . . . . 125

4.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.4 The Multinormal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4.5 Sampling Distributions and Limit Theorems . . . . . . . . . . . . . . . . . . 142

4.6 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5 Theory of the Multinormal 155

5.1 Elementary Properties of the Multinormal . . . . . . . . . . . . . . . . . . . 155

5.2 The Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

5.3 Hotelling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

5.4 Spherical and Elliptical Distributions . . . . . . . . . . . . . . . . . . . . . . 167

5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Contents 3

6 Theory of Estimation 173

6.1 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

6.2 The Cramer-Rao Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7 Hypothesis Testing 183

7.1 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

7.2 Linear Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

7.3 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

III Multivariate Techniques 217

8 Decomposition of Data Matrices by Factors 219

8.1 The Geometric Point of View . . . . . . . . . . . . . . . . . . . . . . . . . . 220

8.2 Fitting the p-dimensional Point Cloud . . . . . . . . . . . . . . . . . . . . . 221

8.3 Fitting the n-dimensional Point Cloud . . . . . . . . . . . . . . . . . . . . . 225

8.4 Relations between Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

8.5 Practical Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

9 Principal Components Analysis 233

9.1 Standardized Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . 234

9.2 Principal Components in Practice . . . . . . . . . . . . . . . . . . . . . . . . 238

9.3 Interpretation of the PCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

9.4 Asymptotic Properties of the PCs . . . . . . . . . . . . . . . . . . . . . . . . 246

9.5 Normalized Principal Components Analysis . . . . . . . . . . . . . . . . . . . 249

9.6 Principal Components as a Factorial Method . . . . . . . . . . . . . . . . . . 250

9.7 Common Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . 256

4 Contents

9.8 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

9.9 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

9.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

10 Factor Analysis 275

10.1 The Orthogonal Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . . 275

10.2 Estimation of the Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . 282

10.3 Factor Scores and Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

10.4 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

11 Cluster Analysis 301

11.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

11.2 The Proximity between Objects . . . . . . . . . . . . . . . . . . . . . . . . . 302

11.3 Cluster Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

11.4 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

12 Discriminant Analysis 323

12.1 Allocation Rules for Known Distributions . . . . . . . . . . . . . . . . . . . . 323

12.2 Discrimination Rules in Practice . . . . . . . . . . . . . . . . . . . . . . . . . 331

12.3 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

13 Correspondence Analysis 341

13.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

13.2 Chi-square Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344

13.3 Correspondence Analysis in Practice . . . . . . . . . . . . . . . . . . . . . . 347

13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358

Contents 5

14 Canonical Correlation Analysis 361

14.1 Most Interesting Linear Combination . . . . . . . . . . . . . . . . . . . . . . 361

14.2 Canonical Correlation in Practice . . . . . . . . . . . . . . . . . . . . . . . . 366

14.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

15 Multidimensional Scaling 373

15.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

15.2 Metric Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . 379

15.2.1 The Classical Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 379

15.3 Nonmetric Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . 383

15.3.1 Shepard-Kruskal algorithm . . . . . . . . . . . . . . . . . . . . . . . . 384

15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

16 Conjoint Measurement Analysis 393

16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

16.2 Design of Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

16.3 Estimation of Preference Orderings . . . . . . . . . . . . . . . . . . . . . . . 398

16.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

17 Applications in Finance 407

17.1 Portfolio Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

17.2 Eļ¬cient Portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408

17.3 Eļ¬cient Portfolios in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . 415

17.4 The Capital Asset Pricing Model (CAPM) . . . . . . . . . . . . . . . . . . . 417

17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

18 Highly Interactive, Computationally Intensive Techniques 421

18.1 Simplicial Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421

18.2 Projection Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

18.3 Sliced Inverse Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431

6 Contents

18.4 Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439

18.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

A Symbols and Notation 443

B Data 447

B.1 Boston Housing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

B.2 Swiss Bank Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448

B.3 Car Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

B.4 Classic Blue Pullovers Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 454

B.5 U.S. Companies Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455

B.6 French Food Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

B.7 Car Marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458

B.8 French BaccalaurĀ“at Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . 459

e

B.9 Journaux Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460

B.10 U.S. Crime Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461

B.11 Plasma Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463

B.12 WAIS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464

B.13 ANOVA Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466

B.14 Timebudget Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

B.15 Geopol Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469

B.16 U.S. Health Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

B.17 Vocabulary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473

B.18 Athletic Records Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475

B.19 Unemployment Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477

B.20 Annual Population Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478

Bibliography 479

Index 483

Preface

Most of the observable phenomena in the empirical sciences are of a multivariate nature.

In ļ¬nancial studies, assets in stock markets are observed simultaneously and their joint

development is analyzed to better understand general tendencies and to track indices. In

medicine recorded observations of subjects in diļ¬erent locations are the basis of reliable

diagnoses and medication. In quantitative marketing consumer preferences are collected in

order to construct models of consumer behavior. The underlying theoretical structure of

these and many other quantitative studies of applied sciences is multivariate. This book

on Applied Multivariate Statistical Analysis presents the tools and concepts of multivariate

data analysis with a strong focus on applications.

The aim of the book is to present multivariate data analysis in a way that is understandable

for non-mathematicians and practitioners who are confronted by statistical data analysis.

This is achieved by focusing on the practical relevance and through the e-book character of

this text. All practical examples may be recalculated and modiļ¬ed by the reader using a

standard web browser and without reference or application of any speciļ¬c software.

The book is divided into three main parts. The ļ¬rst part is devoted to graphical techniques

describing the distributions of the variables involved. The second part deals with multivariate

random variables and presents from a theoretical point of view distributions, estimators

and tests for various practical situations. The last part is on multivariate techniques and

introduces the reader to the wide selection of tools available for multivariate data analysis.

All data sets are given in the appendix and are downloadable from www.md-stat.com. The

text contains a wide variety of exercises the solutions of which are given in a separate

textbook. In addition a full set of transparencies on www.md-stat.com is provided making it

easier for an instructor to present the materials in this book. All transparencies contain hyper

links to the statistical web service so that students and instructors alike may recompute all

examples via a standard web browser.

The ļ¬rst section on descriptive techniques is on the construction of the boxplot. Here the

standard data sets on genuine and counterfeit bank notes and on the Boston housing data are

introduced. Flury faces are shown in Section 1.5, followed by the presentation of Andrews

curves and parallel coordinate plots. Histograms, kernel densities and scatterplots complete

the ļ¬rst part of the book. The reader is introduced to the concept of skewness and correlation

from a graphical point of view.

8 Preface

At the beginning of the second part of the book the reader goes on a short excursion into

matrix algebra. Covariances, correlation and the linear model are introduced. This section

is followed by the presentation of the ANOVA technique and its application to the multiple

linear model. In Chapter 4 the multivariate distributions are introduced and thereafter

specialized to the multinormal. The theory of estimation and testing ends the discussion on

multivariate random variables.

The third and last part of this book starts with a geometric decomposition of data matrices.

It is inļ¬‚uenced by the French school of analyse de donnĀ“es. This geometric point of view

e

is linked to principal components analysis in Chapter 9. An important discussion on factor

analysis follows with a variety of examples from psychology and economics. The section on

cluster analysis deals with the various cluster techniques and leads naturally to the problem

of discrimination analysis. The next chapter deals with the detection of correspondence

between factors. The joint structure of data sets is presented in the chapter on canonical

correlation analysis and a practical study on prices and safety features of automobiles is

given. Next the important topic of multidimensional scaling is introduced, followed by the

tool of conjoint measurement analysis. The conjoint measurement analysis is often used

in psychology and marketing in order to measure preference orderings for certain goods.

The applications in ļ¬nance (Chapter 17) are numerous. We present here the CAPM model

and discuss eļ¬cient portfolio allocations. The book closes with a presentation on highly

interactive, computationally intensive techniques.

This book is designed for the advanced bachelor and ļ¬rst year graduate student as well as

for the inexperienced data analyst who would like a tour of the various statistical tools in

a multivariate data analysis workshop. The experienced reader with a bright knowledge of

algebra will certainly skip some sections of the multivariate random variables part but will

hopefully enjoy the various mathematical roots of the multivariate techniques. A graduate

student might think that the ļ¬rst part on description techniques is well known to him from his

training in introductory statistics. The mathematical and the applied parts of the book (II,

III) will certainly introduce him into the rich realm of multivariate statistical data analysis

modules.

The inexperienced computer user of this e-book is slowly introduced to an interdisciplinary

way of statistical thinking and will certainly enjoy the various practical examples. This

e-book is designed as an interactive document with various links to other features. The

complete e-book may be downloaded from www.xplore-stat.de using the license key given

on the last page of this book. Our e-book design oļ¬ers a complete PDF and HTML ļ¬le with

links to MD*Tech computing servers.

The reader of this book may therefore use all the presented methods and data via the local

XploRe Quantlet Server (XQS) without downloading or buying additional software. Such

XQ Servers may also be installed in a department or addressed freely on the web (see www.i-

xplore.de for more information).

Preface 9

A book of this kind would not have been possible without the help of many friends, col-

leagues and students. For the technical production of the e-book we would like to thank

JĀØrg Feuerhake, ZdenĖk HlĀ“vka, Torsten Kleinow, Sigbert Klinke, Heiko Lehmann, Marlene

o e a

MĀØller. The book has been carefully read by Christian Hafner, Mia Huber, Stefan Sperlich,

u

Ė Ä±Ė

Axel Werwatz. We would also like to thank Pavel CĀ“zek, Isabelle De Macq, Holger Gerhardt,

Alena MyĖiĖkovĀ“ and Manh Cuong Vu for the solutions to various statistical problems and

sc a

exercises. We thank Clemens Heine from Springer Verlag for continuous support and valuable

suggestions on the style of writing and on the contents covered.

W. HĀØrdle and L. Simar

a

Berlin and Louvain-la-Neuve, August 2003

Part I

Descriptive Techniques

1 Comparison of Batches

Multivariate statistical analysis is concerned with analyzing and understanding data in high

dimensions. We suppose that we are given a set {xi }n of n observations of a variable vector

i=1

p

X in R . That is, we suppose that each observation xi has p dimensions:

xi = (xi1 , xi2 , ..., xip ),

and that it is an observed value of a variable vector X ā Rp . Therefore, X is composed of p

random variables:

X = (X1 , X2 , ..., Xp )

where Xj , for j = 1, . . . , p, is a one-dimensional random variable. How do we begin to

analyze this kind of data? Before we investigate questions on what inferences we can reach

from the data, we should think about how to look at the data. This involves descriptive

techniques. Questions that we could answer by descriptive techniques are:

ā¢ Are there components of X that are more spread out than others?

ā¢ Are there some elements of X that indicate subgroups of the data?

ā¢ Are there outliers in the components of X?

ā¢ How ānormalā is the distribution of the data?

ā¢ Are there ālow-dimensionalā linear combinations of X that show ānon-normalā behav-

ior?

One diļ¬culty of descriptive methods for high dimensional data is the human perceptional

system. Point clouds in two dimensions are easy to understand and to interpret. With

modern interactive computing techniques we have the possibility to see real time 3D rotations

and thus to perceive also three-dimensional data. A āsliding techniqueā as described in

HĀØrdle and Scott (1992) may give insight into four-dimensional structures by presenting

a

dynamic 3D density contours as the fourth variable is changed over its range.

A qualitative jump in presentation diļ¬culties occurs for dimensions greater than or equal to

5, unless the high-dimensional structure can be mapped into lower-dimensional components

14 1 Comparison of Batches

(Klinke and Polzehl, 1995). Features like clustered subgroups or outliers, however, can be

detected using a purely graphical analysis.

In this chapter, we investigate the basic descriptive and graphical techniques allowing simple

exploratory data analysis. We begin the exploration of a data set using boxplots. A boxplot

is a simple univariate device that detects outliers component by component and that can

compare distributions of the data among diļ¬erent groups. Next several multivariate tech-

niques are introduced (Flury faces, Andrewsā™ curves and parallel coordinate plots) which

provide graphical displays addressing the questions formulated above. The advantages and

the disadvantages of each of these techniques are stressed.

Two basic techniques for estimating densities are also presented: histograms and kernel

densities. A density estimate gives a quick insight into the shape of the distribution of

the data. We show that kernel density estimates overcome some of the drawbacks of the

histograms.

Finally, scatterplots are shown to be very useful for plotting bivariate or trivariate variables

against each other: they help to understand the nature of the relationship among variables

in a data set and allow to detect groups or clusters of points. Draftman plots or matrix plots

are the visualization of several bivariate scatterplots on the same display. They help detect

structures in conditional dependences by brushing across the plots.

1.1 Boxplots

EXAMPLE 1.1 The Swiss bank data (see Appendix, Table B.2) consists of 200 measure-

ments on Swiss bank notes. The ļ¬rst half of these measurements are from genuine bank

notes, the other half are from counterfeit bank notes.

The authorities have measured, as indicated in Figure 1.1,

X1 = length of the bill

X2 = height of the bill (left)

X3 = height of the bill (right)

X4 = distance of the inner frame to the lower border

X5 = distance of the inner frame to the upper border

X6 = length of the diagonal of the central picture.

These data are taken from Flury and Riedwyl (1988). The aim is to study how these mea-

surements may be used in determining whether a bill is genuine or counterfeit.

1.1 Boxplots 15

Figure 1.1. An old Swiss 1000-franc bank note.

The boxplot is a graphical technique that displays the distribution of variables. It helps us

see the location, skewness, spread, tail length and outlying points.

It is particularly useful in comparing diļ¬erent batches. The boxplot is a graphical repre-

sentation of the Five Number Summary. To introduce the Five Number Summary, let us

consider for a moment a smaller, one-dimensional data set: the population of the 15 largest

U.S. cities in 1960 (Table 1.1).

In the Five Number Summary, we calculate the upper quartile FU , the lower quartile FL ,

the median and the extremes. Recall that order statistics {x(1) , x(2) , . . . , x(n) } are a set of

ordered values x1 , x2 , . . . , xn where x(1) denotes the minimum and x(n) the maximum. The

median M typically cuts the set of observations in two equal parts, and is deļ¬ned as

x( n+1 ) n odd

2

M= . (1.1)

1

x( n ) + x( n +1) n even

2 2 2

The quartiles cut the set into four equal parts, which are often called fourths (that is why we

use the letter F ). Using a deļ¬nition that goes back to Hoaglin, Mosteller and Tukey (1983)

the deļ¬nition of a median can be generalized to fourths, eights, etc. Considering the order

statistics we can deļ¬ne the depth of a data value x(i) as min{i, n ā’ i + 1}. If n is odd, the

depth of the median is n+1 . If n is even, n+1 is a fraction. Thus, the median is determined

2 2

to be the average between the two data values belonging to the next larger and smaller order

1

statistics, i.e., M = 2 x( n ) + x( n +1) . In our example, we have n = 15 hence the median

2 2

M = x(8) = 88.

16 1 Comparison of Batches

City Pop. (10,000) Order Statistics

New York 778 x(15)

Chicago 355 x(14)

Los Angeles 248 x(13)

Philadelphia 200 x(12)

Detroit 167 x(11)

Baltimore 94 x(10)

Houston 94 x(9)

Cleveland 88 x(8)

Washington D.C. 76 x(7)

Saint Louis 75 x(6)

Milwaukee 74 x(5)

San Francisco 74 x(4)

Boston 70 x(3)

Dallas 68 x(2)

New Orleans 63 x(1)

Table 1.1. The 15 largest U.S. cities in 1960.

We proceed in the same way to get the fourths. Take the depth of the median and calculate

[depth of median] + 1

depth of fourth =

2

with [z] denoting the largest integer smaller than or equal to z. In our example this gives

4.5 and thus leads to the two fourths

1

FL = x(4) + x(5)

2

1

FU = x(11) + x(12)

2

(recalling that a depth which is a fraction corresponds to the average of the two nearest data

values).

The F -spread, dF , is deļ¬ned as dF = FU ā’ FL . The outside bars

FU + 1.5dF (1.2)

FL ā’ 1.5dF (1.3)

are the borders beyond which a point is regarded as an outlier. For the number of points

outside these bars see Exercise 1.3. For the n = 15 data points the fourths are 74 =

1 1

x(4) + x(5) and 183.5 = 2 x(11) + x(12) . Therefore the F -spread and the upper and

2

1.1 Boxplots 17

# 15 U.S. Cities

M 8 88

F 4.5 74 183.5

1 63 778

Table 1.2. Five number summary.

lower outside bars in the above example are calculated as follows:

dF = FU ā’ FL = 183.5 ā’ 74 = 109.5 (1.4)

FL ā’ 1.5dF = 74 ā’ 1.5 Ā· 109.5 = ā’90.25 (1.5)

FU + 1.5dF = 183.5 + 1.5 Ā· 109.5 = 347.75. (1.6)

Since New York and Chicago are beyond the outside bars they are considered to be outliers.

The minimum and the maximum are called the extremes. The mean is deļ¬ned as

n

x = nā’1 xi ,

i=1

which is 168.27 in our example. The mean is a measure of location. The median (88), the

fourths (74;183.5) and the extremes (63;778) constitute basic information about the data.

The combination of these ļ¬ve numbers leads to the Five Number Summary as displayed in

Table 1.2. The depths of each of the ļ¬ve numbers have been added as an additional column.

Construction of the Boxplot

1. Draw a box with borders (edges) at FL and FU (i.e., 50% of the data are in this box).

2. Draw the median as a solid line (|) and the mean as a dotted line ().

3. Draw āwhiskersā from each end of the box to the most remote point that is NOT an

outlier.

4. Show outliers as either ā ā or āā¢ādepending on whether they are outside of FU L Ā±1.5dF

or FU L Ā± 3dF respectively. Label them if possible.

18 1 Comparison of Batches

Boxplot

778.00

88.00

63.00

US cities

Figure 1.2. Boxplot for U.S. cities. MVAboxcity.xpl

In the U.S. cities example the cutoļ¬ points (outside bars) are at ā’91 and 349, hence we draw

whiskers to New Orleans and Los Angeles. We can see from Figure 1.2 that the data are

very skew: The upper half of the data (above the median) is more spread out than the lower

half (below the median). The data contains two outliers marked as a star and a circle. The

more distinct outlier is shown as a star. The mean (as a non-robust measure of location) is

pulled away from the median.

Boxplots are very useful tools in comparing batches. The relative location of the distribution

of diļ¬erent batches tells us a lot about the batches themselves. Before we come back to the

Swiss bank data let us compare the fuel economy of vehicles from diļ¬erent countries, see

Figure 1.3 and Table B.3.

The data are from the second column of Table B.3 and show the mileage (miles per gallon)

of U.S. American, Japanese and European cars. The ļ¬ve-number summaries for these data

sets are {12, 16.8, 18.8, 22, 30}, {18, 22, 25, 30.5, 35}, and {14, 19, 23, 25, 28} for American,

Japanese, and European cars, respectively. This reļ¬‚ects the information shown in Figure 1.3.

1.1 Boxplots 19

car data

41.00

33.39

25.78

18.16

US JAPAN EU

Figure 1.3. Boxplot for the mileage of American, Japanese and European

cars (from left to right). MVAboxcar.xpl

The following conclusions can be made:

ā¢ Japanese cars achieve higher fuel eļ¬ciency than U.S. and European cars.

ā¢ There is one outlier, a very fuel-eļ¬cient car (VW-Rabbit Diesel).

ā¢ The main body of the U.S. car data (the box) lies below the Japanese car data.

ā¢ The worst Japanese car is more fuel-eļ¬cient than almost 50 percent of the U.S. cars.

ā¢ The spread of the Japanese and the U.S. cars are almost equal.

ā¢ The median of the Japanese data is above that of the European data and the U.S.

data.

Now let us apply the boxplot technique to the bank data set. In Figure 1.4 we show

the parallel boxplot of the diagonal variable X6 . On the left is the value of the gen-

20 1 Comparison of Batches

Swiss bank notes

142.40

141.19

139.99

138.78

GENUINE COUNTERFEIT

Figure 1.4. The X6 variable of Swiss bank data (diagonal of bank notes).

MVAboxbank6.xpl

uine bank notes and on the right the value of the counterfeit bank notes. The two ļ¬ve-

number summaries are {140.65, 141.25, 141.5, 141.8, 142.4} for the genuine bank notes, and

{138.3, 139.2, 139.5, 139.8, 140.65} for the counterfeit ones.

One sees that the diagonals of the genuine bank notes tend to be larger. It is harder to see

a clear distinction when comparing the length of the bank notes X1 , see Figure 1.5. There

are a few outliers in both plots. Almost all the observations of the diagonal of the genuine

notes are above the ones from the counterfeit. There is one observation in Figure 1.4 of the

genuine notes that is almost equal to the median of the counterfeit notes. Can the parallel

boxplot technique help us distinguish between the two types of bank notes?

1.1 Boxplots 21

Swiss bank notes

216.30

215.64

214.99

214.33

GENUINE COUNTERFEIT

Figure 1.5. The X1 variable of Swiss bank data (length of bank notes).

MVAboxbank1.xpl

Summary

ā’ The median and mean bars are measures of locations.

ā’ The relative location of the median (and the mean) in the box is a measure

of skewness.

ā’ The length of the box and whiskers are a measure of spread.

ā’ The length of the whiskers indicate the tail length of the distribution.

ā’ The outlying points are indicated with a ā ā or āā¢ā depending on if they

are outside of FU L Ā± 1.5dF or FU L Ā± 3dF respectively.

ā’ The boxplots do not indicate multi modality or clusters.

22 1 Comparison of Batches

Summary (continued)

ā’ If we compare the relative size and location of the boxes, we are comparing

distributions.

1.2 Histograms

Histograms are density estimates. A density estimate gives a good impression of the distri-

bution of the data. In contrast to boxplots, density estimates show possible multimodality

of the data. The idea is to locally represent the data density by counting the number of

observations in a sequence of consecutive intervals (bins) with origin x0 . Let Bj (x0 , h) denote

the bin of length h which is the element of a bin grid starting at x0 :

Bj (x0 , h) = [x0 + (j ā’ 1)h, x0 + jh), j ā Z,

where [., .) denotes a left closed and right open interval. If {xi }n is an i.i.d. sample with

i=1

density f , the histogram is deļ¬ned as follows:

n

fh (x) = nā’1 hā’1 I{xi ā Bj (x0 , h)}I{x ā Bj (x0 , h)}. (1.7)

jāZ i=1

In sum (1.7) the ļ¬rst indicator function I{xi ā Bj (x0 , h)} (see Symbols & Notation in

Appendix A) counts the number of observations falling into bin Bj (x0 , h). The second

indicator function is responsible for ālocalizingā the counts around x. The parameter h is a

smoothing or localizing parameter and controls the width of the histogram bins. An h that

is too large leads to very big blocks and thus to a very unstructured histogram. On the other

hand, an h that is too small gives a very variable estimate with many unimportant peaks.

The eļ¬ect of h is given in detail in Figure 1.6. It contains the histogram (upper left) for the

diagonal of the counterfeit bank notes for x0 = 137.8 (the minimum of these observations)

and h = 0.1. Increasing h to h = 0.2 and using the same origin, x0 = 137.8, results in

the histogram shown in the lower left of the ļ¬gure. This density histogram is somewhat

smoother due to the larger h. The binwidth is next set to h = 0.3 (upper right). From this

histogram, one has the impression that the distribution of the diagonal is bimodal with peaks

at about 138.5 and 139.9. The detection of modes requires a ļ¬ne tuning of the binwidth.

Using methods from smoothing methodology (HĀØrdle, MĀØller, Sperlich and Werwatz, 2003)

a u

one can ļ¬nd an āoptimalā binwidth h for n observations:

ā 1/3

24 Ļ

hopt = .

n

Unfortunately, the binwidth h is not the only parameter determining the shapes of f .

1.2 Histograms 23

Swiss bank notes Swiss bank notes

1

0.8

0.6

diagonal

diagonal

0.5

0.4

0.2

0

0

138 138.5 139 139.5 140 140.5 138 138.5 139 139.5 140 140.5

h=0.1 h=0.3

Swiss bank notes Swiss bank notes

0.8

0.6

diagonal

diagonal

0.5

0.4

0.2

0

0

138 138.5 139 139.5 140 140.5 138 138.5 139 139.5 140 140.5 141

h=0.2 h=0.4

Figure 1.6. Diagonal of counterfeit bank notes. Histograms with x0 =

137.8 and h = 0.1 (upper left), h = 0.2 (lower left), h = 0.3 (upper right),

h = 0.4 (lower right). MVAhisbank1.xpl

In Figure 1.7, we show histograms with x0 = 137.65 (upper left), x0 = 137.75 (lower left),

with x0 = 137.85 (upper right), and x0 = 137.95 (lower right). All the graphs have been

scaled equally on the y-axis to allow comparison. One sees thatā”despite the ļ¬xed binwidth

hā”the interpretation is not facilitated. The shift of the origin x0 (to 4 diļ¬erent locations)

created 4 diļ¬erent histograms. This property of histograms strongly contradicts the goal

of presenting data features. Obviously, the same data are represented quite diļ¬erently by

the 4 histograms. A remedy has been proposed by Scott (1985): āAverage the shifted

histograms!ā. The result is presented in Figure 1.8. Here all bank note observations (genuine

and counterfeit) have been used. The averaged shifted histogram is no longer dependent on

the origin and shows a clear bimodality of the diagonals of the Swiss bank notes.

24 1 Comparison of Batches

Swiss bank notes Swiss bank notes

0.8

0.8

0.6

0.6

diagonal

diagonal

0.4

0.4

0.2

0.2

0

0

138 138.5 139 139.5 140 140.5 137.5 138 138.5 139 139.5 140 140.5

x0=137.65 x0=137.85

Swiss bank notes Swiss bank notes

0.8

0.8

0.6

0.6

diagonal

diagonal

0.4

0.4

0.2

0.2

0

0

138 138.5 139 139.5 140 140.5 141 137.5 138 138.5 139 139.5 140 140.5

x0=137.75 x0=137.95

Figure 1.7. Diagonal of counterfeit bank notes. Histogram with h = 0.4

and origins x0 = 137.65 (upper left), x0 = 137.75 (lower left), x0 = 137.85

(upper right), x0 = 137.95 (lower right). MVAhisbank2.xpl

Summary

ā’ Modes of the density are detected with a histogram.

ā’ Modes correspond to strong peaks in the histogram.

ā’ Histograms with the same h need not be identical. They also depend on

the origin x0 of the grid.

ā’ The inļ¬‚uence of the origin x0 is drastic. Changing x0 creates diļ¬erent

looking histograms.

ā’ The consequence of an h that is too large is an unstructured histogram

that is too ļ¬‚at.

ā’ A binwidth h that is too small results in an unstable histogram.

1.3 Kernel Densities 25

Summary (continued)

ā

ā’ There is an āoptimalā h = (24 Ļ/n)1/3 .

ā’ It is recommended to use averaged histograms. They are kernel densities.

1.3 Kernel Densities

The major diļ¬culties of histogram estimation may be summarized in four critiques:

ā¢ determination of the binwidth h, which controls the shape of the histogram,

ā¢ choice of the bin origin x0 , which also inļ¬‚uences to some extent the shape,

ā¢ loss of information since observations are replaced by the central point of the interval

in which they fall,

ā¢ the underlying density function is often assumed to be smooth, but the histogram is

not smooth.

Rosenblatt (1956), Whittle (1958), and Parzen (1962) developed an approach which avoids

the last three diļ¬culties. First, a smooth kernel function rather than a box is used as the

basic building block. Second, the smooth function is centered directly over each observation.

Let us study this reļ¬nement by supposing that x is the center value of a bin. The histogram

can in fact be rewritten as

n

h

ā’1 ā’1

I(|x ā’ xi | ā¤

fh (x) = n h ). (1.8)

2

i=1

If we deļ¬ne K(u) = I(|u| ā¤ 1 ), then (1.8) changes to

2

n

x ā’ xi

fh (x) = nā’1 hā’1 K . (1.9)

h

i=1

This is the general form of the kernel estimator. Allowing smoother kernel functions like the

quartic kernel,

15

K(u) = (1 ā’ u2 )2 I(|u| ā¤ 1),

16

and computing x not only at bin centers gives us the kernel density estimator. Kernel

estimators can also be derived via weighted averaging of rounded points (WARPing) or by

averaging histograms with diļ¬erent origins, see Scott (1985). Table 1.5 introduces some

commonly used kernels.

26 1 Comparison of Batches

Swiss bank notes Swiss bank notes

0.4

0.4

0.3

0.3

diagonal

diagonal

0.2

0.2

0.1

0.1

138 139 140 141 142 138 139 140 141 142

2 shifts 8 shifts

Swiss bank notes Swiss bank notes

0.4

0.4

0.3

0.3

diagonal

diagonal

0.2

0.2

0.1

0.1

138 139 140 141 142 138 139 140 141 142

4 shifts 16 shifts

Figure 1.8. Averaged shifted histograms based on all (counterfeit and gen-

uine) Swiss bank notes: there are 2 shifts (upper left), 4 shifts (lower left),

8 shifts (upper right), and 16 shifts (lower right). MVAashbank.xpl

K(ā¢) Kernel

K(u) = 1 I(|u| ā¤ 1) Uniform

2

K(u) = (1 ā’ |u|)I(|u| ā¤ 1) Triangle

K(u) = 3 (1 ā’ u2 )I(|u| ā¤ 1) Epanechnikov

4

K(u) = 15 (1 ā’ u2 )2 I(|u| ā¤ 1) Quartic (Biweight)

16

2

K(u) = ā1 exp(ā’ u ) = Ļ•(u) Gaussian

2

2Ļ

Table 1.5. Kernel functions.

Diļ¬erent kernels generate diļ¬erent shapes of the estimated density. The most important pa-

rameter is the so-called bandwidth h, and can be optimized, for example, by cross-validation;

see HĀØrdle (1991) for details. The cross-validation method minimizes the integrated squared

a

2

Ė

error. This measure of discrepancy is based on the squared diļ¬erences fh (x) ā’ f (x) .

1.3 Kernel Densities 27

Swiss bank notes

0.8

0.6

density estimates for diagonals

0.4

0.2

0

138 139 140 141 142

counterfeit / genuine

Figure 1.9. Densities of the diagonals of genuine and counterfeit bank

notes. Automatic density estimates. MVAdenbank.xpl

Averaging these squared deviations over a grid of points {xl }L leads to

l=1

L

2

Ė

ā’1

fh (xl ) ā’ f (xl )

L .

l=1

Asymptotically, if this grid size tends to zero, we obtain the integrated squared error:

2

Ė

fh (x) ā’ f (x) dx.

In practice, it turns out that the method consists of selecting a bandwidth that minimizes

the cross-validation function n

Ė Ė

f2 ā’ 2 fh,i (xi ) h

i=1

Ė

where fh,i is the density estimate obtained by using all datapoints except for the i-th obser-

vation. Both terms in the above function involve double sums. Computation may therefore

28 1 Comparison of Batches

142

141

Y

140

139

138

9 10 11 12

X

Figure 1.10. Contours of the density of X4 and X6 of genuine and coun-

terfeit bank notes. MVAcontbank2.xpl

be slow. There are many other density bandwidth selection methods. Probably the fastest

way to calculate this is to refer to some reasonable reference distribution. The idea of using

the Normal distribution as a reference, for example, goes back to Silverman (1986). The

resulting choice of h is called the rule of thumb.

For the Gaussian kernel from Table 1.5 and a Normal reference distribution, the rule of

thumb is to choose

hG = 1.06 Ļ nā’1/5 (1.10)

where Ļ = nā’1 n (xi ā’ x)2 denotes the sample standard deviation. This choice of hG

i=1

optimizes the integrated squared distance between the estimator and the true density. For

the quartic kernel, we need to transform (1.10). The modiļ¬ed rule of thumb is:

hQ = 2.62 Ā· hG . (1.11)

Figure 1.9 shows the automatic density estimates for the diagonals of the counterfeit and

genuine bank notes. The density on the left is the density corresponding to the diagonal

1.3 Kernel Densities 29

of the counterfeit data. The separation is clearly visible, but there is also an overlap. The

problem of distinguishing between the counterfeit and genuine bank notes is not solved by

just looking at the diagonals of the notes! The question arises whether a better separation

could be achieved using not only the diagonals but one or two more variables of the data

set. The estimation of higher dimensional densities is analogous to that of one-dimensional.

We show a two dimensional density estimate for X4 and X5 in Figure 1.10. The contour

lines indicate the height of the density. One sees two separate distributions in this higher

dimensional space, but they still overlap to some extent.

Figure 1.11. Contours of the density of X4 , X5 , X6 of genuine and coun-

terfeit bank notes. MVAcontbank3.xpl

We can add one more dimension and give a graphical representation of a three dimensional

density estimate, or more precisely an estimate of the joint distribution of X4 , X5 and X6 .

Figure 1.11 shows the contour areas at 3 diļ¬erent levels of the density: 0.2 (light grey), 0.4

(grey), and 0.6 (black) of this three dimensional density estimate. One can clearly recognize

30 1 Comparison of Batches

two āellipsoidsā (at each level), but as before, they overlap. In Chapter 12 we will learn

how to separate the two ellipsoids and how to develop a discrimination rule to distinguish

between these data points.

Summary

ā’ Kernel densities estimate distribution densities by the kernel method.

ā’ The bandwidth h determines the degree of smoothness of the estimate f .

ā’ Kernel densities are smooth functions and they can graphically represent

distributions (up to 3 dimensions).

ā’ A simple (but not necessarily correct) way to ļ¬nd a good bandwidth is to

compute the rule of thumb bandwidth hG = 1.06Ļnā’1/5 . This bandwidth

is to be used only in combination with a Gaussian kernel Ļ•.

ā’ Kernel density estimates are a good descriptive tool for seeing modes,

location, skewness, tails, asymmetry, etc.

1.4 Scatterplots

Scatterplots are bivariate or trivariate plots of variables against each other. They help us

understand relationships among the variables of a data set. A downward-sloping scatter

indicates that as we increase the variable on the horizontal axis, the variable on the vertical

axis decreases. An analogous statement can be made for upward-sloping scatters.

Figure 1.12 plots the 5th column (upper inner frame) of the bank data against the 6th

column (diagonal). The scatter is downward-sloping. As we already know from the previous

section on marginal comparison (e.g., Figure 1.9) a good separation between genuine and

counterfeit bank notes is visible for the diagonal variable. The sub-cloud in the upper half

(circles) of Figure 1.12 corresponds to the true bank notes. As noted before, this separation

is not distinct, since the two groups overlap somewhat.

This can be veriļ¬ed in an interactive computing environment by showing the index and

coordinates of certain points in this scatterplot. In Figure 1.12, the 70th observation in

the merged data set is given as a thick circle, and it is from a genuine bank note. This

observation lies well embedded in the cloud of counterfeit bank notes. One straightforward

approach that could be used to tell the counterfeit from the genuine bank notes is to draw

a straight line and deļ¬ne notes above this value as genuine. We would of course misclassify

the 70th observation, but can we do better?

1.4 Scatterplots 31

Swiss bank notes

142

141

diagonal (X6)

140

139

138

8 9 10 11 12

upper inner frame (X5)

Figure 1.12. 2D scatterplot for X5 vs. X6 of the bank notes. Genuine

notes are circles, counterfeit notes are stars. MVAscabank56.xpl

If we extend the two-dimensional scatterplot by adding a third variable, e.g., X4 (lower

distance to inner frame), we obtain the scatterplot in three-dimensions as shown in Fig-

ure 1.13. It becomes apparent from the location of the point clouds that a better separation

is obtained. We have rotated the three dimensional data until this satisfactory 3D view

was obtained. Later, we will see that rotation is the same as bundling a high-dimensional

observation into one or more linear combinations of the elements of the observation vector.

In other words, the āseparation lineā parallel to the horizontal coordinate axis in Figure 1.12

is in Figure 1.13 a plane and no longer parallel to one of the axes. The formula for such a

separation plane is a linear combination of the elements of the observation vector:

a1 x1 + a2 x2 + . . . + a6 x6 = const. (1.12)

The algorithm that automatically ļ¬nds the weights (a1 , . . . , a6 ) will be investigated later on

in Chapter 12.

Let us study yet another technique: the scatterplot matrix. If we want to draw all possible

two-dimensional scatterplots for the variables, we can create a so-called draftmanā™s plot

32 1 Comparison of Batches

Swiss bank notes

142.40

141.48

140.56

139.64

138.72

7.20

8.30 12.30

9.40 11.38

10.46

10.50

9.54

11.60 8.62

Figure 1.13. 3D Scatterplot of the bank notes for (X4 , X5 , X6 ). Genuine

notes are circles, counterfeit are stars. MVAscabank456.xpl

(named after a draftman who prepares drafts for parliamentary discussions). Similar to a

draftmanā™s plot the scatterplot matrix helps in creating new ideas and in building knowledge

about dependencies and structure.

Figure 1.14 shows a draftman plot applied to the last four columns of the full bank data

set. For ease of interpretation we have distinguished between the group of counterfeit and

genuine bank notes by a diļ¬erent color. As discussed several times before, the separability of

the two types of notes is diļ¬erent for diļ¬erent scatterplots. Not only is it diļ¬cult to perform

this separation on, say, scatterplot X3 vs. X4 , in addition the āseparation lineā is no longer

parallel to one of the axes. The most obvious separation happens in the scatterplot in the

lower right where we show, as in Figure 1.12, X5 vs. X6 . The separation line here would be

upward-sloping with an intercept at about X6 = 139. The upper right half of the draftman

plot shows the density contours that we have introduced in Section 1.3.

The power of the draftman plot lies in its ability to show the the internal connections of the

scatter diagrams. Deļ¬ne a brush as a re-scalable rectangle that we can move via keyboard

1.4 Scatterplots 33

Var 3

142

12

12

141

11

10

Y

Y

Y

140

10

139

8

9

138

129 129.5 130 130.5 131 129 129.5 130 130.5 131

129 129.5 130 130.5 131

X X X

Var 4

142

12

12

ńņš. 1 |