APPENDIX 1

BASIC STATISTICS

The problem that we face today is not that we have too little information but too

much. Making sense of large and often contradictory information is part of what we are

called upon to do when analyzing companies. Basic statistics can make this job easier. In

this appendix, we consider the most fundamental tools that we have available in data

analysis.

Summarizing Data

Large amounts of data are often compressed into more easily assimilated summaries,

which provide the user with a sense of the content, without overwhelming him or her

with too many numbers. There a number of ways in which data can be presented. We will

consider two here â€“ one is to present the data in a distribution and the other is to provide

summary statistics that capture key aspects of the data.

Data Distributions

When presented with thousands of pieces of information, you can break the numbers

down into individual values (or ranges of values) and provides the number of individual

data items that take on each value or range of values. This is called a frequency

distribution. If the data can only take on specific values, as is the case when we record the

number of goals scored in a soccer game, it is called a discrete distribution. When the

data can take on any value within the range, as is the case with income or market

capitalization, it is called a continuous distribution.

The advantage of a presenting the data in a distribution is two fold. One is that

you can summarize even the largest data sets into one distribution and get a measure of

what values occur most frequently and the range of high and low values. The second is

that the distribution can resemble one of the many common distributions about which we

know a great deal in statistics. Consider, for instance, the distribution that we tend to

draw on the most in analysis: the normal distribution, illustrated in figure 1.

1

2

Figure 1: Normal Distribution

A normal distribution is symmetric, has a peak centered around the middle of the

distribution and tails that are no fat and stretch to include infinite positive or negative

values. Figure 2 illustrates positively and negatively skewed distributions.

Figure 2: Skewed Distributions

Positively skewed

Negatively skewed

distribution

distribution

Returns

Summary Statistics

The simplest way to measure the key characteristics of a data set is to estimate the

summary statistics for the data. For a data series, X1, X2, X3, ....Xn, where n is the number

of observations in the series, the most widely used summary statistics are as follows â€“

â€¢ The mean (Âµ), which is the average of all of the observations in the data series

2

3

j= n

"X

Mean = Âµ X = j

j=1

â€¢ The median, which is the mid-point of the series; half the data in the series is higher

than the median and half is lower

!

â€¢ The variance, which is a measure of the spread in the distribution around the mean,

and is calculated by first summing up the squared deviations from the mean, and then

dividing by either the number of observations (if the data represents the entire

population) or by this number, reduced by one (if the data represents a sample)

j= n

$ (X

2

# Âµ) 2

Variance = " X = j

j=1

The standard deviation is the square root of the variance.

The mean and the standard deviation are the called the first two moments of any data

!

distribution. A normal distribution can be entirely described by just these two moments;

in other words, the mean and the standard deviation of a normal distribution suffice to

completely characterize it. If a distribution is not symmetric, it is considered to be skewed

and the skewness is the moment that describes both the direction and the magnitude of

the skewness.

Looking for Relationships in the Data

When there are two series of data, there are a number of statistical measures that

can be used to capture how the two series move together over time.

Correlations and Covariances

The two most widely used measures of how two variables move together (or do

not) are the correlation and the covariance. For two data series, X (X1, X2,.) and Y(Y,Y...

), the covariance provides a non-standardized measure of the degree to which they move

together, and is estimated by taking the product of the deviations from the mean for each

variable in each period.

j= n

$ (X

Covariance = " XY = # Âµ X ) (Y j # ÂµY )

j

j=1

!

3

4

The sign on the covariance indicates the type of relationship that the two variables have.

A positive sign indicates that they move together and a negative that they move in

opposite directions. While the covariance increases with the strength of the relationship,

it is still relatively difficult to draw judgments on the strength of the relationship between

two variables by looking at the covariance, since it is not standardized.

The correlation is the standardized measure of the relationship between two

variables. It can be computed from the covariance â€“

j= n

% (X $ Âµ X ) (Y j $ ÂµY )

j

j=1

Correlation = " XY = # XY /# X # Y =

j=n j=n

% (X %

$ ÂµX )2 (Y j $ ÂµY ) 2

j

j=1 j=1

The correlation can never be greater than 1 or less than minus 1. A correlation close to

zero indicates that the two variables are unrelated. A positive correlation indicates that

!

the two variables move together, and the relationship is stronger the closer the correlation

gets to one. A negative correlation indicates the two variables move in opposite

directions, and that relationship also gets stronger the closer the correlation gets to minus

1. Two variables that are perfectly positvely correlated (r=1) essentially move in perfect

proportion in the same direction, while two assets which are perfectly negatively

correlated move in perfect proportion in opposite directions.

Regressions

A simple regression is an extension of the correlation/covariance concept. It

attempts to explain one variable, which is called the dependent variable, using the other

variable, called the independent variable.

Scatter Plots and Regression Lines

Keeping with statistical tradition, let Y be the dependent variable and X be the

independent variable. If the two variables are plotted against each other, with each pair of

observations representing a point on the graph, you have a scatter plot, with Y on the

vertical axis and X on the horizontal axis.

4

5

In a regression, we attempt to fit a line through the points that best fits the . In its

simplest form, this is accomplished by finding a line that minimizes the sum of the

squared deviations of the points from the line. Consequently, it is called ordinary least

squares (OLS) regression. When such a line is fit, two parameters emerge â€“ one is the

point at which the line cuts through the Y-axis, called the intercept of the regression, and

the other is the slope of the regression line.

Y=a+bX

The slope (b) of the regression measures both the direction and the magnitude of the

relationship between the dependent variable (Y) and the independent variable (X). When

the two variables are positively correlated, the slope will also be positive, whereas when

the two variables are negatively correlated, the slope will be negative. The magnitude of

the slope of the regression can be read as follows - for every unit increase in the

dependent variable (X), the independent variable will change by b (slope).

Estimating Regression Parameters

While there are statistical packages that allow us to input data and get the

regression parameters as output, it is worth looking at how they are estimated in the first

place. The slope of the regression line is a logical extension of the covariance concept

introduced in the last section. In fact, the close linkage between the slope of the

regression and the correlation/covariance should not be surprising since the slope is

estimated using the covariance â€“

5

6

Covariance YX " YX

Slope of the Regression = b = =2

Variance of X " X

The intercept (a) of the regression can be read in a number of ways. One interpretation is

that it is the value that Y will have when X is zero. Another is more straightforward, and

!

is based upon how it is calculated. It is the difference between the average value of Y,

and the slope adjusted value of X.

Intercept of the Regression = a = Âµ Y - b * (Âµ X )

Regression parameters are always estimated with some error or statistical noise, partly

because the relationship between the variables is not perfect and partly because we

!

estimate them from samples of data. This noise is captured in a couple of statistics. One is

the R-squared of the regression, which measures the proportion of the variability in the

independent variable (Y) that is explained by the dependent variable (X). It is a direct

function of the correlation between the variables â€“

b 2# X2

Correlation 2 2

R - squared of the Regression = " YX

= =

YX 2

#Y

An R-squared value closer to one indicates a strong relationship between the two

variables, !

though the relationship may be either positive or negative. Another measure of

noise in a regression is the standard error, which measures the "spread' around each of the

two parameters estimated- the intercept and the slope. Each parameter has an associated

standard error, which is calculated from the data â€“

$ j= n '

j=n

X 2 )& 2)

# # (Y

( " bX j )

j j

& )

% j=1 (

j=1

Standard Error of Intercept = SEa = j= n

# (X " ÂµX )2

(n " 1) j

j=1

$ j= n '

& (Y " bX ) 2 )

# j j

& )

!

% j=1 (

Standard Error of Slope = SE b = j= n

# (X " ÂµX )2

(n " 1) j

j=1

If we make the additional assumption that the intercept and slope estimates are normally

distributed, the parameter estimate and the standard error can be combined to get a "t

!

statistic" that measures whether the relationship is statistically significant.

6

7

T statistic for intercept = a/SEa

T statistic from slope = b/SEb

For samples with more than 120 observations, a t statistic greater than 1.66 indicates that

the variable is significantly different from zero with 95% certainty, while a statistic

greater than 2.36 indicates the same with 99% certainty. For smaller samples, the t

statistic has to be larger to have statistical significance.1

Using Regressions

While regressions mirror correlation coefficients and covariances in showing the

strength of the relationship between two variables, they also serve another useful purpose.

The regression equation described in the last section can be used to estimate predicted

values for the dependent variable, based upon assumed or actual values for the

independent variable. In other words, for any given Y, we can estimate what X should be:

X = a + B (Y)

How good are these predictions? That will depend entirely upon the strength of the

relationship measured in

From Simple to Multiple Regressions

The regression that measures the relationship between two variables becomes a

multiple regression when it is extended to include more than one independent variables

(X1,X2,X3,X4..) in trying to explain the dependent variable Y. While the graphical

presentation becomes more difficult, the multiple regression yields output that is an

extension of the simple regression.

Y = a + b X1 + c X2 + dX3 + eX4

The R-squared still measures the strength of the relationship, but an additional R-squared

statistic called the adjusted R squared is computed to counter the bias that will induce the

R-squared to keep increasing as more independent variables are added to the regression.

If there are k independent variables in the regression, the adjusted R squared is computed

as follows â€“

1 The actual values that t statistics need to take on can be found in a table for the t distribution, which is

reproduced at the end of this book as an appendix.

7

8

$ j= n '

& (Y " bX ) 2 )

# j j

& )

% j=1 (

R squared =

n-1

$ j= n '

& (Y " bX ) 2 )

# j j

& )

! % j=1 (

Adjusted R squared =

n- k

Multiple regressions are powerful weapons that allow us to examine the determinants of

any variable. !

Regression Assumptions and Constraints

Both the simple and multiple regressions that we have described in this section

also assume linear relationships between the dependent and independent variables. If the

relationship is not linear, we have two choices. One is to transform the variables, by

taking the square, square root or natural log (for example) of the values and hope that the

relationship between the transformed variables is more linear. The other is run non-linear

regressions that attempt to fit a curve through the data.

There are implicit statistical assumptions behind every multiple regression that we

ignore at our own peril. For the coefficients on the individual independent variables to

make sense, the independent variable needs to be uncorrelated with each other, a

condition that is often very difficult to meet. When independent variables are correlated

with each other, the statistical hazard that is created is called multicollinearity. In its

presence, the coefficients on independent variables can take on unexpected signs

(positive instead of negative, for instance) and unpredictable values.

There are simple diagnostic statistics that allow us to measure how far the data

that we are using in a regression may be deviating from our ideal. If these statistics send

out warning signals, we ignore them at our own peril.

Conclusion

In the course of trying to make sense of large amounts of contradictory data, there

are useful statistical tools that we can draw on. While we have looked at the only most

basic ones in this chapter, there are far more sophisticated and powerful tools that we can

draw on.

8

9

9