Sunday, August 1, 2010

The Normality Test

Simple and Done in Excel

The normality test is used to determine whether a data set resembles the normal distribution. If the data set can be modeled by the normal distribution, then statistical tests involving the normal distribution and t distribution such as Z test, t tests, F tests, and Chi-Square tests can performed on the data set.

There are a number of well-known normality tests such as Kolmogorov Smirnov Test, Shapiro Wilk Test, and the Anderson Darling Test. In this article we will describe two normality tests that can be performed with Excel, but are much simpler than the above tests.

The Normality Test

The Most Basics Ones

The Histogram - The Simplest Normality Test

Probably the easiest normality test is to plot the data in an Excel histogram and then compare the histogram to a normal curve. This method works much better with larger data sets. It is extremely simple to perform in Excel. Here is an example of how a Histogram is used in Excel as the most basic Normailty test:

We are going to evaluate the following data for Normality using a Histogram:

After the input data is arranged as above, we need to determine how we want the data to be grouped when it is broken down into a Histogram. Excel calls the groups "bins." We need to determine the upper and lower range of each bin. When the data is inserted into Excel, we need only to provide the lower boundary of each bin. Here is how is I have arbitrarily set up lower boundaries for each bin:

Now we are ready to create a Histogram with Excel. Access the Excel Histogram in Excel 2003 from: Tools / Data Analysis / Histogram. A dialogue box will appear. The following dialogue box is shown completed. Highlight the input data and bin range data by selecting yellow-colored data cells as is shown above. Your dialogue box will look like this one when you are ready to create the Histogram:

Hitting the OK button will give a completed Histogram that will look like this:

Compare this to a Normal curve with the same mean and standard deviation as follows:

In this case, the data does appear to have been drawn from a Normally-distributed population

The Normal Probability Plot -
A Simple, Quick Normality Test for Excel

Another normality test that is very easy to implement in Excel is called the Normal Probability Plot. There are 2 ways to create the Normal Probability Plot. They both create the same output. I use the 1st method because it is accompanied with an explanation of why the method works. I personally have difficulty with applying a method that I don't understand. Here are both methods, starting with my preferred choice:

Creating the Normal Probability Plot - Method 1

One characteristic that defines the Normal distribution is that Normally-distributed data will have the same amount of area of Normal curve between each point. For example, if there were 7 sampled points total that were perfectly Normally-distributed, The area under the Normal curve between each point would contain 1/7 of the total area under the Normal curve.

The area under the Normal curve between 2 points can be shown graphically as follows:

Calculating the CDF

We can obtain the normal curve area between two sample points (on the X-axis) by using the Cumulative Distribution Function (CDF). The CDF at any point on the x-axis is the total area under the curve to the left of that point. We can obtain the percentage of area in normal curve for each regionby subtracting the CDF at the x-Value of region's lower boundary from the CDF at the x-Value of the region's upper boundary.

The normal distribution that we are trying to fit data has as its two and only parameters the sample's mean and standard deviation.

The CDF of this normal distribution at any point on the x-Axis can be determined by the following Excel formula:

CDF = NORMDIST ( x Value, Sample Mean, Sample Standard Deviation, TRUE )

Once again, this formula calculate the CDF at that x Value, which is the area under the normal curve to the left of the x Value. That normal curve has as its parameters the sample's mean and standard deviation.

Graphical Interpretation of the CDF

CDF (65% of Curve Area From Upper Boundary of Region)

Click on Image To See Larger Version

MINUS

CDF (25% of Curve Area From Lower Boundary of Region)

Click On Image To See Larger Version

EQUALS

25% of Curve's Total Area Is Inside Region

Click On Image To See Larger Version

Given the above, here are the Steps to creating a Normal Probability Plot to evaluate the Normality of sampled data.

Here is a set of 7 sampled points that we are going test for Normality using the Normal Probability Plot:

From these samples, we need to calculate sample size (count - number of samples), sample mean, and sample stadard deviation. Here are those calculations:

Given the above sample size, mean, and standard deviation, if the sample were perfectly Normally-distributed, the sample would have been as follows:

If there are 7 sampled data points that were perfectly Normally distributed, there would be 1/7 of the total Normal curve area between each sampled point.

The Z Score at each sampled point are found with the following Excel formula:

NORMSINV (CDF at each Sample Point)

The Expected Sample Values are found by the following Excel formula:

NORMINV (CDF at Sample Point, Sample Mean, Sample Stan. Dev.)

A graph of Expected Sample Values vs. Z Score will be a straight line, as follows:

We now observe the actual data samples compared to the Expected Data Samples for Normally-distributed data having the same mean and standard deviation:

We now wish to see how close the Actual Sample Values graph to the staright line of the Expected Sample Values, as follows:

We can see that the Actual Sample Data (in purple) maps closely to the Expected Sample Values (in dark blue) so we conclude that the data appears to be derived from a Normally-distributed population.

One caution: A larger sample size (at least 50) should be used to obtain valid results. The small sample size (7) was used here for simplicity.

**************************************************************

Creating the Normal Probability Plot - Method 2

The data set is ranked in order and then plotted on a graph. Each point in the data set represents a y value of a plotted point. The x values of the points are Normal Order Statistic Medians. The closer than the graph is to a straight line, the more closely the data set resembles the normal distribution. Correlation analysis can also be performed the data set (called the Order Responses) and the Normal Order Statistic Medians. The closer the correlation coefficient is to 1, the more the data set resembles the normal distribution.

An Example

An example is the best way to illustrate the Normal Probability Plot. Evaluate the following data set of 6 points for normality:

{66, 76, 17, 23, 44, 41}

The rank of each data point is:

5, 6, 1, 2, 4, 3

The data in ranked order is:

{17, 23, 41, 44, 66, 76}

Now we have to calculate the Normal Order Statistic Medians. We know that we have 6 points so n = 6. The Normal Order Statistic Medians are given by the following formula:

N(i) = G(U(i))

U(i) are the Uniform Order Statistic Medians defined by this formula:

m(i) = 1 - m(n) for i = 1

m(i) = (i - 0.3175)/(n + 0.365) for i = 2, 3, ..., n-1

m(i) = 0.5(1/n) for i = n

G is called the Percent Point of the Normal Distribution. It is the inverse of the cumulative distribution function. In Excel, it would be the NORMSINV(x) function. It tells you the probability the x has a value of m(i) or less. Variable x is normally distributed on a standard normal curve (µ = 0 and σ = 1).

Given the above information, here is how the Normal Order Statistic Medians are calculated:

n = 6

Now calculate U(i) – the Uniform Order Statistic Medians.

U(i) are the Uniform Order Statistic Medians defined by this formula:

m(i) = 1 - m(n) for i = 1

m(i) = (i - 0.3175)/(n + 0.365) for i = 2, 3, ..., n-1

m(i) = 0.5(1/n) for i = n

i = 1
m(1) = 1 – m(n) = 1 – m(6) = 1 – 0.0833 = 0.9167

i = 2
m(2) = (i - 0.3175)/(n + 0.365) = (2 – 0.3175) / (6 + 0.365) = 0.2639

i = 3
m(3) = (i - 0.3175)/(n + 0.365) = (3 – 0.3175) / (6 + 0.365) = 0.4208

i = 4
m(4) = (i - 0.3175)/(n + 0.365) = (4 – 0.3175) / (6 + 0.365) = 0.5776

i = 5
m(5) = (i - 0.3175)/(n + 0.365) = (5 – 0.3175) / (6 + 0.365) = 0.7345

i = 6
m(6) = m(i) = 0.5(1/n) for i = n = m(i) = 0.5(1/6) = 0.0833

So,

U(1) = 0.9167
U(2) = 0.2639
U(3) = 0.4208
U(4) = 0.5776
U(5) = 0.7345
U(6) = 0.0833

The Normal Order Statistic Medians are given by the following formula:N(i) = G(U(i)) --> G(U(i)) is the inverse of the cumulative distribution function. It tells the x value that corresponds to the probability U(i) that a random sample taken from a standardized normally distributed population will have a value of x or less.

This is found in Excel by the following formula:

N(i) = G(U(i)) = NORMSINV(U(i))

So, the Normal Order Statistic Medians are given by:G(U(i)) = NORMSINV(U(i))

N(1) = NORMSINV(U(1)) = NORMSINV(0.9167) = 1.383
N(2) = NORMSINV(U(2)) = NORMSINV(0.2639) = -0.631
N(3) = NORMSINV(U(3)) = NORMSINV(0.4208) = - 0.200
N(4) = NORMSINV(U(4)) = NORMSINV(0.5776) = 0.196
N(5) = NORMSINV(U(5)) = NORMSINV(0.7345) = 0.626
N(6) = NORMSINV(U(6)) = NORMSINV(0.8908) = -1.383

The above are the X values of the data points whose Y values are the ranked point in the data set. The ranked data set is:

{17, 23, 41, 44 66, 76}

So, the following points can be plotted:

(1.383, 17) (-0.631, 23) (-0.200, 41) (0.196, 44) (0.626, 66) (-1.383, 76)

The final graph will resemble a chart such as this:

The closer that the plotted resembles a straight line, the closer the data set resembles the normal distribution. You can also run correlation analysis between the data set of Ordered Responses and the Normal Order Statistic Medians. The closer the correlation coefficient is to 1, the more closely the data set resembles the normal distribution.

There are other well-known Normality tests such as the Kolmogorov-Smirnov Goodness-of-Fit Test, the Anderson-Darling Goodness-of-Fit Test, The Shapiro-Wilk Test, and the Chi-Square Goodness-of-Fit Test. I will very shortly publish an article or two in this blog which will detail how to do these tests in Excel.

If you are going to perform any statistical analysis that uses the normal distribution or t distribution such as Z test, t tests, F tests, and chi-square tests, you should first test your data set for normality. The Normal Probability Plot described in this article is probably the easiest and quickest way to do it in Excel.

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

Become an Excel Statistical Master

Excel Master Series - MBA-level statistics - Over 1,100+ Pages of Easy-To-Follow Instructions in Excel

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

Step-By-Step Optimization With Excel Solver

What's In It?

For anyone who wants to be operating at a high level with the Excel Solver quickly, this is the book for you. Step-By-Step Optimization With Excel Solver is a 200+ page .pdf e-manual of simple yet thorough explanations on how to use the Excel Solver to solve today’s most widely known optimization problems. Loaded with screen shots that are coupled with easy-to-follow instructions, this book will simplify many difficult optimization problems and make you a master of the Excel Solver almost immediately.

Here are just some of the Solver optimization problems that are solved completely with simple-to-understand instructions and screen shots in this e-manual:

• The famous “Traveling Salesman” problem using Solver’s Alldifferent constraint and the Solver’s Evolutionary method to find the shortest path to reach all customers. This also provides an advanced use of the Excel INDEX function.

• The well-known “Knapsack Problem” which shows how optimize the use of limited space while satisfying numerous other criteria.

• How to perform nonlinear regression and curve-fitting on the Solver using the Solver’s GRG Nonlinear solving method.

• How to solve the “Cutting Stock Problem” faced by many manufacturing companies who are trying to determine the optimal way to cut sheets of material to minimize waste while satisfying customer orders.

• Portfolio optimization to maximize return or minimize risk.

• Venture capital investment selection using the Solver’s Binary constraint to maximize Net Present Value of selected cash flows at year 0. Clever use of the If-Then-Else statements makes this a simple problem.

• How use Solver to minimize the total cost of purchasing and shipping goods from multiple suppliers to multiple locations.

• How to optimize the selection of different production machine to minimize cost while fulfilling an order.

• How to optimally allocate a marketing budget to generate the greatest reach and frequency or number of inbound leads at the lowest cost.

Step-By-Step Optimization With Excel Solver has complete instructions and numerous tips on every aspect of operating the Excel Solver. You’ll fully understand the reports and know exactly how to tweek all of the Solver’s settings for total custom use. This e-manual also provides lots of inside advice and guidance on setting up the model in Excel so that it will be as simple and intuitive as possible to work with.

All of the optimization problems in this book are solved step-by-step using a 6-step process that works every time. In addition to detailed screen shots and easy-to-follow explanations on how to solve every optimization problem in the book, a link is provided to download an Excel workbook that has all problems completed exactly as they are in this e-manual.

Step-By-Step Optimization With Excel Solver is exactly the e-manual you need if you want to be optimizing at an advanced level with the Excel Solver quickly.

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

Immediate, Absolute, No-Questions-Asked, Money-Back Guarantee If Not TOTALLY, 100% Satisfied. In Other Words, If Any Excel Master Series eManual That You've Purchased Here Does Not Provide Instructions That Are CRYSTAL CLEAR and EASY TO UNDERSTAND, You Get All Of Your Money Back Immediately and Keep the eManual. Guaranteed!

Meet The Author