Sunday, August 1, 2010

A Quick Normality Test Easily Done In Excel

The Normality Test

Simple and Done in Excel

The normality test is used to determine whether a data set resembles the normal distribution. If the data set can be modeled by the normal distribution, then statistical tests involving the normal distribution and t distribution such as Z test, t tests, F tests, and Chi-Square tests can performed on the data set.

There are a number of well-known normality tests such as Kolmogorov Smirnov Test, Shapiro Wilk Test, and the Anderson Darling Test. In this article we will describe two normality tests that can be performed with Excel, but are much simpler than the above tests.

The Normality Test

The Most Basics Ones

The Histogram - The Simplest Normality Test

Probably the easiest normality test is to plot the data in an Excel histogram and then compare the histogram to a normal curve. This method works much better with larger data sets. It is extremely simple to perform in Excel. Here is an example of how a Histogram is used in Excel as the most basic Normailty test:

We are going to evaluate the following data for Normality using a Histogram:


After the input data is arranged as above, we need to determine how we want the data to be grouped when it is broken down into a Histogram. Excel calls the groups "bins." We need to determine the upper and lower range of each bin. When the data is inserted into Excel, we need only to provide the lower boundary of each bin. Here is how is I have arbitrarily set up lower boundaries for each bin:




Now we are ready to create a Histogram with Excel. Access the Excel Histogram in Excel 2003 from: Tools / Data Analysis / Histogram. A dialogue box will appear. The following dialogue box is shown completed. Highlight the input data and bin range data by selecting yellow-colored data cells as is shown above. Your dialogue box will look like this one when you are ready to create the Histogram:




Hitting the OK button will give a completed Histogram that will look like this:




Compare this to a Normal curve with the same mean and standard deviation as follows:



In this case, the data does appear to have been drawn from a Normally-distributed population


The Normal Probability Plot  -
A Simple, Quick Normality Test for Excel

Another normality test that is very easy to implement in Excel is called the Normal Probability Plot. There are 2 ways to create the Normal Probability Plot. They both create the same output. I use the 1st method because it is accompanied with an explanation of why the method works. I personally have difficulty with applying a method that I don't understand. Here are both methods, starting with my preferred choice:

Creating the Normal Probability Plot - Method 1

One characteristic that defines the Normal distribution is that Normally-distributed data will have the same amount of area of Normal curve between each point. For example, if there were 7 sampled points total that were perfectly Normally-distributed, The area under the Normal curve between each point would contain 1/7 of the total area under the Normal curve.

The area under the Normal curve between 2 points can be shown graphically as follows:


Calculating the CDF

We can obtain the normal curve area between two sample points (on the X-axis) by using the Cumulative Distribution Function (CDF). The CDF at any point on the x-axis is the total area under the curve to the left of that point. We can obtain the percentage of area in normal curve for each regionby subtracting the CDF at the x-Value of region's lower boundary from the CDF at the x-Value of the region's upper boundary.


The normal distribution that we are trying to fit data has as its two and only parameters the sample's mean and standard deviation.


The CDF of this normal distribution at any point on the x-Axis can be determined by the following Excel formula:


CDF = NORMDIST ( x Value, Sample Mean, Sample Standard Deviation, TRUE )


Once again, this formula calculate the CDF at that x Value, which is the area under the normal curve to the left of the x Value. That normal curve has as its parameters the sample's mean and standard deviation.


Graphical Interpretation of the CDF

CDF (65% of Curve Area From Upper Boundary of Region)

Click on Image To See Larger Version


MINUS


CDF (25% of Curve Area From Lower Boundary of Region)
Click On Image To See Larger Version

EQUALS


25% of Curve's Total Area Is Inside Region

Click On Image To See Larger Version


Given the above, here are the Steps to creating a Normal Probability Plot to evaluate the Normality of sampled data.

 Here is a set of 7 sampled points that we are going test for Normality using the Normal Probability Plot:




From these samples, we need to calculate sample size (count - number of samples), sample mean, and sample stadard deviation. Here are those calculations:






Given the above sample size, mean, and standard deviation, if the sample were perfectly Normally-distributed, the sample would have been as follows:






If there are 7 sampled data points that were perfectly Normally distributed, there would be 1/7 of the total Normal curve area between each sampled point.


The Z Score at each sampled point are found with the following Excel formula:


NORMSINV (CDF at each Sample Point)


The Expected Sample Values are found by the following Excel formula:


NORMINV (CDF at Sample Point, Sample Mean, Sample Stan. Dev.)


A graph of Expected Sample Values vs. Z Score will be a straight line, as follows:



We now observe the actual data samples compared to the Expected Data Samples for Normally-distributed data having the same mean and standard deviation:






We now wish to see how close the Actual Sample Values graph to the staright line of the Expected Sample Values, as follows:


We can see that the Actual Sample Data (in purple) maps closely to the Expected Sample Values (in dark blue) so we conclude that the data appears to be derived from a Normally-distributed population.

One caution: A larger sample size (at least 50) should be used to obtain valid results. The small sample size (7) was used here for simplicity.


**************************************************************

Creating the Normal Probability Plot - Method 2

The data set is ranked in order and then plotted on a graph. Each point in the data set represents a y value of a plotted point. The x values of the points are Normal Order Statistic Medians. The closer than the graph is to a straight line, the more closely the data set resembles the normal distribution. Correlation analysis can also be performed the data set (called the Order Responses) and the Normal Order Statistic Medians. The closer the correlation coefficient is to 1, the more the data set resembles the normal distribution.


An Example

An example is the best way to illustrate the Normal Probability Plot. Evaluate the following data set of 6 points for normality:

{66, 76, 17, 23, 44, 41}

The rank of each data point is:

5, 6, 1, 2, 4, 3

The data in ranked order is:

{17, 23, 41, 44, 66, 76}


Now we have to calculate the Normal Order Statistic Medians. We know that we have 6 points so n = 6. The Normal Order Statistic Medians are given by the following formula:


N(i) = G(U(i))


U(i) are the Uniform Order Statistic Medians defined by this formula:

m(i) = 1 - m(n) for i = 1


m(i) = (i - 0.3175)/(n + 0.365) for i = 2, 3, ..., n-1


m(i) = 0.5(1/n) for i = n



G is called the Percent Point of the Normal Distribution. It is the inverse of the cumulative distribution function. In Excel, it would be the NORMSINV(x) function. It tells you the probability the x has a value of m(i) or less. Variable x is normally distributed on a standard normal curve (µ = 0 and σ = 1).


Given the above information, here is how the Normal Order Statistic Medians are calculated:

n = 6

Now calculate U(i) – the Uniform Order Statistic Medians.


U(i) are the Uniform Order Statistic Medians defined by this formula:


m(i) = 1 - m(n) for i = 1

m(i) = (i - 0.3175)/(n + 0.365) for i = 2, 3, ..., n-1


m(i) = 0.5(1/n) for i = n



i = 1 
m(1) = 1 – m(n) = 1 – m(6) = 1 – 0.0833 = 0.9167

i = 2 
m(2) = (i - 0.3175)/(n + 0.365) = (2 – 0.3175) / (6 + 0.365) = 0.2639


i = 3 
m(3) = (i - 0.3175)/(n + 0.365) = (3 – 0.3175) / (6 + 0.365) = 0.4208


i = 4 
m(4) = (i - 0.3175)/(n + 0.365) = (4 – 0.3175) / (6 + 0.365) = 0.5776


i = 5
m(5) = (i - 0.3175)/(n + 0.365) = (5 – 0.3175) / (6 + 0.365) = 0.7345


i = 6 
m(6) = m(i) = 0.5(1/n) for i = n = m(i) = 0.5(1/6) = 0.0833


 So,


U(1) = 0.9167
U(2) = 0.2639
U(3) = 0.4208
U(4) = 0.5776
U(5) = 0.7345
U(6) = 0.0833


The Normal Order Statistic Medians are given by the following formula:N(i) = G(U(i)) --> G(U(i)) is the inverse of the cumulative distribution function. It tells the x value that corresponds to the probability U(i) that a random sample taken from a standardized normally distributed population will have a value of x or less.


This is found in Excel by the following formula:

N(i) = G(U(i)) = NORMSINV(U(i))

So, the Normal Order Statistic Medians are given by:G(U(i)) = NORMSINV(U(i))



N(1) = NORMSINV(U(1)) = NORMSINV(0.9167) = 1.383
N(2) = NORMSINV(U(2)) = NORMSINV(0.2639) = -0.631 
N(3) = NORMSINV(U(3)) = NORMSINV(0.4208) = - 0.200
N(4) = NORMSINV(U(4)) = NORMSINV(0.5776) = 0.196
N(5) = NORMSINV(U(5)) = NORMSINV(0.7345) = 0.626
N(6) = NORMSINV(U(6)) = NORMSINV(0.8908) = -1.383

The above are the X values of the data points whose Y values are the ranked point in the data set. The ranked data set is:


{17, 23, 41, 44 66, 76}

So, the following points can be plotted:


(1.383, 17) (-0.631, 23) (-0.200, 41) (0.196, 44) (0.626, 66) (-1.383, 76)


The final graph will resemble a chart such as this:



 

The closer that the plotted resembles a straight line, the closer the data set resembles the normal distribution. You can also run correlation analysis between the data set of Ordered Responses and the Normal Order Statistic Medians. The closer the correlation coefficient is to 1, the more closely the data set resembles the normal distribution.


There are other well-known Normality tests such as the Kolmogorov-Smirnov Goodness-of-Fit Test, the Anderson-Darling Goodness-of-Fit Test, The Shapiro-Wilk Test, and the Chi-Square Goodness-of-Fit Test. I will very shortly publish an article or two in this blog which will detail how to do these tests in Excel.


If you are going to perform any statistical analysis that uses the normal distribution or t distribution such as Z test, t tests, F tests, and chi-square tests, you should first test your data set for normality. The Normal Probability Plot described in this article is probably the easiest and quickest way to do it in Excel.

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

12 comments:

  1. Where are you getting 1/14, 3/14, 5/14 ... for CDF at each sample point in the first Normal Plat example?

    ReplyDelete
  2. The values of 1/14 etc. are the probability intervals. You get them as follows: start with 1/(2*n), so 1/(2*7), then add 2/2n to the previous value. You'll get 1/14, 3/14, 5/14, etc.

    As the author said, there's a 1/7th distance between each probability value, or 1/n in general.

    ReplyDelete
  3. quite helpful to young statisticians and can do it alone

    ReplyDelete
  4. How does m(6) = 0.5(1/6) = 0.8909? By my math, 0.5*(1/6) = 0.0833

    ReplyDelete
  5. Just fixed it. Thanks for catching that.

    ReplyDelete
  6. Excel is also very useful as SPSS.

    ReplyDelete
  7. This was of very timely help. Thanks much

    ReplyDelete
  8. I found the first method of creating the normal probability plot helpful. But when I use the data from the second method for the first, the corresponding correlations do not agree. It looks like it is because of an error in the expression for m(n). You use 0.5(1/n) when it should be 0.5 to the 1/n power. When I make this change the correlations are close.

    ReplyDelete
  9. In your first method you can quantify what you see by using the R² function --> =rsq(actual values;z-scores at each sample point if sample is normal distributed). The more closer to 1 the more the actual values are normal distributed.

    ReplyDelete
  10. A formula for Uniform Order Statistic Medians is wrong. m(i) = 0.5 ^ (1/n) for i = n. That is m(6) = m(i) = 0.5 ^ (1/n) for i = n = m(i) = 0.5 ^ (1/6) = 0.8909. It results in a smoother line for your example. Source: http://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm

    ReplyDelete