Thursday, May 29, 2014

Residual Normality Tests in Excel – Kolmogorov-Smirnov Test, Anderson-Darling Test, and Shapiro-Wilk Test For Simple Linear Regression

This is one of the following seven articles on Simple Linear Regression in Excel

Overview of Simple Linear Regression in Excel 2010 and Excel 2013

Complete Simple Linear Regression Example in 7 Steps in Excel 2010 and Excel 2013

Residual Evaluation For Simple Regression in 8 Steps in Excel 2010 and Excel 2013

Residual Normality Tests in Excel – Kolmogorov-Smirnov Test, Anderson-Darling Test, and Shapiro-Wilk Test For Simple Linear Regression

Evaluation of Simple Regression Output For Excel 2010 and Excel 2013

All Calculations Performed By the Simple Regression Data Analysis Tool in Excel 2010 and Excel 2013

Prediction Interval of Simple Regression in Excel 2010 and Excel 2013

 

Normality Testing of

Residuals in Excel –

Kolmogorov-Smirnov Test,

Anderson-Darling Test,

Shapiro-Wilk Test

An important assumption of linear regression is that the Residuals be normally-distributed. Normality testing must be performed on the Residuals. The following five normality tests will be performed here:

1) An Excel histogram of the Residuals will be created.

2) A normal probability plot of the Residuals will be created in Excel.

3) The Kolmogorov-Smirnov test for normality of Residuals will be performed in Excel.

4) The Anderson-Darling test for normality of Residuals will be performed in Excel.

5) The Shapiro-Wilk test for normality of Residuals will be performed in Excel.

 

Histogram of the Residuals in Excel

An Excel histogram of the Residuals is shown as follows:

regression,excel,excel 2010,excel 2013,statistics,normality,normality testing,kolmogorov,anderson-darling,shapiro-wilk,nonparametric
(Click On Image To See a Larger Version)

The Residuals appear to be distributed according to the bell-shaped normal distribution in this Excel histogram. This histogram was created in Excel by inserting the following information into the Excel histogram dialogue box:

regression,excel,excel 2010,excel 2013,statistics,normality,normality testing,kolmogorov,anderson-darling,shapiro-wilk,nonparametric (Click On Image To See a Larger Version)

This histogram can also be created with formulas and a chart. The advantage of creating a histogram with formulas and a chart instead of using the Histogram tool from the Data Analysis ToolPak is that chart and formulas in Excel update their output automatically when data is changed. All of the tools in the Data Analysis ToolPak must be rerun to update the output when input data has changed. The histogram can be created with charts and formulas as follows:

regression,excel,excel 2010,excel 2013,statistics,normality,normality testing,kolmogorov,anderson-darling,shapiro-wilk,nonparametric (Click On Image To See a Larger Version)

Using this data to create an Excel bar chart produces the following histogram:

regression,excel,excel 2010,excel 2013,statistics,normality,normality testing,kolmogorov,anderson-darling,shapiro-wilk,nonparametric (Click On Image To See a Larger Version)

The advantage of creating the histogram with an Excel chart is that the chart automatically updates itself when the input data is changed.

 

Normal Probability Plot of Residuals

in Excel

A Normal Probability Plot created in Excel of the Residuals is shown as follows:

regression,excel,excel 2010,excel 2013,statistics,normality,normality testing,kolmogorov,anderson-darling,shapiro-wilk,nonparametric (Click On Image To See a Larger Version)

The Normal Probability Plot of the Residuals provides strong evidence that the Residual are normally-distributed. The more closely the graph of the Actual Residual values (in red) resembles a straight line (in blue), the more closely the Residuals are to being normally-distributed. The Actual Residual values are very close to being a straight line (the red graph deviates only slightly from the blue straight line).

 

Kolmogorov-Smirnov Test For

Normality of Residuals in Excel

The Kolmogorov-Smirnov Test is a hypothesis test that is widely used to determine whether a data sample is normally-distributed. The Kolmogorov-Smirnov Test calculates the distance between the Cumulative Distribution Function (CDF) of each data point and what the CDF of that data point would be if the sample were perfectly normally-distributed. The Null Hypothesis of the Kolmogorov-Smirnov Test states that the distribution of actual data points matches the distribution that is being tested. In this case the data sample is being compared to the normal distribution.

The largest distance between the CDF of any data point and its expected CDF is compared to Kolmogorov-Smirnov Critical Value for a specific sample size and Alpha. If this largest distance exceeds the Critical Value, the Null Hypothesis is rejected and the data sample is determined to have a different distribution than the tested distribution. If the largest distance does not exceed the Critical Value, we cannot reject the Null Hypothesis, which states that the sample has the same distribution as the tested distribution.

F(Xk) = CDF(Xk) for normal distribution

F(Xk) = NORM.DIST(Xk, Sample Mean, Sample Stan. Dev., TRUE)

Residual Data

regression,excel,excel 2010,excel 2013,statistics,normality,normality testing,kolmogorov,anderson-darling,shapiro-wilk,nonparametric (Click On Image To See a Larger Version)

0.1480 = Max Difference Between Actual and Expected CDF

20 = n = Number of Data Points

0.05 = α

regression,excel,excel 2010,excel 2013,statistics,normality,normality testing,kolmogorov,anderson-darling,shapiro-wilk,nonparametric (Click On Image To See a Larger Version)

The Null Hypothesis Stating That the Residuals Are Normally-Distributed Cannot Be Rejected

The Null Hypothesis for the Kolmogorov-Smirnov Test for Normality, which states that the sample data are normally-distributed, is rejected only if the maximum difference between the expected and actual CDF of any of the data points exceed the Critical Value for the given n and α. That is not the case here.

The Max Difference Between the Actual and Expected CDF for Variable 1 (0.1480) is significantly less than the Kolmogorov-Smirnov Critical Value for n = 20 (0.29) at α = 0.05 so the Null Hypotheses of the Kolmogorov-Smirnov Test for the Residual data is accepted.

 

Anderson-Darling Test For

Normality of Residuals in Excel

The Anderson-Darling Test is a hypothesis test that is widely used to determine whether a data sample is normally-distributed. The Anderson-Darling Test calculates a test statistic based upon the actual value of each data point and the Cumulative Distribution Function (CDF) of each data point if the sample were perfectly normally-distributed.

The Anderson-Darling Test is considered to be slightly more powerful than the Kolmogorov-Smirnov test for the following two reasons:

The Kolmogorov-Smirnov test is distribution-free. i.e., its critical values are the same for all distributions tested. The Anderson-darling tests requires critical values calculated for each tested distribution and is therefore more sensitive to the specific distribution.

The Anderson-Darling test gives more weight to values in the outer tails than the Kolmogorov-Smirnov test. The K-S test is less sensitive to aberration in outer values than the A-D test.

If the test statistic exceeds the Anderson-Darling Critical Value for a given Alpha, the Null Hypothesis is rejected and the data sample is determined to have a different distribution than the tested distribution. If the test statistic does not exceed the Critical Value, we cannot reject the Null Hypothesis, which states that the sample has the same distribution as the tested distribution.

F(Xk) = CDF(Xk) for normal distribution

F(Xk) = NORM.DIST(Xk, Sample Mean, Sample Stan. Dev., TRUE)

Residual Data

regression,excel,excel 2010,excel 2013,statistics,normality,normality testing,kolmogorov,anderson-darling,shapiro-wilk,nonparametric (Click On Image To See a Larger Version)

Test Statistic A = 1.333

The above test statistic should be adjusted in the general case that both population mean an population variance are unknown. This is often the case and is an assumption that can always be applied.

When population mean and population variance are unknown, make the following adjustment:

Adjusted Test Statistic A* = ( 1 + 0.75/n + 2.25/n2 )*A

However, the population mean of the residuals is known to be 0. The population standard deviation of the residuals is now known.

In this case Test Statistic A should be used and not Adjusted Test Statistic A*.

Reject the Null Hypothesis of the Anderson-Darling Test which states that the data are normally-distributed when the population mean is known but the population standard deviation is not known if any the following are true:

A > 1.760 When Level of Significance (α) = 0.10

A > 2.323 When Level of Significance (α) = 0.05

A > 3.69 When Level of Significance (α) = 0.01

The Null Hypothesis Stating That the Residuals Are Normally-Distributed Cannot Be Rejected

The Null Hypothesis for the Anderson-Darling Test for Normality, which states that the sample data are normally-distributed, is rejected if the Test Statistic (A) exceeds the Critical Value for the given n and α.

The Test Statistic (A) for the Residual data is significantly less than the Anderson-Darling Critical Value for α = 0.05 so the Null Hypotheses of the Anderson-Darling Test for the Residual data is not rejected. The Null Hypothesis states that the residuals are normally-distributed.

 

Shapiro-Wilk Test For Normality

in Excel

The Shapiro-Wilk Test is a hypothesis test that is widely used to determine whether a data sample is normally-distributed. A test statistic W is calculated. If this test statistic is less than a critical value of W for a given level of significance (alpha) and sample size, the Null Hypothesis which states that the sample is normally-distributed is rejected.

The Shapiro-Wilk Test is a robust normality test and is widely-used because of its slightly superior performance against other normality tests, especially with small sample sizes. Superior performance means that it correctly rejects the Null Hypothesis that the data are not normally-distributed a slightly higher percentage of times than most other normality tests, particularly at small sample sizes.

The Shapiro-Wilk normality test is generally regarded as being slightly more powerful than the Anderson-Darling normality test, which in turn is regarded as being slightly more powerful than the Kolmogorov-Smirnov normality test.

Residual Data

regression,excel,excel 2010,excel 2013,statistics,normality,normality testing,kolmogorov,anderson-darling,shapiro-wilk,nonparametric (Click On Image To See a Larger Version)

0.966014 = Test Statistic W

0.905 = W Critical for the following n and Alpha

20 = n = Number of Data Points

0.05 = α

The Null Hypothesis Stating That the Data Are Normally-Distributed Cannot Be Rejected

Test Statistic W (0.966014) is larger than W Critical 0.905. The Null Hypothesis therefore cannot be rejected. There is not enough evidence to state that the data are not normally-distributed with a confidence level of 95 percent.

 

Correctable Reasons Why Normal

Data Can Appear Non-Normal

If a normality test indicates that data are not normally-distributed, it is a good idea to do a quick evaluation of whether any of the following factors have caused normally-distributed data to appear to be non-normally-distributed:

 

1) Outliers

– Too many outliers can easily skew normally-distributed data. An outlier can often be removed if a specific cause of its extreme value can be identified. Some outliers are expected in normally-distributed data.

 

2) Data Has Been Affected by More Than One Process

– Variations to a process such as shift changes or operator changes can change the distribution of data. Multiple modal values in the data are common indicators that this might be occurring. The effects of different inputs must be identified and eliminated from the data.

 

3) Not Enough Data

– Normally-distributed data will often not assume the appearance of normality until at least 25 data points have been sampled.

 

4) Measuring Devices Have Poor Resolution

– Sometimes (but not always) this problem can be solved by using a larger sample size.

 

5) Data Approaching Zero or a Natural Limit

– If a large number of data values approach a limit such as zero, calculations using very small values might skew computations of important values such as the mean. A simple solution might be to raise all the values by a certain amount.

 

6) Only a Subset of a Process’ Output Is Being Analyzed

– If only a subset of data from an entire process is being used, a representative sample in not being collected. Normally-distributed results would not appear normally-distributed if a representative sample of the entire process is not collected.

 

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

 

No comments:

Post a Comment