Thursday, May 29, 2014

Residual Evaluation For Simple Regression in Excel 2010 and Excel 2013

This is one of the following seven articles on Simple Linear Regression in Excel

Overview of Simple Linear Regression in Excel 2010 and Excel 2013

Complete Simple Linear Regression Example in 7 Steps in Excel 2010 and Excel 2013

Residual Evaluation For Simple Regression in 8 Steps in Excel 2010 and Excel 2013

Residual Normality Tests in Excel – Kolmogorov-Smirnov Test, Anderson-Darling Test, and Shapiro-Wilk Test For Simple Linear Regression

Evaluation of Simple Regression Output For Excel 2010 and Excel 2013

All Calculations Performed By the Simple Regression Data Analysis Tool in Excel 2010 and Excel 2013

Prediction Interval of Simple Regression in Excel 2010 and Excel 2013

 

Excel Regression Step 6 –

Evaluate the Residuals

The purpose of Residual analysis is to confirm the underlying validity of the regression. Linear regression has a number of required assumptions about the residuals. These assumptions should be evaluated before continuing the analysis of the Excel regression output. If one or more of the required residual assumptions are shown to be invalid, the entire regression analysis might be considered, at best, questionable or, at worst, invalid. The residuals should therefore be analyzed first before analyzing any other part of the Excel regression output.

The Residuals are sometimes called the Error Terms. The Residual is the difference between an observed data value and the value predicted by the regression equation. The formula for the Residual is as follows:

Residual = Yactual – Yestimated

 

Evaluating Simple Linear

Regression’s Required

Residual Assumptions

Linear regression has several required assumptions regarding the residuals. These required residual assumptions are as follows:

1) Outliers have been removed.

2) The residuals must be independent of each other. They must not be correlated with each other.

3) The residuals should have a mean of approximately 0.

4) The residuals must have similar variances throughout all residual values.

5) The residuals must be normally-distributed.

6) The residuals may not be highly correlated with any of the independent (X) variables.

7) There must be enough data points to conduct normality testing of residuals.

Here is how to evaluate each of these assumptions in Excel.

 

Performing the Simple Regression

Residual Evaluation Steps

 

1) Locating and Removing Outliers

In many cases a data point is considered to be an outlier if its residual value is more than three standard deviations from the mean of the residuals. Checking the checkbox next to Standardized Residuals in the regression dialogue box calculates standardized value of each residual, which is the number of standard deviations that the residual is from the residual mean. Below once again is the Excel regression output showing the residuals and their distance in standard deviations from the residual mean.

Following are the standardized residuals of the current data set. None are larger in absolute value than 1.69 standard deviations from the residual mean.

regression,excel,excel 2010, excel 2013,statistics,residual,residuals (Click On Image To See a Larger Version)

A data point is often considered an outlier if its residual value is more than three standard deviations from the residual mean. The following Excel output shows that none of the residuals are more than 1.69 standard deviations from the residual mean. On that basis, no data points are considered outliers as a result of having excessively large residuals.

Any outliers that have been removed should be documented and evaluated. Outliers more than 3 standard deviations from the mean are to be expected occasionally for normally-distributed data. If an outlier appears to have been generated by the normal process and not be an aberration of the process, then perhaps it should not be removed. One item to check is whether a data entry error occurred when inputting the data set. Another item that should be checked is whether there was a measurement error when that data point’s parameters were recorded.

If a data point is removed, the regression analysis has to be performed again on the new data set that does not include that data point.

 

2) Determining Whether Residuals Are Independent

This is the most important residual assumption that must be confirmed. If the residuals are not found to be independent, the regression is not considered to be valid.

If the residuals are independent of each other, a graph of the residuals will show no patterns. The residuals should be graphed across all values of the dependent variable. The Excel regression output produced individual graphs of the residuals across all values of each independent variable, but not across all values of the dependent variable. This graph is not part of Excel’s regression output and needs to be generated separately.

An Excel X-Y scatterplot graph of the Residuals plotted against all values of the dependent variable is shown as follows:

regression,excel,excel 2010, excel 2013,statistics,residual,residuals (Click On Image To See a Larger Version)

Residuals that are not independent of each other will show patterns in a Residual graph. No patterns among Residuals are evidenced in this Residual graph so the required regression assumption of Residual independence is validated.

It is important to note an upward or downward linear trend appearing in the Residuals probably indicates that an independent (X) variable is missing. The first tip that this might be occurring is if the Residual mean does not equal approximately zero.

 

3) Calculating the Durbin-Watson Statistic in Excel to Determine If Autocorrelation Exists

An important part of evaluating whether the residuals are independent is to calculate the degree of autocorrelation that exists within the residuals. If the residuals are shown to have a high degree of correlation with each other, the residual are not independent and the regression is not considered valid.

Autocorrelation often occurs with time-series or any other type of longitudinal data. Autocorrelation is evident when data values are influenced by the time interval between them. An example might be a graph of a person’s income. A person’s level of income in one year is likely influenced by that person’s income level in the previous year.

The degree of autocorrelation existing within a variable is calculated by the Durbin-Watson statistics, d. The Durbin-Watson statistic can take values from 0 to 4. A Durbin-Watson statistic near 2 indicates very little autocorrelation within a variable. Values close to 2 indicate that little to no correlation exists among residuals. This, along with no apparent patterns in the Residuals, would confirm the independence of the Residuals.

Values near 0 indicate a perfect positive autocorrelation. Subsequent values are similar to each other in this case. Values will appear to following each other in this case. Values near 4 indicate a perfect negative correlation. Subsequent values are opposite of each other in an alternating pattern.

The data used in this example is not time series data but the Durbin-Watson statistic of the Residuals will be calculated in Excel to show how it is done. Before calculating the Durbin-Watson statistic, the data should be sorted chronologically. The Durbin-Watson for the Residuals would be calculated in Excel as follows:

regression,excel,excel 2010, excel 2013,statistics,residual,residuals
(Click On Image To See a Larger Version)

SUMXMY2(x_array,y_array) calculates the sum of the square of (X – Y) for the entire array.

SUMSQ(array) squares the values in the array and then sums those squares.

If the Residuals are in cells C1:C50, then the Excel formula to calculate the Durbin-Watson statistic for those Residuals is the following:

SUMXMY2(C2:C50,C1:C49)/SUMSQ(C1:C50)

The Durbin-Watson statistic of 2.07 calculated here indicates that the Residuals have very little autocorrelation. The Residuals can be considered independent of each other because of the value of the Durbin-Watson statistic and the lack of apparent patterns in the scatterplot of the Residuals.

 

4) Determining if Residual Mean Equals Zero

The mean of the residuals is shown to be zero as follows:

regression,excel,excel 2010, excel 2013,statistics,residual,residuals
(Click On Image To See a Larger Version)

 

5) Determining If Residual Variance Is Constant

If the Residuals have similar variances across all residual values, the Residuals are said to be homoscedastistic. The property of having similar variance across all sample values or across different sample groups is known as homoscedasticity.

If the Residuals do not have similar variances across all residual values, the Residuals are said to be heteroscedastistic. The property of having different variance across all sample values or across different sample groups is known as heteroscedasticity.

Linear regression requires that Residuals be homoscedastistic, i.e., have similar variances across all residual values.

The variance of the Residuals is the degree of spread among the Residual values. This can be observed on the Residual scatterplot graph. If the variance of the residuals changes as residual values increase, the spread between the values will visibly change on the Residual scatterplot graph. If Residual variance increases, the Residual values will appear to fan out along the graph. If Residual variance decreases, the Residual values will do the opposite; they will appear to clump together along the graph.

Here is the Residual graph again:

regression,excel,excel 2010, excel 2013,statistics,residual,residuals(Click On Image To See a Larger Version)

The Residuals appear to fan out slightly as Residual values increase. This indicates a slight increase in Residual variance across the values of the dependent variable. The degree of fanning out is not significant. Slight unequal variance in Residuals in not usually a reason to discard an otherwise good model. One way to remove unequal variance in the residuals is to reduce the interval between data points. Shorter intervals will have closer variances.

If the number of data points is too small, the residual spread will sometimes produce a cigar-shaped pattern.

 

6) Determining if Residuals Are Normally-Distributed

An important assumption of linear regression is that the Residuals be normally-distributed. Normality testing must be performed on the Residuals. The following five normality tests will be performed in a blog article shortly after this article:

1) An Excel histogram of the Residuals will be created.

2) A normal probability plot of the Residuals will be created in Excel.

3) The Kolmogorov-Smirnov test for normality of Residuals will be performed in Excel.

4) The Anderson-Darling test for normality of Residuals will be performed in Excel.

5) The Shapiro-Wilk test for normality of Residuals will be performed in Excel.

 

7) Determining If Any Input Variables Are Too Highly Correlated

To determine whether the Residuals have significant correlation with any other variables, an Excel correlation matrix can be created. An Excel correlation matrix will simultaneously calculate correlations between all variables. The Excel correlation matrix for all variables in this regression is shown as follows:

regression,excel,excel 2010, excel 2013,statistics,residual,residuals
(Click On Image To See a Larger Version)

The correlation matrix shows the correlations between each of the other two variables to be low. Correlation values go from (-1) to (+1). Correlation values near zero indicate very low correlation. This correlation matrix was created by inserting the following information into the Excel correlation data analysis tool dialogue box:

regression,excel,excel 2010, excel 2013,statistics,residual,residuals (Click On Image To See a Larger Version)

 

8) Determining If There Are Enough Data Points

Violations of important assumptions such as normality of Residuals is difficult to detect if too few data exist. 20 data points is sufficient. 10 data points is probably on the borderline of being too few. All of the normality tests are significantly more powerful (accurate) as data size goes from 15 to 20 data points. Normality of data is very difficult to access accurately when only 10 data points are present.

All required regression assumptions concerning the Residuals have been met. The next step is to evaluate the remainder of the Excel regression output.

 

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

No comments:

Post a Comment