This is one of the following seven articles on Simple Linear Regression in Excel

Excel Regression Step 6 –

Evaluate the Residuals

The purpose of Residual analysis is to confirm the underlying validity of the regression. Linear regression has a number of required assumptions about the residuals. These assumptions should be evaluated before continuing the analysis of the Excel regression output. If one or more of the required residual assumptions are shown to be invalid, the entire regression analysis might be considered, at best, questionable or, at worst, invalid. The residuals should therefore be analyzed first before analyzing any other part of the Excel regression output.

The Residuals are sometimes called the Error Terms. The Residual is the difference between an observed data value and the value predicted by the regression equation. The formula for the Residual is as follows:

Residual = Y_actual – Y_estimated

Evaluating Simple Linear

Regression’s Required

Residual Assumptions

Linear regression has several required assumptions regarding the residuals. These required residual assumptions are as follows:

1) Outliers have been removed.

2) The residuals must be independent of each other. They must not be correlated with each other.

3) The residuals should have a mean of approximately 0.

4) The residuals must have similar variances throughout all residual values.

5) The residuals must be normally-distributed.

6) The residuals may not be highly correlated with any of the independent (X) variables.

7) There must be enough data points to conduct normality testing of residuals.

Here is how to evaluate each of these assumptions in Excel.

Performing the Simple Regression

Residual Evaluation Steps

1) Locating and Removing Outliers

In many cases a data point is considered to be an outlier if its residual value is more than three standard deviations from the mean of the residuals. Checking the checkbox next to Standardized Residuals in the regression dialogue box calculates standardized value of each residual, which is the number of standard deviations that the residual is from the residual mean. Below once again is the Excel regression output showing the residuals and their distance in standard deviations from the residual mean.

Following are the standardized residuals of the current data set. None are larger in absolute value than 1.69 standard deviations from the residual mean.

(Click On Image To See a Larger Version)

A data point is often considered an outlier if its residual value is more than three standard deviations from the residual mean. The following Excel output shows that none of the residuals are more than 1.69 standard deviations from the residual mean. On that basis, no data points are considered outliers as a result of having excessively large residuals.

Any outliers that have been removed should be documented and evaluated. Outliers more than 3 standard deviations from the mean are to be expected occasionally for normally-distributed data. If an outlier appears to have been generated by the normal process and not be an aberration of the process, then perhaps it should not be removed. One item to check is whether a data entry error occurred when inputting the data set. Another item that should be checked is whether there was a measurement error when that data point’s parameters were recorded.

If a data point is removed, the regression analysis has to be performed again on the new data set that does not include that data point.

2) Determining Whether Residuals Are Independent

This is the most important residual assumption that must be confirmed. If the residuals are not found to be independent, the regression is not considered to be valid.

If the residuals are independent of each other, a graph of the residuals will show no patterns. The residuals should be graphed across all values of the dependent variable. The Excel regression output produced individual graphs of the residuals across all values of each independent variable, but not across all values of the dependent variable. This graph is not part of Excel’s regression output and needs to be generated separately.

An Excel X-Y scatterplot graph of the Residuals plotted against all values of the dependent variable is shown as follows:

(Click On Image To See a Larger Version)

Residuals that are not independent of each other will show patterns in a Residual graph. No patterns among Residuals are evidenced in this Residual graph so the required regression assumption of Residual independence is validated.

It is important to note an upward or downward linear trend appearing in the Residuals probably indicates that an independent (X) variable is missing. The first tip that this might be occurring is if the Residual mean does not equal approximately zero.

3) Calculating the Durbin-Watson Statistic in Excel to Determine If Autocorrelation Exists

An important part of evaluating whether the residuals are independent is to calculate the degree of autocorrelation that exists within the residuals. If the residuals are shown to have a high degree of correlation with each other, the residual are not independent and the regression is not considered valid.

Autocorrelation often occurs with time-series or any other type of longitudinal data. Autocorrelation is evident when data values are influenced by the time interval between them. An example might be a graph of a person’s income. A person’s level of income in one year is likely influenced by that person’s income level in the previous year.

The degree of autocorrelation existing within a variable is calculated by the Durbin-Watson statistics, d. The Durbin-Watson statistic can take values from 0 to 4. A Durbin-Watson statistic near 2 indicates very little autocorrelation within a variable. Values close to 2 indicate that little to no correlation exists among residuals. This, along with no apparent patterns in the Residuals, would confirm the independence of the Residuals.

Values near 0 indicate a perfect positive autocorrelation. Subsequent values are similar to each other in this case. Values will appear to following each other in this case. Values near 4 indicate a perfect negative correlation. Subsequent values are opposite of each other in an alternating pattern.

The data used in this example is not time series data but the Durbin-Watson statistic of the Residuals will be calculated in Excel to show how it is done. Before calculating the Durbin-Watson statistic, the data should be sorted chronologically. The Durbin-Watson for the Residuals would be calculated in Excel as follows:

(Click On Image To See a Larger Version)

SUMXMY2(x_array,y_array) calculates the sum of the square of (X – Y) for the entire array.

SUMSQ(array) squares the values in the array and then sums those squares.

If the Residuals are in cells C1:C50, then the Excel formula to calculate the Durbin-Watson statistic for those Residuals is the following:

SUMXMY2(C2:C50,C1:C49)/SUMSQ(C1:C50)

The Durbin-Watson statistic of 2.07 calculated here indicates that the Residuals have very little autocorrelation. The Residuals can be considered independent of each other because of the value of the Durbin-Watson statistic and the lack of apparent patterns in the scatterplot of the Residuals.

4) Determining if Residual Mean Equals Zero

The mean of the residuals is shown to be zero as follows:

(Click On Image To See a Larger Version)

5) Determining If Residual Variance Is Constant

If the Residuals have similar variances across all residual values, the Residuals are said to be homoscedastistic. The property of having similar variance across all sample values or across different sample groups is known as homoscedasticity.

If the Residuals do not have similar variances across all residual values, the Residuals are said to be heteroscedastistic. The property of having different variance across all sample values or across different sample groups is known as heteroscedasticity.

Linear regression requires that Residuals be homoscedastistic, i.e., have similar variances across all residual values.

The variance of the Residuals is the degree of spread among the Residual values. This can be observed on the Residual scatterplot graph. If the variance of the residuals changes as residual values increase, the spread between the values will visibly change on the Residual scatterplot graph. If Residual variance increases, the Residual values will appear to fan out along the graph. If Residual variance decreases, the Residual values will do the opposite; they will appear to clump together along the graph.

Here is the Residual graph again:

(Click On Image To See a Larger Version)

The Residuals appear to fan out slightly as Residual values increase. This indicates a slight increase in Residual variance across the values of the dependent variable. The degree of fanning out is not significant. Slight unequal variance in Residuals in not usually a reason to discard an otherwise good model. One way to remove unequal variance in the residuals is to reduce the interval between data points. Shorter intervals will have closer variances.

If the number of data points is too small, the residual spread will sometimes produce a cigar-shaped pattern.

6) Determining if Residuals Are Normally-Distributed

An important assumption of linear regression is that the Residuals be normally-distributed. Normality testing must be performed on the Residuals. The following five normality tests will be performed in a blog article shortly after this article:

1) An Excel histogram of the Residuals will be created.

2) A normal probability plot of the Residuals will be created in Excel.

3) The Kolmogorov-Smirnov test for normality of Residuals will be performed in Excel.

4) The Anderson-Darling test for normality of Residuals will be performed in Excel.

5) The Shapiro-Wilk test for normality of Residuals will be performed in Excel.

7) Determining If Any Input Variables Are Too Highly Correlated

To determine whether the Residuals have significant correlation with any other variables, an Excel correlation matrix can be created. An Excel correlation matrix will simultaneously calculate correlations between all variables. The Excel correlation matrix for all variables in this regression is shown as follows:

(Click On Image To See a Larger Version)

The correlation matrix shows the correlations between each of the other two variables to be low. Correlation values go from (-1) to (+1). Correlation values near zero indicate very low correlation. This correlation matrix was created by inserting the following information into the Excel correlation data analysis tool dialogue box:

(Click On Image To See a Larger Version)

8) Determining If There Are Enough Data Points

Violations of important assumptions such as normality of Residuals is difficult to detect if too few data exist. 20 data points is sufficient. 10 data points is probably on the borderline of being too few. All of the normality tests are significantly more powerful (accurate) as data size goes from 15 to 20 data points. Normality of data is very difficult to access accurately when only 10 data points are present.

All required regression assumptions concerning the Residuals have been met. The next step is to evaluate the remainder of the Excel regression output.

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

Technorati Tags: regression,excel,excel 2010,excel 2013,statistics,residual,residuals

Become an Excel Statistical Master

Excel Master Series - MBA-level statistics - Over 1,100+ Pages of Easy-To-Follow Instructions in Excel

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

Step-By-Step Optimization With Excel Solver

What's In It?

For anyone who wants to be operating at a high level with the Excel Solver quickly, this is the book for you. Step-By-Step Optimization With Excel Solver is a 200+ page .pdf e-manual of simple yet thorough explanations on how to use the Excel Solver to solve today’s most widely known optimization problems. Loaded with screen shots that are coupled with easy-to-follow instructions, this book will simplify many difficult optimization problems and make you a master of the Excel Solver almost immediately.

Here are just some of the Solver optimization problems that are solved completely with simple-to-understand instructions and screen shots in this e-manual:

• The famous “Traveling Salesman” problem using Solver’s Alldifferent constraint and the Solver’s Evolutionary method to find the shortest path to reach all customers. This also provides an advanced use of the Excel INDEX function.

• The well-known “Knapsack Problem” which shows how optimize the use of limited space while satisfying numerous other criteria.

• How to perform nonlinear regression and curve-fitting on the Solver using the Solver’s GRG Nonlinear solving method.

• How to solve the “Cutting Stock Problem” faced by many manufacturing companies who are trying to determine the optimal way to cut sheets of material to minimize waste while satisfying customer orders.

• Portfolio optimization to maximize return or minimize risk.

• Venture capital investment selection using the Solver’s Binary constraint to maximize Net Present Value of selected cash flows at year 0. Clever use of the If-Then-Else statements makes this a simple problem.

• How use Solver to minimize the total cost of purchasing and shipping goods from multiple suppliers to multiple locations.

• How to optimize the selection of different production machine to minimize cost while fulfilling an order.

• How to optimally allocate a marketing budget to generate the greatest reach and frequency or number of inbound leads at the lowest cost.

Step-By-Step Optimization With Excel Solver has complete instructions and numerous tips on every aspect of operating the Excel Solver. You’ll fully understand the reports and know exactly how to tweek all of the Solver’s settings for total custom use. This e-manual also provides lots of inside advice and guidance on setting up the model in Excel so that it will be as simple and intuitive as possible to work with.

All of the optimization problems in this book are solved step-by-step using a 6-step process that works every time. In addition to detailed screen shots and easy-to-follow explanations on how to solve every optimization problem in the book, a link is provided to download an Excel workbook that has all problems completed exactly as they are in this e-manual.

Step-By-Step Optimization With Excel Solver is exactly the e-manual you need if you want to be optimizing at an advanced level with the Excel Solver quickly.

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

Immediate, Absolute, No-Questions-Asked, Money-Back Guarantee If Not TOTALLY, 100% Satisfied. In Other Words, If Any Excel Master Series eManual That You've Purchased Here Does Not Provide Instructions That Are CRYSTAL CLEAR and EASY TO UNDERSTAND, You Get All Of Your Money Back Immediately and Keep the eManual. Guaranteed!

Meet The Author