Tuesday, May 27, 2014

Multiple Linear Regression’s Required Assumptions

This is one of the following seven articles on Multiple Linear Regression in Excel

Basics of Multiple Regression in Excel 2010 and Excel 2013

Complete Multiple Linear Regression Example in 6 Steps in Excel 2010 and Excel 2013

Multiple Linear Regression’s Required Residual Assumptions

Normality Testing of Residuals in Excel 2010 and Excel 2013

Evaluating the Excel Output of Multiple Regression

Estimating the Prediction Interval of Multiple Regression in Excel

Regression - How To Do Conjoint Analysis Using Dummy Variable Regression in Excel

Multiple Linear

Regression’s Required

Residual Assumptions

Linear regression has several required assumptions regarding the residuals. These required residual assumptions are as follows:

1) Outliers have been removed.

2) The residuals must be independent of each other. They must not be correlated with each other.

3) The residuals should have a mean of approximately 0.

4) The residuals must have similar variances throughout all residual values.

5) The residuals must be normally-distributed.

6) The residuals may not be highly correlated with any of the independent (X) variables.

7) There must be enough data points to conduct normality testing of residuals.

Here is how to evaluate each of these assumptions in Excel.

Locating and Removing Outliers

In many cases a data point is considered to be an outlier if its residual value is more than three standard deviations from the mean of the residuals. Checking the checkbox next to Standardized Residuals in the regression dialogue box calculates standardized value of each residual, which is the number of standard deviations that the residual is from the residual mean. Below once again is the Excel regression output showing the residuals and their distance in standard deviations from the residual mean.

Following are the standardized residuals of the current data set. None are larger in absolute value than 1.755 standard deviations from the residual mean.

excel, excel 2010, excel 2013, regression, multiple regression, statistics, residuals, residual,durban watson,durban-watson,variance test,normality,autocorrelation

(Click Image To See a Larger Version)

A data point is often considered an outlier if its residual value is more than three standard deviations from the residual mean. The following Excel output shows that none of the residuals are more than 1.755 standard deviations from the residual mean. On that basis, no data points are considered outliers as a result of having excessively large residuals.

Any outliers that have been removed should be documented and evaluated. Outliers more than 3 standard deviations from the mean are to be expected occasionally for normally-distributed data. If an outlier appears to have been generated by the normal process and not be an aberration of the process, then perhaps it should not be removed. One item to check is whether a data entry error occurred when inputting the data set. Another item that should be checked is whether there was a measurement error when that data point’s parameters were recorded.

If a data point is removed, the regression analysis has to be performed again on the new data set that does not include that data point.

Determining Whether Residuals

Are Independent

This is the most important residual assumption that must be confirmed. If the residuals are not found to be independent, the regression is not considered to be valid.

If the residuals are independent of each other, a graph of the residuals will show no patterns. The residuals should be graphed across all values of the dependent variable. The Excel regression output produced individual graphs of the residuals across all values of each independent variable, but not across all values of the dependent variable. This graph is not part of Excel’s regression output and needs to be generated separately.

An Excel X-Y scatterplot graph of the Residuals plotted against all values of the dependent variable is shown as follows:

(Click Image To See a Larger Version)

Residuals that are not independent of each other will show patterns in a Residual graph. No patterns among Residuals are evidenced in this Residual graph so the required regression assumption of Residual independence is validated.

It is important to note an upward or downward linear trend appearing in the Residuals probably indicates that an independent (X) variable is missing. The first tip that this might be occurring is if the Residual mean does not equal approximately zero.

Calculating Durbin-Watson Statistic

To Determine If Autocorrelation

Exists

An important part of evaluating whether the residuals are independent is to calculate the degree of autocorrelation that exists within the residuals. If the residuals are shown to have a high degree of correlation with each other, the residual are not independent and the regression is not considered valid.

Autocorrelation often occurs with time-series or any other type of longitudinal data. Autocorrelation is evident when data values are influenced by the time interval between them. An example might be a graph of a person’s income. A person’s level of income in one year is likely influenced by that person’s income level in the previous year.

The degree of autocorrelation existing within a variable is calculated by the Durbin-Watson statistics, d. The Durbin-Watson statistic can take values from 0 to 4. A Durbin-Watson statistic near 2 indicates very little autocorrelation within a variable. Values close to 2 indicate that little to no correlation exists among residuals. This, along with no apparent patterns in the Residuals, would confirm the independence of the Residuals.

Values near 0 indicate a perfect positive autocorrelation. Subsequent values are similar to each other in this case. Values will appear to following each other in this case. Values near 4 indicate a perfect negative correlation. Subsequent values are opposite of each other in an alternating pattern.

The data used in this example is not time series data but the Durbin-Watson statistic of the Residuals will be calculated in Excel to show how it is done. Before calculating the Durbin-Watson statistic, the data should be sorted chronologically. The Durbin-Watson for the Residuals would be calculated in Excel as follows:

(Click Image To See a Larger Version)

SUMXMY2(x_array,y_array) calculates the sum of the square of (X – Y) for the entire array.

SUMSQ(array) squares the values in the array and then sums those squares.

If the Residuals are in cells DV11:DV30, then the Excel formula to calculate the Durbin-Watson statistic for those Residuals is the following:

=SUMXMY2(DV12:DV30,DV11:DV29)/SUMSQ(DV11:DV30)

The Durbin-Watson statistic of 2.07 calculated here indicates that the Residuals have very little autocorrelation. The Residuals can be considered independent of each other because of the value of the Durbin-Watson statistic and the lack of apparent patterns in the scatterplot of the Residuals.

Determining if Residual Mean

Equals Zero

The mean of the residuals is shown to be zero as follows:

(Click Image To See a Larger Version)

Determining If Residual Variance

Is Constant

If the Residuals have similar variances across all residual values, the Residuals are said to be homoscedastistic. The property of having similar variance across all sample values or across different sample groups is known as homoscedasticity.

If the Residuals do not have similar variances across all residual values, the Residuals are said to be heteroscedastistic. The property of having different variance across all sample values or across different sample groups is known as heteroscedasticity. Linear regression requires that Residuals be homoscedastistic, i.e., have similar variances across all residual values.

The variance of the Residuals is the degree of spread among the Residual values. This can be observed on the Residual scatterplot graph. If the variance of the residuals changes as residual values increase, the spread between the values will visibly change on the Residual scatterplot graph. If Residual variance increases, the Residual values will appear to fan out along the graph. If Residual variance decreases, the Residual values will do the opposite; they will appear to clump together along the graph.

Here is the Residual graph again:

(Click Image To See a Larger Version)

The Residuals’ spread appears to be fairly consistent across all Residual values. This indicates that the Residuals are homoscedastistic, i.e., have similar variance across all Residual values. There appears to be no fanning in or fanning out.

Slight unequal variance in Residuals in not usually a reason to discard an otherwise good model. One way to remove unequal variance in the residuals is to reduce the interval between data points. Shorter intervals will have closer variances.

If the number of data points is too small, the residual spread will sometimes produce a cigar-shaped pattern.

Determining if Residuals Are

Normally-Distributed

An important assumption of linear regression is that the Residuals be normally-distributed. Normality testing must be performed on the Residuals. The five normality tests will be performed in the next blog article are as follows:

1) An Excel histogram of the Residuals will be created.

2) A normal probability plot of the Residuals will be created in Excel.

3) The Kolmogorov-Smirnov test for normality of Residuals will be performed in Excel.

4) The Anderson-Darling test for normality of Residuals will be performed in Excel.

5) The Shapiro-Wilk test for normality of Residuals will be performed in Excel.

Determining If Any Input Variables

Are Too Highly Correlated With

Residuals

To determine whether the Residuals have significant correlation with any other variables, an Excel correlation matrix can be created. An Excel correlation matrix will simultaneously calculate correlations between all variables. The Excel correlation matrix for all variables in this regression is shown as follows:

(Click Image To See a Larger Version)

The correlation matrix shows all of the correlations between each of the variables to be low. Correlation values go from (-1) to (+1). Correlation values near zero indicate very low correlation. This correlation matrix was created by inserting the following information into the Excel correlation data analysis tool dialogue box as follows:

(Click Image To See a Larger Version)

Determining If There Are Enough

Data Points

Violations of important assumptions such as normality of Residuals is difficult to detect if too few data exist. 20 data points is sufficient. 10 data points is probably on the borderline of being too few. All of the normality tests are significantly more powerful (accurate) as data size goes from 15 to 20 data points. Normality of data is very difficult to access accurately when only 10 data points are present.

All required regression assumptions concerning the Residuals have been met. The next step is to evaluate the remainder of the Excel regression output.

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

Technorati Tags: excel,excel 2010,excel 2013,regression,multiple regression,statistics,residuals,residual,durban watson,durban-watson,variance test,normality,autocorrelation

2 comments:

Vivian RaddixNovember 14, 2022 at 2:02 AM
I am very interested in this article. Your information is very detailed and useful to me.

Apply Now May/September Intake for Study in USA
ReplyDelete
Replies
John DavidNovember 20, 2023 at 10:19 AM
Understanding the nuances of linear regression is crucial for meaningful analysis. The assumptions, such as linearity, independence, homoscedasticity, and normality of residuals, form the foundation of its validity. Just as a printer relies on a stable power supply like the C6455-60009 for consistent performance, adhering to these assumptions ensures the reliability of linear regression models, fostering accurate predictions and meaningful insights in statistical analyses.
ReplyDelete
Replies

Add comment

Subscribe to: Post Comments (Atom)

Become an Excel Statistical Master

Excel Master Series - MBA-level statistics - Over 1,100+ Pages of Easy-To-Follow Instructions in Excel

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

Step-By-Step Optimization With Excel Solver

What's In It?

For anyone who wants to be operating at a high level with the Excel Solver quickly, this is the book for you. Step-By-Step Optimization With Excel Solver is a 200+ page .pdf e-manual of simple yet thorough explanations on how to use the Excel Solver to solve today’s most widely known optimization problems. Loaded with screen shots that are coupled with easy-to-follow instructions, this book will simplify many difficult optimization problems and make you a master of the Excel Solver almost immediately.

Here are just some of the Solver optimization problems that are solved completely with simple-to-understand instructions and screen shots in this e-manual:

• The famous “Traveling Salesman” problem using Solver’s Alldifferent constraint and the Solver’s Evolutionary method to find the shortest path to reach all customers. This also provides an advanced use of the Excel INDEX function.

• The well-known “Knapsack Problem” which shows how optimize the use of limited space while satisfying numerous other criteria.

• How to perform nonlinear regression and curve-fitting on the Solver using the Solver’s GRG Nonlinear solving method.

• How to solve the “Cutting Stock Problem” faced by many manufacturing companies who are trying to determine the optimal way to cut sheets of material to minimize waste while satisfying customer orders.

• Portfolio optimization to maximize return or minimize risk.

• Venture capital investment selection using the Solver’s Binary constraint to maximize Net Present Value of selected cash flows at year 0. Clever use of the If-Then-Else statements makes this a simple problem.

• How use Solver to minimize the total cost of purchasing and shipping goods from multiple suppliers to multiple locations.

• How to optimize the selection of different production machine to minimize cost while fulfilling an order.

• How to optimally allocate a marketing budget to generate the greatest reach and frequency or number of inbound leads at the lowest cost.

Step-By-Step Optimization With Excel Solver has complete instructions and numerous tips on every aspect of operating the Excel Solver. You’ll fully understand the reports and know exactly how to tweek all of the Solver’s settings for total custom use. This e-manual also provides lots of inside advice and guidance on setting up the model in Excel so that it will be as simple and intuitive as possible to work with.

All of the optimization problems in this book are solved step-by-step using a 6-step process that works every time. In addition to detailed screen shots and easy-to-follow explanations on how to solve every optimization problem in the book, a link is provided to download an Excel workbook that has all problems completed exactly as they are in this e-manual.

Step-By-Step Optimization With Excel Solver is exactly the e-manual you need if you want to be optimizing at an advanced level with the Excel Solver quickly.

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

Immediate, Absolute, No-Questions-Asked, Money-Back Guarantee If Not TOTALLY, 100% Satisfied. In Other Words, If Any Excel Master Series eManual That You've Purchased Here Does Not Provide Instructions That Are CRYSTAL CLEAR and EASY TO UNDERSTAND, You Get All Of Your Money Back Immediately and Keep the eManual. Guaranteed!

Meet The Author