## Tuesday, May 27, 2014

### Multiple Linear Regression’s Required Assumptions

This is one of the following seven articles on Multiple Linear Regression in Excel

Basics of Multiple Regression in Excel 2010 and Excel 2013

Complete Multiple Linear Regression Example in 6 Steps in Excel 2010 and Excel 2013

Multiple Linear Regression’s Required Residual Assumptions

Normality Testing of Residuals in Excel 2010 and Excel 2013

Evaluating the Excel Output of Multiple Regression

Estimating the Prediction Interval of Multiple Regression in Excel

Regression - How To Do Conjoint Analysis Using Dummy Variable Regression in Excel

# Multiple Linear Regression’s Required Residual Assumptions

Linear regression has several required assumptions regarding the residuals. These required residual assumptions are as follows:

1) Outliers have been removed.

2) The residuals must be independent of each other. They must not be correlated with each other.

3) The residuals should have a mean of approximately 0.

4) The residuals must have similar variances throughout all residual values.

5) The residuals must be normally-distributed.

6) The residuals may not be highly correlated with any of the independent (X) variables.

7) There must be enough data points to conduct normality testing of residuals.

Here is how to evaluate each of these assumptions in Excel.

## Locating and Removing Outliers

In many cases a data point is considered to be an outlier if its residual value is more than three standard deviations from the mean of the residuals. Checking the checkbox next to Standardized Residuals in the regression dialogue box calculates standardized value of each residual, which is the number of standard deviations that the residual is from the residual mean. Below once again is the Excel regression output showing the residuals and their distance in standard deviations from the residual mean.

Following are the standardized residuals of the current data set. None are larger in absolute value than 1.755 standard deviations from the residual mean. (Click Image To See a Larger Version) (Click Image To See a Larger Version)

A data point is often considered an outlier if its residual value is more than three standard deviations from the residual mean. The following Excel output shows that none of the residuals are more than 1.755 standard deviations from the residual mean. On that basis, no data points are considered outliers as a result of having excessively large residuals.

Any outliers that have been removed should be documented and evaluated. Outliers more than 3 standard deviations from the mean are to be expected occasionally for normally-distributed data. If an outlier appears to have been generated by the normal process and not be an aberration of the process, then perhaps it should not be removed. One item to check is whether a data entry error occurred when inputting the data set. Another item that should be checked is whether there was a measurement error when that data point’s parameters were recorded.

If a data point is removed, the regression analysis has to be performed again on the new data set that does not include that data point.

## Determining Whether Residuals Are Independent

This is the most important residual assumption that must be confirmed. If the residuals are not found to be independent, the regression is not considered to be valid.

If the residuals are independent of each other, a graph of the residuals will show no patterns. The residuals should be graphed across all values of the dependent variable. The Excel regression output produced individual graphs of the residuals across all values of each independent variable, but not across all values of the dependent variable. This graph is not part of Excel’s regression output and needs to be generated separately.

An Excel X-Y scatterplot graph of the Residuals plotted against all values of the dependent variable is shown as follows: (Click Image To See a Larger Version)

Residuals that are not independent of each other will show patterns in a Residual graph. No patterns among Residuals are evidenced in this Residual graph so the required regression assumption of Residual independence is validated.

It is important to note an upward or downward linear trend appearing in the Residuals probably indicates that an independent (X) variable is missing. The first tip that this might be occurring is if the Residual mean does not equal approximately zero.

## Calculating Durbin-Watson Statistic To Determine If Autocorrelation Exists

An important part of evaluating whether the residuals are independent is to calculate the degree of autocorrelation that exists within the residuals. If the residuals are shown to have a high degree of correlation with each other, the residual are not independent and the regression is not considered valid.

Autocorrelation often occurs with time-series or any other type of longitudinal data. Autocorrelation is evident when data values are influenced by the time interval between them. An example might be a graph of a person’s income. A person’s level of income in one year is likely influenced by that person’s income level in the previous year.

The degree of autocorrelation existing within a variable is calculated by the Durbin-Watson statistics, d. The Durbin-Watson statistic can take values from 0 to 4. A Durbin-Watson statistic near 2 indicates very little autocorrelation within a variable. Values close to 2 indicate that little to no correlation exists among residuals. This, along with no apparent patterns in the Residuals, would confirm the independence of the Residuals.

Values near 0 indicate a perfect positive autocorrelation. Subsequent values are similar to each other in this case. Values will appear to following each other in this case. Values near 4 indicate a perfect negative correlation. Subsequent values are opposite of each other in an alternating pattern.

The data used in this example is not time series data but the Durbin-Watson statistic of the Residuals will be calculated in Excel to show how it is done. Before calculating the Durbin-Watson statistic, the data should be sorted chronologically. The Durbin-Watson for the Residuals would be calculated in Excel as follows: (Click Image To See a Larger Version)

SUMXMY2(x_array,y_array) calculates the sum of the square of (X – Y) for the entire array.

SUMSQ(array) squares the values in the array and then sums those squares.

If the Residuals are in cells DV11:DV30, then the Excel formula to calculate the Durbin-Watson statistic for those Residuals is the following:

=SUMXMY2(DV12:DV30,DV11:DV29)/SUMSQ(DV11:DV30)

The Durbin-Watson statistic of 2.07 calculated here indicates that the Residuals have very little autocorrelation. The Residuals can be considered independent of each other because of the value of the Durbin-Watson statistic and the lack of apparent patterns in the scatterplot of the Residuals.

## Determining if Residual Mean Equals Zero

The mean of the residuals is shown to be zero as follows: (Click Image To See a Larger Version)

## Determining If Residual Variance Is Constant

If the Residuals have similar variances across all residual values, the Residuals are said to be homoscedastistic. The property of having similar variance across all sample values or across different sample groups is known as homoscedasticity.

If the Residuals do not have similar variances across all residual values, the Residuals are said to be heteroscedastistic. The property of having different variance across all sample values or across different sample groups is known as heteroscedasticity. Linear regression requires that Residuals be homoscedastistic, i.e., have similar variances across all residual values.

The variance of the Residuals is the degree of spread among the Residual values. This can be observed on the Residual scatterplot graph. If the variance of the residuals changes as residual values increase, the spread between the values will visibly change on the Residual scatterplot graph. If Residual variance increases, the Residual values will appear to fan out along the graph. If Residual variance decreases, the Residual values will do the opposite; they will appear to clump together along the graph.

Here is the Residual graph again: (Click Image To See a Larger Version)

The Residuals’ spread appears to be fairly consistent across all Residual values. This indicates that the Residuals are homoscedastistic, i.e., have similar variance across all Residual values. There appears to be no fanning in or fanning out.

Slight unequal variance in Residuals in not usually a reason to discard an otherwise good model. One way to remove unequal variance in the residuals is to reduce the interval between data points. Shorter intervals will have closer variances.

If the number of data points is too small, the residual spread will sometimes produce a cigar-shaped pattern.

## Determining if Residuals Are Normally-Distributed

An important assumption of linear regression is that the Residuals be normally-distributed. Normality testing must be performed on the Residuals. The five normality tests will be performed in the next blog article are as follows:

1) An Excel histogram of the Residuals will be created.

2) A normal probability plot of the Residuals will be created in Excel.

3) The Kolmogorov-Smirnov test for normality of Residuals will be performed in Excel.

4) The Anderson-Darling test for normality of Residuals will be performed in Excel.

5) The Shapiro-Wilk test for normality of Residuals will be performed in Excel.

## Determining If Any Input Variables Are Too Highly Correlated With Residuals

To determine whether the Residuals have significant correlation with any other variables, an Excel correlation matrix can be created. An Excel correlation matrix will simultaneously calculate correlations between all variables. The Excel correlation matrix for all variables in this regression is shown as follows: (Click Image To See a Larger Version)

The correlation matrix shows all of the correlations between each of the variables to be low. Correlation values go from (-1) to (+1). Correlation values near zero indicate very low correlation. This correlation matrix was created by inserting the following information into the Excel correlation data analysis tool dialogue box as follows: (Click Image To See a Larger Version)

## Determining If There Are Enough Data Points

Violations of important assumptions such as normality of Residuals is difficult to detect if too few data exist. 20 data points is sufficient. 10 data points is probably on the borderline of being too few. All of the normality tests are significantly more powerful (accurate) as data size goes from 15 to 20 data points. Normality of data is very difficult to access accurately when only 10 data points are present.

All required regression assumptions concerning the Residuals have been met. The next step is to evaluate the remainder of the Excel regression output.

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

• Histograms in Excel
• Bar Chart in Excel
• Combinations & Permutations in Excel
• Normal Distribution in Excel
• t-Distribution in Excel
• Binomial Distribution in Excel
• z-Tests in Excel
• t-Tests in Excel
• Hypothesis Tests of Proportion in Excel
• Chi-Square Independence Tests in Excel
• Chi-Square Goodness-Of-Fit Tests in Excel
• F Tests in Excel
• Correlation in Excel
• Pearson Correlation in Excel
• Spearman Correlation in Excel
• Confidence Intervals in Excel
• Simple Linear Regression in Excel
• Multiple Linear Regression in Excel
• Logistic Regression in Excel
• Single-Factor ANOVA in Excel
• Two-Factor ANOVA With Replication in Excel
• Two-Factor ANOVA Without Replication in Excel
• Randomized Block Design ANOVA in Excel
• Repeated-Measures ANOVA in Excel
• ANCOVA in Excel
• Normality Testing in Excel
• Nonparametric Testing in Excel
• Post Hoc Testing in Excel
• Creating Interactive Graphs of Statistical Distributions in Excel
• Solving Problems With Other Distributions in Excel
• Optimization With Excel Solver
• Chi-Square Population Variance Test in Excel
• Analyzing Data With Pivot Tables
• SEO Functions in Excel
• Time Series Analysis in Excel
• VLOOKUP