Tuesday, May 27, 2014

Multiple Linear Regression in 6 Steps in Excel 2010 and Excel 2013

This is one of the following seven articles on Multiple Linear Regression in Excel

Basics of Multiple Regression in Excel 2010 and Excel 2013

Complete Multiple Linear Regression Example in 6 Steps in Excel 2010 and Excel 2013

Multiple Linear Regression’s Required Residual Assumptions

Normality Testing of Residuals in Excel 2010 and Excel 2013

Evaluating the Excel Output of Multiple Regression

Estimating the Prediction Interval of Multiple Regression in Excel

Regression - How To Do Conjoint Analysis Using Dummy Variable Regression in Excel

 

Complete Example of

Multiple  Linear

Regression in 6 Steps in

Excel

A researcher is attempting to create a model that accurately predicts the total annual power consumption of companies within a specific industry. The researcher has collected information from 21 companies that specialize in a single industry. The four pieces of information collected from each of the 21 companies are as follows:

1) The company’s total power consumption last year in kilowatts.

2) The company’s total number of production machines.

3) The company’s number of new employees added in the last five years.

4) The company’s total increase in salary paid over the last five years.

The collected data are as follows:

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

 

Excel Multiple Linear Regression

Step 1 – Remove Extreme Outliers

Calculation of the mean is one of the fundamental computations when performing linear regression analysis. The mean is unduly affected by outliers. Extremely outliers should be removed before beginning regression analysis. Not all outliers should be removed. An outlier should be removed if it is obviously extremely and inconsistent with the remainder of the data.

 

Sorting the Data To Quickly Spot Extreme Outliers

An easy way to spot extreme outliers is to sort the data. Extremely high or low outlier values will appear at the ends of the sort. A convenient, one-step method to sort a column of data in Excel is shown here.

The formula is cell I4 is the following:

=IF($G4=””,””,LARGE($G$4:$G$24,ROW()-ROW($I$3)))

Copy this formula down as shown to create a descending sort of the data in cells I4 to I24.

Exchanging the word SMALL for LARGE would create an ascending sort instead of the descending sort performed here.

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

Here is the original data with the outlier data record highlighted.

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

The lowest Y value, 509090, is obviously an extreme outlier and is very different than the rest of the data. The cause of this extreme outlier value is not known. Perhaps something unusual is happening in the company from which this data was drawn? It is clear that this value should be removed from the analysis because it would severely skew the final result.

Removing this outlier from the data produces this set of 20 data records:

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

 

Excel Multiple Linear Regression

Step 2 – Create a Correlation Matrix

This step is only necessary when performing multiple regression. The purpose of this step is to identify independent variables that are highly correlated. Different input variables of multiple regression that are highly correlated can cause an error called multicollinearity.

Multicollinearity does not reduce the overall predictive power of the model but it can cause the coefficients of the independent variables in the regression equation to change erratically when small changes are introduced to the regression inputs. Multicollinearity can drastically reduce the validity of the individual predictors without affecting the overall reliability of the regression equation.

When highly correlated pairs of independent variables are found, one of the variables of the pair should be removed from the regression. The variable that should be removed is the one with the lowest correlation with the dependent variable, Y.

An Excel correlation matrix of all independent and dependent variables is shown as follows:

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

This Excel correlation matrix was created using the following inputs for the Excel correlation dialogue box:

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

We can see from the correlation matrix that there is a very high correlation between two independent variables. The correlation between Total_Salary_Increases and Number_of_Production_Machines is 0.989.

One of these two independent variables should be removed to prevent multicollinearity. The variable that should be removed is the one that has the lower correlation with the dependent variable, Power_Consumption. The independent variable Total_Salary_Increases has a lower correlation with the dependent variable Power_Consumption (0.967) than Number_of_Production_Machines (0.980) and should be removed from the regression analysis.

Here is the data after the variable Total_Salary_Increases is removed from the analysis:

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

 

Excel Multiple Linear Regression

Step 3 – Scale Variables If Necessary

All variables should be scaled so that each has a similar number of decimal places beyond zero. This limits rounding error and also insures that the slope of the fitted line will be a convenient size to work with and not too large or too small. Ideally, the coefficients of the independent variables should be between one and ten.

The next step following this one (Step 4) is to view individual scatterplots of each independent variables versus the dependent variable. Rescaling the independent variables is one way to ensure that that data points do not have too extreme of a slope in the scatterplot graphs.

Performing a regression analysis with the current independent variables would produce coefficients for each variable that are over 1,000. This can be corrected by multiplying each of the two independent variables by 1,000. Rescaling the variables in that manner is shown as follows:

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

 

Excel Multiple Linear Regression

Step 4 – Plot the Data

The purpose of plotting the data is to be able to visually inspect the data for linearity. Each independent variable should be plotted against the dependent variable in a scatterplot graph. Linear regression should only be performed if linear relationships exist between the dependent variable and each of the input variables. Excel X-Y scatterplots of the two independent variables versus the dependent variable are shown as follows. The relationships in both cases appear to be linear.

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

 

Excel Multiple Linear Regression

Step 5 – Run the Regression Analysis

Below is the Regression dialogue box with all of the necessary information filled in. Many of the required regression assumptions concerning the Residuals have not yet been validated. Calculating and evaluating the Residuals will be done before analyzing any other part of the regression output. All four checkboxes in the Residuals section of the regression dialogue box should be checked. This will be discussed shortly.

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

Here is a close-up of the completed Excel regression dialogue box;

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

It should be noted that the Residuals are sometimes referred to as the Error terms. The checkbox next to Residuals should be checked in order to have Excel automatically calculate the residual for each data point. The residual is the difference between the actual data point and its value as predicted by the regression equation. Analysis of the residuals is a very important part of linear regression analysis because a number of required assumptions are based upon the residuals.

The checkbox next to Standardized Residuals should also be checked. If this is checked, Excel will calculate the number of standard deviations that each residual value is from the mean of the residuals. Data points are often considered outliers if their residual values are located more than three standard deviations from the residual mean.

The checkbox next to Residual Plots should also be checked. This will create graphs of the residuals plotted against each of the input (independent) variables. Visual observation of these graphs is an important part of evaluating whether the residuals are independent. If the residuals show patterns in any graph, the residuals are considered to not be independent and the regression should not be considered valid. Independence of the residuals is one of linear regression’s most important required assumptions.

The checkbox next to Line Fit plots should be checked as well. This will produce graphs of the Y Values plotted against each X value in a separate graph. This provides visual analysis of the spread of each input (X) variable and any patterns between any X variable the output Y variable.

The checkbox for the Normal Probability Plot was not checked because that produces a normal probability plot of the Y data (the dependent variable data). A normal probability plot is used to evaluate whether data is normally-distributed. Linear regression does not require the independent or dependent variable data be normally-distributed. Many textbooks incorrectly state that the dependent and/or independent data need to be normally-distributed. This is not the case.

Linear regression does however require that the residuals be normally-distributed. A normal probability plot of the residuals would be very useful to evaluate the normality of the residuals but is not included as a part of Excel’s regression output.

A normal probability plot of the Y data does not provide any useful information and the checkbox that would produce that graph is therefore not checked. It is unclear why Excel includes that functionality with its regression data analysis tool.

Those settings shown in the previous Excel regression dialogue box produce the following Excel output:

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

The Excel regression output includes the calculation of the Residuals as specified. Linear regression’s required assumptions regarding the Residuals should be evaluated before analyzing any other part of the Excel regression output. The required Residual assumptions must be verified before the regression output is considered valid.

The Residual output includes each Dependent variable’s predicted value, its Residual value (the difference between the predicted value and the actual value), and the Residual’s standardized value (the number of standard deviations that the Residual value is from the mean of the Residual values). This Residual output is shown as follows:

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

The follow graphs were also generated as part of the Excel regression output:

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

excel, excel 2010, excel 2013, multiple regression, regression, linear regression, statistics,residuals,residual(Click Image To See a Larger Version)

 

Excel Multiple Linear Regression

Step 6 – Evaluate the Residuals

The purpose of Residual analysis is to confirm the underlying validity of the regression. Linear regression has a number of required assumptions about the residuals. These assumptions should be confirmed before evaluating the remainder of the Excel regression output. If one or more of the required residual assumptions are shown to be invalid, the entire regression analysis might be questionable. The residuals should therefore be analyzed first before the remainder of the Excel regression output.

The Residual is sometimes called the Error Term. The Residual is the difference between an observed data value and the value predicted by the regression equation. The formula for the Residual is as follows:

Residual = Yactual – Yestimated

The following blog articles will demonstrate how to evaluate whether linear regression’s required assumptions have been met and also how to interpret the Excel regression output.

 

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

 

1 comment:

  1. Many videos on this topic are in www.KautilyaClasses.com

    ReplyDelete