This is one of the following seven articles on Multiple Linear Regression in Excel
Basics of Multiple Regression in Excel 2010 and Excel 2013
Complete Multiple Linear Regression Example in 6 Steps in Excel 2010 and Excel 2013
Multiple Linear Regression’s Required Residual Assumptions
Normality Testing of Residuals in Excel 2010 and Excel 2013
Evaluating the Excel Output of Multiple Regression
Estimating the Prediction Interval of Multiple Regression in Excel
Regression - How To Do Conjoint Analysis Using Dummy Variable Regression in Excel
Complete Example of
Multiple Linear
Regression in 6 Steps in
Excel
A researcher is attempting to create a model that accurately predicts the total annual power consumption of companies within a specific industry. The researcher has collected information from 21 companies that specialize in a single industry. The four pieces of information collected from each of the 21 companies are as follows:
1) The company’s total power consumption last year in kilowatts.
2) The company’s total number of production machines.
3) The company’s number of new employees added in the last five years.
4) The company’s total increase in salary paid over the last five years.
The collected data are as follows:
(Click Image To See a Larger Version)
Excel Multiple Linear Regression
Step 1 – Remove Extreme Outliers
Calculation of the mean is one of the fundamental computations when performing linear regression analysis. The mean is unduly affected by outliers. Extremely outliers should be removed before beginning regression analysis. Not all outliers should be removed. An outlier should be removed if it is obviously extremely and inconsistent with the remainder of the data.
Sorting the Data To Quickly Spot Extreme Outliers
An easy way to spot extreme outliers is to sort the data. Extremely high or low outlier values will appear at the ends of the sort. A convenient, one-step method to sort a column of data in Excel is shown here.
The formula is cell I4 is the following:
=IF($G4=””,””,LARGE($G$4:$G$24,ROW()-ROW($I$3)))
Copy this formula down as shown to create a descending sort of the data in cells I4 to I24.
Exchanging the word SMALL for LARGE would create an ascending sort instead of the descending sort performed here.
(Click Image To See a Larger Version)
Here is the original data with the outlier data record highlighted.
(Click Image To See a Larger Version)
The lowest Y value, 509090, is obviously an extreme outlier and is very different than the rest of the data. The cause of this extreme outlier value is not known. Perhaps something unusual is happening in the company from which this data was drawn? It is clear that this value should be removed from the analysis because it would severely skew the final result.
Removing this outlier from the data produces this set of 20 data records:
(Click Image To See a Larger Version)
Excel Multiple Linear Regression
Step 2 – Create a Correlation Matrix
This step is only necessary when performing multiple regression. The purpose of this step is to identify independent variables that are highly correlated. Different input variables of multiple regression that are highly correlated can cause an error called multicollinearity.
Multicollinearity does not reduce the overall predictive power of the model but it can cause the coefficients of the independent variables in the regression equation to change erratically when small changes are introduced to the regression inputs. Multicollinearity can drastically reduce the validity of the individual predictors without affecting the overall reliability of the regression equation.
When highly correlated pairs of independent variables are found, one of the variables of the pair should be removed from the regression. The variable that should be removed is the one with the lowest correlation with the dependent variable, Y.
An Excel correlation matrix of all independent and dependent variables is shown as follows:
(Click Image To See a Larger Version)
This Excel correlation matrix was created using the following inputs for the Excel correlation dialogue box:
(Click Image To See a Larger Version)
We can see from the correlation matrix that there is a very high correlation between two independent variables. The correlation between Total_Salary_Increases and Number_of_Production_Machines is 0.989.
One of these two independent variables should be removed to prevent multicollinearity. The variable that should be removed is the one that has the lower correlation with the dependent variable, Power_Consumption. The independent variable Total_Salary_Increases has a lower correlation with the dependent variable Power_Consumption (0.967) than Number_of_Production_Machines (0.980) and should be removed from the regression analysis.
Here is the data after the variable Total_Salary_Increases is removed from the analysis:
(Click Image To See a Larger Version)
Excel Multiple Linear Regression
Step 3 – Scale Variables If Necessary
All variables should be scaled so that each has a similar number of decimal places beyond zero. This limits rounding error and also insures that the slope of the fitted line will be a convenient size to work with and not too large or too small. Ideally, the coefficients of the independent variables should be between one and ten.
The next step following this one (Step 4) is to view individual scatterplots of each independent variables versus the dependent variable. Rescaling the independent variables is one way to ensure that that data points do not have too extreme of a slope in the scatterplot graphs.
Performing a regression analysis with the current independent variables would produce coefficients for each variable that are over 1,000. This can be corrected by multiplying each of the two independent variables by 1,000. Rescaling the variables in that manner is shown as follows:
(Click Image To See a Larger Version)
Excel Multiple Linear Regression
Step 4 – Plot the Data
The purpose of plotting the data is to be able to visually inspect the data for linearity. Each independent variable should be plotted against the dependent variable in a scatterplot graph. Linear regression should only be performed if linear relationships exist between the dependent variable and each of the input variables. Excel X-Y scatterplots of the two independent variables versus the dependent variable are shown as follows. The relationships in both cases appear to be linear.
(Click Image To See a Larger Version)
(Click Image To See a Larger Version)
Excel Multiple Linear Regression
Step 5 – Run the Regression Analysis
Below is the Regression dialogue box with all of the necessary information filled in. Many of the required regression assumptions concerning the Residuals have not yet been validated. Calculating and evaluating the Residuals will be done before analyzing any other part of the regression output. All four checkboxes in the Residuals section of the regression dialogue box should be checked. This will be discussed shortly.
(Click Image To See a Larger Version)
Here is a close-up of the completed Excel regression dialogue box;
(Click Image To See a Larger Version)
It should be noted that the Residuals are sometimes referred to as the Error terms. The checkbox next to Residuals should be checked in order to have Excel automatically calculate the residual for each data point. The residual is the difference between the actual data point and its value as predicted by the regression equation. Analysis of the residuals is a very important part of linear regression analysis because a number of required assumptions are based upon the residuals.
The checkbox next to Standardized Residuals should also be checked. If this is checked, Excel will calculate the number of standard deviations that each residual value is from the mean of the residuals. Data points are often considered outliers if their residual values are located more than three standard deviations from the residual mean.
The checkbox next to Residual Plots should also be checked. This will create graphs of the residuals plotted against each of the input (independent) variables. Visual observation of these graphs is an important part of evaluating whether the residuals are independent. If the residuals show patterns in any graph, the residuals are considered to not be independent and the regression should not be considered valid. Independence of the residuals is one of linear regression’s most important required assumptions.
The checkbox next to Line Fit plots should be checked as well. This will produce graphs of the Y Values plotted against each X value in a separate graph. This provides visual analysis of the spread of each input (X) variable and any patterns between any X variable the output Y variable.
The checkbox for the Normal Probability Plot was not checked because that produces a normal probability plot of the Y data (the dependent variable data). A normal probability plot is used to evaluate whether data is normally-distributed. Linear regression does not require the independent or dependent variable data be normally-distributed. Many textbooks incorrectly state that the dependent and/or independent data need to be normally-distributed. This is not the case.
Linear regression does however require that the residuals be normally-distributed. A normal probability plot of the residuals would be very useful to evaluate the normality of the residuals but is not included as a part of Excel’s regression output.
A normal probability plot of the Y data does not provide any useful information and the checkbox that would produce that graph is therefore not checked. It is unclear why Excel includes that functionality with its regression data analysis tool.
Those settings shown in the previous Excel regression dialogue box produce the following Excel output:
(Click Image To See a Larger Version)
The Excel regression output includes the calculation of the Residuals as specified. Linear regression’s required assumptions regarding the Residuals should be evaluated before analyzing any other part of the Excel regression output. The required Residual assumptions must be verified before the regression output is considered valid.
The Residual output includes each Dependent variable’s predicted value, its Residual value (the difference between the predicted value and the actual value), and the Residual’s standardized value (the number of standard deviations that the Residual value is from the mean of the Residual values). This Residual output is shown as follows:
(Click Image To See a Larger Version)
The follow graphs were also generated as part of the Excel regression output:
(Click Image To See a Larger Version)
(Click Image To See a Larger Version)
(Click Image To See a Larger Version)
(Click Image To See a Larger Version)
Excel Multiple Linear Regression
Step 6 – Evaluate the Residuals
The purpose of Residual analysis is to confirm the underlying validity of the regression. Linear regression has a number of required assumptions about the residuals. These assumptions should be confirmed before evaluating the remainder of the Excel regression output. If one or more of the required residual assumptions are shown to be invalid, the entire regression analysis might be questionable. The residuals should therefore be analyzed first before the remainder of the Excel regression output.
The Residual is sometimes called the Error Term. The Residual is the difference between an observed data value and the value predicted by the regression equation. The formula for the Residual is as follows:
Residual = Yactual – Yestimated
The following blog articles will demonstrate how to evaluate whether linear regression’s required assumptions have been met and also how to interpret the Excel regression output.
Excel Master Series Blog Directory
Statistical Topics and Articles In Each Topic
- Histograms in Excel
- Bar Chart in Excel
- Combinations & Permutations in Excel
- Normal Distribution in Excel
- Overview of the Normal Distribution
- Normal Distribution’s PDF (Probability Density Function) in Excel 2010 and Excel 2013
- Normal Distribution’s CDF (Cumulative Distribution Function) in Excel 2010 and Excel 2013
- Solving Normal Distribution Problems in Excel 2010 and Excel 2013
- Overview of the Standard Normal Distribution in Excel 2010 and Excel 2013
- An Important Difference Between the t and Normal Distribution Graphs
- The Empirical Rule and Chebyshev’s Theorem in Excel – Calculating How Much Data Is a Certain Distance From the Mean
- Demonstrating the Central Limit Theorem In Excel 2010 and Excel 2013 In An Easy-To-Understand Way
- t-Distribution in Excel
- Binomial Distribution in Excel
- z-Tests in Excel
- Overview of Hypothesis Tests Using the Normal Distribution in Excel 2010 and Excel 2013
- One-Sample z-Test in 4 Steps in Excel 2010 and Excel 2013
- 2-Sample Unpooled z-Test in 4 Steps in Excel 2010 and Excel 2013
- Overview of the Paired (Two-Dependent-Sample) z-Test in 4 Steps in Excel 2010 and Excel 2013
- t-Tests in Excel
- Overview of t-Tests: Hypothesis Tests that Use the t-Distribution
- 1-Sample t-Tests in Excel
- 1-Sample t-Test in 4 Steps in Excel 2010 and Excel 2013
- Excel Normality Testing For the 1-Sample t-Test in Excel 2010 and Excel 2013
- 1-Sample t-Test – Effect Size in Excel 2010 and Excel 2013
- 1-Sample t-Test Power With G*Power Utility
- Wilcoxon Signed-Rank Test in 8 Steps As a 1-Sample t-Test Alternative in Excel 2010 and Excel 2013
- Sign Test As a 1-Sample t-Test Alternative in Excel 2010 and Excel 2013
- 2-Independent-Sample Pooled t-Tests in Excel
- 2-Independent-Sample Pooled t-Test in 4 Steps in Excel 2010 and Excel 2013
- Excel Variance Tests: Levene’s, Brown-Forsythe, and F Test For 2-Sample Pooled t-Test in Excel 2010 and Excel 2013
- Excel Normality Tests Kolmogorov-Smirnov, Anderson-Darling, and Shapiro Wilk Tests For Two-Sample Pooled t-Test
- Two-Independent-Sample Pooled t-Test - All Excel Calculations
- 2- Sample Pooled t-Test – Effect Size in Excel 2010 and Excel 2013
- 2-Sample Pooled t-Test Power With G*Power Utility
- Mann-Whitney U Test in 12 Steps in Excel as 2-Sample Pooled t-Test Nonparametric Alternative in Excel 2010 and Excel 2013
- 2- Sample Pooled t-Test = Single-Factor ANOVA With 2 Sample Groups
- 2-Independent-Sample Unpooled t-Tests in Excel
- 2-Independent-Sample Unpooled t-Test in 4 Steps in Excel 2010 and Excel 2013
- Variance Tests: Levene’s Test, Brown-Forsythe Test, and F-Test in Excel For 2-Sample Unpooled t-Test
- Excel Normality Tests Kolmogorov-Smirnov, Anderson-Darling, and Shapiro-Wilk For 2-Sample Unpooled t-Test
- 2-Sample Unpooled t-Test Excel Calculations, Formulas, and Tools
- Effect Size for a 2-Independent-Sample Unpooled t-Test in Excel 2010 and Excel 2013
- Test Power of a 2-Independent Sample Unpooled t-Test With G-Power Utility
- Paired (2-Sample Dependent) t-Tests in Excel
- Paired t-Test in 4 Steps in Excel 2010 and Excel 2013
- Excel Normality Testing of Paired t-Test Data
- Paired t-Test Excel Calculations, Formulas, and Tools
- Paired t-Test – Effect Size in Excel 2010, and Excel 2013
- Paired t-Test – Test Power With G-Power Utility
- Wilcoxon Signed-Rank Test in 8 Steps As a Paired t-Test Alternative
- Sign Test in Excel As A Paired t-Test Alternative
- Hypothesis Tests of Proportion in Excel
- Hypothesis Tests of Proportion Overview (Hypothesis Testing On Binomial Data)
- 1-Sample Hypothesis Test of Proportion in 4 Steps in Excel 2010 and Excel 2013
- 2-Sample Pooled Hypothesis Test of Proportion in 4 Steps in Excel 2010 and Excel 2013
- How To Build a Much More Useful Split-Tester in Excel Than Google's Website Optimizer
- Chi-Square Independence Tests in Excel
- Chi-Square Goodness-Of-Fit Tests in Excel
- F Tests in Excel
- Correlation in Excel
- Pearson Correlation in Excel
- Spearman Correlation in Excel
- Confidence Intervals in Excel
- z-Based Confidence Intervals of a Population Mean in 2 Steps in Excel 2010 and Excel 2013
- t-Based Confidence Intervals of a Population Mean in 2 Steps in Excel 2010 and Excel 2013
- Minimum Sample Size to Limit the Size of a Confidence interval of a Population Mean
- Confidence Interval of Population Proportion in 2 Steps in Excel 2010 and Excel 2013
- Min Sample Size of Confidence Interval of Proportion in Excel 2010 and Excel 2013
- Simple Linear Regression in Excel
- Overview of Simple Linear Regression in Excel 2010 and Excel 2013
- Complete Simple Linear Regression Example in 7 Steps in Excel 2010 and Excel 2013
- Residual Evaluation For Simple Regression in 8 Steps in Excel 2010 and Excel 2013
- Residual Normality Tests in Excel – Kolmogorov-Smirnov Test, Anderson-Darling Test, and Shapiro-Wilk Test For Simple Linear Regression
- Evaluation of Simple Regression Output For Excel 2010 and Excel 2013
- All Calculations Performed By the Simple Regression Data Analysis Tool in Excel 2010 and Excel 2013
- Prediction Interval of Simple Regression in Excel 2010 and Excel 2013
- Multiple Linear Regression in Excel
- Basics of Multiple Regression in Excel 2010 and Excel 2013
- Complete Multiple Linear Regression Example in 6 Steps in Excel 2010 and Excel 2013
- Multiple Linear Regression’s Required Residual Assumptions
- Normality Testing of Residuals in Excel 2010 and Excel 2013
- Evaluating the Excel Output of Multiple Regression
- Estimating the Prediction Interval of Multiple Regression in Excel
- Regression - How To Do Conjoint Analysis Using Dummy Variable Regression in Excel
- Logistic Regression in Excel
- Logistic Regression Overview
- Logistic Regression in 6 Steps in Excel 2010 and Excel 2013
- R Square For Logistic Regression Overview
- Excel R Square Tests: Nagelkerke, Cox and Snell, and Log-Linear Ratio in Excel 2010 and Excel 2013
- Likelihood Ratio Is Better Than Wald Statistic To Determine if the Variable Coefficients Are Significant For Excel 2010 and Excel 2013
- Excel Classification Table: Logistic Regression’s Percentage Correct of Predicted Results in Excel 2010 and Excel 2013
- Hosmer- Lemeshow Test in Excel – Logistic Regression Goodness-of-Fit Test in Excel 2010 and Excel 2013
- Single-Factor ANOVA in Excel
- Overview of Single-Factor ANOVA
- Single-Factor ANOVA in 5 Steps in Excel 2010 and Excel 2013
- Shapiro-Wilk Normality Test in Excel For Each Single-Factor ANOVA Sample Group
- Kruskal-Wallis Test Alternative For Single Factor ANOVA in 7 Steps in Excel 2010 and Excel 2013
- Levene’s and Brown-Forsythe Tests in Excel For Single-Factor ANOVA Sample Group Variance Comparison
- Single-Factor ANOVA - All Excel Calculations
- Overview of Post-Hoc Testing For Single-Factor ANOVA
- Tukey-Kramer Post-Hoc Test in Excel For Single-Factor ANOVA
- Games-Howell Post-Hoc Test in Excel For Single-Factor ANOVA
- Overview of Effect Size For Single-Factor ANOVA
- ANOVA Effect Size Calculation Eta Squared in Excel 2010 and Excel 2013
- ANOVA Effect Size Calculation Psi – RMSSE – in Excel 2010 and Excel 2013
- ANOVA Effect Size Calculation Omega Squared in Excel 2010 and Excel 2013
- Power of Single-Factor ANOVA Test Using Free Utility G*Power
- Welch’s ANOVA Test in 8 Steps in Excel Substitute For Single-Factor ANOVA When Sample Variances Are Not Similar
- Brown-Forsythe F-Test in 4 Steps in Excel Substitute For Single-Factor ANOVA When Sample Variances Are Not Similar
- Two-Factor ANOVA With Replication in Excel
- Two-Factor ANOVA With Replication in 5 Steps in Excel 2010 and Excel 2013
- Variance Tests: Levene’s and Brown-Forsythe For 2-Factor ANOVA in Excel 2010 and Excel 2013
- Shapiro-Wilk Normality Test in Excel For 2-Factor ANOVA With Replication
- 2-Factor ANOVA With Replication Effect Size in Excel 2010 and Excel 2013
- Excel Post Hoc Tukey’s HSD Test For 2-Factor ANOVA With Replication
- 2-Factor ANOVA With Replication – Test Power With G-Power Utility
- Scheirer-Ray-Hare Test Alternative For 2-Factor ANOVA With Replication
- Two-Factor ANOVA Without Replication in Excel
- Randomized Block Design ANOVA in Excel
- Repeated-Measures ANOVA in Excel
- Single-Factor Repeated-Measures ANOVA in 4 Steps in Excel 2010 and Excel 2013
- Sphericity Testing in 9 Steps For Repeated Measures ANOVA in Excel 2010 and Excel 2013
- Effect Size For Repeated-Measures ANOVA in Excel 2010 and Excel 2013
- Friedman Test in 3 Steps For Repeated-Measures ANOVA in Excel 2010 and Excel 2013
- ANCOVA in Excel
- Normality Testing in Excel
- Creating a Box Plot in 8 Steps in Excel
- Creating a Normal Probability Plot With Adjustable Confidence Interval Bands in 9 Steps in Excel With Formulas and a Bar Chart
- Chi-Square Goodness-of-Fit Test For Normality in 9 Steps in Excel
- Kolmogorov-Smirnov, Anderson-Darling, and Shapiro-Wilk Normality Tests in Excel
- Nonparametric Testing in Excel
- Mann-Whitney U Test in 12 Steps in Excel
- Wilcoxon Signed-Rank Test in 8 Steps in Excel
- Sign Test in Excel
- Friedman Test in 3 Steps in Excel
- Scheirer-Ray-Hope Test in Excel
- Welch's ANOVA Test in 8 Steps Test in Excel
- Brown-Forsythe F Test in 4 Steps Test in Excel
- Levene's Test and Brown-Forsythe Variance Tests in Excel
- Chi-Square Independence Test in 7 Steps in Excel
- Chi-Square Goodness-of-Fit Tests in Excel
- Chi-Square Population Variance Test in Excel
- Post Hoc Testing in Excel
- Creating Interactive Graphs of Statistical Distributions in Excel
- Interactive Statistical Distribution Graph in Excel 2010 and Excel 2013
- Interactive Graph of the Normal Distribution in Excel 2010 and Excel 2013
- Interactive Graph of the Chi-Square Distribution in Excel 2010 and Excel 2013
- Interactive Graph of the t-Distribution in Excel 2010 and Excel 2013
- Interactive Graph of the t-Distribution’s PDF in Excel 2010 and Excel 2013
- Interactive Graph of the t-Distribution’s CDF in Excel 2010 and Excel 2013
- Interactive Graph of the Binomial Distribution in Excel 2010 and Excel 2013
- Interactive Graph of the Exponential Distribution in Excel 2010 and Excel 2013
- Interactive Graph of the Beta Distribution in Excel 2010 and Excel 2013
- Interactive Graph of the Gamma Distribution in Excel 2010 and Excel 2013
- Interactive Graph of the Poisson Distribution in Excel 2010 and Excel 2013
- Solving Problems With Other Distributions in Excel
- Solving Uniform Distribution Problems in Excel 2010 and Excel 2013
- Solving Multinomial Distribution Problems in Excel 2010 and Excel 2013
- Solving Exponential Distribution Problems in Excel 2010 and Excel 2013
- Solving Beta Distribution Problems in Excel 2010 and Excel 2013
- Solving Gamma Distribution Problems in Excel 2010 and Excel 2013
- Solving Poisson Distribution Problems in Excel 2010 and Excel 2013
- Optimization With Excel Solver
- Maximizing Lead Generation With Excel Solver
- Minimizing Cutting Stock Waste With Excel Solver
- Optimal Investment Selection With Excel Solver
- Minimizing the Total Cost of Shipping From Multiple Points To Multiple Points With Excel Solver
- Knapsack Loading Problem in Excel Solver – Optimizing the Loading of a Limited Compartment
- Optimizing a Bond Portfolio With Excel Solver
- Travelling Salesman Problem in Excel Solver – Finding the Shortest Path To Reach All Customers
- Chi-Square Population Variance Test in Excel
- Analyzing Data With Pivot Tables
- SEO Functions in Excel
- Time Series Analysis in Excel
- VLOOKUP
Many videos on this topic are in www.KautilyaClasses.com
ReplyDelete