Thursday, May 29, 2014

This is one of the following seven articles on Simple Linear Regression in Excel

Complete Example of

Simple Linear Regression

in 7 Steps in Excel

A company has a large plastic injection molding machine. The company would like to create an equation that will calculate the number of identical plastic parts that would be produced for a specified quantity of input plastic pellets.

The company conducted 21 independent production runs on the machine. In each case a different-sized batch of plastic pellets was input into the machine and the total number of identical plastic parts produced from each batch was recorded.

If the relationship between quantity of input plastic pellets in each batch and the number of parts produced from each batch is linear, calculate the equation that describes that relationship.

It is important to note that all trial runs were performed as identically as possible. The same operator ran the machine on each trial run at approximately the same time during a shift. The machine was calibrated to the same settings and cleaned prior to each trial run and input plastic pellets from the same batch were used in all 21 trial runs.

The data from the 21 trial runs are as follows:

(Click On Image To See a Larger Version)

Excel Regression Step 1 –

Remove Extreme Outliers

Calculation of the mean is one of the fundamental computations when performing linear regression analysis. The mean is unduly affected by outliers. Extremely outliers should be removed before beginning regression analysis. Not all outliers should be removed. An outlier should be removed if it is obviously extremely and inconsistent with the remainder of the data.

At this point in the beginning of the analysis, the objective is to remove outliers that are obviously extreme. After Excel has performed the regression and calculated the residuals, further analysis will be performed to determine if any of the data points can also be considered outliers based upon any unusually large residual terms generated. A data point is often considered to be an outlier if its residual value is more than there standard deviations from the mean of the residuals.

Sorting the Data To Quickly Spot Extreme Outliers

An easy way to quickly spot extreme outliers is to sort the data. Extremely high or low outlier values will appear at the ends of the sort. A convenient, one-step method to sort a column of data in Excel is shown here.

The formula is cell D3 is the following:

=IF($A3=””,””,LARGE($A$3:$A$23,ROW()-ROW($C$2)))

Copy this formula down as shown to create a descending sort of the data in cells A3 to A23.

Exchanging the word SMALL for LARGE would create an ascending sort instead of the descending sort performed here.

(Click On Image To See a Larger Version)

The lowest Y value, 9, is obviously an extreme outlier and is very different than the rest of the data. The cause of this extreme outlier value is not known. Perhaps something unexpected happened during this production run? It is clear that this value should be removed from the analysis because it would severely skew the final result.

Removing this outlier from the data produces this set of 20 data samples:

(Click On Image To See a Larger Version)

Excel Regression Step 2 –

Create a Correlation Matrix

This step is only necessary when performing multiple regression, i.e., linear regression that has more than one independent variable, not single variable regression as we are doing here. The purpose of this step is to identify independent variables that are highly correlated. Different input variables that are highly correlated cause an error called multicollinearity. There is no need to check for correlated independent variables when performing single-variable regression as we are doing here because there is only one independent variable. This step should always be carried out when performing multiple regression. When highly correlated pairs of independent variables are found, one of the variables of the pair should be removed from the regression.

Excel Regression Step 3 –

Scale Variables If Necessary

All variables should be scaled so that each has a similar number of decimal places beyond zero. This limits rounding error and also insures that the slope of the fitted line will be a convenient size to work with and not too large or too small.

The weight of the input pellets is measured in grams. If these weights were specified in kilograms, both variables would be presented in much closer scales. Changing the scale of the incoming pellet weight from grams to kilograms provides the following properly-scaled data:

(Click On Image To See a Larger Version)

Excel Regression Step 4 –

Plot the Data

The purpose of this step is to check for linearity. Each independent variable should be plotted against the dependent variable in a scatterplot graph. Linear regression should only be performed if linear relationships exist between the dependent variable and each of the input variables. An Excel X-Y scatterplot of the two X-Y variables is shown as follows. The relationship appears to be a linear one.

(Click On Image To See a Larger Version)

Excel Regression Step 5 –

Run the Regression Analysis

in Excel

Below is the Regression dialogue box with all necessary information filled in. Many of the required regression assumptions concerning the Residuals have not yet been validated. At this point the regression is being run in Excel to calculate the Residuals in order to analyze them. Further analysis of the Excel regression output should take place only after linear regression’s required assumptions concerning the Residuals have been evaluated.

Calculating the Residuals as part of the Excel regression output is specified in the Excel regression dialogue box as follows:

(Click On Image To See a Larger Version)

It should be noted that the Residuals are sometimes referred to as the Error terms. The checkbox next to Residuals should be checked in order to have Excel automatically calculate the residual for each data point. The residual is the difference between the actual data point and its value as predicted by the regression equation. Analysis of the residuals is a very important part of linear regression analysis because a number of required assumptions are based upon the residuals.

The checkbox next to Standardized Residuals should also be checked. If this is checked, Excel will calculate the number of standard deviations that each residual value is from the mean of the residuals. Data points are often considered outliers if their residual values are located more than three standard deviations from the residual mean.

The checkbox next to Residual Plots should also be checked. This will create graphs of the residuals plotted against each of the input (independent) variables. Visual observation of these graphs is an important part of evaluating whether the residuals are independent. If the residuals show patterns in any graph, the residuals are considered to not be independent and the regression should not be considered valid. Independence of the residuals is one of linear regression’s most important required assumptions.

The checkbox next to Line Fit plots should be checked as well. This will produce graphs of the Y Values plotted against each X value in a separate graph. This provides visual analysis of the spread of each input (X) variable and any patterns between any X variable the output Y variable.

The checkbox for the Normal Probability Plot was not checked because that produces a normal probability plot of the Y data (the dependent variable data). A normal probability plot is used to evaluate whether data is normally-distributed. Linear regression does not require the independent or dependent variable data be normally-distributed. Many textbooks incorrectly state that the dependent and/or independent data need to be normally-distributed. This is not the case.

Linear regression does however require that the residuals be normally-distributed. A normal probability plot of the residuals would be very useful to evaluate the normality of the residuals but is not included as a part of Excel’s regression output.

A normal probability plot of the Y data does not provide any useful information and the checkbox that would produce that graph is therefore not checked. It is unclear why Excel includes that functionality with its regression data analysis tool.

Those settings shown in the preceding Excel regression dialogue box produce the following output:

(Click On Image To See a Larger Version)

The Excel regression output includes the calculation of the Residuals as specified. Linear regression’s required assumptions regarding the Residuals should be evaluated before analyzing any other part of the Excel regression output. The required Residual assumptions must be verified before the regression output is considered valid.

The Residual output includes each Dependent variable’s predicted value, its Residual value (the difference between the predicted value and the actual value), and the Residual standardized value (the number of standard deviations that the Residual value is from the mean of the Residual values). This Residual output is shown as follows:

(Click On Image To See a Larger Version)

The follow graphs were also generated as part of the Excel regression output:

(Click On Image To See a Larger Version)

Excel Regression Step 6 –

Evaluate the Residuals

The purpose of Residual analysis is to confirm the underlying validity of the regression. Linear regression has a number of required assumptions about the residuals. These assumptions should be evaluated before continuing the analysis of the Excel regression output. If one or more of the required residual assumptions are shown to be invalid, the entire regression analysis might be considered, at best, questionable or, at worst, invalid. The residuals should therefore be analyzed first before analyzing any other part of the Excel regression output.

The Residuals are sometimes called the Error Terms. The Residual is the difference between an observed data value and the value predicted by the regression equation. The formula for the Residual is as follows:

Residual = Y_actual – Y_estimated

Linear Regression’s Required Residual Assumptions

Linear regression has several required assumptions regarding the residuals. These required residual assumptions are as follows:

1) Outliers have been removed.

2) The residuals must be independent of each other. They must not be correlated with each other.

3) The residuals should have a mean of approximately 0.

4) The residuals must have similar variances throughout all residual values.

5) The residuals must be normally-distributed.

6) The residuals may not be highly correlated with any of the independent (X) variables.

7) There must be enough data points to conduct normality testing of residuals.

Residual Analysis will be performed in a blog article following this one.

Excel Regression Step 7 –

Evaluate the Excel Regression

Output

The Excel regression output from this example shown below will be evaluated in a blog article shortly following this one.

(Click On Image To See a Larger Version)

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

Technorati Tags: regression,excel,excel 2010,excel 2013,statistics,residuals

Become an Excel Statistical Master

Excel Master Series - MBA-level statistics - Over 1,100+ Pages of Easy-To-Follow Instructions in Excel

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

Step-By-Step Optimization With Excel Solver

What's In It?

For anyone who wants to be operating at a high level with the Excel Solver quickly, this is the book for you. Step-By-Step Optimization With Excel Solver is a 200+ page .pdf e-manual of simple yet thorough explanations on how to use the Excel Solver to solve today’s most widely known optimization problems. Loaded with screen shots that are coupled with easy-to-follow instructions, this book will simplify many difficult optimization problems and make you a master of the Excel Solver almost immediately.

Here are just some of the Solver optimization problems that are solved completely with simple-to-understand instructions and screen shots in this e-manual:

• The famous “Traveling Salesman” problem using Solver’s Alldifferent constraint and the Solver’s Evolutionary method to find the shortest path to reach all customers. This also provides an advanced use of the Excel INDEX function.

• The well-known “Knapsack Problem” which shows how optimize the use of limited space while satisfying numerous other criteria.

• How to perform nonlinear regression and curve-fitting on the Solver using the Solver’s GRG Nonlinear solving method.

• How to solve the “Cutting Stock Problem” faced by many manufacturing companies who are trying to determine the optimal way to cut sheets of material to minimize waste while satisfying customer orders.

• Portfolio optimization to maximize return or minimize risk.

• Venture capital investment selection using the Solver’s Binary constraint to maximize Net Present Value of selected cash flows at year 0. Clever use of the If-Then-Else statements makes this a simple problem.

• How use Solver to minimize the total cost of purchasing and shipping goods from multiple suppliers to multiple locations.

• How to optimize the selection of different production machine to minimize cost while fulfilling an order.

• How to optimally allocate a marketing budget to generate the greatest reach and frequency or number of inbound leads at the lowest cost.

Step-By-Step Optimization With Excel Solver has complete instructions and numerous tips on every aspect of operating the Excel Solver. You’ll fully understand the reports and know exactly how to tweek all of the Solver’s settings for total custom use. This e-manual also provides lots of inside advice and guidance on setting up the model in Excel so that it will be as simple and intuitive as possible to work with.

All of the optimization problems in this book are solved step-by-step using a 6-step process that works every time. In addition to detailed screen shots and easy-to-follow explanations on how to solve every optimization problem in the book, a link is provided to download an Excel workbook that has all problems completed exactly as they are in this e-manual.

Step-By-Step Optimization With Excel Solver is exactly the e-manual you need if you want to be optimizing at an advanced level with the Excel Solver quickly.

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

Immediate, Absolute, No-Questions-Asked, Money-Back Guarantee If Not TOTALLY, 100% Satisfied. In Other Words, If Any Excel Master Series eManual That You've Purchased Here Does Not Provide Instructions That Are CRYSTAL CLEAR and EASY TO UNDERSTAND, You Get All Of Your Money Back Immediately and Keep the eManual. Guaranteed!

Meet The Author