Friday, July 30, 2010

The 22 Most Common

Statistical Errors

1) Not using nonparametric tests when analyzing small samples of unknown distribution

– The t Distribution should only be used in small sample analysis if the population from which the samples were drawn was normally distributed. Nonparametric tests are valid when the population distribution is not known, or is known not to be normally distributed. Using the t distribution in either of these cases for small sample analysis is invalid. I will write a couple of articles in this blog in the future detailing how and when to perform a couple of commonly-used nonparametric tests with Excel.

2) Using a one-tailed test instead of a two-tailed test when accuracy is needed

– If accuracy it what you are seeking, it might be better to use the two-tailed when performing, for example, a hypothesis test. The two-tailed test is more stringent than the one-tailed test because the outer regions (I call them the regions of uncertainty) are half the size in a two-tailed test than in a one-tailed test. The two-tailed test tells you merely that the means are different. The one-tailed test tells you that the means are different in one specific direction.

3) Using covariance analysis instead of correlation analysis

– The output of covariance analysis is dependent upon the scale used to measure the data. Different scales of measurement can produce completely different results on the same data if covariance analysis is used. Correlation analysis is completely independent of the scale used to measure the data. Different scales of measurement will produce the same results on a data set using correlation analysis, unlike covariance analysis.

4) Not taking steps to ensure that your sample is normally distributed when analyzing with the normal distribution

– One way to ensure that you have a normally distributed sample for analysis is to take a number of large samples (each sample consists of at least 30 objects) and then tke the mean from each sample as one sample point. You will then have one final, working sample that consists of the means of all of your previous samples. A statistical theory called the Central Limit Theory states that the means of a group of large samples (each sample consists of at least 30 objects) will be normally distributed, no matter how the underlying population is distributed. You can then perform statistical analysis on that final sample using the normal distribution.

5) Thinking it is impossible to get a statistically significant sample if your target market is large

– The sample size you need from a large population is probably quite a bit smaller than you think. Nationwide surveys are normally within a percentage point or two from real answer after only several thousand interviews have been conducted. That of course depends hugely on obtaining a representative sample to interview.

6) Always requiring 95% certainty

– This could really slow you down. For example, if I’m A/B split-testing keywords or ads in an AdWords campaign, I will typically pick a winner when my split-tester tells me that it is 80% sure that one result is better than the other. Achieving 95% certainty would often take too long.

7) Not using Excel

– This point may sound a little self-serving, but knowing how to do this stuff in Excel is a real time-saver, particularly if you are in marketing, and especially if you’re an Internet marketer. You’ll never need to pick up another thick confusing statistics text book or figure how to work those confusing statistics tables ever again. I’ve actually thrown out all of my statistics text books (well, not quite, I sold them on eBay).

8) Attempting to apply the wrong type of hypothesis test

– There are 4 ways that the data must be classified before the correct hypothesis test can be selected. Another article in this blog discusses this. Also, Chapters 8 and 9 of the Excel Statistical Master provide clear, detailed instructions on how to analyze your data prior to hypothesis test selection. You probably wouldn’t get far into a hypothesis test if you have incorrectly classified the data and selected the wrong hypothesis test.

9) Using the wrong t test

– The t-test to be applied depends upon factors such as whether or samples have the same size and variance. It is important to pick the right t-test before starting.

10) Not controlling or taking into account other variables besides the one(s) being testing when using the t test, ANOVA, or hypothesis tests.

Other variables that not part of the test need to be held as constant as possible during the above tests or your answer might be invalid without you knowing.

11) Not removing outliers prior to statistical analysis

– A couple of outliers can skew results badly. Once again, eyeball the data and determine what belongs and what doesn’t.

12) Analyzing non-normal data with the normal distribution

– Data should always be eyeballed and analyzed for normality before using the normal distribution. If the data is not normally distributed, you must use data fitting techniques to determine which statistical distribution most closely fits the data.

13) Drawing a conclusion before a statistically significant result has been reached

– This is often caused by choosing a statistical test requiring a lot of samples but depending on a low sample rate. A common occurrence of this would be performing multivariate testing on a web site that does not have sufficient traffic. Such a test is likely to be concluded prematurely. A better solution might be to perform a number of successive A/B split-tests in place of multivariate analysis. You get a lot more testing done a lot faster, and correctly.

14) Drawing a conclusion without applying the proper statistical analysis

- This occurs quite often when people simply eyeball the results instead of performing a hypothesis test to determine if the observed change has at least an 80% chance (or whatever level of certainty you desire) of not being pure chance.

15) Not drawing a representative sample from a population

– This is usually solved by taking a larger sample and using a random sampling technique such as nth-ing (sampling every nth object in the population).

16) Only evaluating r square in a regression equation

– In the output of regression performed in Excel, there are actually four very important components of the output that should be looked at. There is an article in this blog that covers this topic in a lot more detail than could be done in this bullet point.

17) Not examining the residuals in regression

– You should always at least eyeball the residuals. If the residuals show a pattern, your regression equation is not explaining all of the behavior of the data.

18) Applying input variables to a regression equation that are outside of the value of the original input variables that were used to create the regression equation

– Here is an example to illustrate why this might produce totally invalid results. Suppose that you created a regression equation that predicted a child’s weight based upon the child’s age, and then you provided an adult age as an input. This regression equation would predict a completely incorrect weight for the adult, because adult data was not used to construct the original regression equation.

19) Adding a large number of new input variables into a regression analysis all at once

– One way to ensure that you have a normally distributed sample for analysis is to take a large number of samples (at 30) with each sample consisting of several random and simultaneously-chosen data points and then take the mean from each sample. Make that mean the sample. Your samples will now be Normally distributed. You can take as few as 2 data points per sample, but the more data points per sample, the fewer data points it will take for your samples (each sample is the mean of the data points collected for one sample) to appear to be Normally distributed. If you are taking only 3 data points per sample, you may have to collect over samples (that would be a total of 300 data points) for your samples to appear to be Normally distributed.

A statistical theory called the Central Limit Theory states that the means of samples (at least 30 samples and each sample having at least 2 data points that are averaged to get a mean, which will be the value of the sample) will be Normally distributed, no matter how the underlying population is distributed. You can then perform statistical analysis on that group using the normal distribution-based techniques.

20) Not doing correlation analysis on all variables prior to performing regression
– You’ll save yourself a lot of time if you can remove any input variables that have a low correlation with the dependent (output – Y) variable or that have a high correlation with another input variable (this error of highly correlated input variables is called multicollinearity). In the 2nd case, you would want to remove the input variable from the highly correlated pair of input variables that has the lowest correlation with the output variable.

21) Not graphing and eyeballing the data prior to performing regression analysis

– Always graph the data before you do regression analysis. You’ll know immediately whether you’re dealing with linear regression, non-linear regression, or completely unrelated data that can’t be regressed.

22) Assuming that correlation equals causation

– This is, of course, not true. However, if you find a correlation, you should look hard for links between the two objects that are correlated. The correlation may be pure chance, but then again, it may not be. A correlation is a reason to look for underlying causes behind the behavior. Correlation is often a symptom of a larger issue, but it is not a guarantee of causality.

If you would like to create a link to this blog article, here is the link to copy for your convenience:

Statistical Mistakes You Don't Want To Make

Please post any comments you have on this article. Your opinion is highly valued!

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

Become an Excel Statistical Master

Excel Master Series - MBA-level statistics - Over 1,100+ Pages of Easy-To-Follow Instructions in Excel

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

Step-By-Step Optimization With Excel Solver

What's In It?

For anyone who wants to be operating at a high level with the Excel Solver quickly, this is the book for you. Step-By-Step Optimization With Excel Solver is a 200+ page .pdf e-manual of simple yet thorough explanations on how to use the Excel Solver to solve today’s most widely known optimization problems. Loaded with screen shots that are coupled with easy-to-follow instructions, this book will simplify many difficult optimization problems and make you a master of the Excel Solver almost immediately.

Here are just some of the Solver optimization problems that are solved completely with simple-to-understand instructions and screen shots in this e-manual:

• The famous “Traveling Salesman” problem using Solver’s Alldifferent constraint and the Solver’s Evolutionary method to find the shortest path to reach all customers. This also provides an advanced use of the Excel INDEX function.

• The well-known “Knapsack Problem” which shows how optimize the use of limited space while satisfying numerous other criteria.

• How to perform nonlinear regression and curve-fitting on the Solver using the Solver’s GRG Nonlinear solving method.

• How to solve the “Cutting Stock Problem” faced by many manufacturing companies who are trying to determine the optimal way to cut sheets of material to minimize waste while satisfying customer orders.

• Portfolio optimization to maximize return or minimize risk.

• Venture capital investment selection using the Solver’s Binary constraint to maximize Net Present Value of selected cash flows at year 0. Clever use of the If-Then-Else statements makes this a simple problem.

• How use Solver to minimize the total cost of purchasing and shipping goods from multiple suppliers to multiple locations.

• How to optimize the selection of different production machine to minimize cost while fulfilling an order.

• How to optimally allocate a marketing budget to generate the greatest reach and frequency or number of inbound leads at the lowest cost.

Step-By-Step Optimization With Excel Solver has complete instructions and numerous tips on every aspect of operating the Excel Solver. You’ll fully understand the reports and know exactly how to tweek all of the Solver’s settings for total custom use. This e-manual also provides lots of inside advice and guidance on setting up the model in Excel so that it will be as simple and intuitive as possible to work with.

All of the optimization problems in this book are solved step-by-step using a 6-step process that works every time. In addition to detailed screen shots and easy-to-follow explanations on how to solve every optimization problem in the book, a link is provided to download an Excel workbook that has all problems completed exactly as they are in this e-manual.

Step-By-Step Optimization With Excel Solver is exactly the e-manual you need if you want to be optimizing at an advanced level with the Excel Solver quickly.

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

Immediate, Absolute, No-Questions-Asked, Money-Back Guarantee If Not TOTALLY, 100% Satisfied. In Other Words, If Any Excel Master Series eManual That You've Purchased Here Does Not Provide Instructions That Are CRYSTAL CLEAR and EASY TO UNDERSTAND, You Get All Of Your Money Back Immediately and Keep the eManual. Guaranteed!

Meet The Author

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

4 comments:

AnonymousMarch 5, 2013 at 5:21 AM
All statistics students should read this advice. I definitely agree with point 7 that Excel is really a great tool for statistics. Not a lot of people are aware of the power of Excel in statistics. If you know how to use Excel for statistics, you'll never need to look at a chart again. Good stuff!
AnonymousMarch 6, 2013 at 6:56 PM
I totally agree about Excel being an outstanding statistical tool. None of my clients have software packages like SPSS, Minitab, or Systat - but they all have Excel. If I didn't do my statistical analysis in Excel, I wouldn't be able to interact with my clients like I do.
jhonJune 1, 2017 at 3:43 AM
This is a great inspiring article.I am pretty much pleased with your good work.You put really very helpful information. Keep it up. Keep blogging. Looking to reading your next post
statistical analysis service
vé máy bay từ canada về việt namOctober 6, 2021 at 7:18 PM
Aivivu chuyên vé máy bay, tham khảo

vé máy bay đi Mỹ

chuyến bay từ mỹ về việt nam

vé máy bay từ đức về việt nam

lịch bay từ hà nội đến nga

mua vé máy bay từ anh về việt nam

máy bay từ pháp về việt nam

chuyen bay danh cho chuyen gia

Friday, July 30, 2010

The 22 Biggest Statistical Mistakes You Don't Want To Make

The 22 Most Common

Statistical Errors

1) Not using nonparametric tests when analyzing small samples of unknown distribution

2) Using a one-tailed test instead of a two-tailed test when accuracy is needed

3) Using covariance analysis instead of correlation analysis

4) Not taking steps to ensure that your sample is normally distributed when analyzing with the normal distribution

5) Thinking it is impossible to get a statistically significant sample if your target market is large

6) Always requiring 95% certainty

7) Not using Excel

8) Attempting to apply the wrong type of hypothesis test

9) Using the wrong t test

10) Not controlling or taking into account other variables besides the one(s) being testing when using the t test, ANOVA, or hypothesis tests.

11) Not removing outliers prior to statistical analysis

12) Analyzing non-normal data with the normal distribution

13) Drawing a conclusion before a statistically significant result has been reached

14) Drawing a conclusion without applying the proper statistical analysis

15) Not drawing a representative sample from a population

16) Only evaluating r square in a regression equation

17) Not examining the residuals in regression

18) Applying input variables to a regression equation that are outside of the value of the original input variables that were used to create the regression equation

19) Adding a large number of new input variables into a regression analysis all at once

21) Not graphing and eyeballing the data prior to performing regression analysis

22) Assuming that correlation equals causation

4 comments:

Blog Directory

Become an Excel Statistical Master

Become an Excel Statistical Master

Become an Excel Statistical Master

Advanced Statistical and Solver Optimization Consulting Service - Fast and Completed in Excel

Become an Excel Statistical Master

Excel Master Series Blog Directory