Friday, July 30, 2010

The 22 Biggest Statistical Mistakes You Don't Want To Make

The 22 Most Common

Statistical Errors

1) Not using nonparametric tests when analyzing small samples of unknown distribution

– The t Distribution should only be used in small sample analysis if the population from which the samples were drawn was normally distributed. Nonparametric tests are valid when the population distribution is not known, or is known not to be normally distributed. Using the t distribution in either of these cases for small sample analysis is invalid. I will write a couple of articles in this blog in the future detailing how and when to perform a couple of commonly-used nonparametric tests with Excel.

2) Using a one-tailed test instead of a two-tailed test when accuracy is needed

– If accuracy it what you are seeking, it might be better to use the two-tailed when performing, for example, a hypothesis test. The two-tailed test is more stringent than the one-tailed test because the outer regions (I call them the regions of uncertainty) are half the size in a two-tailed test than in a one-tailed test. The two-tailed test tells you merely that the means are different. The one-tailed test tells you that the means are different in one specific direction.

3) Using covariance analysis instead of correlation analysis

– The output of covariance analysis is dependent upon the scale used to measure the data. Different scales of measurement can produce completely different results on the same data if covariance analysis is used. Correlation analysis is completely independent of the scale used to measure the data. Different scales of measurement will produce the same results on a data set using correlation analysis, unlike covariance analysis.

4) Not taking steps to ensure that your sample is normally distributed when analyzing with the normal distribution

– One way to ensure that you have a normally distributed sample for analysis is to take a number of large samples (each sample consists of at least 30 objects) and then tke the mean from each sample as one sample point. You will then have one final, working sample that consists of the means of all of your previous samples. A statistical theory called the Central Limit Theory states that the means of a group of large samples (each sample consists of at least 30 objects) will be normally distributed, no matter how the underlying population is distributed. You can then perform statistical analysis on that final sample using the normal distribution.

5) Thinking it is impossible to get a statistically significant sample if your target market is large

– The sample size you need from a large population is probably quite a bit smaller than you think. Nationwide surveys are normally within a percentage point or two from real answer after only several thousand interviews have been conducted. That of course depends hugely on obtaining a representative sample to interview.

6) Always requiring 95% certainty

– This could really slow you down. For example, if I’m A/B split-testing keywords or ads in an AdWords campaign, I will typically pick a winner when my split-tester tells me that it is 80% sure that one result is better than the other. Achieving 95% certainty would often take too long.

7) Not using Excel

– This point may sound a little self-serving, but knowing how to do this stuff in Excel is a real time-saver, particularly if you are in marketing, and especially if you’re an Internet marketer. You’ll never need to pick up another thick confusing statistics text book or figure how to work those confusing statistics tables ever again. I’ve actually thrown out all of my statistics text books (well, not quite, I sold them on eBay).

8) Attempting to apply the wrong type of hypothesis test

– There are 4 ways that the data must be classified before the correct hypothesis test can be selected. Another article in this blog discusses this. Also, Chapters 8 and 9 of the Excel Statistical Master provide clear, detailed instructions on how to analyze your data prior to hypothesis test selection. You probably wouldn’t get far into a hypothesis test if you have incorrectly classified the data and selected the wrong hypothesis test.

9) Using the wrong t test

– The t-test to be applied depends upon factors such as whether or samples have the same size and variance. It is important to pick the right t-test before starting.

10) Not controlling or taking into account other variables besides the one(s) being testing when using the t test, ANOVA, or hypothesis tests.

Other variables that not part of the test need to be held as constant as possible during the above tests or your answer might be invalid without you knowing.

11) Not removing outliers prior to statistical analysis

– A couple of outliers can skew results badly. Once again, eyeball the data and determine what belongs and what doesn’t.

12) Analyzing non-normal data with the normal distribution

– Data should always be eyeballed and analyzed for normality before using the normal distribution. If the data is not normally distributed, you must use data fitting techniques to determine which statistical distribution most closely fits the data.

13) Drawing a conclusion before a statistically significant result has been reached

– This is often caused by choosing a statistical test requiring a lot of samples but depending on a low sample rate. A common occurrence of this would be performing multivariate testing on a web site that does not have sufficient traffic. Such a test is likely to be concluded prematurely. A better solution might be to perform a number of successive A/B split-tests in place of multivariate analysis. You get a lot more testing done a lot faster, and correctly.

14) Drawing a conclusion without applying the proper statistical analysis

- This occurs quite often when people simply eyeball the results instead of performing a hypothesis test to determine if the observed change has at least an 80% chance (or whatever level of certainty you desire) of not being pure chance.

15) Not drawing a representative sample from a population

– This is usually solved by taking a larger sample and using a random sampling technique such as nth-ing (sampling every nth object in the population).

16) Only evaluating r square in a regression equation

– In the output of regression performed in Excel, there are actually four very important components of the output that should be looked at. There is an article in this blog that covers this topic in a lot more detail than could be done in this bullet point.

17) Not examining the residuals in regression

– You should always at least eyeball the residuals. If the residuals show a pattern, your regression equation is not explaining all of the behavior of the data.

18) Applying input variables to a regression equation that are outside of the value of the original input variables that were used to create the regression equation

– Here is an example to illustrate why this might produce totally invalid results. Suppose that you created a regression equation that predicted a child’s weight based upon the child’s age, and then you provided an adult age as an input. This regression equation would predict a completely incorrect weight for the adult, because adult data was not used to construct the original regression equation.

19) Adding a large number of new input variables into a regression analysis all at once

– One way to ensure that you have a normally distributed sample for analysis is to take a large number of samples (at 30) with each sample consisting of several random and simultaneously-chosen data points and then take the mean from each sample. Make that mean the sample. Your samples will now be Normally distributed. You can take as few as 2 data points per sample, but the more data points per sample, the fewer data points it will take for your samples (each sample is the mean of the data points collected for one sample) to appear to be Normally distributed. If you are taking only 3 data points per sample, you may have to collect over samples (that would be a total of 300 data points) for your samples to appear to be Normally distributed.

A statistical theory called the Central Limit Theory states that the means of samples (at least 30 samples and each sample having at least 2 data points that are averaged to get a mean, which will be the value of the sample) will be Normally distributed, no matter how the underlying population is distributed. You can then perform statistical analysis on that group using the normal distribution-based techniques.

20) Not doing correlation analysis on all variables prior to performing regression
– You’ll save yourself a lot of time if you can remove any input variables that have a low correlation with the dependent (output – Y) variable or that have a high correlation with another input variable (this error of highly correlated input variables is called multicollinearity). In the 2nd case, you would want to remove the input variable from the highly correlated pair of input variables that has the lowest correlation with the output variable.

21) Not graphing and eyeballing the data prior to performing regression analysis

– Always graph the data before you do regression analysis. You’ll know immediately whether you’re dealing with linear regression, non-linear regression, or completely unrelated data that can’t be regressed.

22) Assuming that correlation equals causation

– This is, of course, not true. However, if you find a correlation, you should look hard for links between the two objects that are correlated. The correlation may be pure chance, but then again, it may not be. A correlation is a reason to look for underlying causes behind the behavior. Correlation is often a symptom of a larger issue, but it is not a guarantee of causality.

If you would like to create a link to this blog article, here is the link to copy for your convenience:

Statistical Mistakes You Don't Want To Make

Please post any comments you have on this article. Your opinion is highly valued!

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic


  1. All statistics students should read this advice. I definitely agree with point 7 that Excel is really a great tool for statistics. Not a lot of people are aware of the power of Excel in statistics. If you know how to use Excel for statistics, you'll never need to look at a chart again. Good stuff!

  2. I totally agree about Excel being an outstanding statistical tool. None of my clients have software packages like SPSS, Minitab, or Systat - but they all have Excel. If I didn't do my statistical analysis in Excel, I wouldn't be able to interact with my clients like I do.

  3. This is a great inspiring article.I am pretty much pleased with your good work.You put really very helpful information. Keep it up. Keep blogging. Looking to reading your next post
    statistical analysis service