Tuesday, May 27, 2014

Demonstrating the Central Limit Theorem In Excel 2010 and Excel 2013 In An Easy-To-Understand Way

This is one of the following eight articles on the normal distribution in Excel

Overview of the Normal Distribution

Normal Distribution’s PDF (Probability Density Function) in Excel 2010 and Excel 2013

Normal Distribution’s CDF (Cumulative Distribution Function) in Excel 2010 and Excel 2013

Solving Normal Distribution Problems in Excel 2010 and Excel 2013

Overview of the Standard Normal Distribution in Excel 2010 and Excel 2013

An Important Difference Between the t and Normal Distribution Graphs

The Empirical Rule and Chebyshev’s Theorem in Excel – Calculating How Much Data Is a Certain Distance From the Mean

Demonstrating the Central Limit Theorem In Excel 2010 and Excel 2013 In An Easy-To-Understand Way

 

Overview of the Central

Limit Theorem

The Central Limit Theorem, or CLT, is one of statistics’ most basic principles. The CLT states the following points:

When an entire population is grouped into random samples of size n, the following are true:

1) As sample size n increases, the distribution of the sample means approaches the normal distribution regardless of how the population is distributed. A sample size of 30 is often quoted as being the minimum sample size necessary to ensure that the sample means will be normal-distributed. In the real world, sample means will be normal-distributed when the sample size is much smaller. This will be demonstrated with Excel shortly.

2) The sample standard deviation will equal the population standard deviation divided by the square root of the sample size. This will also be demonstrated in Excel shortly

 

Why the Central Limit Theorem

Is So Important to Statistics

The Central Limit Theorem is very important to statistical work because it enables convenient normal-distribution-based and t-Distribution-based hypothesis testing and confidence-interval creation to be performed based upon a single large sample of data without having to verify that the population from which the sample was taken is normal-distributed. Samples are considered to be large if the size of each sample, n, is greater than or equal to 30.

Performing normal-distribution-based and t-Distribution-based hypothesis testing and creating confidence intervals based upon a small sample of data requires verification that the population from which the sample was taken is normal-distributed. The normality of the population is confirmed if the data within the data sample is verified to be normal-distributed. This step is not necessary when the sample is large. Samples are considered to be small if the size of each sample, n, is smaller than 30.

Hypothesis testing of a population mean and confidence intervals of a population mean are discussed in much greater detail in their own sections of this book. Hypothesis testing and confidence intervals of a population mean are calculated based upon a single sample of data. The fundamental assumption that underlies normal-distribution-based and t-Distribution-based hypothesis testing and confidence interval creation of a population mean is that the sample mean is normal-distributed.

The mean of a single sample is normal-distributed if the means of other similar samples are also normal-distributed. In other words, if the means of a number of other same-sized samples randomly taken from the same population were normal-distributed, then the mean of the single sample is normal-distributed.

When the single sample taken is large enough (sample size n is greater than or equal to 30), the mean of that sample is assumed to be normal-distributed as per the Central Limit Theorem. When the size of the single sample is less than 30, the sample’s mean cannot be assumed to be normal-distributed unless one of the following is true:

1) The population from which the sample was drawn is known to be normal-distributed.

2) The data within the single sample is verified through normality testing to be normal-distributed.

Quite often it is not possible to confirm the normality of the population. In this case normality testing should be performed on the data within the single sample taken before normal-distribution-based or t-Distribution-based hypothesis testing or the confidence interval creation can be performed based upon that single sample of data.

 

A Demonstration of the Central

Limit Theorem Using Excel

The Central Limit Theorem is not immediately intuitive can be more quickly understood when an example of how it works is shown in graphical form. The following example will be performed in Excel as follows:

A population of 5,000 randomly-generated data points whose distribution is highly skewed will be created in Excel. The population of 5,000 will be randomly into samples of sample size n = 2, sample size n = 5, and sample n = 10.

The distribution of the sample means will be shown in an Excel histogram created for each of the three sample sizes.

The histograms will demonstrate that the distribution of the sample means become closer and closer to the normal distribution. This exercise will also show that the sample standard deviation equals the population standard deviation divided by the square root of the sample size. This means that the distribution’s shape become tighter as the sample size increases.

Here are the steps in demonstrating how the Central Limit Theorem works using Excel.

The first step is to use Excel’s random-number generator to generate 5,000 random numbers that are highly skewed. In this case, the data will be highly skewed to the right and will following this model, which is an Excel-generated histogram:


excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

This population of 5,000 skewed random numbers by creating the following sets of random numbers in Excel:

● 1,500 random numbers between 0 and 100. Each of these numbers was created with the Excel formula RANDBETWEEN(0,100)

● 1,100 random numbers between 100 and 200. Each of these numbers was created with the Excel formula RANDBETWEEN(100,200)

● 700 random numbers between 200 and 300. Each of these numbers was created with the Excel formula RANDBETWEEN(200,300)

● 500 random numbers between 300 and 400. Each of these numbers was created with the Excel formula RANDBETWEEN(300,400)

● 400 random numbers between 400 and 500. Each of these numbers was created with the Excel formula RANDBETWEEN(400,500)

● 300 random numbers between 500 and 600. Each of these numbers was created with the Excel formula RANDBETWEEN(600,700)

● 200 random numbers between 600 and 700. Each of these numbers was created with the Excel formula RANDBETWEEN(600,700)

● 100 random numbers between 700 and 800. Each of these numbers was created with the Excel formula RANDBETWEEN(700,800)

● 100 random numbers between 800 and 900. Each of these numbers was created with the Excel formula RANDBETWEEN(800,900)

● 100 random numbers between 900 and 1,000. Each of these numbers was created with the Excel formula RANDBETWEEN(900,1,000)

These 5,000 numbers need to be randomly placed in a table so that samples can be taken. One way of doing that is to initially place all 5,000 numbers in a single Excel column. These numbers need to be randomly scrambled in the column. This is done as follows:

The columns of numbers on the right in column D are the randomly-generated numbers from several of the ranges previously created. The column of numbers on the left in column C are randomly-generated numbers from a single range using the Excel formula RANDBETWEEN(0,1000).

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

The objective is to randomly scramble the numbers in column D. The numbers in Column D will be scrambled randomly be performing a single sort of both column based upon an ascending sort of the numbers in Column C. Prior to the sort, the numbers in the two columns appear as follows:

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution

After the sort according to Column C, the numbers appear as follows:

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution

The next step is to take the numbers from the column on the right and place them into a table. This is necessary for taking samples and creating histograms in Excel. Converting column data into a table can be done in Excel using the OFFSET() command. As soon as any cell from C2 to E4 has the correct OFFSET() formula typed into it, that cell can be copied to all of the other cells in the table. The result is the following:

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

The data in this exercise was arrange into a table of dimensions 20 rows by 500 columns as follows:

 

Population (Sample size n = 1)

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

A histogram was created in Excel of this population of 5,000 data points. It population parameters are as follows:

Population size = N = 5,000 (Note that capital N is used to describe population size)

Population mean = µ = 260

Population standard deviation = σ = 230

Note that the following Excel histogram shows the population’s distribution to be highly skewed to the right.

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

 

Sample Size n = 2

A table of samples of sample size n = 2 was created by taking the average of every 2 data points going down each column. A table with these samples appears as follows:


excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

Total number of samples = 2,500

Sample size = n = 2

Sample mean = x_bar = 260

Sample standard deviation = 163 = σ / SQRT(n) (with rounding error)

Note that the shape of the distribution of the sample means when the sample size is 2 is still highly skewed to the right. The bell shape of the normal distribution is already starting to form at a sample size of 2 even though the population from which the samples were taken is highly skewed.

Note how much more narrow the distribution’s shape has become as sample size increase from n = 1 to n = 2.

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

 

Sample Size n = 5

A table of samples of sample size n = 5 was created by taking the average of every 5 data points going down each column. A table with these samples appears as follows:

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

Total number of samples = 1,000

Sample size = n = 5

Sample mean = x_bar = 260

Sample standard deviation = 104 = σ / SQRT(n) (with rounding error)

Note that the shape of the distribution of the sample means when the sample size is 5 already has a strong resemblance to the bell-shape normal distribution’s PDF curve even though the population from which the samples were taken is highly skewed. A mild skew to the right can still be seen.

Note how much more narrow the distribution’s shape has become as sample size increase from n = 2 to n = 5. The reduction in the sample standard deviation indicates how much thinner the distribution’s shape has become.

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

 

Sample Size n = 10

A table of samples of sample size n = 10 was created by taking the average of every 10 data points going down each column. A table with these samples appears as follows:

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

Total number of samples = 500

Sample size = n = 10

Sample mean = x_bar = 260

Sample standard deviation = 72 = σ / SQRT(n) (with rounding error)

Note that the shape of the distribution of the sample means when the sample size is 10 now nearly exactly resembles the bell-shape normal distribution’s PDF curve even though the population from which the samples were taken is highly skewed.

Note how much more narrow the distribution’s shape has become as sample size increase from n = 5 to n = 10. The reduction in the sample standard deviation indicates how much thinner the distribution’s shape has become.

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution (Click on the Image To See an Enlarged Version)

The most important point of the Central Limit Theorem is that the distribution of sample means become closer and closer to the normal distribution as sample size increases regardless of the distribution of the population from which the samples were taken.

Most statistics texts state that the sample size must reach at least 30 before the sample means are nearly certain to be normal-distributed. In the real world, the distribution of the sample means converges to normality at significantly smaller sample sizes. The example just provided shows a reasonable resemblance to the normal distribution’s PDF curve when the sample size is only 5 and the population was highly skewed. A sample size of 10 shows an almost exact resemblance to the normal distribution’s PDF curve.

 

The Resulting Robustness

of the t-Test

The quick convergence of sample mean distribution to normality as sample size increases means that the t–test is relatively robust on non-normally-distributed data as long as the sample size is not too small and the data is not too skewed.

The example just provided shows that even highly-skewed data will still have near-normal distribution of sample means when the sample size is as low as 10. Data that is not skewed will have sample means achieving near-normality at smaller sample sizes than 10.

The t-test is therefore very robust in the face of non-normal data. The t-test can produce a reasonably accurate result for sample sizes as low as 5 to 10 for data that is not skewed. Many statistics texts state that sample size must be large (n > 30) for a t-test to be reliable but that is not the case. The example provided in this section shows the sample means of highly-skewed data converging to near-normality at a sample size of 10.

 

Derivation of the Central Limit

Theorem

The derivation of the Central Limit Theorem is readily available on the Internet. The CLT’s derivation is not trivial and involves a significant amount of calculus. Understanding this complicated derivation does not significantly add to one’s understanding of when and how to apply the CLT. For that reason the derivation of the CLT is not discussed here.

 

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

 

No comments:

Post a Comment