Tuesday, May 27, 2014

This is one of the following eight articles on the normal distribution in Excel

Overview of the Central

Limit Theorem

The Central Limit Theorem, or CLT, is one of statistics’ most basic principles. The CLT states the following points:

When an entire population is grouped into random samples of size n, the following are true:

1) As sample size n increases, the distribution of the sample means approaches the normal distribution regardless of how the population is distributed. A sample size of 30 is often quoted as being the minimum sample size necessary to ensure that the sample means will be normal-distributed. In the real world, sample means will be normal-distributed when the sample size is much smaller. This will be demonstrated with Excel shortly.

2) The sample standard deviation will equal the population standard deviation divided by the square root of the sample size. This will also be demonstrated in Excel shortly

Why the Central Limit Theorem

Is So Important to Statistics

The Central Limit Theorem is very important to statistical work because it enables convenient normal-distribution-based and t-Distribution-based hypothesis testing and confidence-interval creation to be performed based upon a single large sample of data without having to verify that the population from which the sample was taken is normal-distributed. Samples are considered to be large if the size of each sample, n, is greater than or equal to 30.

Performing normal-distribution-based and t-Distribution-based hypothesis testing and creating confidence intervals based upon a small sample of data requires verification that the population from which the sample was taken is normal-distributed. The normality of the population is confirmed if the data within the data sample is verified to be normal-distributed. This step is not necessary when the sample is large. Samples are considered to be small if the size of each sample, n, is smaller than 30.

Hypothesis testing of a population mean and confidence intervals of a population mean are discussed in much greater detail in their own sections of this book. Hypothesis testing and confidence intervals of a population mean are calculated based upon a single sample of data. The fundamental assumption that underlies normal-distribution-based and t-Distribution-based hypothesis testing and confidence interval creation of a population mean is that the sample mean is normal-distributed.

The mean of a single sample is normal-distributed if the means of other similar samples are also normal-distributed. In other words, if the means of a number of other same-sized samples randomly taken from the same population were normal-distributed, then the mean of the single sample is normal-distributed.

When the single sample taken is large enough (sample size n is greater than or equal to 30), the mean of that sample is assumed to be normal-distributed as per the Central Limit Theorem. When the size of the single sample is less than 30, the sample’s mean cannot be assumed to be normal-distributed unless one of the following is true:

1) The population from which the sample was drawn is known to be normal-distributed.

2) The data within the single sample is verified through normality testing to be normal-distributed.

Quite often it is not possible to confirm the normality of the population. In this case normality testing should be performed on the data within the single sample taken before normal-distribution-based or t-Distribution-based hypothesis testing or the confidence interval creation can be performed based upon that single sample of data.

A Demonstration of the Central

Limit Theorem Using Excel

The Central Limit Theorem is not immediately intuitive can be more quickly understood when an example of how it works is shown in graphical form. The following example will be performed in Excel as follows:

A population of 5,000 randomly-generated data points whose distribution is highly skewed will be created in Excel. The population of 5,000 will be randomly into samples of sample size n = 2, sample size n = 5, and sample n = 10.

The distribution of the sample means will be shown in an Excel histogram created for each of the three sample sizes.

The histograms will demonstrate that the distribution of the sample means become closer and closer to the normal distribution. This exercise will also show that the sample standard deviation equals the population standard deviation divided by the square root of the sample size. This means that the distribution’s shape become tighter as the sample size increases.

Here are the steps in demonstrating how the Central Limit Theorem works using Excel.

The first step is to use Excel’s random-number generator to generate 5,000 random numbers that are highly skewed. In this case, the data will be highly skewed to the right and will following this model, which is an Excel-generated histogram:

excel 2010, excel 2013, excel, statistics, central limit theorem, normal distribution

(Click on the Image To See an Enlarged Version)

This population of 5,000 skewed random numbers by creating the following sets of random numbers in Excel:

● 1,500 random numbers between 0 and 100. Each of these numbers was created with the Excel formula RANDBETWEEN(0,100)

● 1,100 random numbers between 100 and 200. Each of these numbers was created with the Excel formula RANDBETWEEN(100,200)

● 700 random numbers between 200 and 300. Each of these numbers was created with the Excel formula RANDBETWEEN(200,300)

● 500 random numbers between 300 and 400. Each of these numbers was created with the Excel formula RANDBETWEEN(300,400)

● 400 random numbers between 400 and 500. Each of these numbers was created with the Excel formula RANDBETWEEN(400,500)

● 300 random numbers between 500 and 600. Each of these numbers was created with the Excel formula RANDBETWEEN(600,700)

● 200 random numbers between 600 and 700. Each of these numbers was created with the Excel formula RANDBETWEEN(600,700)

● 100 random numbers between 700 and 800. Each of these numbers was created with the Excel formula RANDBETWEEN(700,800)

● 100 random numbers between 800 and 900. Each of these numbers was created with the Excel formula RANDBETWEEN(800,900)

● 100 random numbers between 900 and 1,000. Each of these numbers was created with the Excel formula RANDBETWEEN(900,1,000)

These 5,000 numbers need to be randomly placed in a table so that samples can be taken. One way of doing that is to initially place all 5,000 numbers in a single Excel column. These numbers need to be randomly scrambled in the column. This is done as follows:

The columns of numbers on the right in column D are the randomly-generated numbers from several of the ranges previously created. The column of numbers on the left in column C are randomly-generated numbers from a single range using the Excel formula RANDBETWEEN(0,1000).

(Click on the Image To See an Enlarged Version)

The objective is to randomly scramble the numbers in column D. The numbers in Column D will be scrambled randomly be performing a single sort of both column based upon an ascending sort of the numbers in Column C. Prior to the sort, the numbers in the two columns appear as follows:

After the sort according to Column C, the numbers appear as follows:

The next step is to take the numbers from the column on the right and place them into a table. This is necessary for taking samples and creating histograms in Excel. Converting column data into a table can be done in Excel using the OFFSET() command. As soon as any cell from C2 to E4 has the correct OFFSET() formula typed into it, that cell can be copied to all of the other cells in the table. The result is the following:

(Click on the Image To See an Enlarged Version)

The data in this exercise was arrange into a table of dimensions 20 rows by 500 columns as follows:

Population (Sample size n = 1)

(Click on the Image To See an Enlarged Version)

A histogram was created in Excel of this population of 5,000 data points. It population parameters are as follows:

Population size = N = 5,000 (Note that capital N is used to describe population size)

Population mean = µ = 260

Population standard deviation = σ = 230

Note that the following Excel histogram shows the population’s distribution to be highly skewed to the right.

(Click on the Image To See an Enlarged Version)

Sample Size n = 2

A table of samples of sample size n = 2 was created by taking the average of every 2 data points going down each column. A table with these samples appears as follows:

(Click on the Image To See an Enlarged Version)

Total number of samples = 2,500

Sample size = n = 2

Sample mean = x_bar = 260

Sample standard deviation = 163 = σ / SQRT(n) (with rounding error)

Note that the shape of the distribution of the sample means when the sample size is 2 is still highly skewed to the right. The bell shape of the normal distribution is already starting to form at a sample size of 2 even though the population from which the samples were taken is highly skewed.

Note how much more narrow the distribution’s shape has become as sample size increase from n = 1 to n = 2.

(Click on the Image To See an Enlarged Version)

Sample Size n = 5

A table of samples of sample size n = 5 was created by taking the average of every 5 data points going down each column. A table with these samples appears as follows:

(Click on the Image To See an Enlarged Version)

Total number of samples = 1,000

Sample size = n = 5

Sample mean = x_bar = 260

Sample standard deviation = 104 = σ / SQRT(n) (with rounding error)

Note that the shape of the distribution of the sample means when the sample size is 5 already has a strong resemblance to the bell-shape normal distribution’s PDF curve even though the population from which the samples were taken is highly skewed. A mild skew to the right can still be seen.

Note how much more narrow the distribution’s shape has become as sample size increase from n = 2 to n = 5. The reduction in the sample standard deviation indicates how much thinner the distribution’s shape has become.

(Click on the Image To See an Enlarged Version)

Sample Size n = 10

A table of samples of sample size n = 10 was created by taking the average of every 10 data points going down each column. A table with these samples appears as follows:

(Click on the Image To See an Enlarged Version)

Total number of samples = 500

Sample size = n = 10

Sample mean = x_bar = 260

Sample standard deviation = 72 = σ / SQRT(n) (with rounding error)

Note that the shape of the distribution of the sample means when the sample size is 10 now nearly exactly resembles the bell-shape normal distribution’s PDF curve even though the population from which the samples were taken is highly skewed.

Note how much more narrow the distribution’s shape has become as sample size increase from n = 5 to n = 10. The reduction in the sample standard deviation indicates how much thinner the distribution’s shape has become.

(Click on the Image To See an Enlarged Version)

The most important point of the Central Limit Theorem is that the distribution of sample means become closer and closer to the normal distribution as sample size increases regardless of the distribution of the population from which the samples were taken.

Most statistics texts state that the sample size must reach at least 30 before the sample means are nearly certain to be normal-distributed. In the real world, the distribution of the sample means converges to normality at significantly smaller sample sizes. The example just provided shows a reasonable resemblance to the normal distribution’s PDF curve when the sample size is only 5 and the population was highly skewed. A sample size of 10 shows an almost exact resemblance to the normal distribution’s PDF curve.

The Resulting Robustness

of the t-Test

The quick convergence of sample mean distribution to normality as sample size increases means that the t–test is relatively robust on non-normally-distributed data as long as the sample size is not too small and the data is not too skewed.

The example just provided shows that even highly-skewed data will still have near-normal distribution of sample means when the sample size is as low as 10. Data that is not skewed will have sample means achieving near-normality at smaller sample sizes than 10.

The t-test is therefore very robust in the face of non-normal data. The t-test can produce a reasonably accurate result for sample sizes as low as 5 to 10 for data that is not skewed. Many statistics texts state that sample size must be large (n > 30) for a t-test to be reliable but that is not the case. The example provided in this section shows the sample means of highly-skewed data converging to near-normality at a sample size of 10.

Derivation of the Central Limit

Theorem

The derivation of the Central Limit Theorem is readily available on the Internet. The CLT’s derivation is not trivial and involves a significant amount of calculus. Understanding this complicated derivation does not significantly add to one’s understanding of when and how to apply the CLT. For that reason the derivation of the CLT is not discussed here.

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

Technorati Tags: excel 2010,excel 2013,excel,statistics,central limit theorem,normal distribution

Become an Excel Statistical Master

Excel Master Series - MBA-level statistics - Over 1,100+ Pages of Easy-To-Follow Instructions in Excel

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

Step-By-Step Optimization With Excel Solver

What's In It?

For anyone who wants to be operating at a high level with the Excel Solver quickly, this is the book for you. Step-By-Step Optimization With Excel Solver is a 200+ page .pdf e-manual of simple yet thorough explanations on how to use the Excel Solver to solve today’s most widely known optimization problems. Loaded with screen shots that are coupled with easy-to-follow instructions, this book will simplify many difficult optimization problems and make you a master of the Excel Solver almost immediately.

Here are just some of the Solver optimization problems that are solved completely with simple-to-understand instructions and screen shots in this e-manual:

• The famous “Traveling Salesman” problem using Solver’s Alldifferent constraint and the Solver’s Evolutionary method to find the shortest path to reach all customers. This also provides an advanced use of the Excel INDEX function.

• The well-known “Knapsack Problem” which shows how optimize the use of limited space while satisfying numerous other criteria.

• How to perform nonlinear regression and curve-fitting on the Solver using the Solver’s GRG Nonlinear solving method.

• How to solve the “Cutting Stock Problem” faced by many manufacturing companies who are trying to determine the optimal way to cut sheets of material to minimize waste while satisfying customer orders.

• Portfolio optimization to maximize return or minimize risk.

• Venture capital investment selection using the Solver’s Binary constraint to maximize Net Present Value of selected cash flows at year 0. Clever use of the If-Then-Else statements makes this a simple problem.

• How use Solver to minimize the total cost of purchasing and shipping goods from multiple suppliers to multiple locations.

• How to optimize the selection of different production machine to minimize cost while fulfilling an order.

• How to optimally allocate a marketing budget to generate the greatest reach and frequency or number of inbound leads at the lowest cost.

Step-By-Step Optimization With Excel Solver has complete instructions and numerous tips on every aspect of operating the Excel Solver. You’ll fully understand the reports and know exactly how to tweek all of the Solver’s settings for total custom use. This e-manual also provides lots of inside advice and guidance on setting up the model in Excel so that it will be as simple and intuitive as possible to work with.

All of the optimization problems in this book are solved step-by-step using a 6-step process that works every time. In addition to detailed screen shots and easy-to-follow explanations on how to solve every optimization problem in the book, a link is provided to download an Excel workbook that has all problems completed exactly as they are in this e-manual.

Step-By-Step Optimization With Excel Solver is exactly the e-manual you need if you want to be optimizing at an advanced level with the Excel Solver quickly.

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

More Easy-To-

Follow eManuals

That You Will

Master Quickly

*******************

Become an Excel Statistical Master

It's a Full
Easy-To-Follow
MBA Course in Business Statistics

ALL IN EXCEL

&

MUCH Clearer

Than Your Text

Book

Download the
1,100+ Page Excel Statistical Master now

Immediate, Absolute, No-Questions-Asked, Money-Back Guarantee If Not TOTALLY, 100% Satisfied. In Other Words, If Any Excel Master Series eManual That You've Purchased Here Does Not Provide Instructions That Are CRYSTAL CLEAR and EASY TO UNDERSTAND, You Get All Of Your Money Back Immediately and Keep the eManual. Guaranteed!

Meet The Author