## Friday, May 30, 2014

### Chi-Square Goodness-Of-Fit-Normality Test in 9 Steps in Excel 2010 and Excel 2013

This is one of the following three articles on Chi-Square Goodness-Of-Fit Tests in Excel

Overview of the Chi-Square Goodness-of-Fit Test

Chi-Square Goodness- of-Fit Test With Pre-Determined Bins Sizes in 7 Steps in Excel 2010 and Excel 2013

Chi-Square Goodness-Of-Fit-Normality Test in 9 Steps in Excel 2010 and Excel 2013

# Steps in Excel

The Chi-Square GOF test can be used to test how well any data sample fits just about any distribution. Quite often the Chi-Square GOF test is used to test whether a sample of data is normally distributed. The Chi-Square GOF test for normality is an alternative to other well-known normality tests such as the Anderson-Darling and Kolmogorov-Smirnov tests.

The Chi-Square GOF test can be used to test whether a data sample can be fitted with any distribution for which the CDF (Cumulative Distribution Function) can be calculated. The Anderson-Darling and Kolmogorov-Smirnov tests can only be used to test whether a data sample can be fitted with a continuous distribution such as the normal distribution. The Chi-Square GOF test with continuous distributions as well as discrete distributions such as the binomial and Poisson distributions.

## Chi-Square GOF Test for Normality Example in Excel

Determine with 95 percent certainty whether the following sample of data is normally distributed. (Click On Image To See Larger Version)

### Step 1 – Sort and Standardize Data

Data that will be evaluated with the Chi-Square GOF Test for Normality usually not provided in pre-determined bins. The bins, or more specifically, the upper and lower boundaries of each of the bins, have not been established. Sorting and standardizing the data greatly facilitates the development of the bin specifications.

Sorting the data makes the data’s range and any significant outliers apparent. Outliers judged to be extreme and therefore non-representative of the data can be removed. Each significant outlier should be evaluated on a case-by-case basis. Outliers that have been removed and the justifications for removal should be noted.

Standardizing a data value converts that value to its z Score. A z Score is equal to the number of sample standard deviations that the value is from the sample mean. Standardizing the data allow bin boundaries to be based upon increments of sample standard deviation which is fairly intuitive and uncomplicated.

More importantly, converting data values to their z Scores makes it possible to use the normal distribution’s CDF (Cumulative Distribution Function) to calculate the percentage of total data points that would be expected to fall into each of the bins. This will be discussed in more detail shortly.

a) Sorting the Data

The raw data can be sorted using the sorting tool available in Excel. This is effective but an even better way to sort the data is use the formula shown in the following diagram. Using this formula has the advantage that the data will be automatically resorted if any of the data are changed. The sorting tool would need to be reapplied each time any of the data have been changed. The formula can be typed into the top cell and then quickly copied down to the bottom as follows: (Click On Image To See Larger Version)

b) Standardizing the Data

Standardizing the data simply involves subtracting the mean from the data value and then dividing by the standard deviation. This calculation converts each data value to its z Score. For population data, the z Score is the number of population standard deviations that the data value is from the population mean. For sample data, the z Score is the number of sample standard deviations that a data value is from the sample mean.

The z Scores in this example are calculated from sample data as follows: (Click On Image To See Larger Version)

### Step 2 – Create Bins

Bin creation involves specifying the upper and lower boundaries of each bin into which the actual and expected values will fall. Sorting and standardizing the data simplify bin creation.

The z Scores of the data range from -1.787 to 2.490. The bins should cover that entire range because there are no significant outliers among the 26 total z Scores.

Chi_Square GOF Test Requirements

The bins need to be large enough so the three required conditions for the Test Statistic to follow the Chi-Square Distribution are met. The distribution of the Chi-Square Statistic, Χ2, can be approximated by the Chi-Square distribution if the following 3 conditions are met:

1) The number of bins (n) is at least 5

2) The minimum expected number of data points in any of the bins is at least 1

3) The average number of expected data points in a bin is at least 5

Criteria For Creating Bins

Establishing dimensions for the bins is an arbitrary process. Three important criteria need to be considered when establishing the upper and lower boundaries of the bins. These are the following:

1) The bins need to be large enough so the Chi-Square GOF Test’s three conditions will be met.

2) The overall range of all of the bins should be large enough to capture all data points that have not removed for being outliers.

2) The bins need to be small enough so that the Chi-Square GOF Test will have sufficient power. The power of a statistical test is equivalent to its sensitivity and is measured as follows:

Power = 1 – β

β is the test’s probability of making a Type II error. A type II error is a false negative or failing to detect a significant difference.

Power is therefore a statistical test’s probability of not failing to detect a significant difference. This would be the sensitivity of a test.

3) The distance between upper and lower boundaries for all bins should be as similar as possible.

Establishing optimal dimensions for the bins is dependent on judgment and statistical skill of the person performing the test. One possible configuration for the bins would be to construct five bins that catch all data points with z Scores ranging from -2.5 up to 2.5. Each bin would have a range equaling the length of one z Score. The boundaries of the five bins configured with those dimensions are shown as follows: (Click On Image To See Larger Version)

It is not yet known whether at least one data point is expected to fall into each of these bins and whether the average number of data points expected to fall into a bin is at least five. The number of data points expected to fall into each of the bins if the data were normally distributed will be calculated in Step 4 of this process.

### Step 3 – Determine Actual Count For Each Bin

Since bins dimensions have been established, it is now possible to determine how many of the actual z scores will fall into each bin. The counts of actual z Scores that have values in the range of each of the bins should be taken.

A histogram would perform this task because a histogram provides counts of the number of data points that have values in specified ranges. Histograms in Excel can be implemented in two ways as follows:

1) Excel’s histogram data analysis tool

2) An Excel formula combined with a bar chart

The formula/bar chart method is the better method because the all of the output is automatically updated when the raw data is changed. The histogram tool must be manually re-run each time that raw data is changed. The two methods are shown in detail as follows:

a) Creating a Histogram With the Excel Histogram Tool

This tool is accessed under the Data tab as follows:

Data tab / Data Analysis / Histogram

A Histogram dialogue box appears. This dialogue box requires the location of the data to be analyzed, the upper boundaries of each bin, and the location of the upper left corner of the output.

The output of this tool consists of a Frequency chart that contains the counts of the number of data points that fell into each bin and a histogram bar chart which graphically illustrates the number of data points that fell into each bin. The input and the output of the histogram tool are shown as follows: (Click On Image To See Larger Version)

An expanded view of the completed dialogue box is shown as follows: (Click On Image To See Larger Version)

b) Creating a Histogram With a Formula and Bar Chart

The Excel histogram tool, like all of the data analysis tools, does not automatically update its output when input data is changed. The tool must be manually rerun to update the output whenever the input data has been changed.

This inconvenience can be eliminated by using an Excel formula combined with a bar chart to create an output equivalent to that of the histogram data analysis tool. Excel formulas and charts automatically update their output when input data is changed.

There are two Excel formulas that can be used to provide a count of data values within a specified range. The following formula works in versions of Excel from 2007 and later:

=COUNTIFS(range,”lower condition”,range,”upper condition”)

The following formula works in current versions of Excel and versions of Excel prior to 2007:

=SUMPRODUCT((range & lower condition)*(range & upper condition))

Both of these formulas are shown in operation calculating the actual bin counts as follows: (Click On Image To See Larger Version)

The histogram bar chart is an Excel bar chart that is based upon the actual bin counts and the bins’ upper boundary z Scores.

The bins counts and the bar chart output are automatically updated if any of the raw data have been changed.

This bar chart is created in Excel as follows:

Insert tab / Column Chart / 2-D Clustered Column Chart/

A blank chart will appear on the worksheet. Right-click on the blank chart.

Select Data / This brings up the Select Data Source dialogue box

On the left side under Legend Entries (Series) select the blank data series / Edit /

In the Series Values box, select Bin Actual Count cells J4 to J8 / OK

On the right side under Horizontal (Category) Axis Labels, select Bin Upper Boundary z Score cells I4 to I8 / OK

Note that the values in cells I4 to I8 need to start with lower values on top in order to have lower values on the right side of the x axis.

### Step 4 – Determine Expected Count For Each Bin

Standardizing a data value converts that value to its z Score. Converting data values to their z Scores makes it possible to use the normal distribution’s CDF (Cumulative Distribution Function) to calculate the percentage of total data points that would be expected to fall into each of the bins.

The normal distribution’s CDF (Cumulative Distribution Function) equals the probability that sampled point from a normally distributed population has a value UP TO X given the population’s mean, µ, and standard deviation, σ.

The normal distribution’s CDF is expressed as F(X,µ,σ).

The normal distribution’s CDF at point X is calculated in Excel as follows:

F(X,µ,σ) = NORM.S.DIST(z Score(X),TRUE)

If data are normally distributed, the percentage of total data points that is expected to lie between XUpper and XLower is equal to the difference in the CDF values at those two X values. This is equal to the following:

Percentage of Data between XUpper and XLower = F(XUpper,µ,σ) - F(XLower,µ,σ)

Percentage of Data between XUpper and XLower =

= NORM.S.DIST(z Score(XUpper),TRUE)) – NORM.S.DIST(z Score(XLower),TRUE)

This is illustrated in the following example.

Calculate the percentage of total data points in a normally distributed population that lie between the values of 25 and 30 if the population mean, µ, equals 27 and the population standard deviation, σ, equals 5.

Standardizing the data values converts them to z Scores as follows:

z Score = (X - µ)/σ

z Score(x = 25) = (25 – 27)/5 = -0.4

z Score(x = 30) = (30 – 27)/5 = 0.6

Percentage of Data between XUpper and XLower =

= NORM.S.DIST(z Score(XUpper),TRUE)) – NORM.S.DIST(z Score(XLower),TRUE)

= NORM.S.DIST(0.6,TRUE) – NORM.S.DIST(-0.4)

= 0.7257 – 0.3446 = 0.3812

= 38.12 %

This is demonstrated in the following Excel-generated diagram which shows that 38.12 percent of the area under the normal distribution PDF (Probability Density Function) curve lies between x = 25 and x = 30 if µ = 27 and σ = 5. (Click On Image To See Larger Version)

The CDF values of the z Scores of the upper and lower bin boundaries are created as follows: (Click On Image To See Larger Version)

The percentage of the total number of data points in each bin is equal to the percentage of the normal curve area assigned to each bin if the data are normally distributed.

The percentage of normal curve area assigned to each bin is equal to the CDF of the bin’s upper z Score minus the CDF of the bin’s lower z Score. This subtraction is performed in the following image.

The expected count of data points in a bin if the data is normally distributed is equal to the total number of actual data points (26) times the percentage of the total normal curve area assigned to the bin. This calculation is also performed in the following image: (Click On Image To See Larger Version)

### Step 5 – Verify Required Assumptions

The distribution of the Chi-Square Statistic, Χ2, can be approximated by the Chi-Square distribution if the following 3 conditions are met:

1) n ≥ 5

2) The minimum expected number of data points in any of the bins is at least 1

3) The average number of expected data points in a bin is at least 5

All of these conditions have been met.

### Step 6 – Create Null and Alternative Hypotheses

The Null Hypothesis states that actual distribution of the data matches the hypothesized distribution. The Null Hypothesis for the Chi-Square GOF is always specified as the following:

H0: Χ2 = 0

The Chi-Square Statistic, Χ2, is distributed according to the Chi-Square distribution if certain conditions are met. The Chi-Square distribution has only one parameter: its degrees of freedom, df. The probability density function of the Chi-Square distribution calculated at x is defined as f(x,df) and can only be defined for positive values of x.

Since the Chi-Square’s PDF value f(x,df) only exists for positive values of x, the alternative hypothesis specifies that that the Chi-Square Independence Test is a one-tailed test in the right tail and is specified as follows:

H1: Χ2 > 0

### Step 7 – Calculate Chi-Square Statistic, Χ2

The Test Statistic, which is the Chi-Square Statistic, Χ2, is calculated for n = r x c unique cells in the contingency table as follows: (Click On Image To See Larger Version)

This can be quickly implemented in a convenient table as follows: (Click On Image To See Larger Version)

### Step 8 – Calculate Critical Chi-Square Value and p Value

The degrees of freedom for the Chi-Square Independence Test is calculated as follows:

df = n – 1 –m = n – 1 – 2 = 2

n = k = number of expected bins

The Critical Chi-Square Value is calculated as follows:

Chi-Square Critical = CHISQ.INV.RT(α,df)

Chi-Square Critical = CHISQ.INV.RT(0.05,2) = 5.99

Prior to Excel 2010, the formula is calculated as follows:

Chi-Square Critical = CHIINV(α,df)

The p Value is calculated as follows:

p Value = CHISQ.DIST.RT(Chi-Square Statistic,df)

p Value = CHISQ.DIST.RT(6.60,2) = 0.0369

Prior to Excel 2010, the formula is calculated as follows:

p Value = CHIDIST(Chi-Square Statistic,df)

### Step 9 – Determine Whether To Reject Null Hypothesis

The Null Hypothesis is rejected if either of the two equivalent conditions are shown to exist:

1) Chi-Square Statistic > Critical Chi-Square Value

2) p Value < α

Both of these equivalent conditions exist as follows:

Chi-Square Statistic = 6.60

Critical Chi-Square Value = 5.99

p Value = 0.0369

α = 0.05

In this case we reject the Null Hypothesis because the Chi-Square Statistic (6.60) is larger than the Critical Value (5.99) or, equivalently, the p Value (0.0369) is smaller than Alpha (0.50).

An Excel-generated graphical representation of this problem is shown as follows: (Click On Image To See Larger Version)

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

• Histograms in Excel
• Bar Chart in Excel
• Combinations & Permutations in Excel
• Normal Distribution in Excel
• t-Distribution in Excel
• Binomial Distribution in Excel
• z-Tests in Excel
• t-Tests in Excel
• Hypothesis Tests of Proportion in Excel
• Chi-Square Independence Tests in Excel
• Chi-Square Goodness-Of-Fit Tests in Excel
• F Tests in Excel
• Correlation in Excel
• Pearson Correlation in Excel
• Spearman Correlation in Excel
• Confidence Intervals in Excel
• Simple Linear Regression in Excel
• Multiple Linear Regression in Excel
• Logistic Regression in Excel
• Single-Factor ANOVA in Excel
• Two-Factor ANOVA With Replication in Excel
• Two-Factor ANOVA Without Replication in Excel
• Randomized Block Design ANOVA in Excel
• Repeated-Measures ANOVA in Excel
• ANCOVA in Excel
• Normality Testing in Excel
• Nonparametric Testing in Excel
• Post Hoc Testing in Excel
• Creating Interactive Graphs of Statistical Distributions in Excel
• Solving Problems With Other Distributions in Excel
• Optimization With Excel Solver
• Chi-Square Population Variance Test in Excel
• Analyzing Data With Pivot Tables and Pivot Charts
• SEO Functions in Excel
• Time Series Analysis in Excel
• VLOOKUP
• Simplifying Useful Excel Functions