Normal Probability Plot
With Adjustable
Confidence Interval Bands
in 9 Steps in Excel
(Click On Image To See a Larger Version)
Overview
The Normal Probability Plot is a graphical tool used to determine whether sample data is approximately normally distributed. Sample data are plotted against a line that the sample data would fall on if the sample was perfectly normally distributed. Sample data points that do not fall on this line represent deviations from the normal distribution.
Confidence interval bands can be added to a Normal Probability Plot to provide a visual representation of how wide a confidence interval would need to be to contain the majority of the sample data points if the data were normally distributed. The wider that the confidence interval must be to contain the data points, the less likely it is that the data are normally distributed.
The Normal Probability Plot is based upon the following unique characteristic of the normal distribution:
There is a linear relationship between data values and Z Scores of normally-distributed data. A graph of data value vs. data Z Score will be a perfectly straight line if the data are perfectly normally distributed. A Normal Probability Plot graphs the sample data values against the line of perfect normal distribution, i.e., the line that the data values would fall on if the data were perfectly normally distributed.
The line of perfect normal distribution for a given data set is fairly straightforward to construct on an Excel scatterplot. Upper and lower confidence interval bands can also be added to that scatterplot. Finally the actual sample data points can be plotted to show their proximity to the line of perfect normally distribution and also to determine whether they fall within the upper and lower bands of the specified confidence interval.
A histogram is probably a better tool to apply to make a quick determination of whether a data set might be normally distributed. A histogram is highly dependent on the choice of bin size and location but will directly show the shape of the data set’s distribution. A Normal Probability Plot does provide visual indication of the length of tail and whether the data is skewed to either side but the histogram is the more intuitive tool to evaluate the shape of a data set’s distribution.
Placing adjustable confidence interval bands on a Normal Probability Plot makes the Normal Probability Plot more useful. The bands can be adjusted to determine how large of a confidence interval must be to contain the data if they are normally distributed.
Here is the 9-step process to construct a Normal Probability Plot with adjustable confidence interval bans in Excel:
Step 1 – Sort Sample Data
The first step in plotting a data sample within a Normal Probability Plot is to sort the data in ascending order. Sorting can be performed automatically by copying the formula in cell D10 down to cell D28 as shown below. Changing the word SMALL to LARGE convert this ascending sort to a descending sort. This method of sorting using formulas is more convenient than using the Excel sorting tool because the formulas will automatically resort the data if any of the raw data changes. Excel’s sorting tool must be re-run manually every time a new sort is needed. The automatic sorting formula in cell D10 to be copied down to cell D28 is the following:
=IF($B10=””,””,SMALL($B$10:$B$28,ROW()-ROW($D$9)))
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
Step 2 – Calculate Expected Z
Scores
The actual Z Scores of the sorted sample data must first be calculated. The expected Z Scores will correspond to the actual Z Score except that each expected Z Score will be evenly spaced from each other as they increase from smallest to largest Z Score. Each expected Z Score will have the same incremental increase from the previous Z Score.
Z Score is calculated by the following formula:
Z Score = (Data Value – Average Data Value) / (Data Sample Standard Deviation)
The lowest and highest Z Scores are used to determine the total range in the Z Score values of the data set. The increment of Z Score between each expected Z Score point is calculated by the following formula:
Z Score increment = (Total Z Score Range) / (n – 1)
n = the number of data points in the data set
These calculations in Excel are shown as follows:
Data sample count, mean, and sample standard deviation are first calculated in Excel.
(Click On Image To See a Larger Version)
The actual Z Score of each sample data point can now be calculated.
(Click On Image To See a Larger Version)
The total range of actual Z Score values can now be calculated. The expected Z Score incremental increase can be calculated as soon as the total range of Z Score values and the number of data points is known.
(Click On Image To See a Larger Version)
The expected Z Scores for the sample data points if they were normally distributed can now be calculated by starting at the lowest actual Z Score value and adding the incremental increase to each consecutive Z Score until the highest actual Z Score value is reached.
It is important to note that the expected Z Scores calculated in this step will be the X-axis values of all points plotted on the Normal Probability Plot. The four sets of data values that will be plotted on the Normal Probability Plot using these expected Z Scores as their x-axis values are the following:
-
The normal distribution line
-
The upper confidence interval band
-
The lower confidence interval band
-
The actual sample data points
Expected Z Scores are calculated in Excel as follows:
(Click On Image To See a Larger Version)
Step 3 – Calculate Expected Data
Values if Data Are Normally
Distributed
Calculating the data values that would occur if the data were normally distributed is implemented with the following two-step process:
-
Calculate the CDF (Cumulative Distribution Function or F(Y)) of each data point. This can be done using the Excel formula NORM.S.DIST(Z Score, TRUE). The Z Scores used are the expected Z Scores if the data were normally distributed.
-
Calculate each expected data point’s value with the following Excel formula:
NORM.INV( F(Y), Sample Mean, Sample Standard Deviation)
The normal distribution has the unique property that there is a linear relationship between Z Scores and data values if the data are normally distributed.
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
Step 4 – Calculate Half Width of
Confidence Interval
The half-width of the confidence interval of the mean Y value at any X value of a fitted line is shown here. This formula is commonly used to construct the confidence interval about a regression line.
It should be noted that the confidence interval bands for the normal probability plot normal distribution line are curved with the narrowest part of the band occurring at the mean X value. Intuitively this occurs because there is less certainty of mean Y values as X values get further and further from the mean X value. The (Xi – X_bar) term in the formula has its lowest value at the mean X value. The confidence interval bands will be at their narrowest point here at the mean X value as a result.
(Click On Image To See a Larger Version)
The individual pieces of this formula will be calculated in this section. The Alpha (The level of significance) is user-adjustable and is currently set at a = 0.05. Reducing the size of a will cause the confidence interval bands to widen. The smaller alpha is, the greater the specified level of confidence is. The wider the confidence interval, the greater level of certainty exists that the confidence interval will contain the actual mean. For the Normal Probability Plot, the wider the confidence bands needs to be to contain the sample data points surrounding the normal distribution line, the less likely it is that the sample data are normally distributed.
The degrees of freedom is set at n – 2 because the values of two parameters, X and Y, must be known before an error term can be calculated.
(Click On Image To See a Larger Version)
The t Value above is calculated using the following Excel formula:
=T.INV(1 – Alpha/2, n-2)
=T.INV(1-T2/2,T4)
(Click On Image To See a Larger Version)
The Standard Error of the Y Values is calculated using the following Excel formula:
SE of Y Values = STEYX(Array Sorted Sample Values, Array Z Scores)
SE of Y Values = STEYX(D10:D28,I10:I28)
(Click On Image To See a Larger Version)
The Sum of Squared X Deviations From the Mean is calculated with the following Excel formula:
=DEVSQ(X Array) = DEVSQ(Array Z Scores) = DEVSQ(I10:I28)
(Click On Image To See a Larger Version)
The above piece of the Confidence Interval formula is the reason that the upper and lower confidence interval bands are curved. The confidence interval about the mean estimated Y value (the normal distribution line) is tightest at the mean X value (mean Z Score). The confidence interval about the mean Y value gets wider the further the X value is from the X mean.
The individual pieces of the formula can now be combined to calculate the half-width of the confidence interval at each X value. The expected Z Scores are the X Values.
(Click On Image To See a Larger Version)
Step 5 – Calculate Upper and Lower
Bands of the Confidence Interval
The location and width of the confidence interval at each X value (Expected Z Score) can now be calculated.
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
The following four data sets of Y values will be graphed to form the Normal Probability Plot:
-
Expected sample values will be plotted to form the normal distribution line.
-
Upper confidence interval values will be plotted to form the upper band of the confidence interval.
-
Lower confidence interval values will be plotted to form the lower band of the confidence interval.
-
Actual sorted sample values will be plotted individually.
Each of the above data sets of Y values will use the same X values, i.e., the Expected Z Scores.
This graphing will be performed as follows:
Step 6 – Graph the Normal
Distribution Line
All four data sets will be graphed using the following Excel scatterplot graph with straight lines and markers.
(Click On Image To See a Larger Version)
The first data series to be added the chart is the Normal Distribution Line which consists of Expected Z Scores as the X values and Expected Sample Values as they values. This data series is added as follows:
(Click On Image To See a Larger Version)
This data will appear on the graph as follows. Note that the data graphs as a straight line because of the unique property of the normal distribution that a linear relationship exists between the Z Scores and data values of normally-distributed data.
(Click On Image To See a Larger Version)
By default Excel configures the vertical (Y) axis to cross the horizontal (X) axis at the point X = 0. The vertical axis needs to be moved out of the way to the left side of the graph. This is accomplished by double-clicking the horizontal axis and specifying the vertical axis crosses at X = -2 as is shown when the format Axis dialogue box pops up with the double-click.
(Click On Image To See a Larger Version)
The vertical axis is now moved over. This plotted Normal Distribution Line should be only a line. The markers (round dots in the line) should be removed. Double-click on the line and select Format Data Series. Change the Marker Options from Automatic to None and the markers will disappear, leaving only the line.
(Click On Image To See a Larger Version)
There is a lot of empty space in the graph above and below the plotted line. Double-click on the vertical axis and reset the minimum and maximum axis values to 130,000 and 270,000 in the Format Axis dialogue box.
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
The first data series, the Normal Distribution Line, will now appear as follows:
(Click On Image To See a Larger Version)
Step 7 – Graph the Upper and
Lower Confidence Interval Bands
The upper and lower confidence interval bands are designated as the next two data series to be included in the graph. Right-click on the graph anywhere and click Select Data. Add the two data series in the following manner. Once again, the X values for both data series will be the Expected Z Scores.
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
Remove the markers from of the confidence interval series and they should appear graphed as follows. Note that the confidence interval band is at its narrowest point at the mean X value, which is the mean Expected Z Score of 0.
(Click On Image To See a Larger Version)
Step 8 – Graph Sorted Sample Values
The sorted sample values are the last series to be plotted. Once again the X values of this series will be the Expected Z Scores.
(Click On Image To See a Larger Version)
In the case of this plotted data series, only the markers should be plotted and not the connecting lines. Double-click on the plotted data series to bring up the Format Data Series dialogue box. Select No Line to remove the connecting lines.
(Click On Image To See a Larger Version)
The markers can also be configured here to have a blue color. The graph will now appear as follows:
(Click On Image To See a Larger Version)
Step 9 – Add Chart Title, Axes
Titles, and Legend
Chart elements such as chart title, legend, and axis title can be added to the chart by clicking on the chart and selecting Design / Add Chart Elements.
(Click On Image To See a Larger Version)
The final chart appears as follows:
(Click On Image To See a Larger Version)
The alpha of the chart was initial set to 0.05. Reducing alpha will widen the confidence interval. This the result of the following formula:
Level of Confidence = 1 – Level of Significance (alpha)
The above chart shows that the 95-percent confidence level does not contain many of the data points.
Reducing alpha to 0.0001 causes the confidence interval bands to widen as shown.
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
The Normal Probability Plot is a quick indicator of normality and should be used in combination with other normality-testing techniques. Here is an Excel histogram of this sample data set which shows the general shape of the data’s distribution:
(Click On Image To See a Larger Version)
Excel Master Series Blog Directory
Click Here To See a List Of All
Statistical Topics And Articles In
This Blog
You Will Become an Excel Statistical Master!
Could you please calculate confidence interval as MiniTab?
ReplyDeleteGreat post, but I question the result of the standard deviation as 33499.585. When I run your example problem, I get 3349.95 (1/10th your example value).
ReplyDeleteSorry. I found my error.
DeleteThis comment has been removed by the author.
Deleteyo
DeleteWould it be possible to get the file. I would like to walk it step by steps with your instructiosn
ReplyDeleteclicking on image does not working (giving nothing)
ReplyDeleteit would really help if we could actually see the figures. can you re-post the images please?
ReplyDeleteBeen returning to this page every couple of years as there are no other sites with this particular topic. Has anyone figured out how to perform these calculations?
ReplyDelete