Measures of Variation in
Excel
Measures of Variation Overview
Measures of variation describe the degree of spread in a random variable. Measures of variation are sometimes referred to as measures of dispersion. Measures of variation are properties describing probability distributions, populations, or samples taken from a population. Quite often the same measures of variation are calculated differently for populations and samples taken from those population.
The following measures of variation can be calculated in Excel and will be discussed here:
σ^{2} = Population variance – A measure of the degree of dispersion of values of all of the points contained within a population. σ^{2} equals the sum of the squared differences between the sample mean and each data value divided by N. Variances are additive but are not expressed in the same measure as the data values being described. The square root must be taken to get back to the original units.
s^{2} = Sample variance – A measure of the degree of dispersion of values of all of the points contained within a sample. s^{2} equals the sum of the squared differences between the sample mean and each data value divided by (n – 1).
σ = Population standard deviation – The square root of population variance. Standard deviations are not additive but are expressed in the same units as the data values being described.
s = Sample standard deviation – The square root of sample variance.
SS = Sum of squares  Because variances are additive, the sum of the squares of differences between a mean and data values is used to describe the total variance within specific categories. ANOVA and linear regression both uses the sum of the squares in their calculations.
MAD – Mean absolute deviation and Median absolute value – Mean absolute deviation equals the average distance (absolute distance) between data values and the mean. Median absolute value, which is the average distance between data values and the median, is more robust (less sensitive to outliers) than mean absolute deviation.
R = Range – Range describes the difference between the largest and smallest values of a data set. Range is sensitive to outliers. Range is used in SPC (Statistical Process Control) to describe the spread within each sample when sample size is small (typically 2 – 9). Standard deviation is used in SPC to describe spread within each sample when sample size become larger.
Interquartile range – Interquartile range is the spread or width of the middle two quartiles of the four quartiles that all data of a data set have been divided into. Quartiles divide rankordered data into four equal parts, i.e., quartiles Q1, Q2, Q3, and Q4 which contains the highest values. The box of a boxplot contains the middle two quartiles of data while the whiskers contain the outer quartiles of data. Interquartile range is very helpful in finding outliers.
Excel also has excellent tools to visually display the degree of variation existing in a data set. The Excel tools that will be described in this blog article that visually display variation of data are the following:

Histogram

Box plot

Scatter plot
Population Variance
Population variance is a measure of the degree of dispersion of values of all of the points contained within a population. σ^{2} (sigma squared) equals the sum of the squared differences between the sample mean and each data value divided by N, the total number of data values in a population. µ (mu) is the population mean. The formula for calculating population variance is the following:
(Click On Image To See a Larger Version)
Population variance is calculated in Excel 2010 and later with the following formula:
σ^{2} = VAR.P(data range)
Excel 2007 and earlier used the following formula, which also works in later Excel versions:
σ^{2} = VARP(data range)
Note that Greek letters are often used to denote population parameters while letters from the English alphabet are often used to denote sample statistics.
Variances are additive but are not expressed in the same measure as the data values being described. The square root must be taken to get back to the original units. The square root of the variance is the standard deviation. Standard deviations are not additive but they are conveniently expressed in the same units as the original data.
It is important to note that population variance is calculated by dividing the sum of the squared by N, the total number of data points in the population. Sample variance, on the other hand, is calculated by dividing the sum of the squares by (n – 1) instead of n, the number of data points in the sample. This will be discussed in detail shortly.
Sample Variance
Sample variance is a measure of the degree of dispersion of values of all of the points contained within a sample. s^{2} equals the sum of the squared differences between the sample mean and each data value divided by (n1), the total number of data values in sample minus 1. X_{avg} is the sample mean. The formula for calculating population variance is the following:
(Click On Image To See a Larger Version)
Sample variance is calculated in Excel 2010 and later with the following formula:
s^{2} = VAR.S(data range)
Excel 2007 and earlier used the following formula, which also works in later Excel versions:
s^{2} = VAR(data range)
Unbiased Sample Variance
Dividing the sum of the squares from sample data by n instead of (n1) would produce a biased estimate of the sample variation. The term n1, known as Bessel’s correction, replaces n as the divisor to remove the bias.
An intuitive explanation for the use of Bessel’s correction when calculating sample variance or sample standard deviation is as follows:
It is usually the case that when estimating the population standard deviation, σ, from a sample standard deviation, s, the population mean, µ, is also unknown. The sample standard deviation is therefore calculated using the sample mean, x_{avg}, in place of the population mean, µ. This substitution creates a slight bias because the sample’s data points are likely to be closer to x_{avg} than to µ. The sum of the squares calculated from (xi – x_{avg})^{2} will be smaller than (x_{i}  µ)^{2}. This is compensated for by using Bessel’s correction (n1) as the divisor instead of n when calculating sample standard deviation or sample variance.
Note that if the population mean, µ, is known and used instead of sample mean, x_{avg}, during the calculation of sample standard deviation or sample variance, then Bessel’s correction should not be used; the divisor should be n instead of n1. This is usually not the case however.
Bessel’s correction is a consistent estimator because it converges in probability to the population value as the sample size goes to infinity. As sample size increases, the need for the use of Bessel’s correction decreases. When n exceeds 75, the difference between the biased sample standard deviation (using n as the divisor instead of n1) and the population standard deviation is generally less than 1 percent. The use of uncorrected sample standard deviation for very large samples is generally acceptable.
Derivation of Bessel’s Correction
Bessel’s Correction can be derived in a number of ways. Here is one method that is straightforward and intuitive.
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
(Click On Image To See a Larger Version)
Population Standard Deviation
Population standard deviation, σ (sigma), is the square root of population variance, σ^{2}, and is calculated using the following formula:
µ (mu) = Population mean
N = total number of data points in the population
Population standard deviation is calculated in Excel 2010 and later with the following formula:
σ = STDEV.P(data range)
Excel 2007 and earlier used the following formula, which also works in later Excel versions:
σ = STDEVP(data range)
Note that Greek letters are often used to denote population parameters while letters from the English alphabet are often used to denote sample statistics.
Variances are additive but are not expressed in the same measure as the data values being described. The square root must be taken to get back to the original units. The square root of the variance is the standard deviation. Standard deviations are not additive but they are conveniently expressed in the same units as the original data.
It is important to note that population standard deviation is calculated by dividing the sum of the squared by N, the total number of data points in the population. Sample standard deviation, on the other hand, is calculated by dividing the sum of the squares by (n – 1) instead of n, the number of data points in the sample.
Sample Standard Deviation
Sample standard deviation, s, is the square root of sample variance, s^{2}, and is calculated using the following formula:
(Click On Image To See a Larger Version)
x_{avg} = Population mean
n = total number of data points in the sample
Sample standard deviation is calculated in Excel 2010 and later with the following formula:
s = STDEV.S(data range)
Excel 2007 and earlier used the following formula, which also works in later Excel versions:
s = STDEV(data range)
Descriptive Statistics in Excel
Excel provides a tool in the Data Analysis toolpak called Descriptive Statistics that calculates a data column’s sample standard deviation and sample variance along with sample statistics. The Data Analysis toolpak is an Excel addin that is available in most versions of Excel. This addin is initially inactive and must be activated by the user before the first use. After this addin has been activated, Descriptive Statistics can be accessed as follows:
Data tab / Data Analysis / Descriptive Statistics
The Descriptive Statistics dialogue box then appears. The data to which Descriptive Statistics will be applied needs to be arranged in a single column. An example of a column of sample data and the completed Descriptive Statistics dialogue box is shown as follows:
(Click On Image To See a Larger Version)
Clicking OK will produce the following output. The Excel formulas needed to calculate the same results are shown as well.
(Click On Image To See a Larger Version)
Note that the calculations of standard deviation and variance are done for samples using Bessel’s correction n1 as the divisor of the sum of squares and not n. Descriptive Statistics is intended to provide sample statistics and not population parameters.
Sum of Squares
SS = Sum of squares  Because variances are additive, the sum of the squares of differences between a mean and data values is used to describe the total variance within specific categories. The Excel formula to calculate the sum of the squares of deviations of sample points from the sample mean is as follows:
DEVSQ(data range)
This formula is the same for all versions of Excel.
SS = DEVSQ(data range) = ∑ (x_{i} – x_{avg})
ANOVA uses the sum of the squares to calculate the withinsample variance, SS_{withingroups}, and the betweensample variance, SS_{betweengroups}. The F Value for each F test in ANOVA is calculated as follows:
F Value = [ SS_{betweengroups} / df_{betweengroups} ] / [SS_{withingroups} / df_{withingroups} ]
The p Value for each F test in ANOVA is calculated in Excel as follows:
p Value = F.DIST.RT( F Value, df_{betweengroups} , df_{withingroups} )
If the calculated p Value is smaller than the specified alpha, the factor that is being evaluated in the F test is deemed to be significant, i.e., has an effect on the data values.
Linear regression uses the sum of the squares to calculate total variation (SST), explained variation (SSR), and unexplained variation (SSE). R Square of a linear regression is calculated as follows:
R Square = (Explained variation) / (Total variation) = SSR / SST
MAD
MAD can refer to either Mean Absolute Deviation (sometimes called Average Absolute Deviation) or Median Absolute Deviation. It is necessary to clarify which of the two is being calculated when deriving MAD for a set of data.
Mean Absolute Deviation
Mean Absolute Deviation is calculated by the following formula:
Mean Absolute Deviation = 1/n * ∑ (x_{i} – x_{avg})
Mean Absolute Deviation can be calculated in all versions of Excel with the following Excel formula:
AVEDEV(data range)
Following is an example of the calculation of Mean Absolute Deviation using both the formula and a more detailed showing each step of the calculation:
Median Absolute Deviation
Median Absolute Deviation relies on the median and is therefore more robust than Mean Absolute Deviation. Outliers have a much smaller effect on the median than on the mean.
Median Absolute Deviation is calculated by the following formula:
Mean Absolute Deviation = 1/n * ∑ (x_{i} – Median)
Excel does not have a single formula to calculate Median Absolute Deviation as there is for Mean Absolute Deviation. Each individual step of the calculation must be performed as follows:
Standard Deviation of a Continuous Distribution
This article has so far discussed the calculation of variation of discrete sets of data. The standard deviation of a population of data distributed according to probability function p(x) can be stated in general terms as follows:
(Click On Image To See a Larger Version)
Many of the individual statistical distributions use specific, unique formulas to calculate standard deviation. For example, the variance of data that are distributed according to the binomial distribution is calculated by the formula np(1p).
Effect of Changing Units
Unit can be changed by either adding a constant to each data measurement or by multiplying each data measurement by a constant. These two methods of altering the units of measuring have different effects on the measured variation for the data set as follows:
Adding a Constant to Each Data Value
Adding a constant to each data value does not change the distance between data values. All measures of variability therefore remain unchanged.
Multiplying Each Data Value By a Constant
Multiplying each data value by a constant multiplies range and standard deviation by the value of the constant. The variance is multiplied by the square of the constant.
Range
Range of a data set simply equals the highest data value minus the lowest data value. Excel does not have a formula that specifically calculates the range of a data set. This must be implemented in Excel as follows:
Range = MAX(data range) – MIN(data range)
Range is sensitive only to changes in the outermost data values but not changes to any other data values.
Interquartile Range
Quartiles divide rankordered data into four equal parts, i.e., quartiles Q1, Q2, Q3, and Q4 which contains the highest values. Each quartile represent a range into which one quarter of the data points fall. The upper and lower endpoints of each quartile are not data points of the data set but are the borders of the interval that contains one quarter of the data points.
The interquartile range is the width of the middle two quartiles combined. This is the distance between the upper boundary of Q2 and the lower boundary of Q3.
The boundaries between the quartiles are calculated in Excel using the QUARTILE.EXC() as shown in the following image. The highest and lowest data points within each quartile are calculated in Excel using MIN(), MAX(), INDEX, and MATCH() as shown in the following:
(Click On Image To See a Larger Version)
Here is are closeup images of these Excel commands.
(Click On Image To See a Larger Version)
The Interquartile range is calculated in Excel 2010 and later with the following:
Interquartile range = QUARTILE.EXC(data range,3) – QUARTILE.EXC(data range,1)
In Excel 2007 and prior, the Interquartile range is calculated using QUARTILE(). In Excel 2010 QUARTILE() is replaced by QUARTILE.INC(). This produces a slightly different result than the other Excel 2010 function QUARTILE.EXC().
In Excel 2007 the Interquartile range is calculated as follows:
Interquartile range = QUARTILE(data range,3) – QUARTILE(data range,1)
It should be noted that the median of a data set represent the boundary between the 2^{nd} and 3^{rd} quartiles.
The Excel algorithms for the 3 quartile functions – QUARTILE(), QUARTILE.INC(), and QUARTILE.EXC() – are shown as follows:
Visual Presentation of Variation in Excel
Two tools in Excel that provide a visual representation of the spread of onedimensional data set are the histogram and the box plot. The spread of twodimensional data (XY data) can be visually presented using a scatterplot graph in Excel.
Histogram in Excel
A histogram is a bar chart with each bar showing the frequency of occurrence of data values within a specified range. The ranges are called bins. Creating a histogram in Excel requires specifying the upper limits of each bin as follows:
The Excel histogram is one of the tools of the Data Analysis toolpak. The Analysis toolpak is an Excel addin that is initially inactive in Excel. The user must activate this addin prior to its first use. The Excel histogram tool can be accessed as follows:
Data tab / Data Analysis / Histogram
Doing so brings up the Histogram dialogue box which should be completed as follows:
(Click On Image To See a Larger Version)
Clicking OK produces the following Excel histogram.
(Click On Image To See a Larger Version)
The following two articles in this blog provide detailed descriptions on the following histogram topics:

Creating a histogram in Excel using the histogram tool:
http://blog.excelmasterseries.com/2014/05/howto createhistograminexcel2010_27.html

Creating an automatically updating histogram in 7 steps in Excel using formulas and a bar chart:
http://blog.excelmasterseries.com/2014/05/howtocreate histograminexcel2010.html
(Click On Image To See a Larger Version)
Frequency Tables
The Excel Histogram tool produces a frequency table that is part of its output. The frequency table displays a count of the number of data points that fall into each bin. This is the basis of the histogram chart shown as follows:
A standalone frequency table can also be produced in Excel using the FREQUENCY() formula. The FREQUENCY() formula is an array formula that is implemented as follows.
Step 1 – Highlight the Cells That Will Contain the Output
The raw data should be sorted in a column and the upper limits of each bin should also be listed in a column as shown in the following image.
The FREQUENCY() formula produces a frequency table in a separate data column that contains 1 more cell than the number of bins. The upper limits of six bins are listed in this example so there are six total bins. The FREQUENCY() formula will therefore require a column of seven cells to hold its output.
The FREQUENCY() is an array formula. The first step in implementing an array formula in Excel is to highlight (select) the cells that will hold the output. Click and drag the mouse to highlight seven empty cells in a column where the output should go.
Step 2 – Type in the Formula
While the seven cells remain highlighted, type in the FREQUENCY() formula as shown.
(Click On Image To See a Larger Version)
Step 3 – Type CTRL – SHIFT – ENTER at the Same Time
Doing so will produce the frequency table in cells H4:H10 that have been highlighted.
(Click On Image To See a Larger Version)
Box Plots
Box plots visually display the spread of data in each of the four quartiles. The interquartile range, Q2 and Q3, are contained the two boxes. The outer quartiles, Q1 and Q4, are displayed as whiskers above and below the boxes. Q4 is displayed as the upper whiskers and Q1 is displayed as the lower whisker. For this reason boxplots are sometimes called BoxandWhisker plots.
The boundary between the upper box (Q3) and the lower box (Q2) is the median. Box plot diagrams can also be constructed in Excel to display the data’s mean as well. An example of a box plot in Excel and the data from which it was derived is shown as follows:
(Click On Image To See a Larger Version)
The following article in this blog provides stepbystep instructions on how to produce the preceding box plot in 8 steps in Excel:
http://blog.excelmasterseries.com/2015/02/boxplotsin8steps inexcel.html
Scatter Plot in Excel
A scatter plot is a diagram using Cartesian coordinates to display the XY values of a twodimensional data set. The Y variable displayed on the vertical axis is often the dependent or response variable while the X variable displayed on the horizontal axis is related to that response variable. The scatter plot is also referred to also referred to as a scatter graph, scatter diagram, or scatter chart.
The following Excel scatter plot displays data points which contain a dependent variable (number of parts produced) as the X value and the residual associated with each dependent variable as the Y value. This scatter plot indicates that the residuals have slightly increasing variance as the dependent variable increases in value. This is indicated by the slight fanningout of the Y values (the residual values) as the X variable (number of parts produced) increases. This scatter plot was used as part of a singlevariable linear regression to determine whether one of the required conditions of linear regression, namely that the residual have constant variance, is met. The slight increase in variance of residuals as the dependent variable increases is not significant enough to invalidate the regression that was performed.
(Click On Image To See a Larger Version)
Variation in Statistical Process Control
Every measurable process contains variation. This variation can be classified in the following two categories:
Controlled Variation – This type of variation is consistent and stable over time and is the result of random statistical variation that is always present at least to some degree.
Uncontrolled Variation – This type of variation changes over time and is attributed to assignable causes. It is the job of management to determine the assignable causes of uncontrolled variation and reduce them as much as possible.
Variation in a measurable process can be displayed on control charts. Control charts in SPC (statistical process control) provide information about samples taken from a process. Individual samples are called subgroups. A subgroup is specific number of measurements periodically taken at a single point in a process.
An SPC control chart plots the location and dispersion of each subgroup. An SPC control chart consists of a pair of graphs; an upper graph charts the location of each successive subgroup and a lower graph charts the dispersion of each successive subgroup. Two common SPC control charts are the following:
XBar / R Charts are running records of subgroup average value (XBar) and subgroup range (R). Subgroups usually consist of 3 to 5 data measurements.
XBar / S Charts are running records of subgroup average value (XBar) and subgroup standard deviation (S). Subgroups usually consist of 8 to 10 data measurements.
A process can be deemed to be OOC (out of control) if the location or dispersion values of subgroups fall outside of defined upper or lower control limits or show patterns. The upper and lower control limits are approximately three sigma from the mean value being measured. One sigma is equal to the length of one standard deviation.
The interval between subgroups should be set so that changes between subgroups will be maximized. Subgroups should be taken over an interval that is long enough for potential variation to show up. The sampling interval between individual samples within the subgroups should be set so that variation within subgroups is minimized. Subgroup size should be held constant. Typically at least 20 subgroups are needed to judge statistical control and provide a reasonable estimate of parameters of mean and variation.
Excel Master Series Blog Directory
Click Here To See a List Of All
Statistical Topics And Articles In
This Blog
You Will Become an Excel Statistical Master!
Wow! i can imagine the dedication you gave in order to make this post. Thanks for sharing this. If you'd like i want to share this on Facebook coz this is very informative. I was just scrolling across the web looking for essay writer online and i found your amazing post. Thanks again.
ReplyDeleteExcel can be a very useful tool that you can use in different reports on a daily basis. As long as you know how to navigate it properly, no wonder students via AmericanWriters.org reviews are very keen on using this tool.
ReplyDelete