Monday, April 6, 2015

Measures of Variation in Excel

Measures of Variation in

Excel

Measures of Variation Overview

Measures of variation describe the degree of spread in a random variable. Measures of variation are sometimes referred to as measures of dispersion. Measures of variation are properties describing probability distributions, populations, or samples taken from a population. Quite often the same measures of variation are calculated differently for populations and samples taken from those population.

The following measures of variation can be calculated in Excel and will be discussed here:

σ2 = Population variance – A measure of the degree of dispersion of values of all of the points contained within a population. σ2 equals the sum of the squared differences between the sample mean and each data value divided by N. Variances are additive but are not expressed in the same measure as the data values being described. The square root must be taken to get back to the original units.

s2 = Sample variance – A measure of the degree of dispersion of values of all of the points contained within a sample. s2 equals the sum of the squared differences between the sample mean and each data value divided by (n – 1).

σ = Population standard deviation – The square root of population variance. Standard deviations are not additive but are expressed in the same units as the data values being described.

s = Sample standard deviation – The square root of sample variance.

SS = Sum of squares - Because variances are additive, the sum of the squares of differences between a mean and data values is used to describe the total variance within specific categories. ANOVA and linear regression both uses the sum of the squares in their calculations.

MAD – Mean absolute deviation and Median absolute value – Mean absolute deviation equals the average distance (absolute distance) between data values and the mean. Median absolute value, which is the average distance between data values and the median, is more robust (less sensitive to outliers) than mean absolute deviation.

R = Range – Range describes the difference between the largest and smallest values of a data set. Range is sensitive to outliers. Range is used in SPC (Statistical Process Control) to describe the spread within each sample when sample size is small (typically 2 – 9). Standard deviation is used in SPC to describe spread within each sample when sample size become larger.

Interquartile range – Interquartile range is the spread or width of the middle two quartiles of the four quartiles that all data of a data set have been divided into. Quartiles divide rank-ordered data into four equal parts, i.e., quartiles Q1, Q2, Q3, and Q4 which contains the highest values. The box of a boxplot contains the middle two quartiles of data while the whiskers contain the outer quartiles of data. Interquartile range is very helpful in finding outliers.

Excel also has excellent tools to visually display the degree of variation existing in a data set. The Excel tools that will be described in this blog article that visually display variation of data are the following:

  1. Histogram

  2. Box plot

  3. Scatter plot

Population Variance

Population variance is a measure of the degree of dispersion of values of all of the points contained within a population. σ2 (sigma squared) equals the sum of the squared differences between the sample mean and each data value divided by N, the total number of data values in a population. µ (mu) is the population mean. The formula for calculating population variance is the following:

Variation in Excel - Population Variance
(Click On Image To See a Larger Version)

Population variance is calculated in Excel 2010 and later with the following formula:

σ2 = VAR.P(data range)

Excel 2007 and earlier used the following formula, which also works in later Excel versions:

σ2 = VARP(data range)

Note that Greek letters are often used to denote population parameters while letters from the English alphabet are often used to denote sample statistics.

Variances are additive but are not expressed in the same measure as the data values being described. The square root must be taken to get back to the original units. The square root of the variance is the standard deviation. Standard deviations are not additive but they are conveniently expressed in the same units as the original data.

It is important to note that population variance is calculated by dividing the sum of the squared by N, the total number of data points in the population. Sample variance, on the other hand, is calculated by dividing the sum of the squares by (n – 1) instead of n, the number of data points in the sample. This will be discussed in detail shortly.

 

Sample Variance

Sample variance is a measure of the degree of dispersion of values of all of the points contained within a sample. s2 equals the sum of the squared differences between the sample mean and each data value divided by (n-1), the total number of data values in sample minus 1. Xavg is the sample mean. The formula for calculating population variance is the following:

Variation in Excel - Sample Standard Deviation
(Click On Image To See a Larger Version)

Sample variance is calculated in Excel 2010 and later with the following formula:

s2 = VAR.S(data range)

Excel 2007 and earlier used the following formula, which also works in later Excel versions:

s2 = VAR(data range)

 

Unbiased Sample Variance

Dividing the sum of the squares from sample data by n instead of (n-1) would produce a biased estimate of the sample variation. The term n-1, known as Bessel’s correction, replaces n as the divisor to remove the bias.

An intuitive explanation for the use of Bessel’s correction when calculating sample variance or sample standard deviation is as follows:

It is usually the case that when estimating the population standard deviation, σ, from a sample standard deviation, s, the population mean, µ, is also unknown. The sample standard deviation is therefore calculated using the sample mean, xavg, in place of the population mean, µ. This substitution creates a slight bias because the sample’s data points are likely to be closer to xavg than to µ. The sum of the squares calculated from (xi – xavg)2 will be smaller than (xi - µ)2. This is compensated for by using Bessel’s correction (n-1) as the divisor instead of n when calculating sample standard deviation or sample variance.

Note that if the population mean, µ, is known and used instead of sample mean, xavg, during the calculation of sample standard deviation or sample variance, then Bessel’s correction should not be used; the divisor should be n instead of n-1. This is usually not the case however.

Bessel’s correction is a consistent estimator because it converges in probability to the population value as the sample size goes to infinity. As sample size increases, the need for the use of Bessel’s correction decreases. When n exceeds 75, the difference between the biased sample standard deviation (using n as the divisor instead of n-1) and the population standard deviation is generally less than 1 percent. The use of uncorrected sample standard deviation for very large samples is generally acceptable.

 

Derivation of Bessel’s Correction

Bessel’s Correction can be derived in a number of ways. Here is one method that is straightforward and intuitive.

Variation in Excel - Bessel's Correction Derivation
(Click On Image To See a Larger Version)

Variation in Excel - Bessel's Correction Derivation
(Click On Image To See a Larger Version)

Variation in Excel - Bessel's Correction Derivation
(Click On Image To See a Larger Version)

Variation in Excel - Bessel's Correction Derivation

Variation in Excel - Bessel's Correction Derivation
(Click On Image To See a Larger Version)

Variation in Excel - Bessel's Correction Derivation
(Click On Image To See a Larger Version)

Variation in Excel - Bessel's Correction Derivation

Variation in Excel - Bessel's Correction Derivation

Variation in Excel - Bessel's Correction Derivation

Variation in Excel - Bessel's Correction Derivation

Variation in Excel - Bessel's Correction Derivation

Variation in Excel - Bessel's Correction Derivation

Variation in Excel - Bessel's Correction Derivation
(Click On Image To See a Larger Version)

Variation in Excel - Bessel's Correction Derivation
(Click On Image To See a Larger Version)

 

Population Standard Deviation

Population standard deviation, σ (sigma), is the square root of population variance, σ2, and is calculated using the following formula:

Variation in Excel - Population Standard Deviation

µ (mu) = Population mean

N = total number of data points in the population

Population standard deviation is calculated in Excel 2010 and later with the following formula:

σ = STDEV.P(data range)

Excel 2007 and earlier used the following formula, which also works in later Excel versions:

σ = STDEVP(data range)

Note that Greek letters are often used to denote population parameters while letters from the English alphabet are often used to denote sample statistics.

Variances are additive but are not expressed in the same measure as the data values being described. The square root must be taken to get back to the original units. The square root of the variance is the standard deviation. Standard deviations are not additive but they are conveniently expressed in the same units as the original data.

It is important to note that population standard deviation is calculated by dividing the sum of the squared by N, the total number of data points in the population. Sample standard deviation, on the other hand, is calculated by dividing the sum of the squares by (n – 1) instead of n, the number of data points in the sample.

 

Sample Standard Deviation

Sample standard deviation, s, is the square root of sample variance, s2, and is calculated using the following formula:

Variation in Excel - Sample Standard Deviation
(Click On Image To See a Larger Version)

xavg = Population mean

n = total number of data points in the sample

Sample standard deviation is calculated in Excel 2010 and later with the following formula:

s = STDEV.S(data range)

Excel 2007 and earlier used the following formula, which also works in later Excel versions:

s = STDEV(data range)

 

Descriptive Statistics in Excel

Excel provides a tool in the Data Analysis toolpak called Descriptive Statistics that calculates a data column’s sample standard deviation and sample variance along with sample statistics. The Data Analysis toolpak is an Excel add-in that is available in most versions of Excel. This add-in is initially inactive and must be activated by the user before the first use. After this add-in has been activated, Descriptive Statistics can be accessed as follows:

Data tab / Data Analysis / Descriptive Statistics

The Descriptive Statistics dialogue box then appears. The data to which Descriptive Statistics will be applied needs to be arranged in a single column. An example of a column of sample data and the completed Descriptive Statistics dialogue box is shown as follows:

Variation in Excel - Descriptive Statistics - Variance
(Click On Image To See a Larger Version)

Clicking OK will produce the following output. The Excel formulas needed to calculate the same results are shown as well.

Variation in Excel - Descriptive Statistics - Variance
(Click On Image To See a Larger Version)

Note that the calculations of standard deviation and variance are done for samples using Bessel’s correction n-1 as the divisor of the sum of squares and not n. Descriptive Statistics is intended to provide sample statistics and not population parameters.

Sum of Squares

SS = Sum of squares - Because variances are additive, the sum of the squares of differences between a mean and data values is used to describe the total variance within specific categories. The Excel formula to calculate the sum of the squares of deviations of sample points from the sample mean is as follows:

DEVSQ(data range)

This formula is the same for all versions of Excel.

SS = DEVSQ(data range) = ∑ (xi – xavg)

ANOVA uses the sum of the squares to calculate the within-sample variance, SSwithin-groups, and the between-sample variance, SSbetween-groups. The F Value for each F test in ANOVA is calculated as follows:

F Value = [ SSbetween-groups / dfbetween-groups ] / [SSwithin-groups / dfwithin-groups ]

The p Value for each F test in ANOVA is calculated in Excel as follows:

p Value = F.DIST.RT( F Value, dfbetween-groups , dfwithin-groups )

If the calculated p Value is smaller than the specified alpha, the factor that is being evaluated in the F test is deemed to be significant, i.e., has an effect on the data values.

Linear regression uses the sum of the squares to calculate total variation (SST), explained variation (SSR), and unexplained variation (SSE). R Square of a linear regression is calculated as follows:

R Square = (Explained variation) / (Total variation) = SSR / SST

 

MAD

MAD can refer to either Mean Absolute Deviation (sometimes called Average Absolute Deviation) or Median Absolute Deviation. It is necessary to clarify which of the two is being calculated when deriving MAD for a set of data.

 

Mean Absolute Deviation

Mean Absolute Deviation is calculated by the following formula:

Mean Absolute Deviation = 1/n * ∑ (xi – xavg)

Mean Absolute Deviation can be calculated in all versions of Excel with the following Excel formula:

AVEDEV(data range)

Following is an example of the calculation of Mean Absolute Deviation using both the formula and a more detailed showing each step of the calculation:

Variation in Excel - Mean Absolute Deviation

 

Median Absolute Deviation

Median Absolute Deviation relies on the median and is therefore more robust than Mean Absolute Deviation. Outliers have a much smaller effect on the median than on the mean.

Median Absolute Deviation is calculated by the following formula:

Mean Absolute Deviation = 1/n * ∑ (xi – Median)

Excel does not have a single formula to calculate Median Absolute Deviation as there is for Mean Absolute Deviation. Each individual step of the calculation must be performed as follows:

Variation in Excel - Median Absolute Deviation

Standard Deviation of a Continuous Distribution

This article has so far discussed the calculation of variation of discrete sets of data. The standard deviation of a population of data distributed according to probability function p(x) can be stated in general terms as follows:

Variation in Excel - Standard Deviation of a Continuous Variable
(Click On Image To See a Larger Version)

Many of the individual statistical distributions use specific, unique formulas to calculate standard deviation. For example, the variance of data that are distributed according to the binomial distribution is calculated by the formula np(1-p).

 

Effect of Changing Units

Unit can be changed by either adding a constant to each data measurement or by multiplying each data measurement by a constant. These two methods of altering the units of measuring have different effects on the measured variation for the data set as follows:

Adding a Constant to Each Data Value

Adding a constant to each data value does not change the distance between data values. All measures of variability therefore remain unchanged.

Multiplying Each Data Value By a Constant

Multiplying each data value by a constant multiplies range and standard deviation by the value of the constant. The variance is multiplied by the square of the constant.

 

Range

Range of a data set simply equals the highest data value minus the lowest data value. Excel does not have a formula that specifically calculates the range of a data set. This must be implemented in Excel as follows:

Range = MAX(data range) – MIN(data range)

Range is sensitive only to changes in the outermost data values but not changes to any other data values.

 

Interquartile Range

Quartiles divide rank-ordered data into four equal parts, i.e., quartiles Q1, Q2, Q3, and Q4 which contains the highest values. Each quartile represent a range into which one quarter of the data points fall. The upper and lower endpoints of each quartile are not data points of the data set but are the borders of the interval that contains one quarter of the data points.

The interquartile range is the width of the middle two quartiles combined. This is the distance between the upper boundary of Q2 and the lower boundary of Q3.

The boundaries between the quartiles are calculated in Excel using the QUARTILE.EXC() as shown in the following image. The highest and lowest data points within each quartile are calculated in Excel using MIN(), MAX(), INDEX, and MATCH() as shown in the following:

Variation in Excel - Quartiles
(Click On Image To See a Larger Version)

Here is are close-up images of these Excel commands.

Variation in Excel - Quartiles Close-up
(Click On Image To See a Larger Version)

Variation in Excel - Quartiles Close-up

The Interquartile range is calculated in Excel 2010 and later with the following:

Interquartile range = QUARTILE.EXC(data range,3) – QUARTILE.EXC(data range,1)

In Excel 2007 and prior, the Interquartile range is calculated using QUARTILE(). In Excel 2010 QUARTILE() is replaced by QUARTILE.INC(). This produces a slightly different result than the other Excel 2010 function QUARTILE.EXC().

In Excel 2007 the Interquartile range is calculated as follows:

Interquartile range = QUARTILE(data range,3) – QUARTILE(data range,1)

It should be noted that the median of a data set represent the boundary between the 2nd and 3rd quartiles.

The Excel algorithms for the 3 quartile functions – QUARTILE(), QUARTILE.INC(), and QUARTILE.EXC() – are shown as follows:

QUARTILE(), QUARTILE.INC() algorithms

QUARTILE.EXC() algorithm

 

Visual Presentation of Variation in Excel

Two tools in Excel that provide a visual representation of the spread of one-dimensional data set are the histogram and the box plot. The spread of two-dimensional data (X-Y data) can be visually presented using a scatterplot graph in Excel.

 

Histogram in Excel

A histogram is a bar chart with each bar showing the frequency of occurrence of data values within a specified range. The ranges are called bins. Creating a histogram in Excel requires specifying the upper limits of each bin as follows:

Variation in Excel - Histogram Raw Data

The Excel histogram is one of the tools of the Data Analysis toolpak. The Analysis toolpak is an Excel add-in that is initially inactive in Excel. The user must activate this add-in prior to its first use. The Excel histogram tool can be accessed as follows:

Data tab / Data Analysis / Histogram

Doing so brings up the Histogram dialogue box which should be completed as follows:

Variation in Excel - Histogram Dialogue Box
(Click On Image To See a Larger Version)

Clicking OK produces the following Excel histogram.

Variation in Excel - Histogram
(Click On Image To See a Larger Version)

The following two articles in this blog provide detailed descriptions on the following histogram topics:

  1. Creating a histogram in Excel using the histogram tool:

http://blog.excelmasterseries.com/2014/05/how-to- create-histogram-in-excel-2010_27.html

  1. Creating an automatically updating histogram in 7 steps in Excel using formulas and a bar chart:

http://blog.excelmasterseries.com/2014/05/how-to-create- histogram-in-excel-2010.html
(Click On Image To See a Larger Version)

 

Frequency Tables

The Excel Histogram tool produces a frequency table that is part of its output. The frequency table displays a count of the number of data points that fall into each bin. This is the basis of the histogram chart shown as follows:

Variation in Excel - Histogram

A stand-alone frequency table can also be produced in Excel using the FREQUENCY() formula. The FREQUENCY() formula is an array formula that is implemented as follows.

 

Step 1 – Highlight the Cells That Will Contain the Output

The raw data should be sorted in a column and the upper limits of each bin should also be listed in a column as shown in the following image.

The FREQUENCY() formula produces a frequency table in a separate data column that contains 1 more cell than the number of bins. The upper limits of six bins are listed in this example so there are six total bins. The FREQUENCY() formula will therefore require a column of seven cells to hold its output.

The FREQUENCY() is an array formula. The first step in implementing an array formula in Excel is to highlight (select) the cells that will hold the output. Click and drag the mouse to highlight seven empty cells in a column where the output should go.

Variance_Frequency_1_600

 

Step 2 – Type in the Formula

While the seven cells remain highlighted, type in the FREQUENCY() formula as shown.

Variation in Excel - Frequency
(Click On Image To See a Larger Version)

 

Step 3 – Type CTRL – SHIFT – ENTER at the Same Time

Doing so will produce the frequency table in cells H4:H10 that have been highlighted.

Variation in Excel - Frequency
(Click On Image To See a Larger Version)

 

Box Plots

Box plots visually display the spread of data in each of the four quartiles. The interquartile range, Q2 and Q3, are contained the two boxes. The outer quartiles, Q1 and Q4, are displayed as whiskers above and below the boxes. Q4 is displayed as the upper whiskers and Q1 is displayed as the lower whisker. For this reason boxplots are sometimes called Box-and-Whisker plots.

The boundary between the upper box (Q3) and the lower box (Q2) is the median. Box plot diagrams can also be constructed in Excel to display the data’s mean as well. An example of a box plot in Excel and the data from which it was derived is shown as follows:

Variation in Excel - Box Plot
(Click On Image To See a Larger Version)

The following article in this blog provides step-by-step instructions on how to produce the preceding box plot in 8 steps in Excel:

http://blog.excelmasterseries.com/2015/02/box-plots-in-8-steps- in-excel.html

 

Scatter Plot in Excel

A scatter plot is a diagram using Cartesian coordinates to display the X-Y values of a two-dimensional data set. The Y variable displayed on the vertical axis is often the dependent or response variable while the X variable displayed on the horizontal axis is related to that response variable. The scatter plot is also referred to also referred to as a scatter graph, scatter diagram, or scatter chart.

The following Excel scatter plot displays data points which contain a dependent variable (number of parts produced) as the X value and the residual associated with each dependent variable as the Y value. This scatter plot indicates that the residuals have slightly increasing variance as the dependent variable increases in value. This is indicated by the slight fanning-out of the Y values (the residual values) as the X variable (number of parts produced) increases. This scatter plot was used as part of a single-variable linear regression to determine whether one of the required conditions of linear regression, namely that the residual have constant variance, is met. The slight increase in variance of residuals as the dependent variable increases is not significant enough to invalidate the regression that was performed.

Variation in Excel - Scatter Plot
(Click On Image To See a Larger Version)

 

Variation in Statistical Process Control

Every measurable process contains variation. This variation can be classified in the following two categories:

Controlled Variation – This type of variation is consistent and stable over time and is the result of random statistical variation that is always present at least to some degree.

Uncontrolled Variation – This type of variation changes over time and is attributed to assignable causes. It is the job of management to determine the assignable causes of uncontrolled variation and reduce them as much as possible.

Variation in a measurable process can be displayed on control charts. Control charts in SPC (statistical process control) provide information about samples taken from a process. Individual samples are called subgroups. A subgroup is specific number of measurements periodically taken at a single point in a process.

An SPC control chart plots the location and dispersion of each subgroup. An SPC control chart consists of a pair of graphs; an upper graph charts the location of each successive subgroup and a lower graph charts the dispersion of each successive subgroup. Two common SPC control charts are the following:

X-Bar / R Charts are running records of subgroup average value (X-Bar) and subgroup range (R). Subgroups usually consist of 3 to 5 data measurements.

X-Bar / S Charts are running records of subgroup average value (X-Bar) and subgroup standard deviation (S). Subgroups usually consist of 8 to 10 data measurements.

A process can be deemed to be OOC (out of control) if the location or dispersion values of subgroups fall outside of defined upper or lower control limits or show patterns. The upper and lower control limits are approximately three sigma from the mean value being measured. One sigma is equal to the length of one standard deviation.

The interval between subgroups should be set so that changes between subgroups will be maximized. Subgroups should be taken over an interval that is long enough for potential variation to show up. The sampling interval between individual samples within the subgroups should be set so that variation within subgroups is minimized. Subgroup size should be held constant. Typically at least 20 subgroups are needed to judge statistical control and provide a reasonable estimate of parameters of mean and variation.

 

Excel Master Series Blog Directory

 

Click Here To See a List Of All

Statistical Topics And Articles In

This Blog

 

You Will Become an Excel Statistical Master!

4 comments:

  1. Wow! i can imagine the dedication you gave in order to make this post. Thanks for sharing this. If you'd like i want to share this on Facebook coz this is very informative. I was just scrolling across the web looking for essay writer online and i found your amazing post. Thanks again.

    ReplyDelete
  2. Excel can be a very useful tool that you can use in different reports on a daily basis. As long as you know how to navigate it properly, no wonder students via American-Writers.org reviews are very keen on using this tool.

    ReplyDelete
  3. It's clear that in life of every student was a moment when he asks himself “Is there anyone who can help me with my homework?”. Some of them can ignore fact that there are quality sources which can help with homeworks and continue to fight with tons of paper and wasting valuable time. Meanwhile others use sources like this https://rocketpaper.net/research-paper-for-students

    ReplyDelete
  4. When you google “paraphrase for me”, you get surprisingly few trustworthy companies. Some offer a limited range of services, while others openly use plagiarism rewriter software instead of professional writers. Here you get top-notch our plagiarism rewriter with your essays, research and term papers. The writers easily deal with all types of assignments. Even a dissertation can be professionally rewritten without a trace of its original structure and language.

    ReplyDelete