## Wednesday, May 28, 2014

### Pearson Correlation in 3 Steps in Excel 2010 and Excel 2013

This is one of the following four articles on Correlations in Excel

Overview of Correlation In Excel 2010 and Excel 2013

Pearson Correlation in 3 Steps in Excel 2010 and Excel 2013

Pearson Correlation – Calculating r Critical and p Value of r in Excel

Spearman Correlation in 6 Steps in Excel 2010 and Excel 2013

# Pearson Correlation  Coefficient r in 3 Steps in Excel

Pearson’s Correlation Coefficient, r, is widely used as a measure of linear dependency between two variables. Pearson’s Correlation Coefficient is also referred to as Pearson’s r or Pearson’s Product Moment Correlation Coefficient.

r2 is denoted as R Square and tells how well data points fit a line or curve. In simple linear regression, R Square is simply the square of the correlation coefficient between the dependent variable (the Y values) and the single independent variable (the X values). R Square represents the proportion of the total variance of the Y values can be explained by the variance of the X values. R Square takes can assume values from 0 to +1.

Pearson’s Correlation Coefficient, r, can assume values from -1 to +1.

A value of +1 indicates that two variables have a perfect positive correlation. A perfect positive correlation means that one of the variables moves exactly the same positive amount for each unit positive change in the other variable. A scatterplot of linear data having a Pearson Correlation, r, near +1 is as follows: (Click On Image To See a Larger Version)

An r value of -1 indicates that two variables have a perfect negative correlation. A perfect negative correlation means that one of the variables moves exactly the same negative amount for each unit positive change in the other variable. A scatterplot of linear data having a Pearson Correlation, r, near -1 is as follows: (Click On Image To See a Larger Version)

An r value near 0 indicates very low correlation between two variables. The movements of one variable have a very low correspondence with the movements of the other variable. A scatterplot of linear data having a Pearson Correlation, r, near 0 is as follows: (Click On Image To See a Larger Version)

## Pearson Correlation’s Six Required Assumptions

1) The both variables are either interval or ratio data.

2) The Pearson Correlation is most accurate when the variables are approximately normally distribution. Normality is not an absolute requirement for applying the Pearson Correlation though. The text indicates that it is, but that is incorrect. I have uploaded an Excel workbook to the Doc Sharing folder that automatically checks normality by creates a Normal Probability Plot for input data.

3) The relationship is reasonably linear. This can be seen on an X-Y scatterplot.

4) Outliers are removed or kept to a minimum. Outliers can badly skew the Pearson correlation.

5) Each variable has approximately the same variance. In statistical terms, variables with the same variance are said to be homoscedastistic. Variance in data sets can be compared using the nonparametric tests Levene’s Test and the Brown-Forsythe Test. The F Test (available in Excel both as a function and as a Data Analysis tool) can be used to compare variance in data sets but is highly sensitive to non-normality of data.

6) There is a monotonic relationship between the two variables.

Pearson’s Correlation can be applied to a population or to a sample.

## Pearson Correlation Formulas

Pearson’s Correlation when applied to a population is referred to as the Population Pearson’s Correlation Coefficient or simply the Population Correlation Coefficient. The Population Pearson Correlation Coefficient is designated by the symbol ρ (Greek letter “rho”) and is calculated by the following formula: (Click On Image To See a Larger Version)

Pearson’s Correlation when applied to a sample is referred to as the Sample Pearson’s Correlation Coefficient or simply the Sample Correlation Coefficient. The Population Pearson Correlation Coefficient is designated by the symbol r or rxy and is equal to the sample covariance between two variables divided by the product of their sample standard deviations as given by the following formula: (Click On Image To See a Larger Version)

sxy is the Sample Covariance between variables x and y and is calculated by the following formula: (Click On Image To See a Larger Version)

sx is the Sample Standard Deviation of variable x and is calculated by the following formula: (Click On Image To See a Larger Version)

sy is the Sample Standard Deviation of variable y and is calculated by the following formula: (Click On Image To See a Larger Version)

## Example of Pearson Correlation in Excel

### Pearson Correlation Step 1 – Create a Scatterplot of the Data

Before calculating the Pearson Correlation between two variables, it is a good idea to create an X-Y scatterplot to determine if there appears to be a linear relationship between the two. Following is an example of creating an Excel scatterplot of a sample of X-Y data. The chart type in Excel is an X-Y scatterplot with only markers using Chart Layout 3, Style 2. A Least-Squares Line is created using Chart Layout 3.

The chart appears as follows: (Click On Image To See a Larger Version)

The scatterplot chart shows a strong linear relationship between the two variables X and Y. The Pearson correlation would be the correct choice to determine the correlation between the two variables.

### Pearson Correlation Step 2 – Calculate r in Excel With One of Three Methods

The Pearson Sample Correlation Coefficient, rxy, can be calculated using any of the three following methods in Excel:

1) Data Analysis Correlation Tool This tool can also be used to create a correlation matrix between more than two variables. An example of this will be performed later in this section.

2) Correlation Formula The correlation formula which is the following:

CORREL(array1, array2)

3) Covariance Formula The sample covariance between two variables divided by the product of their sample standard deviations as given by the following formula:

COVARIANCE.S(array1, array2)*STDEV.S(array1)* STDEV.S(array2)

These three methods are implemented in Excel as follows: (Click On Image To See a Larger Version)

### Pearson Correlation Step 3 - Determine Whether r Is Significant

After calculating the Pearson Correlation Coefficient, r, between two data sets, the significance of r should be checked. If r has been calculated based upon just a few pairs of numbers, it is difficult to determine whether this calculated correlation really exists between the two sets of numbers or if that calculated r is just a random occurrence because there are so few data pairs.

On the other hand, if the r is calculated from a large number of data pairs, the certainty level is much higher the calculated correlation r really does exist between the two sets of numbers.

There are two equivalent ways to determine whether or not the calculated r should be considered significant at a given α. These two methods are the following:

a) Calculate the p value and compare it to the specified α

b) Calculate r Critical and compare it to r

The blog article that followings this one will perform calculations a) and b) to determine whether r is significant.

## Performing Correlation Analysis On More Than 3 Variables

As mentioned, the Data Analysis Correlation tool can be used to create a correlation matrix if there are more than two variables. An example of creating a correlation matrix between three variables is shown as follows: (Click On Image To See a Larger Version)

Each r must be evaluated separately to determine if that r is significant. A correlation coefficient r is significant if its calculated p Value is less than alpha or, equivalently, if the r is greater than r Critical. The p value and r Critical are calculated in the same way as before with the following formulas: (Click On Image To See a Larger Version) (Click On Image To See a Larger Version)

Excel Master Series Blog Directory

Statistical Topics and Articles In Each Topic

• Histograms in Excel
• Bar Chart in Excel
• Combinations & Permutations in Excel
• Normal Distribution in Excel
• t-Distribution in Excel
• Binomial Distribution in Excel
• z-Tests in Excel
• t-Tests in Excel
• Hypothesis Tests of Proportion in Excel
• Chi-Square Independence Tests in Excel
• Chi-Square Goodness-Of-Fit Tests in Excel
• F Tests in Excel
• Correlation in Excel
• Pearson Correlation in Excel
• Spearman Correlation in Excel
• Confidence Intervals in Excel
• Simple Linear Regression in Excel
• Multiple Linear Regression in Excel
• Logistic Regression in Excel
• Single-Factor ANOVA in Excel
• Two-Factor ANOVA With Replication in Excel
• Two-Factor ANOVA Without Replication in Excel
• Randomized Block Design ANOVA in Excel
• Repeated-Measures ANOVA in Excel
• ANCOVA in Excel
• Normality Testing in Excel
• Nonparametric Testing in Excel
• Post Hoc Testing in Excel
• Creating Interactive Graphs of Statistical Distributions in Excel
• Solving Problems With Other Distributions in Excel
• Optimization With Excel Solver
• Chi-Square Population Variance Test in Excel
• Analyzing Data With Pivot Tables
• SEO Functions in Excel
• Time Series Analysis in Excel
• VLOOKUP