Notes
Data
Applets
Examples

OnLine Help
New User
User's Guide
References

Notes on Topic 15:
Correlation: The Relationship of Two Variables

    The Pearson Correlation Coefficient

    The Pearson Correlation Coefficient

    Definition: The Pearson correlation measures the degree and direction of the linear relationship between two variables.

    The Pearson correlation coefficient is by far the most common measure of correlation is the Pearson product-moment correlation.

    Notation: The Pearson correlation is denoted by the letter r.


    Formulas:
    We begin by introducing the formulas for the Pearson correlation coefficient.

    1. Conceptual Formula

      Conceptually, the Pearson Correlation is the ratio of the variation shared by X and Y to the variation of X and Y separately. The conceptual formula is:

      Stated in statistical terminology:

      When there is a perfect linear relationship, every change in the X variable is accompanied by a corresponding change in the Y variable. In this case, all variation in X is shared with Y, so the ratio given above is r=1.00. At the other extreme, when there is no linear relationship between X and Y, then the numerator is zero, so r=0.00.

    2. Sum of Products of Deviations --- The measure of shared variability:

      To calculate the Pearson correlation, it is necessary to introduce one new concept: The sum of the products of corresponding deviation scores for two variables.

      The sum of the squares is a similar concept we've already seen. The sum of the squares of the deviation scores for a variable is used to measure the amount of variability of a single variable:

      The sum of products, which is used to measure the variability shared between two variables, is defined as:

      Note that the name is short for the sum of the products of corresponding deviation scores for two variables.

      To calculate the SP, you first determine the deviation scores for each X and for each Y, then you calculate the products of each pair of deviation scores, and then (last) you sum the products.

    3. The Algebraic Formula:

      As noted above, conceptually the Pearson correlation coefficient is the ratio of the joint covariability of X and Y, to the variability of X and Y separately. The formula uses the SP as the measure of covariability, and the square root of the product of the SS for X and the SS for Y as the measure of separate variability. That is:

    4. Z-Scores and Pearson Correlation:

      If we have scores that are expressed as standardized scores -- Z-Scores with a mean of zero and a variance of one -- then the formula for the Pearson Correlation becomes particularly simple. It is:


    Understanding and Interpreting the Pearson Correlation Coefficient

    1. Correlation is NOT Causation!

      One of the most common errors made in interpreting correlations is to assume that a correlation necessarily implies a cause-and-effect relationship between the two variables. Simply stated: Correlation is NOT Causation!

    2. Correlation and Restricted Range

      When a correlation is computed from scores with a restricted range the correlation coefficient is lower than it would be if it were computed from scores with an unrestricted range.

      This happens when we look at the correlation between SAT and GPA among students in this class, since we are only seeing those students who were admitted to UNC. Those with low SAT scores (who presumably would have had very low GPA scores) were not admitted. Thus, we have a restricted range of observed SAT scores, and a lower correlation.

      Try this ViSta Correlation Demonstration Applet for a demonstration of the effect of restriction of range on the value of the correlation.

      The applet produces graphics like those shown below:

      Automobile Weight and Horsepower
      Unrestricted Relationship
      The unrestricted relationship between Automobile weight and Horsepower is shown in the scatterplot at the right.

      The Pearson Correlation is .92.

      Restricting the Range:
      If we lived in a country that restricted cars to have no more than 100 Hp, then the data would be cut off like this:

      Restricted Range:
      What we would see as the relationship in a country where the maximum horsepower is 100 Hp, would be only based on the cars with less than 100 HP. We would see the scatterplot at the right.

      Now the correlation is only .71, rather than .92.

    3. Outliers (Outriders?)

      Outliers (which G&W call, for some unknown reason, outriders) are an individual observation that has very large values of X and Y relative to all the other values of X and Y. For example, in this scatterplot of the Market Value of many companies plotted versus their Assets, the fact that IBM is so large compared to any other company completely obscures the relationship of the two variables to each other.
      Outliers and Pearson Correlation
      The correlation for these variables is .68, which is spuriously high.

      In fact, the correlation is reduced to .48 when IBM is removed.

    4. Leverage (Influence) Points

      Some points in a scatterplot can have a large influence on the value of the correlation coefficient. These points may possibly be outliers, but not all outliers are leverage points, nor are all leverage points outliers.

      The position of the regression curve is heavily influenced by observations that are extreme in their value on the X-variable. The correlation coefficient is heavily influenced by values extreme in their value on either variable.

      Try this ViSta Correlation Demonstration Applet for a demonstration of the effect of leverage points on the value of the correlation.

    5. Correlation and Strength of Relationship

      The Pearson correlation measures the degree of relationship between two variables. It is not, however, interpreted as a percentage. On the other hand:

      The Coefficient of Determination:
      The Coefficient of Determination, which is the squared correlation coefficient, measures the percentage of variation shared between the two variables.


    Hypothesis Testing with the Pearson Correlation Coefficient

    The Pearson Correlation coefficient is usually computed for sample data. Oftentime we wish to make inferences from the sample correlation to a value for the correlation in the population. We can use standard inference testing techniques to make this inference.

    The basic question answered by the hypothesis testing procedure for the Pearson correlation coefficient is whether it is significantly different from zero: i.e., whether or not a non-zero correlation exists in the population.

    Here are the four standard hypothesis testing steps, as augmented by a visualization step for the data:

    1. State the hypotheses

      The hypotheses concern whether or not there exists a non-zero correlation in the population. We have a 1-tailed hypothesis:

      There are also two possible 1-tailed hypothesis. Here's one of them:

    2. Set the decision criterion

      Choose an alpha level.
      The df=n-2, where n is the number of pairs.

    3. Gather the data

      Lets use the data gathered in class about SAT-M and GPA. We can also use the SAT-V and GPA correlation. We observe that

      • The MathSAT/GPA correlation is .32 -- 10% of the variance in GPA is explained by MathSAT.
      • The VerbalSAT/GPA correlation is .47 -- 22% of the variance in GPA is explained by VerbalSAT.
      • Remember that these correlations have been attenuated by restriction of range.
      • If we had a larger sample, and these correlation values stayed about the same, then they would become significant. However, significance isn't everything, as the size of the correlation tells us how strong the relationship is.

    4. Visualize the data

      Here are the two scatterplots:

    5. Evaluate the Hypothesis

      For df=39, alpha=.05, we can interpolate to find the the critical one-tailed r.

      • For df = 35 the critical r = .275.
      • For df = 40 the critical r = .257.
        Thus
      • For df = 39 the critical r
        = .275 - (.275 - .257)*(4/5)
        = .275 - .014
        = .261
      Note that we don't really need to interpolate since both observed correlations (.32 and .47) are beyond the larger critical value of .275.

      Therefore, for both relationships, we reject the null hypothesis that there is not a positive correlation in the population (in plain English, we say these correlations are significantly different from zero). The two SAT scales DO significantly predict your GPA's.


    Pearson Correlation and ViSta
    ViSta can compute and report Pearson (as well as Spearman, Point-Biserial or Phi correlations) but it does not do significance testing for the computed correlation coefficients.

    The ViSta Applet demonstrates that you can compute Pearson correlations in two ways:

    • Summarize Data You can get a report of correlations among the active numeric variables by choosing the Data menu's Summarize Data item and then clicking on the Correlation Matrix check box. The report will contain a matrix of correlations.
    • Transform You can compute a data object containing correlations by choosing the Transform menu's Correlations item. A data object containing a matrix of correlations among the active numeric variables will be created.

      The commands to do this are:

       (browse-data) (summarize-data 
                                          :moments t :correlations t) (correlations) 
                                          (browse-data) 

    Scatterplots of the relationship between the two variables being correlated can be obtained by asking for a Data Visualization when only the two variables in question are selected. You do this by

    1. Clicking on the desired data icon to make it active.
    2. Opening the Var window (use the List Variables item of the Data menu).
    3. Selecting, in the Var window, the two variables you want to show in the scatterplot.
    4. Using the Data menu's Visualize Data item.

    This will give you a scatterplot like those shown above. Note that it will not have a diagonal regression line on it, which is produced by the regression procedure, as discussed in the regression notes.