User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

1.9 - hypothesis test for the population correlation coefficient.

There is one more point we haven't stressed yet in our discussion about the correlation coefficient r and the coefficient of determination \(R^{2}\) — namely, the two measures summarize the strength of a linear relationship in samples only . If we obtained a different sample, we would obtain different correlations, different \(R^{2}\) values, and therefore potentially different conclusions. As always, we want to draw conclusions about populations , not just samples. To do so, we either have to conduct a hypothesis test or calculate a confidence interval. In this section, we learn how to conduct a hypothesis test for the population correlation coefficient \(\rho\) (the greek letter "rho").

In general, a researcher should use the hypothesis test for the population correlation \(\rho\) to learn of a linear association between two variables, when it isn't obvious which variable should be regarded as the response. Let's clarify this point with examples of two different research questions.

Consider evaluating whether or not a linear relationship exists between skin cancer mortality and latitude. We will see in Lesson 2 that we can perform either of the following tests:

  • t -test for testing \(H_{0} \colon \beta_{1}= 0\)
  • ANOVA F -test for testing \(H_{0} \colon \beta_{1}= 0\)

For this example, it is fairly obvious that latitude should be treated as the predictor variable and skin cancer mortality as the response.

By contrast, suppose we want to evaluate whether or not a linear relationship exists between a husband's age and his wife's age ( Husband and Wife data ). In this case, one could treat the husband's age as the response:

husband's age vs wife's age plot

...or one could treat the wife's age as the response:

wife's age vs husband's age plot

In cases such as these, we answer our research question concerning the existence of a linear relationship by using the t -test for testing the population correlation coefficient \(H_{0}\colon \rho = 0\).

Let's jump right to it! We follow standard hypothesis test procedures in conducting a hypothesis test for the population correlation coefficient \(\rho\).

Steps for Hypothesis Testing for \(\boldsymbol{\rho}\) Section  

Step 1: hypotheses.

First, we specify the null and alternative hypotheses:

  • Null hypothesis \(H_{0} \colon \rho = 0\)
  • Alternative hypothesis \(H_{A} \colon \rho ≠ 0\) or \(H_{A} \colon \rho < 0\) or \(H_{A} \colon \rho > 0\)

Step 2: Test Statistic

Second, we calculate the value of the test statistic using the following formula:

Test statistic:  \(t^*=\dfrac{r\sqrt{n-2}}{\sqrt{1-R^2}}\) 

Step 3: P-Value

Third, we use the resulting test statistic to calculate the P -value. As always, the P -value is the answer to the question "how likely is it that we’d get a test statistic t* as extreme as we did if the null hypothesis were true?" The P -value is determined by referring to a t- distribution with n -2 degrees of freedom.

Step 4: Decision

Finally, we make a decision:

  • If the P -value is smaller than the significance level \(\alpha\), we reject the null hypothesis in favor of the alternative. We conclude that "there is sufficient evidence at the\(\alpha\) level to conclude that there is a linear relationship in the population between the predictor x and response y."
  • If the P -value is larger than the significance level \(\alpha\), we fail to reject the null hypothesis. We conclude "there is not enough evidence at the  \(\alpha\) level to conclude that there is a linear relationship in the population between the predictor x and response y ."

Example 1-5: Husband and Wife Data Section  

Let's perform the hypothesis test on the husband's age and wife's age data in which the sample correlation based on n = 170 couples is r = 0.939. To test \(H_{0} \colon \rho = 0\) against the alternative \(H_{A} \colon \rho ≠ 0\), we obtain the following test statistic:

\begin{align} t^*&=\dfrac{r\sqrt{n-2}}{\sqrt{1-R^2}}\\ &=\dfrac{0.939\sqrt{170-2}}{\sqrt{1-0.939^2}}\\ &=35.39\end{align}

To obtain the P -value, we need to compare the test statistic to a t -distribution with 168 degrees of freedom (since 170 - 2 = 168). In particular, we need to find the probability that we'd observe a test statistic more extreme than 35.39, and then, since we're conducting a two-sided test, multiply the probability by 2. Minitab helps us out here:

Student's t distribution with 168 DF

The output tells us that the probability of getting a test-statistic smaller than 35.39 is greater than 0.999. Therefore, the probability of getting a test-statistic greater than 35.39 is less than 0.001. As illustrated in the following video, we multiply by 2 and determine that the P-value is less than 0.002.

Since the P -value is small — smaller than 0.05, say — we can reject the null hypothesis. There is sufficient statistical evidence at the \(\alpha = 0.05\) level to conclude that there is a significant linear relationship between a husband's age and his wife's age.

Incidentally, we can let statistical software like Minitab do all of the dirty work for us. In doing so, Minitab reports:

Correlation: WAge, HAge

Pearson correlation of WAge and HAge = 0.939

P-Value = 0.000

Final Note Section  

One final note ... as always, we should clarify when it is okay to use the t -test for testing \(H_{0} \colon \rho = 0\)? The guidelines are a straightforward extension of the "LINE" assumptions made for the simple linear regression model. It's okay:

  • When it is not obvious which variable is the response.
  • For each x , the y 's are normal with equal variances.
  • For each y , the x 's are normal with equal variances.
  • Either, y can be considered a linear function of x .
  • Or, x can be considered a linear function of y .
  • The ( x , y ) pairs are independent

12.4 Testing the Significance of the Correlation Coefficient

The correlation coefficient, r , tells us about the strength and direction of the linear relationship between x and y . However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n , together.

We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute r , the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we have only sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, r , is our estimate of the unknown population correlation coefficient.

  • The symbol for the population correlation coefficient is ρ , the Greek letter "rho."
  • ρ = population correlation coefficient (unknown)
  • r = sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is "close to zero" or "significantly different from zero". We decide this based on the sample correlation coefficient r and the sample size n .

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is "significant."

  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between x and y . We can use the regression line to model the linear relationship between x and y in the population.

If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is "not significant".

  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is not significantly different from zero."
  • What the conclusion means: There is not a significant linear relationship between x and y . Therefore, we CANNOT use the regression line to model a linear relationship between x and y in the population.
  • If r is significant and the scatter plot shows a linear trend, the line can be used to predict the value of y for values of x that are within the domain of observed x values.
  • If r is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
  • If r is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed x values in the data.

PERFORMING THE HYPOTHESIS TEST

  • Null Hypothesis: H 0 : ρ = 0
  • Alternate Hypothesis: H a : ρ ≠ 0

WHAT THE HYPOTHESES MEAN IN WORDS:

  • Null Hypothesis H 0 : The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship (correlation) between x and y in the population.
  • Alternate Hypothesis H a : The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and y in the population.

DRAWING A CONCLUSION: There are two methods of making the decision. The two methods are equivalent and give the same result.

  • Method 1: Using the p -value
  • Method 2: Using a table of critical values

In this chapter of this textbook, we will always use a significance level of 5%, α = 0.05

Using the p -value method, you could choose any appropriate significance level you want; you are not limited to using α = 0.05. But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, α = 0.05. (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook.)

METHOD 1: Using a p -value to make a decision

Using the ti-83, 83+, 84, 84+ calculator.

To calculate the p -value using LinRegTTEST: On the LinRegTTEST input screen, on the line prompt for β or ρ , highlight " ≠ 0 " The output screen shows the p-value on the line that reads "p =". (Most computer statistical software can calculate the p -value.)

  • Decision: Reject the null hypothesis.
  • Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero."
  • Decision: DO NOT REJECT the null hypothesis.
  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is NOT significantly different from zero."
  • You will use technology to calculate the p -value. The following describes the calculations to compute the test statistics and the p -value:
  • The p -value is calculated using a t -distribution with n - 2 degrees of freedom.
  • The formula for the test statistic is t = r n − 2 1 − r 2 t = r n − 2 1 − r 2 . The value of the test statistic, t , is shown in the computer or calculator output along with the p -value. The test statistic t has the same sign as the correlation coefficient r .
  • The p -value is the combined area in both tails.

An alternative way to calculate the p -value (p) given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.

  • Consider the third exam/final exam example .
  • The line of best fit is: ŷ = -173.51 + 4.83 x with r = 0.6631 and there are n = 11 data points.
  • Can the regression line be used for prediction? Given a third exam score ( x value), can we use the line to predict the final exam score (predicted y value)?
  • H 0 : ρ = 0
  • H a : ρ ≠ 0
  • The p -value is 0.026 (from LinRegTTest on your calculator or from computer software).
  • The p -value, 0.026, is less than the significance level of α = 0.05.
  • Decision: Reject the Null Hypothesis H 0
  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score ( x ) and the final exam score ( y ) because the correlation coefficient is significantly different from zero.

Because r is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.

METHOD 2: Using a table of Critical Values to make a decision

The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a good idea of whether the computed value of r r is significant or not . Compare r to the appropriate critical value in the table. If r is not between the positive and negative critical values, then the correlation coefficient is significant. If r is significant, then you may want to use the line for prediction.

Example 12.7

Suppose you computed r = 0.801 using n = 10 data points. df = n - 2 = 10 - 2 = 8. The critical values associated with df = 8 are -0.632 and + 0.632. If r < negative critical value or r > positive critical value, then r is significant. Since r = 0.801 and 0.801 > 0.632, r is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Try It 12.7

For a given line of best fit, you computed that r = 0.6501 using n = 12 data points and the critical value is 0.576. Can the line be used for prediction? Why or why not?

Example 12.8

Suppose you computed r = –0.624 with 14 data points. df = 14 – 2 = 12. The critical values are –0.532 and 0.532. Since –0.624 < –0.532, r is significant and the line can be used for prediction

Try It 12.8

For a given line of best fit, you compute that r = 0.5204 using n = 9 data points, and the critical value is 0.666. Can the line be used for prediction? Why or why not?

Example 12.9

Suppose you computed r = 0.776 and n = 6. df = 6 – 2 = 4. The critical values are –0.811 and 0.811. Since –0.811 < 0.776 < 0.811, r is not significant, and the line should not be used for prediction.

Try It 12.9

For a given line of best fit, you compute that r = –0.7204 using n = 8 data points, and the critical value is = 0.707. Can the line be used for prediction? Why or why not?

THIRD-EXAM vs FINAL-EXAM EXAMPLE: critical value method

Consider the third exam/final exam example . The line of best fit is: ŷ = –173.51+4.83 x with r = 0.6631 and there are n = 11 data points. Can the regression line be used for prediction? Given a third-exam score ( x value), can we use the line to predict the final exam score (predicted y value)?

  • Use the "95% Critical Value" table for r with df = n – 2 = 11 – 2 = 9.
  • The critical values are –0.602 and +0.602
  • Since 0.6631 > 0.602, r is significant.
  • Conclusion:There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score ( x ) and the final exam score ( y ) because the correlation coefficient is significantly different from zero.

Example 12.10

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if r is significant and the line of best fit associated with each r can be used to predict a y value. If it helps, draw a number line.

  • r = –0.567 and the sample size, n , is 19. The df = n – 2 = 17. The critical value is –0.456. –0.567 < –0.456 so r is significant.
  • r = 0.708 and the sample size, n , is nine. The df = n – 2 = 7. The critical value is 0.666. 0.708 > 0.666 so r is significant.
  • r = 0.134 and the sample size, n , is 14. The df = 14 – 2 = 12. The critical value is 0.532. 0.134 is between –0.532 and 0.532 so r is not significant.
  • r = 0 and the sample size, n , is five. No matter what the dfs are, r = 0 is between the two critical values so r is not significant.

Try It 12.10

For a given line of best fit, you compute that r = 0 using n = 100 data points. Can the line be used for prediction? Why or why not?

Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between x and y in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between x and y in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population. Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

  • There is a linear relationship in the population that models the average value of y for varying values of x . In other words, the expected value of y for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population.)
  • The y values for any particular x value are normally distributed about the line. This implies that there are more y values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of y values lie on the line.
  • The standard deviations of the population y values about the line are equal for each value of x . In other words, each of these normal distributions of y values has the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).
  • The data are produced from a well-designed, random sample or randomized experiment.

As an Amazon Associate we earn from qualifying purchases.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/introductory-statistics/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Introductory Statistics
  • Publication date: Sep 19, 2013
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/introductory-statistics/pages/1-introduction
  • Section URL: https://openstax.org/books/introductory-statistics/pages/12-4-testing-the-significance-of-the-correlation-coefficient

© Jun 23, 2022 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

This means that we can state a null and alternative hypothesis for the population correlation ρ based on our predictions for a correlation. Let's look at how this works in an example.

Now let's go through our hypothesis testing steps:

Step 1: State hypotheses and choose α level

Remember we're going to state hypotheses in terms of our population correlation ρ. In this example, we expect GPA to decrease as distance from campus increases. This means that we are making a directional hypothesis and using a 1-tailed test. It also means we expect to find a negative value of ρ, because that would indicate a negative relationship between GPA and distance from campus. So here are our hypotheses:

H 0 ρ > 0

H A : ρ < 0

We're making our predictions as a comparison with 0, because 0 would indicate no relationship. Note that if we were conducting a 2-tailed test, our hypotheses would be ρ = 0 for the null hypothesis and ρ not equal to 0 for the alternative hypothesis.

We'll use our conventional α = .05.

Step 2: Collect the sample

Here are our sample data:

Step 3: Calculate test statistic

For this example, we're going to calculate a Pearson r statistic. Recall the formula for Person r:   The bottom of the formula requires us to calculate the sum of squares (SS) for each measure individually and the top of the formula requires calculation of the sum of products of the two variables (SP). We'll start with the SS terms. Remember the formula for SS is: SS = Σ(X - ) 2 We'll calculate this for both GPA and Distance. If you need a review of how to calculate SS, review Lab 9 . For our example, we get: SS GPA = .58 and SS distance = 18.39 Now we need to calculate the SP term. Remember the formula for SP is SP = Σ(X - )(Y - ) If you need to review how to calculate the SP term, go to Lab 12 . For our example, we get SP = -.63 Plugging these SS and SP values into our r equation gives us r = -.19 Now we need to find our critical value of r using a table like we did for our z and t-tests. We'll need to know our degrees of freedom, because like t, the r distribution changes depending on the sample size. For r, df = n - 2 So for our example, we have df = 5 - 2 = 3. Now, with df = 3, α = .05, and a one-tailed test, we can find r critical in the Table of Pearson r values . This table is organized and used in the same way that the t-table is used.

Our r crit = .805. We write r crit (3) = -.805 (negative because we are doing a 1-tailed test looking for a negative relationship).

Step 4: Compare observed test statistic to critical test statistic and make a decision about H 0

Our r obs (3) = -.19 and r crit (3) = -.805

Since -.19 is not in the critical region that begins at -.805, we cannot reject the null. We must retain the null hypothesis and conclude that we have no evidence of a relationship between GPA and distance from campus.

Now try a few of these types of problems on your own. Show all four steps of hypothesis testing in your answer (some questions will require more for each step than others) and be sure to state hypotheses in terms of ρ.

(1) A high school counselor would like to know if there is a relationship between mathematical skill and verbal skill. A sample of n = 25 students is selected, and the counselor records achievement test scores in mathematics and English for each student. The Pearson correlation for this sample is r = +0.50. Do these data provide sufficient evidence for a real relationship in the population? Test at the .05 α level, two tails.

(2) It is well known that similarity in attitudes, beliefs, and interests plays an important role in interpersonal attraction. Thus, correlations for attitudes between married couples should be strong and positive. Suppose a researcher developed a questionnaire that measures how liberal or conservative one's attitudes are. Low scores indicate that the person has liberal attitudes, while high scores indicate conservatism. Here are the data from the study:

Couple A: Husband - 14, Wife - 11

Couple B: Husband - 7, Wife - 6

Couple C: Husband - 15, Wife - 18

Couple D: Husband - 7, Wife - 4

Couple E: Husband - 3, Wife - 1

Couple F: Husband - 9, Wife - 10

Couple G: Husband - 9, Wife - 5

Couple H: Husband - 3, Wife - 3

Test the researcher's hypothesis with α set at .05.

(3) A researcher believes that a person's belief in supernatural events (e.g., ghosts, ESP, etc) is related to their education level. For a sample of n = 30 people, he gives them a questionnaire that measures their belief in supernatural events (where a high score means they believe in more of these events) and asks them how many years of schooling they've had. He finds that SS beliefs = 10, SS schooling = 10, and SP = -8. With α = .01, test the researcher's hypothesis.

Using SPSS for Hypothesis Testing with Pearson r

We can also use SPSS to a hypothesis test with Pearson r. We could calculate the Pearson r with SPSS and then look at the output to make our decision about H 0 . The output will give us a p value for our Pearson r (listed under Sig in the Output). We can compare this p value with alpha to determine if the p value is in the critical region.

Remember from Lab 12 , to calculate a Pearson r using SPSS:

The output that you get is a correlation matrix. It correlates each variable against each variable (including itself). You should notice that the table has redundant information on it (e.g., you'll find an r for height correlated with weight, and and r for weight correlated with height. These two statements are identical.)

Pearson correlation

This page offers all the basic information you need about the Pearson correlation coefficient and its significance test and confidence interval. It is part of Statkat’s wiki module, containing similarly structured info pages for many different statistical methods. The info pages give information about null and alternative hypotheses, assumptions, test statistics and confidence intervals, how to find p values, SPSS how-to’s and more.

To compare the Pearson correlation coefficient with other statistical methods, go to Statkat's Comparison tool or practice with the Pearson correlation coefficient at Statkat's Practice question center

  • 1. When to use
  • 2. Null hypothesis
  • 3. Alternative hypothesis
  • 4. Assumptions of test for correlation
  • 5. Test statistic
  • 6. Sampling distribution
  • 7. Significant?
  • 8. Approximate $C$% confidence interval for $\rho$
  • 9. Properties of the Pearson correlation coefficient
  • 10. Equivalent to
  • 11. Example context

When to use?

Note that theoretically, it is always possible to 'downgrade' the measurement level of a variable. For instance, a test that can be performed on a variable of ordinal measurement level can also be performed on a variable of interval measurement level, in which case the interval variable is downgraded to an ordinal variable. However, downgrading the measurement level of variables is generally a bad idea since it means you are throwing away important information in your data (an exception is the downgrade from ratio to interval level, which is generally irrelevant in data analysis).

If you are not sure which method you should use, you might like the assistance of our method selection tool or our method selection table .

Null hypothesis

The test for the Pearson correlation coefficient tests the following null hypothesis (H 0 ):

Alternative hypothesis

The test for the Pearson correlation coefficient tests the above null hypothesis against the following alternative hypothesis (H 1 or H a ):

Assumptions of test for correlation

Statistical tests always make assumptions about the sampling procedure that was used to obtain the sample data. So called parametric tests also make assumptions about how data are distributed in the population. Non-parametric tests are more 'robust' and make no or less strict assumptions about population distributions, but are generally less powerful. Violation of assumptions may render the outcome of statistical tests useless, although violation of some assumptions (e.g. independence assumptions) are generally more problematic than violation of other assumptions (e.g. normality assumptions in combination with large samples).

The test for the Pearson correlation coefficient makes the following assumptions:

  • In the population, the two variables are jointly normally distributed (this covers the normality, homoscedasticity, and linearity assumptions)
  • Sample of pairs is a simple random sample from the population of pairs. That is, pairs are independent of one another

Test statistic

The test for the Pearson correlation coefficient is based on the following test statistic:

  • $t = \dfrac{r \times \sqrt{N - 2}}{\sqrt{1 - r^2}} $ where $r$ is the sample correlation $r = \frac{1}{N - 1} \sum_{j}\Big(\frac{x_{j} - \bar{x}}{s_x} \Big) \Big(\frac{y_{j} - \bar{y}}{s_y} \Big)$ and $N$ is the sample size
  • $r_{Fisher} = \dfrac{1}{2} \times \log\Bigg(\dfrac{1 + r}{1 - r} \Bigg )$, where $r$ is the sample correlation
  • $\rho_{0_{Fisher}} = \dfrac{1}{2} \times \log\Bigg( \dfrac{1 + \rho_0}{1 - \rho_0} \Bigg )$, where $\rho_0$ is the population correlation according to H0

Sampling distribution

  • $t$ distribution with $N - 2$ degrees of freedom
  • Approximately the standard normal distribution

Significant?

This is how you find out if your test result is significant:

  • Check if $t$ observed in sample is at least as extreme as critical value $t^*$ or
  • Find two sided $p$ value corresponding to observed $t$ and check if it is equal to or smaller than $\alpha$
  • Check if $t$ observed in sample is equal to or larger than critical value $t^*$ or
  • Find right sided $p$ value corresponding to observed $t$ and check if it is equal to or smaller than $\alpha$
  • Check if $t$ observed in sample is equal to or smaller than critical value $t^*$ or
  • Find left sided $p$ value corresponding to observed $t$ and check if it is equal to or smaller than $\alpha$
  • Check if $z$ observed in sample is at least as extreme as critical value $z^*$ or
  • Find two sided $p$ value corresponding to observed $z$ and check if it is equal to or smaller than $\alpha$
  • Check if $z$ observed in sample is equal to or larger than critical value $z^*$ or
  • Find right sided $p$ value corresponding to observed $z$ and check if it is equal to or smaller than $\alpha$
  • Check if $z$ observed in sample is equal to or smaller than critical value $z^*$ or
  • Find left sided $p$ value corresponding to observed $z$ and check if it is equal to or smaller than $\alpha$

Approximate $C$% confidence interval for $\rho$

  • $lower_{Fisher} = r_{Fisher} - z^* \times \sqrt{\dfrac{1}{N - 3}}$
  • $upper_{Fisher} = r_{Fisher} + z^* \times \sqrt{\dfrac{1}{N - 3}}$
  • lower bound = $\dfrac{e^{2 \times lower_{Fisher}} - 1}{e^{2 \times lower_{Fisher}} + 1}$
  • upper bound = $\dfrac{e^{2 \times upper_{Fisher}} - 1}{e^{2 \times upper_{Fisher}} + 1}$

Properties of the Pearson correlation coefficient

  • The Pearson correlation coefficient is a measure for the linear relationship between two quantitative variables.
  • The Pearson correlation coefficient squared reflects the proportion of variance explained in one variable by the other variable.
  • The Pearson correlation coefficient can take on values between -1 (perfect negative relationship) and 1 (perfect positive relationship). A value of 0 means no linear relationship.
  • The absolute size of the Pearson correlation coefficient is not affected by any linear transformation of the variables. However, the sign of the Pearson correlation will flip when the scores on one of the two variables are multiplied by a negative number (reversing the direction of measurement of that variable). For example:
  • the correlation between $x$ and $y$ is equivalent to the correlation between $3x + 5$ and $2y - 6$.
  • the absolute value of the correlation between $x$ and $y$ is equivalent to the absolute value of the correlation between $-3x + 5$ and $2y - 6$. However, the signs of the two correlation coefficients will be in opposite directions, due to the multiplication of $x$ by $-3$.
  • The Pearson correlation coefficient does not say anything about causality.
  • The Pearson correlation coefficient is sensitive to outliers.

Equivalent to

The test for the Pearson correlation coefficient is equivalent to:

  • $b_1 = r \times \frac{s_y}{s_x}$
  • Results significance test ($t$ and $p$ value) testing $H_0$: $\beta_1 = 0$ are equivalent to results significance test testing $H_0$: $\rho = 0$

Example context

The test for the Pearson correlation coefficient could for instance be used to answer the question:

How to compute thePearson correlation coefficient in SPSS:

  • Put your two variables in the box below Variables

How to compute thePearson correlation coefficient in jamovi :

  • Put your two variables in the white box at the right
  • Under Correlation Coefficients, select Pearson (selected by default)
  • Under Hypothesis, select your alternative hypothesis
  • Flashes Safe Seven
  • FlashLine Login
  • Faculty & Staff Phone Directory
  • Emeriti or Retiree
  • All Departments
  • Maps & Directions

Kent State University Home

  • Building Guide
  • Departments
  • Directions & Parking
  • Faculty & Staff
  • Give to University Libraries
  • Library Instructional Spaces
  • Mission & Vision
  • Newsletters
  • Circulation
  • Course Reserves / Core Textbooks
  • Equipment for Checkout
  • Interlibrary Loan
  • Library Instruction
  • Library Tutorials
  • My Library Account
  • Open Access Kent State
  • Research Support Services
  • Statistical Consulting
  • Student Multimedia Studio
  • Citation Tools
  • Databases A-to-Z
  • Databases By Subject
  • Digital Collections
  • Discovery@Kent State
  • Government Information
  • Journal Finder
  • Library Guides
  • Connect from Off-Campus
  • Library Workshops
  • Subject Librarians Directory
  • Suggestions/Feedback
  • Writing Commons
  • Academic Integrity
  • Jobs for Students
  • International Students
  • Meet with a Librarian
  • Study Spaces
  • University Libraries Student Scholarship
  • Affordable Course Materials
  • Copyright Services
  • Selection Manager
  • Suggest a Purchase

Library Locations at the Kent Campus

  • Architecture Library
  • Fashion Library
  • Map Library
  • Performing Arts Library
  • Special Collections and Archives

Regional Campus Libraries

  • East Liverpool
  • College of Podiatric Medicine

hypothesis test pearson correlation

  • Kent State University
  • SAS Tutorials
  • Pearson Correlation with PROC CORR

SAS Tutorials: Pearson Correlation with PROC CORR

  • The SAS 9.4 User Interface
  • SAS Syntax Rules
  • SAS Libraries
  • The Data Step
  • Informats and Formats
  • User-Defined Formats (Value Labels)
  • Defining Variables
  • Missing Values
  • Importing Excel Files into SAS
  • Computing New Variables
  • Date-Time Functions and Variables in SAS
  • Sorting Data
  • Subsetting and Splitting Datasets
  • Merging Datasets
  • Transposing Data using PROC TRANSPOSE
  • Summarizing dataset contents with PROC CONTENTS
  • Viewing Data
  • Frequency Tables using PROC FREQ
  • Crosstabs using PROC FREQ
  • Chi-Square Test of Independence
  • t tests are used to test if the means of two independent groups are significantly different. In SAS, PROC TTEST with a CLASS statement and a VAR statement can be used to conduct an independent samples t test." href="https://libguides.library.kent.edu/SAS/IndependentTTest" style="" >Independent Samples t Test
  • t tests are used to test if the means of two paired measurements, such as pretest/posttest scores, are significantly different. In SAS, PROC TTEST with a PAIRED statement can be used to conduct a paired samples t test." href="https://libguides.library.kent.edu/SAS/PairedSamplestTest" style="" >Paired Samples t Test
  • Exporting Results to Word or PDF
  • Importing Data into SAS OnDemand for Academics
  • Connecting to WRDS from SAS
  • SAS Resources Online
  • How to Cite the Tutorials

Sample Data Files

Our tutorials reference a dataset called "sample" in many examples. If you'd like to download the sample dataset to work through the examples, choose one of the files below:

  • Data definitions (*.pdf)
  • Data - Comma delimited (*.csv)
  • Data - Tab delimited (*.txt)
  • Data - Excel format (*.xlsx)
  • Data - SAS format (*.sas7bdat)
  • Data - SPSS format (*.sav)
  • SPSS Syntax (*.sps) Syntax to add variable labels, value labels, set variable types, and compute several recoded variables used in later tutorials.
  • SAS Syntax (*.sas) Syntax to read the CSV-format sample data and set variable labels and formats/value labels.

Pearson Correlation

The bivariate Pearson Correlation produces a sample correlation coefficient, r , which measures the strength and direction of linear relationships between pairs of continuous variables. By extension, the Pearson Correlation evaluates whether there is statistical evidence for a linear relationship among the same pairs of variables in the population, represented by a population correlation coefficient, ρ (“rho”). The Pearson Correlation is a parametric measure.

This measure is also known as:

  • Pearson’s correlation
  • Pearson product-moment correlation (PPMC)

Common Uses

The bivariate Pearson Correlation is commonly used to measure the following:

  • Correlations among pairs of variables
  • Correlations within and between sets of variables

The bivariate Pearson correlation indicates the following:

  • Whether a statistically significant linear relationship exists between two continuous variables
  • The strength of a linear relationship (i.e., how close the relationship is to being a perfectly straight line)
  • The direction of a linear relationship (increasing or decreasing)

Note: The bivariate Pearson Correlation cannot address non-linear relationships or relationships among categorical variables. If you wish to understand relationships that involve categorical variables and/or non-linear relationships, you will need to choose another measure of association.

Note: The bivariate Pearson Correlation only reveals associations among continuous variables. The bivariate Pearson Correlation does not provide any inferences about causation, no matter how large the correlation coefficient is.

Data Requirements

To use Pearson correlation, your data must meet the following requirements:

  • Two or more continuous variables (i.e., interval or ratio level)
  • Cases must have non-missing values on both variables
  • Linear relationship between the variables
  • the values for all variables across cases are unrelated
  • for any case, the value for any variable cannot influence the value of any variable for other cases
  • no case can influence another case on any variable
  • The biviariate Pearson correlation coefficient and corresponding significance test are not robust when independence is violated.
  • Each pair of variables is bivariately normally distributed
  • Each pair of variables is bivariately normally distributed at all levels of the other variable(s)
  • This assumption ensures that the variables are linearly related; violations of this assumption may indicate that non-linear relationships among variables exist. Linearity can be assessed visually using a scatterplot of the data.
  • Random sample of data from the population
  • No outliers

The null hypothesis ( H 0 ) and alternative hypothesis ( H 1 ) of the significance test for correlation can be expressed in the following ways, depending on whether a one-tailed or two-tailed test is requested:

Two-tailed significance test:

H 0 : ρ  = 0 ("the population correlation coefficient is 0; there is no association") H 1 : ρ ≠ 0 ("the population correlation coefficient is not 0; a nonzero correlation could exist")

One-tailed significance test:

H 0 : ρ  = 0 ("the population correlation coefficient is 0; there is no association") H 1 : ρ   > 0 ("the population correlation coefficient is greater than 0; a positive correlation could exist")      OR H 1 : ρ   < 0 ("the population correlation coefficient is less than 0; a negative correlation could exist")

where ρ is the population correlation coefficient.

Test Statistic

The sample correlation coefficient between two variables x and y is denoted r or r xy , and can be computed as: $$ r_{xy} = \frac{\mathrm{cov}(x,y)}{\sqrt{\mathrm{var}(x)} \dot{} \sqrt{\mathrm{var}(y)}} $$

where cov( x , y ) is the sample covariance of x and y ; var( x ) is the sample variance of x ; and var( y ) is the sample variance of y .

Correlation can take on any value in the range [-1, 1]. The sign of the correlation coefficient indicates the direction of the relationship, while the magnitude of the correlation (how close it is to -1 or +1) indicates the strength of the relationship.

  •  -1 : perfectly negative linear relationship
  •   0 : no relationship
  • +1  : perfectly positive linear relationship

The strength can be assessed by these general guidelines [1] (which may vary by discipline):

  • .1 < | r | < .3 … small / weak correlation
  • .3 < | r | < .5 … medium / moderate correlation
  • .5 < | r | ……… large / strong correlation

Note: The direction and strength of a correlation are two distinct properties. The scatterplots below [2] show correlations that are r = +0.90, r = 0.00, and r = -0.90, respectively. The strength of the nonzero correlations are the same: 0.90. But the direction of the correlations is different: a negative correlation corresponds to a decreasing relationship, while and a positive correlation corresponds to an increasing relationship. 

Scatterplot of data with correlation r = -0.90

Note that the r = 0.00 correlation has no discernable increasing or decreasing linear pattern in this particular graph. However, keep in mind that Pearson correlation is only capable of detecting linear associations, so it is possible to have a pair of variables with a strong nonlinear relationship and a small Pearson correlation coefficient. It is good practice to create scatterplots of your variables to corroborate your correlation coefficients.

[1]  Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.

[2]  Scatterplots created in R using ggplot2 , ggthemes::theme_tufte() , and MASS::mvrnorm() .

Data Set Up

Your data should include two or more continuous numeric variables.

Correlation with PROC CORR

The CORR procedure produces Pearson correlation coefficients of continuous numeric variables. The basic syntax of the CORR procedure is: 

In the first line of the SAS code above, PROC CORR tells SAS to execute the CORR procedure on the dataset given in the DATA= argument. Immediately following PROC CORR is where you put any procedure-level options you want to include. Let’s review some of the more common options:

  • NOMISS Excludes observations with missing values on any of the analysis variables specified in the VAR or WITH statements (i.e., listwise exclusion).
  • PLOTS=MATRIX Creates a scatterplot matrix of the variables in the VAR and/or WITH statements.
  • PLOTS=MATRIX(HISTOGRAM) Same as above, but changes the panels on the diagonal of the scatterplot matrix to display histograms of the variables in the VAR statement. (The HISTOGRAM option is ignored if you include a WITH statement.)
  • PLOTS=SCATTER Creates individual scatterplots of the variables in the VAR and/or WITH statements.
  • PLOTS(MAXPOINTS=n)= <...> Used to increase the limit on the number of datapoints used in a plot to some number n . By default, n is 5000. Can be used in conjunction with any of the above options for MATRIX and SCATTER. If you have included PLOTS syntax in your script but do not see any plots in your output, check your log window; if you see the message WARNING: The scatter plot matrix with more than 5000 points has been suppressed. Use the PLOTS(MAXPOINTS= ) option in the PROC CORR statement to change or override the cutoff. then you should try revising the code to PLOTS(MAXPOINTS=15000)= and rerun.

On the next line, the VAR statement is where you specify all of the variables you want to compute pairwise correlations for. You can list as many variables as you want, with each variable separated by a space. If the VAR statement is not included, then SAS will include every numeric variable that does not appear in any other of the statements.

The WITH statement is optional, but is typically used if you only want to run correlations between certain combinations of variables. If both the VAR and WITH statements are used, each variable in the WITH statement will be correlated against each variable in the VAR statement.

When ODS graphics are turned on and you request plots from PROC CORR, each plot will be saved as a PNG file in the same directory where your SAS code is. If you run the same code multiple times, it will create new graphics files for each run (rather than overwriting the old ones).
  • SAS 9.2 Procedures Guide - PROC CORR
  • SAS 9.2 Procedures Guide - PROC CORR - CORR Statement Options

Example: Understanding the linear association between height and weight

Problem statement.

Perhaps you would like to test whether there is a statistically significant linear relationship between two continuous variables, weight and height (and by extension, infer whether the association is significant in the population). You can use a bivariate Pearson Correlation to test whether there is a statistically significant linear relationship between height and weight, and to determine the strength and direction of the association.

Before the Test

Before we look at the Pearson correlations, we should look at the scatterplots of our variables to get an idea of what to expect. In particular, we need to determine if it's reasonable to assume that our variables have linear relationships. PROC CORR automatically includes descriptive statistics (including mean, standard deviation, minimum, and maximum) for the input variables, and can optionally create scatterplots and/or scatterplot matrices. (Note that the plots require the ODS graphics system . If you are using SAS 9.3 or later, ODS is turned on by default.)

In the sample data, we will use two variables: “Height” and “Weight.” The variable “Height” is a continuous measure of height in inches and exhibits a range of values from 55.00 to 84.41. The variable “Weight” is a continuous measure of weight in pounds and exhibits a range of values from 101.71 to 350.07.

Running the Test

Sas program.

The first two tables tell us what variables were analyzed, and their descriptive statistics.

2 Variables: Weight Height

The third table contains the Pearson correlation coefficients and test results.

hypothesis test pearson correlation

Notice that the correlations in the main diagonal (cells A and D) are all equal to 1. This is because a variable is always perfectly correlated with itself. Notice, however, that the sample sizes are different in cell A ( n =376) versus cell D ( n =408). This is because of missing data -- there are more missing observations for variable Weight than there are for variable Height, respectively.

The important cells we want to look at are either B or C. (Cells B and C are identical, because they include information about the same pair of variables.) Cells B and D contain the correlation coefficient itself, its p-value, and the number of complete pairwise observations that the calculation was based on.

In cell B (repeated in cell C), we can see that the Pearson correlation coefficient for height and weight is .513, which is significant ( p < .001 for a two-tailed test), based on 354 complete observations (i.e., cases with nonmissing values for both height and weight).

If you used the PLOTS=SCATTER option in the PROC CORR statement, you will see a scatter plot:

hypothesis test pearson correlation

Decision and Conclusions

Based on the results, we can state the following:

  • Weight and height have a statistically significant linear relationship ( r = 0.51, p < .001).
  • The direction of the relationship is positive (i.e., height and weight are positively correlated), meaning that these variables tend to increase together (i.e., greater height is associated with greater weight).
  • The magnitude, or strength, of the association is moderate (.3 < | r | < .5).
  • << Previous: Analyzing Data
  • Next: Chi-Square Test of Independence >>
  • Last Updated: Dec 18, 2023 12:59 PM
  • URL: https://libguides.library.kent.edu/SAS

Street Address

Mailing address, quick links.

  • How Are We Doing?
  • Student Jobs

Information

  • Accessibility
  • Emergency Information
  • For Our Alumni
  • For the Media
  • Jobs & Employment
  • Life at KSU
  • Privacy Statement
  • Technology Support
  • Website Feedback

Pearson Product-Moment Correlation

What does this test do.

The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a measure of the strength of a linear association between two variables and is denoted by r . Basically, a Pearson product-moment correlation attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient, r , indicates how far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit).

What values can the Pearson correlation coefficient take?

The Pearson correlation coefficient, r , can take a range of values from +1 to -1. A value of 0 indicates that there is no association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association; that is, as the value of one variable increases, the value of the other variable decreases. This is shown in the diagram below:

Pearson Coefficient - Different Values

How can we determine the strength of association based on the Pearson correlation coefficient?

The stronger the association of the two variables, the closer the Pearson correlation coefficient, r , will be to either +1 or -1 depending on whether the relationship is positive or negative, respectively. Achieving a value of +1 or -1 means that all your data points are included on the line of best fit – there are no data points that show any variation away from this line. Values for r between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there is variation around the line of best fit. The closer the value of r to 0 the greater the variation around the line of best fit. Different relationships and their correlation coefficients are shown in the diagram below:

Different values for the Pearson Correlation Coefficient

Are there guidelines to interpreting Pearson's correlation coefficient?

Yes, the following guidelines have been proposed:

Remember that these values are guidelines and whether an association is strong or not will also depend on what you are measuring.

Can you use any type of variable for Pearson's correlation coefficient?

No, the two variables have to be measured on either an interval or ratio scale. However, both variables do not need to be measured on the same scale (e.g., one variable can be ratio and one can be interval). Further information about types of variable can be found in our Types of Variable guide. If you have ordinal data, you will want to use Spearman's rank-order correlation or a Kendall's Tau Correlation instead of the Pearson product-moment correlation.

Do the two variables have to be measured in the same units?

No, the two variables can be measured in entirely different units. For example, you could correlate a person's age with their blood sugar levels. Here, the units are completely different; age is measured in years and blood sugar level measured in mmol/L (a measure of concentration). Indeed, the calculations for Pearson's correlation coefficient were designed such that the units of measurement do not affect the calculation. This allows the correlation coefficient to be comparable and not influenced by the units of the variables used.

What about dependent and independent variables?

The Pearson product-moment correlation does not take into consideration whether a variable has been classified as a dependent or independent variable. It treats all variables equally. For example, you might want to find out whether basketball performance is correlated to a person's height. You might, therefore, plot a graph of performance against height and calculate the Pearson correlation coefficient. Lets say, for example, that r = .67. That is, as height increases so does basketball performance. This makes sense. However, if we plotted the variables the other way around and wanted to determine whether a person's height was determined by their basketball performance (which makes no sense), we would still get r = .67. This is because the Pearson correlation coefficient makes no account of any theory behind why you chose the two variables to compare. This is illustrated below:

Not influenced by Dependent and Independent Variables

Does the Pearson correlation coefficient indicate the slope of the line?

It is important to realize that the Pearson correlation coefficient, r , does not represent the slope of the line of best fit. Therefore, if you get a Pearson correlation coefficient of +1 this does not mean that for every unit increase in one variable there is a unit increase in another. It simply means that there is no variation between the data points and the line of best fit. This is illustrated below:

The Pearson Coefficient does not indicate the slope of the line of best fit.

What assumptions does Pearson's correlation make?

The first and most important step before analysing your data using Pearson’s correlation is to check whether it is appropriate to use this statistical test. After all, Pearson’s correlation will only give you valid/accurate results if your study design and data " pass/meet " seven assumptions that underpin Pearson’s correlation.

In many cases, Pearson’s correlation will be the incorrect statistical test to use because your data " violates/does not meet " one or more of these assumptions. This is not uncommon when working with real-world data, which is often "messy", as opposed to textbook examples. However, there is often a solution, whether this involves using a different statistical test , or making adjustments to your data so that you can continue to use Pearson’s correlation.

We briefly set out the seven assumptions below, three of which relate to your study design and how you measured your variables (i.e., Assumptions #1, #2 and #3 below), and four which relate to the characteristics of your data (i.e., Assumptions #4, #5, #6 and #7 below):

Note: We list seven assumptions below, but there is disagreement in the statistics literature whether the term "assumptions" should be used to describe all of these (e.g., see Nunnally, 1978). We highlight this point for transparency. However, we use the word "assumptions" to stress their importance and to indicate that they should be examined closely when using a Pearson’s correlation if you want accurate/valid results. We also use the word "assumptions" to indicate that where some of these are not met, Pearson’s correlation will no longer be the correct statistical test to analyse your data.

  • Assumption #1: Your two variables should be measured on a continuous scale (i.e., they are measured at the interval or ratio level). Examples of continuous variables include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), driving speed (measured in km/h) and so forth.
  • Assumption #2: Your two continuous variables should be paired , which means that each case (e.g., each participant) has two values: one for each variable. These "values" are also referred to as "data points". For example, imagine that you had collected the revision times (measured in hours) and exam results (measured from 0 to 100) from 100 randomly sampled students at a university (i.e., you have two continuous variables: "revision time" and "exam performance"). Each of the 100 students would have a value for revision time (e.g., "student #1" studied for "23 hours") and an exam result (e.g., "student #1" scored "81 out of 100"). Therefore, you would have 100 paired values.

Note: The independence of cases assumption is also known as the independence of observations assumption.

Since assumptions #1, #2 and #3 relate to your study design and how you measured your variables , if any of these three assumptions are not met (i.e., if any of these assumptions do not fit with your research), Pearson’s correlation is the incorrect statistical test to analyse your data. It is likely that there will be other statistical tests you can use instead, but Pearson’s correlation is not the correct test.

After checking if your study design and variables meet assumptions #1, #2 and #3 , you should now check if your data also meets assumptions #4, #5, #6 and #7 below. When checking if your data meets these four assumptions, do not be surprised if this process takes up the majority of the time you dedicate to carrying out your analysis. As we mentioned above, it is not uncommon for one or more of these assumptions to be violated (i.e., not met) when working with real-world data rather than textbook examples. However, with the right guidance this does not need to be a difficult process and there are often other statistical analysis techniques that you can carry out that will allow you to continue with your analysis.

Note: If your two continuous, paired variables (i.e., Assumptions #1 and 2) follow a bivariate normal distribution , there will be linearity, univariate normality and homoscedasticity (i.e., Assumptions #4, #5 and #6 below; e.g., Lindeman et al., 1980). Unfortunately, the assumption of bivariate normality is very difficult to test, which is why we focus on linearity and univariate normality instead. Homoscedasticity is also difficult to test, but we include this so that you know why it is important. We include outliers at the end (i.e., Assumption #7) because they cannot only lead to violations of the linearity and univariate normality assumptions, but they also have a large impact on the value of Pearson’s correlation coefficient, r (e.g., Wilcox, 2012).

Detecting a Linear Relationship.

Note: Pearson's correlation coefficient is a measure of the strength of a linear association between two variables. Put another way, it determines whether there is a linear component of association between two continuous variables. As such, linearity is not strictly an "assumption" of Pearson's correlation. However, you would not normally want to use Pearson's correlation to determine the strength and direction of a linear relationship when you already know the relationship between your two variables is not linear. Instead, the relationship between your two variables might be better described by another statistical measure (Cohen, 2013). For this reason, it is not uncommon to view the relationship between your two variables in a scatterplot to see if running a Pearson's correlation is the best choice as a measure of association or whether another measure would be better.

Note: The disagreements about the robustness of Pearson’s correlation are based on additional assumptions that are made to justify robustness under non-normality and whether these additional assumptions are likely to be true in practice. For further reading on this issue, see, for example, Edgell and Noon (1984) and Hogg and Craig (2014).

Homoscedasticity in Correlation.

Note: Outliers are not necessarily "bad", but due to the effect they have on the Pearson correlation coefficient, r , discussed on the next page , they need to be taken into account.

You can check whether your data meets assumptions #4, #5 and #7 using a number of statistics packages (to learn more, see our guides for: SPSS Statistics , Stata and Minitab ). If any of these seven assumptions are violated (i.e., not met), there are often other statistical analysis techniques that you can carry out that will allow you to continue with your analysis (e.g., see Shevlyakov and Oja, 2016).

On the next page we discuss other characteristics of Pearson's correlation that you should consider.

We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.

Stay up to date on the topics that matter to you

Pearson Correlation

A statistical technique to investigate the strength and direction of the relationship between two quantitative variables.

Elliot McClenaghan image

Complete the form below to unlock access to ALL audio articles.

Pearson’s correlation is a commonly used statistical technique for investigating the strength and direction of the relationship between two quantitative variables, such as the relationship between age and height. In this article, we will explore the theory, assumptions and interpretation of Pearson’s correlation, including a worked example of how to calculate Pearson’s correlation coefficient, often referred to as Pearson’s r.

What is Pearson correlation test, Pearson product moment correlation or Pearson r?

Scatter plots, pearson correlation coefficient formula, what are the assumptions of pearson's correlation test, pearson vs spearman correlation and when to use pearson's r, pearson’s correlation coefficient interpretation, pearsons r test example.

Pearson’s correlation helps us understand the relationship between two quantitative variables when the relationship between them is assumed to take a linear pattern. The relationship between two quantitative variables (also known as continuous variables), can be visualized using a scatter plot , and a straight line can be drawn through them. The closeness with which the points lie along this line is measured by Pearson’s correlation coefficient, also often denoted as Pearson’s r, and sometimes referred to as Pearson’s product moment correlation coefficient or simply the correlation coefficient. Pearson’s r can be thought of not just as a descriptive statistic but also an inferential statistic because, as with other statistical tests, a hypothesis test can be performed to make inferences and draw conclusions from the data.

A common first step when investigating the relationship between two quantitative variables is to plot the data in a scatter diagram, where the outcome (response or dependent) variable is on the vertical y-axis and the exposure (explanatory or independent) variable is on the horizontal x-axis, with markers representing the values of each variable an individual takes in the dataset. An example of a scatter plot can be found in Figure 3.

The formula for Pearson’s correlation coefficient, r, relates to how closely a line of best fit, or how well  a linear regression, predicts the relationship between the two variables. It is presented as follows:

Pearson correlation coefficient formula.

where x i and y i represent the values of the exposure variable and outcome variable for each individual respectively, and x̄ and ȳ represent the mean of the values of the exposure and outcome variables in the dataset.

This formula works by numerically expressing the variation in the outcome variable “explained” by the exposure variable. In essence, we are calculating the sum of products (multiplying corresponding values from pairs and then adding together the resulting values) about the mean of x and y divided by the square root of the sum of squares about the mean of x multiplied by the sum of squares about the mean of y. The variation is expressed using the sum of the squared distances (e.g. ( x i – x̄ ) 2 ) of the values from the mean of y and x. This captures the variation in the values around the line of best fit and also ensures the correlation coefficient lies between − 1 and + 1.

There are some key assumptions that should be met before Pearson’s correlation can be used to give valid results:

  • The data should be on a continuous scale. Examples of continuous variables include age in years, height in centimeters and temperature in degrees Celsius. Sometimes continuous variables are referred to as quantitative variables, although, it’s important to remember that, while all continuous variables are quantitative, not all quantitative variables are continuous.
  • The variables should take a Normal distribution . This assumption means the Pearson’s correlation test is a parametric test. If the data in the variables of interest take some other distribution, then a non-parametric test for correlation such as Spearman’s rank correlation should be used. This assumption can be checked using a histogram.
  • There should be no outliers in the dataset. Outliers are values that are notably different from the pattern of the rest of the data and may influence the line of best fit and warp the correlation coefficient.
  • The relationship between the two variables is assumed to be linear . This assumption is related to the “no outliers” assumption, in that the relationship should be able to be described by a straight line relatively well. These assumptions can be checked using scatter plots (Figure 1).

Scatter plots showing examples of linear and non-linear relationships.

Figure 1 : Scatter plots showing examples of linear and non-linear relationships. Credit: Technology Networks.

Pearson’s correlation and Spearman’s rank correlation share the same purpose of quantifying the strength and direction (negative or positive) of an association between two variables. The correlation coefficients can be used to conduct a hypothesis test for the strength of evidence for the correlation between two variables of interest.

Unlike Spearman’s correlation, Pearson’s correlation assumes normality and a linear relationship between the variables. Spearman’s correlation test is non-parametric and assumes neither a specific distribution nor a linear relationship, so it is suitable when these assumptions are not met. Pearson’s correlation is based on standard deviation and covariance of datapoints around a best fit line, whereas Spearman’s rank correlation is based on ranking data points and is suitable only for measuring monotonic association (as one variable increases (or decreases) the other variable also increases (or decreases)).

The properties and interpretation of Pearson’s r can be summarized as follows:

  • r always lies between − 1 and + 1.
  • Positive correlation (positive values of r) is when one variable increases and the other tends to increase. Negative correlation (negative values of r) is when one variable decreases and the other tends to increase. See Figure 2 for some examples.
  • Values closer to − 1 and + 1 indicate stronger relationships and r will be equal to 0 when variables are not linearly associated, with − 1 and + 1 representing when the points are perfectly on the line of best fit. It should be noted that the variables may still be associated if r = 0, but in a more complex, non-linear way.
  • The more random the scatter of points around the line, the less correlated the data and the closer to 0 the value of r will be.

It is important to note that correlation coefficients assess only two variables at a time and should not be interpreted as causal relationships, where additional variables and multivariable methods are needed.

Scatter plots showing various relationships between two variables and their corresponding Pearson correlation coefficients, r.

Figure 2 : Scatter plots showing various relationships between two variables and their corresponding Pearson correlation coefficients, r. Credit: Technology Networks .

Suppose a team of chemists is interested in the relationship between a reactant’s concentration (in mol/L) and reaction rate (in units/min). Both variables take a Normal distribution and the relationship is assumed to be linear, so a Pearson’s correlation test is appropriate (Figure 3). In this section we will calculate Pearson’s r using the formula by hand, but in reality, these calculations can be done for us using statistical software.

Scatter plot showing reactant concentration (x) and reaction rate (y) in our worked example, with a best fit line included.

Figure 3 : Scatter plot showing reactant concentration (x) and reaction rate (y) in our worked example, with a best fit line included. Credit: Elliot McClenaghan.

For the by-hand calculation, so that we can follow the steps clearly in our dataset, we can rearrange the formula as follows, where n is the number of observations or data points:

Rearranging the Pearson's correlation coefficient formula, step 1.

We need to calculate the sum of the x and y values, calculate x 2 and y 2 , and their sums as well. Finally, the sum of the cross product (values of x multiplied by y). The raw values for each observation and the calculated values are shown in Table 1.

Table 1 : Reactant concentration (x) and reaction rate (y) values required to calculate Pearson’s correlation coefficient, along with x 2 , y 2 and the cross-product values. The sum (Σ) of each column is included in the bottom row.

Step two is to calculate the Pearson’s correlation coefficient, r, using the formula, the calculated parameters and our sample size (n = 20).

Calculating the Pearson’s correlation coefficient, step 1.

In our example, we find Pearson’s correlation coefficient to be r = − 0.94. We can interpret this as evidence of strong negative correlation given that r is a negative value and close to − 1.

Step three is to conduct a hypothesis test for the relationship between the two variables using Pearson’s r.

As with other correlation tests and statistical tests, we can conduct a hypothesis test for the strength of evidence of the correlation and to assess to what extent our finding could be attributed to chance. Our hypotheses for the test are as follows:

  • Null hypothesis (H0) is that the two variables are independent, that there is no linear relationship.
  • Alternative hypothesis (H1) is that there is a linear relationship between the two variables, and that there is correlation.

We calculate our test statistic (t) by plugging in our r value and n into the following formula:

Calculating our test statistic (t), step 1.

Next, we can use the degrees of freedom (df) (df is the number of independent bits of information in a statistical model or test; for Pearson correlation, the formula for this is df = n − 2), the significance level of interest (normally α = 0.05) and whether we are interested in a one-sided or two-sided test, to find a corresponding critical value of t using a t-distribution table . The t-distribution is a probability distribution that gives the probabilities of the occurrence of different outcomes for an experiment and is commonly used in statistical hypothesis testing.

The two-sided p-value is usually of more interest as it gives both directions of the effect and better represents the hypotheses of our test. In our example, we find a two-sided p-value of p < 0.001. This p-value indicates strong evidence against the null hypothesis. Therefore, we conclude that there is evidence of a negative correlation between the two variables, as reactant concentration increases, reaction rate decreases.

Further reading

  • Bland M. An Introduction to Medical Statistics (4th ed.) . Oxford. Oxford University Press; 2015. ISBN: 9780199589920  
  • Laerd Statistics. Pearson product-moment correlation. https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php . Accessed April 3, 2024.

What is Pearson correlation test, Pearson product moment correlation or Pearson r? Pearson’s correlation helps us understand the relationship between two quantitative variables when the relationship between them is assumed to take a linear pattern. The relationship between two quantitative variables (also known as continuous variables), can be visualized using a scatter plot, and a straight line can be drawn through them. The closeness with which the points lie along this line is measured by Pearson’s correlation coefficient, also often denoted as Pearson’s r, and sometimes referred to as Pearson’s product moment correlation coefficient or simply the correlation coefficient. Pearson’s r can be thought of not just as a descriptive statistic but also an inferential statistic because, as with other statistical tests, a hypothesis test can be performed to make inferences and draw conclusions from the data.

What are the assumptions of Pearson's correlation test? There are some key assumptions that should be met before Pearson’s correlation can be used to give valid results: - The data should be on a continuous scale. Examples of continuous variables include age in years, height in centimeters and temperature in degrees Celsius. Sometimes continuous variables are referred to as quantitative variables, although, it’s important to remember that, while all continuous variables are quantitative, not all quantitative variables are continuous. - The variables should take a Normal distribution. This assumption means the Pearson’s correlation test is a parametric test. If the data in the variables of interest take some other distribution, then a non-parametric test for correlation such as Spearman’s rank correlation should be used. This assumption can be checked using a histogram. - There should be no outliers in the dataset. Outliers are values that are notably different from the pattern of the rest of the data and may influence the line of best fit and warp the correlation coefficient. - The relationship between the two variables is assumed to be linear. This assumption is related to the “no outliers” assumption, in that the relationship should be able to be described by a straight line relatively well. These assumptions can be checked using scatter plots

Elliot McClenaghan image

  • How It Works

Pearson Correlation Coefficient

  • July 4, 2021
  • Posted by: admin
  • Categories: SPSS Analysis Help, Statistical Test

pearson correlation

Why is Pearson’s correlation used?

Pearson Correlation Coefficient is typically used to describe the strength of the linear relationship between two quantitative variables. Often, these two variables are designated X (predictor) and Y (outcome). Pearson’s r has values that range from −1.00 to +1.00. The sign of r provides information about the direction of the relationship between X and Y. A positive correlation indicates that as scores on X increase, scores on Y also tend to increase; a negative correlation indicates that as scores on X increase, scores on Y tend to decrease; and a correlation near 0 indicates that as scores on X increase, scores on Y neither increase nor decrease in a linear manner.

What does the Correlation coefficient tell you?

The absolute magnitude of Pearson’s r provides information about the strength of the linear association between scores on X and Y. For values of r close to 0, there is no linear association between X and Y. When r = +1.00, there is a perfect positive linear association; when r = −1.00, there is a perfect negative linear association. Intermediate values of r correspond to intermediate strength of the relationship. Figures 7.2 through 7.5 show examples of data for which the correlations are r = +.75, r = +.50, r = +.23, and r = .00.

Pearson’s r is a standardized or unit-free index of the strength of the linear relationship between two variables. No matter what units are used to express the scores on the X and Y variables, the possible values of Pearson’s r range from –1 (a perfect negative linear relationship) to +1 (a perfect positive linear relationship).

How do you explain correlation analysis?

Consider, for example, a correlation between height and weight. Height could be measured in inches, centimeters, or feet; weight could be measured in ounces, pounds, or kilograms. When we correlate scores on height and weight for a given sample of people, the correlation has the same value no matter which of these units are used to measure height and weight. This happens because the scores on X and Y are converted to z scores (i.e., they are converted to unit-free or standardized distances from their means) during the computation of Pearson’s r.

What is the null hypothesis for Pearson correlation?

A correlation coefficient may be tested to determine whether the coefficient significantly differs from zero. The value r is obtained on a sample. The value rho (ρ) is the population’s correlation coefficient. It is hoped that r closely approximates rho. The null and alternative hypotheses are as follows:

What are the assumptions for Pearson correlation?

The assumptions for the Pearson correlation coefficient are as follows: level of measurement, related pairs, absence of outliers, normality of variables, linearity, and homoscedasticity.

Linear Relationship

When using the Pearson correlation coefficient, it is assumed that the cluster of points is the best fit by a straight line.

Homoscedasticity

A second assumption of the correlation coefficient is that of homoscedasticity. This assumption is met if the distance from the points to the line is relatively equal all along the line.

If the normality assumption is violated, you can use the Spearmen Rho Correlation test

Our statisticians take it all in their stride and will produce the clever result you’re looking for.

spssanalysis-spss

Looking For a Statistician to Do Your Pearson Correlation Analysis?

Pearson Correlation

Pearson correlation analysis examines the relationship between two variables. For example, is there a correlation between a person's age and salary?

Pearson Correlation

More specifically, we can use the pearson correlation coefficient to measure the linear relationship between two variables.

Strength and direction of correlation

With a correlation analysis we can determine:

  • How strong the correlation is
  • and in which direction the correlation goes.

We can read the strength and direction of the correlation in the Pearson correlation coefficient r , whose value varies between -1 and 1.

Strength of the correlation

The strength of the correlation can be read in a table. An r between 0 and 0.1 indicates no correlation. An amount of r between 0.7 and 1 is indicates a very strong correlation.

Direction of the correlation

A positive relationship or correlation exists when large values of one variable are associated with large values of the other variable, or when small values of one variable are associated with small values of the other variable.

Positive Pearson correlation coefficient

A positive correlation results, for example, for height and shoe size. This results in a positive correlation coefficient.

positive correlation coefficient

A negative correlation is when large values of one variable are associated with small values of the other variable and vice versa.

negative Pearson correlation coefficient

A negative correlation is usually found between product price and sales volume. This results in a negative correlation coefficient.

negative correlation coefficient

Calculate Pearson correlation

The Pearson correlation coefficient is calculated using the following equation. Here r is the Pearson correlation coefficient, x i are the individual values of one variable e.g. age, y i are the individual values of the other variable e.g. salary and x̄ and ȳ are the mean values of the two variables respectively.

Equation Pearson Correlation

In the equation, we can see that the respective mean value is first subtracted from both variables.

So in our example, we calculate the mean values of age and salary. We then subtract the mean values from each of age and salary. We then multiply both values. We then sum up the individual results of the multiplication. The expression in the denominator ensures that the correlation coefficient is scaled between -1 and 1.

If we now multiply two positive values we get a positive value. If we multiply two negative values we also get a positive value (minus times minus is plus). So all values that lie in these ranges have a positive influence on the correlation coefficient.

Positive correlation Pearson correlation

If we multiply a positive value and a negative value we get a negative value (minus times plus is minus). So all values that are in these ranges have a negative influence on the correlation coefficient.

negative correlation Pearson correlation

Therefore, if our values are predominantly in the two green areas from previous two figures, we get a positive correlation coefficient and therefore a positive correlation.

If our scores are predominantly in the two red areas from the figures, we get a negative correlation coefficient and thus a negative correlation.

If the points are distributed over all four areas, the positive terms and the negative terms cancel each other out and we might end up with a very small or no correlation.

Testing correlation coefficients for significance

In general, the correlation coefficient is calculated using data from a sample. In most cases, however, we want to test a hypothesis about the population.

Pearson Correlation Sample

In the case of correlation analysis, we then want to know if there is a correlation in the population.

For this, we test whether the correlation coefficient in the sample is statistically significantly different from zero.

Hypotheses in the Pearson Correlation

The null hypothesis and the alternative hypothesis in Pearson correlation are thus:

  • Null hypothesis: The correlation coefficient is not significantly different from zero (There is no linear relationship).
  • Alternative hypothesis: The correlation coefficient deviates significantly from zero (there is a linear correlation).

Attention: It is always tested whether the null hypothesis is rejected or not rejected.

In our example with the salary and the age of a person, we could thus have the question: Is there a correlation between age and salary in the German population (the population)?

To find out, we draw a sample and test whether the correlation coefficient is significantly different from zero in this sample.

  • The null hypothesis is then: There is no correlation between salary and age in the German population.
  • and the alternative hypothesis: There is a correlation between salary and age in the German population.

Significance and the t-test

Whether the Pearson correlation coefficient is significantly different from zero based on the sample surveyed can be checked using a t-test . Here, r is the correlation coefficient and n is the sample size.

Significance of the Pearson correlation

A p-value can then be calculated from the test statistic t . If the p-value is smaller than the specified significance level, which is usually 5%, then the null hypothesis is rejected, otherwise it is not.

Assumptions of the Pearson correlation

But what about the assumptions for a Pearson correlation? Here we have to distinguish whether we just want to calculate the Pearson correlation coefficient, or whether we want to test a hypothesis.

To calculate the Pearson correlation coefficient, only two metric variables must be present. Metric variables are, for example, a person's weight,a person's salary or electricity consumption.

The Pearson correlation coefficient, then tells us how large the linear relationship is. If there is a non-linear correlation, we cannot read it from the Pearson correlation coefficient.

Assumptions of the Pearson correlation

However, if we want to test whether the Pearson correlation coefficient is significantly different from zero in the sample, i.e. we want to test a hypothesis, the two variables must also be normally distributed!

Pearson correlation normal distribution

If this is not given, the calculated test statistic t or the p-value cannot be interpreted reliably. If the assumptions are not met, Spearman's rank correlation can be used.

Calculate Pearson correlation online with DATAtab

If you like, you can of course calculate a correlation analysis online with DATAtab. To do this, simply copy your data into this table in the statistics calculator and click on either the Hypothesis tests or Correlation tab.

If you now look at two metric variables, a Pearson correlation will be calculated automatically. If you don't know exactly how to interpret the results, you can also just click on Summary in words !

Statistics made easy

  • many illustrative examples
  • ideal for exams and theses
  • statistics made easy on 412 pages
  • 5rd revised edition (April 2024)
  • Only 7.99 €

Datatab

"Super simple written"

"It could not be simpler"

"So many helpful examples"

Statistics Calculator

Cite DATAtab: DATAtab Team (2024). DATAtab: Online Statistics Calculator. DATAtab e.U. Graz, Austria. URL https://datatab.net

Statistical Hypothesis Analysis in Python with ANOVAs, Chi-Square, and Pearson Correlation

hypothesis test pearson correlation

  • Introduction

Python is an incredibly versatile language, useful for a wide variety of tasks in a wide range of disciplines. One such discipline is statistical analysis on datasets, and along with SPSS, Python is one of the most common tools for statistics.

Python's user-friendly and intuitive nature makes running statistical tests and implementing analytical techniques easy, especially through the use of the statsmodels library .

  • Introducing the statsmodels Library in Python

The statsmodels library is a module for Python that gives easy access to a variety of statistical tools for carrying out statistical tests and exploring data. There are a number of statistical tests and functions that the library grants access to, including ordinary least squares (OLS) regressions, generalized linear models, logit models, Principal Component Analysis (PCA) , and Autoregressive Integrated Moving Average (ARIMA) models.

The results of the models are constantly tested against other statistical packages to ensure that the models are accurate. When combined with SciPy and Pandas , it's simple to visualize data, run statistical tests, and check relationships for significance.

  • Choosing a Dataset

Before we can practice statistics with Python, we need to select a dataset. We'll be making use of a dataset compiled by the Gapminder Foundation.

The Gapminder Dataset tracks many variables used to assess the general health and wellness of populations in countries around the world. We'll be using the dataset because it is very well documented, standardized, and complete. We won't have to do much in the way of preprocessing in order to make use of it.

There are a few things we'll want to do just to get the dataset ready to run regressions, ANOVAs, and other tests, but by and large the dataset is ready to work with.

The starting point for our statistical analysis of the Gapminder dataset is exploratory data analysis. We'll use some graphing and plotting functions from Matplotlib and Seaborn to visualize some interesting relationships and get an idea of what variable relationships we may want to explore.

  • Exploratory Data Analysis and Preprocessing

We'll start off by visualizing some possible relationships. Using Seaborn and Pandas we can do some regressions that look at the strength of correlations between the variables in our dataset to get an idea of which variable relationships are worth studying.

We'll import those two and any other libraries we'll be using here:

There isn't much preprocessing we have to do, but we do need to do a few things. First, we'll check for any missing or null data and convert any non-numeric entries to numeric. We'll also make a copy of the transformed dataframe that we'll work with:

Here are the outputs:

There's a handful of missing values, but our numeric conversion should turn them into NaN values, allowing exploratory data analysis to be carried out on the dataset.

Specifically, we could try analyzing the relationship between internet use rate and life expectancy, or between internet use rate and employment rate. Let's try making individual graphs of some of these relationships using Seaborn and Matplotlib:

Here are the results of the graphs:

scatter plot of internet use and breast cancer

It looks like there are some interesting relationships that we could further investigate. Interestingly, there seems to be a fairly strong positive relationship between internet use rate and breast cancer, though this is likely just an artifact of better testing in countries that have more access to technology.

There also seems to be a fairly strong, though less linear relationship between life expectancy and internet use rate.

Finally, it looks like there is a parabolic, non-linear relationship between internet use rate and employment rate.

  • Selecting a Suitable Hypothesis

We want to pick out a relationship that merits further exploration. There are many potential relationships here that we could form a hypothesis about and explore the relationship with statistical tests. When we make a hypothesis and run a correlation test between the two variables, if the correlation test is significant, we then need to conduct statistical tests to see just how strong the correlation is and if we can reliably say that the correlation between the two variables is more than just chance.

The type of statistical test we use will depend on the nature of our explanatory and response variables, also known and independent and dependent variables . We'll go over how to run three different types of statistical tests:

  • Chi-Square Tests
  • Regressions.

We'll go with what we visualized above and choose to explore the relationship between internet use rates and life expectancy.

The null hypothesis is that there is no significant relationship between internet use rate and life expectancy, while our hypothesis is that there is a relationship between the two variables.

We're going to be conducting various types of hypothesis tests on the dataset. The type of hypothesis test that we use is dependent on the nature of our explanatory and response variables. Different combinations of explanatory and response variables require different statistical tests. For example, if one variable is categorical and one variable is quantitative in nature, an Analysis of Variance is required.

  • Analysis of Variance (ANOVA)

An Analysis of Variance (ANOVA) is a statistical test employed to compare two or more means together, which are determined through the analysis of variance. One-way ANOVA tests are utilized to analyze differences between groups and determine if the differences are statistically significant.

One-way ANOVAs compare two or more independent group means, though in practice they are most often used when there are at least three independent groups.

In order to carry out an ANOVA on the Gapminder dataset, we'll need to transform some of the features, as these values in the dataset are continuous but ANOVA analyses are appropriate for situations where one variable is categorical and one variable is quantitative.

We can transform the data from continuous to quantitative by selecting a category and binning the variable in question, dividing it into percentiles. The independent variable will be converted into a categorical variable, while the dependent variable will stay continuous. We can use the qcut() function in Pandas to divide the dataframe into bins:

After the variables have been transformed and are ready to be analyzed, we can use the statsmodel library to carry out an ANOVA on the selected features. We'll print out the results of the ANOVA and check to see if the relationship between the two variables is statistically significant:

Here's the output of the model:

We can see that the model gives a very small P-value ( Prob F-statistic ) of 1.71e-35 . This is far less than the usual significance threshold of 0.05 , so we conclude there is a significant relationship between life expectancy and internet use rate.

Since the correlation P-value does seem to be significant, and since we have 10 different categories, we'll want to run a post-hoc test to check that the difference between the means is still significant even after we check for type-1 errors. We can carry out post-hoc tests with the help of the multicomp module, utilizing a Tukey Honestly Significant Difference (Tukey HSD) test:

Here are the results of the test:

Now we have some better insight into which groups in our comparison have statistically significant differences.

If the reject column has a label of False , we know it's recommended that we reject the null hypothesis and assume that there is a significant difference between the two groups being compared.

  • The Chi-Square Test of Independence

ANOVA is appropriate for instances where one variable is continuous and the other is categorical. Now we'll be looking at how to carry out a Chi-Square test of independence .

The Chi-Square test of independence is utilized when both explanatory and response variables are categorical. You likely also want to use the Chi-Square test when the explanatory variable is quantitative and the response variable is categorical, which you can do by dividing the explanatory variable into categories.

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

The Chi-Square test of independence is a statistical test used to analyze how significant a relationship between two categorical variables is. When a Chi-Square test is run, every category in one variable has its frequency compared against the second variable's categories. This means that the data can be displayed as a frequency table, where the rows represent the independent variables and the columns represent the dependent variables.

Much like we converted our independent variable into a categorical variable (by binning it), for the ANOVA test, we need to make both variables categorical in order to carry out the Chi-Square test. Our hypothesis for this problem is the same as the hypothesis in the previous problem, that there is a significant relationship between life expectancy and internet use rate.

We'll keep things simple for now and divide our internet use rate variable into two categories, though we could easily do more. We'll write a function to handle that.

We'll be conducting post-hoc comparison to guard against type-1 errors (false positives) using an approach called the Bonferroni Adjustment . In order to do this, you can carry out comparisons for the different possible pairs of your response variable, and then you check their adjusted significance.

We won't run comparisons for all the different possible pairs here, we'll just show how it can be done. We'll make a few different comparisons using a re-coding scheme and map the records into new feature columns.

Afterwards, we can check the observed counts and create tables of those comparisons:

Running a Chi-Square test and post-hoc comparison involves first constructing a cross-tabs comparison table. The cross-tabs comparison table shows the percentage of occurrence for the response variable for the different levels of the explanatory variable.

Just to get an idea of how this works, let's print out the results for all the life expectancy bin comparisons:

We can see that a cross-tab comparison checks for the frequency of one variable's categories in the second variable. Above we see the distribution of life expectancies in situations where they fall into one of the two bins we created.

Now we need to compute the cross-tabs for the different pairs we created above, as this is what we run through the Chi-Square test:

Once we have transformed the variables so that the Chi-Square test can be carried out, we can use the chi2_contingency function in statsmodel to carry out the test.

We want to print out the column percentages as well as the results of the Chi-Square test, and we'll create a function to do this. We'll then use our function to do the Chi-Square test for the four comparison tables we created:

Here are the results:

If we're only looking at the results for the full count table, it looks like there's a P-value of 6.064860600653971e-18 .

However, in order to ascertain how the different groups diverge from one another, we need to carry out the Chi-Square test for the different pairs in our dataframe. We'll check to see if there is a statistically significant difference for each of the different pairs we selected. Note that the P-value which indicates a significant result changes depending on how many comparisons you are making, and while we won't cover that in this tutorial, you'll need to be mindful of it.

The 6 vs 9 comparison gives us a P-value of 0.127 , which is above the 0.05 threshold, indicating that the difference for that category may be non-significant. Seeing the differences of the comparisons helps us understand why we need to compare different levels with one another.

  • Pearson Correlation

We've covered the test you should use when you have a categorical explanatory variable and a quantitative response variable (ANOVA), as well as the test you use when you have two categorical variables (Chi-Squared).

We'll now take a look at the appropriate type of test to use when you have a quantitative explanatory variable and a quantitative response variable - the Pearson Correlation .

The Pearson Correlation test is used to analyze the strength of a relationship between two provided variables, both quantitative in nature. The value, or strength of the Pearson correlation, will be between +1 and -1 .

A correlation of 1 indicates a perfect association between the variables, and the correlation is either positive or negative. Correlation coefficients near 0 indicate very weak, almost non-existent, correlations. While there are other ways of measuring correlations between two variables, such as Spearman Correlation or Kendall Rank Correlation , the Pearson correlation is probably the most commonly used correlation test.

As the Gapminder dataset has its features represented with quantitative variables, we don't need to do any categorical transformation of the data before running a Pearson Correlation on it. Note that it's assumed that both variables are normally distributed and there aren't many significant outliers in the dataset. We'll need access to SciPy in order to carry out the Pearson correlation.

We'll graph the relationship between life expectancy and internet use rates, as well as internet use rate and employment rate, just to see what another correlation graph might look like. After creating a graphing function, we'll use the pearsonr() function from SciPy to carry out the correlation and check the results:

correlation between internet use rate and life expectancy

The first value is the direction and strength of the correlation, while the second is the P-value. The numbers suggest a fairly strong correlation between life expectancy and internet use rate that isn't due to chance. Meanwhile, there's a weaker, though still significant correlation between employment rate and internet use rate.

Note that it is also possible to run a Pearson Correlation on categorical data, though the results will look somewhat different. If we wanted to, we could group the income levels and run the Pearson Correlation on them. You can use it to check for the presence of moderating variables that could be having an effect on your association of interest.

  • Moderators and Statistical Interaction

Let's look at how to account for statistical interaction between multiple variables, AKA moderation.

Moderation is when a third (or more) variable impacts the strength of the association between the independent variable and the dependent variable.

There are different ways to test for moderation/statistical interaction between a third variable and the independent/dependent variables. For example, if you carried out an ANOVA test, you could test for moderation by doing a two-way ANOVA test in order to test for possible moderation.

However, a reliable way to test for moderation, no matter what type of statistical test you ran (ANOVA, Chi-Square, Pearson Correlation) is to check if there is an association between explanatory and response variables for every subgroup/level of the third variable.

To be more concrete, if you were carrying out ANOVA tests, you could just run an ANOVA for every category in the third variable (the variable you suspect might have a moderating effect on the relationship you are studying).

If you were using a Chi-Square test, you could just carry out a Chi-Square test on new dataframes holding all data points found within the categories of your moderating variable.

If your statistical test is a Pearson correlation, you would need to create categories or bins for the moderating variable and then run the Pearson correlation for all three of those bins.

Let's take a quick look at how to carry out Pearson Correlations for moderating variables. We'll create artificial categories/levels out of our continuous features. The process for testing for moderation for the other two test types (Chi-Square and ANOVA) is very similar, but you'll have pre-existing categorical variables to work with instead.

We'll want to choose a suitable variable to act as our moderating variable. Let's try income level per person and divide it into three different groups:

Once more, the first value is the direction and strength of the correlation, while the second is the P-value.

statsmodels is an extremely useful library that allows Python users to analyze data and run statistical tests on datasets. You can carry out ANOVAs, Chi-Square Tests, Pearson Correlations and tests for moderation.

Once you become familiar with how to carry out these tests, you'll be able to test for significant relationships between dependent and independent variables, adapting for the categorical or continuous nature of the variables.

You might also like...

  • Keras Callbacks: Save and Visualize Prediction on Each Training Epoch
  • Loading a Pretrained TensorFlow Model into TensorFlow Serving
  • Scikit-Learn's train_test_split() - Training, Testing and Validation Sets
  • Feature Scaling Data with Scikit-Learn for Machine Learning in Python

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Aspiring data scientist and writer. BS in Communications. I hope to use my multiple talents and skillsets to teach others about the transformative power of computer programming and data science.

In this article

hypothesis test pearson correlation

Bank Note Fraud Detection with SVMs in Python with Scikit-Learn

Can you tell the difference between a real and a fraud bank note? Probably! Can you do it for 1000 bank notes? Probably! But it...

David Landup

Building Your First Convolutional Neural Network With Keras

Most resources start with pristine datasets, start at importing and finish at validation. There's much more to know. Why was a class predicted? Where was...

© 2013- 2024 Stack Abuse. All rights reserved.

Correlation Calculator

Input your values with a space or comma between in the table below

Critical Value

Results shown here

Sample size, n

Sample correlation coefficient, r, standardized sample score.

scipy.stats.pearsonr #

Pearson correlation coefficient and p-value for testing non-correlation.

The Pearson correlation coefficient [1] measures the linear relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

This function also performs a test of the null hypothesis that the distributions underlying the samples are uncorrelated and normally distributed. (See Kowalski [3] for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.

Input array.

Defines the alternative hypothesis. Default is ‘two-sided’. The following options are available:

‘two-sided’: the correlation is nonzero

‘less’: the correlation is negative (less than zero)

‘greater’: the correlation is positive (greater than zero)

New in version 1.9.0.

Defines the method used to compute the p-value. If method is an instance of PermutationMethod / MonteCarloMethod , the p-value is computed using scipy.stats.permutation_test / scipy.stats.monte_carlo_test with the provided configuration options and other appropriate settings. Otherwise, the p-value is computed as documented in the notes.

New in version 1.11.0.

An object with the following attributes:

Pearson product-moment correlation coefficient.

The p-value associated with the chosen alternative.

The object has the following method:

This computes the confidence interval of the correlation coefficient statistic for the given confidence level. The confidence interval is returned in a namedtuple with fields low and high . If method is not provided, the confidence interval is computed using the Fisher transformation [1] . If method is an instance of BootstrapMethod , the confidence interval is computed using scipy.stats.bootstrap with the provided configuration options and other appropriate settings. In some cases, confidence limits may be NaN due to a degenerate resample, and this is typical for very small samples (~6 observations).

Raised if an input is a constant array. The correlation coefficient is not defined in this case, so np.nan is returned.

Raised if an input is “nearly” constant. The array x is considered nearly constant if norm(x - mean(x)) < 1e-13 * abs(mean(x)) . Numerical errors in the calculation x - mean(x) in this case might result in an inaccurate calculation of r.

Spearman rank-order correlation coefficient.

Kendall’s tau, a correlation measure for ordinal data.

The correlation coefficient is calculated as follows:

where \(m_x\) is the mean of the vector x and \(m_y\) is the mean of the vector y.

Under the assumption that x and y are drawn from independent normal distributions (so the population correlation coefficient is 0), the probability density function of the sample correlation coefficient r is ( [1] , [2] ):

where n is the number of samples, and B is the beta function. This is sometimes referred to as the exact distribution of r. This is the distribution that is used in pearsonr to compute the p-value when the method parameter is left at its default value (None). The distribution is a beta distribution on the interval [-1, 1], with equal shape parameters a = b = n/2 - 1. In terms of SciPy’s implementation of the beta distribution, the distribution of r is:

The default p-value returned by pearsonr is a two-sided p-value. For a given sample with correlation coefficient r, the p-value is the probability that abs(r’) of a random sample x’ and y’ drawn from the population with zero correlation would be greater than or equal to abs(r). In terms of the object dist shown above, the p-value for a given r and length n can be computed as:

When n is 2, the above continuous distribution is not well-defined. One can interpret the limit of the beta distribution as the shape parameters a and b approach a = b = 0 as a discrete distribution with equal probability masses at r = 1 and r = -1. More directly, one can observe that, given the data x = [x1, x2] and y = [y1, y2], and assuming x1 != x2 and y1 != y2, the only possible values for r are 1 and -1. Because abs(r’) for any sample x’ and y’ with length 2 will be 1, the two-sided p-value for a sample of length 2 is always 1.

For backwards compatibility, the object that is returned also behaves like a tuple of length two that holds the statistic and the p-value.

“Pearson correlation coefficient”, Wikipedia, https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

Student, “Probable error of a correlation coefficient”, Biometrika, Volume 6, Issue 2-3, 1 September 1908, pp. 302-310.

C. J. Kowalski, “On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coefficient” Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 21, No. 1 (1972), pp. 1-12.

To perform an exact permutation version of the test:

To perform the test under the null hypothesis that the data were drawn from uniform distributions:

To produce an asymptotic 90% confidence interval:

And for a bootstrap confidence interval:

There is a linear dependence between x and y if y = a + b*x + e, where a,b are constants and e is a random error term, assumed to be independent of x. For simplicity, assume that x is standard normal, a=0, b=1 and let e follow a normal distribution with mean zero and standard deviation s>0.

This should be close to the exact value given by

For s=0.5, we observe a high level of correlation. In general, a large variance of the noise reduces the correlation, while the correlation approaches one as the variance of the error goes to zero.

It is important to keep in mind that no correlation does not imply independence unless (x, y) is jointly normal. Correlation can even be zero when there is a very simple dependence structure: if X follows a standard normal distribution, let y = abs(x). Note that the correlation between x and y is zero. Indeed, since the expectation of x is zero, cov(x, y) = E[x*y]. By definition, this equals E[x*abs(x)] which is zero by symmetry. The following lines of code illustrate this observation:

A non-zero correlation coefficient can be misleading. For example, if X has a standard normal distribution, define y = x if x < 0 and y = 0 otherwise. A simple calculation shows that corr(x, y) = sqrt(2/Pi) = 0.797…, implying a high level of correlation:

This is unintuitive since there is no dependence of x and y if x is larger than zero which happens in about half of the cases if we sample x and y.

Test a hypothesis about the strength of a Pearson's r correlation

Description.

test_correlation is suitable for testing a hypothesis about a the strength of correlation between two continuous variables (designs in which Pearson's r is a suitable measure of correlation).

This function can be passed an esci_estimate object generated by estimate_r() .

It can test hypotheses about a specific value for the difference (a point null) or about a range of values (an interval null)

Returns a list with 1-2 data frames

point_null - always returned

test_type - 'Nil hypothesis test', meaning a test against H0 = 0

outcome_variable_name - Name of the outcome variable

effect - Label for the effect being tested

null_words - Express the null in words

confidence - Confidence level, integer (95 for 95%, etc.)

LL - Lower boundary of the confidence% CI for the effect

UL - Upper boundary of the confidence% CI for the effect

CI - Character representation of the CI for the effect

CI_compare - Text description of relation between CI and null

t - If applicable, t value for hypothesis test

df - If applicable, degrees of freedom for hypothesis test

p - If applicable, p value for hypothesis test

p_result - Text representation of p value obtained

null_decision - Text represention of the decision for the null

conclusion - Text representation of conclusion to draw

significant - TRUE/FALSE if significant at alpha = 1-CI

interval_null - returned only if an interval null is specified

test_type - 'Practical significance test', meaning a test against an interval null

outcome_variable_name -

effect - Name of the outcome variable

rope - Test representation of null interval

rope_compare - Text description of relation between CI and null interval

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

12.5: Testing the Significance of the Correlation Coefficient

  • Last updated
  • Save as PDF
  • Page ID 800

The correlation coefficient, \(r\), tells us about the strength and direction of the linear relationship between \(x\) and \(y\). However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient \(r\) and the sample size \(n\), together. We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute \(r\), the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we have only sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, \(r\), is our estimate of the unknown population correlation coefficient.

  • The symbol for the population correlation coefficient is \(\rho\), the Greek letter "rho."
  • \(\rho =\) population correlation coefficient (unknown)
  • \(r =\) sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient \(\rho\) is "close to zero" or "significantly different from zero". We decide this based on the sample correlation coefficient \(r\) and the sample size \(n\).

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is "significant."

  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between \(x\) and \(y\). We can use the regression line to model the linear relationship between \(x\) and \(y\) in the population.

If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is "not significant".

  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is not significantly different from zero."
  • What the conclusion means: There is not a significant linear relationship between \(x\) and \(y\). Therefore, we CANNOT use the regression line to model a linear relationship between \(x\) and \(y\) in the population.
  • If \(r\) is significant and the scatter plot shows a linear trend, the line can be used to predict the value of \(y\) for values of \(x\) that are within the domain of observed \(x\) values.
  • If \(r\) is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
  • If \(r\) is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed \(x\) values in the data.

PERFORMING THE HYPOTHESIS TEST

  • Null Hypothesis: \(H_{0}: \rho = 0\)
  • Alternate Hypothesis: \(H_{a}: \rho \neq 0\)

WHAT THE HYPOTHESES MEAN IN WORDS:

  • Null Hypothesis \(H_{0}\) : The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship(correlation) between \(x\) and \(y\) in the population.
  • Alternate Hypothesis \(H_{a}\) : The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between \(x\) and \(y\) in the population.

DRAWING A CONCLUSION:There are two methods of making the decision. The two methods are equivalent and give the same result.

  • Method 1: Using the \(p\text{-value}\)
  • Method 2: Using a table of critical values

In this chapter of this textbook, we will always use a significance level of 5%, \(\alpha = 0.05\)

Using the \(p\text{-value}\) method, you could choose any appropriate significance level you want; you are not limited to using \(\alpha = 0.05\). But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, \(\alpha = 0.05\). (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook.)

METHOD 1: Using a \(p\text{-value}\) to make a decision

Using the ti83, 83+, 84, 84+ calculator.

To calculate the \(p\text{-value}\) using LinRegTTEST:

On the LinRegTTEST input screen, on the line prompt for \(\beta\) or \(\rho\), highlight "\(\neq 0\)"

The output screen shows the \(p\text{-value}\) on the line that reads "\(p =\)".

(Most computer statistical software can calculate the \(p\text{-value}\).)

If the \(p\text{-value}\) is less than the significance level ( \(\alpha = 0.05\) ):

  • Decision: Reject the null hypothesis.
  • Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is significantly different from zero."

If the \(p\text{-value}\) is NOT less than the significance level ( \(\alpha = 0.05\) )

  • Decision: DO NOT REJECT the null hypothesis.
  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is NOT significantly different from zero."

Calculation Notes:

  • You will use technology to calculate the \(p\text{-value}\). The following describes the calculations to compute the test statistics and the \(p\text{-value}\):
  • The \(p\text{-value}\) is calculated using a \(t\)-distribution with \(n - 2\) degrees of freedom.
  • The formula for the test statistic is \(t = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}\). The value of the test statistic, \(t\), is shown in the computer or calculator output along with the \(p\text{-value}\). The test statistic \(t\) has the same sign as the correlation coefficient \(r\).
  • The \(p\text{-value}\) is the combined area in both tails.

An alternative way to calculate the \(p\text{-value}\) ( \(p\) ) given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.

THIRD-EXAM vs FINAL-EXAM EXAMPLE: \(p\text{-value}\) method

  • Consider the third exam/final exam example.
  • The line of best fit is: \(\hat{y} = -173.51 + 4.83x\) with \(r = 0.6631\) and there are \(n = 11\) data points.
  • Can the regression line be used for prediction? Given a third exam score ( \(x\) value), can we use the line to predict the final exam score (predicted \(y\) value)?
  • \(H_{0}: \rho = 0\)
  • \(H_{a}: \rho \neq 0\)
  • \(\alpha = 0.05\)
  • The \(p\text{-value}\) is 0.026 (from LinRegTTest on your calculator or from computer software).
  • The \(p\text{-value}\), 0.026, is less than the significance level of \(\alpha = 0.05\).
  • Decision: Reject the Null Hypothesis \(H_{0}\)
  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score (\(x\)) and the final exam score (\(y\)) because the correlation coefficient is significantly different from zero.

Because \(r\) is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.

METHOD 2: Using a table of Critical Values to make a decision

The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a good idea of whether the computed value of \(r\) is significant or not . Compare \(r\) to the appropriate critical value in the table. If \(r\) is not between the positive and negative critical values, then the correlation coefficient is significant. If \(r\) is significant, then you may want to use the line for prediction.

Example \(\PageIndex{1}\)

Suppose you computed \(r = 0.801\) using \(n = 10\) data points. \(df = n - 2 = 10 - 2 = 8\). The critical values associated with \(df = 8\) are \(-0.632\) and \(+0.632\). If \(r <\) negative critical value or \(r >\) positive critical value, then \(r\) is significant. Since \(r = 0.801\) and \(0.801 > 0.632\), \(r\) is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Horizontal number line with values of -1, -0.632, 0, 0.632, 0.801, and 1. A dashed line above values -0.632, 0, and 0.632 indicates not significant values.

Exercise \(\PageIndex{1}\)

For a given line of best fit, you computed that \(r = 0.6501\) using \(n = 12\) data points and the critical value is 0.576. Can the line be used for prediction? Why or why not?

If the scatter plot looks linear then, yes, the line can be used for prediction, because \(r >\) the positive critical value.

Example \(\PageIndex{2}\)

Suppose you computed \(r = –0.624\) with 14 data points. \(df = 14 – 2 = 12\). The critical values are \(-0.532\) and \(0.532\). Since \(-0.624 < -0.532\), \(r\) is significant and the line can be used for prediction

Horizontal number line with values of -0.624, -0.532, and 0.532.

Exercise \(\PageIndex{2}\)

For a given line of best fit, you compute that \(r = 0.5204\) using \(n = 9\) data points, and the critical value is \(0.666\). Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction, because \(r <\) the positive critical value.

Example \(\PageIndex{3}\)

Suppose you computed \(r = 0.776\) and \(n = 6\). \(df = 6 - 2 = 4\). The critical values are \(-0.811\) and \(0.811\). Since \(-0.811 < 0.776 < 0.811\), \(r\) is not significant, and the line should not be used for prediction.

Horizontal number line with values -0.924, -0.532, and 0.532.

Exercise \(\PageIndex{3}\)

For a given line of best fit, you compute that \(r = -0.7204\) using \(n = 8\) data points, and the critical value is \(= 0.707\). Can the line be used for prediction? Why or why not?

Yes, the line can be used for prediction, because \(r <\) the negative critical value.

THIRD-EXAM vs FINAL-EXAM EXAMPLE: critical value method

Consider the third exam/final exam example. The line of best fit is: \(\hat{y} = -173.51 + 4.83x\) with \(r = 0.6631\) and there are \(n = 11\) data points. Can the regression line be used for prediction? Given a third-exam score ( \(x\) value), can we use the line to predict the final exam score (predicted \(y\) value)?

  • Use the "95% Critical Value" table for \(r\) with \(df = n - 2 = 11 - 2 = 9\).
  • The critical values are \(-0.602\) and \(+0.602\)
  • Since \(0.6631 > 0.602\), \(r\) is significant.
  • Conclusion:There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score (\(x\)) and the final exam score (\(y\)) because the correlation coefficient is significantly different from zero.

Example \(\PageIndex{4}\)

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if \(r\) is significant and the line of best fit associated with each r can be used to predict a \(y\) value. If it helps, draw a number line.

  • \(r = –0.567\) and the sample size, \(n\), is \(19\). The \(df = n - 2 = 17\). The critical value is \(-0.456\). \(-0.567 < -0.456\) so \(r\) is significant.
  • \(r = 0.708\) and the sample size, \(n\), is \(9\). The \(df = n - 2 = 7\). The critical value is \(0.666\). \(0.708 > 0.666\) so \(r\) is significant.
  • \(r = 0.134\) and the sample size, \(n\), is \(14\). The \(df = 14 - 2 = 12\). The critical value is \(0.532\). \(0.134\) is between \(-0.532\) and \(0.532\) so \(r\) is not significant.
  • \(r = 0\) and the sample size, \(n\), is five. No matter what the \(dfs\) are, \(r = 0\) is between the two critical values so \(r\) is not significant.

Exercise \(\PageIndex{4}\)

For a given line of best fit, you compute that \(r = 0\) using \(n = 100\) data points. Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction no matter what the sample size is.

Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between \(x\) and \(y\) in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between \(x\) and \(y\) in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population. Examining the scatter plot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

The assumptions underlying the test of significance are:

  • There is a linear relationship in the population that models the average value of \(y\) for varying values of \(x\). In other words, the expected value of \(y\) for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population.)
  • The \(y\) values for any particular \(x\) value are normally distributed about the line. This implies that there are more \(y\) values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of \(y\) values lie on the line.
  • The standard deviations of the population \(y\) values about the line are equal for each value of \(x\). In other words, each of these normal distributions of \(y\) values has the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).
  • The data are produced from a well-designed, random sample or randomized experiment.

The left graph shows three sets of points. Each set falls in a vertical line. The points in each set are normally distributed along the line — they are densely packed in the middle and more spread out at the top and bottom. A downward sloping regression line passes through the mean of each set. The right graph shows the same regression line plotted. A vertical normal curve is shown for each line.

Linear regression is a procedure for fitting a straight line of the form \(\hat{y} = a + bx\) to data. The conditions for regression are:

  • Linear In the population, there is a linear relationship that models the average value of \(y\) for different values of \(x\).
  • Independent The residuals are assumed to be independent.
  • Normal The \(y\) values are distributed normally for any value of \(x\).
  • Equal variance The standard deviation of the \(y\) values is equal for each \(x\) value.
  • Random The data are produced from a well-designed random sample or randomized experiment.

The slope \(b\) and intercept \(a\) of the least-squares line estimate the slope \(\beta\) and intercept \(\alpha\) of the population (true) regression line. To estimate the population standard deviation of \(y\), \(\sigma\), use the standard deviation of the residuals, \(s\). \(s = \sqrt{\frac{SEE}{n-2}}\). The variable \(\rho\) (rho) is the population correlation coefficient. To test the null hypothesis \(H_{0}: \rho =\) hypothesized value , use a linear regression t-test. The most common null hypothesis is \(H_{0}: \rho = 0\) which indicates there is no linear relationship between \(x\) and \(y\) in the population. The TI-83, 83+, 84, 84+ calculator function LinRegTTest can perform this test (STATS TESTS LinRegTTest).

Formula Review

Least Squares Line or Line of Best Fit:

\[\hat{y} = a + bx\]

\[a = y\text{-intercept}\]

\[b = \text{slope}\]

Standard deviation of the residuals:

\[s = \sqrt{\frac{SSE}{n-2}}\]

\[SSE = \text{sum of squared errors}\]

\[n = \text{the number of data points}\]

Time Series Cross Correlation (Space Time Pattern Mining)

Calculates the cross correlation at various time lags between two time series stored in a space-time cube.

The cross correlation is calculated by pairing the corresponding values of each time series and calculating a Pearson correlation coefficient. The second time series is then shifted by one time step, and a new correlation is calculated. This shifting repeats up to a specified maximum number of time steps. The time lag (shift) with the strongest correlation is an estimate of the delay between changes in one time series and responses in the other (for example, the delay between advertising spending and sales revenue). You can filter and remove trends from the time series to test for statistically significant dependence between the variables. You can also include spatial neighbors in the calculations to incorporate spatial relationships between the two time series.

Learn more about how Time Series Cross Correlation works

  • Illustration

Time Series Cross Correlation tool illustration

The sign (positive or negative) of a time lag value is interpreted as the shift of the secondary analysis variable relative to the primary analysis variable. For example, a time lag value of 5 means that the secondary variable is shifted five time steps forward (right on the time axis) before calculating the cross correlation. If the time lag with the strongest correlation is positive, it means that changes in the value of the secondary analysis variable occur before changes in the primary analysis variable. Similarly, a time lag value of -3 means that the secondary time series is shifted three time steps backward (left on the time axis). If the time lag with strongest correlation is negative, it means that changes in the primary analysis variable occur before changes in the secondary analysis variable.

Learn more about time lags

The primary output of the tool is a feature class containing the cross correlation results of each location for all time lags. In a map, a group layer will be added containing six layers from different fields of the output features: three layers of the strongest correlations (strongest positive, strongest negative, and strongest in absolute value) and three layers of the associated time lags for each of the strongest correlations. You can use these layers to quickly identify which locations had the strongest correlations and which time lags produced the correlations.

Optionally, you can create pop-up charts on the output features summarizing and visualizing the correlations across all lags at each location. You can also create output tables containing all individual correlations between locations at every time lag.

Learn more about tool outputs

Use the Spatial Neighbors to Include in Calculations parameter to calculate the cross correlations using neighborhoods around each location. This is appropriate when the time series of nearby locations tend to be more similar than time series of locations that are farther away. If neighbors are used, the cross correlation of a location is a weighted average of the correlations between the primary variable of the focal location and the secondary variable of each of its neighbors (including itself). For example, if a location has five neighbors, the cross correlation of the location is a weighted average of six correlations: the correlation between the primary variable of the focal location and secondary variable of the focal location, the correlation between the primary variable of the focal location and the secondary variable of the first neighbor, the correlation between the primary variable of the focal location and the secondary variable of the second neighbor, and so on. The Spatial Neighbor Weighting Method parameter specifies the weights that will be used in the weighted average.

To test the statistical significance of the cross correlations at each lag, the Filter and Remove Trends parameter must be checked. When checked, p-values and 95 percent confidence intervals will be calculated for all lags at all locations. Additionally, significance testing can only be performed on pairwise correlations between two time series (rather than a weighted average of multiple correlations), so if you include spatial neighbors in calculations, only the output pairwise correlations table will contain p-values and confidence intervals. If neighbors are not included, the output features and the output lagged correlations table will contain p-value and confidence interval fields.

The statistical significance tests are independently performed for each time lag of each location, and there is no correction for multiple hypothesis testing . Be cautious when interpreting the significance of any particular p-value or confidence interval.

Learn more about removing trends and filtering autocorrelation

The same analysis variable can be entered for both the primary and secondary analysis variables (called an autocorrelation analysis). However, the results may be difficult to interpret because a time series is always perfectly correlated with itself when the time lag value is zero (unshifted). The output features and correlation tables will contain the correlation results of all time lags, and the results at time lag zero can be filtered or deselected.

Derived Output

Name Explanation Data Type in_cube The space-time cube containing the variable to be analyzed. Space-time cubes have a .nc file extension and are created using various tools in the Space Time Pattern Mining toolbox .

Code sample

The following Python script demonstrates how to use the TimeSeriesCrossCorrelation function.

  • Environments
  • Licensing information
  • Standard: Yes
  • Advanced: Yes

Related topics

  • How Time Series Cross Correlation works
  • Find a geoprocessing tool

Feedback on this topic?

In this topic

IMAGES

  1. Understanding the Pearson Correlation Coefficient

    hypothesis test pearson correlation

  2. Pearson Correlation Explained (Inc. Test Assumptions)

    hypothesis test pearson correlation

  3. Pearson's correlation coefficient: a beginner's guide

    hypothesis test pearson correlation

  4. Hypothesis testing for the (Pearson) correlation coefficient

    hypothesis test pearson correlation

  5. Hypothesis testing with Pearson's r

    hypothesis test pearson correlation

  6. How To Perform A Pearson Correlation Test In R

    hypothesis test pearson correlation

VIDEO

  1. SPSS 10- Pearson's Correlation [Urdu/Hindi]

  2. 19. Hypothesis Testing -Pearson Correlation Coefficient

  3. Pearson Test for Correlation

  4. Hypothesis Testing based on Correlation

  5. Correlation Analysis/ Testing hypothesis in r

  6. Correlation Hypothesis Test Theory

COMMENTS

  1. 1.9

    Let's perform the hypothesis test on the husband's age and wife's age data in which the sample correlation based on n = 170 couples is r = 0.939. To test H 0: ρ = 0 against the alternative H A: ρ ≠ 0, we obtain the following test statistic: t ∗ = r n − 2 1 − R 2 = 0.939 170 − 2 1 − 0.939 2 = 35.39. To obtain the P -value, we need ...

  2. Pearson Correlation Coefficient (r)

    Revised on February 10, 2024. The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation. It is a number between -1 and 1 that measures the strength and direction of the relationship between two variables. When one variable changes, the other variable changes in the same direction.

  3. 11.2: Correlation Hypothesis Test

    The formula for the test statistic is t = r√n − 2 √1 − r2. t = r n − 2 √ 1 − r 2 √. The value of the test statistic, t. t. , is shown in the computer or calculator output along with the p-value. p -value. The test statistic t. t. has the same sign as the correlation coefficient r.

  4. SPSS Tutorials: Pearson Correlation

    The null hypothesis (H 0) and alternative hypothesis (H 1) of the significance test for correlation can be expressed in the following ways, ... Selecting Pearson will produce the test statistics for a bivariate Pearson Correlation. C Test of Significance: Click Two-tailed or One-tailed, depending on your desired significance test. SPSS uses a ...

  5. 12.4 Testing the Significance of the Correlation Coefficient

    PERFORMING THE HYPOTHESIS TEST. Null Hypothesis: H 0: ρ = 0 Alternate Hypothesis: H a: ρ ≠ 0 WHAT THE HYPOTHESES MEAN IN WORDS: Null Hypothesis H 0: The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship (correlation) between x and y in the population.; Alternate Hypothesis H a: The population correlation coefficient ...

  6. Lab 20: Hypothesis testing with correlation

    Step 1: State hypotheses and choose α level. Remember we're going to state hypotheses in terms of our population correlation ρ. In this example, we expect GPA to decrease as distance from campus increases. This means that we are making a directional hypothesis and using a 1-tailed test.

  7. Pearson correlation

    The Pearson correlation is a measure for the strength and direction of the linear relationship between two variables of at least interval measurement level. Alternative hypothesis. The test for the Pearson correlation coefficient tests the above null hypothesis against the following alternative hypothesis (H 1 or H a): H 1 two sided: $\rho \neq ...

  8. Pearson correlation coefficient

    In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. ... For example, if z = 2.2 is observed and a two-sided p-value is desired to test the null hypothesis that ...

  9. SAS Tutorials: Pearson Correlation with PROC CORR

    The null hypothesis (H 0) and alternative hypothesis (H 1) of the significance test for correlation can be expressed in the following ways, ... The third table contains the Pearson correlation coefficients and test results. Notice that the correlations in the main diagonal (cells A and D) are all equal to 1. This is because a variable is always ...

  10. Hypothesis Testing: Correlations

    We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population. The hypothesis test lets us decide whether the value of the population correlation coefficient. \rho ρ.

  11. Correlation Coefficient

    Correlation analysis example You check whether the data meet all of the assumptions for the Pearson's r correlation test. Both variables are quantitative and normally distributed with no outliers, so you calculate a Pearson's r correlation coefficient. The correlation coefficient is strong at .58. Interpreting a correlation coefficient

  12. 12.1.2: Hypothesis Test for a Correlation

    The t-test is a statistical test for the correlation coefficient. It can be used when x x and y y are linearly related, the variables are random variables, and when the population of the variable y y is normally distributed. The formula for the t-test statistic is t = r ( n − 2 1 −r2)− −−−−−−−√ t = r ( n − 2 1 − r 2).

  13. Pearson Product-Moment Correlation

    Pearson Product-Moment Correlation What does this test do? The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a measure of the strength of a linear association between two variables and is denoted by r.Basically, a Pearson product-moment correlation attempts to draw a line of best fit through the data of two variables, and the Pearson ...

  14. Pearson Correlation

    What is Pearson correlation test, Pearson product moment correlation or Pearson r? Scatter plots. ... Step three is to conduct a hypothesis test for the relationship between the two variables using Pearson's r. As with other correlation tests and statistical tests, we can conduct a hypothesis test for the strength of evidence of the ...

  15. Interpreting Correlation Coefficients

    Hypothesis Test for Correlation Coefficients. Correlation coefficients have a hypothesis test. As with any hypothesis test, this test takes sample data and evaluates two mutually exclusive statements about the population from which the sample was drawn. ... On the other hand, the hypothesis test of Pearson's correlation coefficient does ...

  16. Pearson Correlation-Hypothesis Testing, Assumptions and Why Used for?

    Pearson Correlation Coefficient is typically used to describe the strength of the linear relationship between two quantitative variables. Often, these two variables are designated X (predictor) and Y (outcome). Pearson's r has values that range from −1.00 to +1.00. The sign of r provides information about the direction of the relationship ...

  17. Pearson Correlation

    The null hypothesis and the alternative hypothesis in Pearson correlation are thus: Null hypothesis: The correlation coefficient is not significantly different from zero ... Significance and the t-test. Whether the Pearson correlation coefficient is significantly different from zero based on the sample surveyed can be checked using a t-test ...

  18. Statistical Hypothesis Analysis in Python with ANOVAs, Chi-Square, and

    The type of hypothesis test that we use is dependent on the nature of our explanatory and response variables. Different combinations of explanatory and response variables require different statistical tests. ... If your statistical test is a Pearson correlation, you would need to create categories or bins for the moderating variable and then ...

  19. Correlation Hypothesis Test Calculator for r

    Discover the power of statistics with our free hypothesis test for Pearson correlation coefficient (r) on two numerical data sets. Our user-friendly calculator provides accurate results to determine the strength and significance of relationships between variables. Uncover valuable insights from your data and make informed decisions with ease.

  20. scipy.stats.pearsonr

    scipy.stats.pearsonr #. Pearson correlation coefficient and p-value for testing non-correlation. The Pearson correlation coefficient [1] measures the linear relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear ...

  21. Test a hypothesis about the strength of a Pearson's r correlation

    Test a hypothesis about the strength of a Pearson's r correlation Description. test_correlation is suitable for testing a hypothesis about a the strength of correlation between two continuous variables (designs in which Pearson's r is a suitable measure of correlation).. Usage test_correlation(estimate, rope = c(0, 0), output_html = FALSE)

  22. 12.5: Testing the Significance of the Correlation Coefficient

    The formula for the test statistic is t = r√n − 2 √1 − r2. The value of the test statistic, t, is shown in the computer or calculator output along with the p-value. The test statistic t has the same sign as the correlation coefficient r. The p-value is the combined area in both tails.

  23. Time Series Cross Correlation (Space Time Pattern Mining)

    The cross correlation is calculated by pairing the corresponding values of each time series and calculating a Pearson correlation coefficient. The second time series is then shifted by one time step, and a new correlation is calculated. ... and there is no correction for multiple hypothesis testing. Be cautious when interpreting the ...