Statology

Statistics Made Easy

Understanding the Null Hypothesis for Linear Regression

Linear regression is a technique we can use to understand the relationship between one or more predictor variables and a response variable .

If we only have one predictor variable and one response variable, we can use simple linear regression , which uses the following formula to estimate the relationship between the variables:

ŷ = β 0 + β 1 x

  • ŷ: The estimated response value.
  • β 0 : The average value of y when x is zero.
  • β 1 : The average change in y associated with a one unit increase in x.
  • x: The value of the predictor variable.

Simple linear regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = 0
  • H A : β 1 ≠ 0

The null hypothesis states that the coefficient β 1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

The alternative hypothesis states that β 1 is not equal to zero. In other words, there is a statistically significant relationship between x and y.

If we have multiple predictor variables and one response variable, we can use multiple linear regression , which uses the following formula to estimate the relationship between the variables:

ŷ = β 0 + β 1 x 1 + β 2 x 2 + … + β k x k

  • β 0 : The average value of y when all predictor variables are equal to zero.
  • β i : The average change in y associated with a one unit increase in x i .
  • x i : The value of the predictor variable x i .

Multiple linear regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = β 2 = … = β k = 0
  • H A : β 1 = β 2 = … = β k ≠ 0

The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically significant relationship with the response variable, y.

The alternative hypothesis states that not every coefficient is simultaneously equal to zero.

The following examples show how to decide to reject or fail to reject the null hypothesis in both simple linear regression and multiple linear regression models.

Example 1: Simple Linear Regression

Suppose a professor would like to use the number of hours studied to predict the exam score that students will receive in his class. He collects data for 20 students and fits a simple linear regression model.

The following screenshot shows the output of the regression model:

Output of simple linear regression in Excel

The fitted simple linear regression model is:

Exam Score = 67.1617 + 5.2503*(hours studied)

To determine if there is a statistically significant relationship between hours studied and exam score, we need to analyze the overall F value of the model and the corresponding p-value:

  • Overall F-Value:  47.9952
  • P-value:  0.000

Since this p-value is less than .05, we can reject the null hypothesis. In other words, there is a statistically significant relationship between hours studied and exam score received.

Example 2: Multiple Linear Regression

Suppose a professor would like to use the number of hours studied and the number of prep exams taken to predict the exam score that students will receive in his class. He collects data for 20 students and fits a multiple linear regression model.

Multiple linear regression output in Excel

The fitted multiple linear regression model is:

Exam Score = 67.67 + 5.56*(hours studied) – 0.60*(prep exams taken)

To determine if there is a jointly statistically significant relationship between the two predictor variables and the response variable, we need to analyze the overall F value of the model and the corresponding p-value:

  • Overall F-Value:  23.46
  • P-value:  0.00

Since this p-value is less than .05, we can reject the null hypothesis. In other words, hours studied and prep exams taken have a jointly statistically significant relationship with exam score.

Note: Although the p-value for prep exams taken (p = 0.52) is not significant, prep exams combined with hours studied has a significant relationship with exam score.

Additional Resources

Understanding the F-Test of Overall Significance in Regression How to Read and Interpret a Regression Table How to Report Regression Results How to Perform Simple Linear Regression in Excel How to Perform Multiple Linear Regression in Excel

' src=

Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

An open portfolio of interoperable, industry leading products

The Dotmatics digital science platform provides the first true end-to-end solution for scientific R&D, combining an enterprise data platform with the most widely used applications for data analysis, biologics, flow cytometry, chemicals innovation, and more.

linear regression hypothesis

Statistical analysis and graphing software for scientists

Bioinformatics, cloning, and antibody discovery software

Plan, visualize, & document core molecular biology procedures

Electronic Lab Notebook to organize, search and share data

Proteomics software for analysis of mass spec data

Modern cytometry analysis platform

Analysis, statistics, graphing and reporting of flow cytometry data

Software to optimize designs of clinical trials

The Ultimate Guide to Linear Regression

Get all your linear regression questions answered here

Welcome! When most people think of statistical models, their first thought is linear regression models. What most people don’t realize is that linear regression is a specific type of regression.

With that in mind, we’ll start with an overview of regression models as a whole. Then after we understand the purpose, we’ll focus on the linear part, including why it’s so popular and how to calculate regression lines-of-best-fit! (Or, if you already understand regression, you can skip straight down to the linear part) .

This guide will help you run and understand the intuition behind linear regression models. It’s intended to be a refresher resource for scientists and researchers, as well as to help new students gain better intuition about this useful modeling tool.

What is regression?

In its simplest form, regression is a type of model that uses one or more variables to estimate the actual values of another. There are plenty of different kinds of regression models, including the most commonly used linear regression, but they all have the basics in common. 

Usually the researcher has a response variable they are interested in predicting, and an idea of one or more predictor variables that could help in making an educated guess. Some simple examples include:

  • Predicting the progression of a disease such as diabetes using predictors such as age, cholesterol, etc. (linear regression)
  • Predicting survival rates or time-to-failure based on explanatory variables (survival analysis) 
  • Predicting political affiliation based on a person’s income level and years of education (logistic regression or some other classifier)
  • Predicting drug inhibition concentration at various dosages (nonlinear regression)

There are all sorts of applications, but the point is this: If we have a dataset of observations that links those variables together for each item in the dataset, we can regress the response on the predictors. Furthermore:

Fitting a model to your data can tell you how one variable increases or decreases as the value of another variable changes.

For example, if we have a dataset of houses that includes both their size and selling price, a regression model can help quantify the relationship between the two. ( Not that any model will be perfect for this !)

The most noticeable aspect of a regression model is the equation it produces. This model equation gives a line of best fit, which can be used to produce estimates of a response variable based on any value of the predictors ( within reason ). We call the output of the model a point estimate because it is a point on the continuum of possibilities. Of course, how good that prediction actually depends on everything from the accuracy of the data you’re putting in the model to how hard the question is in the first place.

Compare this to other methods like correlation, which can tell you the strength of the relationship between the variables, but is not helpful in estimating point estimates of the actual values for the response.

What is the difference between the variables in regression?

There are two different kinds of variables in regression: The one which helps predict (predictors), and the one you’re trying to predict (response).

Predictors were historically called independent variables in science textbooks. You may also see them referred to as x-variables, regressors, inputs, or covariates. Depending on the type of regression model you can have multiple predictor variables, which is called multiple regression . Predictors can be either continuous (numerical values such as height and weight) or categorical (levels of categories such as truck/SUV/motorcycle).

The response variable is often explained in layman’s terms as “the thing you actually want to predict or know more about”. It is usually the focus of the study and can be referred to as the dependent variable, y-variable, outcome, or target. In general, the response variable has a single value for each observation (e.g., predicting the temperature based on some other variables), but there can be multiple values (e.g., predicting the location of an object in latitude and longitude). The latter case is called multivariate regression (not to be confused with multiple regression). 

What are the purposes of regression analysis?

Regression Analysis has two main purposes:

  • Explanatory - A regression analysis explains the relationship between the response and predictor variables. For example, it can answer questions such as, does kidney function increase the severity of symptoms in some particular disease process? 
  • Predictive - A regression model can give a point estimate of the response variable based on the value of the predictors. 

How do I know which model best fits the data?

The most common way of determining the best model is by choosing the one that minimizes the squared difference between the actual values and the model’s estimated values. This is called least squares. Note that “least squares regression” is often used as a moniker for linear regression even though least squares is used for linear as well as nonlinear and other types of regression.

What is linear regression?

The most popular form of regression is linear regression, which is used to predict the value of one numeric (continuous) response variable based on one or more predictor variables (continuous or categorical).

Most people think the name “linear regression” comes from a straight line relationship between the variables. For most cases, that’s a fine way to think of it intuitively: As a predictor variable increases, the response either increases or decreases at the same rate (all other things equal). If this relationship holds the same for any values of the variables, a straight line pattern will form in the data when graphed, as in the example below:

1 - Old Faithful Eruption Times -Linear regression

However, the actual reason that it’s called linear regression is technical and has enough subtlety that it often causes confusion. For example, the graph below is linear regression, too, even though the resulting line is curved. The definition is mathematical and has to do with how the predictor variables relate to the response variable. Suffice it to say that linear regression handles most simple relationships, but can’t do complicated mathematical operations such as raising one predictor variable to the power of another predictor variable.

2 - Linear Regression Example

The most common linear regression models use the ordinary least squares algorithm to pick the parameters in the model and form the best line possible to show the relationship (the line-of-best-fit). Though it’s an algorithm shared by many models, linear regression is by far the most common application. If someone is discussing least-squares regression, it is more likely than not that they are talking about linear regression.

What are the major advantages of linear regression analysis?

Linear regression models are known for being easy to interpret thanks to the applications of the model equation, both for understanding the underlying relationship and in applying the model to predictions. The fact that regression analysis is great for explanatory analysis and often good enough for prediction is rare among modeling techniques.

In contrast, most techniques do one or the other. For example, a well-tuned AI-based artificial neural network model may be great at prediction but is a “black box” that offers little to no interpretability. 

There are some other benefits too:

  • Linear regression is computationally fast, particularly if you’re using statistical software. Though it’s not always a simple task to do by hand, it’s still much faster than the days it would take to calculate many other models.
  • The popularity of regression models is itself an advantage. The fact that it is a tried and tested approach used by so many scientists makes for easy collaboration.

Assumptions of linear regression

Just because scientists' initial reaction is usually to try a linear regression model, that doesn't mean it is always the right choice. In fact, there are some underlying assumptions that, if ignored, could invalidate the model.

  • Random sample - The observations in your data need to be independent from one another. There are many ways that dependence occurs, for example, one common way is with multiple response data, where a single subject is measured multiple times. The measurements on the same individual are presumably correlated, and you couldn’t use linear regression in this case.
  • Independence between predictors - If you have multiple predictors in your model, in theory, they shouldn’t be correlated with one another. If they are, this can cause instability in your model fit, although this affects the interpretation of your model rather than the predictions. See more about multicollinearity here .
  • Homoscedasticity - Meaning ‘equal scatter,’ this says that your residuals (the difference between the model prediction and the observed values) should be just as variable anywhere along the continuum. This is assessed with residual plots.
  • Residuals are normally distributed - In addition to having equal scatter, in the standard linear regression model, the residuals are assumed to come from a normal distribution. This is commonly assessed using a QQ-plot.
  • Linear relationship between predictors and response - The relationships must be linear as described above , ruling out some more complicated mathematical relationships. You can model some “curves” in your data using, say, variable X and variable X^2 ("X squared") as predictors.
  • No uncertainty in predictor measurements - The model assumes that all of the uncertainty is in the response variable. This is the most nuanced assumption: Even if you’re attempting to make inferences about a model with predictors that are themselves estimates, this would not affect you unless you need to attribute the uncertainty to the predictors. This field of study is called “measurement error.”

Other things to keep in mind for valid inference:

  • Representative sample - The dataset you are going to use should be a representative (and random!) sample of the population you’re trying to make inferences about. To use an intuitive example, you should not expect all people to act the same as those in your household. Since we often underestimate our own bias, the best bet is to have a random sample when you start.
  • Sample size - If your dataset only has 5 observations in it, the model will be less effective at finding a real pattern (or if one exists) than if it has 100. There is no one-size-fits-all number for every study, but generally 30 or more is considered the low end of what regression needs.
  • Stay in range - Don’t try to make predictions outside the range of the dataset you used to build the model. For example, let’s say you are predicting home values based on square footage. If your dataset only has homes between 1,000 and 3,000 square feet, the model may not be a good judge of the value of an 800 or 4,000 square-foot house. This is called extrapolating, and is not recommended.

Types of linear regression

The two most common types of regression are simple linear regression and multiple linear regression, which only differ by the number of predictors in the model. Simple linear regression has a single predictor. 

Simple linear regression

It’s called simple for a reason: If you are testing a linear relationship between exactly two continuous variables (one predictor and one response variable), you’re looking for a simple linear regression model, also called a least squares regression line. Are you looking to use more predictors than that? Try a multiple linear regression model. That is the main difference between the two, but there are other considerations and differences involved too.

You can use statistical software such as Prism to calculate simple linear regression coefficients and graph the regression line it produces. For a quick simple linear regression analysis, try our free online linear regression calculator .

Interpreting a simple linear regression model

Remember the y = mx+b formula for a line from grade school? The slope was m , and the y-intercept was b , and both were necessary to draw a line. That’s what you’re basically building here too, but most textbooks and programs will write out the predictive equation for regression this way:

Simple-linear-regression-formula

Y is your response variable, and X is your predictor. The two 𝛽 symbols are called “parameters”, the things the model will estimate to create your line of best fit. The first (not connected to X) is the intercept, the other (the coefficient in front of X) is called the slope term.

As an example, we will use a sample Prism dataset with diabetes data to model the relationship between a person’s glucose level (predictor) and their glycosylated hemoglobin level (response). Once we run the analysis we get this output:

3 - SLR Results Page - Linear regression

Best-fit parameters and the regression equation

The first section in the Prism output for simple linear regression is all about the workings of the model itself. They can be called parameters, estimates, or (as they are above) best-fit values. Keep in mind, parameter estimates could be positive or negative in regression depending on the relationship.

There you see the slope (for glucose) and the y-intercept. The values for those help us build the equation the model uses to estimate and make predictions:

Glycosylated Hemoglobin = 2.24 + (0.0312*Glucose)

Notice: That same equation is given later in the output, near the bottom of the page.

Using this equation, we can plug in any number in the range of our dataset for glucose and estimate that person’s glycosylated hemoglobin level. For instance, a glucose level of 90 corresponds to an estimate of 5.048 for that person’s glycosylated hemoglobin level. But that’s just the start of how these parameters are used.

Interpreting parameter estimates

You can also interpret the parameters of simple linear regression on their own, and because there are only two it is pretty straightforward.

The slope parameter is often the most helpful: It means that for every 1 unit increase in glucose, the estimated glycosylated hemoglobin level will increase by 0.0312 units. As an aside, if it was negative (perhaps -0.04), we would say a 1 unit increase in glucose would actually decrease the estimated response by -0.04.

The intercept parameter is useful for fitting the model, because it shifts the best-fit-line up or down. In this example, the value it shows (2.24) is the predicted glycosylated hemoglobin level for a person with a glucose level of 0. In cases like this, the interpretation of the intercept isn’t very interesting or helpful.

Simply put, if there’s no predictor with a value of 0 in the dataset, you should ignore this part of the interpretation and consider the model as a whole and the slope. However, notice that if you plug in 0 for a person’s glucose, 2.24 is exactly what the full model estimates. 

Confidence intervals and standard error

The next couple sections seem technical, but really get back to the core of how no model is perfect. We can give “point estimates” for the best-fit parameters today, but there’s still some uncertainty involved in trying to find the true and exact relationship between the variables. 

Standard error and confidence intervals work together to give an estimate of that uncertainty. Add and subtract the standard error from the estimate to get a fair range of possible values for that true relationship. With this 95% confidence interval, you can say you believe the true value of that parameter is somewhere between the two endpoints (for the slope of glucose, somewhere between 0.0285 and 0.0340).

This method may seem too cautious at first, but is simply giving a range of real possibilities around the point estimate. After all, wouldn’t you like to know if the point estimate you gave was wildly variable? This gives you that missing piece. 

Goodness of fit

Determining how well your model fits can be done graphically and numerically. If you know what to look for, there’s nothing better than plotting your data to assess the fit and how well your data meet the assumptions of the model. These diagnostic graphics plot the residuals, which are the differences between the estimated model and the observed data points.

A good plot to use is a residual plot versus the predictor (X) variable. Here you want to look for equal scatter, meaning the points all vary roughly the same above and below the dotted line across all x values. The plot on the left looks great, whereas the plot on the right shows a clear parabolic shaped trend, which would need to be addressed.

10 - Log Transform Comparison - Linear regression

Another way to assess the goodness of fit is with the R-squared statistic, which is the proportion of the variance in the response that is explained by the model. In this case, the value of 0.561 says that 56% of the variance in glycosylated hemoglobin can be explained by this very simple model equation (effectively, that person’s glucose level).

The name R-squared may remind you of a similar statistic: Pearson’s R, which measures the correlation between any two variables. Fun fact: As long as you’re doing simple linear regression, the square-root of R-squared (which is to say, R), is equivalent to the Pearson’s R correlation between the predictor and response variable.

The reason is that simple linear regression draws on the same mechanisms of least-squares that Pearson’s R does for correlation. Keep in mind, while regression and correlation are similar they are not the same thing . The differences usually come down to the purpose of the analysis, as correlation does not fit a line through the data points.

Significance and F-tests

So we have a model, and we know how to use it for predictions. We know R-squared gives an idea of how well the model fits the data… but how do we know if there is actually a significant relationship between the variables? 

A section at the bottom asks that same question: Is the slope significantly non-zero? This is especially important for this model, where the best-fit value (roughly 0.03) seems very close to 0 to the naked eye. How can we feel confident one way or another?

In this case, the slope  is  significantly non-zero: An F-test gives a p-value of less than 0.0001. F-tests answer this for the model as a whole rather than its individual slopes, but in this case there is only one slope anyway. P-values are always interpreted in comparison to a “significance threshold”: If it’s less than the threshold level, the model is said to show a trend that is significantly different from “no relationship” (or, the null hypothesis). And based on how we set up the regression analysis to use 0.05 as the threshold for significance, it tells us that the model points to a significant relationship. There is evidence that this relationship is real.

If it wasn’t, then we are effectively saying there is no evidence that the model gives any new information beyond random guessing. In other words: The model may output a number for a prediction, but if the slope is not significant, it may not be worth actually considering that prediction.

Graphing linear regression

Since a linear regression model produces an equation for a line, graphing linear regression’s line-of-best-fit in relation to the points themselves is a popular way to see how closely the model fits the eye test. Software like Prism makes the graphing part of regression incredibly easy, because a graph is created automatically alongside the details of the model. Here are some more graphing tips , along with an example from our analysis:

5 - SLR Line of Best Fit - Linear regression

Multiple linear regression

If you understand the basics of simple linear regression, you understand about 80% of multiple linear regression, too. The inner-workings are the same, it is still based on the least-squares regression algorithm, and it is still a model designed to predict a response. But instead of just one predictor variable, multiple linear regression uses multiple predictors.

The model equation is similar to the previous one, the main thing you notice is that it’s longer because of the additional predictors. Let’s say you are using 3 predictor variables, the predictive equation will produce 3 slope estimates (one for each) along with an Intercept term:

Multiple Linear Regession formula

Prism makes it easy to create a multiple linear regression model, especially calculating regression slope coefficients and generating graphics to diagnose how well the model fits.

What do I need to know about multicollinearity?

The assumptions for multiple linear regression are discussed here. With multiple predictors, in addition to the interpretation getting more challenging, another added complication is with multicollinearity. 

Multicollinearity occurs when two or more predictor variables “overlap” in what they measure. In other places you will see this referred to as the variables being dependent of one another. Ideally, the predictors are independent and no one predictor influences the values of another. 

There are various ways of measuring multicollinearity , but the main thing to know is that multicollinearity won’t affect how well your model predicts point values. However, it garbles inference about how each individual variable affects the response. 

For example, say that you want to estimate the height of a tree, and you have measured the circumference of the tree at two heights from the ground, one meter and two meter. The circumferences will be highly correlated. If you include both in the model, it’s very possible that you could end up with a negative slope parameter for one of those circumferences. Clearly, a tree doesn't get shorter when the circumference gets larger. Instead, that negative slope coefficient is acting as an adjustment to the other variable.

What is the difference between simple linear regression and multiple linear regression?

Once you’ve decided that your study is a good fit for a linear model, the choice between the two simply comes down to how many predictor variables you include. Just one? Simple linear. More than that? Multiple linear.

Based on that, you may be wondering, “Why would I ever do a simple linear regression when multiple linear regression can account for more variables?” Great question!

The answer is that sometimes less is more.  A common misconception is that the goal of a model is to be 100% accurate. Scientists know that no model is perfect, it is a simplified version of reality. So the goal isn’t perfection: Rather, the goal is to find as simple a model as possible to describe relationships so you understand the system, reach valid scientific conclusions, and design new experiments.

Still not convinced? Let’s say you were able to create a model that was 100% accurate for each point in your dataset. Most of the time if you’ve done this, you’ve done one of two things:

  • Come to an obvious conclusion that isn’t practically useful (100% of winning basketball teams score more points than their opponent) OR
  • You’ve modeled not only the trend in your data, but also the random “noise” that is too variable to count on. This is called “overfitting”: You tried so hard to account for every aspect of the past that the model ignores the differences that will arise in the future.

Other differences pop up on the technical side. To give some quick examples of that, using multiple linear regression means that:

  • In addition to the overall interpretation and significance of the model, each slope now has its own interpretation and question of significance.
  • R-squared is not as intuitive as it was for simple linear regression.
  • Graphing the equation is not a single line anymore. You could say that multiple linear regression just does not lend itself to graphing as easily.
All in all: simple regression is always more intuitive than multiple linear regression!

Interpreting multiple linear regression

We’ve said that multiple linear regression is harder to interpret than simple linear regression, and that is true. Taking the math and more technical aspects out of the question, overall interpretation is always harder the more factors are involved. But while there are more things to keep track of, the basic components of the thought process remain the same: parameters, confidence intervals and significance. We even use the model equation the same way.

Let’s use the same diabetes dataset to illustrate, but with a new wrinkle: In addition to glucose level, we will also include HDL and the person’s age as predictors of their glycosylated hemoglobin level (response). Here’s the output from Prism:

6 - MLR Results Page - Linear regression

Analysis of variance and F-tests

While most scientists’ eyes go straight to the section with parameter estimates, the first section of output is valuable and is the best place to start. Analysis of variance tests the model as a whole (and some individual pieces) to tell you how good your model is before you make sense of the rest.

It includes the Sum of Squares table, and the F-test on the far right of that section is of highest interest. The “Regression” as a whole (on the top line of the section) has a p-value of less than 0.0001 and is significant at the 0.05 level we chose to use. Each parameter slope has its own individual F-test too, but it is easier to understand as a t-test.

Parameter estimates and T-tests

Now for the fun part: The model itself has the same structure and information we used for simple linear regression, and we interpret it very similarly. The key is to remember that you are interpreting each parameter in its own right (not something you have to keep in mind with only one parameter!). Prism puts all of the statistics for each parameters in one table, including (for each parameter):

  • The parameter’s estimate itself
  • Its standard error and confidence interval
  • A P-value from a t-test

The estimates themselves are straightforward and are used to make the model equation, just like before. In this case the model’s predictive equation is (when rounding to the nearest thousandth):

Glycosylated Hemoglobin = 1.870 + 0.029*Glucose - 0.005*HDL +0.018*Age

If you remember back to our simple linear regression model, the slope for glucose has changed slightly. That is because we are now accounting for other factors too. This distinction can sometimes change the interpretation of an individual predictor’s effect dramatically.

When interpreting the individual slope estimates for predictor variables, the difference goes back to how Multiple Regression assumes each predictor is independent of the others. For simple regression you can say “a 1 point increase in X usually corresponds to a 5 point increase in Y”. For multiple regression it’s more like “a 1 point increase in X usually corresponds to a 5 point increase in Y, assuming every other factor is equal.” That may not seem like a big jump, but it acknowledges 1) that there are more factors at play and 2) the need for those predictors to not have influence on one another for the model to be helpful.

The standard errors and confidence intervals are also shown for each parameter, giving an idea of the variability for each slope/intercept on its own. Interpreting each one of these is done exactly the same way as we mentioned in the simple linear regression example, but remember that if multicollinearity exists, the standard errors and confidence intervals get inflated (often drastically).

On the end are p-values, which as you might guess, are interpreted just like we did for the first example. The underlying method behind the p-value here is a T-test. These only tell how significant each of the factors are, to evaluate the model as a whole we would need to use the F-test at the top. 

Evaluating each on its own though is still helpful: In this case it shows that while the other predictors are all significant, HDL shows no significance since we have already considered the other factors. That is not to say that it has no significance on its own, only that it adds no value to a model of just glucose and age. In fact, now that we know this, we could choose to re-run our model with only glucose and age and dial in better parameter estimates for that simpler model.

Another difference in interpretation occurs when you have categorical predictor variables such as sex in our example data. When you add categorical variables to a model, you pick a “reference level.” In this case (image below), we selected female as our reference level. The model below says that males have slightly lower predicted response than females (about 0.15 less).

7 - MLR with Sex - Linear regression

Assessing how well your model fits with multiple linear regression is more difficult than with simple linear regression, although the ideas remain the same, i.e., there are graphical and numerical diagnoses.

At the very least, it’s good to check a residual vs predicted plot to look for trends. In our diabetes model, this plot (included below) looks okay at first, but has some issues. Notice that values tend to miss high on the left and low on the right.

8 - Residual multiple linear - Linear regression

However, on further inspection, notice that there are only a few outlying points causing this unequal scatter. If you see outliers like above in your analysis that disrupt equal scatter, you have a few options .

As for numerical evaluations of goodness of fit, you have a lot more options with multiple linear regression. R-squared is still a go-to if you just want a measure to describe the proportion of variance in the response variable that is explained by your model. However, a common use of the goodness of fit statistics is to perform model selection , which means deciding on what variables to include in the model. If that’s what you’re using the goodness of fit for, then you’re better off using adjusted R-squared or an information criterion such as AICc.

Graphing multiple linear regression

Graphs are extremely useful to test how well a multiple linear regression model fits overall. With multiple predictors, it’s not feasible to plot the predictors against the response variable like it is in simple linear regression. A simple solution is to use the predicted response value on the x-axis and the residuals on the y-axis (as shown above). As a reminder, the residuals are the differences between the predicted and the observed response values. There are also several other plots using residuals that can be used to assess other model assumptions such as normally distributed error terms and serial correlation.

Model selection - choosing which predictor variables to include

How do you know which predictor variables to include in your model? It’s a great question and an active area of research.

For most researchers in the sciences, you’re dealing with a few predictor variables, and you have a pretty good hypothesis about the general structure of your model. If this is the case, then you might just try fitting a few different models, and picking the one that looks best based on how the residuals look and using a goodness of fit metric such as adjusted R-square or AICc .

Why doesn't my model fit well?

There are a lot of reasons that would cause your model to not fit well. One reason is having too much unexplained variance in the response. This could be because there were important predictor variables that you didn’t measure, or the relationship between the predictors and the response is more complicated than a simple linear regression model. In this last case, you can consider using interaction terms or transformations of the predictor variables.

If prediction accuracy is all that matters to you, meaning that you only want a good estimate of  the response and don’t need to understand how the predictors affect it, then there are a lot of clever, computational tools for building and selecting models. We won’t cover them in this guide, but if you want to know more about this topic, look into cross-validation and LASSO regression to get started.

Interactions

Interactions and transformations are useful tools to address situations where your model doesn't fit well by just using the unmodified predictor variables.

Interaction terms are found by multiplying two predictor variables together to create a new “interaction” variable. They greatly increase the complexity of describing how each variable affects the response. The primary use is to allow for more flexibility so that the effect of one predictor variable depends on the value of another predictor variable.

For a specific example using the diabetes data above, perhaps we have reason to believe that the effect of glucose on the response (hemoglobin %) changes depending on the age of the patient. Stats software makes this simple to do, but in effect, we multiply glucose by age, and include that new term in our model. Our new model when rounded is:

Glycosylated Hemoglobin = 0.42 + 0.044*Glucose - 0.004*HDL +0.044*Age - .0003*Glucose*Age

For reference, our model without the interaction term was:

Glycosylated Hemoglobin = 1.865 + 0.029*Glucose - 0.005*HDL +0.018*Age

Adding the interaction term changed the other estimates by a lot! Interpreting what this means is challenging. At the very least, we can say that the effect of glucose depends on age for this model since the coefficients are statistically significant. We might also want to say that high glucose appears to matter less for older patients due to the negative coefficient estimate of the interaction term (-0.0002). However, there is very high multicollinearity in this model (and in nearly every model with interaction terms), so interpreting the coefficients should be done with caution. Even with this example, if we remove a few outliers, this interaction term is no longer statistically significant, so it is unstable and could simply be a byproduct of noisy data.

9 - Interaction Results - Linear regression

Transformations

In addition to interactions, another strategy to use when your model doesn't fit your data well are transformations of variables. You can transform your response or any of your predictor variables.

Transformations on the response variable change the interpretation quite a bit. Instead of the model fitting your response variable, y , it fits the transformed y . A common example where this is appropriate is with predicting height for various ages of an animal species. Log transformations on the response, height in this case, are used because the variability in height at birth is very small, but the variability of height with adult animals is much higher. This violates the assumption of equal scatter. 

In the plots below, notice the funnel type shape on the left, where the scatter widens as age increases. On the right hand side, the funnel shape disappears and the variability of the residuals looks consistent.

The linear model using the log transformed y fits much better, however now the interpretation of the model changes. Using the example data above, the predicted model is:

ln(y) = -0.4 + 0.2 * x

This means that a single unit change in x results in a 0.2 increase in the log of y . That doesn't mean much to most people. Instead, you probably want your interpretation to be on the original y scale. To do that, we need to exponentiate both sides of the equation, which (avoiding the mathematical details) means that a 1 unit increase in x results in a 22% increase in y .

All of that is to say that transformations can assist with fitting your model, but they can complicate interpretation. 

When linear regression doesn't work

The ubiquitous nature of linear regression is a positive for collaboration, but sometimes it causes researchers to assume (before doing their due diligence) that a linear regression model is the right model for every situation. Sometimes software even seems to reinforce this attitude and the model that is subsequently chosen, rather than the person remaining in control of their research.

Sure, linear regression is great for its simplicity and familiarity, but there are many situations where there are better alternatives.

Other types of regression

Logistic regression.

Linear vs logistic regression: linear regression is appropriate when your response variable is continuous, but if your response has only two levels (e.g., presence/absence, yes/no, etc.), then look into simple logistic regression or multiple logistic regression .

Poisson regression

If instead, your response variable is a count (e.g., number of earthquakes in an area, number of males a female horseshoe crab has nesting nearby, etc.), then consider Poisson regression .

Nonlinear regression

For more complicated mathematical relationships between the predictors and response variables, such as dose-response curves in pharmacokinetics, check out nonlinear regression .

If you’ve designed and run an experiment with a continuous response variable and your research factors are categorical (e.g., Diet 1/Diet 2, Treatment 1/Treatment 2, etc.), then you need ANOVA models. These are differentiated by the number of treatments ( one-way ANOVA , two-way ANOVA , three-way ANOVA ) or other characteristics such as repeated measures ANOVA .

Principal component regression

Principal component regression is useful when you have as many or more predictor variables than observations in your study. It offers a technique for reducing the “dimension” of your predictors, so that you can still fit a linear regression model.

Cox proportional hazards regression

Cox proportional hazards regression is the go-to technique for survival analysis, when you have data measuring time until an event.

Deming regression

Deming regression is useful when there are two variables ( x and y ), and there is measurement error in both variables. One common situation that this occurs is comparing results from two different methods (e.g., comparing two different machines that measure blood oxygen level or that check for a particular pathogen).

Perform your own Linear Regression

Are you ready to calculate your own Linear Regression?  With a consistently clear, practical, and well-documented interface, learn how Prism can give you the controls you need to fit your data and simplify nonlinear regression .

Start your 30 day free trial of Prism   and get access to:

  • A step by step guide on how to perform Linear Regression
  • Sample data to save you time
  • More tips on how Prism can help your research

With Prism, in a matter of minutes you learn how to go from entering data to performing statistical analyses and generating high-quality graphs.

Linear regression - Hypothesis testing

by Marco Taboga , PhD

This lecture discusses how to perform tests of hypotheses about the coefficients of a linear regression model estimated by ordinary least squares (OLS).

Table of contents

Normal vs non-normal model

The linear regression model, matrix notation, tests of hypothesis in the normal linear regression model, test of a restriction on a single coefficient (t test), test of a set of linear restrictions (f test), tests based on maximum likelihood procedures (wald, lagrange multiplier, likelihood ratio), tests of hypothesis when the ols estimator is asymptotically normal, test of a restriction on a single coefficient (z test), test of a set of linear restrictions (chi-square test), learn more about regression analysis.

The lecture is divided in two parts:

in the first part, we discuss hypothesis testing in the normal linear regression model , in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors;

in the second part, we show how to carry out hypothesis tests in linear regression analyses where the hypothesis of normality holds only in large samples (i.e., the OLS estimator can be proved to be asymptotically normal).

How to choose which test to carry out after estimating a linear regression model.

We also denote:

We now explain how to derive tests about the coefficients of the normal linear regression model.

It can be proved (see the lecture about the normal linear regression model ) that the assumption of conditional normality implies that:

How the acceptance region is determined depends not only on the desired size of the test , but also on whether the test is:

one-tailed (only one of the two things, i.e., either smaller or larger, is possible).

For more details on how to determine the acceptance region, see the glossary entry on critical values .

[eq28]

The F test is one-tailed .

A critical value in the right tail of the F distribution is chosen so as to achieve the desired size of the test.

Then, the null hypothesis is rejected if the F statistics is larger than the critical value.

In this section we explain how to perform hypothesis tests about the coefficients of a linear regression model when the OLS estimator is asymptotically normal.

As we have shown in the lecture on the properties of the OLS estimator , in several cases (i.e., under different sets of assumptions) it can be proved that:

These two properties are used to derive the asymptotic distribution of the test statistics used in hypothesis testing.

The test can be either one-tailed or two-tailed . The same comments made for the t-test apply here.

[eq50]

Like the F test, also the Chi-square test is usually one-tailed .

The desired size of the test is achieved by appropriately choosing a critical value in the right tail of the Chi-square distribution.

The null is rejected if the Chi-square statistics is larger than the critical value.

Want to learn more about regression analysis? Here are some suggestions:

R squared of a linear regression ;

Gauss-Markov theorem ;

Generalized Least Squares ;

Multicollinearity ;

Dummy variables ;

Selection of linear regression models

Partitioned regression ;

Ridge regression .

How to cite

Please cite as:

Taboga, Marco (2021). "Linear regression - Hypothesis testing", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression-hypothesis-testing.

Most of the learning materials found on this website are now available in a traditional textbook format.

  • F distribution
  • Beta distribution
  • Conditional probability
  • Central Limit Theorem
  • Binomial distribution
  • Mean square convergence
  • Delta method
  • Almost sure convergence
  • Mathematical tools
  • Fundamentals of probability
  • Probability distributions
  • Asymptotic theory
  • Fundamentals of statistics
  • About Statlect
  • Cookies, privacy and terms of use
  • Loss function
  • Almost sure
  • Type I error
  • Precision matrix
  • Integrable variable
  • To enhance your privacy,
  • we removed the social buttons,
  • but don't forget to share .
  • Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Linear regression hypothesis testing: Concepts, Examples

Simple linear regression model

In relation to machine learning , linear regression is defined as a predictive modeling technique that allows us to build a model which can help predict continuous response variables as a function of a linear combination of explanatory or predictor variables. While training linear regression models, we need to rely on hypothesis testing in relation to determining the relationship between the response and predictor variables. In the case of the linear regression model, two types of hypothesis testing are done. They are T-tests and F-tests . In other words, there are two types of statistics that are used to assess whether linear regression models exist representing response and predictor variables. They are t-statistics and f-statistics. As data scientists , it is of utmost importance to determine if linear regression is the correct choice of model for our particular problem and this can be done by performing hypothesis testing related to linear regression response and predictor variables. Many times, it is found that these concepts are not very clear with a lot many data scientists. In this blog post, we will discuss linear regression and hypothesis testing related to t-statistics and f-statistics . We will also provide an example to help illustrate how these concepts work.

Table of Contents

What are linear regression models?

A linear regression model can be defined as the function approximation that represents a continuous response variable as a function of one or more predictor variables. While building a linear regression model, the goal is to identify a linear equation that best predicts or models the relationship between the response or dependent variable and one or more predictor or independent variables.

There are two different kinds of linear regression models. They are as follows:

  • Simple or Univariate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and one predictor or independent variable. The form of the equation that represents a simple linear regression model is Y=mX+b, where m is the coefficients of the predictor variable and b is bias. When considering the linear regression line, m represents the slope and b represents the intercept.
  • Multiple or Multi-variate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and more than one predictor or independent variable. The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

While training linear regression models, the requirement is to determine the coefficients which can result in the best-fitted linear regression line. The learning algorithm used to find the most appropriate coefficients is known as least squares regression . In the least-squares regression method, the coefficients are calculated using the least-squares error function. The main objective of this method is to minimize or reduce the sum of squared residuals between actual and predicted response values. The sum of squared residuals is also called the residual sum of squares (RSS). The outcome of executing the least-squares regression method is coefficients that minimize the linear regression cost function .

The residual e of the ith observation is represented as the following where [latex]Y_i[/latex] is the ith observation and [latex]\hat{Y_i}[/latex] is the prediction for ith observation or the value of response variable for ith observation.

[latex]e_i = Y_i – \hat{Y_i}[/latex]

The residual sum of squares can be represented as the following:

[latex]RSS = e_1^2 + e_2^2 + e_3^2 + … + e_n^2[/latex]

The least-squares method represents the algorithm that minimizes the above term, RSS.

Once the coefficients are determined, can it be claimed that these coefficients are the most appropriate ones for linear regression? The answer is no. After all, the coefficients are only the estimates and thus, there will be standard errors associated with each of the coefficients.  Recall that the standard error is used to calculate the confidence interval in which the mean value of the population parameter would exist. In other words, it represents the error of estimating a population parameter based on the sample data. The value of the standard error is calculated as the standard deviation of the sample divided by the square root of the sample size. The formula below represents the standard error of a mean.

[latex]SE(\mu) = \frac{\sigma}{\sqrt(N)}[/latex]

Thus, without analyzing aspects such as the standard error associated with the coefficients, it cannot be claimed that the linear regression coefficients are the most suitable ones without performing hypothesis testing. This is where hypothesis testing is needed . Before we get into why we need hypothesis testing with the linear regression model, let’s briefly learn about what is hypothesis testing?

Train a Multiple Linear Regression Model using R

Before getting into understanding the hypothesis testing concepts in relation to the linear regression model, let’s train a multi-variate or multiple linear regression model and print the summary output of the model which will be referred to, in the next section. 

The data used for creating a multi-linear regression model is BostonHousing which can be loaded in RStudioby installing mlbench package. The code is shown below:

install.packages(“mlbench”) library(mlbench) data(“BostonHousing”)

Once the data is loaded, the code shown below can be used to create the linear regression model.

attach(BostonHousing) BostonHousing.lm <- lm(log(medv) ~ crim + chas + rad + lstat) summary(BostonHousing.lm)

Executing the above command will result in the creation of a linear regression model with the response variable as medv and predictor variables as crim, chas, rad, and lstat. The following represents the details related to the response and predictor variables:

  • log(medv) : Log of the median value of owner-occupied homes in USD 1000’s
  • crim : Per capita crime rate by town
  • chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • rad : Index of accessibility to radial highways
  • lstat : Percentage of the lower status of the population

The following will be the output of the summary command that prints the details relating to the model including hypothesis testing details for coefficients (t-statistics) and the model as a whole (f-statistics) 

linear regression model summary table r.png

Hypothesis tests & Linear Regression Models

Hypothesis tests are the statistical procedure that is used to test a claim or assumption about the underlying distribution of a population based on the sample data. Here are key steps of doing hypothesis tests with linear regression models:

  • Hypothesis formulation for T-tests: In the case of linear regression, the claim is made that there exists a relationship between response and predictor variables, and the claim is represented using the non-zero value of coefficients of predictor variables in the linear equation or regression model. This is formulated as an alternate hypothesis. Thus, the null hypothesis is set that there is no relationship between response and the predictor variables . Hence, the coefficients related to each of the predictor variables is equal to zero (0). So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis for each test states that a1 = 0, a2 = 0, a3 = 0 etc. For all the predictor variables, individual hypothesis testing is done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. Thus, if there are, say, 5 features, there will be five hypothesis tests and each will have an associated null and alternate hypothesis.
  • Hypothesis formulation for F-test : In addition, there is a hypothesis test done around the claim that there is a linear regression model representing the response variable and all the predictor variables. The null hypothesis is that the linear regression model does not exist . This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0.
  • F-statistics for testing hypothesis for linear regression model : F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0. F-statistics is calculated as a function of sum of squares residuals for restricted regression (representing linear regression model with only intercept or bias and all the values of coefficients as zero) and sum of squares residuals for unrestricted regression (representing linear regression model). In the above diagram, note the value of f-statistics as 15.66 against the degrees of freedom as 5 and 194. 
  • Evaluate t-statistics against the critical value/region : After calculating the value of t-statistics for each coefficient, it is now time to make a decision about whether to accept or reject the null hypothesis. In order for this decision to be made, one needs to set a significance level, which is also known as the alpha level. The significance level of 0.05 is usually set for rejecting the null hypothesis or otherwise. If the value of t-statistics fall in the critical region, the null hypothesis is rejected. Or, if the p-value comes out to be less than 0.05, the null hypothesis is rejected.
  • Evaluate f-statistics against the critical value/region : The value of F-statistics and the p-value is evaluated for testing the null hypothesis that the linear regression model representing response and predictor variables does not exist. If the value of f-statistics is more than the critical value at the level of significance as 0.05, the null hypothesis is rejected. This means that the linear model exists with at least one valid coefficients. 
  • Draw conclusions : The final step of hypothesis testing is to draw a conclusion by interpreting the results in terms of the original claim or hypothesis. If the null hypothesis of one or more predictor variables is rejected, it represents the fact that the relationship between the response and the predictor variable is not statistically significant based on the evidence or the sample data we used for training the model. Similarly, if the f-statistics value lies in the critical region and the value of the p-value is less than the alpha value usually set as 0.05, one can say that there exists a linear regression model.

Why hypothesis tests for linear regression models?

The reasons why we need to do hypothesis tests in case of a linear regression model are following:

  • By creating the model, we are establishing a new truth (claims) about the relationship between response or dependent variable with one or more predictor or independent variables. In order to justify the truth, there are needed one or more tests. These tests can be termed as an act of testing the claim (or new truth) or in other words, hypothesis tests.
  • One kind of test is required to test the relationship between response and each of the predictor variables (hence, T-tests)
  • Another kind of test is required to test the linear regression model representation as a whole. This is called F-test.

While training linear regression models, hypothesis testing is done to determine whether the relationship between the response and each of the predictor variables is statistically significant or otherwise. The coefficients related to each of the predictor variables is determined. Then, individual hypothesis tests are done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. If at least one of the null hypotheses is rejected, it represents the fact that there exists no relationship between response and that particular predictor variable. T-statistics is used for performing the hypothesis testing because the standard deviation of the sampling distribution is unknown. The value of t-statistics is compared with the critical value from the t-distribution table in order to make a decision about whether to accept or reject the null hypothesis regarding the relationship between the response and predictor variables. If the value falls in the critical region, then the null hypothesis is rejected which means that there is no relationship between response and that predictor variable. In addition to T-tests, F-test is performed to test the null hypothesis that the linear regression model does not exist and that the value of all the coefficients is zero (0). Learn more about the linear regression and t-test in this blog – Linear regression t-test: formula, example .

Recent Posts

Ajitesh Kumar

  • Model Parallelism vs Data Parallelism: Examples - April 11, 2024
  • Model Complexity & Overfitting in Machine Learning: How to Reduce - April 10, 2024
  • 6 Game-Changing Features of ChatGPT’s Latest Upgrade - April 9, 2024

Ajitesh Kumar

One response.

Very informative

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:
  • Excellence Awaits: IITs, NITs & IIITs Journey

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • Model Parallelism vs Data Parallelism: Examples
  • Model Complexity & Overfitting in Machine Learning: How to Reduce
  • 6 Game-Changing Features of ChatGPT’s Latest Upgrade
  • Self-Prediction vs Contrastive Learning: Examples
  • Free IBM Data Sciences Courses on Coursera

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

5.2 - writing hypotheses.

The first step in conducting a hypothesis test is to write the hypothesis statements that are going to be tested. For each test you will have a null hypothesis (\(H_0\)) and an alternative hypothesis (\(H_a\)).

When writing hypotheses there are three things that we need to know: (1) the parameter that we are testing (2) the direction of the test (non-directional, right-tailed or left-tailed), and (3) the value of the hypothesized parameter.

  • At this point we can write hypotheses for a single mean (\(\mu\)), paired means(\(\mu_d\)), a single proportion (\(p\)), the difference between two independent means (\(\mu_1-\mu_2\)), the difference between two proportions (\(p_1-p_2\)), a simple linear regression slope (\(\beta\)), and a correlation (\(\rho\)). 
  • The research question will give us the information necessary to determine if the test is two-tailed (e.g., "different from," "not equal to"), right-tailed (e.g., "greater than," "more than"), or left-tailed (e.g., "less than," "fewer than").
  • The research question will also give us the hypothesized parameter value. This is the number that goes in the hypothesis statements (i.e., \(\mu_0\) and \(p_0\)). For the difference between two groups, regression, and correlation, this value is typically 0.

Hypotheses are always written in terms of population parameters (e.g., \(p\) and \(\mu\)).  The tables below display all of the possible hypotheses for the parameters that we have learned thus far. Note that the null hypothesis always includes the equality (i.e., =).

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Linear Regression in R | A Step-by-Step Guide & Examples

Linear Regression in R | A Step-by-Step Guide & Examples

Published on February 25, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Linear regression is a regression model that uses a straight line to describe the relationship between variables . It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.

There are two main types of linear regression:

  • Simple linear regression uses only one independent variable
  • Multiple linear regression uses two or more independent variables

In this step-by-step guide, we will walk you through linear regression in R using two sample datasets.

Download the sample datasets to try it yourself.

Simple regression dataset Multiple regression dataset

Table of contents

Getting started in r, step 1: load the data into r, step 2: make sure your data meet the assumptions, step 3: perform the linear regression analysis, step 4: check for homoscedasticity, step 5: visualize the results with a graph, step 6: report your results, other interesting articles.

Start by downloading R and RStudio . Then open RStudio and click on File > New File > R Script .

As we go through each step , you can copy and paste the code from the text boxes directly into your script. To run the code, highlight the lines you want to run and click on the Run button on the top right of the text editor (or press ctrl + enter on the keyboard).

To install the packages you need for the analysis, run this code (you only need to do this once):

Next, load the packages into your R environment by running this code (you need to do this every time you restart R):

Prevent plagiarism. Run a free check.

Follow these four steps for each dataset:

  • In RStudio, go to File > Import dataset  > From Text (base) .
  • Choose the data file you have downloaded ( income.data or heart.data ), and an Import Dataset window pops up.
  • In the Data Frame window, you should see an X (index) column and columns listing the data for each of the variables ( income and happiness or biking , smoking , and heart.disease ).
  • Click on the Import button and the file should appear in your Environment tab on the upper right side of the RStudio screen.

After you’ve loaded the data, check that it has been read in correctly using summary() .

Simple regression

Because both our variables are quantitative , when we run this function we see a table in our console with a numeric summary of the data. This tells us the minimum, median , mean , and maximum values of the independent variable (income) and dependent variable (happiness):

Simple linear regression summary output in R

Multiple regression

Again, because the variables are quantitative, running the code produces a numeric summary of the data for the independent variables (smoking and biking) and the dependent variable (heart disease):

Multiple regression summary output in R

We can use R to check that our data meet the four main assumptions for linear regression .

  • Independence of observations (aka no autocorrelation)

Because we only have one independent variable and one dependent variable, we don’t need to test for any hidden relationships among variables.

If you know that you have autocorrelation within variables (i.e. multiple observations of the same test subject), then do not proceed with a simple linear regression! Use a structured model, like a linear mixed-effects model, instead.

To check whether the dependent variable follows a normal distribution , use the hist() function.

Simple regression histogram

The observations are roughly bell-shaped (more observations in the middle of the distribution, fewer on the tails), so we can proceed with the linear regression.

The relationship between the independent and dependent variable must be linear. We can test this visually with a scatter plot to see if the distribution of data points could be described with a straight line.

Simple regression scatter plot

The relationship looks roughly linear, so we can proceed with the linear model.

  • Homoscedasticity  (aka homogeneity of variance )

This means that the prediction error doesn’t change significantly over the range of prediction of the model. We can test this assumption later, after fitting the linear model.

Use the cor() function to test the relationship between your independent variables and make sure they aren’t too highly correlated.

When we run this code, the output is 0.015. The correlation between biking and smoking is small (0.015 is only a 1.5% correlation), so we can include both parameters in our model.

Use the hist() function to test whether your dependent variable follows a normal distribution .

Multiple regression histogram

The distribution of observations is roughly bell-shaped, so we can proceed with the linear regression.

We can check this using two scatterplots: one for biking and heart disease, and one for smoking and heart disease.

Multiple regression scatter plot 1

Although the relationship between smoking and heart disease is a bit less clear, it still appears linear. We can proceed with linear regression.

  • Homoscedasticity

We will check this after we make the model.

Now that you’ve determined your data meet the assumptions, you can perform a linear regression analysis to evaluate the relationship between the independent and dependent variables.

Simple regression: income and happiness

Let’s see if there’s a linear relationship between income and happiness in our survey of 500 people with incomes ranging from $15k to $75k, where happiness is measured on a scale of 1 to 10.

To perform a simple linear regression analysis and check the results, you need to run two lines of code. The first line of code makes the linear model, and the second line prints out the summary of the model:

The output looks like this:

Simple regression results

This output table first presents the model equation, then summarizes the model residuals (see step 4).

The Coefficients section shows:

  • The estimates ( Estimate ) for the model parameters – the value of the y-intercept (in this case 0.204) and the estimated effect of income on happiness (0.713).
  • The standard error of the estimated values ( Std. Error ).
  • The test statistic ( t value , in this case the t statistic ).
  • The p value ( Pr(>| t | ) ), aka the probability of finding the given t statistic if the null hypothesis of no relationship were true.

The final three lines are model diagnostics – the most important thing to note is the p value (here it is 2.2e-16, or almost zero), which will indicate whether the model fits the data well.

From these results, we can say that there is a significant positive relationship between income and happiness ( p value < 0.001), with a 0.713-unit (+/- 0.01) increase in happiness for every unit increase in income.

Multiple regression: biking, smoking, and heart disease

Let’s see if there’s a linear relationship between biking to work, smoking, and heart disease in our imaginary survey of 500 towns. The rates of biking to work range between 1 and 75%, rates of smoking between 0.5 and 30%, and rates of heart disease between 0.5% and 20.5%.

To test the relationship, we first fit a linear model with heart disease as the dependent variable and biking and smoking as the independent variables. Run these two lines of code:

Multiple regression results

The estimated effect of biking on heart disease is -0.2, while the estimated effect of smoking is 0.178.

This means that for every 1% increase in biking to work, there is a correlated 0.2% decrease in the incidence of heart disease. Meanwhile, for every 1% increase in smoking, there is a 0.178% increase in the rate of heart disease.

The standard errors for these regression coefficients are very small, and the t statistics are very large (-147 and 50.4, respectively). The p values reflect these small errors and large t statistics. For both parameters, there is almost zero probability that this effect is due to chance.

Remember that these data are made up for this example, so in real life these relationships would not be nearly so clear!

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Before proceeding with data visualization, we should make sure that our models fit the homoscedasticity assumption of the linear model.

We can run plot(income.happiness.lm) to check whether the observed data meets our model assumptions:

Note that the par(mfrow()) command will divide the Plots window into the number of rows and columns specified in the brackets. So par(mfrow=c(2,2)) divides it up into two rows and two columns. To go back to plotting one graph in the entire window, set the parameters again and replace the (2,2) with (1,1).

These are the residual plots produced by the code:

Simple regression diagnostic plots lm

Residuals are the unexplained variance . They are not exactly the same as model error, but they are calculated from it, so seeing a bias in the residuals would also indicate a bias in the error.

The most important thing to look for is that the red lines representing the mean of the residuals are all basically horizontal and centered around zero. This means there are no outliers or biases in the data that would make a linear regression invalid.

In the Normal Q-Qplot in the top right, we can see that the real residuals from our model form an almost perfectly one-to-one line with the theoretical residuals from a perfect model.

Based on these residuals, we can say that our model meets the assumption of homoscedasticity.

Again, we should check that our model is actually a good fit for the data, and that we don’t have large variation in the model error, by running this code:

Multiple regression diagnostic plots lm

As with our simple regression, the residuals show no bias, so we can say our model fits the assumption of homoscedasticity.

Next, we can plot the data and the regression line from our linear regression model so that the results can be shared.

Follow 4 steps to visualize the results of your simple linear regression.

  • Plot the data points on a graph

Simple regression scatter plot

  • Add the linear regression line to the plotted data

Add the regression line using geom_smooth() and typing in lm as your method for creating the line. This will add the line of the linear regression as well as the standard error of the estimate (in this case +/- 0.01) as a light grey stripe surrounding the line:

Simple regression line

  • Add the equation for the regression line.

Simple regression equation

  • Make the graph ready for publication

We can add some style parameters using theme_bw() and making custom labels using labs() .

This produces the finished graph that you can include in your papers:

Simple linear regression in R graph example

The visualization step for multiple regression is more difficult than for simple regression, because we now have two predictors. One option is to plot a plane, but these are difficult to read and not often published.

We will try a different method: plotting the relationship between biking and heart disease at different levels of smoking. In this example, smoking will be treated as a factor with three levels, just for the purposes of displaying the relationships in our data.

There are 7 steps to follow.

  • Create a new dataframe with the information needed to plot the model

Use the function expand.grid() to create a dataframe with the parameters you supply. Within this function we will:

  • Create a sequence from the lowest to the highest value of your observed biking data;
  • Choose the minimum, mean, and maximum values of smoking, in order to make 3 levels of smoking over which to predict rates of heart disease.

This will not create anything new in your console, but you should see a new data frame appear in the Environment tab. Click on it to view it.

  • Predict the values of heart disease based on your linear model

Next we will save our ‘predicted y’ values as a new column in the dataset we just created.

  • Round the smoking numbers to two decimals

This will make the legend easier to read later on.

  • Change the ‘smoking’ variable into a factor

This allows us to plot the interaction between biking and heart disease at each of the three levels of smoking we chose.

  • Plot the original data

Multiple linear regression scatter plot

  • Add the regression lines

Multiple regression lines

Because this graph has two regression coefficients, the stat_regline_equation() function won’t work here. But if we want to add our regression model to the graph, we can do so like this:

This is the finished graph that you can include in your papers!

In addition to the graph, include a brief statement explaining the results of the regression model.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Linear Regression in R | A Step-by-Step Guide & Examples. Scribbr. Retrieved April 15, 2024, from https://www.scribbr.com/statistics/linear-regression-in-r/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, multiple linear regression | a quick guide (examples), choosing the right statistical test | types & examples, what is your plagiarism score.

Linear Regression Analysis using SPSS Statistics

Introduction.

Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable (or sometimes, the outcome variable). The variable we are using to predict the other variable's value is called the independent variable (or sometimes, the predictor variable). For example, you could use linear regression to understand whether exam performance can be predicted based on revision time; whether cigarette consumption can be predicted based on smoking duration; and so forth. If you have two or more independent variables, rather than just one, you need to use multiple regression .

This "quick start" guide shows you how to carry out linear regression using SPSS Statistics, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for linear regression to give you a valid result. We discuss these assumptions next.

SPSS Statistics

Assumptions.

When you choose to analyse your data using linear regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using linear regression. You need to do this because it is only appropriate to use linear regression if your data "passes" seven assumptions that are required for linear regression to give you a valid result. In practice, checking for these seven assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these seven assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out linear regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let’s take a look at these seven assumptions:

  • Assumption #1: Your dependent variable should be measured at the continuous level (i.e., it is either an interval or ratio variable). Examples of continuous variables include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. You can learn more about interval and ratio variables in our article: Types of Variable .
  • Assumption #2: Your independent variable should also be measured at the continuous level (i.e., it is either an interval or ratio variable). See the bullet above for examples of continuous variables.

Types of relationship

  • Assumption #5: You should have independence of observations , which you can easily check using the Durbin-Watson statistic, which is a simple test to run using SPSS Statistics. We explain how to interpret the result of the Durbin-Watson statistic in our enhanced linear regression guide.

Homoscedasticity in linear regression

  • Assumption #7: Finally, you need to check that the residuals (errors) of the regression line are approximately normally distributed (we explain these terms in our enhanced linear regression guide). Two common methods to check this assumption include using either a histogram (with a superimposed normal curve) or a Normal P-P Plot. Again, in our enhanced linear regression guide, we: (a) show you how to check this assumption using SPSS Statistics, whether you use a histogram (with superimposed normal curve) or Normal P-P Plot; (b) explain how to interpret these diagrams; and (c) provide a possible solution if your data fails to meet this assumption.

You can check assumptions #3, #4, #5, #6 and #7 using SPSS Statistics. Assumptions #3 should be checked first, before moving onto assumptions #4, #5, #6 and #7. We suggest testing the assumptions in this order because assumptions #3, #4, #5, #6 and #7 require you to run the linear regression procedure in SPSS Statistics first, so it is easier to deal with these after checking assumption #1 and #2. Just remember that if you do not run the statistical tests on these assumptions correctly, the results you get when running a linear regression might not be valid. This is why we dedicate a number of sections of our enhanced linear regression guide to help you get this right. You can find out more about our enhanced content as a whole on our Features: Overview page, or more specifically, learn how we help with testing assumptions on our Features: Assumptions page.

In the section, Procedure , we illustrate the SPSS Statistics procedure to perform a linear regression assuming that no assumptions have been violated. First, we introduce the example that is used in this guide.

A salesperson for a large car brand wants to determine whether there is a relationship between an individual's income and the price they pay for a car. As such, the individual's "income" is the independent variable and the "price" they pay for a car is the dependent variable. The salesperson wants to use this information to determine which cars to offer potential customers in new areas where average income is known.

Setup in SPSS Statistics

In SPSS Statistics, we created two variables so that we could enter our data: Income (the independent variable), and Price (the dependent variable). It can also be useful to create a third variable, caseno , to act as a chronological case number. This third variable is used to make it easy for you to eliminate cases (e.g., significant outliers) that you have identified when checking for assumptions. However, we do not include it in the SPSS Statistics procedure that follows because we assume that you have already checked these assumptions. In our enhanced linear regression guide, we show you how to correctly enter data in SPSS Statistics to run a linear regression when you are also checking for assumptions. You can learn about our enhanced data setup content on our Features: Data Setup page. Alternately, see our generic, "quick start" guide: Entering Data in SPSS Statistics .

Test Procedure in SPSS Statistics

The five steps below show you how to analyse your data using linear regression in SPSS Statistics when none of the seven assumptions in the previous section, Assumptions , have been violated. At the end of these four steps, we show you how to interpret the results from your linear regression. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6 and #7, which are required when using linear regression and can be tested using SPSS Statistics, you can learn more about our enhanced guides on our Features: Overview page.

Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28 , as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version , SPSS Statistics introduced a new look to their interface called " SPSS Light ", replacing the previous look for versions 26 and earlier versions , which was called " SPSS Standard ". Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the procedure is identical .

Menu for a linear regression in SPSS Statistics

Published with written permission from SPSS Statistics, IBM Corporation.

You will be presented with the Linear Regression dialogue box:

'Linear Regression' dialogue box in SPSS Statistics. Variables 'Income' & 'Price' on the left

Access all 96 SPSS Statistics guides in Laerd Statistics

Subscription Plans

Output of Linear Regression Analysis

SPSS Statistics will generate quite a few tables of output for a linear regression. In this section, we show you only the three main tables required to understand your results from the linear regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the six assumptions required to carry out linear regression is provided in our enhanced guide. This includes relevant scatterplots, histogram (with superimposed normal curve), Normal P-P Plot, casewise diagnostics and the Durbin-Watson statistic. Below, we focus on the results for the linear regression analysis only.

The first table of interest is the Model Summary table, as shown below:

'Model Summary' table for a linear regression in SPSS Statistics. Shows 'Sum of Squares', 'df', 'Mean Square', 'F' & 'Sig.'

This table provides the R and R 2 values. The R value represents the simple correlation and is 0.873 (the " R " Column), which indicates a high degree of correlation. The R 2 value (the " R Square " column) indicates how much of the total variation in the dependent variable, Price , can be explained by the independent variable, Income . In this case, 76.2% can be explained, which is very large.

The next table is the ANOVA table, which reports how well the regression equation fits the data (i.e., predicts the dependent variable) and is shown below:

'ANOVA' table for a linear regression in SPSS. Shows 'Unstandarized Coefficients', 'Standardized Coefficients', 't' & 'Sig.'

This table indicates that the regression model predicts the dependent variable significantly well. How do we know this? Look at the " Regression " row and go to the " Sig. " column. This indicates the statistical significance of the regression model that was run. Here, p < 0.0005, which is less than 0.05, and indicates that, overall, the regression model statistically significantly predicts the outcome variable (i.e., it is a good fit for the data).

The Coefficients table provides us with the necessary information to predict price from income, as well as determine whether income contributes statistically significantly to the model (by looking at the " Sig. " column). Furthermore, we can use the values in the " B " column under the " Unstandardized Coefficients " column, as shown below:

'Coefficients' table for linear regression. Shows 'Unstandarized Coefficients', 'Standardized Coefficients', 't' & 'Sig.'

to present the regression equation as:

Price = 8287 + 0.564(Income)

If you are unsure how to interpret regression equations or how to use them to make predictions, we discuss this in our enhanced linear regression guide. We also show you how to write up the results from your assumptions tests and linear regression output if you need to report this in a dissertation/thesis, assignment or research report. We do this using the Harvard and APA styles. You can learn more about our enhanced content on our Features: Overview page.

We also have a "quick start" guide on how to perform a linear regression analysis in Stata .

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

11.1: Testing the Hypothesis that β = 0

  • Last updated
  • Save as PDF
  • Page ID 26113

The correlation coefficient, \(r\), tells us about the strength and direction of the linear relationship between \(x\) and \(y\). However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient \(r\) and the sample size \(n\), together. We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute \(r\), the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we have only sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, \(r\), is our estimate of the unknown population correlation coefficient.

  • The symbol for the population correlation coefficient is \(\rho\), the Greek letter "rho."
  • \(\rho =\) population correlation coefficient (unknown)
  • \(r =\) sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient \(\rho\) is "close to zero" or "significantly different from zero". We decide this based on the sample correlation coefficient \(r\) and the sample size \(n\).

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is "significant."

  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between \(x\) and \(y\). We can use the regression line to model the linear relationship between \(x\) and \(y\) in the population.

If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is "not significant".

  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is not significantly different from zero."
  • What the conclusion means: There is not a significant linear relationship between \(x\) and \(y\). Therefore, we CANNOT use the regression line to model a linear relationship between \(x\) and \(y\) in the population.
  • If \(r\) is significant and the scatter plot shows a linear trend, the line can be used to predict the value of \(y\) for values of \(x\) that are within the domain of observed \(x\) values.
  • If \(r\) is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
  • If \(r\) is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed \(x\) values in the data.

PERFORMING THE HYPOTHESIS TEST

  • Null Hypothesis: \(H_{0}: \rho = 0\)
  • Alternate Hypothesis: \(H_{a}: \rho \neq 0\)

WHAT THE HYPOTHESES MEAN IN WORDS:

  • Null Hypothesis \(H_{0}\) : The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship(correlation) between \(x\) and \(y\) in the population.
  • Alternate Hypothesis \(H_{a}\) : The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between \(x\) and \(y\) in the population.

DRAWING A CONCLUSION:There are two methods of making the decision. The two methods are equivalent and give the same result.

  • Method 1: Using the \(p\text{-value}\)
  • Method 2: Using a table of critical values

In this chapter of this textbook, we will always use a significance level of 5%, \(\alpha = 0.05\)

Using the \(p\text{-value}\) method, you could choose any appropriate significance level you want; you are not limited to using \(\alpha = 0.05\). But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, \(\alpha = 0.05\). (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook.)

METHOD 1: Using a \(p\text{-value}\) to make a decision

To calculate the \(p\text{-value}\) using LinRegTTEST:

On the LinRegTTEST input screen, on the line prompt for \(\beta\) or \(\rho\), highlight "\(\neq 0\)"

The output screen shows the \(p\text{-value}\) on the line that reads "\(p =\)".

(Most computer statistical software can calculate the \(p\text{-value}\).)

If the \(p\text{-value}\) is less than the significance level ( \(\alpha = 0.05\) ):

  • Decision: Reject the null hypothesis.
  • Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is significantly different from zero."

If the \(p\text{-value}\) is NOT less than the significance level ( \(\alpha = 0.05\) )

  • Decision: DO NOT REJECT the null hypothesis.
  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is NOT significantly different from zero."

Calculation Notes:

  • You will use technology to calculate the \(p\text{-value}\). The following describes the calculations to compute the test statistics and the \(p\text{-value}\):
  • The \(p\text{-value}\) is calculated using a \(t\)-distribution with \(n - 2\) degrees of freedom.
  • The formula for the test statistic is \(t = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}\). The value of the test statistic, \(t\), is shown in the computer or calculator output along with the \(p\text{-value}\). The test statistic \(t\) has the same sign as the correlation coefficient \(r\).
  • The \(p\text{-value}\) is the combined area in both tails.

An alternative way to calculate the \(p\text{-value}\) ( \(p\) ) given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.

THIRD-EXAM vs FINAL-EXAM EXAMPLE: \(p\text{-value}\) method

  • Consider the third exam/final exam example.
  • The line of best fit is: \(\hat{y} = -173.51 + 4.83x\) with \(r = 0.6631\) and there are \(n = 11\) data points.
  • Can the regression line be used for prediction? Given a third exam score ( \(x\) value), can we use the line to predict the final exam score (predicted \(y\) value)?
  • \(H_{0}: \rho = 0\)
  • \(H_{a}: \rho \neq 0\)
  • \(\alpha = 0.05\)
  • The \(p\text{-value}\) is 0.026 (from LinRegTTest on your calculator or from computer software).
  • The \(p\text{-value}\), 0.026, is less than the significance level of \(\alpha = 0.05\).
  • Decision: Reject the Null Hypothesis \(H_{0}\)
  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score (\(x\)) and the final exam score (\(y\)) because the correlation coefficient is significantly different from zero.

Because \(r\) is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.

METHOD 2: Using a table of Critical Values to make a decision

The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a good idea of whether the computed value of \(r\) is significant or not . Compare \(r\) to the appropriate critical value in the table. If \(r\) is not between the positive and negative critical values, then the correlation coefficient is significant. If \(r\) is significant, then you may want to use the line for prediction.

Example \(\PageIndex{1}\)

Suppose you computed \(r = 0.801\) using \(n = 10\) data points. \(df = n - 2 = 10 - 2 = 8\). The critical values associated with \(df = 8\) are \(-0.632\) and \(+0.632\). If \(r <\) negative critical value or \(r >\) positive critical value, then \(r\) is significant. Since \(r = 0.801\) and \(0.801 > 0.632\), \(r\) is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Exercise \(\PageIndex{1}\)

For a given line of best fit, you computed that \(r = 0.6501\) using \(n = 12\) data points and the critical value is 0.576. Can the line be used for prediction? Why or why not?

If the scatter plot looks linear then, yes, the line can be used for prediction, because \(r >\) the positive critical value.

Example \(\PageIndex{2}\)

Suppose you computed \(r = –0.624\) with 14 data points. \(df = 14 – 2 = 12\). The critical values are \(-0.532\) and \(0.532\). Since \(-0.624 < -0.532\), \(r\) is significant and the line can be used for prediction

Exercise \(\PageIndex{2}\)

For a given line of best fit, you compute that \(r = 0.5204\) using \(n = 9\) data points, and the critical value is \(0.666\). Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction, because \(r <\) the positive critical value.

Example \(\PageIndex{3}\)

Suppose you computed \(r = 0.776\) and \(n = 6\). \(df = 6 - 2 = 4\). The critical values are \(-0.811\) and \(0.811\). Since \(-0.811 < 0.776 < 0.811\), \(r\) is not significant, and the line should not be used for prediction.

Exercise \(\PageIndex{3}\)

For a given line of best fit, you compute that \(r = -0.7204\) using \(n = 8\) data points, and the critical value is \(= 0.707\). Can the line be used for prediction? Why or why not?

Yes, the line can be used for prediction, because \(r <\) the negative critical value.

THIRD-EXAM vs FINAL-EXAM EXAMPLE: critical value method

Consider the third exam/final exam example. The line of best fit is: \(\hat{y} = -173.51 + 4.83x\) with \(r = 0.6631\) and there are \(n = 11\) data points. Can the regression line be used for prediction? Given a third-exam score ( \(x\) value), can we use the line to predict the final exam score (predicted \(y\) value)?

  • Use the "95% Critical Value" table for \(r\) with \(df = n - 2 = 11 - 2 = 9\).
  • The critical values are \(-0.602\) and \(+0.602\)
  • Since \(0.6631 > 0.602\), \(r\) is significant.
  • Conclusion:There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score (\(x\)) and the final exam score (\(y\)) because the correlation coefficient is significantly different from zero.

Example \(\PageIndex{4}\)

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if \(r\) is significant and the line of best fit associated with each r can be used to predict a \(y\) value. If it helps, draw a number line.

  • \(r = –0.567\) and the sample size, \(n\), is \(19\). The \(df = n - 2 = 17\). The critical value is \(-0.456\). \(-0.567 < -0.456\) so \(r\) is significant.
  • \(r = 0.708\) and the sample size, \(n\), is \(9\). The \(df = n - 2 = 7\). The critical value is \(0.666\). \(0.708 > 0.666\) so \(r\) is significant.
  • \(r = 0.134\) and the sample size, \(n\), is \(14\). The \(df = 14 - 2 = 12\). The critical value is \(0.532\). \(0.134\) is between \(-0.532\) and \(0.532\) so \(r\) is not significant.
  • \(r = 0\) and the sample size, \(n\), is five. No matter what the \(dfs\) are, \(r = 0\) is between the two critical values so \(r\) is not significant.

Exercise \(\PageIndex{4}\)

For a given line of best fit, you compute that \(r = 0\) using \(n = 100\) data points. Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction no matter what the sample size is.

Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between \(x\) and \(y\) in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between \(x\) and \(y\) in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population. Examining the scatter plot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

The assumptions underlying the test of significance are:

  • There is a linear relationship in the population that models the average value of \(y\) for varying values of \(x\). In other words, the expected value of \(y\) for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population.)
  • The \(y\) values for any particular \(x\) value are normally distributed about the line. This implies that there are more \(y\) values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of \(y\) values lie on the line.
  • The standard deviations of the population \(y\) values about the line are equal for each value of \(x\). In other words, each of these normal distributions of \(y\) values has the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).
  • The data are produced from a well-designed, random sample or randomized experiment.

Linear regression is a procedure for fitting a straight line of the form \(\hat{y} = a + bx\) to data. The conditions for regression are:

  • Linear In the population, there is a linear relationship that models the average value of \(y\) for different values of \(x\).
  • Independent The residuals are assumed to be independent.
  • Normal The \(y\) values are distributed normally for any value of \(x\).
  • Equal variance The standard deviation of the \(y\) values is equal for each \(x\) value.
  • Random The data are produced from a well-designed random sample or randomized experiment.

The slope \(b\) and intercept \(a\) of the least-squares line estimate the slope \(\beta\) and intercept \(\alpha\) of the population (true) regression line. To estimate the population standard deviation of \(y\), \(\sigma\), use the standard deviation of the residuals, \(s\). \(s = \sqrt{\frac{SEE}{n-2}}\). The variable \(\rho\) (rho) is the population correlation coefficient. To test the null hypothesis \(H_{0}: \rho =\) hypothesized value , use a linear regression t-test. The most common null hypothesis is \(H_{0}: \rho = 0\) which indicates there is no linear relationship between \(x\) and \(y\) in the population. The TI-83, 83+, 84, 84+ calculator function LinRegTTest can perform this test (STATS TESTS LinRegTTest).

Formula Review

Least Squares Line or Line of Best Fit:

\[\hat{y} = a + bx\]

\[a = y\text{-intercept}\]

\[b = \text{slope}\]

Standard deviation of the residuals:

\[s = \sqrt{\frac{SEE}{n-2}}\]

\[SSE = \text{sum of squared errors}\]

\[n = \text{the number of data points}\]

  • Data Visualization
  • Statistics in R
  • Machine Learning in R
  • Data Science in R
  • Packages in R
  • Multiline Plot using Plotly in R
  • Scaling Variables Parallel Coordinates chart in R
  • R Programming Exercises, Practice Questions and Solutions
  • Poisson Functions in R Programming
  • Find String Matches in a Vector or Matrix in R Programming - str_detect() Function
  • Convert type of data object in R Programming - type.convert() Function
  • Comparing values of data frames in R Programming - all_equal() Function
  • Accessing variables of a data frame in R Programming - attach() and detach() function
  • Bootstrapping in R Programming
  • Convert string from lowercase to uppercase in R programming - toupper() function
  • How to find SubString in R programming?
  • Adding elements in a vector in R programming - append() method
  • Replicate elements of vector in R programming - rep() Method
  • Calculate arc tangent of a value in R programming - atan2(y, x) function
  • Calculate arc cosine of a value in R programming - acos() function
  • Performing F-Test in R programming - var.test() Method
  • Performing Binomial Test in R programming - binom.test() Method
  • Two Dimensional List in R Programming
  • Poisson Regression in R Programming

How to Use the linearHypothesis() Function in R

In statistics, understanding how variables relate to each other is crucial. This helps in making smart decisions. When we build regression models, we need to check if certain combinations of variables are statistically significant. In R Programming Language a tool called linear hypothesis () in the “car” package for this purpose. Also, this article gives a simple guide on using linear hypothesis () in R for such analyses.

What is a linear hypothesis?

The linear hypothesis () function is a tool in R’s “car” package used to test linear hypotheses in regression models. It helps us to determine if certain combinations of variables have a significant impact on our model’s outcome.

H 0 : Cβ = 0
  • H 0 denotes the null hypothesis.
  • C is a matrix representing the coefficients of the linear combination being tested.
  • β represents the vector of coefficients in the regression model.
  • 0 signifies a vector of zeros.
linearHypothesis(model, hypothesis.matrix)
  • model is the fitted regression model for which hypotheses are to be tested.
  • hypothesis.matrix is a matrix specifying the linear hypotheses to be tested.

Implemention of linearHypothesis() Function in R

The hypothesis being tested is whether the coefficients of X1 and X2 are equal (i.e., X1 – X2 = 0).

  • The output provides the results of the linear hypothesis test, including the Residual Degrees of Freedom (Res.Df), Residual Sum of Squares (RSS), the change in the RSS between the restricted and full models, the F-statistic (F), and the corresponding p-value (Pr(>F)).
  • In this case, the p-value is 0.5459, which indicates that we fail to reject the null hypothesis at conventional significance levels, suggesting that there is no significant difference between the coefficients of X1 and X2.

Perform linearHypothesis() Function on mtcars dataset

The hypothesis being tested is whether the coefficients of ‘cyl’ and ‘disp’ sum up to zero (i.e., cyl + disp = 0).

  • In this case, the p-value is 0.3321, which indicates that we fail to reject the null hypothesis at conventional significance levels. Therefore, we don’t have sufficient evidence to conclude that the sum of the coefficients of ‘cyl’ and ‘disp’ is significantly different from zero.

The `linearHypothesis()` function in R’s “car” package provides a straightforward method for testing linear hypotheses in regression models. The significance of specific variable combinations, analysts can make informed decisions about the model’s predictive power.

Please Login to comment...

Similar reads.

author

  • R Machine Learning
  • 5 Reasons to Start Using Claude 3 Instead of ChatGPT
  • 6 Ways to Identify Who an Unknown Caller
  • 10 Best Lavender AI Alternatives and Competitors 2024
  • The 7 Best AI Tools for Programmers to Streamline Development in 2024
  • 30 OOPs Interview Questions and Answers (2024)

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

IMAGES

  1. PPT

    linear regression hypothesis

  2. Mod-01 Lec-39 Hypothesis Testing in Linear Regression

    linear regression hypothesis

  3. How to Write and Test Statistical Hypotheses in Simple Linear

    linear regression hypothesis

  4. Linear Regression with Multiple Variables

    linear regression hypothesis

  5. Linear Regression

    linear regression hypothesis

  6. Multiple Linear Regression Hypothesis Testing in Matrix Form

    linear regression hypothesis

VIDEO

  1. Hypothesis Testing in Simple Linear Regression

  2. Statistics: Linear regression examples

  3. Simple linear regression hypothesis testing

  4. 11.6. Simple Linear Regression: Hypothesis Testing

  5. Case Study: Using SigmaXL and DMAIC to Improve Customer Satisfaction

  6. Linear Regression

COMMENTS

  1. Understanding the Null Hypothesis for Linear Regression

    x: The value of the predictor variable. Simple linear regression uses the following null and alternative hypotheses: H0: β1 = 0. HA: β1 ≠ 0. The null hypothesis states that the coefficient β1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

  2. 12.2.1: Hypothesis Test for Linear Regression

    The hypotheses are: Find the critical value using dfE = n − p − 1 = 13 for a two-tailed test α = 0.05 inverse t-distribution to get the critical values ±2.160. Draw the sampling distribution and label the critical values, as shown in Figure 12-14. Figure 12-14: Graph of t-distribution with labeled critical values.

  3. PDF Chapter 9 Simple Linear Regression

    218 CHAPTER 9. SIMPLE LINEAR REGRESSION 9.2 Statistical hypotheses For simple linear regression, the chief null hypothesis is H 0: β 1 = 0, and the corresponding alternative hypothesis is H 1: β 1 6= 0. If this null hypothesis is true, then, from E(Y) = β 0 + β 1x we can see that the population mean of Y is β 0 for

  4. 3.3.4: Hypothesis Test for Simple Linear Regression

    In simple linear regression, this is equivalent to saying "Are X an Y correlated?". In reviewing the model, Y = β0 +β1X + ε Y = β 0 + β 1 X + ε, as long as the slope ( β1 β 1) has any non‐zero value, X X will add value in helping predict the expected value of Y Y. However, if there is no correlation between X and Y, the value of ...

  5. Simple Linear Regression

    Regression allows you to estimate how a dependent variable changes as the independent variable (s) change. Simple linear regression example. You are a social researcher interested in the relationship between income and happiness. You survey 500 people whose incomes range from 15k to 75k and ask them to rank their happiness on a scale from 1 to ...

  6. 14.4: Hypothesis Test for Simple Linear Regression

    In simple linear regression, this is equivalent to saying "Are X an Y correlated?". In reviewing the model, Y = β0 +β1X + ε Y = β 0 + β 1 X + ε, as long as the slope ( β1 β 1) has any non‐zero value, X X will add value in helping predict the expected value of Y Y. However, if there is no correlation between X and Y, the value of ...

  7. The Ultimate Guide to Linear Regression

    A complete step-by-step guide to Linear Regression with examples. Explore the Scientific R&D Platform. ... the null hypothesis). And based on how we set up the regression analysis to use 0.05 as the threshold for significance, it tells us that the model points to a significant relationship. There is evidence that this relationship is real.

  8. Linear regression

    The lecture is divided in two parts: in the first part, we discuss hypothesis testing in the normal linear regression model, in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors; in the second part, we show how to carry out hypothesis tests in linear regression analyses where the ...

  9. Linear regression

    e. In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables ). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear ...

  10. Linear Regression

    Linear regression has two primary purposes—understanding the relationships between variables and forecasting. The coefficients represent the estimated magnitude and direction (positive/negative) of the relationship between each independent variable and the dependent variable.; A linear regression equation allows you to predict the mean value of the dependent variable given values of the ...

  11. Linear regression hypothesis testing: Concepts, Examples

    F-statistics for testing hypothesis for linear regression model: F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0.

  12. The Complete Guide to Linear Regression Analysis

    With a simple calculation, we can find the value of β0 and β1 for minimum RSS value. With the stats model library in python, we can find out the coefficients, Table 1: Simple regression of sales on TV. Values for β0 and β1 are 7.03 and 0.047 respectively. Then the relation becomes, Sales = 7.03 + 0.047 * TV.

  13. 5.2

    5.2 - Writing Hypotheses. The first step in conducting a hypothesis test is to write the hypothesis statements that are going to be tested. For each test you will have a null hypothesis ( H 0) and an alternative hypothesis ( H a ). When writing hypotheses there are three things that we need to know: (1) the parameter that we are testing (2) the ...

  14. How to Simplify Hypothesis Testing for Linear Regression in Python

    A Quick Reminder Regarding Linear Regression. Before I share the 4 assumptions that should be met in order to run a linear regression hypothesis test, there is one important point to keep in mind regarding linear regression. Linear regression can be thought of as a dual purpose tool: To predict future values for the y variable

  15. PDF Lecture 9: Linear Regression

    Regression. Technique used for the modeling and analysis of numerical data. Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other. Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships.

  16. PDF Lecture 5 Hypothesis Testing in Multiple Linear Regression

    As in simple linear regression, under the null hypothesis t 0 = βˆ j seˆ(βˆ j) ∼ t n−p−1. We reject H 0 if |t 0| > t n−p−1,1−α/2. This is a partial test because βˆ j depends on all of the other predictors x i, i 6= j that are in the model. Thus, this is a test of the contribution of x j given the other predictors in the model.

  17. 15.5: Hypothesis Tests for Regression Models

    15.5: Hypothesis Tests for Regression Models. So far we've talked about what a regression model is, how the coefficients of a regression model are estimated, and how we quantify the performance of the model (the last of these, incidentally, is basically our measure of effect size). The next thing we need to talk about is hypothesis tests.

  18. Multiple Linear Regression

    The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...

  19. Linear Regression in R

    To perform linear regression in R, there are 6 main steps. Use our sample data and code to perform simple or multiple regression. FAQ ... aka the probability of finding the given t statistic if the null hypothesis of no relationship were true. The final three lines are model diagnostics - the most important thing to note is the p ...

  20. Hypothesis Testing On Linear Regression

    Hypothesis Testing On Linear Regression. ... Hence, every time we perform linear regression, we need to test whether the fitted line is a significant one or not (in other terms, test whether β₁ ...

  21. Linear Regression Analysis using SPSS Statistics

    Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable (or sometimes, the outcome variable). The variable we are using to predict the other variable's value is called the ...

  22. 11.1: Testing the Hypothesis that β = 0

    Linear regression is a procedure for fitting a straight line of the form \(\hat{y} = a + bx\) to data. The conditions for regression are: Linear In the population, there is a linear relationship that models the average value of \(y\) for different values of \(x\). Independent The residuals are assumed to be independent.

  23. Linear Regression in Machine learning

    Hypothesis function in Linear Regression. As we have assumed earlier that our independent feature is the experience i.e X and the respective salary Y is the dependent variable. Let's assume there is a linear relationship between X and Y then the salary can be predicted using: [Tex]\hat{Y} = \theta_1 + \theta_2X [/Tex] ...

  24. How to Use the linearHypothesis() Function in R

    In R Programming Language a tool called linear hypothesis in the "car" package for this purpose. Also, this article gives a simple guide on using linear hypothesis in R for such analyses. What is a linear hypothesis? The linear hypothesis function is a tool in R's "car" package used to test linear hypotheses in regression models. It ...