regression analysis in marketing research example

The Only Course You'll Need To Understand Marketing Like Never Before

How to Get Started with Marketing and Design Your Career in 5 Steps

regression analysis in marketing research example

Linear Regression for Marketing Analytics [Hands-on] If you are thinking about learning Predictive Analytics to improve your marketing efficiency then Linear Regression is the concept to start with. In this discussion, I present to you a step-by-step guide for performing Linear Regression for Marketing Analytics - which is the first topic you should start off with in your Marketing Analytics journey.

Introduction to Linear Regression

Linear Regression for Marketing Analytics is one of the most powerful and basic concepts to get started in Marketing Analytics with. If you are looking to start off with learning Machine Learning which can lend a helping hand to your Marketing education then Linear Regression is the topic to get started with. 

You as a marketer know that Machine Learning and Data Science have a significant impact on the decision making process in the field of Marketing. In the field of Marketing, Marketing Analytics helps in giving conclusive reasoning for most of the things, which for years and years has run on the golden gut of marketers.

Learning Regression for Marketing Analytics gives you the ability to predict various marketing variables which may or may not have any visible pattern to them. In this discussion, I will give you an introduction to what is linear regression and how it can transform your marketing and sales analytics.

Therefore, if you are willing to get started with Machine Learning or Marketing Analytics, Linear Regression is the place to begin. I assure you that after this discussion, Linear Regression, as a concept will be crystal clear to you.

Machine Learning for Marketing Analytics

You may already have the understanding of the fact that machine learning algorithms can be broadly divided into two categories: Supervised and Unsupervised Learning .

In Supervised Learning, the dataset that you would work with has the observed values from the past of the variable that you are looking to predict.

For example, you could be required to create a model for predicting the Sales of a product basis the Advertising Expenditure and Sales Expenditure for a given month. 

You would begin by asking the Sales Manager for the data for the previous months, expecting that he would share with you a well laid out excel with Quarter, Advertising Expenditure, Sales Expenditure and the Sales for that quarter in different columns.

With such a dataset in your possession, the job of the predictive algorithm that you create will be to find the relationship between these variables. The relationship should be generalized enough so that when you enter the advertising and sales expenditures for the coming quarter,  it can give the predicted sales for that quarter.

Unsupervised Learn ing is when this observed variable is not made available to you. In that case you will find yourself solving a different kind of marketing problem altogether and not that of prediction of a variable. Unsupervised Learning is not a part of this discussion.

 Supervised Learning has two sub-categories of problem: Regression and Classification .

If you are pressed for time, you can go ahead and watch my video first in which I have explained all the concepts that I have shared below.

Supervised Learning for Marketing Analytics

I shared a common marketing use case above. In that example, you had to predict the sales for the quarter using two different kinds of expenditure variables. Now, we know that the value of sales can be any number - arguably a positive one. The sales could, therefore, range from anywhere between 0 to some really high number.

Such a variable is a continuous variable which can have any value in a really high range. And this, infact, is the simplest way to understand what regression is.

A prediction problem in which the variable to be predicted is a continuous variable is a Regression problem.

Let’s look at an entirely different marketing use case to understand what is not a regression problem.

You have been recruited in the marketing team of a large private bank (a common placement for many business students). 

You are given the data of the bank’s list of prospects of the last year with details like Age, Job, Marital Status, No. of Children, Previous Loans, Previous Defaults etc. Along with that you are also provided with the information whether the person took the loan from the bank or not (0 for did not take the loan and 1 for did take the loan).

Your job as a young analytical marketer is to predict whether a prospect which comes in, in the future will take the loan from your bank or not.

Now, please note that in such a prediction problem your task is to just classify the prospects basis your algorithm’s understanding of whether the prospect will buy or not. Which means that the possible values for the outcome are discrete (0 or 1) and not continuous.

A prediction problem in which the variable to be predicted is a discrete variable is a Classification problem.

There are a variety of problems across industries that are prediction problems. My objective of this discussion is to equip you with the intuition and some hands-on coding of Linear Regression so that you can appreciate the use cases irrespective of the industry.

Vocabulary of Regression

Before I dive straight into what Linear Regression is, let me help you in forming an understanding of the vocabulary used when explaining regression. I will link it to the use cases mentioned above. So just reading through it will give you a complete understanding of what is what.

Target Variable: The variable to be predicted is called the Target Variable. When you had to predict the Sales for the quarter using the Advertising Expenditure and Sales Expenditure, the Sales is the target variable.

Naturally, the target variable can also be referred to as the Dependent Variable as its value is dependent on the other variables in the system. Even in our marketing use case, obviously, the Sales is dependent on how much expenditure you have done on Advertisements and on Sales Promotions.

The target variable is commonly denoted as y.

Feature Variable: All the other variables that are used to predict the target variable are called the Feature variables.

The feature variables can also be called Independent Variables . In the examples, the Advertising and Sales Promotion expenditures are the independent variables. Also, imagine that there is a different machine learning problem altogether of image recognition. In such a problem, each of the pixels is a feature variable.

The feature variable is commonly denoted as x. Multiple feature variables get denoted as x0, x1, x2,...., xn.

Finally, let me clarify to you as to what are the different names with which these two variables can be referred to.

  • Independent Variable (x) and Dependent Variable (y)
  • Feature Variable (x) and Target Variable (y)
  • Predictor Variable (x) and Predicted Variable (y)
  • Input Variable (x) and Output Variable (y)

Introduction to Linear Regression

There are many Regression Models out of which the most basic regression model is the Linear Regression . For Nonlinear Regression, there are different models like Generalized Additive Models (GAMS) and tree-based models which can be used for regression. 

Since you are starting off with Marketing Analytics, my objective in this discussion is to take you only through Linear Regression for Marketing Analytics and develop your understanding with that as a base.

Note: What I am going to share with you in the remaining part of the article tends to tune out a lot of people who are frightened by anything that even looks like Math. You would see some mathematical equations with unique notations, and some formulas as well. 

None of it is a mathematical concept which you would not have studied in school. If you just manage to sit through it, you will realize that it is nothing but plain English written in a jazzed up manner with equations, which by the way are equally important. 

For you as an Analytical Marketer, the intuition is important so please just focus on that.

A linear regression model for a use case which has just one independent/feature variable would look like:

Regression Equations 1

When you use more than one feature variable in your model then the linear regression model will look like:

Regression Equations 2

Let me quickly decipher what this equation means.

What are the β Parameters? 

The symbols that you see, β0, β1, β2, are called Model Parameters . These are constants which determine the predicted value of the target variable. These can also be referred to as Model Coefficients or Parameters .

Specifically, β0 is referred to as the Bias Parameter . You will notice that this is not multiplied with any variable/feature in the model and is a standalone parameter that adjusts the model.

Notice carefully that I referred to these Model Parameters (β) as Constants and the features (x) as Variables. This difference needs to be understood and proper usage of the terms makes a lot of difference in understanding the topic.

Why is the predicted variable ŷ and not y?

As I had mentioned above, y is the target variable, which is the variable we are trying to predict. Now, while y represents the actual value of the variable to be predicted, ŷ  represents the predicted value of the variable.

Since, there is always some error in the prediction that is why the predicted value is represented with a different notation from the actual variable.

Why is this called a ‘Linear’ regression model?

This is a simple concept straight from your class 10th textbook. If you notice the equation again, you would see that each of the independent variables (x) appears with the power of 1 (degree 1). This means that the variables are not of a higher power (i.e. x 2 , x 3 ..).

Such a model will always be represented by a straight line when plotted on a graph, as I would show in the later part of this discussion.

Visual Representation of Linear Regression

I briefly touched upon a use case above in which you were to predict the Sales or a quarter basis the Advertisement Expenditure and the Sales Expenditure. Since there are two features in this model, the structure of this model will be like the equation below:

Dataset Equation 1

However, for simplicity, let us assume that for the features the Sales Manager could only provide the data for the Advertisement Expenditure and, therefore, there will only be a single feature in our model. In this situation, this is how our model equation is going to look like.

Dataset Equation 2

Here is the exact data that you received from the sales manager for you to work on.

Data

This data shows that for a Quarter 1, when the Advertisement Expenditure was 24,000, the sales were 724,000. I’m ignoring the units of currency for the time being. It could be Indian Rupee (INR) or United States Dollars (USD) or anything else.

Now, I went ahead and plotted both of these variables on a scatter plot with the Advertisement Expenditure on the x-axis and the Sales on the y-axis.

Scatter Plot

In this scatter plot, each dot represents one quarter given in the table. For that particular quarter, we will be able to determine the Advertisement Expenditure and the resulting Sales from the x and y axis, respectively.

Marked Scatter Plot

How would you predict using Linear Regression?

You would remember that the objective for this exercise is to be able to predict the sales of the future quarters based on the features that we have. And in this case, we have just one feature variable, i.e. the Ad_Exp .

In order to know where the next dot will lie on the scatter plot you need to find the equation of a straight line which passes through these points hence representing a trend.

3 lines scatter plot

Now, through Python I have drawn three lines which pass through these plots. Each of these three lines are represented by three different equations. Just by looking at the three, you can say that the one in the middle seems to be passing through the points just ‘perfectly’. How do we determine whether this line passes through the points perfectly or not?.

Which is the best trendline in Linear Regression?

Now obviously, you don’t need to make these three lines on your scatter plot every time you do linear regression. It is just for me to explain to you the intuition behind how we choose the best fitting linear line.

The best trendline which passes through the scatter plots is the one which minimizes the difference between the actual value and the predicted value across all the points.  

If you magnify at one of the points you will see what exactly is this difference between the actual and the predicted value. A metric that is used to capture the error of the entire model across all the points is called Residual Sum of Squares (RSS) which will be discussed in my next article on errors. 

But to explain briefly,  each of these distances (of each of the points) from the best fitting line is squared and added. What we finally get is the Residual Sum of Squares.

As we had already understood from our intuition, out of the three lines that I had plotted, the one at the center seems to be the one with least difference across all the points. And if we run the curve fitting on Python, it indeed turns out to be the best fitting line for the scatter plot.

best fitting scatter plot

For this part of the discussion, my purpose was to just give you the intuition of Linear Regression for Marketing Analytics.

And with this you should be able to understand what is the objective of a linear regression problem. From what you have seen above, you can simply say that the objective of a linear regression problem is to determine the regression model parameters (β0, β1, β..) that minimize the error of the model.

Notice again, that this is a linear i.e. a straight line and it is not at all necessary that your trendline should be straight.

Non-linear regression is something that I will discuss later in the series once I have helped you develop an understanding for regression.

Hands-on Coding: Linear Regression Model with Marketing Data

This is the section where you will learn how to perform the regression in Python. Continuing with the same data that the Sales Manager had shared with you.

Sales is the target variable that needs to be predicted. Now, based on this data, your objective is to create a predictive model (just like the equation above), an equation in which you can plug in the Ad_exp value for a future quarter and predict the Sales for that quarter.

Let us straightaway get down to some hands-on coding to get this prediction done. Please do not feel left out if you do not have experience with Python. You will not require any pre-requisite knowledge. In fact the best way to learn is to get your hands dirty by solving a problem - like the one we are doing.

Step 1: Importing Python Libraries

The first step is to fire up your Jupyter notebook and load all the prerequisite libraries in your Jupyter notebook. Here are the important libraries that we will be needing for this linear regression.

  • numpy (to perform certain mathematical operations for regression)
  • pandas (the data that you load will be stored in a pandas DataFrames)
  • matplotlib.pyplot (you will use matplotlib to plot the data)

In order to load these, just start with these few lines of codes in your first cell:

The last line of code helps in displaying all the graphs that we will be making within the Jupyter notebook.

Step 2: Loading the data in a DataFrame

Let me now import my data into a DataFrame. A DataFrame is a data type in Python. The simplest way to understand it would be that it stores all your data as a table. And it will be on this table where we will perform all of our Python operations.

Now, I am saving my table (which you saw above) in a variable called ‘data’. Further, after the equal to ‘=’ sign, I have used a command pd.read_csv. 

This ensures that the .csv file which I have on my laptop at the file location mentioned in the path, gets loaded onto my Jupyter notebook. Please note that you will need to enter the path of the location where the .csv is stored in your laptop.

By running just the variable name ‘data’, as I have done in the second line of code, you will see the entire table loaded as a DataFrame.

Step 3: Separating the Feature and the Target Variable

You already know that the Ad_exp is the feature variable or the independent variable. Basis this variable, the target variable i.e. the Sales needs to be predicted.

Therefore, just like a classic mathematical equation, let me store the Ad_exp values in a variable x and the Sales values in a variable y. This notation also makes sense because in a mathematical equation y is the output variable and x is the input variable. Same is the case here.

The last line of code will display a scatter-plot on your Jupyter notebook which will look like this:

Please note, this is the same plot that you saw above in the intuition section. 

Step 4: Machine Learning! Line fitting

Let me tell you that till now you have not done any machine learning. This was just some basic level data cleaning/data preparation.

The glamorous Machine Learning part of the code starts here and also ends with this one line of code. 

From the Numpy library that you had installed, you will now be using the polyfit() method to find the coefficients of the linear line that fit the curve.

You already know from you school level math that the equation of a linear line is given by:

Here, the m is the slope of the line and c is the y-intercept. This trendline that we are trying to find here is no different. It follows the same equation and with this code we will be able to find the m and c values for it.

This method needs three parameters: the previously defined input and output variables (x, y) — and an integer, too: 1. This latter number defines the degree of the polynomial you want to fit. 

You would have understood that if you changed that number from 1 to 2, 3, 4 and so on, it would become a higher degree of regression also referred to as Polynomial Regression. 

That is also something that I will be discussing with you in the coming weeks.

But, as soon as you run this code, you see an output which is an array of two digits. These two digits are nothing but the values of m and c from the equation of a straight line.

Therefore, we now know that the best trendline that describes our data is:

y = 633.9931736 + (4.68585196 * x)

If you realize, we are actually done with our prediction problem. With this equation given above, you can just plug in the value of x, which you should remember is Advertising Expenditure, and you will get the value of y, i.e. the Sales that you are likely to make in that quarter.

But, since we are already doing some interesting stuff in python here, why do we have to manually find the value of Sales. Let’s make this better in our last and final step.

Step 5: Making the Predictions

Instead of doing the calculations manually in that equation, you can use another method that is made possible with the Numpy library that we had imported. The method is called poly1d()

Please follow the code given below.

We had stored the values of our equation coefficients in ‘model’. I created a variable Predict which now carried all this model data and also had the ability to predict value courtesy the Numpy method poly1d().

Now, when I entered Ad_Expenditure as 51, you saw that the predicted sales for it is shown to be 872.971.

Congratulations on your first step towards Marketing Analytics!

By executing these simple lines of code, you have successfully taken the first step towards learning Marketing Analytics. This is big! 

Let me tell you that Linear Regression is a fundamental concept in Marketing Analytics and in Data Science in general. Therefore, you should definitely spend all that time that you need to understand it really well.

Things get interesting from here. I have not yet spoken about how to measure the accuracy of your system. I have also not mentioned how to perform this regression if there were more than one feature or independent variable. That is called Multiple Regression . 

Gradually as we proceed in this journey, I will take you through all of these concepts and also through higher order regression i.e. the polynomial regression.

Having covered the most fundamental concept in machine learning, you are now ready to implement it on some of your datasets.

Whatever you learned in this discussion is more than sufficient for you to pick a simple dataset from your work and go ahead to create a linear regression model on it.

If you are not able to find a dataset for practice, stay rest assured. You can download a practice dataset for Linear Regression. This is a toy dataset that I have created for you practice so that you can get the necessary confidence.

Further, if you want to speed up the process of learning Marketing Analytics you can consider taking up this Data Scientist with Python career track on DataCamp. In order to help you get started with the career track, I have crafted a study plan for you so that you can sail through the course with ease.

The form you have selected does not exist.

Let’s learn some more Marketing Analytics in our next discussion.

You May Also Like

How to write a Data Analysis Report

How to write a Data Analysis Report like a Pro

regression analysis in marketing research example

Why Big Data Cannot Understand Consumers Why do you prefer watching a movie on Netflix over hearing the story’s plot from your friend? Why do we Indians, on meeting a stranger, first ask “Where are you from?” Big Data explains direct correlations from huge amounts of data. But, what is thick data and how does cultural identity impact consumer buying behaviour?

regression analysis in marketing research example

Basic Statistics for Data Science

regression analysis in marketing research example

How to do (accurate) sales forecasting – using Excel and SPSS In this series of articles, I will take you through the complete process of doing sales forecasting using Excel and SPSS. This first article of the series takes you through the basics of sales forecasting and its importance. Also, get the free excel which I will be using for the analysis.

Marketing Analytics using Excel

Marketing Analytics Using Excel [Simplified] Excel is the most versatile tool used in the industry as of now. It has been with us since the 1980’s and is still being used as the most fundamental tool for data structuring and analysis. It’s a go-to tool for a person in IT, Finance, HR, Marketing and simply any other department that you can think of. Let’s build some discussion on what value it carries for our beloved marketers.

regression analysis in marketing research example

How SAP HANA helped the Kolkata Knight Riders to Clinch IPL 7!

' src=

About the Author: Darpan Saxena

guest

Darpan thank you for thorough explanation, it’s very useful. I have come across data where some sales values are negative, and advertising expenditure much higher than sales in general (to the effect of 100x). What would be the best approach in dealing with negative values? Thank you.

regression analysis in marketing research example

  • Media Center
  • E-Books & White Papers
  • Knowledge Center

The Strategic Value of Regression Analysis in Marketing Research

by Michael Lieberman , on December 14, 2023

designer hand working with  digital tablet and laptop and notebook stack and eye glass on wooden desk in office-1

Regression analysis offers significant value in modern business and research contexts. This article explores the strategic importance of regression analysis to shed light on its diverse applications and benefits. Included are several different case studies to help bring the concept to life.

Understanding Regression Analysis in Marketing

Regression analysis in marketing is used to examine how independent variables—such as advertising spend, demographics, pricing, and product features—influence a dependent variable, typically a measure of consumer behavior or business performance. The goal is to create models that capture these relationships accurately, allowing marketers to make informed decisions.

Benefits of Regression Analysis in Marketing

  • Data-driven decisions : Regression analysis empowers marketers to make data-driven decisions, reducing reliance on intuition and guesswork. This approach leads to more accurate and strategic marketing efforts.
  • Efficiency and cost savings : By optimizing marketing campaigns and resource allocation, regression analysis can significantly improve efficiency and cost-effectiveness. Companies can achieve better results with the same or fewer resources.
  • Personalization : Understanding consumer behavior through regression analysis allows for personalized marketing efforts. Tailored messages and offers can lead to higher engagement and conversion rates.
  • Competitive advantage : Marketers who employ regression analysis are better equipped to adapt to changing market conditions, outperform competitors, and stay ahead of industry trends.
  • Continuous improvement : Regression analysis is an iterative process. As new data becomes available, models can be updated and refined, ensuring that marketing strategies remain effective over time.

Strategic Applications

  • Consumer behavior prediction : Regression analysis helps marketers predict consumer behavior. By analyzing historical data and considering various factors, such as past purchases, online behavior, and demographic information, companies can build models to anticipate customer preferences, buying patterns, and churn rates.
  • Marketing campaign optimization : Businesses invest heavily in marketing campaigns. Regression analysis aids in optimizing these efforts by identifying which marketing channels, messages, or strategies have the greatest impact on key performance indicators (KPIs) like sales, click-through rates, or conversion rates.
  • Pricing strategy : Pricing is a critical aspect of marketing. Regression analysis can reveal the relationship between pricing strategies and sales volume, helping companies determine the optimal price points for their products or services.
  • Product Development : In product development, regression analysis can be used to understand the relationship between product features and consumer satisfaction. Companies can then prioritize product enhancements based on customer preferences.

Case Study – Regression Analysis for Ranking Key Attributes

Let’s explore a specific example in a category known as Casual Dining Restaurants (CDR). In a survey, respondents are asked to rate several casual dining restaurants on a variety of attributes. For the purposes of this article, we will keep the number of demonstrated attributes to the top eight. The data for each restaurant is stacked into one regression. We are seeking to rank the attributes based on a regression analysis against an industry standard overall measurement: Net Promoter Score.

Table 1 shows the leading Casual Dining Restaurant chains in the United States to be used to ‘rank’ the key reasons that patrons visit this restaurant category, not specific to one restaurant band.

Table 1 - List of Leading Casual Dining Restaurant Chains in the United States

Table 1 - Regression Analysis Casual Resturant Dining Case Study

In Figure 1 we see a graphic example of key drivers across the CDR category.

Figure 1 - Net Promoter Score Casual Dining Restaurants Regression Analysis

The category-wide drivers of CDR visits are not particularly surprising. Good food. Good value. Cleanliness. Staff energy. There is one attribute, however, that may not seem intuitively as important as restaurant executives might think. Make sure your servers thank departing customers . Diners seek not just delicious cuisine at a reasonable price, but they also desire a sense of appreciation.

Case Study – Regression and Brand Response to Crisis

A major automobile company has a public relations disaster. In order to regain trust in their brand equity, the company commissions a series of regression analyses to gauge how buyers are viewing their brand image. However, what they really want to know is how American auto buyers view trust—the most valuable brand perception of this company’s automotive product.

The disaster is fresh—a nation-wide recall of thousands of cars over safety issues regarding airbags—so our company would like a composite of which values go into “Is this a Company I Trust.” Thus, it surveyed decision makers, stake holders, owners, and prospects. We then stack the data into one dataset and run a strategic regression. Once performed, the regression beta values are summed and then reported as percentages of influence on the dependent variable. What we see are the major components of “Trust.”

Figure 2 - Percentage Influence of "A Company I Trust"

Figure 2 - Brand Response to Crisis Regression Analysis Example

Not surprisingly, family safety is the leading driver of Trust. However, we now have Shapley Values of the major components. These findings would normally be handed over to the public relations team to begin damage control. Within days the company began to run advertisements in major markets to reverse the negative narrative of the recall.

Case Study - Regression Analysis/Maximizing Product Lines

SparkleSquad Studios is a fictional startup hoping to find a niche among tween and teen girls to help reverse the tide of social media addiction. Though funded through venture capital investment, they found that all their 40 potential product areas, they only have capacity to produce eight. In order to determine the top 8 hobby products in demand, they fielded a study.

Table 2 - List of Potential Product Area Development

Table 2 - Regression Analysis for Product Development Example

SparkleSquad Studios then conducted a large study gathering data from thousands of web-based surveys conducted among girls aged 10 to 16 across the United States. The key construct of the study is simple—not more than 5 minutes—and concise to cater to respondents' shorter attention spans. Below are the key questions.

  • How much money do you typically allocate to hobbies unrelated to social media in a given month?
  • Please check-off the hobbies that interest you from the list of 40 potential options below.

Question 1 serves as the dependent variable in the regression. Question 2 responses are coded into categorical variables, 1=Checked, 0=Not Checked . These are the independent variables.

Results are shown below in Table 3.

Table 3 - Top 10 Hobby Products for Production Determined Through Regression Analysis

Table 3 - Regression Analysis Case Study Findings

Based on the resulting regression analysis, SparkleSquad will commence production of ten statistically significant products. The data-driven approach ensures these offerings meet the maximized determined market demand.

Regression analysis gives businesses the ability to predict consumer behavior, optimize marketing efforts, and drive results through data-driven decision-making. By leveraging regression analysis, businesses can gain a competitive advantage and increase their efficiency, and effectiveness. In an era where consumer preferences and market conditions are in constant flux, regression analysis remains an essential tool for marketers looking to stay ahead of the curve.

New Call-to-action

Michael Lieberman is the Founder and President of Multivariate Solutions , a statistical and market research consulting firm that works with major advertising, public relations, and political strategy firms. He can be reached at +1 646 257 3794, or [email protected] .

Download "The 5 Keys to Estimating Market Sizing for Strategic Decision Making"

About This Blog

Our goal is to help you better understand your customer, market, and competition in order to help drive your business growth.

Popular Posts

  • A CEO’s Perspective on Harnessing AI for Market Research Excellence
  • 7 Key Advantages of Outsourcing Market Research Services
  • How to Use Market Research for Onboarding and Training Employees
  • 10 Global Industries That Will Boom in the Next 5 Years
  • Primary Data vs. Secondary Data: Market Research Methods

Recent Posts

Posts by topic.

  • Industry Insights (825)
  • Market Research Strategy (272)
  • Food & Beverage (134)
  • Healthcare (125)
  • The Freedonia Group (121)
  • How To's (108)
  • Market Research Provider (89)
  • Manufacturing & Construction (81)
  • Packaged Facts (78)
  • Pharmaceuticals (78)
  • Telecommunications & Wireless (70)
  • Heavy Industry (69)
  • Marketing (58)
  • Profound (56)
  • Retail (56)
  • Software & Enterprise Computing (54)
  • Transportation & Shipping (54)
  • House & Home (50)
  • Materials & Chemicals (47)
  • Medical Devices (46)
  • Consumer Electronics (45)
  • Energy & Resources (42)
  • Public Sector (40)
  • Biotechnology (37)
  • Demographics (37)
  • Business Services & Administration (36)
  • Education (36)
  • Custom Market Research (35)
  • Diagnostics (34)
  • Academic (33)
  • Travel & Leisure (33)
  • E-commerce & IT Outsourcing (32)
  • Financial Services (29)
  • Computer Hardware & Networking (26)
  • Simba Information (24)
  • Kalorama Information (21)
  • Knowledge Centers (19)
  • Apparel (18)
  • Cosmetics & Personal Care (17)
  • Social Media (16)
  • Advertising (14)
  • Big Data (14)
  • Market Research Subscription (14)
  • Holiday (11)
  • Emerging Markets (8)
  • Associations (1)
  • Religion (1)

MarketResearch.com 6116 Executive Blvd Suite 550 Rockville, MD 20852 800.298.5699 (U.S.) +1.240.747.3093 (International) [email protected]

From Our Blog

Subscribe to blog, connect with us.

LinkedIn

Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Explore the platform powering Experience Management

  • Free Account
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories
  • Artificial Intelligence
  • Market Research
  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • Survey Data Analysis & Reporting
  • Regression Analysis

Try Qualtrics for free

The complete guide to regression analysis.

19 min read What is regression analysis and why is it useful? While most of us have heard the term, understanding regression analysis in detail may be something you need to brush up on. Here’s what you need to know about this popular method of analysis.

When you rely on data to drive and guide business decisions, as well as predict market trends, just gathering and analyzing what you find isn’t enough — you need to ensure it’s relevant and valuable.

The challenge, however, is that so many variables can influence business data: market conditions, economic disruption, even the weather! As such, it’s essential you know which variables are affecting your data and forecasts, and what data you can discard.

And one of the most effective ways to determine data value and monitor trends (and the relationships between them) is to use regression analysis, a set of statistical methods used for the estimation of relationships between independent and dependent variables.

In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.

Free eBook: 2024 global market research trends report

What is regression analysis?

Regression analysis is a statistical method. It’s used for analyzing different factors that might influence an objective – such as the success of a product launch, business growth, a new marketing campaign – and determining which factors are important and which ones can be ignored.

Regression analysis can also help leaders understand how different variables impact each other and what the outcomes are. For example, when forecasting financial performance, regression analysis can help leaders determine how changes in the business can influence revenue or expenses in the future.

Running an analysis of this kind, you might find that there’s a high correlation between the number of marketers employed by the company, the leads generated, and the opportunities closed.

This seems to suggest that a high number of marketers and a high number of leads generated influences sales success. But do you need both factors to close those sales? By analyzing the effects of these variables on your outcome,  you might learn that when leads increase but the number of marketers employed stays constant, there is no impact on the number of opportunities closed, but if the number of marketers increases, leads and closed opportunities both rise.

Regression analysis can help you tease out these complex relationships so you can determine which areas you need to focus on in order to get your desired results, and avoid wasting time with those that have little or no impact. In this example, that might mean hiring more marketers rather than trying to increase leads generated.

How does regression analysis work?

Regression analysis starts with variables that are categorized into two types: dependent and independent variables. The variables you select depend on the outcomes you’re analyzing.

Understanding variables:

1. dependent variable.

This is the main variable that you want to analyze and predict. For example, operational (O) data such as your quarterly or annual sales, or experience (X) data such as your net promoter score (NPS) or customer satisfaction score (CSAT) .

These variables are also called response variables, outcome variables, or left-hand-side variables (because they appear on the left-hand side of a regression equation).

There are three easy ways to identify them:

  • Is the variable measured as an outcome of the study?
  • Does the variable depend on another in the study?
  • Do you measure the variable only after other variables are altered?

2. Independent variable

Independent variables are the factors that could affect your dependent variables. For example, a price rise in the second quarter could make an impact on your sales figures.

You can identify independent variables with the following list of questions:

  • Is the variable manipulated, controlled, or used as a subject grouping method by the researcher?
  • Does this variable come before the other variable in time?
  • Are you trying to understand whether or how this variable affects another?

Independent variables are often referred to differently in regression depending on the purpose of the analysis. You might hear them called:

Explanatory variables

Explanatory variables are those which explain an event or an outcome in your study. For example, explaining why your sales dropped or increased.

Predictor variables

Predictor variables are used to predict the value of the dependent variable. For example, predicting how much sales will increase when new product features are rolled out .

Experimental variables

These are variables that can be manipulated or changed directly by researchers to assess the impact. For example, assessing how different product pricing ($10 vs $15 vs $20) will impact the likelihood to purchase.

Subject variables (also called fixed effects)

Subject variables can’t be changed directly, but vary across the sample. For example, age, gender, or income of consumers.

Unlike experimental variables, you can’t randomly assign or change subject variables, but you can design your regression analysis to determine the different outcomes of groups of participants with the same characteristics. For example, ‘how do price rises impact sales based on income?’

Carrying out regression analysis

Regression analysis

So regression is about the relationships between dependent and independent variables. But how exactly do you do it?

Assuming you have your data collection done already, the first and foremost thing you need to do is plot your results on a graph. Doing this makes interpreting regression analysis results much easier as you can clearly see the correlations between dependent and independent variables.

Let’s say you want to carry out a regression analysis to understand the relationship between the number of ads placed and revenue generated.

On the Y-axis, you place the revenue generated. On the X-axis, the number of digital ads. By plotting the information on the graph, and drawing a line (called the regression line) through the middle of the data, you can see the relationship between the number of digital ads placed and revenue generated.

Regression analysis - step by step

This regression line is the line that provides the best description of the relationship between your independent variables and your dependent variable. In this example, we’ve used a simple linear regression model.

Regression analysis - step by step

Statistical analysis software can draw this line for you and precisely calculate the regression line. The software then provides a formula for the slope of the line, adding further context to the relationship between your dependent and independent variables.

Simple linear regression analysis

A simple linear model uses a single straight line to determine the relationship between a single independent variable and a dependent variable.

This regression model is mostly used when you want to determine the relationship between two variables (like price increases and sales) or the value of the dependent variable at certain points of the independent variable (for example the sales levels at a certain price rise).

While linear regression is useful, it does require you to make some assumptions.

For example, it requires you to assume that:

  • the data was collected using a statistically valid sample collection method that is representative of the target population
  • The observed relationship between the variables can’t be explained by a ‘hidden’ third variable – in other words, there are no spurious correlations.
  • the relationship between the independent variable and dependent variable is linear – meaning that the best fit along the data points is a straight line and not a curved one

Multiple regression analysis

As the name suggests, multiple regression analysis is a type of regression that uses multiple variables. It uses multiple independent variables to predict the outcome of a single dependent variable. Of the various kinds of multiple regression, multiple linear regression is one of the best-known.

Multiple linear regression is a close relative of the simple linear regression model in that it looks at the impact of several independent variables on one dependent variable. However, like simple linear regression, multiple regression analysis also requires you to make some basic assumptions.

For example, you will be assuming that:

  • there is a linear relationship between the dependent and independent variables (it creates a straight line and not a curve through the data points)
  • the independent variables aren’t highly correlated in their own right

An example of multiple linear regression would be an analysis of how marketing spend, revenue growth, and general market sentiment affect the share price of a company.

With multiple linear regression models you can estimate how these variables will influence the share price, and to what extent.

Multivariate linear regression

Multivariate linear regression involves more than one dependent variable as well as multiple independent variables, making it more complicated than linear or multiple linear regressions. However, this also makes it much more powerful and capable of making predictions about complex real-world situations.

For example, if an organization wants to establish or estimate how the COVID-19 pandemic has affected employees in its different markets, it can use multivariate linear regression, with the different geographical regions as dependent variables and the different facets of the pandemic as independent variables (such as mental health self-rating scores, proportion of employees working at home, lockdown durations and employee sick days).

Through multivariate linear regression, you can look at relationships between variables in a holistic way and quantify the relationships between them. As you can clearly visualize those relationships, you can make adjustments to dependent and independent variables to see which conditions influence them. Overall, multivariate linear regression provides a more realistic picture than looking at a single variable.

However, because multivariate techniques are complex, they involve high-level mathematics that require a statistical program to analyze the data.

Logistic regression

Logistic regression models the probability of a binary outcome based on independent variables.

So, what is a binary outcome? It’s when there are only two possible scenarios, either the event happens (1) or it doesn’t (0). e.g. yes/no outcomes, pass/fail outcomes, and so on. In other words, if the outcome can be described as being in either one of two categories.

Logistic regression makes predictions based on independent variables that are assumed or known to have an influence on the outcome. For example, the probability of a sports team winning their game might be affected by independent variables like weather, day of the week, whether they are playing at home or away and how they fared in previous matches.

What are some common mistakes with regression analysis?

Across the globe, businesses are increasingly relying on quality data and insights to drive decision-making — but to make accurate decisions, it’s important that the data collected and statistical methods used to analyze it are reliable and accurate.

Using the wrong data or the wrong assumptions can result in poor decision-making, lead to missed opportunities to improve efficiency and savings, and — ultimately — damage your business long term.

  • Assumptions

When running regression analysis, be it a simple linear or multiple regression, it’s really important to check that the assumptions your chosen method requires have been met. If your data points don’t conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data. For example, if you are looking at income data, which scales on a logarithmic distribution, you should take the Natural Log of Income as your variable then adjust the outcome after the model is created.

  • Correlation vs. causation

It’s a well-worn phrase that bears repeating – correlation does not equal causation. While variables that are linked by causality will always show correlation, the reverse is not always true. Moreover, there is no statistic that can determine causality (although the design of your study overall can).

If you observe a correlation in your results, such as in the first example we gave in this article where there was a correlation between leads and sales, you can’t assume that one thing has influenced the other. Instead, you should use it as a starting point for investigating the relationship between the variables in more depth.

  • Choosing the wrong variables to analyze

Before you use any kind of statistical method, it’s important to understand the subject you’re researching in detail. Doing so means you’re making informed choices of variables and you’re not overlooking something important that might have a significant bearing on your dependent variable.

  • Model building The variables you include in your analysis are just as important as the variables you choose to exclude. That’s because the strength of each independent variable is influenced by the other variables in the model. Other techniques, such as Key Drivers Analysis, are able to account for these variable interdependencies.

Benefits of using regression analysis

There are several benefits to using regression analysis to judge how changing variables will affect your business and to ensure you focus on the right things when forecasting.

Here are just a few of those benefits:

Make accurate predictions

Regression analysis is commonly used when forecasting and forward planning for a business. For example, when predicting sales for the year ahead, a number of different variables will come into play to determine the eventual result.

Regression analysis can help you determine which of these variables are likely to have the biggest impact based on previous events and help you make more accurate forecasts and predictions.

Identify inefficiencies

Using a regression equation a business can identify areas for improvement when it comes to efficiency, either in terms of people, processes, or equipment.

For example, regression analysis can help a car manufacturer determine order numbers based on external factors like the economy or environment.

Using the initial regression equation, they can use it to determine how many members of staff and how much equipment they need to meet orders.

Drive better decisions

Improving processes or business outcomes is always on the minds of owners and business leaders, but without actionable data, they’re simply relying on instinct, and this doesn’t always work out.

This is particularly true when it comes to issues of price. For example, to what extent will raising the price (and to what level) affect next quarter’s sales?

There’s no way to know this without data analysis. Regression analysis can help provide insights into the correlation between price rises and sales based on historical data.

How do businesses use regression? A real-life example

Marketing and advertising spending are common topics for regression analysis. Companies use regression when trying to assess the value of ad spend and marketing spend on revenue.

A typical example is using a regression equation to assess the correlation between ad costs and conversions of new customers. In this instance,

  • our dependent variable (the factor we’re trying to assess the outcomes of) will be our conversions
  • the independent variable (the factor we’ll change to assess how it changes the outcome) will be the daily ad spend
  • the regression equation will try to determine whether an increase in ad spend has a direct correlation with the number of conversions we have

The analysis is relatively straightforward — using historical data from an ad account, we can use daily data to judge ad spend vs conversions and how changes to the spend alter the conversions.

By assessing this data over time, we can make predictions not only on whether increasing ad spend will lead to increased conversions but also what level of spending will lead to what increase in conversions. This can help to optimize campaign spend and ensure marketing delivers good ROI.

This is an example of a simple linear model. If you wanted to carry out a more complex regression equation, we could also factor in other independent variables such as seasonality, GDP, and the current reach of our chosen advertising networks.

By increasing the number of independent variables, we can get a better understanding of whether ad spend is resulting in an increase in conversions, whether it’s exerting an influence in combination with another set of variables, or if we’re dealing with a correlation with no causal impact – which might be useful for predictions anyway, but isn’t a lever we can use to increase sales.

Using this predicted value of each independent variable, we can more accurately predict how spend will change the conversion rate of advertising.

Regression analysis tools

Regression analysis is an important tool when it comes to better decision-making and improved business outcomes. To get the best out of it, you need to invest in the right kind of statistical analysis software.

The best option is likely to be one that sits at the intersection of powerful statistical analysis and intuitive ease of use, as this will empower everyone from beginners to expert analysts to uncover meaning from data, identify hidden trends and produce predictive models without statistical training being required.

IQ stats in action

To help prevent costly errors, choose a tool that automatically runs the right statistical tests and visualizations and then translates the results into simple language that anyone can put into action.

With software that’s both powerful and user-friendly, you can isolate key experience drivers, understand what influences the business, apply the most appropriate regression methods, identify data issues, and much more.

Regression analysis tools

With Qualtrics’ Stats iQ™, you don’t have to worry about the regression equation because our statistical software will run the appropriate equation for you automatically based on the variable type you want to monitor. You can also use several equations, including linear regression and logistic regression, to gain deeper insights into business outcomes and make more accurate, data-driven decisions.

Related resources

Analysis & Reporting

Data Analysis 31 min read

Social media analytics 13 min read, kano analysis 21 min read, margin of error 11 min read, data saturation in qualitative research 8 min read, thematic analysis 11 min read, behavioral analytics 12 min read, request demo.

Ready to learn more about Qualtrics?

  • Privacy Policy

Research Method

Home » Regression Analysis – Methods, Types and Examples

Regression Analysis – Methods, Types and Examples

Table of Contents

Regression Analysis

Regression Analysis

Regression analysis is a set of statistical processes for estimating the relationships among variables . It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’).

Regression Analysis Methodology

Here is a general methodology for performing regression analysis:

  • Define the research question: Clearly state the research question or hypothesis you want to investigate. Identify the dependent variable (also called the response variable or outcome variable) and the independent variables (also called predictor variables or explanatory variables) that you believe are related to the dependent variable.
  • Collect data: Gather the data for the dependent variable and independent variables. Ensure that the data is relevant, accurate, and representative of the population or phenomenon you are studying.
  • Explore the data: Perform exploratory data analysis to understand the characteristics of the data, identify any missing values or outliers, and assess the relationships between variables through scatter plots, histograms, or summary statistics.
  • Choose the regression model: Select an appropriate regression model based on the nature of the variables and the research question. Common regression models include linear regression, multiple regression, logistic regression, polynomial regression, and time series regression, among others.
  • Assess assumptions: Check the assumptions of the regression model. Some common assumptions include linearity (the relationship between variables is linear), independence of errors, homoscedasticity (constant variance of errors), and normality of errors. Violation of these assumptions may require additional steps or alternative models.
  • Estimate the model: Use a suitable method to estimate the parameters of the regression model. The most common method is ordinary least squares (OLS), which minimizes the sum of squared differences between the observed and predicted values of the dependent variable.
  • I nterpret the results: Analyze the estimated coefficients, p-values, confidence intervals, and goodness-of-fit measures (e.g., R-squared) to interpret the results. Determine the significance and direction of the relationships between the independent variables and the dependent variable.
  • Evaluate model performance: Assess the overall performance of the regression model using appropriate measures, such as R-squared, adjusted R-squared, and root mean squared error (RMSE). These measures indicate how well the model fits the data and how much of the variation in the dependent variable is explained by the independent variables.
  • Test assumptions and diagnose problems: Check the residuals (the differences between observed and predicted values) for any patterns or deviations from assumptions. Conduct diagnostic tests, such as examining residual plots, testing for multicollinearity among independent variables, and assessing heteroscedasticity or autocorrelation, if applicable.
  • Make predictions and draw conclusions: Once you have a satisfactory model, use it to make predictions on new or unseen data. Draw conclusions based on the results of the analysis, considering the limitations and potential implications of the findings.

Types of Regression Analysis

Types of Regression Analysis are as follows:

Linear Regression

Linear regression is the most basic and widely used form of regression analysis. It models the linear relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting line that minimizes the sum of squared differences between observed and predicted values.

Multiple Regression

Multiple regression extends linear regression by incorporating two or more independent variables to predict the dependent variable. It allows for examining the simultaneous effects of multiple predictors on the outcome variable.

Polynomial Regression

Polynomial regression models non-linear relationships between variables by adding polynomial terms (e.g., squared or cubic terms) to the regression equation. It can capture curved or nonlinear patterns in the data.

Logistic Regression

Logistic regression is used when the dependent variable is binary or categorical. It models the probability of the occurrence of a certain event or outcome based on the independent variables. Logistic regression estimates the coefficients using the logistic function, which transforms the linear combination of predictors into a probability.

Ridge Regression and Lasso Regression

Ridge regression and Lasso regression are techniques used for addressing multicollinearity (high correlation between independent variables) and variable selection. Both methods introduce a penalty term to the regression equation to shrink or eliminate less important variables. Ridge regression uses L2 regularization, while Lasso regression uses L1 regularization.

Time Series Regression

Time series regression analyzes the relationship between a dependent variable and independent variables when the data is collected over time. It accounts for autocorrelation and trends in the data and is used in forecasting and studying temporal relationships.

Nonlinear Regression

Nonlinear regression models are used when the relationship between the dependent variable and independent variables is not linear. These models can take various functional forms and require estimation techniques different from those used in linear regression.

Poisson Regression

Poisson regression is employed when the dependent variable represents count data. It models the relationship between the independent variables and the expected count, assuming a Poisson distribution for the dependent variable.

Generalized Linear Models (GLM)

GLMs are a flexible class of regression models that extend the linear regression framework to handle different types of dependent variables, including binary, count, and continuous variables. GLMs incorporate various probability distributions and link functions.

Regression Analysis Formulas

Regression analysis involves estimating the parameters of a regression model to describe the relationship between the dependent variable (Y) and one or more independent variables (X). Here are the basic formulas for linear regression, multiple regression, and logistic regression:

Linear Regression:

Simple Linear Regression Model: Y = β0 + β1X + ε

Multiple Linear Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

In both formulas:

  • Y represents the dependent variable (response variable).
  • X represents the independent variable(s) (predictor variable(s)).
  • β0, β1, β2, …, βn are the regression coefficients or parameters that need to be estimated.
  • ε represents the error term or residual (the difference between the observed and predicted values).

Multiple Regression:

Multiple regression extends the concept of simple linear regression by including multiple independent variables.

Multiple Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

The formulas are similar to those in linear regression, with the addition of more independent variables.

Logistic Regression:

Logistic regression is used when the dependent variable is binary or categorical. The logistic regression model applies a logistic or sigmoid function to the linear combination of the independent variables.

Logistic Regression Model: p = 1 / (1 + e^-(β0 + β1X1 + β2X2 + … + βnXn))

In the formula:

  • p represents the probability of the event occurring (e.g., the probability of success or belonging to a certain category).
  • X1, X2, …, Xn represent the independent variables.
  • e is the base of the natural logarithm.

The logistic function ensures that the predicted probabilities lie between 0 and 1, allowing for binary classification.

Regression Analysis Examples

Regression Analysis Examples are as follows:

  • Stock Market Prediction: Regression analysis can be used to predict stock prices based on various factors such as historical prices, trading volume, news sentiment, and economic indicators. Traders and investors can use this analysis to make informed decisions about buying or selling stocks.
  • Demand Forecasting: In retail and e-commerce, real-time It can help forecast demand for products. By analyzing historical sales data along with real-time data such as website traffic, promotional activities, and market trends, businesses can adjust their inventory levels and production schedules to meet customer demand more effectively.
  • Energy Load Forecasting: Utility companies often use real-time regression analysis to forecast electricity demand. By analyzing historical energy consumption data, weather conditions, and other relevant factors, they can predict future energy loads. This information helps them optimize power generation and distribution, ensuring a stable and efficient energy supply.
  • Online Advertising Performance: It can be used to assess the performance of online advertising campaigns. By analyzing real-time data on ad impressions, click-through rates, conversion rates, and other metrics, advertisers can adjust their targeting, messaging, and ad placement strategies to maximize their return on investment.
  • Predictive Maintenance: Regression analysis can be applied to predict equipment failures or maintenance needs. By continuously monitoring sensor data from machines or vehicles, regression models can identify patterns or anomalies that indicate potential failures. This enables proactive maintenance, reducing downtime and optimizing maintenance schedules.
  • Financial Risk Assessment: Real-time regression analysis can help financial institutions assess the risk associated with lending or investment decisions. By analyzing real-time data on factors such as borrower financials, market conditions, and macroeconomic indicators, regression models can estimate the likelihood of default or assess the risk-return tradeoff for investment portfolios.

Importance of Regression Analysis

Importance of Regression Analysis is as follows:

  • Relationship Identification: Regression analysis helps in identifying and quantifying the relationship between a dependent variable and one or more independent variables. It allows us to determine how changes in independent variables impact the dependent variable. This information is crucial for decision-making, planning, and forecasting.
  • Prediction and Forecasting: Regression analysis enables us to make predictions and forecasts based on the relationships identified. By estimating the values of the dependent variable using known values of independent variables, regression models can provide valuable insights into future outcomes. This is particularly useful in business, economics, finance, and other fields where forecasting is vital for planning and strategy development.
  • Causality Assessment: While correlation does not imply causation, regression analysis provides a framework for assessing causality by considering the direction and strength of the relationship between variables. It allows researchers to control for other factors and assess the impact of a specific independent variable on the dependent variable. This helps in determining the causal effect and identifying significant factors that influence outcomes.
  • Model Building and Variable Selection: Regression analysis aids in model building by determining the most appropriate functional form of the relationship between variables. It helps researchers select relevant independent variables and eliminate irrelevant ones, reducing complexity and improving model accuracy. This process is crucial for creating robust and interpretable models.
  • Hypothesis Testing: Regression analysis provides a statistical framework for hypothesis testing. Researchers can test the significance of individual coefficients, assess the overall model fit, and determine if the relationship between variables is statistically significant. This allows for rigorous analysis and validation of research hypotheses.
  • Policy Evaluation and Decision-Making: Regression analysis plays a vital role in policy evaluation and decision-making processes. By analyzing historical data, researchers can evaluate the effectiveness of policy interventions and identify the key factors contributing to certain outcomes. This information helps policymakers make informed decisions, allocate resources effectively, and optimize policy implementation.
  • Risk Assessment and Control: Regression analysis can be used for risk assessment and control purposes. By analyzing historical data, organizations can identify risk factors and develop models that predict the likelihood of certain outcomes, such as defaults, accidents, or failures. This enables proactive risk management, allowing organizations to take preventive measures and mitigate potential risks.

When to Use Regression Analysis

  • Prediction : Regression analysis is often employed to predict the value of the dependent variable based on the values of independent variables. For example, you might use regression to predict sales based on advertising expenditure, or to predict a student’s academic performance based on variables like study time, attendance, and previous grades.
  • Relationship analysis: Regression can help determine the strength and direction of the relationship between variables. It can be used to examine whether there is a linear association between variables, identify which independent variables have a significant impact on the dependent variable, and quantify the magnitude of those effects.
  • Causal inference: Regression analysis can be used to explore cause-and-effect relationships by controlling for other variables. For example, in a medical study, you might use regression to determine the impact of a specific treatment while accounting for other factors like age, gender, and lifestyle.
  • Forecasting : Regression models can be utilized to forecast future trends or outcomes. By fitting a regression model to historical data, you can make predictions about future values of the dependent variable based on changes in the independent variables.
  • Model evaluation: Regression analysis can be used to evaluate the performance of a model or test the significance of variables. You can assess how well the model fits the data, determine if additional variables improve the model’s predictive power, or test the statistical significance of coefficients.
  • Data exploration : Regression analysis can help uncover patterns and insights in the data. By examining the relationships between variables, you can gain a deeper understanding of the data set and identify potential patterns, outliers, or influential observations.

Applications of Regression Analysis

Here are some common applications of regression analysis:

  • Economic Forecasting: Regression analysis is frequently employed in economics to forecast variables such as GDP growth, inflation rates, or stock market performance. By analyzing historical data and identifying the underlying relationships, economists can make predictions about future economic conditions.
  • Financial Analysis: Regression analysis plays a crucial role in financial analysis, such as predicting stock prices or evaluating the impact of financial factors on company performance. It helps analysts understand how variables like interest rates, company earnings, or market indices influence financial outcomes.
  • Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables. Regression models provide insights into which marketing efforts are most effective and help optimize marketing campaigns.
  • Health Sciences: Regression analysis is extensively used in medical research and public health studies. It helps examine the relationship between risk factors and health outcomes, such as the impact of smoking on lung cancer or the relationship between diet and heart disease. Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices.
  • Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or social factors on various outcomes such as crime rates, academic performance, or job satisfaction.
  • Operations Research: Regression analysis is applied in operations research to optimize processes and improve efficiency. For example, it can be used to predict demand based on historical sales data, determine the factors influencing production output, or optimize supply chain logistics.
  • Environmental Studies: Regression analysis helps in understanding and predicting environmental phenomena. It can be used to analyze the impact of factors like temperature, pollution levels, or land use patterns on phenomena such as species diversity, water quality, or climate change.
  • Sports Analytics: Regression analysis is increasingly used in sports analytics to gain insights into player performance, team strategies, and game outcomes. It helps analyze the relationship between various factors like player statistics, coaching strategies, or environmental conditions and their impact on game outcomes.

Advantages and Disadvantages of Regression Analysis

About the author.

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Cluster Analysis

Cluster Analysis – Types, Methods and Examples

Discriminant Analysis

Discriminant Analysis – Methods, Types and...

MANOVA

MANOVA (Multivariate Analysis of Variance) –...

Documentary Analysis

Documentary Analysis – Methods, Applications and...

ANOVA

ANOVA (Analysis of variance) – Formulas, Types...

Graphical Methods

Graphical Methods – Types, Examples and Guide

  • Business Essentials
  • Leadership & Management
  • Credential of Leadership, Impact, and Management in Business (CLIMB)
  • Entrepreneurship & Innovation
  • Digital Transformation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

regression analysis in marketing research example

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

  • Career Development
  • Communication
  • Decision-Making
  • Earning Your MBA
  • Negotiation
  • News & Events
  • Productivity
  • Staff Spotlight
  • Student Profiles
  • Work-Life Balance
  • AI Essentials for Business
  • Alternative Investments
  • Business Analytics
  • Business Strategy
  • Business and Climate Change
  • Design Thinking and Innovation
  • Digital Marketing Strategy
  • Disruptive Strategy
  • Economics for Managers
  • Entrepreneurship Essentials
  • Financial Accounting
  • Global Business
  • Launching Tech Ventures
  • Leadership Principles
  • Leadership, Ethics, and Corporate Accountability
  • Leading with Finance
  • Management Essentials
  • Negotiation Mastery
  • Organizational Leadership
  • Power and Influence for Positive Impact
  • Strategy Execution
  • Sustainable Business Strategy
  • Sustainable Investing
  • Winning with Digital Platforms

What Is Regression Analysis in Business Analytics?

Business professional using calculator for regression analysis

  • 14 Dec 2021

Countless factors impact every facet of business. How can you consider those factors and know their true impact?

Imagine you seek to understand the factors that influence people’s decision to buy your company’s product. They range from customers’ physical locations to satisfaction levels among sales representatives to your competitors' Black Friday sales.

Understanding the relationships between each factor and product sales can enable you to pinpoint areas for improvement, helping you drive more sales.

To learn how each factor influences sales, you need to use a statistical analysis method called regression analysis .

If you aren’t a business or data analyst, you may not run regressions yourself, but knowing how analysis works can provide important insight into which factors impact product sales and, thus, which are worth improving.

Access your free e-book today.

Foundational Concepts for Regression Analysis

Before diving into regression analysis, you need to build foundational knowledge of statistical concepts and relationships.

Independent and Dependent Variables

Start with the basics. What relationship are you aiming to explore? Try formatting your answer like this: “I want to understand the impact of [the independent variable] on [the dependent variable].”

The independent variable is the factor that could impact the dependent variable . For example, “I want to understand the impact of employee satisfaction on product sales.”

In this case, employee satisfaction is the independent variable, and product sales is the dependent variable. Identifying the dependent and independent variables is the first step toward regression analysis.

Correlation vs. Causation

One of the cardinal rules of statistically exploring relationships is to never assume correlation implies causation. In other words, just because two variables move in the same direction doesn’t mean one caused the other to occur.

If two or more variables are correlated , their directional movements are related. If two variables are positively correlated , it means that as one goes up or down, so does the other. Alternatively, if two variables are negatively correlated , one goes up while the other goes down.

A correlation’s strength can be quantified by calculating the correlation coefficient , sometimes represented by r . The correlation coefficient falls between negative one and positive one.

r = -1 indicates a perfect negative correlation.

r = 1 indicates a perfect positive correlation.

r = 0 indicates no correlation.

Causation means that one variable caused the other to occur. Proving a causal relationship between variables requires a true experiment with a control group (which doesn’t receive the independent variable) and an experimental group (which receives the independent variable).

While regression analysis provides insights into relationships between variables, it doesn’t prove causation. It can be tempting to assume that one variable caused the other—especially if you want it to be true—which is why you need to keep this in mind any time you run regressions or analyze relationships between variables.

With the basics under your belt, here’s a deeper explanation of regression analysis so you can leverage it to drive strategic planning and decision-making.

Related: How to Learn Business Analytics without a Business Background

What Is Regression Analysis?

Regression analysis is the statistical method used to determine the structure of a relationship between two variables (single linear regression) or three or more variables (multiple regression).

According to the Harvard Business School Online course Business Analytics , regression is used for two primary purposes:

  • To study the magnitude and structure of the relationship between variables
  • To forecast a variable based on its relationship with another variable

Both of these insights can inform strategic business decisions.

“Regression allows us to gain insights into the structure of that relationship and provides measures of how well the data fit that relationship,” says HBS Professor Jan Hammond, who teaches Business Analytics, one of three courses that comprise the Credential of Readiness (CORe) program . “Such insights can prove extremely valuable for analyzing historical trends and developing forecasts.”

One way to think of regression is by visualizing a scatter plot of your data with the independent variable on the X-axis and the dependent variable on the Y-axis. The regression line is the line that best fits the scatter plot data. The regression equation represents the line’s slope and the relationship between the two variables, along with an estimation of error.

Physically creating this scatter plot can be a natural starting point for parsing out the relationships between variables.

Credential of Readiness | Master the fundamentals of business | Learn More

Types of Regression Analysis

There are two types of regression analysis: single variable linear regression and multiple regression.

Single variable linear regression is used to determine the relationship between two variables: the independent and dependent. The equation for a single variable linear regression looks like this:

Single Variable Linear Regression Formula

In the equation:

  • ŷ is the expected value of Y (the dependent variable) for a given value of X (the independent variable).
  • x is the independent variable.
  • α is the Y-intercept, the point at which the regression line intersects with the vertical axis.
  • β is the slope of the regression line, or the average change in the dependent variable as the independent variable increases by one.
  • ε is the error term, equal to Y – ŷ, or the difference between the actual value of the dependent variable and its expected value.

Multiple regression , on the other hand, is used to determine the relationship between three or more variables: the dependent variable and at least two independent variables. The multiple regression equation looks complex but is similar to the single variable linear regression equation:

Multiple Regression Formula

Each component of this equation represents the same thing as in the previous equation, with the addition of the subscript k, which is the total number of independent variables being examined. For each independent variable you include in the regression, multiply the slope of the regression line by the value of the independent variable, and add it to the rest of the equation.

How to Run Regressions

You can use a host of statistical programs—such as Microsoft Excel, SPSS, and STATA—to run both single variable linear and multiple regressions. If you’re interested in hands-on practice with this skill, Business Analytics teaches learners how to create scatter plots and run regressions in Microsoft Excel, as well as make sense of the output and use it to drive business decisions.

Calculating Confidence and Accounting for Error

It’s important to note: This overview of regression analysis is introductory and doesn’t delve into calculations of confidence level, significance, variance, and error. When working in a statistical program, these calculations may be provided or require that you implement a function. When conducting regression analysis, these metrics are important for gauging how significant your results are and how much importance to place on them.

Business Analytics | Become a data-driven leader | Learn More

Why Use Regression Analysis?

Once you’ve generated a regression equation for a set of variables, you effectively have a roadmap for the relationship between your independent and dependent variables. If you input a specific X value into the equation, you can see the expected Y value.

This can be critical for predicting the outcome of potential changes, allowing you to ask, “What would happen if this factor changed by a specific amount?”

Returning to the earlier example, running a regression analysis could allow you to find the equation representing the relationship between employee satisfaction and product sales. You could input a higher level of employee satisfaction and see how sales might change accordingly. This information could lead to improved working conditions for employees, backed by data that shows the tie between high employee satisfaction and sales.

Whether predicting future outcomes, determining areas for improvement, or identifying relationships between seemingly unconnected variables, understanding regression analysis can enable you to craft data-driven strategies and determine the best course of action with all factors in mind.

Do you want to become a data-driven professional? Explore our eight-week Business Analytics course and our three-course Credential of Readiness (CORe) program to deepen your analytical skills and apply them to real-world business problems.

regression analysis in marketing research example

About the Author

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

A Refresher on Regression Analysis

regression analysis in marketing research example

Understanding one of the most important types of data analysis.

You probably know by now that whenever possible you should be making data-driven decisions at work . But do you know how to parse through all the data available to you? The good news is that you probably don’t need to do the number crunching yourself (hallelujah!) but you do need to correctly understand and interpret the analysis created by your colleagues. One of the most important types of data analysis is called regression analysis.

  • Amy Gallo is a contributing editor at Harvard Business Review, cohost of the Women at Work podcast , and the author of two books: Getting Along: How to Work with Anyone (Even Difficult People) and the HBR Guide to Dealing with Conflict . She writes and speaks about workplace dynamics. Watch her TEDx talk on conflict and follow her on LinkedIn . amyegallo

regression analysis in marketing research example

Partner Center

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

regression analysis in marketing research example

Home Market Research

Regression Analysis: Definition, Types, Usage & Advantages

regression analysis in marketing research example

Regression analysis is perhaps one of the most widely used statistical methods for investigating or estimating the relationship between a set of independent and dependent variables. In statistical analysis , distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities.

It is also used as a blanket term for various data analysis techniques utilized in a qualitative research method for modeling and analyzing numerous variables. In the regression method, the dependent variable is a predictor or an explanatory element, and the dependent variable is the outcome or a response to a specific query.

LEARN ABOUT:   Statistical Analysis Methods

Content Index

Definition of Regression Analysis

Types of regression analysis, regression analysis usage in market research, how regression analysis derives insights from surveys, advantages of using regression analysis in an online survey.

Regression analysis is often used to model or analyze data. Most survey analysts use it to understand the relationship between the variables, which can be further utilized to predict the precise outcome.

For Example – Suppose a soft drink company wants to expand its manufacturing unit to a newer location. Before moving forward, the company wants to analyze its revenue generation model and the various factors that might impact it. Hence, the company conducts an online survey with a specific questionnaire.

After using regression analysis, it becomes easier for the company to analyze the survey results and understand the relationship between different variables like electricity and revenue – here, revenue is the dependent variable.

LEARN ABOUT: Level of Analysis

In addition, understanding the relationship between different independent variables like pricing, number of workers, and logistics with the revenue helps the company estimate the impact of varied factors on sales and profits.

Survey researchers often use this technique to examine and find a correlation between different variables of interest. It provides an opportunity to gauge the influence of different independent variables on a dependent variable.

Overall, regression analysis saves the survey researchers’ additional efforts in arranging several independent variables in tables and testing or calculating their effect on a dependent variable. Different types of analytical research methods are widely used to evaluate new business ideas and make informed decisions.

Create a Free Account

Researchers usually start by learning linear and logistic regression first. Due to the widespread knowledge of these two methods and ease of application, many analysts think there are only two types of models. Each model has its own specialty and ability to perform if specific conditions are met.

This blog explains the commonly used seven types of multiple regression analysis methods that can be used to interpret the enumerated data in various formats.

01. Linear Regression Analysis

It is one of the most widely known modeling techniques, as it is amongst the first elite regression analysis methods picked up by people at the time of learning predictive modeling. Here, the dependent variable is continuous, and the independent variable is more often continuous or discreet with a linear regression line.

Please note that multiple linear regression has more than one independent variable than simple linear regression. Thus, linear regression is best to be used only when there is a linear relationship between the independent and a dependent variable.

A business can use linear regression to measure the effectiveness of the marketing campaigns, pricing, and promotions on sales of a product. Suppose a company selling sports equipment wants to understand if the funds they have invested in the marketing and branding of their products have given them substantial returns or not.

Linear regression is the best statistical method to interpret the results. The best thing about linear regression is it also helps in analyzing the obscure impact of each marketing and branding activity, yet controlling the constituent’s potential to regulate the sales.

If the company is running two or more advertising campaigns simultaneously, one on television and two on radio, then linear regression can easily analyze the independent and combined influence of running these advertisements together.

LEARN ABOUT: Data Analytics Projects

02. Logistic Regression Analysis

Logistic regression is commonly used to determine the probability of event success and event failure. Logistic regression is used whenever the dependent variable is binary, like 0/1, True/False, or Yes/No. Thus, it can be said that logistic regression is used to analyze either the close-ended questions in a survey or the questions demanding numeric responses in a survey.

Please note logistic regression does not need a linear relationship between a dependent and an independent variable, just like linear regression. Logistic regression applies a non-linear log transformation for predicting the odds ratio; therefore, it easily handles various types of relationships between a dependent and an independent variable.

Logistic regression is widely used to analyze categorical data, particularly for binary response data in business data modeling. More often, logistic regression is used when the dependent variable is categorical, like to predict whether the health claim made by a person is real(1) or fraudulent, to understand if the tumor is malignant(1) or not.

Businesses use logistic regression to predict whether the consumers in a particular demographic will purchase their product or will buy from the competitors based on age, income, gender, race, state of residence, previous purchase, etc.

03. Polynomial Regression Analysis

Polynomial regression is commonly used to analyze curvilinear data when an independent variable’s power is more than 1. In this regression analysis method, the best-fit line is never a ‘straight line’ but always a ‘curve line’ fitting into the data points.

Please note that polynomial regression is better to use when two or more variables have exponents and a few do not.

Additionally, it can model non-linearly separable data offering the liberty to choose the exact exponent for each variable, and that too with full control over the modeling features available.

When combined with response surface analysis, polynomial regression is considered one of the sophisticated statistical methods commonly used in multisource feedback research. Polynomial regression is used mostly in finance and insurance-related industries where the relationship between dependent and independent variables is curvilinear.

Suppose a person wants to budget expense planning by determining how long it would take to earn a definitive sum. Polynomial regression, by taking into account his/her income and predicting expenses, can easily determine the precise time he/she needs to work to earn that specific sum amount.

04. Stepwise Regression Analysis

This is a semi-automated process with which a statistical model is built either by adding or removing the dependent variable on the t-statistics of their estimated coefficients.

If used properly, the stepwise regression will provide you with more powerful data at your fingertips than any method. It works well when you are working with a large number of independent variables. It just fine-tunes the unit of analysis model by poking variables randomly.

Stepwise regression analysis is recommended to be used when there are multiple independent variables, wherein the selection of independent variables is done automatically without human intervention.

Please note, in stepwise regression modeling, the variable is added or subtracted from the set of explanatory variables. The set of added or removed variables is chosen depending on the test statistics of the estimated coefficient.

Suppose you have a set of independent variables like age, weight, body surface area, duration of hypertension, basal pulse, and stress index based on which you want to analyze its impact on the blood pressure.

In stepwise regression, the best subset of the independent variable is automatically chosen; it either starts by choosing no variable to proceed further (as it adds one variable at a time) or starts with all variables in the model and proceeds backward (removes one variable at a time).

Thus, using regression analysis, you can calculate the impact of each or a group of variables on blood pressure.

05. Ridge Regression Analysis

Ridge regression is based on an ordinary least square method which is used to analyze multicollinearity data (data where independent variables are highly correlated). Collinearity can be explained as a near-linear relationship between variables.

Whenever there is multicollinearity, the estimates of least squares will be unbiased, but if the difference between them is larger, then it may be far away from the true value. However, ridge regression eliminates the standard errors by appending some degree of bias to the regression estimates with a motive to provide more reliable estimates.

If you want, you can also learn about Selection Bias through our blog.

Please note, Assumptions derived through the ridge regression are similar to the least squared regression, the only difference being the normality. Although the value of the coefficient is constricted in the ridge regression, it never reaches zero suggesting the inability to select variables.

Suppose you are crazy about two guitarists performing live at an event near you, and you go to watch their performance with a motive to find out who is a better guitarist. But when the performance starts, you notice that both are playing black-and-blue notes at the same time.

Is it possible to find out the best guitarist having the biggest impact on sound among them when they are both playing loud and fast? As both of them are playing different notes, it is substantially difficult to differentiate them, making it the best case of multicollinearity, which tends to increase the standard errors of the coefficients.

Ridge regression addresses multicollinearity in cases like these and includes bias or a shrinkage estimation to derive results.

06. Lasso Regression Analysis

Lasso (Least Absolute Shrinkage and Selection Operator) is similar to ridge regression; however, it uses an absolute value bias instead of the square bias used in ridge regression.

It was developed way back in 1989 as an alternative to the traditional least-squares estimate with the intention to deduce the majority of problems related to overfitting when the data has a large number of independent variables.

Lasso has the capability to perform both – selecting variables and regularizing them along with a soft threshold. Applying lasso regression makes it easier to derive a subset of predictors from minimizing prediction errors while analyzing a quantitative response.

Please note that regression coefficients reaching zero value after shrinkage are excluded from the lasso model. On the contrary, regression coefficients having more value than zero are strongly associated with the response variables, wherein the explanatory variables can be either quantitative, categorical, or both.

Suppose an automobile company wants to perform a research analysis on average fuel consumption by cars in the US. For samples, they chose 32 models of car and 10 features of automobile design – Number of cylinders, Displacement, Gross horsepower, Rear axle ratio, Weight, ¼ mile time, v/s engine, transmission, number of gears, and number of carburetors.

As you can see a correlation between the response variable mpg (miles per gallon) is extremely correlated to some variables like weight, displacement, number of cylinders, and horsepower. The problem can be analyzed by using the glmnet package in R and lasso regression for feature selection.

07. Elastic Net Regression Analysis

It is a mixture of ridge and lasso regression models trained with L1 and L2 norms. The elastic net brings about a grouping effect wherein strongly correlated predictors tend to be in/out of the model together. Using the elastic net regression model is recommended when the number of predictors is far greater than the number of observations.

Please note that the elastic net regression model came into existence as an option to the lasso regression model as lasso’s variable section was too much dependent on data, making it unstable. By using elastic net regression, statisticians became capable of over-bridging the penalties of ridge and lasso regression only to get the best out of both models.

A clinical research team having access to a microarray data set on leukemia (LEU) was interested in constructing a diagnostic rule based on the expression level of presented gene samples for predicting the type of leukemia. The data set they had, consisted of a large number of genes and a few samples.

Apart from that, they were given a specific set of samples to be used as training samples, out of which some were infected with type 1 leukemia (acute lymphoblastic leukemia) and some with type 2 leukemia (acute myeloid leukemia).

Model fitting and tuning parameter selection by tenfold CV were carried out on the training data. Then they compared the performance of those methods by computing their prediction mean-squared error on the test data to get the necessary results.

A market research survey focuses on three major matrices; Customer Satisfaction , Customer Loyalty , and Customer Advocacy . Remember, although these matrices tell us about customer health and intentions, they fail to tell us ways of improving the position. Therefore, an in-depth survey questionnaire intended to ask consumers the reason behind their dissatisfaction is definitely a way to gain practical insights.

However, it has been found that people often struggle to put forth their motivation or demotivation or describe their satisfaction or dissatisfaction. In addition to that, people always give undue importance to some rational factors, such as price, packaging, etc. Overall, it acts as a predictive analytic and forecasting tool in market research.

When used as a forecasting tool, regression analysis can determine an organization’s sales figures by taking into account external market data. A multinational company conducts a market research survey to understand the impact of various factors such as GDP (Gross Domestic Product), CPI (Consumer Price Index), and other similar factors on its revenue generation model.

Obviously, regression analysis in consideration of forecasted marketing indicators was used to predict a tentative revenue that will be generated in future quarters and even in future years. However, the more forward you go in the future, the data will become more unreliable, leaving a wide margin of error .

Case study of using regression analysis

A water purifier company wanted to understand the factors leading to brand favorability. The survey was the best medium for reaching out to existing and prospective customers. A large-scale consumer survey was planned, and a discreet questionnaire was prepared using the best survey tool .

A number of questions related to the brand, favorability, satisfaction, and probable dissatisfaction were effectively asked in the survey. After getting optimum responses to the survey, regression analysis was used to narrow down the top ten factors responsible for driving brand favorability.

All the ten attributes derived (mentioned in the image below) in one or the other way highlighted their importance in impacting the favorability of that specific water purifier brand.

Regression Analysis in Market Research

It is easy to run a regression analysis using Excel or SPSS, but while doing so, the importance of four numbers in interpreting the data must be understood.

The first two numbers out of the four numbers directly relate to the regression model itself.

  • F-Value: It helps in measuring the statistical significance of the survey model. Remember, an F-Value significantly less than 0.05 is considered to be more meaningful. Less than 0.05 F-Value ensures survey analysis output is not by chance.
  • R-Squared: This is the value wherein the independent variables try to explain the amount of movement by dependent variables. Considering the R-Squared value is 0.7, a tested independent variable can explain 70% of the dependent variable’s movement. It means the survey analysis output we will be getting is highly predictive in nature and can be considered accurate.

The other two numbers relate to each of the independent variables while interpreting regression analysis.

  • P-Value: Like F-Value, even the P-Value is statistically significant. Moreover, here it indicates how relevant and statistically significant the independent variable’s effect is. Once again, we are looking for a value of less than 0.05.
  • Interpretation: The fourth number relates to the coefficient achieved after measuring the impact of variables. For instance, we test multiple independent variables to get a coefficient. It tells us, ‘by what value the dependent variable is expected to increase when independent variables (which we are considering) increase by one when all other independent variables are stagnant at the same value.

In a few cases, the simple coefficient is replaced by a standardized coefficient demonstrating the contribution from each independent variable to move or bring about a change in the dependent variable.

01. Get access to predictive analytics

Do you know utilizing regression analysis to understand the outcome of a business survey is like having the power to unveil future opportunities and risks?

For example, after seeing a particular television advertisement slot, we can predict the exact number of businesses using that data to estimate a maximum bid for that slot. The finance and insurance industry as a whole depends a lot on regression analysis of survey data to identify trends and opportunities for more accurate planning and decision-making.

02. Enhance operational efficiency

Do you know businesses use regression analysis to optimize their business processes?

For example, before launching a new product line, businesses conduct consumer surveys to better understand the impact of various factors on the product’s production, packaging, distribution, and consumption.

A data-driven foresight helps eliminate the guesswork, hypothesis, and internal politics from decision-making. A deeper understanding of the areas impacting operational efficiencies and revenues leads to better business optimization.

03. Quantitative support for decision-making

Business surveys today generate a lot of data related to finance, revenue, operation, purchases, etc., and business owners are heavily dependent on various data analysis models to make informed business decisions.

For example, regression analysis helps enterprises to make informed strategic workforce decisions. Conducting and interpreting the outcome of employee surveys like Employee Engagement Surveys, Employee Satisfaction Surveys, Employer Improvement Surveys, Employee Exit Surveys, etc., boosts the understanding of the relationship between employees and the enterprise.

It also helps get a fair idea of certain issues impacting the organization’s working culture, working environment, and productivity. Furthermore, intelligent business-oriented interpretations reduce the huge pile of raw data into actionable information to make a more informed decision.

04. Prevent mistakes from happening due to intuitions

By knowing how to use regression analysis for interpreting survey results, one can easily provide factual support to management for making informed decisions. ; but do you know that it also helps in keeping out faults in the judgment?

For example, a mall manager thinks if he extends the closing time of the mall, then it will result in more sales. Regression analysis contradicts the belief that predicting increased revenue due to increased sales won’t support the increased operating expenses arising from longer working hours.

Regression analysis is a useful statistical method for modeling and comprehending the relationships between variables. It provides numerous advantages to various data types and interactions. Researchers and analysts may gain useful insights into the factors influencing a dependent variable and use the results to make informed decisions. 

With QuestionPro Research, you can improve the efficiency and accuracy of regression analysis by streamlining the data gathering, analysis, and reporting processes. The platform’s user-friendly interface and wide range of features make it a valuable tool for researchers and analysts conducting regression analysis as part of their research projects.

Sign up for the free trial today and let your research dreams fly!

FREE TRIAL         LEARN MORE

MORE LIKE THIS

email survey tool

The Best Email Survey Tool to Boost Your Feedback Game

May 7, 2024

Employee Engagement Survey Tools

Top 10 Employee Engagement Survey Tools

employee engagement software

Top 20 Employee Engagement Software Solutions

May 3, 2024

customer experience software

15 Best Customer Experience Software of 2024

May 2, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

Marketing Research Design & Analysis 2019

6 regression.

This chapter is primarily based on:

  • Field, A., Miles J., & Field, Z. (2012): Discovering Statistics Using R. Sage Publications ( chapters 6, 7, 8 ).
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013): An Introduction to Statistical Learning with Applications in R, Springer ( chapter 3 )

You can download the corresponding R-Code here

6.1 Correlation

Before we start with regression analysis, we will review the basic concept of correlation first. Correlation helps us to determine the degree to which the variation in one variable, X, is related to the variation in another variable, Y.

6.1.1 Correlation coefficient

The correlation coefficient summarizes the strength of the linear relationship between two metric (interval or ratio scaled) variables. Let’s consider a simple example. Say you conduct a survey to investigate the relationship between the attitude towards a city and the duration of residency. The “Attitude” variable can take values between 1 (very unfavorable) and 12 (very favorable), and the “duration of residency” is measured in years. Let’s further assume for this example that the attitude measurement represents an interval scale (although it is usually not realistic to assume that the scale points on an itemized rating scale have the same distance). To keep it simple, let’s further assume that you only asked 12 people. We can create a short data set like this:

Let’s look at the data. The following graph shows the individual data points for the “duration of residency”" variable, where the blue horizontal line represents the mean of the variable (9.33) and the vertical lines show the distance of the individual data points from the mean.

Scores for duration of residency variable

Figure 6.1: Scores for duration of residency variable

You can see that there are some respondents that have been living in the city longer than average and some respondents that have been living in the city shorter than average. Let’s do the same for the second variable (“Attitude”):

Scores for attitude variable

Figure 6.2: Scores for attitude variable

Again, we can see that some respondents have an above average attitude towards the city (more favorable) and some respondents have a below average attitude towards the city. Let’s plot the data in one graph now to see if there is some co-movement:

Scores for attitude and duration of residency variables

Figure 6.3: Scores for attitude and duration of residency variables

We can see that there is indeed some co-movement here. The variables covary because respondents who have an above (below) average attitude towards the city also appear to have been living in the city for an above (below) average amount of time and vice versa. Correlation helps us to quantify this relationship. Before you proceed to compute the correlation coefficient, you should first look at the data. We usually use a scatterplot to visualize the relationship between two metric variables:

Scatterplot for duration and attitute variables

Figure 6.4: Scatterplot for duration and attitute variables

How can we compute the correlation coefficient? Remember that the variance measures the average deviation from the mean of a variable:

\[\begin{equation} \begin{split} s_x^2&=\frac{\sum_{i=1}^{N} (X_i-\overline{X})^2}{N-1} \\ &= \frac{\sum_{i=1}^{N} (X_i-\overline{X})*(X_i-\overline{X})}{N-1} \end{split} \tag{6.1} \end{equation}\]

When we consider two variables, we multiply the deviation for one variable by the respective deviation for the second variable:

\((X_i-\overline{X})*(Y_i-\overline{Y})\)

This is called the cross-product deviation. Then we sum the cross-product deviations:

\(\sum_{i=1}^{N}(X_i-\overline{X})*(Y_i-\overline{Y})\)

… and compute the average of the sum of all cross-product deviations to get the covariance :

\[\begin{equation} Cov(x, y) =\frac{\sum_{i=1}^{N}(X_i-\overline{X})*(Y_i-\overline{Y})}{N-1} \tag{6.2} \end{equation}\]

You can easily compute the covariance manually as follows

Or you simply use the built-in cov() function:

A positive covariance indicates that as one variable deviates from the mean, the other variable deviates in the same direction. A negative covariance indicates that as one variable deviates from the mean (e.g., increases), the other variable deviates in the opposite direction (e.g., decreases).

However, the size of the covariance depends on the scale of measurement. Larger scale units will lead to larger covariance. To overcome the problem of dependence on measurement scale, we need to convert covariance to a standard set of units through standardization by dividing the covariance by the standard deviation (i.e., similar to how we compute z-scores).

With two variables, there are two standard deviations. We simply multiply the two standard deviations. We then divide the covariance by the product of the two standard deviations to get the standardized covariance, which is known as a correlation coefficient r:

\[\begin{equation} r=\frac{Cov_{xy}}{s_x*s_y} \tag{6.3} \end{equation}\]

This is known as the product moment correlation (r) and it is straight-forward to compute:

Or you could just use the cor() function:

Properties of r:

  • ranges from -1 to + 1
  • +1 indicates perfect linear relationship
  • -1 indicates perfect negative relationship
  • 0 indicates no linear relationship
  • ± .1 represents small effect
  • ± .3 represents medium effect
  • ± .5 represents large effect

6.1.2 Significance testing

How can we determine if our two variables are significantly related? To test this, we denote the population moment correlation ρ. Then we test the null of no relationship between variables:

\[H_0:\rho=0\] \[H_1:\rho\ne0\]

The test statistic is:

\[\begin{equation} t=\frac{r*\sqrt{N-2}}{\sqrt{1-r^2}} \tag{6.4} \end{equation}\]

It has a t distribution with n - 2 degrees of freedom. Then, we follow the usual procedure of calculating the test statistic and comparing the test statistic to the critical value of the underlying probability distribution. If the calculated test statistic is larger than the critical value, the null hypothesis of no relationship between X and Y is rejected.

Or you can simply use the cor.test() function, which also produces the 95% confidence interval:

To determine the linear relationship between variables, the data only needs to be measured using interval scales. If you want to test the significance of the association, the sampling distribution needs to be normally distributed (we usually assume this when our data are normally distributed or when N is large). If parametric assumptions are violated, you should use non-parametric tests:

  • Spearman’s correlation coefficient: requires ordinal data and ranks the data before applying Pearson’s equation.
  • Kendall’s tau: use when N is small or the number of tied ranks is large.

Report the results:

A Pearson product-moment correlation coefficient was computed to assess the relationship between the duration of residence in a city and the attitude toward the city. There was a positive correlation between the two variables, r = 0.936, n = 12, p < 0.05. A scatterplot summarizes the results (Figure XY).

A note on the interpretation of correlation coefficients:

Correlation coefficients give no indication of the direction of causality. In our example, we can conclude that the attitude toward the city is more positive as the years of residence increases. However, we cannot say that the years of residence cause the attitudes to be more positive. There are two main reasons for caution when interpreting correlations:

  • Third-variable problem: there may be other unobserved factors that affect the results.
  • Direction of causality: Correlations say nothing about which variable causes the other to change (reverse causality: attitudes may just as well cause the years of residence variable).

6.2 Regression

Correlations measure relationships between variables (i.e., how much two variables covary). Using regression analysis we can predict the outcome of a dependent variable (Y) from one or more independent variables (X). E.g., how many products will we sell if we increase the advertising expenditures by 1000 Euros? In regression analysis, we fit a model to our data and use it to predict the values of the dependent variable from one predictor variable (bivariate regression) or several predictor variables (multiple regression). The following table shows a comparison of correlation and regression analysis:

6.2.1 Simple linear regression

In simple linear regression, we assess the relationship between one dependent (regressand) and one independent (regressor) variable. The goal is to fit a line through a scatterplot of observations in order to find the line that best describes the data (scatterplot).

Suppose you are a marketing research analyst at a music label and your task is to suggest, on the basis of past data, a marketing plan for the next year that will maximize product sales. The data set that is available to you includes information on the sales of music downloads (thousands of units), advertising expenditures (in Euros), the number of radio plays an artist received per week (airplay), the number of previous releases of an artist (starpower), repertoire origin (country; 0 = local, 1 = international), and genre (1 = rock, 2 = pop, 3 = electronic). Let’s load and inspect the data first:

As stated above, regression analysis may be used to relate a quantitative response (“dependent variable”) to one or more predictor variables (“independent variables”). In a simple linear regression, we have one dependent and one independent variable.

Here are a few important questions that we might seek to address based on the data:

  • Is there a relationship between advertising budget and sales?
  • How strong is the relationship between advertising budget and sales?
  • Which other variables contribute to sales?
  • How accurately can we estimate the effect of each variable on sales?
  • How accurately can we predict future sales?
  • Is the relationship linear?
  • Is there synergy among the advertising activities?

We may use linear regression to answer these questions. Let’s start with the first question and investigate the effect of advertising on sales.

6.2.1.1 Estimating the coefficients

A simple linear regression model only has one predictor and can be written as:

\[\begin{equation} Y=\beta_0+\beta_1X+\epsilon \tag{6.5} \end{equation}\]

In our specific context, let’s consider only the influence of advertising on sales for now:

\[\begin{equation} Sales=\beta_0+\beta_1*adspend+\epsilon \tag{6.6} \end{equation}\]

The word “adspend” represents data on advertising expenditures that we have observed and β 1 (the “slope”“) represents the unknown relationship between advertising expenditures and sales. It tells you by how much sales will increase for an additional Euro spent on advertising. β 0 (the”intercept") is the number of sales we would expect if no money is spent on advertising. Together, β 0 and β 1 represent the model coefficients or parameters. The error term (ε) captures everything that we miss by using our model, including, (1) misspecifications (the true relationship might not be linear), (2) omitted variables (other variables might drive sales), and (3) measurement error (our measurement of the variables might be imperfect).

Once we have used our training data to produce estimates for the model coefficients, we can predict future sales on the basis of a particular value of advertising expenditures by computing:

\[\begin{equation} \hat{Sales}=\hat{\beta_0}+\hat{\beta_1}*adspend \tag{6.7} \end{equation}\]

We use the hat symbol, ^ , to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response (sales). In practice, β 0 and β 1 are unknown and must be estimated from the data to make predictions. In the case of our advertising example, the data set consists of the advertising budget and product sales (n = 200). Our goal is to obtain coefficient estimates such that the linear model fits the available data well. In other words, we fit a line through the scatterplot of observations and try to find the line that best describes the data. The following graph shows the scatterplot for our data, where the black line shows the regression line. The grey vertical lines shows the difference between the predicted values (the regression line) and the observed values. This difference is referred to as the residuals (“e”).

Ordinary least squares (OLS)

Figure 6.5: Ordinary least squares (OLS)

Estimation of the regression function is based on the idea of the method of least squares (OLS = ordinary least squares). The first step is to calculate the residuals by subtracting the observed values from the predicted values.

\(e_i = Y_i-(\beta_0+\beta_1X_i)\)

This difference is then minimized by minimizing the sum of the squared residuals:

\[\begin{equation} \sum_{i=1}^{N} e_i^2= \sum_{i=1}^{N} [Y_i-(\beta_0+\beta_1X_i)]^2\rightarrow min! \tag{6.8} \end{equation}\]

e i : Residuals (i = 1,2,…,N) Y i : Values of the dependent variable (i = 1,2,…,N) β 0 : Intercept β 1 : Regression coefficient / slope parameters X ni : Values of the nth independent variables and the ith observation N: Number of observations

This is also referred to as the residual sum of squares (RSS) . Now we need to choose the values for β 0 and β 1 that minimize RSS. So how can we derive these values for the regression coefficient? The equation for β 1 is given by:

\[\begin{equation} \hat{\beta_1}=\frac{COV_{XY}}{s_x^2} \tag{6.9} \end{equation}\]

The exact mathematical derivation of this formula is beyond the scope of this script, but the intuition is to calculate the first derivative of the squared residuals with respect to β 1 and set it to zero, thereby finding the β 1 that minimizes the term. Using the above formula, you can easily compute β 1 using the following code:

The interpretation of β 1 is as follows:

For every extra Euros spent on advertising, sales can be expected to increase by 0.096 units. Or, in other words, if we increase our marketing budget by 1,000 Euros, sales can be expected to increase by 96 units.

Using the estimated coefficient for β 1 , it is easy to compute β 0 (the intercept) as follows:

\[\begin{equation} \hat{\beta_0}=\overline{Y}-\hat{\beta_1}\overline{X} \tag{6.10} \end{equation}\]

The R code for this is:

The interpretation of β 0 is as follows:

If we spend no money on advertising, we would expect to sell 134.14 units.

You may also verify this based on a scatterplot of the data. The following plot shows the scatterplot including the regression line, which is estimated using OLS.

Scatterplot

Figure 6.6: Scatterplot

You can see that the regression line intersects with the y-axis at 134.14, which corresponds to the expected sales level when advertising expenditure (on the x-axis) is zero (i.e., the intercept β 0 ). The slope coefficient (β 1 ) tells you by how much sales (on the y-axis) would increase if advertising expenditures (on the x-axis) are increased by one unit.

6.2.1.2 Significance testing

In a next step, we assess if the effect of advertising on sales is statistically significant. This means that we test the null hypothesis H 0 : “There is no relationship between advertising and sales” versus the alternative hypothesis H 1 : “The is some relationship between advertising and sales”. Or, to state this mathematically:

\[H_0:\beta_1=0\] \[H_1:\beta_1\ne0\]

How can we test if the effect is statistically significant? Recall the generalized equation to derive a test statistic:

\[\begin{equation} test\ statistic = \frac{effect}{error} \tag{6.11} \end{equation}\]

The effect is given by the β 1 coefficient in this case. To compute the test statistic, we need to come up with a measure of uncertainty around this estimate (the error). This is because we use information from a sample to estimate the least squares line to make inferences regarding the regression line in the entire population. Since we only have access to one sample, the regression line will be slightly different every time we take a different sample from the population. This is sampling variation and it is perfectly normal! It just means that we need to take into account the uncertainty around the estimate, which is achieved by the standard error. Thus, the test statistic for our hypothesis is given by:

\[\begin{equation} t = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} \tag{6.12} \end{equation}\]

After calculating the test statistic, we compare its value to the values that we would expect to find if there was no effect based on the t-distribution. In a regression context, the degrees of freedom are given by N - p - 1 where N is the sample size and p is the number of predictors. In our case, we have 200 observations and one predictor. Thus, the degrees of freedom is 200 - 1 - 1 = 198. In the regression output below, R provides the exact probability of observing a t value of this magnitude (or larger) if the null hypothesis was true. This probability is the p-value. A small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the outcome variable due to chance in the absence of any real association between the predictor and the outcome.

To estimate the regression model in R, you can use the lm() function. Within the function, you first specify the dependent variable (“sales”) and independent variable (“adspend”) separated by a ~ (tilde). As mentioned previously, this is known as formula notation in R. The data = regression argument specifies that the variables come from the data frame named “regression”. Strictly speaking, you use the lm() function to create an object called “simple_regression,” which holds the regression output. You can then view the results using the summary() function:

Note that the estimated coefficients for β 0 (134.14) and β 1 (0.096) correspond to the results of our manual computation above. The associated t-values and p-values are given in the output. The t-values are larger than the critical t-values for the 95% confidence level, since the associated p-values are smaller than 0.05. In case of the coefficient for β 1 this means that the probability of an association between the advertising and sales of the observed magnitude (or larger) is smaller than 0.05, if the value of β 1 was, in fact, 0.

The coefficients associated with the respective variables represent point estimates . To get a better feeling for the range of values that the coefficients could take, it is helpful to compute confidence intervals . A 95% confidence interval is defined as a range of values such that with a 95% probability, the range will contain the true unknown value of the parameter. For example, for β 1 , the confidence interval can be computed as.

\[\begin{equation} CI = \hat{\beta_1}\pm(t_{1-\frac{\alpha}{2}}*SE(\beta_1)) \tag{6.13} \end{equation}\]

It is easy to compute confidence intervals in R using the confint() function. You just have to provide the name of you estimated model as an argument:

For our model, the 95% confidence interval for β 0 is [119.28,149], and the 95% confidence interval for β 1 is [0.08,0.12]. Thus, we can conclude that when we do not spend any money on advertising, sales will be somewhere between 119 and 149 units on average. In addition, for each increase in advertising expenditures by one Euro, there will be an average increase in sales of between 0.08 and 0.12.

6.2.1.3 Assessing model fit

Once we have rejected the null hypothesis in favor of the alternative hypothesis, the next step is to investigate to what extent the model represents (“fits”) the data. How can we assess the model fit?

  • First, we calculate the fit of the most basic model (i.e., the mean)
  • Then, we calculate the fit of the best model (i.e., the regression model)
  • A good model should fit the data significantly better than the basic model
  • R 2 : Represents the percentage of the variation in the outcome that can be explained by the model
  • The F-ratio measures how much the model has improved the prediction of the outcome compared to the level of inaccuracy in the model

Similar to ANOVA, the calculation of model fit statistics relies on estimating the different sum of squares values. SS T is the difference between the observed data and the mean value of Y (aka. total variation). In the absence of any other information, the mean value of Y represents the best guess on where an observation at a given level of advertising will fall:

\[\begin{equation} SS_T= \sum_{i=1}^{N} (Y_i-\overline{Y})^2 \tag{6.14} \end{equation}\]

The following graph shows the total sum of squares:

Total sum of squares

Figure 6.7: Total sum of squares

Based on our linear model, the best guess about the sales level at a given level of advertising is the predicted value. The model sum of squares (SS M ) has the mathematical representation:

\[\begin{equation} SS_M= \sum_{j=1}^{c} n_j(\overline{Y}_j-\overline{Y})^2 \tag{6.15} \end{equation}\]

The model sum of squares represents the improvement in prediction resulting from using the regression model rather than the mean of the data. The following graph shows the model sum of squares for our example:

Ordinary least squares (OLS)

Figure 6.8: Ordinary least squares (OLS)

The residual sum of squares (SS R ) is the difference between the observed data and the predicted values along the regression line (i.e., the variation not explained by the model)

\[\begin{equation} SS_R= \sum_{j=1}^{c} \sum_{i=1}^{n} ({Y}_{ij}-\overline{Y}_{j})^2 \tag{6.16} \end{equation}\]

The following graph shows the residual sum of squares for our example:

Ordinary least squares (OLS)

Figure 6.9: Ordinary least squares (OLS)

The R 2 statistic represents the proportion of variance that is explained by the model and is computed as:

\[\begin{equation} R^2= \frac{SS_M}{SS_T} \tag{6.16} \end{equation}\]

It takes values between 0 (very bad fit) and 1 (very good fit). Note that when the goal of your model is to predict future outcomes, a “too good” model fit can pose severe challenges. The reason is that the model might fit your specific sample so well, that it will only predict well within the sample but not generalize to other samples. This is called overfitting and it shows that there is a trade-off between model fit and out-of-sample predictive ability of the model, if the goal is to predict beyond the sample.

You can get a first impression of the fit of the model by inspecting the scatter plot as can be seen in the plot below. If the observations are highly dispersed around the regression line (left plot), the fit will be lower compared to a data set where the values are less dispersed (right plot).

Good vs. bad model fit

Figure 6.10: Good vs. bad model fit

The R 2 statistic is reported in the regression output (see above). However, you could also extract the relevant sum of squares statistics from the regression object using the anova() function to compute it manually:

Now we can compute R 2 in the same way that we have computed Eta 2 in the last section:

Adjusted R-squared

Due to the way the R 2 statistic is calculated, it will never decrease if a new explanatory variable is introduced into the model. This means that every new independent variable either doesn’t change the R 2 or increases it, even if there is no real relationship between the new variable and the dependent variable. Hence, one could be tempted to just add as many variables as possible to increase the R 2 and thus obtain a “better” model. However, this actually only leads to more noise and therefore a worse model.

To account for this, there exists a test statistic closely related to the R 2 , the adjusted R 2 . It can be calculated as follows:

\[\begin{equation} \overline{R^2} = 1 - (1 - R^2)\frac{n-1}{n - k - 1} \tag{6.17} \end{equation}\]

where n is the total number of observations and k is the total number of explanatory variables. The adjusted R 2 is equal to or less than the regular R 2 and can be negative. It will only increase if the added variable adds more explanatory power than one expect by pure chance. Essentially, it contains a “penalty” for including unnecessary variables and therefore favors more parsimonious models. As such, it is a measure of suitability, good for comparing different models and is very useful in the model selection stage of a project. In R, the standard lm() function automatically also reports the adjusted R 2 .

Another significance test is the F-test. It tests the null hypothesis:

\[H_0:R^2=0\]

This is equivalent to the following null hypothesis:

\[H_0:\beta_1=\beta_2=\beta_3=\beta_k=0\]

The F-test statistic is calculated as follows:

\[\begin{equation} F=\frac{\frac{SS_M}{k}}{\frac{SS_R}{(n-k-1)}}=\frac{MS_M}{MS_R} \tag{6.16} \end{equation}\]

which has a F distribution with k number of predictors and n degrees of freedom. In other words, you divide the systematic (“explained”) variation due to the predictor variables by the unsystematic (“unexplained”) variation.

The result of the F-test is provided in the regression output. However, you might manually compute the F-test using the ANOVA results from the model:

6.2.1.4 Using the model

After fitting the model, we can use the estimated coefficients to predict sales for different values of advertising. Suppose you want to predict sales for a new product, and the company plans to spend 800 Euros on advertising. How much will it sell? You can easily compute this either by hand:

\[\hat{sales}=134.134 + 0.09612*800=211\]

… or by extracting the estimated coefficients from the model summary:

The predicted value of the dependent variable is 211 units, i.e., the product will (on average) sell 211 units.

The following video summarizes how to conduct simple linear regression in R

6.2.2 Multiple linear regression

Multiple linear regression is a statistical technique that simultaneously tests the relationships between two or more independent variables and an interval-scaled dependent variable. The general form of the equation is given by:

\[\begin{equation} Y=(\beta_0+\beta_1*X_1+\beta_2*X_2+\beta_n*X_n)+\epsilon \tag{6.5} \end{equation}\]

Again, we aim to find the linear combination of predictors that correlate maximally with the outcome variable. Note that if you change the composition of predictors, the partial regression coefficient of an independent variable will be different from that of the bivariate regression coefficient. This is because the regressors are usually correlated, and any variation in Y that was shared by X1 and X2 was attributed to X1. The interpretation of the partial regression coefficients is the expected change in Y when X is changed by one unit and all other predictors are held constant.

Let’s extend the previous example. Say, in addition to the influence of advertising, you are interested in estimating the influence of airplay on the number of album downloads. The corresponding equation would then be given by:

\[\begin{equation} Sales=\beta_0+\beta_1*adspend+\beta_2*airplay+\epsilon \tag{6.6} \end{equation}\]

The words “adspend” and “airplay” represent data that we have observed on advertising expenditures and number of radio plays, and β 1 and β 2 represent the unknown relationship between sales and advertising expenditures and radio airplay, respectively. The coefficients tells you by how much sales will increase for an additional Euro spent on advertising (when radio airplay is held constant) and by how much sales will increase for an additional radio play (when advertising expenditures are held constant). Thus, we can make predictions about album sales based not only on advertising spending, but also on radio airplay.

With several predictors, the partitioning of sum of squares is the same as in the bivariate model, except that the model is no longer a 2-D straight line. With two predictors, the regression line becomes a 3-D regression plane. In our example:

Figure 6.11: Regression plane

Like in the bivariate case, the plane is fitted to the data with the aim to predict the observed data as good as possible. The deviation of the observations from the plane represent the residuals (the error we make in predicting the observed data from the model). Note that this is conceptually the same as in the bivariate case, except that the computation is more complex (we won’t go into details here). The model is fairly easy to plot using a 3-D scatterplot, because we only have two predictors. While multiple regression models that have more than two predictors are not as easy to visualize, you may apply the same principles when interpreting the model outcome:

  • Total sum of squares (SS T ) is still the difference between the observed data and the mean value of Y (total variation)
  • Residual sum of squares (SS R ) is still the difference between the observed data and the values predicted by the model (unexplained variation)
  • Model sum of squares (SS M ) is still the difference between the values predicted by the model and the mean value of Y (explained variation)
  • R measures the multiple correlation between the predictors and the outcome
  • R 2 is the amount of variation in the outcome variable explained by the model

Estimating multiple regression models is straightforward using the lm() function. You just need to separate the individual predictors on the right hand side of the equation using the + symbol. For example, the model:

\[\begin{equation} Sales=\beta_0+\beta_1*adspend+\beta_2*airplay+\beta_3*starpower+\epsilon \tag{6.6} \end{equation}\]

could be estimated as follows:

The interpretation of the coefficients is as follows:

  • adspend (β 1 ): when advertising expenditures increase by 1 Euro, sales will increase by 0.08 units
  • airplay (β 2 ): when radio airplay increases by 1 play per week, sales will increase by 3.37 units
  • starpower (β 3 ): when the number of previous albums increases by 1, sales will increase by 11.09 units

The associated t-values and p-values are also given in the output. You can see that the p-values are smaller than 0.05 for all three coefficients. Hence, all effects are “significant”. This means that if the null hypothesis was true (i.e., there was no effect between the variables and sales), the probability of observing associations of the estimated magnitudes (or larger) is very small (e.g., smaller than 0.05).

Again, to get a better feeling for the range of values that the coefficients could take, it is helpful to compute confidence intervals .

What does this tell you? Recall that a 95% confidence interval is defined as a range of values such that with a 95% probability, the range will contain the true unknown value of the parameter. For example, for β 3 , the confidence interval is [6.28,15.89]. Thus, although we have computed a point estimate of 11.09 for the effect of starpower on sales based on our sample, the effect might actually just as well take any other value within this range, considering the sample size and the variability in our data.

The output also tells us that 66.47% of the variation can be explained by our model. You may also visually inspect the fit of the model by plotting the predicted values against the observed values. We can extract the predicted values using the predict() function. So let’s create a new variable yhat , which contains those predicted values.

We can now use this variable to plot the predicted values against the observed values. In the following plot, the model fit would be perfect if all points would fall on the diagonal line. The larger the distance between the points and the line, the worse the model fit.

Model fit

Figure 6.12: Model fit

Partial plots

In the context of a simple linear regression (i.e., with a single independent variable), a scatter plot of the dependent variable against the independent variable provides a good indication of the nature of the relationship. If there is more than one independent variable, however, things become more complicated. The reason is that although the scatter plot still show the relationship between the two variables, it does not take into account the effect of the other independent variables in the model. Partial regression plot show the effect of adding another variable to a model that already controls for the remaining variables in the model. In other words, it is a scatterplot of the residuals of the outcome variable and each predictor when both variables are regressed separately on the remaining predictors. As an example, consider the effect of advertising expenditures on sales. In this case, the partial plot would show the effect of adding advertising expenditures as an explanatory variable while controlling for the variation that is explained by airplay and starpower in both variables (sales and advertising). Think of it as the purified relationship between advertising and sales that remains after controlling for other factors. The partial plots can easily be created using the avPlots() function from the car package:

Partial plots

Figure 6.13: Partial plots

Using the model

After fitting the model, we can use the estimated coefficients to predict sales for different values of advertising, airplay, and starpower. Suppose you would like to predict sales for a new music album with advertising expenditures of 800, airplay of 30 and starpower of 5. How much will it sell?

\[\hat{sales}=−26.61 + 0.084 * 800 + 3.367*30 + 11.08 ∗ 5= 197.74\]

… or by extracting the estimated coefficients:

The predicted value of the dependent variable is 198 units, i.e., the product will sell 198 units.

Comparing effects

Using the output from the regression model above, it is difficult to compare the effects of the independent variables because they are all measured on different scales (Euros, radio plays, releases). Standardized regression coefficients can be used to judge the relative importance of the predictor variables. Standardization is achieved by multiplying the unstandardized coefficient by the ratio of the standard deviations of the independent and dependent variables:

\[\begin{equation} B_{k}=\beta_{k} * \frac{s_{x_k}}{s_y} \tag{6.18} \end{equation}\]

Hence, the standardized coefficient will tell you by how many standard deviations the outcome will change as a result of a one standard deviation change in the predictor variable. Standardized coefficients can be easily computed using the lm.beta() function from the lm.beta package.

The results show that for adspend and airplay , a change by one standard deviation will result in a 0.51 standard deviation change in sales, whereas for starpower , a one standard deviation change will only lead to a 0.19 standard deviation change in sales. Hence, while the effects of adspend and airplay are comparable in magnitude, the effect of starpower is less strong.

The following video summarizes how to conduct multiple regression in R

6.3 Potential problems

Once you have built and estimated your model it is important to run diagnostics to ensure that the results are accurate. In the following section we will discuss common problems.

6.3.1 Outliers

The following video summarizes how to handle outliers in R

Outliers are data points that differ vastly from the trend. They can introduce bias into a model due to the fact that they alter the parameter estimates. Consider the example below. A linear regression was performed twice on the same data set, except during the second estimation the two green points were changed to be outliers by being moved to the positions indicated in red. The solid red line is the regression line based on the unaltered data set, while the dotted line was estimated using the altered data set. As you can see the second regression would lead to different conclusions than the first. Therefore it is important to identify outliers and further deal with them.

Effects of outliers

Figure 6.14: Effects of outliers

One quick way to visually detect outliers is by creating a scatterplot (as above) to see whether anything seems off. Another approach is to inspect the studentized residuals. If there are no outliers in your data, about 95% will be between -2 and 2, as per the assumptions of the normal distribution. Values well outside of this range are unlikely to happen by chance and warrant further inspection. As a rule of thumb, observations whose studentized residuals are greater than 3 in absolute values are potential outliers.

The studentized residuals can be obtained in R with the function rstudent() . We can use this function to create a new variable that contains the studentized residuals e music sales regression from before yields the following residuals:

A good way to visually inspect the studentized residuals is to plot them in a scatterplot and roughly check if most of the observations are within the -3, 3 bounds.

Plot of the studentized residuals

Figure 6.15: Plot of the studentized residuals

To identify potentially influential observations in our data set, we can apply a filter to our data:

After a detailed inspection of the potential outliers, you might decide to delete the affected observations from the data set or not. If an outlier has resulted from an error in data collection, then you might simply remove the observation. However, even though data may have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. This means that the decision of whether to exclude an outlier or not is closely related to the question whether this observation is an influential observation, as will be discussed next.

6.3.2 Influential observations

Related to the issue of outliers is that of influential observations, meaning observations that exert undue influence on the parameters. It is possible to determine whether or not the results are driven by an influential observation by calculating how far the predicted values for your data would move if the model was fitted without this particular observation. This calculated total distance is called Cook’s distance . To identify influential observations, we can inspect the respective plots created from the model output. A rule of thumb to determine whether an observation should be classified as influential or not is to look for observation with a Cook’s distance > 1 (although opinions vary on this). The following plot can be used to see the Cook’s distance associated with each data point:

Cook's distance

Figure 6.16: Cook’s distance

It is easy to see that none of the Cook’s distance values is close to the critical value of 1. Another useful plot to identify influential observations is plot number 5 from the output:

Residuals vs. Leverage

Figure 6.17: Residuals vs. Leverage

In this plot, we look for cases outside of a dashed line, which represents Cook’s distance . Lines for Cook’s distance thresholds of 0.5 and 1 are included by default. In our example, this line is not even visible, since the Cook’s distance values are far away from the critical values. Generally, you would watch out for outlying values at the upper right corner or at the lower right corner of the plot. Those spots are the places where cases can be influential against a regression line. In our example, there are no influential cases.

To see how influential observations can impact your regression, have a look at this example .

6.3.3 Non-linearity

An important underlying assumption for OLS is that of linearity, meaning that the relationship between the dependent and the independent variable can be reasonably approximated in linear terms. One quick way to assess whether a linear relationship can be assumed is to inspect the added variable plots that we already came across earlier:

Partial plots

Figure 6.18: Partial plots

In our example, it appears that linear relationships can be reasonably assumed. Please note, however, that the assumption of linearity implies two things:

  • Constant marginal returns (e.g., an increase in ad-spend from 10€ to 11€ yields the same increase in sales as an increase from 100,000€ to 100,001€)
  • Elasticities increase with X (e.g., advertising becomes relatively more effective; i.e., a relatively smaller change in advertising expenditure will yield the same return)

These assumptions may not be justifiable in certain contexts and you might have to transform your data (e.g., using log-transformations) in these cases, as we will see below.

6.3.4 Non-constant error variance

The following video summarizes how to identify non-constant error variance in R

Another important assumption of the linear model is that the error terms have a constant variance (i.e., homoscedasticity). The following plot from the model output shows the residuals (the vertical distance from an observed value to the predicted values) versus the fitted values (the predicted value from the regression model). If all the points fell exactly on the dashed grey line, it would mean that we have a perfect prediction. The residual variance (i.e., the spread of the values on the y-axis) should be similar across the scale of the fitted values on the x-axis.

Residuals vs. fitted values

Figure 6.19: Residuals vs. fitted values

In our case, this appears to be the case. You can identify non-constant variances in the errors (i.e., heteroscedasticity) from the presence of a funnel shape in the above plot. When the assumption of constant error variances is not met, this might be due to a misspecification of your model (e.g., the relationship might not be linear). In these cases, it often helps to transform your data (e.g., using log-transformations). The red line also helps you to identify potential misspecification of your model. It is a smoothed curve that passes through the residuals and if it lies close to the gray dashed line (as in our case) it suggest a correct specification. If the line would deviate from the dashed grey line a lot (e.g., a U-shape or inverse U-shape), it would suggest that the linear model specification is not reasonable and you should try different specifications.

If OLS is performed despite heteroscedasticity, the estimates of the coefficient will still be correct on average. However, the estimator is inefficient , meaning that the standard error is wrong, which will impact the significance tests (i.e., the p-values will be wrong). However, there are also robust regression methods, which you can use to estimate your model despite the presence of heteroscedasticity.

6.3.5 Non-normally distributed errors

Another assumption of OLS is that the error term is normally distributed. This can be a reasonable assumption for many scenarios, but we still need a way to check if it is actually the case. As we can not directly observe the actual error term, we have to work with the next best thing - the residuals.

A quick way to assess whether a given sample is approximately normally distributed is by using Q-Q plots. These plot the theoretical position of the observations (under the assumption that they are normally distributed) against the actual position. The plot below is created by the model output and shows the residuals in a Q-Q plot. As you can see, most of the points roughly follow the theoretical distribution, as given by the straight line. If most of the points are close to the line, the data is approximately normally distributed.

Q-Q plot

Figure 6.20: Q-Q plot

Another way to check for normal distribution of the data is to employ statistical tests that test the null hypothesis that the data is normally distributed, such as the Shapiro–Wilk test. We can extract the residuals from our model using the resid() function and apply the shapiro.test() function to it:

As you can see, we can not reject the H 0 of the normally distributed residuals, which means that we can assume the residuals to be approximately normally distributed.

When the assumption of normally distributed errors is not met, this might again be due to a misspecification of your model, in which case it might help to transform your data (e.g., using log-transformations).

6.3.6 Correlation of errors

The assumption of independent errors implies that for any two observations the residual terms should be uncorrelated. This is also known as a lack of autocorrelation . In theory, this could be tested with the Durbin-Watson test, which checks whether adjacent residuals are correlated. However, be aware that the test is sensitive to the order of your data. Hence, it only makes sense if there is a natural order in the data (e.g., time-series data) when the presence of dependent errors indicates autocorrelation. Since there is no natural order in our data, we don’t need to apply this test. .

If you are confronted with data that has a natural order, you can performed the test using the command durbinWatsonTest() , which takes the object that the lm() function generates as an argument. The test statistic varies between 0 and 4, with values close to 2 being desirable. As a rule of thumb values below 1 and above 3 are causes for concern.

6.3.7 Collinearity

Linear dependence of regressors, also known as multicollinearity, is when there is a strong linear relationship between the independent variables. Some correlation will always be present, but severe correlation can make proper estimation impossible. When present, it affects the model in several ways:

  • Limits the size of R 2 : when two variables are highly correlated, the amount of unique explained variance is low; therefore the incremental change in R 2 by including an additional predictor is larger if the predictor is uncorrelated with the other predictors.
  • Increases the standard errors of the coefficients, making them less trustworthy.
  • Uncertainty about the importance of predictors: if two predictors explain similar variance in the outcome, we cannot know which of these variables is important.

A quick way to find obvious multicollinearity is to examine the correlation matrix of the data. Any value > 0.8 - 0.9 should be cause for concern. You can, for example, create a correlation matrix using the rcorr() function from the Hmisc package.

The bivariate correlations can also be show in a plot:

Bivariate correlation plots

Figure 6.21: Bivariate correlation plots

However, this only spots bivariate multicollinearity. Variance inflation factors can be used to spot more subtle multicollinearity arising from multivariate relationships. It is calculated by regressing X i on all other X and using the resulting R 2 to calculate

\[\begin{equation} \begin{split} \frac{1}{1 - R_i^2} \end{split} \tag{6.19} \end{equation}\]

VIF values of over 4 are certainly cause for concern and values over 2 should be further investigated. If the average VIF is over 1 the regression may be biased. The VIF for all variables can easily be calculated in R with the vif() function.

As you can see the values are well below the cutoff, indicating that we do not have to worry about multicollinearity in our example.

6.3.8 Omitted Variables

If a variable that influences the outcome is left out of the model (“omitted”), a bias in other variables’ coefficients might be introduced. Specifically, the other coefficients will be biased if the corresponding variables are correlated with the omitted variable. Intuitively, the variables left in the model “pick up” the effect of the omitted variable to the degree that they are related. Let’s illustrate this with an example.

Consider the following data on the number of people visiting concerts of smaller bands.

The data set contains three variables:

  • avg_rating : The average rating a band has, resulting from a ten-point scale.
  • followers : The number of followers the band has at the time of the concert.
  • concert_visitors : The number of tickets sold for the concert.

If we estimate a model to explain the number of tickets sold as a function of the average rating and the number of followers, the results would look as follows:

Now assume we don’t have data on the number of followers a band has, but we still have information on the average rating and want to explain the number of tickets sold. Fitting a linear model with just the avg_rating variable included yields the following results:

What happens to the coefficient of avg_rating ? Because avg_rating and followers are not independent (e.g. one could argue that bands with a higher average rating probably have more followers) the coefficient will be biased. In our case we massively overestimate the effect that the average rating of a band has on ticket sales. In the original model, the effect was about 20.5. In the new, smaller model, the effect is approximately 3.1 times higher.

We can also work out intuitively what the bias will be. The marginal effect of followers on concert_visitors is captured by avg_rating to the degree that avg_rating is related to followers . There are two coefficients of interest:

  • What is the marginal effect of followers on concert_visitors ?
  • How much of that effect is captured by avg_rating ?

The former is just the coefficient of followers in the original regression.

The latter is the coefficient of avg_rating obtained from a regression on followers , since the coefficient shows how avg_rating and followers relate to each other.

Now we can calculate the bias induced by omitting followers

To calculate the biased coefficient, simply add the bias to the coefficient from the original model.

6.4 Categorical predictors

6.4.1 two categories.

Suppose, you wish to investigate the effect of the variable “country” on sales, which is a categorical variable that can only take two levels (i.e., 0 = local artist, 1 = international artist). Categorical variables with two levels are also called binary predictors. It is straightforward to include these variables in your model as “dummy” variables. Dummy variables are factor variables that can only take two values. For our “country” variable, we can create a new predictor variable that takes the form:

\[\begin{equation} x_4 = \begin{cases} 1 & \quad \text{if } i \text{th artist is international}\\ 0 & \quad \text{if } i \text{th artist is local} \end{cases} \tag{6.20} \end{equation}\]

This new variable is then added to our regression equation from before, so that the equation becomes

\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*international+\epsilon \end{align}\]

where “international” represents the new dummy variable and \(\beta_4\) is the coefficient associated with this variable. Estimating the model is straightforward - you just need to include the variable as an additional predictor variable. Note that the variable needs to be specified as a factor variable before including it in your model. If you haven’t converted it to a factor variable before, you could also use the wrapper function as.factor() within the equation.

You can see that we now have an additional coefficient in the regression output, which tells us the effect of the binary predictor. The dummy variable can generally be interpreted as the average difference in the dependent variable between the two groups (similar to a t-test). In this case, the coefficient tells you the difference in sales between international and local artists, and whether this difference is significant. Specifically, it means that international artists on average sell 45.67 units more than local artists, and this difference is significant (i.e., p < 0.05).

6.4.2 More than two categories

Predictors with more than two categories, like our “genre”" variable, can also be included in your model. However, in this case one dummy variable cannot represent all possible values, since there are three genres (i.e., 1 = Rock, 2 = Pop, 3 = Electronic). Thus, we need to create additional dummy variables. For example, for our “genre” variable, we create two dummy variables as follows:

\[\begin{equation} x_5 = \begin{cases} 1 & \quad \text{if } i \text{th product is from Pop genre}\\ 0 & \quad \text{if } i \text{th product is from Rock genre} \end{cases} \tag{6.21} \end{equation}\]

\[\begin{equation} x_6 = \begin{cases} 1 & \quad \text{if } i \text{th product is from Electronic genre}\\ 0 & \quad \text{if } i \text{th product is from Rock genre} \end{cases} \tag{6.22} \end{equation}\]

We would then add these variables as additional predictors in the regression equation and obtain the following model

\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*international\\ &+\beta_5*Pop\\ &+\beta_6*Electronic+\epsilon \end{align}\]

where “Pop” and “Rock” represent our new dummy variables, and \(\beta_5\) and \(\beta_6\) represent the associated regression coefficients.

The interpretation of the coefficients is as follows: \(\beta_5\) is the difference in average sales between the genres “Rock” and “Pop”, while \(\beta_6\) is the difference in average sales between the genres “Rock” and “Electro”. Note that the level for which no dummy variable is created is also referred to as the baseline . In our case, “Rock” would be the baseline genre. This means that there will always be one fewer dummy variable than the number of levels.

You don’t have to create the dummy variables manually as R will do this automatically when you add the variable to your equation:

How can we interpret the coefficients? It is estimated based on our model that products from the “Pop” genre will on average sell 47.69 units more than products from the “Rock” genre, and that products from the “Electronic” genre will sell on average 27.62 units more than the products from the “Rock” genre. The p-value of both variables is smaller than 0.05, suggesting that there is statistical evidence for a real difference in sales between the genres.

The level of the baseline category is arbitrary. As you have seen, R simply selects the first level as the baseline. If you would like to use a different baseline category, you can use the relevel() function and set the reference category using the ref argument. The following would estimate the same model using the second category as the baseline:

Note that while your choice of the baseline category impacts the coefficients and the significance level, the prediction for each group will be the same regardless of this choice.

6.5 Extensions of the linear model

The standard linear regression model provides results that are easy to interpret and is useful to address many real-world problems. However, it makes rather restrictive assumptions that might be violated in many cases. Notably, it assumes that the relationships between the response and predictor variable is additive and linear . The additive assumption states that the effect of an independent variable on the dependent variable is independent of the values of the other independent variables included in the model. The linear assumption means that the effect of a one-unit change in the independent variable on the dependent variable is the same, regardless of the values of the value of the independent variable. This is also referred to as constant marginal returns . For example, an increase in ad-spend from 10€ to 11€ yields the same increase in sales as an increase from 100,000€ to 100,001€. This section presents alternative model specifications if the assumptions do not hold.

6.5.1 Interaction effects

Regarding the additive assumption, it might be argued that the effect of some variables are not fully independent of the values of other variables. In our example, one could argue that the effect of advertising depends on the type of artist. For example, for local artist advertising might be more effective. We can investigate if this is the case using a grouped scatterplot:

Effect of advertising by group

Figure 6.22: Effect of advertising by group

The scatterplot indeed suggests that there is a difference in advertising effectiveness between local and international artists. You can see this from the two different regression lines. We can incorporate this interaction effect by including an interaction term in the regression equation as follows:

\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*international\\ &+\beta_5*(adspend*international)\\ &+\epsilon \end{align}\]

You can see that the effect of advertising now depends on the type of artist. Hence, the additive assumption is removed. Note that if you decide to include an interaction effect, you should always include the main effects of the variables that are included in the interaction (even if the associated p-values do not suggest significant effects). It is easy to include an interaction effect in you model by adding an additional variable that has the format ```var1:var2````. In our example, this could be achieved using the following specification:

How can we interpret the coefficient? The adspend main effect tells you the effect of advertising for the reference group that has the factor level zero. In our example, it is the advertising effect for local artist. This means that for local artists, spending an additional 1,000 Euros on advertising will result in approximately 89 additional unit sales. The interaction effect tells you by how much the effect differs for the other group (i.e., international artists) and whether this difference is significant. In our example, it means that the effect for international artists can be computed as: 0.0885 - 0.0347 = 0.0538. This means that for international artists, spending an additional 1,000 Euros on advertising will result in approximately 54 additional unit sales. Since the interaction effect is significant (p < 0.05) we can conclude that advertising is less effective for international artists.

The above example showed the interaction between a categorical variable (i.e., “country”) and a continuous variable (i.e., “adspend”). However, interaction effects can be defined for different combinations of variable types. For example, you might just as well specify an interaction between two continuous variables. In our example, you might suspect that there are synergy effects between advertising expenditures and ratio airplay. It could be that advertising is more effective when an artist receives a large number of radio plays. In this case, we would specify our model as:

\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*(adspend*airplay)\\ &+\epsilon \end{align}\]

In this case, we can interpret \(\beta_4\) as the increase in the effectiveness of advertising for a one unit increase in radio airplay (or vice versa). This can be translated to R using:

However, since the p-value of the interaction is larger than 0.05, there is little statistical evidence for an interaction between the two variables.

6.5.2 Non-linear relationships

In our example above, it appeared that linear relationships could be reasonably assumed. In many practical applications, however, this might not be the case. Let’s review the implications of a linear specification again:

In many marketing contexts, these might not be reasonable assumptions. Consider the case of advertising. It is unlikely that the return on advertising will not depend on the level of advertising expenditures. It is rather likely that saturation occurs at some level, meaning that the return from an additional Euro spend on advertising is decreasing with the level of advertising expenditures (i.e., decreasing marginal returns). In other words, at some point the advertising campaign has achieved a certain level of penetration and an additional Euro spend on advertising won’t yield the same return as in the beginning.

Let’s use an example data set, containing the advertising expenditures of a company and the sales (in thousand units).

Now we inspect if a linear specification is appropriate by looking at the scatterplot:

Non-linear relationship

Figure 6.23: Non-linear relationship

It appears that a linear model might not represent the data well. It rather appears that the effect of an additional Euro spend on advertising is decreasing with increasing levels of advertising expenditures. Thus, we have decreasing marginal returns. We could put this to a test and estimate a linear model:

Advertising appears to be positively related to sales with an additional Euro that is spent on advertising resulting in 0.0005 additional sales. The R 2 statistic suggests that approximately 51% of the total variation can be explained by the model

To test if the linear specification is appropriate, let’s inspect some of the plots that are generated by R. We start by inspecting the residuals plot.

Residuals vs. Fitted

Figure 6.24: Residuals vs. Fitted

The plot suggests that the assumption of homoscedasticity is violated (i.e., the spread of values on the y-axis is different for different levels of the fitted values). In addition, the red line deviates from the dashed grey line, suggesting that the relationship might not be linear. Finally, the Q-Q plot of the residuals suggests that the residuals are not normally distributed.

Q-Q plot

Figure 6.25: Q-Q plot

To sum up, a linear specification might not be the best model for this data set.

In this case, a multiplicative model might be a better representation of the data. The multiplicative model has the following formal representation:

\[\begin{equation} Y =\beta_0 *X_1^{\beta_1}*X_2^{\beta_2}*...*X_J^{\beta_J}*\epsilon \tag{6.23} \end{equation}\]

This functional form can be linearized by taking the logarithm of both sides of the equation:

\[\begin{equation} log(Y) =log(\beta_0) + \beta_1*log(X_1) + \beta_2*log(X_2) + ...+ \beta_J*log(X_J) + log(\epsilon) \tag{6.24} \end{equation}\]

This means that taking logarithms of both sides of the equation makes linear estimation possible. Let’s test how the scatterplot would look like if we use the logarithm of our variables using the log() function instead of the original values.

Linearized effect

Figure 6.26: Linearized effect

It appears that now, with the log-transformed variables, a linear specification is a much better representation of the data. Hence, we can log-transform our variables and estimate the following equation:

\[\begin{equation} log(sales) = log(\beta_0) + \beta_1*log(advertising) + log(\epsilon) \tag{6.25} \end{equation}\]

This can be easily implemented in R by transforming the variables using the log() function:

Note that this specification implies decreasing marginal returns (i.e., the returns of advertising are decreasing with the level of advertising), which appear to be more consistent with the data. The specification is also consistent with proportional changes in advertising being associated with proportional changes in sales (i.e., advertising does not become more effective with increasing levels). This has important implications on the interpretation of the coefficients. In our example, you would interpret the coefficient as follows: A 1% increase in advertising leads to a 0.3% increase in sales . Hence, the interpretation is in proportional terms and no longer in units. This means that the coefficients in a log-log model can be directly interpreted as elasticities, which also makes communication easier. We can generally also inspect the R 2 statistic to see that the model fit has increased compared to the linear specification (i.e., R 2 has increased to 0.681 from 0.509). However, please note that the variables are now measured on a different scale, which means that the model fit in theory is not directly comparable. Also, we could use the residuals plot to confirm that the revised specification is more appropriate:

Residuals plot

Figure 6.27: Residuals plot

Q-Q plot

Figure 6.28: Q-Q plot

Finally, we can plot the predicted values against the observed values to see that the results from the log-log model (red) provide a better prediction than the results from the linear model (blue).

Comparison if model fit

Figure 6.29: Comparison if model fit

Another way of modelling non-linearities is including a squared term if there are decreasing or increasing effects. In fact, we can model non-constant slopes as long as the form is a linear combination of exponentials (i.e. squared, cubed, …) of the explanatory variables. Usually we do not expect many inflection points so squared or third power terms suffice. Note that the degree of the polynomial has to be equal to the number of inflection points.

When using squared terms we can model diminishing and eventually negative returns. Think about advertisement spending. If a brand is not well known, spending on ads will increase brand awareness and have a large effect on sales. In a regression model this translates to a steep slope for spending at the origin (i.e. for lower spending). However, as more and more people will already know the brand we expect that an additional Euro spent on advertisement will have less and less of an effect the more the company spends. We say that the returns are diminishing. Eventually, if they keep putting more and more ads out, people get annoyed and some will stop buying from the company. In that case the return might even get negative. To model such a situation we need a linear as well as a squared term in the regression.

lm(...) can take squared (or any power) terms as input by adding I(X^2) as explanatory variable. In the example below we see a clear quadratic relationship with an inflection point at around 70. If we try to model this using the level of the covariates without the quadratic term we do not get a very good fit.

regression analysis in marketing research example

The graph above clearly shows that advertising spending of between 0 and 50 increases sales. However, the marginal increase (i.e. the slope of the data curve) is decreasing. Around 70 there is an inflection point. After that point additional ad-spending actually decreases sales (e.g. people get annoyed). Notice that the prediction line is straight, that is, the marginal increase of sales due to additional spending on advertising is the same for any amount of spending. This shows the danger of basing business decisions on wrongly specified models. But even in the area in which the sign of the prediction is correct, we are quite far off. Lets take a look at the top 5 sales values and the corresponding predictions:

By including a quadratic term we can fit the data very well. This is still a linear model since the outcome variable is still explained by a linear combination of regressors even though one of the regressors is now just a non-linear function of the same variable (i.e. the squared value).

regression analysis in marketing research example

Now the prediction of the model is very close to the actual data and we could base our production decisions on that model.

When interpreting the coefficients of the predictor in this model we have to be careful. Since we included the squared term, the slope is now different at each level of production (this can be seen in the graph above). That is, we do not have a single coefficient to interpret as the slope anymore. This can easily be shown by calculating the derivative of the model with respect to production.

\[ \text{Sales} = \alpha + \beta_1 \text{ Advertising} + \beta_2 \text{ Advertising}^2 + \varepsilon\\ {\delta \text{ Sales} \over \delta \text{ Advertising}} = \beta_1 + 2 \beta_2 \text{ Advertising} \equiv \text{Slope} \]

Intuitively, this means that the change of sales due to an additional Euro spent on advertising depends on the current level of advertising. \(\alpha\) , the intercept can still be interpreted as the expected value of sales given that we do not advertise at all (set advertising to 0 in the model). The sign of the squared term ( \(\beta_2\) ) can be used to determine the curvature of the function. If the sign is positive, the function is convex (curvature is upwards), if it is negative it is concave curvature is downwards). We can interpret \(\beta_1\) and \(\beta_2\) separately in terms of their influence on the slope . By setting advertising to \(0\) we observe that \(\beta_1\) is the slope at the origin. By taking the derivative of the slope with respect to advertising we see that the change of the slope due to additional spending on advertising is two times \(\beta_2\) .

\[ {\delta Slope \over \delta Advertising} = 2\beta_2 \]

At the maximum predicted value the slope is close to \(0\) (theoretically it is equal to \(0\) but this would require decimals and we can only sell whole pieces). Above we only calculated the prediction for the observed data, so let’s first predict the profit for all possible values between \(1\) and \(200\) to get the optimal production level according to our model.

For all other levels of advertising we insert the pieces produced into the formula to obtain the slope at that point. In the following example you can choose the level of advertising.

6.6 Logistic regression

The following video summarizes how to visualize log-transformed regressions in R

6.6.1 Motivation and intuition

In the last section we saw how to predict continuous outcomes (sales, height, etc.) via linear regression models. Another interesting case is that of binary outcomes, i.e. when the variable we want to model can only take two values (yes or no, group 1 or group 2, dead or alive, etc.). To this end we would like to estimate how our predictor variables change the probability of a value being 0 or 1. In this case we can technically still use a linear model (e.g. OLS). However, its predictions will most likely not be particularly useful. A more useful method is the logistic regression. In particular we are going to have a look at the logit model. In the following dataset we are trying to predict whether a song will be a top-10 hit on a popular music streaming platform. In a first step we are going to use only the danceability index as a predictor. Later we are going to add more independent variables.

Below are two attempts to model the data. The left assumes a linear probability model (calculated with the same methods that we used in the last chapter), while the right model is a logistic regression model . As you can see, the linear probability model produces probabilities that are above 1 and below 0, which are not valid probabilities, while the logistic model stays between 0 and 1. Notice that songs with a higher danceability index (on the right of the x-axis) seem to cluster more at \(1\) and those with a lower more at \(0\) so we expect a positive influence of danceability on the probability of a song to become a top-10 hit.

The same binary data explained by two models; A linear probability model (on the left) and a logistic regression model (on the right)

Figure 6.30: The same binary data explained by two models; A linear probability model (on the left) and a logistic regression model (on the right)

A key insight at this point is that the connection between \(\mathbf{X}\) and \(Y\) is non-linear in the logistic regression model. As we can see in the plot, the probability of success is most strongly affected by danceability around values of \(0.5\) , while higher and lower values have a smaller marginal effect. This obviously also has consequences for the interpretation of the coefficients later on.

6.6.2 Technical details of the model

As the name suggests, the logistic function is an important component of the logistic regression model. It has the following form:

\[ f(\mathbf{X}) = \frac{1}{1 + e^{-\mathbf{X}}} \] This function transforms all real numbers into the range between 0 and 1. We need this to model probabilities, as probabilities can only be between 0 and 1.

regression analysis in marketing research example

The logistic function on its own is not very useful yet, as we want to be able to determine how predictors influence the probability of a value to be equal to 1. To this end we replace the \(\mathbf{X}\) in the function above with our familiar linear specification, i.e.

\[ \mathbf{X} = \beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i}\\ f(\mathbf{X}) = P(y_i = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i})}} \]

In our case we only have \(\beta_0\) and \(\beta_1\) , the coefficient associated with danceability.

In general we now have a mathematical relationship between our predictor variables \((x_1, ..., x_m)\) and the probability of \(y_i\) being equal to one. The last step is to estimate the parameters of this model \((\beta_0, \beta_1, ..., \beta_m)\) to determine the magnitude of the effects.

6.6.3 Estimation in R

We are now going to show how to perform logistic regression in R. Instead of lm() we now use glm(Y~X, family=binomial(link = 'logit')) to use the logit model. We can still use the summary() command to inspect the output of the model.

Noticeably this output does not include an \(R^2\) value to asses model fit. Multiple “Pseudo \(R^2\) s”, similar to the one used in OLS, have been developed. There are packages that return the \(R^2\) given a logit model (see rcompanion or pscl ). The calculation by hand is also fairly simple. We define the function logisticPseudoR2s() that takes a logit model as an input and returns three popular pseudo \(R^2\) values.

The coefficients of the model give the change in the log odds of the dependent variable due to a unit change in the regressor. This makes the exact interpretation of the coefficients difficult, but we can still interpret the signs and the p-values which will tell us if a variable has a significant positive or negative impact on the probability of the dependent variable being \(1\) . In order to get the odds ratios we can simply take the exponent of the coefficients.

Notice that the coefficient is extremely large. That is (partly) due to the fact that the danceability variable is constrained to values between \(0\) and \(1\) and the coefficients are for a unit change. We can make the “unit-change” interpretation more meaningful by multiplying the danceability index by \(100\) . This linear transformation does not affect the model fit or the p-values.

We observe that danceability positively affects the likelihood of becoming at top-10 hit. To get the confidence intervals for the coefficients we can use the same function as with OLS

In order to get a rough idea about the magnitude of the effects we can calculate the partial effects at the mean of the data (that is the effect for the average observation). Alternatively, we can calculate the mean of the effects (that is the average of the individual effects). Both can be done with the logitmfx(...) function from the mfx package. If we set logitmfx(logit_model, data = my_data, atmean = FALSE) we calculate the latter. Setting atmean = TRUE will calculate the former. However, in general we are most interested in the sign and significance of the coefficient.

This now gives the average partial effects in percentage points. An additional point on the danceability scale (from \(1\) to \(100\) ), on average, makes it \(1.57%\) more likely for a song to become at top-10 hit.

To get the effect of an additional point at a specific value, we can calculate the odds ratio by predicting the probability at a value and at the value \(+1\) . For example if we are interested in how much more likely a song with 51 compared to 50 danceability is to become a hit we can simply calculate the following

So the odds are 20% higher at 51 than at 50.

6.6.3.1 Logistic model with multiple predictors

Of course we can also use multiple predictors in logistic regression as shown in the formula above. We might want to add spotify followers (in million) and weeks since the release of the song.

Again, the familiar formula interface can be used with the glm() function. All the model summaries shown above still work with multiple predictors.

6.6.3.2 Model selection

The question remains, whether a variable should be added to the model. We will present two methods for model selection for logistic regression. The first is based on the Akaike Information Criterium (AIC). It is reported with the summary output for logit models. The value of the AIC is relative , meaning that it has no interpretation by itself. However, it can be used to compare and select models. The model with the lowest AIC value is the one that should be chosen. Note that the AIC does not indicate how well the model fits the data, but is merely used to compare models.

For example, consider the following model, where we exclude the followers covariate. Seeing as it was able to contribute significantly to the explanatory power of the model, the AIC increases, indicating that the model including followers is better suited to explain the data. We always want the lowest possible AIC.

As a second measure for variable selection, you can use the pseudo \(R^2\) s as shown above. The fit is distinctly worse according to all three values presented here, when excluding the spotify followers.

6.6.3.3 Predictions

We can predict the probability given an observation using the predict(my_logit, newdata = ..., type = "response") function. Replace ... with the observed values for which you would like to predict the outcome variable.

The prediction indicates that a song with danceability of \(50\) from an artist with \(10M\) spotify followers has a \(66%\) chance of being in the top-10, 1 week after its release.

6.6.3.4 Perfect Prediction Logit

Perfect prediction occurs whenever a linear function of \(X\) can perfectly separate the \(1\) s from the \(0\) s in the dependent variable. This is problematic when estimating a logit model as it will result in biased estimators (also check to p-values in the example!). R will return the following message if this occurs:

glm.fit: fitted probabilities numerically 0 or 1 occurred

Given this error, one should not use the output of the glm(...) function for the analysis. There are various ways to deal with this problem, one of which is to use Firth’s bias-reduced penalized-likelihood logistic regression with the logistf(Y~X) function in the logistf package.

6.6.3.4.1 Example

In this example data \(Y = 0\) if \(x_1 <0\) and \(Y=1\) if \(x_1>0\) and we thus have perfect prediction. As we can see the output of the regular logit model is not interpretable. The standard errors are huge compared to the coefficients and thus the p-values are \(1\) despite \(x_1\) being a predictor of \(Y\) . Thus, we turn to the penalized-likelihood version. This model correctly indicates that \(x_1\) is in fact a predictor for \(Y\) as the coefficient is significant.

Regression Analysis

  • First Online: 20 July 2018

Cite this chapter

regression analysis in marketing research example

  • Marko Sarstedt 3 &
  • Erik Mooi 4  

Part of the book series: Springer Texts in Business and Economics ((STBE))

142k Accesses

19 Citations

3 Altmetric

We first provide comprehensive, but simple, access to essential regression knowledge by discussing how regression analysis works, the requirements and assumptions on which it relies, and how you can specify a regression analysis model that allows you to make critical decisions for your business, clients, or project. Each step involved in regression analysis is linked to its execution in SPSS. We show how to use a range of SPSS’s easy-to-learn statistical procedures that underlie regression analysis, which will allow you to analyze, chart, and validate regression analysis results and to assess your analysis’s robustness. Interpretation of SPSS output can be difficult, but we make this easier by means of an annotated case study. We conclude with suggestions for further readings on the use, application, and interpretation of regression analysis.

Electronic supplementary material

The online version of this chapter ( https://doi.org/10.1007/978-3-662-56707-4_7 ) contains additional material that is available to authorized users. You can also download the “Springer Nature More Media App” from the iOS or Android App Store to stream the videos and scan the image containing the “Play button”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Strictly speaking, the difference between the predicted and the observed y -values is \(\hat e.\)

This only applies to the standardized β s.

This is only a requirement if you are interested in the regression coefficients, which is the dominant use of regression. If you are only interested in prediction, collinearity is not important.

A related measure is the tolerance , which is 1/VIF and calculated as 1/(1− R 2 ).

The VIF is calculated using a completely separate regression analysis. In this regression analysis, the variable for which the VIF is calculated is regarded as a dependent variable and all other independent variables are regarded as independents. The R 2 that this model provides is deducted from 1 and the reciprocal value of this sum (i.e., 1/(1− R 2 )) is the VIF. The VIF is therefore an indication of how much the regression model explains one independent variable. If the other variables explain much of the variance (the VIF is larger than 10), collinearity is likely a problem.

This term can be calculated manually, but also by using the function mmult in Microsoft Excel where \({x^T}x\) is calculated. Once this matrix has been calculated, you can use the minverse function to arrive at \({({x^T}x)^{ - 1}}\) .

The test also includes the predicted values squared and to the power of three.

This hypothesis can also be read as that a model with only an intercept is sufficient.

The AIC is specifically calculated as \(AIC = {\text{n}}\left[ {{\text{log}}\left( {\frac{{{\text{S}}{{\text{S}}_{\text{E}}}}}{{\text{n}}}} \right) + \:\frac{{2{\text{k}}}}{{\text{n}}}} \right]\) , where SS E is the error sum of squares, n is the number of observations and k the number of independent variables, while the BIC is calculated as \(BIC = {\text{n}}\left[ {{\text{log}}\left( {\frac{{{\text{S}}{{\text{S}}_{\text{E}}}}}{{\text{n}}}} \right) + \:\frac{{{\text{k}} \cdot {\text{log}}\left( {\text{n}} \right)}}{{\text{n}}}} \right]\) . Note that these formulations only hold in case of normally distributed residuals with constant variance (Burnham and Anderson 2013 ).

Cohen’s ( 1994 ) classical article “The Earth is Round (p < .05)” offers an interesting discussion on significance and effect sizes.

It is possible to compare regression coefficients statistically, avoiding the need to the subjectivity of “similar.” Strictly speaking, the test for comparing coefficients is z-distributed with \({\text{z}} = \frac{{{b_1} - {b_2}}}{{\sqrt {SE_1^2 + SE_2^2} }}\) (see Paternoster et al. 1998 ).

Note that this only works, as shown in the lower left of Fig. 7.7 , if the “Python essentials” are installed.

Note that it is better to calculate if the R 2 increase is significant (as for Ramsey’s RESET test) but this needs to be done manually and falls outside of the scope of this book.

Note that a p -value is never exactly zero, but has values different from zero in later decimal places.

We would like to thank Dr. D.I. Gilliland and AgriPro for making the data and case study available.

Aiken, L. S., & West, S. G. (1991). Multiple regression: testing and interpreting interactions . Thousand Oaks, CA: Sage.

Google Scholar  

Baum, C. F. (2006). An introduction to modern econometrics using Stata . College Station, TX: Stata Press.

Burnham, K. P., & Anderson, D. R. (2013). Model Selection and multimodel inference: A practical information-theoretic approach (2nd ed.). New York, NJ: Springer.

Cohen, J. (1994). The earth is round (p < .05). The American Psychologist , 49 (912), 997–1003.

Article   Google Scholar  

Cook, R. D., & Weisberg, S. (1983). Diagnostics for heteroscedasticity in regression. Biometrika , 70 (1), 1–10.

Durbin, J., & Watson, G. S. (1951). Testing for serial correlation in least squares regression, II. Biometrika , 38 (1–2), 159–179.

Fabozzi, F. J., Focardi, S. M., Rachev, S. T., & Arshanapalli, B. G. (2014). The basics of financial econometrics: tools, concepts, and asset management applications . Hoboken, NJ: John Wiley & Sons.

Book   Google Scholar  

Field, A. (2013). Discovering statistics using SPSS (4th ed.). London: Sage.

Green, S. B. (1991). How many subjects does it take to do a regression analysis? Multivariate Behavioral Research , 26 (3), 499–510.

Greene, W. H. (2011). Econometric analysis (7th ed.). Upper Saddle River, NJ: Prentice Hall.

Hair Jr., J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2019). Multivariate data analysis (8th ed.). Boston, MA: Cengage.

Hill, C., Griffiths, W., & Lim, G. C. (2011). Principles of econometrics (4th ed.). Hoboken, NJ: John Wiley & Sons.

Kelley, K., & Maxwell, S. E. (2003). Sample size for multiple regression: Obtaining regression coefficients that are accurate, not simply significant. Psychological Methods , 8 (3), 305–321.

Mason, C. H., & Perreault Jr., W. D. (1991), Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research , 28 (3), 268–280.

Mooi, E. A., & Frambach, R. T. (2009). A stakeholder perspective on buyer–supplier conflict. Journal of Marketing Channels , 16 (4), 291–307.

O’brien, R. M. (2007). A caution regarding rules of thumb for variance inflation factors. Quality & Quantity , 41 (5), 673–690.

Paternoster, R., Brame, R., Mazerolle, P., & Piquero, A. (1998). Using the correct statistical test for the equality of regression coefficients. Criminology , 36 (4), 859–866.

Ramsey, J. B. (1969). Test for specification errors in classical linear least-squares regression analysis. Journal of the Royal Statistical Society, Series B , 31 (2), 350–371.

Treiman, D. J. (2014). Quantitative data analysis: Doing social research to test ideas . Hoboken, NJ: John Wiley & Sons.

VanVoorhis, C. R. W., & Morgan, B. L. (2007). Understanding power and rules of thumb for determining sample sizes. Tutorials in Quantitative Methods for Psychology , 3 (2), 43–50.

Further Reading

Echambadi, R., & Hess, J. D. (2007). Mean-centering does not alleviate collinearity problems in moderated multiple regression models. Marketing Science , 26 (3), 438–445.

Iacobucci, D. (2008). Mediation analysis: Quantitative applications in the social sciences . Thousand Oaks, CA: Sage.

Shmueli, G. (2010). To explain or to predict? Statistical Science , 25 (3), 289–310.

Spiller, S. A., Fitzsimons, G. J., Lynch Jr., J. G., & McClelland, G. H. (2013). Spotlights, floodlights, and the magic number zero: Simple effects tests in moderated regression. Journal of Marketing Research , 50 (2), 277–288.

Zhao, X., Lynch, J. G., & Chen, Q. (2010). Reconsidering Baron and Kenny: Myths and truths about mediation analysis. Journal of Consumer Research , 37 (2), 197–206.

Download references

Author information

Authors and affiliations.

Faculty of Economics and Management, Otto-von-Guericke- University Magdeburg, Magdeburg, Germany

Marko Sarstedt

Department of Management and Marketing, The University of Melbourne, Parkville, VIC, Australia

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Sarstedt, M., Mooi, E. (2019). Regression Analysis. In: A Concise Guide to Market Research. Springer Texts in Business and Economics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-56707-4_7

Download citation

DOI : https://doi.org/10.1007/978-3-662-56707-4_7

Published : 20 July 2018

Publisher Name : Springer, Berlin, Heidelberg

Print ISBN : 978-3-662-56706-7

Online ISBN : 978-3-662-56707-4

eBook Packages : Business and Management Business and Management (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

The MSR Group

Regression Analysis in Market Research

by Richard Nehrboss SR | Mar 14, 2023 | Customer Experience Management , Financial Services , Research Methodology

What is Regression Analysis & How Is It Used?

regression analysis in marketing research example

Regression analysis helps organizations make sense of priority areas and what factors have the most impact and influence on their customer relationships. It allows researchers and brands to read between the lines of the survey data. This article will help you understand the definition of regression analysis, how it is commonly used, and the benefits of using regression research.

Regression Analysis: Definition

Regression analysis is a common statistical method that helps organizations understand the relationship between independent variables and dependent variables.

  • Dependent variable: The main factor you want to measure or understand.
  • Independent variables: The secondary factors you believe to have an influence on your dependent variable.

More specifically regression analysis tells you what factors are most important, which to disregard, and how each factor affects one another.

Importance of Regression Analysis

There are several benefits of regression analysis, most of which center around using it to achieve data-driven decision-making.

The advantages of using regression analysis in research include:

  • Great tool for forecasting: While there is no such thing as a magic crystal ball, regression research is a great approach to measuring predictive analytics and forecasting.
  • Focus attention on priority areas of improvement: Regression statistical analysis helps businesses and organizations prioritize efforts to improve customer satisfaction metrics such as net promoter score, customer effort score, and customer loyalty. Using regression analysis in quantitative research provides the opportunity to take corrective actions on the items that will most positively improve overall satisfaction.

When to Use Regression Analysis

A common use of regression analysis is understanding how the likelihood to recommend a product or service (dependent variable) is impacted by changes in wait time, price, and quantity purchased (presumably independent variables). A popular way to measure this is with net promoter score (NPS) as it is one of the most commonly used metrics in market research.

Net promoter score formula

The score is very telling to help your business understand how many raving fans your brand has in comparison to your key competitors and industry benchmarks. While our online survey company always recommends using an open-ended question after NPS to gather context to help understand the driving forces behind the score, sometimes it does not tell the whole story.

Regression Analysis Example in Business

Keeping with the bank survey from above, let’s say in the same survey you ask a series of customer satisfaction questions related to respondents’ experience with the bank. You believe the interest rates and customer service are good at your bank but you think there might be some underlying drivers really pushing your high NPS. In this example, likelihood to recommend, or NPS is your dependent variable A. Your more specific follow-up satisfaction questions are dependent variables B, C, D, E, F, G.

Through your regression analysis, you find out that INDEPENDENT VARIABLE C (friendliness of the staff) has the most significant effect on NPS. This means how the customer rates the friendliness of the staff members will have the largest overall impact on how likely they would be to recommend your bank. This is much different than what customers said in the open-ended comment about interest rates and customer service. However, as regression analysis proves, staff friendliness is essential.

Regression analysis is another tool market research firms used on a daily basis with their clients to help brands understand survey data from customers. The benefit of using a third-party market research firm is that you can leverage their expertise to tell you the “so what” of your customer survey data.

At The MSR Group, we use regression analysis to help our clients understand the relationship between independent variables and dependent variables. We have worked with banks to understand the impact that key index scores from the markets had on sales projections. We also help our clients prioritize efforts to improve customer satisfaction metrics such as net promoter score, customer effort score, and customer loyalty.

If you are interested in using regression analysis to help your business make data-driven decisions, contact The MSR Group by filling out an online contact form or emailing [email protected]. Regression analysis is a powerful tool that can help executives and management make data-driven decisions. It can help them understand the relationship between independent variables and dependent variables, and how each factor affects one another. It can also help them focus their attention on priority areas of improvement, and use predictive analytics and forecasting to understand how their revenue might be impacted in future quarters.

At The MSR Group, we use regression analysis to help our clients understand the relationship between independent variables and dependent variables, and prioritize efforts to improve customer satisfaction metrics. We have worked with banks to understand the impact that key index scores from the markets had on sales projections, and how increasing prices will have any impact on repeat customer purchases. Using regression analysis in quantitative research provides the opportunity to take corrective actions on the items that will most positively improve overall satisfaction.

MSR Group | Insights Revealed Logo Reversed

Subscribe To Our Newsletter

Join our mailing list so you never miss an update from The MSR Group!

Thanks for subscribing!

  • Customer Experience Management
  • Employee Experience
  • Financial Services
  • Marketing Effectiveness
  • News Releases
  • Research Methodology
  • Social Media
  • Technology & Innovation

The MSR Group Logo

Join our email list to receive MSR Group news and industry updates right to your inbox!

You have Successfully Subscribed!

How to Use Regression Analysis to Forecast Sales: A Step-by-Step Guide

Flori Needle

Updated: August 27, 2021

Published: December 21, 2020

There are various ways to understand the effectiveness of your sales team activity or how well sales teams are at driving sales to reach operational and financial goals. Sales forecasting , a method that predicts sales performance based on historical performance, is one way to get this understanding.

salesperson using regression analysis to forecast sales

Sales forecasting is important because it can help you identify what is going right, as well as what areas of your current strategy need to be adapted and changed to ensure future success.

For example, if your team is consistently below quotas, sales forecasting can help determine where and why these issues are happening. Forecasting can also help you decide on future business endeavors, like when you’d have the revenue to invest in new products or expand your business.

Some forecasting methods involve doing basic math, like adding up month to month sales, and others are more in-depth. Regression analysis is one of these methods, and it requires in-depth statistical analysis.

If you’re anything like me and not at all mathematically inclined, conducting this type of forecast may seem daunting. Thankfully, this piece will give an easy to understand breakdown of regression analysis in sales and guide you through an easy to follow example using Google Sheets .

What is regression analysis?

In statistics, regression analysis is a mathematical method used to understand the relationship between a dependent variable and an independent variable. Results of this analysis demonstrate the strength of the relationship between the two variables and if the dependent variable is significantly impacted by the independent variable.

There are multiple different types of regression analysis, but the most basic and common form is simple linear regression that uses the following equation: Y = bX + a

That type of explanation isn’t really helpful, though, if you don’t already have a grasp of mathematical processes, which I certainly don’t. Let’s take a look at what regression analysis means, in layman’s terms, for sales forecasting.

What is regression analysis in sales?

In simple terms, sales regression analysis is used to understand how certain factors in your sales process affect sales performance and predict how sales would change over time if you continued the same strategy or pivoted to different methods.

Independent and dependent variables are still at play here, but the dependent variable is always the same: sales performance. Whether it’s total revenue or number of deals closed, your dependent variable will always be sales performance. The independent variable is the factor you are examining that will change sales performance, like the number of salespeople you have or how much money is spent on advertising.

Sales regression forecasting results help businesses understand how their sales teams are or are not succeeding and what the future could look like based on past sales performance. The results can also be used to predict future sales based on changes that haven’t yet been made, like if hiring more salespeople would increase business revenue.

So, what do these words mean, math wise? Like I said before, I’m not good at math. But, I did conduct a simple sales regression analysis that is easy to follow and didn’t require many calculations on my part. Let’s go over this example below.

How To Use Regression Analysis To Forecast Sales

Let’s say that you want to run a sales forecast to understand if having your salespeople make more sales calls will mean that they close more deals. To conduct this forecast, you need historical data that depicts the number of sales calls made over a certain period. So, mathematically, the number of sales calls is the independent variable, or X value, and the dependent variable is the number of deals closed per month, or Y value.

I made up the data set below to represent monthly sales calls, and a corresponding number of deals closed over a two year period.

sample data set for regression sales forecast

So, the overall regression equation is Y = bX + a , where:

  • X is the independent variable (number of sales calls)
  • Y is the dependent variable (number of deals closed)
  • b is the slope of the line
  • a is the point of interception, or what Y equals when X is zero

Since we’re using Google Sheets, its built-in functions will do the math for us and we don’t need to try and calculate the values of these variables. We simply need to use the historical data table and select the correct graph to represent our data. The first step of the process is to highlight the numbers in the X and Y column and navigate to the toolbar, select Insert, and click Chart from the dropdown menu.

demo showing how to create a chart for sales regression forecasting

The default graph that appears isn’t what we need, so I clicked on the Chart editor tool and selected Scatter plot, as shown in the gif below.

After selecting the scatter plot, I clicked Customize, Series, and scrolled down to select the Trendline box (shown below).

After all of these customizations, I get the following scatter plot.

sales regression scatter plot example

The Sheets tool did the math for me, but the line in the chart is the b variable from the regression equation, or slope, that creates the line of best fit . The blue dots are the y values, or the number of deals closed based on the number of sales calls.

values, or the number of deals closed based on the number of sales calls.

So, the scatter plot answers my overall question of whether having salespeople make more sales calls will close more deals. The answer is yes, and I know this because the line of best fit trendline is moving upwards, which indicates a positive relationship. Even though one month can have 20 sales calls and 10 deals and the next has 10 calls and 40 deals, the statistical analysis of the historical data in the table assumes that, on average, more sales calls means more deals closed.

I’m fine with this data. It means that simply having salespeople make more calls per-month than they have before will increase deal count. However, this scatter plot does not give us the specific forecast numbers that you’ll need to understand your future sales performance. Let’s use the same example to obtain that information.

Let’s say your boss tells you that they want to generate more quarterly revenue, which is directly related to sales activity. You can assume closing more deals means generating more revenue, but you still want the data to prove that having your salespeople make more calls would actually close more deals.

The built-in FORECAST.LINEAR equation in Sheets will help you understand this, based on the historical data in the first table .

I made the table below within the same sheet to create my forecast breakdown. In my Sheets document, this new table uses the same columns as the first (A, B, and C) and begins in row 26.

I went with 50 because the highest number of sales calls made in any given month from the original data table is 40 and we want to know what happens to deal totals if that number actually increases. I could’ve only used 50, but I increased the number by 10 each month to get an accurate forecast that is based on statistics, not a one-off occurrence.

sample data for regression sales forecasting

After creating this chart, I followed this path within the Insert dropdown menu in the Sheets toolbar: Insert -> Function -> Statistical -> FORECAST.LINEAR .

This part gets a little bit technical, but it’s simpler than it looks. The instruction menu below tells me that I’ll obtain my forecasts by filling in the relevant column numbers for the target number of sales calls.

sales forecast equation breakdown in google sheets

Here is the breakdown of what the elements of the FORECAST.LINEAR equation mean:

  • x is the value on the x-axis (in the scatter plot) that we want to forecast, which is the target call volume.
  • data_y uses the first and last row number in column C in the original table, 2 and 24.
  • data_x uses the first and last row number in column B in the original table, 2 and 24.
  • data_y goes before data_x because the dependent variable in column C changes because of the number in column B.

This equation, as the FORECAST.LINEAR instructions tell us, will calculate the expected y value (number of deals closed) for a specific x value based on a linear regression of the original data set. There are two ways to fill out the equation. The first option, shown below, is to manually input the x value for the number of target calls and repeat for each row.

=FORECAST.LINEAR(50, C2:C24, B2:B24)

The second option is to use the corresponding cell number for the first x value and drag the equation down to each subsequent cell. This is what the equation would look like if I used the cell number for 50 in the second data table:

=FORECAST.LINEAR(B27, C2:C24, B2:B24)

To reiterate, I use the number 50 because I want to be sure that making more sales calls results in more closed deals and more revenue, not just a random occurrence. This is what the number of deals closed would be, not rounded up to exact decimal points.

sample regression forecast results

Overall, the results of this linear regression analysis and expected forecast tell me that the number of sales calls is directly related to the number of deals closed per month. If you ask your salespeople to make ten more calls per month than the previous month, the number of deals closed will increase, which will help your business generate more revenue.

While Google Sheets helped me do the math without any further calculations, other tools are available to streamline and simplify this process.

Sales Regression Forecasting Tools

A critical factor in conducting a successful regression analysis is having data and having enough data. While you can add and just use two numbers, regression requires enough data to determine if there is a significant relationship between your variables. Without enough data points, it will be challenging to run an accurate forecast. If you don’t yet have enough data, it may be best to wait until you have enough.

Once you have the data you need, the below list of tools that can help you through the process of collecting, storing, and exporting your sales data.

InsightSquared

InsightSquared is a revenue intelligence platform that uses AI to make accurate forecasting predictions.

While it can’t run a regression analysis, it can give you the data you need to conduct the regression on your own. Specifically, it provides data breakdowns of the teams, representatives, and sales activities that are driving the best results. You can use this insight to come up with further questions to ask in your regression analysis to better understand performance.

demo of data collection software for sales forecasting

Since sorting through data is essential for beginning your analysis, MethodData is valuable tool. The service can create custom sales reports based on the variables you need for your specific regression, and the automated processes save you time. Instead digging through your data and cleaning it up enough to be usable, it happens automatically once you create your custom reports.

HubSpot Sales Hub

HubSpot’s Sales Hub automatically records and tracks all relevant sales and performance data related to your teams. Specific items collected include activity reports for sales calls, emails sent, and meetings taken with clients, but you can also create custom reports.

If you want an immediate overview of your sales forecast, the Sales Hub comes with a probability forecast report . It gives a breakdown of how likely it will be that you’ll meet your monthly or quarterly sales goals (shown in the image below). These projections can help you come up with further questions to analyze in your regression analysis to understand what is (or isn’t) going wrong.

regression analysis in marketing research example

Automate.io

If you’re a HubSpot Sales Hub user and you want to use Google Sheets to conduct your regression analysis as I did, Automate.io allows you to sync and export data to external apps, including Google Sheets, eliminating the risks that can sometimes come from a simple copy+paste.

Another factor that can affect your analysis is whether you’re even doing it right. Like I said before, I’m bad at math, so I used an online tool. If you feel confident enough, feel free to use a pen, paper, and a quality calculator to run your analysis by hand .

If you’re like me, using statistical analysis tools like Excel , Google Sheets, RStudio , and SPSS can help you through the process, no hard calculations required. Paired with one of the data export tools listed above, you’ll have a seamless strategy to clean and organize your data and run your linear regression analysis.

Regression Analysis Helps You Better Understand Sales Performance

A regression analysis will give you statistical insight into the factors that influence sales performance.

If you take the time to come up with a viable regression question that focuses on two business-specific variables and use the right data, you’ll be able to accurately forecast expected sales performance and understand what elements of your strategy can remain the same, or what needs to change to meet new business goals.

Improve your website with effective technical SEO. Start by conducting this  audit.  

Don't forget to share this post!

Related articles.

Demand Forecasting vs. Sales Forecasting — The Complete Guide

Demand Forecasting vs. Sales Forecasting — The Complete Guide

The Ultimate Guide to Sales Projections

The Ultimate Guide to Sales Projections

The Ultimate Guide to Sales Forecasting

The Ultimate Guide to Sales Forecasting

The Ultimate Excel Sales Forecasting Guide: How to Choose and Build the Right Forecasting Model (With Step-By-Step Instructions)

The Ultimate Excel Sales Forecasting Guide: How to Choose and Build the Right Forecasting Model (With Step-By-Step Instructions)

The 12 Best Sales Forecasting Software in 2022

The 12 Best Sales Forecasting Software in 2022

12 Tactics for Better Sales Forecasting [+5 Forecasting Models to Leverage]

12 Tactics for Better Sales Forecasting [+5 Forecasting Models to Leverage]

A Straightforward Guide to Sales Potential

A Straightforward Guide to Sales Potential

Sandbagging in Sales: What It Is & Why You Shouldn't Do It

Sandbagging in Sales: What It Is & Why You Shouldn't Do It

The Percent of Sales Method: What It Is and How to Use It

The Percent of Sales Method: What It Is and How to Use It

Why Your Sales Forecasts Suck (and What to Do About It)

Why Your Sales Forecasts Suck (and What to Do About It)

Easily calculate drop-off rates and learn how to increase conversion and close rates.

Powerful and easy-to-use sales software that drives productivity, enables customer connection, and supports growing sales orgs

regression analysis in marketing research example

Marketing Mix Modeling Services

Leverage a wide range of Marketing Mix Modeling Consultancy services Learn More →

Managed Services

Discover a modern approach to running Marketing Mix Modeling Projects Learn More →

Marketing Mix Modeling Software Suite

End-to-end Marketing Mix Modeling Software Learn More →

Data Preparation Software Learn More →

Data Optimization Software Learn More →

Try our Solutions now!

MASS Analytics offers the best-in-class Marketing Mix Modeling Solutions.

regression analysis in marketing research example

Marketing Mix Modeling Company

Start Your Journey Today and Join our team! Learn More →

About MASS Analytics

Where Marketing Meets Measurement Learn More →

Get Your Questions Answered Learn More →

Information Security Policy Statement

We Comply with the highest Information Security Standards Learn More →

#1 Best Place to Work

At MASS Analytics, we place a high value on an atmosphere that encourages innovation and teamwork

regression analysis in marketing research example

Discover the ins and outs of MMM

Explore real-life projects

Get an in-depth look into MMM

Events & Webinars

Watch a recap of our events

Find answers to your MMM questions

Our Products Factsheets

MMM Courses

Marketing Mix Modeling Courses

Marketing Mix Modeling Guide

Empower yourself with the knowledge and tools to run successful Marketing Mix Modeling projects.

regression analysis in marketing research example

Marketing Mix Modeling for Multi-regional Retailers Learn More →

Marketing Mix Modeling for CPG Companies Learn More →

Marketing Mix Modeling for E-commerce Learn More →

Marketing Mix Modeling for Financial Institutions Learn More →

Regression Analysis for Marketing Mix Modeling

Mar 1, 2023 | Marketing Mix Modeling Blogs

Regression Analysis

For marketing mix modeling.

This comprehensive guide will give you an overview of what regression analysis is, its different types and how it can be leveraged in Marketing Mix Modeling.

Regression analysis is an important part of model building, the fourth phase in the MMM workflow. It is a powerful approach used to uncover and measure the relationship between a set of variables and a specified KPI, and predict future outcomes. This makes it a very useful and common technique in building marketing mix models.

regression analysis in marketing research example

Regression analysis has prescriptive power and is often used to predict the value of a variable (the dependent variable/ the KPI of interest), based on other variables (the independent variables).

By the end of this blog, you will be well-versed in the basics of regression analysis and ready to start making data-driven decisions by interpreting the results of your regression analysis.

Let’s start first by defining a key component of regression analysis.

WHY IS CORRELATION CRUCIAL FOR REGRESSION ANALYSIS?

Correlation – Also called Pearson’s Coefficient of Correlation – is a key statistic that needs be computed and analyzed as part of the explore phase of the MMM workflow.

This metric is useful to appreciate the strength of the relationship between two variables. However, the presence of a correlation does not necessarily mean causality. Possible explanations include:

• Direct cause and effect: water causes plants to grow

• Both cause and effects: coffee consumption causes nervousness; nervous people have more coffee.

• Relationship caused by third variable : death due to drowning and soft drink consumption during summer. Both variables here are related to heat and humidity (third variable).

• Coincidental relationship: an increase in the number of people exercising and an increase in the number of people committing crimes.

Pearson’s Correlation coefficient is a standardized covariance:

r=\frac{cov(x,y)}{\sqrt{var(x)}\sqrt{var(y)}}

• Measures the relative strength of the linear relationship between two variables

• Unit-less

• Ranges between -1 and 1

• The closer to -1, the stronger the negative linear relationship

• The closer to 1, the stronger the positive linear relationship

• The closer to 0, the weaker any positive linear relationship

As mentioned before in this article, Correlation is a key component to be calculated and analyzed as part of the MMM workflow. However, it is important to mention that it can only depict linear association and it fails to depict any non-linear associations.

THE SIMPLE LINEAR REGRESSION

What is the impact of advertising on sales? Or in other terms, if you know how much budget you are investing in advertising, are you capable of forecasting how much sales you can achieve ? Regression analysis allows you to do that!

But before dealing with real-life examples that require elaborate types of Regression, let’s use the Simple Linear Regression to fully understand how the process works. First, you need to collect a sample of data periods about x , in this case, the advertising spend; and a sample of data periods about y , in this case, the sales volume. Then, you use the regression technique to estimate the relationship between the variations period on period of advertising and the variations in sales. Statistically speaking this means estimating the coefficients \beta _{0} and \beta _{1} . Once these are estimated, then you can discover the level of sales that could be achieved when a specific advertising spend is deployed.

Now that we got the business element explained, let’s dive into the mathematical side of the simple linear regression:

The equation of a simple linear regression depicts a dependency relationship between two variables or two sets of variables. The first set is called the dependent variable, and the second set would be the predictors or independent/explanatory variables. Given that this is “simple” linear regression, the independent variables set is composed of only one variable . This variable is used to predict the outcome of the dependent variable.

To predict the outcome for different values (or scenarios of x , the analyst needs to estimate β 0 and β 1 from the data collected.

y=\beta _{0}+\beta _{1}+\varepsilon

y = Dependent Variable

x= Independent Variable

β 0 = Intercept

β 1 = Slope

ε = Error term

ε ≈ Normal Random Variable with E(ε) = 0 and V(ε) = σ², i.e ε ~ N (0,σ²)

E(y/x) = \beta _{0} + \beta _{1} x

regression analysis in marketing research example

• β 0 and β 1 are unknown, and the goal is to estimate them using the data sample.

• β 0 is the base.

• β 1 represents the coefficient of the independent variable x.

• ε is the error term or residual. In regression analysis, it is important that the error term is random and is not influenced by any other factor. It should be as small and as random as possible.

HOW DOES LINEAR REGRESSION WORK?

Let’s consider the following equation that models the impact of advertising on sales.

regression analysis in marketing research example

\underbrace {y}_{Sales} = \underbrace {\beta_{0}}_{Base Sales}+\underbrace {\beta_{1}x}_{Advertising Impact}+\underbrace {\varepsilon}_{Random Error}

regression analysis in marketing research example

When data is plotted, we obtain the following chart where each point has an x coordinate and y coordinate:

• x represents the advertising budget.

• y coordinates represent the sales that have been achieved for a particular advertising spend (value of x ).

regression analysis in marketing research example

The goal of the simple regression analysis is to find the line (i.e. the intercept and slope) that goes through the maximum data points in this chart, minimizing the distance between each point and its projection on the “searched for” line.

The blue line is the best line and is the estimate or \widehat{y} . Point estimates are obtained by projecting the initial data points (the dots) on the estimate line \widehat{y} .

The fitted line equation is : \hat{Y}=\hat{\beta _{0}}+\hat{\beta _{1}}x . → Note here that it is no longer about β 0 and β 1 , but rather their hat variations, because the parameters β 0 and β 1 are being estimated using \hat{\beta _{0}} and \hat{\beta _{1}}  based on the sample of data collected.

The intercept is where the blue line intercepts the y -axis. That represents the base sales.

The slope is that of the curve of the fitted line. Any data point is represented by two coordinates x , y . Any projected point on the line will result in its estimate \hat{y} .

The difference between the initial position of the point and its projection on the line, which is represented by the dot in orange, represents the error term. The smaller the value of the error term, the better is the regression line in estimating the relationship between x and y .

OLS: Ordinary Least Squares is the method that is used to estimate the parameters β 0 and β 1 and to minimize the error term.

HOW REFINED COULD YOUR FORECAST WITH REGRESSION ANALYSIS BE?

regression analysis in marketing research example

To estimate the volume of sales to be achieved in a given period, a simple method consists in quoting the average sales of the previous periods; in other terms, use a simple forecasting method based on the average. On the other hand, if regression is used, the forecast is a little more sophisticated.

In regression analysis, the concept we try to illustrate is that when we know how much budget is to be deployed for adverting, this information about the “independent variable” can be used to refine the forecast about the sales to be achieved. Presumably a forecast based on regression should be superior to the forecast based on the simple average method!

How much prediction improvement is achieved when using regression over the simple average method? This leads us to introduce the concept of variance decomposition. The aim is to achieve a decomposition of the total variance SST (sum of squared total) into two components:

What is explained by the model (the regression line), which is represented by SSR.

The residual part (the unexplained part) of the equation, which is SSE: The Sum of Squared Error.

Total variation consists of two parts

regression analysis in marketing research example

In the following chart:

regression analysis in marketing research example

The projection of a specific point on the blue line represents the estimate \hat{y} .

The difference between the estimate and the yellow line represents what would be gained when using regression to do the forecasting (over the simple average method).

The difference between the real point, which is the red point in the graph, and its projection on the yellow curve, represents what is called the error, or the residual.

The ideal situation is one where the sum of squared error (the residual) is very small compared to the SSR, which is the sum of squared regression.

HOW MUCH VARIANCE

Is explained by, the model r-squared (r²).

Those who are familiar with Regression or Marketing Mix Models have certainly heard about R² multiple times! But what does it really mean in Layman’s terms?

R^{^{2}}=\frac{SSR}{SST}, where: 0\leq R \leq 1

regression analysis in marketing research example

R² simply represents the portion of the variance in the dependent variable that is explained by the model or the independent variable. The higher the R² the better In our sales/advertising example, if R² equals 70 %, it would mean that 70% of the variation in the sales data is explained by the movements in the advertising variable. So, advertising can explain 70% of the variation in the dependent variable (sales).

R² is a very important metric in Marketing Mix Models as it gives an indication of how successful the model, and its variables, are in explaining the variations in the modeled KPI.

HOW ACCURATE IS YOUR MODEL?

The standard error of, the estimate (see).

This is a measure of the accuracy of the model. It is simply the average of the error term (difference between the real y values and their estimate \hat{y} ). This statistic is very useful as it shows how accurate the results of the modeling are and the prediction power of the model.

Analyze your residuals and plot them at all times! Make sure the OLS assumptions are respected and use the relevant statistics to establish your assessment (DW, VIF, Normality test, homoscedasticity etc.).

HOW RELIABLE IS YOUR MODEL?

regression analysis in marketing research example

The t-stat is a standard output of the model and a very important statistic to look at when building MMM models. Remember that the dataset used to create a regression model is a sample. That sample is used to estimate the parameters β 0 and β 1 at the population level. It is therefore very important to ensure that the estimates we obtain from the sample are reliable and are good estimates for the population parameters. In other terms, if we had picked up another sample from the same population, would have we obtained the same estimates (within a certain confidence interval)? That is exactly what the t-stat is for.

As a rule of thumb, the t-stat should be above 2 to make sure that we are reporting reliable impacts.

What Does Real-life Regression look like?

Now that you explored how Simple Linear Regression works, it’s time to move to what happens in real life when working on an MMM project.

There are many variations of Regression analysis that are used in Marketing Mix Modeling:

•	Multiple Regression•	Loglinear modeling•	Nested Models•	Pooled Regression etc.

In this part of the article, we’re going to discuss the Multiple Linear Regression. If you’d like to learn about the other types, check out the following article .

Multiple Regression

Multiple Regression is as simple as stating that the period on period movements/variations in the dependent variable (the KPI that we are interested in modeling, e.g. sales) are not any more explained by the variations of the movements of one single variable, but rather by the movement of multiple variables. 

This is of course a more representative setting as simple linear regression is hardly used in real life MMM projects; as it is too simplistic and does not handle the complexity of consumer behavior and the media landscape.

Thus, when the analyst starts adding more variables into the regression equation, they move from simple regression setting to multiple regression setting.  

In a typical marketing mix modeling project, multiple variables impact the sales performance, e.g.

regression analysis in marketing research example

To be able to measure the impact of those variables on sales or any other chosen KPI, the analyst needs to build a robust model which accounts for all the variables influencing the movement of sales.

y=\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\beta _{3}x_{3}+\beta _{4}x_{4}+...+\beta _{k}x_{k}

  • \beta _{0},\beta _{1},...,\beta _{k} : are population model parameters (coefficients) to be estimated from the sample.
  • \beta _{0} represents the base, a very important concept in MMM.
  • x _{1},x _{2},...,x _{k} are the independent variables influencing sales.
  • The term \beta _{k}x_{k}  represents the contribution of the variable x_{k}  on sales: i.e. how much sales are driven by the variable x_{k}   (incremental impact)
  • \varepsilon : are the error terms , assumed to be independent and following the normal distribution with 0 mean and constant variance.

regression analysis in marketing research example

Estimating this model will help the business better predict their sales performance. In that, they gain insights into:

  • The impact of every media channel on sales be it online or offline.
  • The budget they should allocate to media before reaching saturation.
  • The promotional mechanics they should utilize.
  • The distribution levels to be maintained.
  • How seasonality could be leveraged.

Once they estimate the parameters, \beta _{0},\beta _{1},...,and   \beta _{k} , are estimated they can easily plug in the values to any media and marketing scenario hey have been given and to predict the incremental sales they will make. This becomes a powerful tool for decision making.

HOW THE PARAMETERS

Are estimated.

The process is the same as that of the simple linear regression. The goal is to minimize the residuals, which is the difference between y and \widehat{y} . The method of estimation is also the same, which is the Ordinary Least Squares (OLS) method that consists in finding the estimates of \beta _{0},\beta _{1},\beta _{2}..., and \beta _{4}   that minimize the sum of the squared error:

\sum_{i=1}^{n}(y _{i}-\hat{y _{i}})^{2}

COEFFICIENT

Interpretation.

In a given equation, the estimated coefficient \widehat{\beta _{k}} related to the advertising variable x _{k} would represent the sensitivity of the dependent variable to variations in the independent variable x _{k} .

In a given equation, the estimated coefficient \widehat{\beta _{k}}   related to the variable x_{k}   represent s the sensitivity of the dependent variable to variations in the independent variable x_{k} .

For instance, if x_{k} represents the search activity expressed in terms of millions of impression, then we estimate that 1 unit movement in x_{k}  (i.e. 1 Million impression) will yield \widehat{\beta _{k}}  units increase in sales, while keeping all the other variables in the equation (seasonality , promotions, distributio n, etc…) at the same level. 

The coefficient in the context of multiple regression is also commonly called a partial correlation coefficient because it explains how much the dependent variable will move as a result of the movement of an independent variable while keeping all the other variables at the same level.  

When estimating the coefficients in the context of multiple regression, you need to make sure they are reliable so you can interpret them. To check this, compute the t- stat associated to each coefficient.

You now know how to conduct a regression analysis for your marketing mix modeling project. But it doesn’t end there, there are other modeling techniques we didn’t cover, like other types of linear regression, as well as non-linear regression!

Related Articles

Maximizing Executive Support for Your Marketing Mix Modeling Results: A Strategic Guide

Maximizing Executive Support for Your Marketing Mix Modeling Results: A Strategic Guide

How Far Can You Scale With Marketing Mix Modeling

How Far Can You Scale With Marketing Mix Modeling

How to Become a Marketing Data Analyst in 2024

How to Become a Marketing Data Analyst in 2024

regression analysis in marketing research example

Michael Pawlicki

Business Consulting

  • About this Website
  • About Michael Pawlicki
  • Cluster Analysis
  • Multidimensional Scaling
  • Conjoint Analysis
  • Choice Illusion
  • Ebbinghaus Illusion
  • Maslow’s Hierarchy of Needs
  • Marketing Research
  • Pricing Strategy
  • Facebook 2020: $100 Billion Superstar or Failure
  • Amazon Kindle Fire
  • Marketing Plan for NHS

Regression Analysis – predicting the future

In marketing, the regression analysis is used to predict how the relationship between two variables, such as advertising and sales, can develop over time. Business managers can draw the regression line with data (cases) derived from historical sales data available to them.

The purpose of regression analysis is to describe, predict and control the relationship between at least two variables. The basic principle is to minimise the distance between the actual data and the perditions of the regression line. Regression analysis is used for variations in market share, sales and brand preference and this is normally done using variables such as advertising, price, distribution and quality.

  •  Regression analysis is used:
  •  To predict the values of the dependent variable
  • To determine the independent variables
  • To explain significant variation in the dependent variable and whether a relationship between variables exists
  • To measure strength of the relationship
  • To determine structure or form of the relationship

An online t-shirt sales company invested in Google AdWords advertising:

  • £1000 in January
  • £1000 in February
  • £1000 in March

Their sales grew steadily in this period:

  • £5000 in January
  • £5500 in February
  • £6000 in March

The managers can predict by looking at the regression line that with current level of advertising spent (£1000 per month) the sales in April will be £6500. This obviously would be the case if all other things remain equal but in reality they never do. The sales managers should use the prediction data from the regression analysis as an additional managerial tool but should not exclusively rely on it. The level of sales can be affected by elements other than the level of advertising. This includes, but is not limited to, factors such as weather conditions or the central bank’s increase or decrease of base interest rates. Regression analysis is concerned with the nature and degree of association between variables but does not assume causality (does not explain why there is relationship between variables). Other good examples of how regression analysis can be used to test marketing relevant hypothesis are: Can variation in demand be explained in terms of variation in fuel prices? Are consumers’ perceptions of quality determined by their perceptions in price? For a simple tutorial about the regression analysis for beginners please view the video below:

Regression analysis consists of number of statistics used to determine its accuracy and usefulness for certain purpose. Some of those statistics and methods are clearly explained by the statistics experts in the videos listed below. It is recommended that you read the text first and then watch the corresponding video:

  • Product Moment Correlation (r) is a statistic summarising the strength of association between two metric variables (for example: X and Y). It is used to determine whether a linear (straight line) relationship exists between X and Y. It indicates the degree to which the variation in one variable (X) is related to the variation in another variable (Y) (also known as Pearson or Simple Correlation, Bivariate Correlation or Correlation Coefficient) . Covariance is a systematic relationship between two variables in which a change in one implies a corresponding change in the other (COV x Y).  The correlation coefficient between two variables will be the same regardless of their units of measurement. If r = 0.93 (a value close to 1.0) it means that one variable is strongly associated with the other. It does not matter which variable is considered dependent and which independent (X with Y) or (Y with X). The ‘r’ is designed to measure the strength of linear relationship, thus r= 0 does not suggest that there is no relationship between X and Y as there could be a non-linear relationship between the two.

  • Residuals – the difference between the observed value of Y and the value predicted by the regression equation.

  • Partial Correlation Coefficient – measures the association between the variables after adjusting for the effect of one or more additional variables. For example: how strongly related are sales to advertising expenditure when the effect of price is controlled?
  • Part Correlation Coefficient – is a measure of the correlation between Y and X when the linear effects of the other independent variables have been removed from X but not from Y.
  • Non-metric Correlation – a correlation measure for two non-metric variables that rely on rankings to compute the correlations.
  • Least Squares Procedure – is a technique for fitting straight line to a scattergram by minimising the vertical distances of all the points from the line. The best fitting line is a regression line. The vertical distance from the point to the line is the error (e). read more
  • Significance Testing – significance of the linear relationship between X and Y may be tested by examining two hypothesis:
  • There is no linear relationship between X and Y
  • There is a relationship (positive or negative) between X and Y

The strength and significance of association is measured by the coefficient of determination r-square (r2). Significance Testing involves testing the significance of the overall regression equation as well as specific partial regression coefficients.

Multiple Regression

Multiple Regression is extremely relevant to business analysis. Itinvolves single dependent variable such as sales and two or more independent variables such as employee remuneration, number of staff, level of advertising, online marketing spend. For example: can variation in sales be explained in terms of variation in advertising expenditures, prices and level of distribution? It is possible to consider additional independent variables to answer the question raised. Statistics relevant to multiple regression are: adjusted r-square (r2) – coefficient of multiple determination is adjusted for the number of independent variables and the sample size to account for diminishing returns. To get more insight into multiple regression and understand how other statistics such as significance testing influence the usefulness of the analysis please watch the video below:

References:

  • Tuk, M., 2012. Regression Analysis, Marketing Analytics . Imperial College London, unpublished.
  • Malhotra, K. N. and Birks, F.D., 2000. Marketing Research. An applied approach. European Edition . London: Pearson

Written by Michael Pawlicki

Speak Your Mind Cancel reply

About the author, recent posts.

  • Price Elasticity
  • Multidimensional Scaling (MDS) for Marketing
  • Marketing Plan for Healthcare

Advertisements

  • August 2015
  • August 2014
  • January 2013
  • October 2012
  • January 2012
  • February 2010

Social Media

  • Google Plus

Connect with Me

Return to top of page

Copyright © 2024 · eleven40 theme on Genesis Framework · WordPress · Log in

Regression Analysis

regression analysis in marketing research example

  • Home > What We Do > Research Methods > Pricing and Value Research Techniques > Regression Analysis

From overall customer satisfaction to satisfaction with your product quality and price, regression analysis measures the strength of a relationship between different variables.

Contact Us >

To find out more about measuring customer satisfaction to help your business

How regression analysis works

While correlation analysis provides a single numeric summary of a relation (“the correlation coefficient”), regression analysis results in a prediction equation, describing the relationship between the variables. If the relationship is strong – expressed by the Rsquare value – it can be used to predict values of one variable given the other variables have known values. For example, how will the overall satisfaction score change if satisfaction with product quality goes up from 6 to 7?

Regression Analysis

Measuring customer satisfaction

Regression analysis can be used in customer satisfaction and employee satisfaction studies to answer questions such as: “Which product dimensions contribute most to someone’s overall satisfaction or loyalty to the brand?” This is often referred to as Key Drivers Analysis.

It can also be used to simulate the outcome when actions are taken. For example: “What will happen to the satisfaction score when product availability is improved?”

Regression Analysis Research

Contact Us >  

Privacy Overview

regression analysis in marketing research example

Research-Methodology

Regression Analysis

Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

The basic form of regression models includes unknown parameters (β), independent variables (X), and the dependent variable (Y).

Regression model, basically, specifies the relation of dependent variable (Y) to a function combination of independent variables (X) and unknown parameters (β)

                                    Y  ≈  f (X, β)   

Regression equation can be used to predict the values of ‘y’, if the value of ‘x’ is given, and both ‘y’ and ‘x’ are the two sets of measures of a sample size of ‘n’. The formulae for regression equation would be

Regression analysis

Do not be intimidated by visual complexity of correlation and regression formulae above. You don’t have to apply the formula manually, and correlation and regression analyses can be run with the application of popular analytical software such as Microsoft Excel, Microsoft Access, SPSS and others.

Linear regression analysis is based on the following set of assumptions:

1. Assumption of linearity . There is a linear relationship between dependent and independent variables.

2. Assumption of homoscedasticity . Data values for dependent and independent variables have equal variances.

3. Assumption of absence of collinearity or multicollinearity . There is no correlation between two or more independent variables.

4. Assumption of normal distribution . The data for the independent variables and dependent variable are normally distributed

My e-book,  The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance  offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Regression analysis

IMAGES

  1. Regression Analysis: The Ultimate Guide

    regression analysis in marketing research example

  2. What is regression analysis?

    regression analysis in marketing research example

  3. Regression analysis: What it means and how to interpret the outcome

    regression analysis in marketing research example

  4. PPT

    regression analysis in marketing research example

  5. Regression Analysis, a powerful marketing tool

    regression analysis in marketing research example

  6. PPT

    regression analysis in marketing research example

VIDEO

  1. Regression Analysis interpretation

  2. Logstic regression analysis in simple way

  3. Logistic Regression in Python

  4. Interpreting multiple regression analysis results

  5. Regression Analysis for Research #research #phd #statisticalanalysis

  6. Regression Analysis in Jamovi

COMMENTS

  1. Linear Regression for Marketing Analytics [Hands-on]

    Hands-on Coding: Linear Regression Model with Marketing Data. Step 1: Importing Python Libraries. Step 2: Loading the data in a DataFrame. Step 3: Separating the Feature and the Target Variable. Step 4: Machine Learning! Line fitting. Step 5: Making the Predictions.

  2. The Strategic Value of Regression Analysis in Marketing Research

    Benefits of Regression Analysis in Marketing. Data-driven decisions: Regression analysis empowers marketers to make data-driven decisions, reducing reliance on intuition and guesswork. This approach leads to more accurate and strategic marketing efforts. Efficiency and cost savings: By optimizing marketing campaigns and resource allocation ...

  3. Regression Analysis: The Complete Guide

    An example of multiple linear regression would be an analysis of how marketing spend, revenue growth, and general market sentiment affect the share price of a company. With multiple linear regression models you can estimate how these variables will influence the share price, and to what extent.

  4. Regression Analysis

    Regression Analysis Examples. Regression Analysis Examples are as follows: Stock Market Prediction: ... Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables. ...

  5. What Is Regression Analysis in Business Analytics?

    Regression analysis is the statistical method used to determine the structure of a relationship between two variables (single linear regression) or three or more variables (multiple regression). According to the Harvard Business School Online course Business Analytics, regression is used for two primary purposes: To study the magnitude and ...

  6. A Refresher on Regression Analysis

    The good news is that you probably don't need to do the number crunching yourself (hallelujah!) but you do need to correctly understand and interpret the analysis created by your colleagues. One ...

  7. Regression Analysis: Definition, Types, Usage & Advantages

    Regression analysis usage in market research. A market research survey focuses on three major matrices; Customer Satisfaction, Customer Loyalty, and Customer Advocacy. Remember, although these matrices tell us about customer health and intentions, they fail to tell us ways of improving the position. ... For example, regression analysis helps ...

  8. 6 Regression

    6.1.1 Correlation coefficient. The correlation coefficient summarizes the strength of the linear relationship between two metric (interval or ratio scaled) variables. Let's consider a simple example. Say you conduct a survey to investigate the relationship between the attitude towards a city and the duration of residency.

  9. Regression Tutorial with Analysis Examples

    My tutorial helps you go through the regression content in a systematic and logical order. This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions.

  10. Regression Analysis

    is one of the most frequently used tools in market research. In its simplest form, regression analysis allows market researchers to analyze relationships between one independent and one dependent variable. In marketing applications, the dependent variable is usually the outcome we care about (e.g., sales), while the independent variables are ...

  11. Regression Analysis

    Regression analysis is one of the most frequently used analysis techniques in market research. It allows market researchers to analyze the relationships between dependent variables and independent variables.In marketing applications, the dependent variable is the outcome we care about (e.g., sales), while we use the independent variables to achieve those outcomes (e.g., pricing or advertising).

  12. Regression Analysis in Market Research

    Regression analysis is another tool market research firms used on a daily basis with their clients to help brands understand survey data from customers. The benefit of using a third-party market research firm is that you can leverage their expertise to tell you the "so what" of your customer survey data. At The MSR Group, we use regression ...

  13. The Complete Guide to Linear Regression Analysis

    With a simple calculation, we can find the value of β0 and β1 for minimum RSS value. With the stats model library in python, we can find out the coefficients, Table 1: Simple regression of sales on TV. Values for β0 and β1 are 7.03 and 0.047 respectively. Then the relation becomes, Sales = 7.03 + 0.047 * TV.

  14. Regression Analysis

    Regression Analysis in Finance. Regression analysis comes with several applications in finance. For example, the statistical method is fundamental to the Capital Asset Pricing Model (CAPM). Essentially, the CAPM equation is a model that determines the relationship between the expected return of an asset and the market risk premium.

  15. How to Use Regression Analysis to Forecast Sales: A Step-by-Step Guide

    So, the overall regression equation is Y = bX + a, where: X is the independent variable (number of sales calls) Y is the dependent variable (number of deals closed) b is the slope of the line. a is the point of interception, or what Y equals when X is zero. Since we're using Google Sheets, its built-in functions will do the math for us and we ...

  16. Regression Analysis for Marketing Mix Modeling

    Regression analysis is an important part of model building, the fourth phase in the MMM workflow. It is a powerful approach used to uncover and measure the relationship between a set of variables and a specified KPI, and predict future outcomes. This makes it a very useful and common technique in building marketing mix models.

  17. (PDF) Regression Analysis

    Regression analysi s can help them understand what drives. customers to buy their products, helps exp lain their customer's satisfaction, and informs how Agripro measures up against their ...

  18. Regression Analysis

    The purpose of regression analysis is to describe, predict and control the relationship between at least two variables. The basic principle is to minimise the distance between the actual data and the perditions of the regression line. Regression analysis is used for variations in market share, sales and brand preference and this is normally ...

  19. Regression Analysis in Market Research

    While correlation analysis provides a single numeric summary of a relation ("the correlation coefficient"), regression analysis results in a prediction equation, describing the relationship between the variables. If the relationship is strong - expressed by the Rsquare value - it can be used to predict values of one variable given the ...

  20. Pooling Issues and Methods in Regression Analysis with Examples in

    The existence of multiple observations within several cross sections gives rise to both opportunities and problems in the application of regression analysis. In this article the issues associated with the decision of whether or not to pool the data for purposes of estimation are explored and examples of marketing applications are provided.

  21. Regression Analysis

    Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

  22. Using Regression Analysis in Market Research

    Regression Analysis in market research - an example. So that's an overview of the theory. Let's now take a look at Regression Analysis in action using a real-life example.

  23. B2B Content Marketing Trends 2024 [Research]

    For their 14 th annual content marketing survey, CMI and MarketingProfs surveyed 1,080 recipients around the globe - representing a range of industries, functional areas, and company sizes — in July 2023. The online survey was emailed to a sample of marketers using lists from CMI and MarketingProfs.

  24. HLA Typing for Transplant Research Report 2024

    The Global HLA Typing for Transplant Market is a dynamic and critical sector within the broader healthcare industry, dedicated to the assessment and analysis of Human Leukocyte Antigen (HLA ...