Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10 .

  • Payment Plans
  • Product List
  • Partnerships

AnalystPrep

  • Try Free Trial
  • Study Packages
  • Levels I, II & III Lifetime Package
  • Video Lessons
  • Study Notes
  • Practice Questions
  • Levels II & III Lifetime Package
  • About the Exam
  • About your Instructor
  • Part I Study Packages
  • Part I & Part II Lifetime Package
  • Part II Study Packages
  • Exams P & FM Lifetime Package
  • Quantitative Questions
  • Verbal Questions
  • Data Insight Questions
  • Live Tutoring
  • About your Instructors
  • EA Practice Questions
  • Data Sufficiency Questions
  • Integrated Reasoning Questions

Hypothesis Testing in Regression Analysis

Hypothesis Testing in Regression Analysis

Hypothesis testing is used to confirm if the estimated regression coefficients bear any statistical significance.  Either the confidence interval approach or the t-test approach can be used in hypothesis testing. In this section, we will explore the t-test approach.

The t-test Approach

The following are the steps followed in the performance of the t-test:

  • Set the significance level for the test.
  • Formulate the null and the alternative hypotheses.

$$t=\frac{\widehat{b_1}-b_1}{s_{\widehat{b_1}}}$$

\(b_1\) = True slope coefficient.

\(\widehat{b_1}\) = Point estimate for \(b_1\)

\(b_1 s_{\widehat{b_1\ }}\) = Standard error of the regression coefficient.

  • Compare the absolute value of the t-statistic to the critical t-value (t_c). Reject the null hypothesis if the absolute value of the t-statistic is greater than the critical t-value i.e., \(t\ >\ +\ t_{critical}\ or\ t\ <\ –t_{\text{critical}}\).

Example: Hypothesis Testing of the Significance of Regression Coefficients

An analyst generates the following output from the regression analysis of inflation on unemployment:

$$\small{\begin{array}{llll}\hline{}& \textbf{Regression Statistics} &{}&{}\\ \hline{}& \text{Multiple R} & 0.8766 &{} \\ {}& \text{R Square} & 0.7684 &{} \\ {}& \text{Adjusted R Square} & 0.7394 & {}\\ {}& \text{Standard Error} & 0.0063 &{}\\ {}& \text{Observations} & 10 &{}\\ \hline {}& & & \\ \hline{} & \textbf{Coefficients} & \textbf{Standard Error} & \textbf{t-Stat}\\ \hline \text{Intercept} & 0.0710 & 0.0094 & 7.5160 \\\text{Forecast (Slope)} & -0.9041 & 0.1755 & -5.1516\\ \hline\end{array}}$$

At the 5% significant level, test the null hypothesis that the slope coefficient is significantly different from one, that is,

$$ H_{0}: b_{1} = 1\ vs. \ H_{a}: b_{1}≠1 $$

The calculated t-statistic, \(\text{t}=\frac{\widehat{b_{1}}-b_1}{\widehat{S_{b_{1}}}}\) is equal to:

$$\begin{align*}\text{t}& = \frac{-0.9041-1}{0.1755}\\& = -10.85\end{align*}$$

The critical two-tail t-values from the table with \(n-2=8\) degrees of freedom are:

$$\text{t}_{c}=±2.306$$

hypothesis testing and regression analysis

Notice that \(|t|>t_{c}\) i.e., (\(10.85>2.306\))

Therefore, we reject the null hypothesis and conclude that the estimated slope coefficient is statistically different from one.

Note that we used the confidence interval approach and arrived at the same conclusion.

Question Neeth Shinu, CFA, is forecasting price elasticity of supply for a certain product. Shinu uses the quantity of the product supplied for the past 5months as the dependent variable and the price per unit of the product as the independent variable. The regression results are shown below. $$\small{\begin{array}{lccccc}\hline \textbf{Regression Statistics} & & & & & \\ \hline \text{Multiple R} & 0.9971 & {}& {}&{}\\ \text{R Square} & 0.9941 & & & \\ \text{Adjusted R Square} & 0.9922 & & & & \\ \text{Standard Error} & 3.6515 & & & \\ \text{Observations} & 5 & & & \\ \hline {}& \textbf{Coefficients} & \textbf{Standard Error} & \textbf{t Stat} & \textbf{P-value}\\ \hline\text{Intercept} & -159 & 10.520 & (15.114) & 0.001\\ \text{Slope} & 0.26 & 0.012 & 22.517 & 0.000\\ \hline\end{array}}$$ Which of the following most likely reports the correct value of the t-statistic for the slope and most accurately evaluates its statistical significance with 95% confidence?     A. \(t=21.67\); slope is significantly different from zero.     B. \(t= 3.18\); slope is significantly different from zero.     C. \(t=22.57\); slope is not significantly different from zero. Solution The correct answer is A . The t-statistic is calculated using the formula: $$\text{t}=\frac{\widehat{b_{1}}-b_1}{\widehat{S_{b_{1}}}}$$ Where: \(b_{1}\) = True slope coefficient \(\widehat{b_{1}}\) = Point estimator for \(b_{1}\) \(\widehat{S_{b_{1}}}\) = Standard error of the regression coefficient $$\begin{align*}\text{t}&=\frac{0.26-0}{0.012}\\&=21.67\end{align*}$$ The critical two-tail t-values from the t-table with \(n-2 = 3\) degrees of freedom are: $$t_{c}=±3.18$$ Notice that \(|t|>t_{c}\) (i.e \(21.67>3.18\)). Therefore, the null hypothesis can be rejected. Further, we can conclude that the estimated slope coefficient is statistically different from zero.

Offered by AnalystPrep

hypothesis testing and regression analysis

Analysis of Variance (ANOVA)

Predicted value of a dependent variable, testing independence based on continge ..., coefficient of variation, cumulative distribution function (cdf).

A cumulative distribution function, F(x), gives the probability that the random variable X... Read More

Portfolio Returns

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Regression Analysis – Methods, Types and Examples

Regression Analysis – Methods, Types and Examples

Table of Contents

Regression Analysis

Regression Analysis

Regression analysis is a set of statistical processes for estimating the relationships among variables . It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’).

Regression Analysis Methodology

Here is a general methodology for performing regression analysis:

  • Define the research question: Clearly state the research question or hypothesis you want to investigate. Identify the dependent variable (also called the response variable or outcome variable) and the independent variables (also called predictor variables or explanatory variables) that you believe are related to the dependent variable.
  • Collect data: Gather the data for the dependent variable and independent variables. Ensure that the data is relevant, accurate, and representative of the population or phenomenon you are studying.
  • Explore the data: Perform exploratory data analysis to understand the characteristics of the data, identify any missing values or outliers, and assess the relationships between variables through scatter plots, histograms, or summary statistics.
  • Choose the regression model: Select an appropriate regression model based on the nature of the variables and the research question. Common regression models include linear regression, multiple regression, logistic regression, polynomial regression, and time series regression, among others.
  • Assess assumptions: Check the assumptions of the regression model. Some common assumptions include linearity (the relationship between variables is linear), independence of errors, homoscedasticity (constant variance of errors), and normality of errors. Violation of these assumptions may require additional steps or alternative models.
  • Estimate the model: Use a suitable method to estimate the parameters of the regression model. The most common method is ordinary least squares (OLS), which minimizes the sum of squared differences between the observed and predicted values of the dependent variable.
  • I nterpret the results: Analyze the estimated coefficients, p-values, confidence intervals, and goodness-of-fit measures (e.g., R-squared) to interpret the results. Determine the significance and direction of the relationships between the independent variables and the dependent variable.
  • Evaluate model performance: Assess the overall performance of the regression model using appropriate measures, such as R-squared, adjusted R-squared, and root mean squared error (RMSE). These measures indicate how well the model fits the data and how much of the variation in the dependent variable is explained by the independent variables.
  • Test assumptions and diagnose problems: Check the residuals (the differences between observed and predicted values) for any patterns or deviations from assumptions. Conduct diagnostic tests, such as examining residual plots, testing for multicollinearity among independent variables, and assessing heteroscedasticity or autocorrelation, if applicable.
  • Make predictions and draw conclusions: Once you have a satisfactory model, use it to make predictions on new or unseen data. Draw conclusions based on the results of the analysis, considering the limitations and potential implications of the findings.

Types of Regression Analysis

Types of Regression Analysis are as follows:

Linear Regression

Linear regression is the most basic and widely used form of regression analysis. It models the linear relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting line that minimizes the sum of squared differences between observed and predicted values.

Multiple Regression

Multiple regression extends linear regression by incorporating two or more independent variables to predict the dependent variable. It allows for examining the simultaneous effects of multiple predictors on the outcome variable.

Polynomial Regression

Polynomial regression models non-linear relationships between variables by adding polynomial terms (e.g., squared or cubic terms) to the regression equation. It can capture curved or nonlinear patterns in the data.

Logistic Regression

Logistic regression is used when the dependent variable is binary or categorical. It models the probability of the occurrence of a certain event or outcome based on the independent variables. Logistic regression estimates the coefficients using the logistic function, which transforms the linear combination of predictors into a probability.

Ridge Regression and Lasso Regression

Ridge regression and Lasso regression are techniques used for addressing multicollinearity (high correlation between independent variables) and variable selection. Both methods introduce a penalty term to the regression equation to shrink or eliminate less important variables. Ridge regression uses L2 regularization, while Lasso regression uses L1 regularization.

Time Series Regression

Time series regression analyzes the relationship between a dependent variable and independent variables when the data is collected over time. It accounts for autocorrelation and trends in the data and is used in forecasting and studying temporal relationships.

Nonlinear Regression

Nonlinear regression models are used when the relationship between the dependent variable and independent variables is not linear. These models can take various functional forms and require estimation techniques different from those used in linear regression.

Poisson Regression

Poisson regression is employed when the dependent variable represents count data. It models the relationship between the independent variables and the expected count, assuming a Poisson distribution for the dependent variable.

Generalized Linear Models (GLM)

GLMs are a flexible class of regression models that extend the linear regression framework to handle different types of dependent variables, including binary, count, and continuous variables. GLMs incorporate various probability distributions and link functions.

Regression Analysis Formulas

Regression analysis involves estimating the parameters of a regression model to describe the relationship between the dependent variable (Y) and one or more independent variables (X). Here are the basic formulas for linear regression, multiple regression, and logistic regression:

Linear Regression:

Simple Linear Regression Model: Y = β0 + β1X + ε

Multiple Linear Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

In both formulas:

  • Y represents the dependent variable (response variable).
  • X represents the independent variable(s) (predictor variable(s)).
  • β0, β1, β2, …, βn are the regression coefficients or parameters that need to be estimated.
  • ε represents the error term or residual (the difference between the observed and predicted values).

Multiple Regression:

Multiple regression extends the concept of simple linear regression by including multiple independent variables.

Multiple Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

The formulas are similar to those in linear regression, with the addition of more independent variables.

Logistic Regression:

Logistic regression is used when the dependent variable is binary or categorical. The logistic regression model applies a logistic or sigmoid function to the linear combination of the independent variables.

Logistic Regression Model: p = 1 / (1 + e^-(β0 + β1X1 + β2X2 + … + βnXn))

In the formula:

  • p represents the probability of the event occurring (e.g., the probability of success or belonging to a certain category).
  • X1, X2, …, Xn represent the independent variables.
  • e is the base of the natural logarithm.

The logistic function ensures that the predicted probabilities lie between 0 and 1, allowing for binary classification.

Regression Analysis Examples

Regression Analysis Examples are as follows:

  • Stock Market Prediction: Regression analysis can be used to predict stock prices based on various factors such as historical prices, trading volume, news sentiment, and economic indicators. Traders and investors can use this analysis to make informed decisions about buying or selling stocks.
  • Demand Forecasting: In retail and e-commerce, real-time It can help forecast demand for products. By analyzing historical sales data along with real-time data such as website traffic, promotional activities, and market trends, businesses can adjust their inventory levels and production schedules to meet customer demand more effectively.
  • Energy Load Forecasting: Utility companies often use real-time regression analysis to forecast electricity demand. By analyzing historical energy consumption data, weather conditions, and other relevant factors, they can predict future energy loads. This information helps them optimize power generation and distribution, ensuring a stable and efficient energy supply.
  • Online Advertising Performance: It can be used to assess the performance of online advertising campaigns. By analyzing real-time data on ad impressions, click-through rates, conversion rates, and other metrics, advertisers can adjust their targeting, messaging, and ad placement strategies to maximize their return on investment.
  • Predictive Maintenance: Regression analysis can be applied to predict equipment failures or maintenance needs. By continuously monitoring sensor data from machines or vehicles, regression models can identify patterns or anomalies that indicate potential failures. This enables proactive maintenance, reducing downtime and optimizing maintenance schedules.
  • Financial Risk Assessment: Real-time regression analysis can help financial institutions assess the risk associated with lending or investment decisions. By analyzing real-time data on factors such as borrower financials, market conditions, and macroeconomic indicators, regression models can estimate the likelihood of default or assess the risk-return tradeoff for investment portfolios.

Importance of Regression Analysis

Importance of Regression Analysis is as follows:

  • Relationship Identification: Regression analysis helps in identifying and quantifying the relationship between a dependent variable and one or more independent variables. It allows us to determine how changes in independent variables impact the dependent variable. This information is crucial for decision-making, planning, and forecasting.
  • Prediction and Forecasting: Regression analysis enables us to make predictions and forecasts based on the relationships identified. By estimating the values of the dependent variable using known values of independent variables, regression models can provide valuable insights into future outcomes. This is particularly useful in business, economics, finance, and other fields where forecasting is vital for planning and strategy development.
  • Causality Assessment: While correlation does not imply causation, regression analysis provides a framework for assessing causality by considering the direction and strength of the relationship between variables. It allows researchers to control for other factors and assess the impact of a specific independent variable on the dependent variable. This helps in determining the causal effect and identifying significant factors that influence outcomes.
  • Model Building and Variable Selection: Regression analysis aids in model building by determining the most appropriate functional form of the relationship between variables. It helps researchers select relevant independent variables and eliminate irrelevant ones, reducing complexity and improving model accuracy. This process is crucial for creating robust and interpretable models.
  • Hypothesis Testing: Regression analysis provides a statistical framework for hypothesis testing. Researchers can test the significance of individual coefficients, assess the overall model fit, and determine if the relationship between variables is statistically significant. This allows for rigorous analysis and validation of research hypotheses.
  • Policy Evaluation and Decision-Making: Regression analysis plays a vital role in policy evaluation and decision-making processes. By analyzing historical data, researchers can evaluate the effectiveness of policy interventions and identify the key factors contributing to certain outcomes. This information helps policymakers make informed decisions, allocate resources effectively, and optimize policy implementation.
  • Risk Assessment and Control: Regression analysis can be used for risk assessment and control purposes. By analyzing historical data, organizations can identify risk factors and develop models that predict the likelihood of certain outcomes, such as defaults, accidents, or failures. This enables proactive risk management, allowing organizations to take preventive measures and mitigate potential risks.

When to Use Regression Analysis

  • Prediction : Regression analysis is often employed to predict the value of the dependent variable based on the values of independent variables. For example, you might use regression to predict sales based on advertising expenditure, or to predict a student’s academic performance based on variables like study time, attendance, and previous grades.
  • Relationship analysis: Regression can help determine the strength and direction of the relationship between variables. It can be used to examine whether there is a linear association between variables, identify which independent variables have a significant impact on the dependent variable, and quantify the magnitude of those effects.
  • Causal inference: Regression analysis can be used to explore cause-and-effect relationships by controlling for other variables. For example, in a medical study, you might use regression to determine the impact of a specific treatment while accounting for other factors like age, gender, and lifestyle.
  • Forecasting : Regression models can be utilized to forecast future trends or outcomes. By fitting a regression model to historical data, you can make predictions about future values of the dependent variable based on changes in the independent variables.
  • Model evaluation: Regression analysis can be used to evaluate the performance of a model or test the significance of variables. You can assess how well the model fits the data, determine if additional variables improve the model’s predictive power, or test the statistical significance of coefficients.
  • Data exploration : Regression analysis can help uncover patterns and insights in the data. By examining the relationships between variables, you can gain a deeper understanding of the data set and identify potential patterns, outliers, or influential observations.

Applications of Regression Analysis

Here are some common applications of regression analysis:

  • Economic Forecasting: Regression analysis is frequently employed in economics to forecast variables such as GDP growth, inflation rates, or stock market performance. By analyzing historical data and identifying the underlying relationships, economists can make predictions about future economic conditions.
  • Financial Analysis: Regression analysis plays a crucial role in financial analysis, such as predicting stock prices or evaluating the impact of financial factors on company performance. It helps analysts understand how variables like interest rates, company earnings, or market indices influence financial outcomes.
  • Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables. Regression models provide insights into which marketing efforts are most effective and help optimize marketing campaigns.
  • Health Sciences: Regression analysis is extensively used in medical research and public health studies. It helps examine the relationship between risk factors and health outcomes, such as the impact of smoking on lung cancer or the relationship between diet and heart disease. Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices.
  • Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or social factors on various outcomes such as crime rates, academic performance, or job satisfaction.
  • Operations Research: Regression analysis is applied in operations research to optimize processes and improve efficiency. For example, it can be used to predict demand based on historical sales data, determine the factors influencing production output, or optimize supply chain logistics.
  • Environmental Studies: Regression analysis helps in understanding and predicting environmental phenomena. It can be used to analyze the impact of factors like temperature, pollution levels, or land use patterns on phenomena such as species diversity, water quality, or climate change.
  • Sports Analytics: Regression analysis is increasingly used in sports analytics to gain insights into player performance, team strategies, and game outcomes. It helps analyze the relationship between various factors like player statistics, coaching strategies, or environmental conditions and their impact on game outcomes.

Advantages and Disadvantages of Regression Analysis

About the author.

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Cluster Analysis

Cluster Analysis – Types, Methods and Examples

Discriminant Analysis

Discriminant Analysis – Methods, Types and...

MANOVA

MANOVA (Multivariate Analysis of Variance) –...

Documentary Analysis

Documentary Analysis – Methods, Applications and...

ANOVA

ANOVA (Analysis of variance) – Formulas, Types...

Graphical Methods

Graphical Methods – Types, Examples and Guide

  • Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Linear regression hypothesis testing: Concepts, Examples

Simple linear regression model

In relation to machine learning , linear regression is defined as a predictive modeling technique that allows us to build a model which can help predict continuous response variables as a function of a linear combination of explanatory or predictor variables. While training linear regression models, we need to rely on hypothesis testing in relation to determining the relationship between the response and predictor variables. In the case of the linear regression model, two types of hypothesis testing are done. They are T-tests and F-tests . In other words, there are two types of statistics that are used to assess whether linear regression models exist representing response and predictor variables. They are t-statistics and f-statistics. As data scientists , it is of utmost importance to determine if linear regression is the correct choice of model for our particular problem and this can be done by performing hypothesis testing related to linear regression response and predictor variables. Many times, it is found that these concepts are not very clear with a lot many data scientists. In this blog post, we will discuss linear regression and hypothesis testing related to t-statistics and f-statistics . We will also provide an example to help illustrate how these concepts work.

Table of Contents

What are linear regression models?

A linear regression model can be defined as the function approximation that represents a continuous response variable as a function of one or more predictor variables. While building a linear regression model, the goal is to identify a linear equation that best predicts or models the relationship between the response or dependent variable and one or more predictor or independent variables.

There are two different kinds of linear regression models. They are as follows:

  • Simple or Univariate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and one predictor or independent variable. The form of the equation that represents a simple linear regression model is Y=mX+b, where m is the coefficients of the predictor variable and b is bias. When considering the linear regression line, m represents the slope and b represents the intercept.
  • Multiple or Multi-variate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and more than one predictor or independent variable. The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

While training linear regression models, the requirement is to determine the coefficients which can result in the best-fitted linear regression line. The learning algorithm used to find the most appropriate coefficients is known as least squares regression . In the least-squares regression method, the coefficients are calculated using the least-squares error function. The main objective of this method is to minimize or reduce the sum of squared residuals between actual and predicted response values. The sum of squared residuals is also called the residual sum of squares (RSS). The outcome of executing the least-squares regression method is coefficients that minimize the linear regression cost function .

The residual e of the ith observation is represented as the following where [latex]Y_i[/latex] is the ith observation and [latex]\hat{Y_i}[/latex] is the prediction for ith observation or the value of response variable for ith observation.

[latex]e_i = Y_i – \hat{Y_i}[/latex]

The residual sum of squares can be represented as the following:

[latex]RSS = e_1^2 + e_2^2 + e_3^2 + … + e_n^2[/latex]

The least-squares method represents the algorithm that minimizes the above term, RSS.

Once the coefficients are determined, can it be claimed that these coefficients are the most appropriate ones for linear regression? The answer is no. After all, the coefficients are only the estimates and thus, there will be standard errors associated with each of the coefficients.  Recall that the standard error is used to calculate the confidence interval in which the mean value of the population parameter would exist. In other words, it represents the error of estimating a population parameter based on the sample data. The value of the standard error is calculated as the standard deviation of the sample divided by the square root of the sample size. The formula below represents the standard error of a mean.

[latex]SE(\mu) = \frac{\sigma}{\sqrt(N)}[/latex]

Thus, without analyzing aspects such as the standard error associated with the coefficients, it cannot be claimed that the linear regression coefficients are the most suitable ones without performing hypothesis testing. This is where hypothesis testing is needed . Before we get into why we need hypothesis testing with the linear regression model, let’s briefly learn about what is hypothesis testing?

Train a Multiple Linear Regression Model using R

Before getting into understanding the hypothesis testing concepts in relation to the linear regression model, let’s train a multi-variate or multiple linear regression model and print the summary output of the model which will be referred to, in the next section. 

The data used for creating a multi-linear regression model is BostonHousing which can be loaded in RStudioby installing mlbench package. The code is shown below:

install.packages(“mlbench”) library(mlbench) data(“BostonHousing”)

Once the data is loaded, the code shown below can be used to create the linear regression model.

attach(BostonHousing) BostonHousing.lm <- lm(log(medv) ~ crim + chas + rad + lstat) summary(BostonHousing.lm)

Executing the above command will result in the creation of a linear regression model with the response variable as medv and predictor variables as crim, chas, rad, and lstat. The following represents the details related to the response and predictor variables:

  • log(medv) : Log of the median value of owner-occupied homes in USD 1000’s
  • crim : Per capita crime rate by town
  • chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • rad : Index of accessibility to radial highways
  • lstat : Percentage of the lower status of the population

The following will be the output of the summary command that prints the details relating to the model including hypothesis testing details for coefficients (t-statistics) and the model as a whole (f-statistics) 

linear regression model summary table r.png

Hypothesis tests & Linear Regression Models

Hypothesis tests are the statistical procedure that is used to test a claim or assumption about the underlying distribution of a population based on the sample data. Here are key steps of doing hypothesis tests with linear regression models:

  • Hypothesis formulation for T-tests: In the case of linear regression, the claim is made that there exists a relationship between response and predictor variables, and the claim is represented using the non-zero value of coefficients of predictor variables in the linear equation or regression model. This is formulated as an alternate hypothesis. Thus, the null hypothesis is set that there is no relationship between response and the predictor variables . Hence, the coefficients related to each of the predictor variables is equal to zero (0). So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis for each test states that a1 = 0, a2 = 0, a3 = 0 etc. For all the predictor variables, individual hypothesis testing is done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. Thus, if there are, say, 5 features, there will be five hypothesis tests and each will have an associated null and alternate hypothesis.
  • Hypothesis formulation for F-test : In addition, there is a hypothesis test done around the claim that there is a linear regression model representing the response variable and all the predictor variables. The null hypothesis is that the linear regression model does not exist . This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0.
  • F-statistics for testing hypothesis for linear regression model : F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0. F-statistics is calculated as a function of sum of squares residuals for restricted regression (representing linear regression model with only intercept or bias and all the values of coefficients as zero) and sum of squares residuals for unrestricted regression (representing linear regression model). In the above diagram, note the value of f-statistics as 15.66 against the degrees of freedom as 5 and 194. 
  • Evaluate t-statistics against the critical value/region : After calculating the value of t-statistics for each coefficient, it is now time to make a decision about whether to accept or reject the null hypothesis. In order for this decision to be made, one needs to set a significance level, which is also known as the alpha level. The significance level of 0.05 is usually set for rejecting the null hypothesis or otherwise. If the value of t-statistics fall in the critical region, the null hypothesis is rejected. Or, if the p-value comes out to be less than 0.05, the null hypothesis is rejected.
  • Evaluate f-statistics against the critical value/region : The value of F-statistics and the p-value is evaluated for testing the null hypothesis that the linear regression model representing response and predictor variables does not exist. If the value of f-statistics is more than the critical value at the level of significance as 0.05, the null hypothesis is rejected. This means that the linear model exists with at least one valid coefficients. 
  • Draw conclusions : The final step of hypothesis testing is to draw a conclusion by interpreting the results in terms of the original claim or hypothesis. If the null hypothesis of one or more predictor variables is rejected, it represents the fact that the relationship between the response and the predictor variable is not statistically significant based on the evidence or the sample data we used for training the model. Similarly, if the f-statistics value lies in the critical region and the value of the p-value is less than the alpha value usually set as 0.05, one can say that there exists a linear regression model.

Why hypothesis tests for linear regression models?

The reasons why we need to do hypothesis tests in case of a linear regression model are following:

  • By creating the model, we are establishing a new truth (claims) about the relationship between response or dependent variable with one or more predictor or independent variables. In order to justify the truth, there are needed one or more tests. These tests can be termed as an act of testing the claim (or new truth) or in other words, hypothesis tests.
  • One kind of test is required to test the relationship between response and each of the predictor variables (hence, T-tests)
  • Another kind of test is required to test the linear regression model representation as a whole. This is called F-test.

While training linear regression models, hypothesis testing is done to determine whether the relationship between the response and each of the predictor variables is statistically significant or otherwise. The coefficients related to each of the predictor variables is determined. Then, individual hypothesis tests are done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. If at least one of the null hypotheses is rejected, it represents the fact that there exists no relationship between response and that particular predictor variable. T-statistics is used for performing the hypothesis testing because the standard deviation of the sampling distribution is unknown. The value of t-statistics is compared with the critical value from the t-distribution table in order to make a decision about whether to accept or reject the null hypothesis regarding the relationship between the response and predictor variables. If the value falls in the critical region, then the null hypothesis is rejected which means that there is no relationship between response and that predictor variable. In addition to T-tests, F-test is performed to test the null hypothesis that the linear regression model does not exist and that the value of all the coefficients is zero (0). Learn more about the linear regression and t-test in this blog – Linear regression t-test: formula, example .

Recent Posts

Ajitesh Kumar

  • Self-Supervised Learning vs Transfer Learning: Examples - April 2, 2024
  • OKRs vs KPIs vs KRAs: Differences and Examples - February 21, 2024
  • CEP vs Traditional Database Examples - February 2, 2024

Ajitesh Kumar

One response.

Very informative

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:
  • Excellence Awaits: IITs, NITs & IIITs Journey

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • Self-Supervised Learning vs Transfer Learning: Examples
  • OKRs vs KPIs vs KRAs: Differences and Examples
  • CEP vs Traditional Database Examples
  • Retrieval Augmented Generation (RAG) & LLM: Examples
  • Attention Mechanism in Transformers: Examples

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

hypothesis testing and regression analysis

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

Prevent plagiarism. Run a free check.

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved April 3, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

Statology

Statistics Made Easy

Understanding the Null Hypothesis for Linear Regression

Linear regression is a technique we can use to understand the relationship between one or more predictor variables and a response variable .

If we only have one predictor variable and one response variable, we can use simple linear regression , which uses the following formula to estimate the relationship between the variables:

ŷ = β 0 + β 1 x

  • ŷ: The estimated response value.
  • β 0 : The average value of y when x is zero.
  • β 1 : The average change in y associated with a one unit increase in x.
  • x: The value of the predictor variable.

Simple linear regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = 0
  • H A : β 1 ≠ 0

The null hypothesis states that the coefficient β 1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

The alternative hypothesis states that β 1 is not equal to zero. In other words, there is a statistically significant relationship between x and y.

If we have multiple predictor variables and one response variable, we can use multiple linear regression , which uses the following formula to estimate the relationship between the variables:

ŷ = β 0 + β 1 x 1 + β 2 x 2 + … + β k x k

  • β 0 : The average value of y when all predictor variables are equal to zero.
  • β i : The average change in y associated with a one unit increase in x i .
  • x i : The value of the predictor variable x i .

Multiple linear regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = β 2 = … = β k = 0
  • H A : β 1 = β 2 = … = β k ≠ 0

The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically significant relationship with the response variable, y.

The alternative hypothesis states that not every coefficient is simultaneously equal to zero.

The following examples show how to decide to reject or fail to reject the null hypothesis in both simple linear regression and multiple linear regression models.

Example 1: Simple Linear Regression

Suppose a professor would like to use the number of hours studied to predict the exam score that students will receive in his class. He collects data for 20 students and fits a simple linear regression model.

The following screenshot shows the output of the regression model:

Output of simple linear regression in Excel

The fitted simple linear regression model is:

Exam Score = 67.1617 + 5.2503*(hours studied)

To determine if there is a statistically significant relationship between hours studied and exam score, we need to analyze the overall F value of the model and the corresponding p-value:

  • Overall F-Value:  47.9952
  • P-value:  0.000

Since this p-value is less than .05, we can reject the null hypothesis. In other words, there is a statistically significant relationship between hours studied and exam score received.

Example 2: Multiple Linear Regression

Suppose a professor would like to use the number of hours studied and the number of prep exams taken to predict the exam score that students will receive in his class. He collects data for 20 students and fits a multiple linear regression model.

Multiple linear regression output in Excel

The fitted multiple linear regression model is:

Exam Score = 67.67 + 5.56*(hours studied) – 0.60*(prep exams taken)

To determine if there is a jointly statistically significant relationship between the two predictor variables and the response variable, we need to analyze the overall F value of the model and the corresponding p-value:

  • Overall F-Value:  23.46
  • P-value:  0.00

Since this p-value is less than .05, we can reject the null hypothesis. In other words, hours studied and prep exams taken have a jointly statistically significant relationship with exam score.

Note: Although the p-value for prep exams taken (p = 0.52) is not significant, prep exams combined with hours studied has a significant relationship with exam score.

Additional Resources

Understanding the F-Test of Overall Significance in Regression How to Read and Interpret a Regression Table How to Report Regression Results How to Perform Simple Linear Regression in Excel How to Perform Multiple Linear Regression in Excel

' src=

Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

Teach yourself statistics

Hypothesis Test for Regression Slope

This lesson describes how to conduct a hypothesis test to determine whether there is a significant linear relationship between an independent variable X and a dependent variable Y .

The test focuses on the slope of the regression line

Y = Β 0 + Β 1 X

where Β 0 is a constant, Β 1 is the slope (also called the regression coefficient), X is the value of the independent variable, and Y is the value of the dependent variable.

If we find that the slope of the regression line is significantly different from zero, we will conclude that there is a significant relationship between the independent and dependent variables.

Test Requirements

The approach described in this lesson is valid whenever the standard requirements for simple linear regression are met.

  • The dependent variable Y has a linear relationship to the independent variable X .
  • For each value of X, the probability distribution of Y has the same standard deviation σ.
  • The Y values are independent.
  • The Y values are roughly normally distributed (i.e., symmetric and unimodal ). A little skewness is ok if the sample size is large.

The test procedure consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.

State the Hypotheses

If there is a significant linear relationship between the independent variable X and the dependent variable Y , the slope will not equal zero.

H o : Β 1 = 0

H a : Β 1 ≠ 0

The null hypothesis states that the slope is equal to zero, and the alternative hypothesis states that the slope is not equal to zero.

Formulate an Analysis Plan

The analysis plan describes how to use sample data to accept or reject the null hypothesis. The plan should specify the following elements.

  • Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value between 0 and 1 can be used.
  • Test method. Use a linear regression t-test (described in the next section) to determine whether the slope of the regression line differs significantly from zero.

Analyze Sample Data

Using sample data, find the standard error of the slope, the slope of the regression line, the degrees of freedom, the test statistic, and the P-value associated with the test statistic. The approach described in this section is illustrated in the sample problem at the end of this lesson.

SE = s b 1 = sqrt [ Σ(y i - ŷ i ) 2 / (n - 2) ] / sqrt [ Σ(x i - x ) 2 ]

  • Slope. Like the standard error, the slope of the regression line will be provided by most statistics software packages. In the hypothetical output above, the slope is equal to 35.

t = b 1 / SE

  • P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic. Since the test statistic is a t statistic, use the t Distribution Calculator to assess the probability associated with the test statistic. Use the degrees of freedom computed above.

Interpret Results

If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the P-value to the significance level , and rejecting the null hypothesis when the P-value is less than the significance level.

Test Your Understanding

The local utility company surveys 101 randomly selected customers. For each survey participant, the company collects the following: annual electric bill (in dollars) and home size (in square feet). Output from a regression analysis appears below.

Is there a significant linear relationship between annual bill and home size? Use a 0.05 level of significance.

The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:

H o : The slope of the regression line is equal to zero.

H a : The slope of the regression line is not equal to zero.

  • Formulate an analysis plan . For this analysis, the significance level is 0.05. Using sample data, we will conduct a linear regression t-test to determine whether the slope of the regression line differs significantly from zero.

We get the slope (b 1 ) and the standard error (SE) from the regression output.

b 1 = 0.55       SE = 0.24

We compute the degrees of freedom and the t statistic, using the following equations.

DF = n - 2 = 101 - 2 = 99

t = b 1 /SE = 0.55/0.24 = 2.29

where DF is the degrees of freedom, n is the number of observations in the sample, b 1 is the slope of the regression line, and SE is the standard error of the slope.

  • Interpret results . Since the P-value (0.0242) is less than the significance level (0.05), we cannot accept the null hypothesis.

statistics-for-data-science-key-concepts

  • DATA SCIENCE

Statistics for Data Science: Key Concepts

  • February 27, 2023

Table of Contents

Why is statistics important for data science, probability, tendency and distribution of data, hypothesis testing, think stats, statistics in plain english.

Statistics is a foundational component of data science , providing powerful tools for analyzing and interpreting data. Data scientists rely on statistical techniques to draw out meaningful insights from large and complex data sets and to identify patterns and trends that can contribute to informed business decisions and guide future research. Key statistical concepts, such as probability, hypothesis testing, and regression analysis, are essential for understanding the relationships between different variables in a data set and identifying the factors that drive outcome changes.

Statistics is an essential tool for data science as it provides the framework to analyze, interpret, and draw meaningful insights from data. Data scientists use statistical methods to summarize and describe large and complex datasets, identify patterns and relationships, make predictions and forecasts, and evaluate the effectiveness of their models. With statistical analysis, data scientists can better understand the behavior of the data and thus can make informed decisions based on their findings.

Additionally, statistical inference techniques enable data scientists to make generalizations about the population from which they collected the data, even when they only have a sample. Through a solid foundation in statistics, data scientists can make sense of the vast amount of data available to them and avoid flawed or misleading conclusions.

Therefore, statistics play a crucial role in enabling data scientists to unlock the full potential of data and leverage it to drive insights and informed decision-making in various industries.

Learning Statistics for Data Science

Now that we have discussed the importance of statistics for data science, it is time to discuss how to build a solid foundation in statistics. You need to understand several key concepts to understand the fundamentals of statistics for data science . Some of these critical concepts include:

Keep reading to find out more about each of these concepts.

Regarding statistics for data science, probability is a fundamental concept that helps us understand uncertainty and make predictions based on available data. Probability means measuring the likelihood of an event occurring and is expressed as a value between 0 and 1. This concept allows us to quantify our confidence in our predictions based on the available data.

Probability theory is essential for developing statistical models, conducting hypothesis testing, and making informed decisions based on data. Understanding probability helps data scientists interpret the results of statistical analyses and communicate them effectively to stakeholders.

In statistics, sampling is selecting a representative sample or subset of individuals or items from a larger population to make statistical inferences. Because it is often impractical or even impossible to examine an entire population, data scientists use sampling to draw conclusions about the population as a whole.

Sampling methods can be random or non-random, and we use different techniques depending on the research question, the population size, and the level of accuracy or precision required.

The goal of sampling is to obtain a sample representative of the population that accurately estimates the population parameters of interest, such as mean, variance, or proportion.

Tendency and distribution are essential concepts in statistics that describe data’s central tendency and spread. Tendency refers to a data set’s typical value or center and is often measured using metrics such as mean, median, and mode. On the other hand, distribution describes how data is spread out or dispersed and can be represented using tools such as histograms, box plots, or probability distributions.

Understanding both tendency and distribution of data is crucial for making informed decisions and drawing accurate conclusions from data analysis . By examining the tendency and distribution of data, researchers and analysts can identify patterns, outliers, and other essential characteristics that can guide further investigation or action.

Hypothesis testing is a statistical method used to determine whether an observed result is statistically significant or simply due to chance. It involves setting up two hypotheses, a null hypothesis (H0) and an alternative hypothesis (Ha), and testing the data to see which hypothesis is more likely to be true. The null hypothesis is typically the default assumption that no significant difference or relationship exists between the tested variables. In contrast, the alternative hypothesis proposes that there is a significant difference or relationship between said variables.

Hypothesis testing allows researchers to make data-driven decisions and draw conclusions based on statistical evidence rather than relying solely on intuition or anecdotal evidence.

Variations in statistics refer to the degree of dispersion or spread of data in a sample or population. A higher variation indicates a greater data spread, whereas a lower variation suggests a tighter clustering of values. Various factors can affect variations, including sample size, outliers, and data distribution. Measures of variation, such as range, variance, and standard deviation, provide insights into the diversity and distribution of data points.

Understanding variations in statistics is crucial in drawing accurate conclusions from data and making informed decisions based on empirical evidence.

✅ Request information on BAU's programs TODAY!

Regression is a statistical method used to study the relation between a dependent variable and one or multiple independent variables. It is commonly used in many fields, including finance, social sciences, and engineering. The basic idea of regression is to find a mathematical equation that can describe the relationship between the variables. We then use the equation to make predictions about the dependent variable based on the values of the independent variables.

There are many types of regression, including linear regression, logistic regression, and polynomial regression. The choice of regression model depends on the type of data and the research question. Regression analysis is a powerful tool for understanding complex relationships in data and making predictions.

2 Statistics Books for Data Science

Learning statistics through a book is a great way to begin your journey into the world of data science. Plenty of excellent books cover the fundamental principles of statistics clearly and concisely, with plenty of examples and exercises to help you practice. It’s important to take your time, carefully review each concept, and ensure you fully understand each new idea before moving on to the next. With a good book and a little determination, you’ll be well on your way to becoming a skilled data scientist in no time!

Here are two book recommendations on statistics for data science if you don’t know where to start.

Think Stats by Allen B. Downey is a fantastic book for beginners with a background in Python programming. This book uses clear and concise explanations to cover important statistical concepts, including probability, hypothesis testing, correlation, and regression analysis. In addition, it focuses on practical examples and exercises which will allow you to apply what you have learned to real-world data sets.

Statistics in Plain English by Timothy C. Urdan is an excellent introductory book for those who want to understand statistics without getting bogged down in technical jargon, i.e., through plain English. Using simple language, the book covers a wide range of statistical concepts and methods, including probability, hypothesis testing, correlation, and regression analysis.

In conclusion, statistics play a crucial role in data science, providing the tools and techniques needed to extract meaningful insights from data. Data scientists must have a solid foundation in statistical concepts and methods to analyze and interpret data effectively. By applying statistical techniques to large and complex data sets, data scientists can identify patterns and trends that allow businesses to make informed decisions, drive scientific research, and ultimately contribute to innovation and progress in various fields.

Bay Atlantic University

Leave a reply.

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

You May Also Like

  • 7 minute read

Data Is Money: 16 Highest Paying Jobs in Data Science

  • November 18, 2021
  • 5 minute read

What Is a Business Intelligence Analyst and How Do You Become One?

  • July 23, 2022
  • 6 minute read

Data Science Interview Questions

  • March 14, 2023
  • 8 minute read

How To Get a Data Science Job

  • November 28, 2022
  • ACADEMIC ADVICE

13 Benefits of Going to College

  • April 2, 2024

Auditory Learner: Characteristics & Benefits

How much do writers make.

  • April 1, 2024
  • POLITICAL SCIENCE

What is a Diplomat and What Do They Really Do?

  • March 7, 2024
  • 7 shares 5 0 2

Request information on BAU's programs TODAY!

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

14.4: Hypothesis Test for Simple Linear Regression

  • Last updated
  • Save as PDF
  • Page ID 20929

  • Maurice A. Geraghty
  • De Anza College

We will now describe a hypothesis test to determine if the regression model is meaningful; in other words, does the value of \(X\) in any way help predict the expected value of \(Y\)?

Simple Linear Regression ANOVA Hypothesis Test

Model Assumptions

  • The residual errors are random and are normally distributed.
  • The standard deviation of the residual error does not depend on \(X\)
  • A linear relationship exists between \(X\) and \(Y\)
  • The samples are randomly selected

Test Hypotheses

\(H_o\):  \(X\) and \(Y\) are not correlated   

\(H_a\):  \(X\) and \(Y\) are correlated   

\(H_o\):  \(\beta_1\) (slope) = 0   

\(H_a\):  \(\beta_1\) (slope) ≠ 0

Test Statistic

\(F=\dfrac{M S_{\text {Regression }}}{M S_{\text {Error }}}\)

\(d f_{\text {num }}=1\)

\(d f_{\text {den }}=n-2\)

Sum of Squares

\(S S_{\text {Total }}=\sum(Y-\bar{Y})^{2}\)

\(S S_{\text {Error }}=\sum(Y-\hat{Y})^{2}\)

\(S S_{\text {Regression }}=S S_{\text {Total }}-S S_{\text {Error }}\)

In simple linear regression, this is equivalent to saying “Are X an Y correlated?”

In reviewing the model, \(Y=\beta_{0}+\beta_{1} X+\varepsilon\), as long as the slope (\(\beta_{1}\)) has any non‐zero value, \(X\) will add value in helping predict the expected value of \(Y\). However, if there is no correlation between X and Y, the value of the slope (\(\beta_{1}\)) will be zero. The model we can use is very similar to One Factor ANOVA.

The Results of the test can be summarized in a special ANOVA table:

Example: Rainfall and sales of sunglasses

Design : Is there a significant correlation between rainfall and sales of sunglasses?

Research Hypothese s:

\(H_o\):  Sales and Rainfall are not correlated      \(H_o\):  1 (slope) = 0

\(H_a\):  Sales and Rainfall are correlated      \(H_a\):  1 (slope) ≠ 0

Type I error would be to reject the Null Hypothesis and \(t\) claim that rainfall is correlated with sales of sunglasses, when they are not correlated. The test will be run at a level of significance (\(\alpha\)) of 5%.

The test statistic from the table will be \(\mathrm{F}=\dfrac{\text { MSRegression }}{\text { MSError }}\). The degrees of freedom for the numerator will be 1, and the degrees of freedom for denominator will be 5‐2=3.  

Critical Value for \(F\) at  \(\alpha\)of 5% with \(df_{num}=1\) and \(df_{den}=3} is 10.13.  Reject \(H_o\) if \(F >10.13\). We will also run this test using the \(p\)‐value method with statistical software, such as Minitab.  

Data/Results

clipboard_eb53a719f76cee3cc8dba624ad935461d.png

\(F=341.422 / 12.859=26.551\), which is more than the critical value of 10.13, so Reject \(H_o\). Also, the \(p\)‐value = 0.0142 < 0.05 which also supports rejecting \(H_o\).  

Sales of Sunglasses and Rainfall are negatively correlated.

Workshop Title Slide

Welcome to Hypothesis Testing and Regression Analysis in R.

In this online module, participants will learn how to conduct hypothesis tests in R, along with correlation and regression analysis. The session will include t test, paired t test, ANOVA, regression, correlation, and covariance. The workshop is open to all who wish to learn about running data analysis in R, however, it is essential to have some prior basic knowledge of the software.

Proceed to the Preparation page to get started.

Hypothesis testing for varying coefficient models in tail index regression

  • Regular Article
  • Open access
  • Published: 02 April 2024

Cite this article

You have full access to this open access article

  • Koki Momoki 1 &
  • Takuma Yoshida 1  

This study examines the varying coefficient model in tail index regression. The varying coefficient model is an efficient semiparametric model that avoids the curse of dimensionality when including large covariates in the model. In fact, the varying coefficient model is useful in mean, quantile, and other regressions. The tail index regression is not an exception. However, the varying coefficient model is flexible, but leaner and simpler models are preferred for applications. Therefore, it is important to evaluate whether the estimated coefficient function varies significantly with covariates. If the effect of the non-linearity of the model is weak, the varying coefficient structure is reduced to a simpler model, such as a constant or zero. Accordingly, the hypothesis test for model assessment in the varying coefficient model has been discussed in mean and quantile regression. However, there are no results in tail index regression. In this study, we investigate the asymptotic properties of an estimator and provide a hypothesis testing method for varying coefficient models for tail index regression.

Avoid common mistakes on your manuscript.

1 Introduction

In various fields, predicting the high- or low-tail behavior of data distribution is of interest. Examples include events such as heavy rains, large earthquakes, and significant fluctuations in stock prices. Extreme value theory is a standard approach for analyzing the data of such extremal events. Let \(Y_1, Y_2, \ldots , Y_n\) be independent and identically distributed random variables with distribution function F . In extreme value theory, the following maximum domain of attraction assumption is standard: Assume that there exist sequences of constants \(a_n>0\) and \(b_n\in \mathbb {R}\) and a non-degenerate distribution function G such that

for each continuity point y in G . This assumption implies that there exist a constant \(\gamma \in \mathbb {R}\) and a positive function \(\sigma (t)\) such that

for all y for which \(1+\gamma y>0\) , where \(y^*=\sup \{y: F(y)<1\}\in (-\infty , \infty ]\) and the right-hand side for \(\gamma =0\) is interpreted as \(e^{-y}\) (see, Theorem 1.1.6 of de Haan and Ferreira ( 2006 )). The class of distributions on the right-hand side is called the generalized Pareto distribution and the parameter \(\gamma \) is called the extreme value index. Therefore, in extreme value theory, the tail behavior of the data distribution is characterized by the extreme value index \(\gamma \) . Its existing estimators include the Hill estimator (Hill 1975 ), Pickands estimator (Pickands 1975 ), kernel estimator (Csorgo et al. 1985 ), maximum likelihood estimator (Smith 1987 ), and moment estimator (Dekkers et al. 1989 ), etc. It is noteworthy that the generalized Pareto distribution has different features depending on the sign of \(\gamma \) . If \(\gamma >0\) , we have

for all \(y>0\) , where \(\mathcal {L}(y)\) is a slowly varying function at infinity; i.e., \(\mathcal {L}(ys)/\mathcal {L}(y)\rightarrow 1\) as \(y\rightarrow \infty \) for all \(s>0\) . The class of these distributions is called the Pareto-type distribution. This case seems to be common in areas such as finance and insurance, and we frequently observe extremely large values in the data compared to the case of \(\gamma \le 0\) . Therefore, many researchers in extreme value theory have focused on this case. The Hill estimator mentioned above is one of the estimators of the positive extreme value index \(\gamma \) and is widely used in many extreme value studies. In this study, we assume that the extreme value index \(\gamma \) is positive.

In recent years, various regression models of the conditional extreme value index have been studied in the so-called tail index regression, where the tail index refers to the inverse of the extreme value index. The nonparametric tail index regression estimators include Gardes and Girard ( 2010 ), Stupfler ( 2013 ), Daouia et al. ( 2013 ), Gardes and Stupfler ( 2014 ), Goegebeur et al. ( 2014 , 2015 ) and Ma et al. ( 2020 ). For fully nonparametric methods, the curse of dimensionality arises when multiple covariates are used. However, in many applications, extremal events are often triggered by multiple factors, and we hope to consider these factors. To avoid the curse of dimensionality, Wang and Tsai ( 2009 ) studied the parametric tail index regression assuming the linear model. However, in some applications of extreme value theory, the linear model may be too simple to predict the tail behavior of the distribution of the response. As an extension of Wang and Tsai ( 2009 ), Youngman ( 2019 ) studied additive models, Li et al. ( 2022 ) developed partially linear models, Yoshida ( 2023 ) provided single-index models, and Ma et al. ( 2019 ) provided varying coefficient models. The varying coefficient model is useful for analyzing time series and longitudinal data, etc. Because time or location is often important in many applications of extreme value theory, the varying coefficient model is expected to be useful in tail index regression. We are also interested in tail index regression assuming the varying coefficient model.

The varying coefficient models pioneered by Hastie and Tibshirani ( 1993 ) assume that the regression function \(m_Y(\mathbf{{X}}, \mathbf{{T}})\) of interest satisfies

for the given explanatory variable vectors \(\mathbf{{X}}\) and \(\mathbf{{T}}\) , and the response variable Y , where \({\varvec{\theta }}(\cdot )\) is the vector of unknown smooth functions of \(\mathbf{{T}}\) , which is denoted by the coefficient function vector. In mean and quantile regression, many authors have developed varying coefficient models, such as those of Wu et al. ( 1998 ), Fan and Zhang ( 1999 ), Huang et al. ( 2002 ), Huang et al. ( 2004 ), Kim ( 2007 ), Cai and Xu ( 2008 ), Andriyana et al. ( 2014 ), and Andriyana et al. ( 2018 ). Fan and Zhang ( 2008 ) provided a review article on the varying coefficient model. Some of these studies examined not only the estimation methods of the coefficient function, but also the hypothesis testing methods. We denote \({\varvec{\theta }}(\cdot )=(\theta _1(\cdot ), \theta _2(\cdot ), \ldots , \theta _p(\cdot ))^\top \) . The hypothesis test for the constancy of a specific component can be represented as

for an unknown constant \(C_0\) , where \(\mathrm{{H}}_{0\mathrm{{C}}}\) is the null hypothesis and \(\mathrm{{H}}_{1\mathrm{{C}}}\) is the alternative hypothesis. It is particularly important to test the sparsity of a specific covariate, which can be expressed as

where \(\mathrm{{H}}_{0\mathrm{{Z}}}\) is the null hypothesis and \(\mathrm{{H}}_{1\mathrm{{Z}}}\) is the alternative hypothesis. The varying coefficient model is flexible, but simpler models provide an easy interpretation of the data structure in real data analysis. The above hypothesis tests help to reduce the varying coefficient model to a simpler model. In mean and quantile regression, testing methods based on the comparison of the residual sum of squares include Cai et al. ( 2000 ), Fan and Li ( 2001 ), Huang et al. ( 2002 ), Kim ( 2007 ), among others, where they used the bootstrap to implement their test. In mean regression, Fan and Zhang ( 2000 ) proposed the testing method based on the asymptotic distribution of the maximum deviation between the estimated coefficient function and true coefficient function.

In this study, we employ a logarithmic transformation to link the extreme value index of the response variable Y to the explanatory variable vectors \(\mathbf{{X}}\) and \(\mathbf{{T}}\) via

To the best of our knowledge, Ma et al. ( 2019 ) also studied this model. They provided a kernel-type nonparametric estimator of \({\varvec{\theta }}(\mathbf{{T}})\) and established asymptotic normality. However, they did not discuss hypothesis testing. Therefore, there are no results for the hypothesis tests in tail index regression. Our study aims to establish a testing method for varying coefficient models for tail index regression.

The remainder of this paper is organized as follows. Section  2 introduces the local constant (Nadaraya–Watson type) maximum likelihood estimator of the coefficient functions, and Sect.  3 investigates its asymptotic properties. Section  4 introduces the proposed method for testing the structure of the coefficient functions and demonstrates the finite sample performance through simulations. A real example is analyzed in Sect.  5 . All technical details are provided in Appendix.

2 Model and method

2.1 varying coefficient models in tail index regression.

Let \(Y>0\) be the univariate response variable of interest, \(\mathbf{{X}}=(X_1, X_2, \ldots , X_p)^\top \in \mathbb {R}^p\) be the p -dimensional explanatory variable vector, and \(\mathbf{{T}}=(T_1, T_2, \ldots , T_q)^\top \in \mathbb {R}^q\) be the q -dimensional explanatory variable vector. In addition, let \(F(y\mid \mathbf{{x}}, \mathbf{{t}})=P(Y\le y\mid \mathbf{{X}}=\mathbf{{x}}, \mathbf{{T}}=\mathbf{{t}})\) be the conditional distribution function of Y given \((\mathbf{{X}}, \mathbf{{T}})=(\mathbf{{x}}, \mathbf{{t}})\) . We consider the Pareto-type distribution

where \(\gamma (\mathbf{{x}}, \mathbf{{t}})\) is a positive function of \(\mathbf{{x}}\) and \(\mathbf{{t}}\) , and \(\mathcal {L}(y; \mathbf{{x}}, \mathbf{{t}})\) is a covariate-dependent slowly varying function at infinity; i.e., \(\mathcal {L}(ys; \mathbf{{x}}, \mathbf{{t}})/\mathcal {L}(y; \mathbf{{x}}, \mathbf{{t}})\rightarrow 1\) as \(y\rightarrow \infty \) for any \(s>0\) . We assume that the slowly varying function \(\mathcal {L}(y; \mathbf{{x}}, \mathbf{{t}})\) converges to a constant at a reasonably high speed. Specifically, we assume

where \(c_0(\mathbf{{x}}, \mathbf{{t}})\) , \(c_1(\mathbf{{x}}, \mathbf{{t}})\) and \(\beta (\mathbf{{x}}, \mathbf{{t}})\) are functions of \(\mathbf{{x}}\) and \(\mathbf{{t}}\) with \(c_0(\mathbf{{x}}, \mathbf{{t}})>0\) and \(\beta (\mathbf{{x}}, \mathbf{{t}})>0\) , and \(o(y^{-\beta (\mathbf{{x}}, \mathbf{{t}})})\) is a higher-order term. This assumption is called the Hall class (Hall 1982 ). In our study, we adopt the varying coefficient model for the conditional extreme value index \(\gamma (\mathbf{{x}}, \mathbf{{t}})\) as

where \(\mathbf{{x}}=(x_1, x_2, \ldots , x_p)^\top \in \mathbb {R}^p\) , \({\varvec{\theta }}(\mathbf{{t}})=(\theta _0(\mathbf{{t}}), \theta _1(\mathbf{{t}}), \ldots , \theta _p(\mathbf{{t}}))^\top \in \mathbb {R}^{p+1}\) , and \(\theta _j(\mathbf{{t}}),\ j=0, 1, \ldots , p\) are the unknown smooth functions of \(\mathbf{{t}}\) .

2.2 Local constant maximum likelihood estimator

Let \(f(y\mid \mathbf{{x}}, \mathbf{{t}})\) be the conditional probability density function of Y given \((\mathbf{{X}}, \mathbf{{T}})=(\mathbf{{x}}, \mathbf{{t}})\) . If \(\mathcal {L}(\cdot ; \mathbf{{x}}, \mathbf{{t}})\) is differentiable, we obtain

Because \(\mathcal {L}(y; \mathbf{{x}}, \mathbf{{t}})\rightarrow c_0(\mathbf{{x}}, \mathbf{{t}})\) and \(\partial \mathcal {L}(y; \mathbf{{x}}, \mathbf{{t}})/\partial y\rightarrow 0\) as \(y\rightarrow \infty \) by using ( 2.2 ), we have

for sufficiently large \(y>0\) . Let \(\{(Y_i, \mathbf{{X}}_i, \mathbf{{T}}_i),\ i=1, 2, \ldots , n\}\) be an independent and identically distributed random sample with the same distribution as \((Y, \mathbf{{X}}, \mathbf{{T}})\) . We introduce a sufficiently high threshold \(\omega _n>0\) such that \(\omega _n\rightarrow \infty \) as \(n\rightarrow \infty \) and use the responses that exceed it. Let \(f(y\mid \mathbf{{x}}, \mathbf{{t}}, \omega _n)\) be the conditional probability density function of Y given \((\mathbf{{X}}, \mathbf{{T}})=(\mathbf{{x}}, \mathbf{{t}})\) and \(Y>\omega _n\) . Then, we have

for \(y>\omega _n\) . Thus, we can estimate the coefficient function vector \({\varvec{\theta }}(\mathbf{{t}})\) by using the following weighted maximum likelihood approach: Let

where \({\varvec{\theta }}\in \mathbb {R}^{p+1}\) , \(I(\cdot )\) is an indicator function, \(K(\cdot )\) is a kernel function on \(\mathbb {R}^q\) , and \(\mathbf{{H}}_n=\mathrm{{diag}}(h_{n1}, \ldots , h_{nq})\) is a q -order diagonal matrix whose components are bandwidths \(h_{nk},\ k=1, 2, \ldots , q\) such that \(h_{nk}\rightarrow 0\) as \(n\rightarrow \infty \) . We define the estimator of the coefficient function vector \({\varvec{\theta }}(\mathbf{{t}})\) as the minimizer of the objective function \(L_n({\varvec{\theta }})\) . We denote this estimator by \(\widehat{\varvec{\theta }}(\mathbf{{t}})=(\widehat{\theta }_0(\mathbf{{t}}), \widehat{\theta }_1(\mathbf{{t}}), \ldots , \widehat{\theta }_p(\mathbf{{t}}))^\top \in \mathbb {R}^{p+1}\) . Ma et al. ( 2019 ) provided the local linear maximum likelihood estimator. When \(p=0\) and \(q=0\) , the covariate-independent estimator \(\widehat{\theta }_0\) is explicitly obtained, and we have

which is the Hill estimator proposed by Hill ( 1975 ) and is widely used in univariate extreme value theory.

Note that the varying coefficient model corresponds to linear and nonparametric models as special cases. When \(q=0\) , ( 2.3 ) is simplified as

where \({\varvec{\theta }}=(\theta _0, \theta _1, \ldots , \theta _p)^\top \in \mathbb {R}^{p+1}\) , and \(\theta _j,\ j=0, 1, \ldots , p\) are the regression coefficients. Wang and Tsai ( 2009 ) studied this tail index regression model. Whereas, when \(p=0\) , we obtain a nonparametric estimator of the positive extreme value index as

which was studied by Goegebeur et al. ( 2014 , 2015 ).

2.3 Bandwidths and threshold selection

The threshold \(\omega _n\) and bandwidths \(h_{nk},\ k=1, \ldots , q\) are tuning parameters that control the balance between the bias and variance of the estimator \(\widehat{\varvec{\theta }}(\mathbf{{t}})\) . A larger value of \(h_{nk}\) or smaller value of \(\omega _n\) leads to more bias, whereas a larger value of \(\omega _n\) or smaller value of \(h_{nk}\) leads to a larger variance. Therefore, these tuning parameters must be appropriately selected.

The threshold selection is needed to obtain the good approximation of ( 2.4 ). To achieve this, the discrepancy measure, which was proposed by Wang and Tsai ( 2009 ), is suitable. Meanwhile, the choice of the bandwidths controls the smoothness of the estimator. Therefore, we use the cross-validation to select the bandwidths, similar to other studies using kernel smoothing (e.g., Ma et al. 2019 ). Thus, we combine the discrepancy measure and cross-validation as the whole tuning parameter selection method. The algorithm of the tuning parameter selection is as follows. In the first step, we select the bandwidths \(h_{nk},\ k=1, \ldots , q\) by D -fold cross-validation as

which is based on ( 2.5 ), where \(\omega _0\) is a predetermined threshold, \(\lfloor \cdot \rfloor \) is the floor function, and \(\{(Y_i^{(d)}, \mathbf{{X}}_i^{(d)}, \mathbf{{T}}_i^{(d)}),\ i=1, 2, \ldots , \lfloor n/D\rfloor \}\) is the d th test dataset. \(\widehat{\varvec{\theta }}^{(d)}(\cdot ; \omega _0, \mathbf{{H}})\) is the proposed estimator, with \(\omega _n=\omega _0\) and \(\mathbf{{H}}_n=\mathbf{{H}}\) , which is obtained from the d th training dataset. In the second step, we select the threshold \(\omega _n\) using the discrepancy measure. We denote the order statistics of \(\{\exp \{-\exp ((1, \mathbf{{X}}_i^\top )\widehat{\varvec{\theta }}(\mathbf{{T}}_i))\log (Y_i/\omega _n)\}: Y_i>\omega _n,\ i=1, \ldots , n\}\) as \(\widehat{U}_{1, n_0} \le \widehat{U}_{2, n_0} \le \ldots \le \widehat{U}_{n_0, n_0}\) , where \(n_0=\sum _{i=1}^nI(Y_i>\omega _n)\) is the number of responses that exceed the threshold \(\omega _n\) . Because the conditional distribution of \(\exp \{-\exp ((1, \mathbf{{X}}^\top ){\varvec{\theta }}(\mathbf{{T}}))\log (Y/\omega _n)\}\) given \(Y>\omega _n\) is approximately standard uniform, we can regard \(\{\widehat{U}_{l, n_0}\}_{l=1}^{n_0}\) as a sample from the standard uniform distribution. Therefore, we select the threshold \(\omega _n\) as

where \(\{\widehat{U}_{l, n_0}(\omega , \mathbf{{H}}_\mathrm{{CV}})\}_{l=1}^{n_0}\) are \(\{\widehat{U}_{l, n_0}\}_{l=1}^{n_0}\) with \(\omega _n=\omega \) and \(\mathbf{{H}}_n=\mathbf{{H}}_\mathrm{{CV}}\) , and \(\widehat{F}(\cdot ; \omega , \mathbf{{H}}_\mathrm{{CV}})\) is a empirical distribution function of \(\{\widehat{U}_{l, n_0}(\omega , \mathbf{{H}}_\mathrm{{CV}})\}_{l=1}^{n_0}\) .

3 Asymptotic properties

3.1 conditions.

In this section, we investigate the asymptotic properties of our proposed estimator. The following technical conditions are required: We define \(n_0(\mathbf{{t}})=n\det (\mathbf{{H}}_n)f_\mathbf{{T}}(\mathbf{{t}})P(Y>\omega _n\mid \mathbf{{T}}=\mathbf{{t}})\) and \(n(\mathbf{{t}})=n\det (\mathbf{{H}}_n)f_\mathbf{{T}}(\mathbf{{t}})\) , where \(f_\mathbf{{T}}(\mathbf{{t}})\) is the marginal probability density function of \(\mathbf{{T}}\) . We also define

The kernel function \(K(\cdot )\) is an absolutely continuous function that has compact support and satisfies the conditions

with \(\mathbf{{u}}=(u_1, u_2, \ldots , u_q)^\top \in \mathbb {R}^q\) .

The joint probability density function \(f(y, \mathbf{{x}}, \mathbf{{t}})\) of \((Y, \mathbf{{X}}, \mathbf{{T}})\) and the coefficient function \(\theta _j(\mathbf{{t}})\) have continuous second-order derivative on \(\mathbf{{t}}\) .

Assume \(n_0(\mathbf{{t}})\rightarrow \infty \) and

as \(n\rightarrow \infty \) for all \(\mathbf{{t}}\in \mathbb {R}^q\) , where \(\mathbf{{I}}_{p+1}\) is a \((p+1)\) -order identity matrix and the symbol “ \(\xrightarrow {P}\) ” stands for convergence in probability.

The reminder term \(o(y^{-\beta (\mathbf{{x}}, \mathbf{{t}})})\) defined in ( 2.2 ) satisfies

as \(y\rightarrow \infty \) .

For all \(\mathbf{{t}}\in \mathbb {R}^q\) , there exists a nonzero vector \(\mathbf{{b}}(\mathbf{{t}})\in \mathbb {R}^{p+1}\) such that

as \(n\rightarrow \infty \) .

The condition (C.1) is typically used for kernel estimation. The conditions (C.3)–(C.5) correspond to the conditions (C.1)–(C.3) of Wang and Tsai ( 2009 ). The condition (C.3) requires that a certain weak law of large numbers holds. The condition (C.4) regularizes the extreme behavior of the slowly varying function \(\mathcal {L}(y; \mathbf{{x}}, \mathbf{{t}})\) . The condition (C.5) specifies the optimal convergence rates of threshold \(\omega _n\) and bandwidths \(h_{nk},\ k=1, \ldots , q\) .

3.2 Asymptotic properties

The above \(\dot{\varvec{L}}_n({\varvec{\theta }})\) and \(\ddot{\varvec{L}}_n({\varvec{\theta }})\) are the gradient vector and Hessian matrix of the objective function \(L_n({\varvec{\theta }})\) , respectively. The proposed estimator \(\widehat{\varvec{\theta }}(\mathbf{{t}})\) is defined as the minimizer of \(L_n({\varvec{\theta }})\) and satisfies \(\dot{\varvec{L}}_n(\widehat{\varvec{\theta }}(\mathbf{{t}}))={\varvec{0}}\) . Therefore, similar to common approaches for establishing the asymptotic normality of the maximum likelihood estimator, we need to investigate the asymptotic properties of \(\dot{\varvec{L}}_n({\varvec{\theta }})\) and \(\ddot{\varvec{L}}_n({\varvec{\theta }})\) . Let \(\nu =\int K(\mathbf{{u}})^2\mathrm{{d}}{} \mathbf{{u}}\) , \({\varvec{\kappa }}=\int \mathbf{{u}}{} \mathbf{{u}}^\top K(\mathbf{{u}})\mathrm{{d}}{} \mathbf{{u}}\) ,

where \(\mathbf{{t}}=(t_1, t_2, \ldots , t_q)^\top \in \mathbb {R}^q\) and \(k_1, k_2\in \{1, 2, \ldots , q\}\) .

Let us suppose that conditions (C.1)-(C.5) are satisfied; then, as \(n\rightarrow \infty \) ,

where the symbol “ \(\xrightarrow {D}\) ” denotes convergence in distribution,

hypothesis testing and regression analysis

From Theorems 1 and 2 , we obtain the following asymptotic normality of our proposed estimator \(\widehat{\varvec{\theta }}(\mathbf{{t}})\) :

This result implies that \(\widehat{\varvec{\theta }}(\mathbf{{t}})\) is the consistent estimator of \({\varvec{\theta }}(\mathbf{{t}})\) . The convergence rate of \(\widehat{\varvec{\theta }}(\mathbf{{t}})\) to \({\varvec{\theta }}(\mathbf{{t}})\) is on the same order as \([n_0(\mathbf{{t}}){\varvec{\Sigma }}_n(\mathbf{{t}})]^{-1/2}\) . The \(n_0(\mathbf{{t}})=n\det (\mathbf{{H}}_n)f_\mathbf{{T}}(\mathbf{{t}})P(Y>\omega _n\mid \mathbf{{T}}=\mathbf{{t}})\) is proportional to the number of top-order statistics of the responses used for estimation at \(\mathbf{{t}}\) . The \({\varvec{\Sigma }}_n(\mathbf{{t}})\) is defined in Sect.  3.1 . The asymptotic bias is caused by two factors. The bias \(\mathbf{{b}}(\mathbf{{t}})\) is caused by the approximation of the tail of the conditional distribution of Y by the Pareto distribution in ( 2.4 ), which is related to the convergence rate of the slowly varying function \(\mathcal {L}(\cdot ; \mathbf{{x}}, \mathbf{{t}})\) to the constant \(c_0(\mathbf{{x}}, \mathbf{{t}})\) . From the definition of \(\mathbf{{b}}(\mathbf{{t}})\) given in (C.5), we can see that the proposed estimator is more biased for larger \(\gamma (\mathbf{{x}}, \mathbf{{t}})\) . In other words, the heavier the tail of the data, the more biased the estimator. Meanwhile, if \(\beta (\mathbf{{x}}, \mathbf{{t}})\) is small, the large bias of the estimator is occurred. Thus, the bias of our estimator is particularly sensitive to \(\gamma (\mathbf{{x}}, \mathbf{{t}})\) and \(\beta (\mathbf{{x}}, \mathbf{{t}})\) . These parameters are related to the second order condition in extreme value theory (see, Theorem 3.2.5 of de Haan and Ferreira ( 2006 ), Theorems 2 and 3 of Wang and Tsai ( 2009 ), and Theorem 2 of Li et al. ( 2022 ), to name a few). In contrast, the biases \(\mathbf{{\Lambda }}_n^{(1)}(\mathbf{{t}})\) and \(\mathbf{{\Lambda }}_n^{(2)}(\mathbf{{t}})\) are caused by kernel smoothing.

Our asymptotic normality is comparable to the asymptotic normality of the local linear maximum likelihood estimator of the coefficient function vector proposed by Ma et al. ( 2019 ). The difference between the two estimators is the asymptotic bias. In the asymptotic normality in Ma et al. ( 2019 ), it is assumed that the bias caused by the approximation ( 2.4 ) is negligible, so the bias \(\mathbf{{b}}(\mathbf{{t}})\) does not appear in their asymptotic normality. The essential difference is the bias caused by kernel smoothing. In the case of Ma et al. ( 2019 ), the bias caused by kernel smoothing is \(\mathbf{{\Lambda }}_n^{(2)}(\mathbf{{t}})\) . However, it has the same convergence rate as the bias \(\mathbf{{\Lambda }}_n^{(1)}(\mathbf{{t}})+\mathbf{{\Lambda }}_n^{(2)}(\mathbf{{t}})\) . The asymptotic variances of the two estimators are the same.

4 Testing for structure of the coefficient function

4.1 testing method.

In varying coefficient models, we often hope to test whether each coefficient function \(\theta _j(\cdot )\) is constant or zero. If some \(\theta _j(\mathbf{{t}})\) does not vary across \(\mathbf{{t}}\) , this motivates us to consider models that are simpler than the varying coefficient model. Generally, the hypothesis test can be represented as

for a given known function \(\eta (\cdot )\) , where \(\mathrm{{H}}_0\) is the null hypothesis and \(\mathrm{{H}}_1\) is the alternative hypothesis.

Without a loss of generality, we assume that the explanatory variable vector \(\mathbf{{T}}\in \mathbb {R}^q\) is distributed on \([0, 1]^q\subset \mathbb {R}^q\) . Then, we apply Lemma 1 of Fan and Zhang ( 2000 ) to

where \(\sigma _{nj}(\mathbf{{t}})=E[X_j^2I(Y>\omega _n)\mid \mathbf{{T}}=\mathbf{{t}}],\ j=0, 1, \ldots , p\) and \(X_0\equiv 1\) . The following conditions are required:

For all large \(n\in \mathbb {N}\) , the function \(\sigma _{nj}(\mathbf{{t}})\) is bounded away from zero for all \(\mathbf{{t}}\in [0, 1]^q\) and has a bounded partial derivative.

\(\lim _{n\rightarrow \infty }\sup _\mathbf{{t}}E[|X_j|^sI(Y>\omega _n)\mid \mathbf{{T}}=\mathbf{{t}}]<\infty \) for some \(s>2\) .

Under the conditions (C.1)-(C.7), if \(h_n:=h_{nk}=n^{-b},\ k=1, \ldots , q\) , for some \(0<b<1-2/s\) , we have

as \(n\rightarrow \infty \) , where

(see, Rosenblatt 1976 ) and

From Theorem 3 , we now have \(\widehat{\varvec{\theta }}(\mathbf{{t}})\rightarrow ^P{\varvec{\theta }}(\mathbf{{t}})\) as \(n\rightarrow \infty \) . By the first-order Taylor expansion around \(\widehat{\theta }_j(\mathbf{{t}})=\theta _j(\mathbf{{t}})\) , we obtain

The left-hand side of the above equation is zero because \(\widehat{\varvec{\theta }}(\mathbf{{t}})=(\widehat{\theta }_0(\mathbf{{t}}), \widehat{\theta }_{1}(\mathbf{{t}}), \ldots , \widehat{\theta }_p(\mathbf{{t}}))^\top \) is the minimizer of \(L_n({\varvec{\theta }})\) . From Theorems 2 and 3 , we also have \(\partial ^2L_n({\varvec{\theta }})/\partial \theta _j^2\mid _{{\varvec{\theta }}=(\widehat{\theta }_0(\mathbf{{t}}), \ldots , \widehat{\theta }_{j-1}(\mathbf{{t}}), \theta _j(\mathbf{{t}}), \widehat{\theta }_{j+1}(\mathbf{{t}}), \ldots , \widehat{\theta }_p(\mathbf{{t}}))^\top }\rightarrow ^Pn(\mathbf{{t}})\sigma _{nj}(\mathbf{{t}})\) as \(n\rightarrow \infty \) . Consequently, we have

This implies that \(\psi (\mathbf{{t}})\) in Theorem 4 can be approximated as

From this result, \(E[\psi (\mathbf{{t}})]\) is asymptotically equivalent to the j th component of \(-\mathbf{{b}}(\mathbf{{t}})-[n_0(\mathbf{{t}}){\varvec{\Sigma }}_n(\mathbf{{t}})]^{1/2}\sum _{l=1}^2\mathbf{{\Lambda }}_n^{(l)}(\mathbf{{t}})\) . This bias involves many unknown parameters. In particular, \(\beta (\mathbf{{x}}, \mathbf{{t}})\) included in \(\mathbf{{b}}(\mathbf{{t}})\) corresponds to the so-called second order parameter (see Gomes et al. 2002 ). However, the estimation method of the second order parameter has not yet been developed in the context of the tail index regression. Thus, at the present stage, checking that (C.5) is satisfied and estimating \(E[\psi (\mathbf{{t}})]\) are challenging and are posited as future work. Therefore, in this paper, we assume that \(E[\psi (\mathbf{{t}})]\) is zero, similar to Wang and Tsai ( 2009 ). Then, Theorem 4 can be used to test if ( 4.1 ). Under the null hypothesis \(\mathrm{{H}}_0: \theta _j(\mathbf{{t}})\equiv \eta (\mathbf{{t}})\) , we use the test statistic

where \(\widehat{[n(\mathbf{{t}})\sigma _{nj}(\mathbf{{t}})]}\) is the kernel estimator of \(n(\mathbf{{t}})\sigma _{nj}(\mathbf{{t}})\) based on (C.3). For a given significance level \(\alpha \) , we reject the null hypothesis \(\mathrm{{H}}_0\) if \(\widetilde{T}<-\log \{-0.5\log (\alpha /2)\}=e_{\alpha /2}\) or \(\widetilde{T}>-\log \{-0.5\log (1-\alpha /2)\}=e_{1-\alpha /2}\) .

As mentioned above, we are mainly interested in the following two hypothesis tests: One is

If the null hypothesis \(\mathrm{{H}}_{0\textrm{Z}}\) is not rejected, the corresponding \(X_j\) may not be important for predicting the tail behavior of the distribution of the response Y . Thus, this can help judge the sparsity of a particular covariate. The other is

for an unknown constant \(C_0\) without prior knowledge. Under the null hypothesis \(\mathrm{{H}}_{0\mathrm{{C}}}\) , we estimate the unknown constant \(C_0\) as the average of the estimates \(\{\widehat{\theta }_j(\mathbf{{t}}_l)\}_{l=1}^L\) , where \(\mathbf{{t}}_l,\ l=1, 2, \ldots , L\) are equally spaced points in \([0, 1]^q\) . If the null hypothesis \(\mathrm{{H}}_{0\textrm{C}}\) is not rejected, it motivates us to adopt a simpler model that considers the coefficient function \(\theta _j(\cdot )\) to be constant.

The simultaneous test from Theorem 4 is more rigorous than the test statistic based on the residual sum of squares (see Cai et al. 2000 ). Here, we consider the separate hypotheses for each coefficient. One might think that the single hypothesis test on all coefficients would be of interest. However, such an extension is really difficult because we have to consider the distribution of \(\sup _\mathbf{{t}}\{\widehat{\theta }_0(\mathbf{{t}}), \widehat{\theta }_1(\mathbf{{t}}), \ldots , \widehat{\theta }_p(\mathbf{{t}})\}\) . In fact, such a method has not even been studied in the context of mean regression. Thus, the development of a simultaneous testing method into a single hypothesis test on all coefficient functions is posited as an important future work.

4.2 Simulation

We ran a simulation study to demonstrate the finite sample performance of the proposed estimator and test statistic. We present the results for the three model settings. In all settings, we simulated the responses \(\{Y_i\}_{i=1}^n\) using the following conditional distribution function:

where \(\log \gamma (\mathbf{{x}}, \mathbf{{t}})^{-1}=(1, \mathbf{{x}}^\top ){\varvec{\theta }}(\mathbf{{t}})\) . This conditional distribution function satisfies ( 2.2 ) with \(c_0(\mathbf{{x}},\mathbf{{t}})=1+\delta \) , \(c_1(\mathbf{{x}}, \mathbf{{t}})=-\delta (1+\delta )\) , and \(\beta (\mathbf{{x}}, \mathbf{{t}})=\gamma (\mathbf{{x}}, \mathbf{{t}})^{-1}\) . If \(\delta \ne 0\) , the above conditional distribution is not the Pareto distribution; therefore, we need to introduce the threshold \(\omega _n\) appropriately. Otherwise, modeling bias occurs, resulting in less accuracy in the estimation. We simulated the predictors \(\{(X_{i1}, X_{i2}, \ldots , X_{ip})\}_{i=1}^n\) based on the following procedure:

where \(\{(Z_{i1}, Z_{i2}, \ldots , Z_{ip})\}_{i=1}^n\) is an independent sample from the multivariate normal distribution with \(E[Z_{ij}]=0\) and \(\mathrm{{cov}}[Z_{ij_1}, Z_{ij_2}]=0.5^{|j_1-j_2|}\) , and \(\Phi (\cdot )\) is the cumulative distribution function of a standard normal. Consequently, for \(j=1, 2, \ldots , p\) , \(\{X_{ij}\}_{i=1}^n\) is uniformly distributed on \([-\sqrt{3}, \sqrt{3}]\) with unit variance. Meanwhile, we simulated the predictors \(\{(T_{i1}, T_{i2}, \ldots , T_{iq})\}_{i=1}^n\) from a uniform distribution on \([-0.2, 1.2]^q\subset \mathbb {R}^q\) with \(\mathrm{{cov}}[T_{ik_1}, T_{ik_2}]=0\) .

To measure the goodness of the estimator \(\widehat{\varvec{\theta }}(\mathbf{{t}})\) , we calculated the following empirical mean square error based on \(M=100\) simulations:

where \(\widehat{\theta }_j^{(m)}(\cdot )\) is the estimate of \(\theta _j(\cdot )\) using the m th dataset and \(\{\mathbf{{t}}_l\}_{l=1}^L\) are equally spaced points in \([0, 1]^q\) . In addition, to evaluate the performance of the test statistic, we obtained the probability of error as follows. When the null hypothesis is true, the empirical probability of the Type I error is defined as

where \(\widetilde{T}_m\) is the test statistic \(\widetilde{T}\) using the m th dataset. Meanwhile, when the null hypothesis is false, the empirical probability of the Type II error is given by

Now, the null hypotheses of interest, \(\mathrm{{H}}_{0\mathrm{{C}}}\) and \(\mathrm{{H}}_{0\mathrm{{Z}}}\) , are defined in Sect.  4.1 . Accordingly, if the null hypothesis \(\mathrm{{H}}_{0\mathrm{{C}}}\) is true, i.e., the given coefficient function \(\theta _j(\mathbf{{t}})\) is constant, we provide E1 to examine the performance of the constancy test; if not, E2 is provided. Similarly, if the null hypothesis \(\mathrm{{H}}_{0\mathrm{{Z}}}\) is true, i.e., \(\theta _j(\mathbf{{t}})\equiv 0\) , E1 is used to evaluate the accuracy for the sparsity test; if not, the result for \(\mathrm{{H}}_{0\mathrm{{Z}}}\) is given as E2.

In the first model setting, we set \(p=3\) and \(q=1\) and defined the coefficient functions \(\theta _j(\cdot ),\ j=1, 2, 3\) as

where the intercept term \(\theta _0(t)\) was not considered. We employed the Epanechnikov kernel in the proposed estimator. In the estimation process, we selected the threshold \(\omega _n\) and bandwidth \(h_n\) using the procedure described in Sect.  2.3 . We set the pre-determined sample fraction to \(n_0/n=0.2\) in \(D=20\) -fold cross-validation, where \(n_0=\sum _{i=1}^nI(Y_i>\omega _n)\) . Table 1 shows the calculated MSEs and empirical probabilities of error for each coefficient function \(\theta _j(\cdot )\) when \(\delta =0.1, 0.25, 0.5\) and \(n=200, 500, 1000\) . For each coefficient function \(\theta _j(\cdot )\) , the calculated MSEs improved as n increased. This result is desirable and suggests the consistency of the proposed estimator. Note that when testing the null hypothesis \(\mathrm{{H}}_{0\mathrm{{C}}}\) , we must estimate the unknown constant \(C_0\) . Since the maximum deviation between \(\widehat{\theta }_j(t)\) and the estimated \(C_0\) tends to be smaller than the maximum deviation between \(\widehat{\theta }_j(t)\) and the true value \(C_0\) , the empirical probabilities of the Type I error were smaller for the null hypothesis \(\mathrm{{H}}_{0\mathrm{{C}}}\) than for the null hypothesis \(\mathrm{{H}}_{0\mathrm{{Z}}}\) . In all settings, the empirical probability of the Type II error improved as n increased.

The second model setting focuses on the case where p is larger than in the first model setting. We set \(p=10\) and \(q=1\) and defined the coefficient functions \(\theta _j(\cdot ),\ j=1, 2, \ldots , 10\) as

where the intercept term \(\theta _0(t)\) was not considered. The kernel function was the Epanechnikov kernel, and the tuning parameters were selected in the same manner as in the first model setting. Table 2 shows the calculated MSEs and empirical probabilities of error for each coefficient function \(\theta _j(\cdot )\) when \(\delta =0.1, 0.25, 0.5\) and \(n=200, 500, 1000\) . The accuracy of the estimator and test statistic improved as n increased, with no significant deterioration compared to the first model setting with \(p=3\) , indicating that the proposed model can avoid the curse of dimensionality even when the dimension p is large. Figure  1 shows the results of the estimation. The two dotted lines are plots of the 5th and 95th largest estimates of the \(M=100\) estimates at each point \(t\in [0, 1]\) . The average estimates (dashed line) resembled the true value (solid line).

In the third model setting, we set \(p=2\) and \(q=2\) and defined the coefficient functions \(\theta _j(\cdot ),\ j=0, 1, 2\) as

We employed the kernel function of the Epanechnikov type as follows:

The tuning parameters were selected in the same manner as in the first model setting. Table 3 shows the calculated MSEs and empirical probabilities of error for each coefficient function \(\theta _j(\cdot )\) when \(\delta =0.1, 0.25, 0.5\) and \(n=3000, 5000\) . As with the first and second settings, the accuracy of the estimator and test statistic improved as n increased.

We note that Tables 1 , 2 , and 3 show the results of the hypothesis tests when the tuning parameters are automatically selected based on each dataset. As a result, the empirical probability of the Type I error tended to be smaller than the given significance level \(\alpha =0.05\) in many settings, and thus the results seem to be conservative. However, although ad hoc selection of the tuning parameters will yield more reasonable results about the Type I error, such tuning parameters cannot be determined in real data analysis.

figure 1

The estimated coefficient functions in the second model setting with \(\delta =0.5\) and \(n=1000\) : the true value (solid line), average estimates (dashed line) and empirical 95% confidence interval calculated based on estimates (dotted lines)

In this study, we used the plug-in test, but the bootstrap method may also be useful for testing. However, as described in de Haan and Zhou ( 2022 ), it is known that bootstrap methods do not always work efficiently in the context of extreme value theory. In addition, the effectiveness of bootstrapping in extreme value theory has only been partially revealed. Thus, at the present stage, the bootstrap test in our model is posited as future research.

figure 2

The histograms of the response Y for male (top left panel) and female (top right panel): The bottom two panels show the histograms of the response Y greater than 15 for male (bottom left panel) and female (bottom right panel)

figure 3

The time series plots of Y for male (left panel) and female (right panel), where Y exceeds the threshold \(\omega _n\)

5 Application

In this section, we apply the proposed method to a real dataset on white blood cells. The dataset is available in Kaggle ( https://www.kaggle.com/amithasanshuvo/cardiac-data-nhanes ). White blood cells play a role in processing foreign substances such as bacteria and viruses that have invaded the body, and are a type of blood cell that is indispensable for maintaining the normal immune function of the human body. Therefore, if the white blood cell count is abnormal, diseases may be suspected. The top left and right panels of Fig.  2 show histograms of the white blood cell counts for \(n=18{,}047\) males and \(n=19{,}032\) females aged 20 to 85, respectively, and the bottom two panels show histograms for those over 15 ( \(\times 10^3/\upmu \) L). We can judge whether the tails of these distributions have a positive extreme value index by comparing them to the normal distribution with a zero extreme value index. In many extreme value studies, kurtosis is often used. The sample kurtosis was about 403.8 for males and about 38.3 for females, indicating that the right tails of these distributions are heavy. In addition, Fig.  3 shows plots of the subject’s age and white blood cell count, suggesting that the number of abnormal white blood cell counts tends to increase with age.

The dataset also contains percentages by type: neutrophils, eosinophils, basophils, monocytes, and lymphocytes. White blood cell differentiation is a clinical test that identifies the types of white blood cells that cause an abnormal white blood cell count. These five types have different immune functions and can help detect certain diseases. The sample averages were about 58.02, 3.10, 0.69, 8.39, and 29.84% for males and about 58.70, 2.57, 0.71, 7.47, and 30.59% for females, respectively. Neutrophils and lymphocytes comprised the majority of white blood cells, and the correlation coefficient calculated from the transformed observations, as described below, was approximately \(-0.93\) for males and \(-0.95\) for females. In other words, there was a strong negative correlation between the percentages of neutrophils and lymphocytes. In this analysis, we define the response Y as the white blood cell count; the predictors \(X_1\) , \(X_2\) , \(X_3\) and \(X_4\) as the percentages of eosinophils, basophils, monocytes and lymphocytes in the white blood cells; and the predictor T as age. We denote \(\mathbf{{X}}=(X_1, X_2, X_3, X_4)^\top \) .

figure 4

The three dimensional scatter plots of \((X_j, T, Y),\ j=1, 2, 3, 4\) with \(Y>\omega _n\) for male. For the top left, top right, bottom left and bottom right panels, \(X_j\) is the percentage of eosinophils, basophils, monocytes and lymphocytes in the white blood cells, respectively

figure 5

The three dimensional scatter plots of \((X_j, T, Y),\ j=1, 2, 3, 4\) with \(Y>\omega _n\) for female. For the top left, top right, bottom left and bottom right panels, \(X_j\) is the percentage of eosinophils, basophils, monocytes and lymphocytes in the white blood cells, respectively

Figures 4 and 5 show the three-dimensional scatter plots of each \((X_j, T, Y)\) for male and female, respectively. As shown in these figures, the predictors \(X_1\) , \(X_2\) , \(X_3\) and \(X_4\) had many outliers. However, excluding these outliers also excludes the extreme values of the response Y . Therefore, we apply the normal score transformation to \(X_j\) . That is, if \(X_{ij}\) is the \(R_{ij}\) -th largest in the j th predictor sample \(\{X_{ij}\}_{i=1}^n\) , \(X_{ij}\) is redefined as

where all observations are jittered by uniform noise before applying the normal score transformation. Consequently, the redefined predictors \(X_1\) , \(X_2\) , \(X_3\) , and \(X_4\) are normally distributed. Wang and Tsai ( 2009 ) applied a similar transformation in their analysis of real data.

figure 6

The estimated coefficient functions (solid line) and its 95% confidence intervals (dashed lines) with bias ignored for male (first column) and female (second column)

We assume that the conditional distribution function of Y given \((\mathbf{{X}}, T)=(\mathbf{{x}}, t)\) satisfies

where \(\mathcal {L}(\cdot ; \mathbf{{x}}, t)\) is a slowly varying function satisfying ( 2.2 ), and

where \(\mathbf{{x}}=(x_1, x_2, x_3, x_4)^\top \in \mathbb {R}^4\) , and \(\theta _j(t),\ j=0, 1, 2, 3, 4\) are unknown smooth functions of t . The aim of the analysis is to investigate the effect of \(X_j\) on the extreme values of Y , where the effect of \(X_j\) varies with T . To do this, we first estimate the unknown coefficient functions \(\theta _j(\cdot ),\ j=0, 1, 2, 3, 4\) . Then, we select the threshold \(\omega _n\) and bandwidth \(h_n\) using the procedure described in Sect.  2.3 . We employ the Epanechikov kernel in the proposed estimator and set the pre-determined sample fraction to \(n_0/n=0.030\) in \(D=20\) -fold cross-validation, where \(n_0=\sum _{i=1}^nI(Y_i>\omega _n)\) . We obtained the optimal tuning parameters as \((h_n, n_0/n)=(0.21, 0.042)\) for male and \((h_n, n_0/n)=(0.30, 0.036)\) for female. Figure  6 shows the estimated coefficient functions \(\widehat{\theta }_j(\cdot ),\ j=0, 1, 2, 3, 4\) by the solid line and the following pointwise 95% confidence intervals computed from the asymptotic normality of the proposed estimator by the dashed lines:

where the bias is ignored, \(n(t)\sigma _{nj}(t)\) defined in Sect.  4.1 is estimated based on (C.3), and \(\nu =\int K(u)^2\mathrm{{d}}u\) . For all coefficient functions, the trends were similar for male and female. The decreasing trend in the estimated intercept term \(\widehat{\theta }_0(\cdot )\) indicates that the number of abnormal white blood cell counts tends to increase with age. Some of the estimated coefficient functions deviated from zero and vary with age.

Table 4 presents the results of the statistical hypothesis tests for sparsity and constancy, as defined in Sect.  4.1 . For the significance level \(\alpha =0.05\) , we reject the null hypothesis if \(\widetilde{T}<-0.61\) or \(\widetilde{T}>4.37\) . The null hypothesis \(\mathrm{{H}}_{0\textrm{Z}}\) for sparsity was rejected for all coefficient functions, except \(\theta _2(\cdot )\) for both male and female. In addition, the null hypothesis \(\mathrm{{H}}_{0\textrm{C}}\) for constancy was rejected for \(\theta _1(\cdot )\) and \(\theta _4(\cdot )\) for male and \(\theta _0(\cdot )\) and \(\theta _4(\cdot )\) for female. Remarkably, eosinophils and monocytes, which represented a small percentage of white blood cells, were associated with abnormal white blood cell counts.

figure 7

The Q–Q plots for the proposed model for male (left panel) and female (right panel)

figure 8

The Q–Q plots for the linear model proposed by Wang and Tsai ( 2009 ) for male (left panel) and female (right panel)

We evaluate the goodness of fit of the model using the Q–Q plot (quantile–quantile plot). We regard \(\{\exp ((1, \mathbf{{X}}_i^\top )\widehat{\varvec{\theta }}(T_i))\log (Y_i/\omega _n): Y_i>\omega _n,\ i=1, \ldots , n\}\) as a random sample from the standard exponential distribution. Figure  7 shows plots of these empirical and theoretical quantiles. The two dashed lines show the pointwise 95% confidence intervals computed in the simulations. We can infer that the better the plots are aligned on a straight line, the better the model fits the data. Most of the plots were within the 95% confidence interval and the goodness of fit of the model did not seem to be bad. In contrast, Fig.  8 shows the plots for the linear model proposed by Wang and Tsai ( 2009 ), where the predictors are defined as T scaled on [0, 1], \(X_1\) , \(X_2\) , \(X_3\) and \(X_4\) . In this case, many plots were outside the 95% confidence interval and deviated significantly from the straight line, indicating that our model fits the data better.

Finally, because the null hypotheses \(\mathrm{{H}}_{0\textrm{Z}}\) and \(\mathrm{{H}}_{0\textrm{C}}\) were not rejected for \(\theta _2(\cdot )\) and \(\theta _3(\cdot )\) , we adopt a simpler model. We consider the model

which assumes the sparsity of \(X_2\) . For the model ( 5.1 ), the discrepancy measure value described in Sect.  2.3 was approximately \(4.322\times 10^{-4}\) for males and \(3.015\times 10^{-4}\) for females. Meanwhile, for the model ( 5.2 ), the discrepancy measure value was approximately \(3.730\times 10^{-4}\) for males and \(3.017\times 10^{-4}\) for females, where \((h_n, n_0/n)=(0.22, 0.042)\) for males and \((h_n, n_0/n)=(0.30, 0.036)\) for females. The discrepancy measure values for females were not very different between the two models, but the discrepancy measure value for males was smaller in the model ( 5.2 ) than in the model ( 5.1 ). Moreover, we consider the model

where \(\widehat{\theta }_3\) is the average of the estimates \(\{\widehat{\theta }_3(t_l)\}_{l=1}^L\) obtained in model ( 5.1 ), which is a known constant. For the model ( 5.3 ), the discrepancy measure value was approximately \(3.628\times 10^{-4}\) for males and \(3.104\times 10^{-4}\) for females, where \((h_n, n_0/n)=(0.19, 0.042)\) for males and \((h_n, n_0/n)=(0.30, 0.036)\) for females. The discrepancy measure value for males was smaller in the model ( 5.3 ) than in the model ( 5.1 ). Therefore, from the point of view of the discrepancy measure, the data structure may be explained by a simpler model.

Data availability

Data is available on online.

Code availability

If reviewers requested, we will submit the code.

Andriyana Y, Gijbels I, Verhasselt A (2014) P-splines quantile regression estimation in varying coefficient models. TEST 23:153–194. https://doi.org/10.1007/s11749-013-0346-2

Article   MathSciNet   Google Scholar  

Andriyana Y, Gijbels I, Verhasselt A (2018) Quantile regression in varying-coefficient models: non-crossing quantile curves and heteroscedasticity. Stat Pap 59:1589–1621. https://doi.org/10.1007/s00362-016-0847-7

Cai Z, Xu X (2008) Nonparametric quantile estimations for dynamic smooth coefficient models. J Am Stat Assoc 103:1595–1608. https://doi.org/10.1198/016214508000000977

Cai Z, Fan J, Yao Q (2000) Functional-Coefficient Regression Models for Nonlinear Time Series. J Am Stat Assoc 95:941–956. https://doi.org/10.1080/01621459.2000.10474284

Csorgo S, Deheuvels P, Mason D (1985) Kernel estimates of the tail index of a distribution. Annu Stat 13:1050–1077. https://doi.org/10.1214/aos/1176349656

de Haan L, Ferreira A (2006) Extreme value theory: an introduction. Springer, New York. https://doi.org/10.1007/0-387-34471-3

Article   Google Scholar  

de Haan L, Zhou C (2022) Bootstrapping extreme value estimators. J Am Stat Assoc. https://doi.org/10.1080/01621459.2022.2120400

Daouia A, Gardes L, Girard S (2013) On kernel smoothing for extremal quantile regression. Bernoulli 19:2557–2589. https://doi.org/10.3150/12-BEJ466

Dekkers ALM, Einmahl JHJ, de Haan L (1989) A moment estimator for the index of an extreme-value distribution. Ann Stat 17:1833–1855. https://doi.org/10.1214/aos/1176347397

Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360. https://doi.org/10.1198/016214501753382273

Fan J, Zhang W (1999) Statistical estimation in varying coefficient models. Ann Stat 27:1491–1518. https://doi.org/10.1214/aos/1017939139

Fan J, Zhang W (2000) Simultaneous confidence bands and hypothesis testing in varying-coefficient models. Scand J Stat 27:715–731. https://doi.org/10.1111/1467-9469.00218

Fan J, Zhang W (2008) Statistical methods with varying coefficient models. Stat Interface 1:179–195. https://doi.org/10.4310/SII.2008.v1.n1.a15

Fan J, Zhang C, Zhang J (2001) Generalized likelihood ratio statistics and Wilks phenomenon. Ann Stat 29:153–193. https://doi.org/10.1214/aos/996986505

Gardes L, Girard S (2010) Conditional extremes from heavy-tailed distributions: application to the estimation of extreme rainfall return levels. Extremes 13:177–204. https://doi.org/10.1007/s10687-010-0100-z

Gardes L, Stupfler G (2014) Estimation of the conditional tail index using a smoothed local hill estimator. Extremes 17:45–75. https://doi.org/10.1007/s10687-013-0174-5

Goegebeur Y, Guillou A, Schorgen A (2014) Nonparametric regression estimation of conditional tails: random covariate case. Statistics 48:732–755. https://doi.org/10.1080/02331888.2013.800064

Goegebeur Y, Guillou A, Stupfler G (2015) Uniform asymptotic properties of the nonparametric regression estimator of conditional tails. Annales de l’ Institut Henri Poincaré, Probabilités et Statistiques 51:1190–1213. https://doi.org/10.1214/14-AIHP624

Gomes M, de Haan L, Peng L (2002) Semi-parametric estimation of the second order parameter in statistics of extremes. Extremes 5:387–414. https://doi.org/10.1023/A:1025128326588

Hall P (1982) On some simple estimates of an exponent of regular variation. J R Stat Soc B 44:37–42. https://doi.org/10.1111/j.2517-6161.1982.tb01183.x

Hill BM (1975) A simple general approach to inference about the tail of a distribution. Annu Stat 3:1163–1174. https://doi.org/10.1214/aos/1176343247

Hastie T, Tibshirani R (1993) Varying-coefficient models. J R Stat Soc B 55:757–779. https://doi.org/10.1111/j.2517-6161.1993.tb01939.x

Huang JZ, Wu CO, Zhou L (2002) Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika 89:111–128. https://doi.org/10.1093/biomet/89.1.111

Huang JZ, Wu CO, Zhou L (2004) Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Stat Sin 14:763–788

MathSciNet   Google Scholar  

Kim MO (2007) Quantile regression with varying coefficients. Ann Stat 35:92–108. https://doi.org/10.1214/009053606000000966

Li R, Leng C, You J (2022) Semiparametric tail index regression. J Bus Econ Stat 40:82–95. https://doi.org/10.1080/07350015.2020.1775616

Ma Y, Jiang Y, Huang W (2019) Tail index varying coefficient model. Commun Stat 48:235–256. https://doi.org/10.1080/03610926.2017.1406519

Ma Y, Wei B, Huang W (2020) A nonparametric estimator for the conditional tail index of Pareto-type distributions. Metrika 83:17–44. https://doi.org/10.1007/s00184-019-00723-8

Pickands J (1975) Statistical inference using extreme order statistics. Ann Stat 3:119–131. https://doi.org/10.1214/aos/1176343003

Rosenblatt M (1976) On the maximal deviation of \(k\) -dimensional density estimates. Ann Probab 4:1009–1015. https://doi.org/10.1214/aop/1176995945

Smith RL (1987) Estimating tails of probability distributions. Ann Stat 15:1174–1207. https://doi.org/10.1214/aos/1176350499

Stupfler G (2013) A moment estimator for the conditional extreme-value index. Electron J Stat 7:2298–2343. https://doi.org/10.1214/13-EJS846

Wang H, Tsai CL (2009) Tail index regression. J Am Stat Assoc 104:1233–1240. https://doi.org/10.1198/jasa.2009.tm08458

Wu CO, Chiang CT, Hoover DR (1998) Asymptotic confidence regions for kernel smoothing of a varying-coefficient model with longitudinal data. J Am Stat Assoc 93:1388–1402. https://doi.org/10.1080/01621459.1998.10473800

Yoshida T (2023) Single-index models for extreme value index regression. arXiv:2203.05758

Youngman BD (2019) Generalized additive models for exceedances of high thresholds with an application to return level estimation for U.S. wind gusts. J Am Stat Assoc 114:1865–1879. https://doi.org/10.1080/01621459.2018.1529596

Download references

Acknowledgements

We would like to thank the anonymous two reviewers, Associate Editor and Editor for their helpful comments that helped to improve the manuscript.

Open access funding provided by Kagoshima University. This research was partially financially supported by the JSPS KAKENHI (Grant Numbers 18K18011, 22K11935).

Author information

Authors and affiliations.

Kagoshima University, Kagoshima, 890-8580, Japan

Koki Momoki & Takuma Yoshida

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Koki Momoki. The first draft of the manuscript was written by Koki Momoki and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Koki Momoki .

Ethics declarations

Conflict of interest.

The authors have no conflict of interest to declare that are relevant to the content of this article.

Ethical approval

Consent to participate, consent for publication, additional information, publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

In this appendix, we prove Theorems 1 – 3 for \(\mathbf{{t}}=\mathbf{{t}}_0\in \mathbb {R}^q\) . For convenience, the intercept term \(\theta _0(\cdot )\) is not considered.

Proof of Theorem 1

\(\dot{\varvec{L}}_n({\varvec{\theta }}(\mathbf{{t}}_0))\) can be regarded as the sum of independent and identically distributed random variables. To apply the Central Limit Theorem, we show \(E[[n_0(\mathbf{{t}}_0){\varvec{\Sigma }}_n(\mathbf{{t}}_0)]^{-1/2}(\dot{\varvec{L}}_n({\varvec{\theta }}(\mathbf{{t}}_0))+n_0(\mathbf{{t}}_0){\varvec{\Sigma }}_n(\mathbf{{t}}_0)\sum _{l=1}^2\mathbf{{\Lambda }}_n^{(l)}(\mathbf{{t}}_0))]\rightarrow -\mathbf{{b}}(\mathbf{{t}}_0)\) as \(n\rightarrow \infty \) in the first step and \(\mathrm{{var}}[[n_0(\mathbf{{t}}_0){\varvec{\Sigma }}_n(\mathbf{{t}}_0)]^{-1/2}(\dot{\varvec{L}}_n({\varvec{\theta }}(\mathbf{{t}}_0))+n_0(\mathbf{{t}}_0){\varvec{\Sigma }}_n(\mathbf{{t}}_0)\sum _{l=1}^2\mathbf{{\Lambda }}_n^{(l)}(\mathbf{{t}}_0))]=\mathrm{{var}}[[n_0(\mathbf{{t}}_0){\varvec{\Sigma }}_n(\mathbf{{t}}_0)]^{-1/2}\dot{\varvec{L}}_n({\varvec{\theta }}(\mathbf{{t}}_0))]\rightarrow \nu \mathbf{{I}}_p\) as \(n\rightarrow \infty \) in the second step, where “var” denotes the variance-covariance matrix.

We can write \(\dot{\varvec{L}}_n({\varvec{\theta }}(\mathbf{{t}}_0))\) as

By the Taylor expansion and the condition (C.1), we have

From the condition (C.4) and model assumptions ( 2.1 ) and ( 2.2 ), we have

Analogously, we have

Under the condition (C.5), we have

as \(n\rightarrow \infty \) . Therefore, we have \(E[[n_0(\mathbf{{t}}_0)\mathbf{{\Sigma }}_n(\mathbf{{t}}_0)]^{-1/2}{} \mathbf{{A}}_n^{(1)}]\rightarrow -\mathbf{{b}}(\mathbf{{t}}_0)\) as \(n\rightarrow \infty \) . Using the second-order Taylor expansion, we have

Therefore, by the Taylor expansion and condition (C.1), we have

Because the conditional distribution of \(\gamma (\mathbf{{X}}, \mathbf{{t}}_0)^{-1}\log (Y/\omega _n)\) given \((\mathbf{{X}}, \mathbf{{T}})=(\mathbf{{x}}, \mathbf{{t}}_0)\) and \(Y>\omega _n\) is approximately a standard exponential, we have

where \(f_{(\mathbf{{X}}, \mathbf{{T}})}(\mathbf{{x}}, \mathbf{{t}})\) denotes the marginal density function of \((\mathbf{{X}},\mathbf{{T}})\) . Therefore, the right-hand side of ( A.1 ) can be written as

where \(\mathbf{{\Lambda }}_n^{(1)}(\mathbf{{t}})\) and \(\mathbf{{\Lambda }}_n^{(2)}(\mathbf{{t}})\) are defined in Sect.  3.2 . Therefore, we have \(E[[n_0(\mathbf{{t}}_0){} \mathbf{{\Sigma }}_n(\mathbf{{t}}_0)]^{-1/2}{} \mathbf{{A}}_n^{(2)}]-[n_0(\mathbf{{t}}_0)\mathbf{{\Sigma }}_n(\mathbf{{t}}_0)]^{1/2}\sum _{l=1}^2\mathbf{{\Lambda }}_n^{(l)}(\mathbf{{t}}_0)\rightarrow \mathbf{{0}}\) as \(n\rightarrow \infty \) . Hence, the proof of the first step is completed.

We abbreviate as

Because \(\{(Y_i, \mathbf{{X}}_i, \mathbf{{T}}_i)\}_{i=1}^n\) are independently distributed, we have

From the result of Step 1 , the second term on the right-hand side converges to the zero matrix as \(n\rightarrow \infty \) . Using the Taylor expansion, the first term on the right-hand side can be written as

Therefore, the right-hand side of ( A.2 ) can be written as

as \(n\rightarrow \infty \) . Hence, the proof of the second step is completed.

From the results of Steps 1 and 2 , we obtain Theorem 1 by applying the Central Limit Theorem. \(\square \)

Proof of Theorem 2

We show \(E[[n_0(\mathbf{{t}}_0)\mathbf{{\Sigma }}_n(\mathbf{{t}}_0)]^{-1/2}\ddot{\varvec{L}}_n({\varvec{\theta }}(\mathbf{{t}}_0))[n_0(\mathbf{{t}}_0)\mathbf{{\Sigma }}_n(\mathbf{{t}}_0)]^{-1/2}]\rightarrow \mathbf{{I}}_p\) as \(n\rightarrow \infty \) in the third step and \(\mathrm{{var}}[\mathrm{{vec}}([n_0(\mathbf{{t}}_0)\mathbf{{\Sigma }}_n(\mathbf{{t}}_0)]^{-1/2}\ddot{\varvec{L}}_n({\varvec{\theta }}(\mathbf{{t}}_0))[n_0(\mathbf{{t}}_0){} \mathbf{{\Sigma }}_n(\mathbf{{t}}_0)]^{-1/2})]\rightarrow {\varvec{O}}\) as \(n\rightarrow \infty \) in the fourth step, where \(\mathrm{{vec}}(\cdot )\) is a vec operator.

The conditional distribution of \(\gamma (\mathbf{{X}},\mathbf{{t}}_0)^{-1}\log (Y/\omega _n)\) given \((\mathbf{{X}},\mathbf{{T}})=(\mathbf{{x}},\mathbf{{t}}_0)\) and \(Y>\omega _n\) is approximately a standard exponential. Consequently, we have

Under the condition (C.3), the right-hand side converges to \(\mathbf{{I}}_p\) as \(n\rightarrow \infty \) . Hence, the proof of the third step is completed.

where “ \(\otimes \) ” denotes the Kronecker product. Under the condition (C.3), the right-hand side converges to \({\varvec{O}}\) as \(n\rightarrow \infty \) . Hence, the proof of the fourth step is completed.

\(\square \)

Proof of Theorem 3

We define \({\varvec{\alpha }}_n=\mathbf{{\Sigma }}_n(\mathbf{{t}}_0)^{1/2}({\varvec{\theta }}-{\varvec{\theta }}(\mathbf{{t}}_0))\) and \({\varvec{\alpha }}_n^*=\mathbf{{\Sigma }}_n(\mathbf{{t}}_0)^{1/2}{\varvec{\theta }}(\mathbf{{t}}_0)\) . Additionally, we define \(\mathbf{{Z}}_{ni}=\mathbf{{\Sigma }}_n(\mathbf{{t}}_0)^{-1/2}{} \mathbf{{X}}_i,\ i=1, \ldots , n\) . The objective function \(L_n({\varvec{\theta }})\) can be written as

Let \(\dot{\varvec{L}}_n^*({\varvec{\alpha }}_n)\) be the gradient vector of \(L_n^*({\varvec{\alpha }}_n)\) and \(\ddot{\varvec{L}}_n^*({\varvec{\alpha }}_n)\) be Hessian matrix of \(L_n^*({\varvec{\alpha }}_n)\) . We assume \(\ddot{\varvec{L}}_n^*({\varvec{\alpha }}_n)\) is a positive definite matrix for all \({\varvec{\alpha }}_n\in \mathbb {R}^p\) . Therefore, \(L_n^*({\varvec{\alpha }}_n)\) is strictly convex.

Using the Taylor expansion, we have

for fixed \(\mathbf{{s}}\in \mathbb {R}^p\) . By applying Theorems 1 and 2 , we have \(\dot{\varvec{L}}_n^*({\varvec{0}})/\sqrt{n_0(\mathbf{{t}}_0)}+[n_0(\mathbf{{t}}_0){\varvec{\Sigma }}_n(\mathbf{{t}}_0)]^{1/2}\sum _{l=1}^2\mathbf{{\Lambda }}_n^{(l)}(\mathbf{{t}}_0)\xrightarrow {D}\mathrm{{N}}(-\mathbf{{b}}(\mathbf{{t}}_0),\nu \mathbf{{I}}_p)\) and \(\ddot{\varvec{L}}_n^*({\varvec{0}})/n_0(\mathbf{{t}}_0)\xrightarrow {P}{} \mathbf{{I}}_p\) . We assume \([n_0(\mathbf{{t}}_0){\varvec{\Sigma }}_n(\mathbf{{t}}_0)]^{1/2}\sum _{l=1}^2\mathbf{{\Lambda }}_n^{(l)}(\mathbf{{t}}_0)\rightarrow 0\) as \(n\rightarrow \infty \) . Consequently, this implies that, for any \(\varepsilon >0\) , there exists a constant C such that

which implies \(\widehat{\varvec{\alpha }}_n=\mathbf{{\Sigma }}_n(\mathbf{{t}}_0)^{1/2}(\widehat{\varvec{\theta }}(\mathbf{{t}}_0)-{\varvec{\theta }}(\mathbf{{t}}_0))=O_P(\sqrt{n_0(\mathbf{{t}}_0)})\) (Fan and Li 2001 ). From the Taylor expansion of \(\dot{\varvec{L}}_n^*(\widehat{\varvec{\alpha }}_n)={\varvec{0}}\) , we have

Therefore, by applying Theorems 1 and 2 , we obtain Theorem 3 . \(\square \)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Momoki, K., Yoshida, T. Hypothesis testing for varying coefficient models in tail index regression. Stat Papers (2024). https://doi.org/10.1007/s00362-024-01538-0

Download citation

Received : 24 May 2022

Revised : 08 December 2023

Published : 02 April 2024

DOI : https://doi.org/10.1007/s00362-024-01538-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Extreme value theory
  • Hypothesis testing
  • Pareto-type model
  • Tail index regression
  • Varying coefficient model

Mathematics Subject Classification

  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. How to Test Hypotheses in Regression Analysis, Correlation, and

    hypothesis testing and regression analysis

  2. Hypothesis Testing- Meaning, Types & Steps

    hypothesis testing and regression analysis

  3. Mod-01 Lec-39 Hypothesis Testing in Linear Regression

    hypothesis testing and regression analysis

  4. PPT

    hypothesis testing and regression analysis

  5. Regression Analysis and Introduction to Hypothesis Testing

    hypothesis testing and regression analysis

  6. Hypothesis Testing Cheat Sheet

    hypothesis testing and regression analysis

VIDEO

  1. Linear regression

  2. Simple Linear Regression, hypothesis tests

  3. Multiple Regression

  4. How to perform Multiple Regression?/Interpretation and Hypothesis test using SPSS. Part2

  5. Hypothesis Test for Simple Linear Regession

  6. Hypotheses & Hypothesis tests

COMMENTS

  1. Hypothesis Testing in Regression Analysis

    An analyst generates the following output from the regression analysis of inflation on unemployment: [Math Processing Error] At the 5% significant level, test the null hypothesis that the slope coefficient is significantly different from one, that is, H0: b1 = 1 vs. Ha: b1 ≠ 1. Solution. The calculated t-statistic, t = ^ b1 − b1 ^ Sb1 is ...

  2. 12.2.1: Hypothesis Test for Linear Regression

    The two test statistic formulas are algebraically equal; however, the formulas are different and we use a different parameter in the hypotheses. The formula for the t-test statistic is t = b1 (MSE SSxx)√ t = b 1 ( M S E S S x x) Use the t-distribution with degrees of freedom equal to n − p − 1 n − p − 1.

  3. Regression Analysis

    Hypothesis Testing: Regression analysis provides a statistical framework for hypothesis testing. Researchers can test the significance of individual coefficients, assess the overall model fit, and determine if the relationship between variables is statistically significant. This allows for rigorous analysis and validation of research hypotheses.

  4. Comparing Regression Lines with Hypothesis Tests

    To perform a hypothesis test on the difference between the constants, we need to assess the Condition variable. The Condition coefficient is 10, which is the vertical difference between the two models. ... Use sequential regression analysis and enter the condition variable and interaction term as the second block of variables to enter in the ...

  5. 15.5: Hypothesis Tests for Regression Models

    Testing the model as a whole; Tests for individual coefficients; Running the hypothesis tests in R; So far we've talked about what a regression model is, how the coefficients of a regression model are estimated, and how we quantify the performance of the model (the last of these, incidentally, is basically our measure of effect size).

  6. Linear regression hypothesis testing: Concepts, Examples

    F-statistics for testing hypothesis for linear regression model: F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0.

  7. Statistical Hypothesis Testing Overview

    Hypothesis testing is a crucial procedure to perform when you want to make inferences about a population using a random sample. These inferences include estimating population properties such as the mean, differences between means, proportions, and the relationships between variables. This post provides an overview of statistical hypothesis testing.

  8. Hypothesis Testing

    There are 5 main steps in hypothesis testing: State your research hypothesis as a null hypothesis and alternate hypothesis (H o) and (H a or H 1 ). Collect data in a way designed to test the hypothesis. Perform an appropriate statistical test. Decide whether to reject or fail to reject your null hypothesis. Present the findings in your results ...

  9. 3.3.4: Hypothesis Test for Simple Linear Regression

    In simple linear regression, this is equivalent to saying "Are X an Y correlated?". In reviewing the model, Y = β0 +β1X + ε Y = β 0 + β 1 X + ε, as long as the slope ( β1 β 1) has any non‐zero value, X X will add value in helping predict the expected value of Y Y. However, if there is no correlation between X and Y, the value of ...

  10. Hypothesis Testing and Regression Analysis

    In this chapter, we look at the different stages of data preparation involved in quantitative analysis. Understanding these processes will help us gather reliable data and reach a valid conclusion. We will discuss types of hypotheses and how they are stated mathematically. Furthermore, we shall discuss hypothesis testing with worked examples.

  11. Mastering Hypothesis Testing: A Comprehensive Guide for ...

    4. Regression Analysis: - This is used to understand relationships between variables. It includes linear regression, which assesses the relationship between a dependent variable and one or more ...

  12. Everything you need to know about Hypothesis Testing in Machine Learning

    The null hypothesis represented as H₀ is the initial claim that is based on the prevailing belief about the population. The alternate hypothesis represented as H₁ is the challenge to the null hypothesis. It is the claim which we would like to prove as True. One of the main points which we should consider while formulating the null and alternative hypothesis is that the null hypothesis ...

  13. Introduction to Statistical Analysis: Hypothesis Testing

    Introduction and Review of Concepts. In this module you learn about the models required to analyze different types of data and the difference between explanatory vs predictive modeling. Then you review fundamental statistical concepts, such as the sampling distribution of a mean, hypothesis testing, p-values, and confidence intervals.

  14. How to Simplify Hypothesis Testing for Linear Regression in Python

    Side note: There is another hypothesis test that is more seldom used with linear regression, which is a hypothesis regarding the intercept. It's used less since we're typically concerned with the slope of the line. The 4 Assumptions for Linear Regression Hypothesis Testing. There is a linear regression relation between Y and X

  15. PDF Lecture 5 Hypothesis Testing in Multiple Linear Regression

    As in simple linear regression, under the null hypothesis t 0 = βˆ j seˆ(βˆ j) ∼ t n−p−1. We reject H 0 if |t 0| > t n−p−1,1−α/2. This is a partial test because βˆ j depends on all of the other predictors x i, i 6= j that are in the model. Thus, this is a test of the contribution of x j given the other predictors in the model.

  16. Regression Models and Hypothesis Testing

    In order to illustrate the methodology, let us consider a few examples of regression models taken from several disciplines. Fig. 7.2. The median weekly earnings (left) and the unemployment rate (right) as a function educational attainment, here specified as the years of education past the age of 15.

  17. Multiple Regression Analysis: Hypothesis Tests

    This video is an introduction to multiple regression analysis, with a focus on conducting a hypothesis test. If I look tired in the video, it's because I've ...

  18. Understanding the Null Hypothesis for Linear Regression

    xi: The value of the predictor variable xi. Multiple linear regression uses the following null and alternative hypotheses: H0: β1 = β2 = … = βk = 0. HA: β1 = β2 = … = βk ≠ 0. The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically ...

  19. Hypothesis Test for Regression Slope

    Hypothesis Test for Regression Slope. This lesson describes how to conduct a hypothesis test to determine whether there is a significant linear relationship between an independent variable X and a dependent variable Y.. The test focuses on the slope of the regression line Y = Β 0 + Β 1 X. where Β 0 is a constant, Β 1 is the slope (also called the regression coefficient), X is the value of ...

  20. Statistics for Data Science: Key Concepts

    This book uses clear and concise explanations to cover important statistical concepts, including probability, hypothesis testing, correlation, and regression analysis. In addition, it focuses on practical examples and exercises which will allow you to apply what you have learned to real-world data sets. Statistics in Plain English

  21. 14.4: Hypothesis Test for Simple Linear Regression

    In simple linear regression, this is equivalent to saying "Are X an Y correlated?". In reviewing the model, Y = β0 +β1X + ε Y = β 0 + β 1 X + ε, as long as the slope ( β1 β 1) has any non‐zero value, X X will add value in helping predict the expected value of Y Y. However, if there is no correlation between X and Y, the value of ...

  22. Home

    Welcome to Hypothesis Testing and Regression Analysis in R. In this online module, participants will learn how to conduct hypothesis tests in R, along with correlation and regression analysis. The session will include t test, paired t test, ANOVA, regression, correlation, and covariance. The workshop is open to all who wish to learn about ...

  23. Probability distributions, hypothesis testing, and analysis

    Hypothesis testing: Uses a prespecified significance level, ... Multiple regression analysis (with one outcome variable and two or more predictors assessed simultaneously) is useful when we need to adjust for confounding. The predictors are combined in additive manner. In multiple regression analysis, one of the assumptions includes no ...

  24. Hypothesis testing for varying coefficient models in tail index regression

    The varying coefficient model is an efficient semiparametric model that avoids the curse of dimensionality when including large covariates in the model. In fact, the varying coefficient model is useful in mean, quantile, and other regressions. The tail index regression is not an exception. However, the varying coefficient model is flexible, but ...

  25. Statistical Analysis: Hypothesis Testing, Regression, and

    As p-value < 0.1, the test is significat at sf = 0.1. So we can reject the null hypothesis and conclude that the slope is significantly different from zero. Therefore, Tesla outperforms the market expectation at the significance level of 0.1. e) We obtained the p-value for the right-tailed test using excel function TDIST. P-value = 0.032 < 0.05 ...