## User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

- Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
- Duis aute irure dolor in reprehenderit in voluptate
- Excepteur sint occaecat cupidatat non proident

## Keyboard Shortcuts

6.6 - confidence intervals & hypothesis testing.

Confidence intervals and hypothesis tests are similar in that they are both inferential methods that rely on an approximated sampling distribution. Confidence intervals use data from a sample to estimate a population parameter. Hypothesis tests use data from a sample to test a specified hypothesis. Hypothesis testing requires that we have a hypothesized parameter.

The simulation methods used to construct bootstrap distributions and randomization distributions are similar. One primary difference is a bootstrap distribution is centered on the observed sample statistic while a randomization distribution is centered on the value in the null hypothesis.

In Lesson 4, we learned confidence intervals contain a range of reasonable estimates of the population parameter. All of the confidence intervals we constructed in this course were two-tailed. These two-tailed confidence intervals go hand-in-hand with the two-tailed hypothesis tests we learned in Lesson 5. The conclusion drawn from a two-tailed confidence interval is usually the same as the conclusion drawn from a two-tailed hypothesis test. In other words, if the the 95% confidence interval contains the hypothesized parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always fail to reject the null hypothesis. If the 95% confidence interval does not contain the hypothesize parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always reject the null hypothesis.

## Example: Mean Section

This example uses the Body Temperature dataset built in to StatKey for constructing a bootstrap confidence interval and conducting a randomization test .

Let's start by constructing a 95% confidence interval using the percentile method in StatKey:

The 95% confidence interval for the mean body temperature in the population is [98.044, 98.474].

Now, what if we want to know if there is enough evidence that the mean body temperature is different from 98.6 degrees? We can conduct a hypothesis test. Because 98.6 is not contained within the 95% confidence interval, it is not a reasonable estimate of the population mean. We should expect to have a p value less than 0.05 and to reject the null hypothesis.

\(H_0: \mu=98.6\)

\(H_a: \mu \ne 98.6\)

\(p = 2*0.00080=0.00160\)

\(p \leq 0.05\), reject the null hypothesis

There is evidence that the population mean is different from 98.6 degrees.

## Selecting the Appropriate Procedure Section

The decision of whether to use a confidence interval or a hypothesis test depends on the research question. If we want to estimate a population parameter, we use a confidence interval. If we are given a specific population parameter (i.e., hypothesized value), and want to determine the likelihood that a population with that parameter would produce a sample as different as our sample, we use a hypothesis test. Below are a few examples of selecting the appropriate procedure.

## Example: Cheese Consumption Section

Research question: How much cheese (in pounds) does an average American adult consume annually?

What is the appropriate inferential procedure?

Cheese consumption, in pounds, is a quantitative variable. We have one group: American adults. We are not given a specific value to test, so the appropriate procedure here is a confidence interval for a single mean .

## Example: Age Section

Research question: Is the average age in the population of all STAT 200 students greater than 30 years?

There is one group: STAT 200 students. The variable of interest is age in years, which is quantitative. The research question includes a specific population parameter to test: 30 years. The appropriate procedure is a hypothesis test for a single mean .

## Try it! Section

For each research question, identify the variables, the parameter of interest and decide on the the appropriate inferential procedure.

Research question: How strong is the correlation between height (in inches) and weight (in pounds) in American teenagers?

There are two variables of interest: (1) height in inches and (2) weight in pounds. Both are quantitative variables. The parameter of interest is the correlation between these two variables.

We are not given a specific correlation to test. We are being asked to estimate the strength of the correlation. The appropriate procedure here is a confidence interval for a correlation .

Research question: Are the majority of registered voters planning to vote in the next presidential election?

The parameter that is being tested here is a single proportion. We have one group: registered voters. "The majority" would be more than 50%, or p>0.50. This is a specific parameter that we are testing. The appropriate procedure here is a hypothesis test for a single proportion .

Research question: On average, are STAT 200 students younger than STAT 500 students?

We have two independent groups: STAT 200 students and STAT 500 students. We are comparing them in terms of average (i.e., mean) age.

If STAT 200 students are younger than STAT 500 students, that translates to \(\mu_{200}<\mu_{500}\) which is an alternative hypothesis. This could also be written as \(\mu_{200}-\mu_{500}<0\), where 0 is a specific population parameter that we are testing.

The appropriate procedure here is a hypothesis test for the difference in two means .

Research question: On average, how much taller are adult male giraffes compared to adult female giraffes?

There are two groups: males and females. The response variable is height, which is quantitative. We are not given a specific parameter to test, instead we are asked to estimate "how much" taller males are than females. The appropriate procedure is a confidence interval for the difference in two means .

Research question: Are STAT 500 students more likely than STAT 200 students to be employed full-time?

There are two independent groups: STAT 500 students and STAT 200 students. The response variable is full-time employment status which is categorical with two levels: yes/no.

If STAT 500 students are more likely than STAT 200 students to be employed full-time, that translates to \(p_{500}>p_{200}\) which is an alternative hypothesis. This could also be written as \(p_{500}-p_{200}>0\), where 0 is a specific parameter that we are testing. The appropriate procedure is a hypothesis test for the difference in two proportions.

Research question: Is there is a relationship between outdoor temperature (in Fahrenheit) and coffee sales (in cups per day)?

There are two variables here: (1) temperature in Fahrenheit and (2) cups of coffee sold in a day. Both variables are quantitative. The parameter of interest is the correlation between these two variables.

If there is a relationship between the variables, that means that the correlation is different from zero. This is a specific parameter that we are testing. The appropriate procedure is a hypothesis test for a correlation .

Statistics Made Easy

## Hypothesis Test vs. Confidence Interval: What’s the Difference?

Two of the most commonly used procedures in statistics are hypothesis tests and confidence intervals .

Here’s the difference between the two:

- A hypothesis test is a formal statistical test that is used to determine if some hypothesis about a population parameter is true.

A confidence interval is a range of values that is likely to contain a population parameter with a certain level of confidence.

This tutorial shares a brief overview of each method along with their similarities and differences.

## The Basics of Hypothesis Tests

A hypothesis test is used to test whether or not some hypothesis about a population parameter is true.

To perform a hypothesis test in the real world, researchers will obtain a random sample from the population and perform a hypothesis test on the sample data, using a null and alternative hypothesis:

- Null Hypothesis (H 0 ): The sample data occurs purely from chance.
- Alternative Hypothesis (H A ): The sample data is influenced by some non-random cause.

If the p-value of the hypothesis test is less than some significance level (e.g. α = .05), then we can reject the null hypothesis and conclude that we have sufficient evidence to say that the alternative hypothesis is true.

## Hypothesis Test Example

Suppose a manufacturing facility wants to test whether or not some new method changes the number of defective widgets produced per month, which is currently 250.

To test this, they may measure the mean number of defective widgets produced before and after using the new method for one month.

They can perform a hypothesis test using the following hypotheses:

- H 0 : μ after = μ before (the mean number of defective widgets is the same before and after using the new method)
- H A : μ after ≠ μ before (the mean number of defective widgets produced is different before and after using the new method)

Suppose they perform a one sample t-test and end up with a p-value of .0032.

Since this p-value is less than α = .05, the facility can reject the null hypothesis and conclude that the new method leads to a change in the number of defective widgets produced per month.

## The Basics of Confidence Intervals

To calculate a confidence interval in the real world, researchers will obtain a random sample from the population and use the following formula to calculate a confidence interval for the population mean:

Confidence Interval = x +/- z*(s/√ n )

- x : sample mean
- z: the chosen z-value
- s: sample standard deviation
- n: sample size

The z-value that you will use is dependent on the confidence level that you choose. The following table shows the z-value that corresponds to popular confidence level choices:

## Confidence Interval Example

Suppose a biologist wants to estimate the mean weight of turtles in a certain population so she collects a random sample of turtles with the following information:

- Sample size n = 25
- Sample mean weight x = 300
- Sample standard deviation s = 18.5

Here is how to find calculate the 90% confidence interval for the true population mean weight:

90% Confidence Interval: 300 +/- 1.645*(18.5/√25) = [293.91, 306.09]

The biologist can be 90% confident that the true mean weight of a turtle in this population is between 293.1 pounds and 306.09 pounds.

## Hypothesis Test vs. Confidence Interval: When to Use Each

The decision to use a hypothesis test or a confidence interval depends on the question you’re attempting to answer.

You should use a confidence interval when you want to estimate the value of a population parameter.

You should use a hypothesis test when you want to determine if some hypothesis about a population parameter is likely true or not.

To test your knowledge of when to use each procedure, consider the following scenarios.

## Scenario 1: Hours Spent Studying

Suppose an academic researcher wants to measure the mean number of hours that college students spend studying per week.

Which procedure should she use to answer this question?

She should use a confidence interval because she’s interested in estimating the value of a population parameter.

## Scenario 2: New Medication

Suppose a doctor wants to test whether or not a new medication is able to reduce blood pressure more than the current standard medication.

Which procedure should he use to answer this question?

He should use a hypothesis test because he’s interested in understanding whether or not a specific assumption about a population parameter is true.

## Additional Resources

The following tutorials provide additional information about hypothesis tests :

Introduction to Hypothesis Testing Introduction to the One Sample t-test Introduction to the Two Sample t-test Introduction to the Paired Samples t-test

The following tutorials provide additional information about confidence intervals :

Introduction to Confidence Intervals Confidence Interval for a Mean Confidence Interval for a Proportion

## Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

## 8.6 Relationship Between Confidence Intervals and Hypothesis Tests

Confidence intervals (CI) and hypothesis tests should give consistent results: we should not reject [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if the corresponding [latex](1 - \alpha) \times 100\%[/latex] confidence interval contains the hypothesized value [latex]\mu_0[/latex]. Two-sided confidence intervals correspond to two-tailed tests, upper-tailed confidence intervals correspond to right-tailed tests, and lower-tailed confidence intervals correspond to left-tailed tests.

A [latex](1 - \alpha) \times 100\%[/latex] two-sided [latex]t[/latex] confidence interval is given in the form [latex](\bar{x} - t_{\alpha / 2} \frac{s}{\sqrt{n}}, \bar{x} + t_{\alpha / 2} \frac{s}{\sqrt{n}})[/latex]. A [latex](1 - \alpha) \times 100\%[/latex] upper-tailed t confidence interval is given by [latex](\bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}, \infty)[/latex] and the number [latex]\bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}[/latex] is called the lower bound of the interval. A [latex](1 - \alpha) \times 100\%[/latex] lower-tailed t confidence interval is given by [latex](- \infty, \bar{x} + t_{\alpha} \frac{s}{\sqrt{n}})[/latex] and the number [latex]\bar{x} + t_{\alpha} \frac{s}{\sqrt{n}}[/latex] is called the upper bound of the interval. We can also use confidence intervals to make conclusions about hypothesis tests: reject the null hypothesis [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if the corresponding [latex](1 - \alpha) \times 100\%[/latex] confidence interval does not contain the hypothesized value [latex]\mu_0[/latex]. The relationship is summarized in the following table.

Table 8.3 : Relationship Between Confidence Interval and Hypothesis Test

Here is the reason we should reject [latex]H_0[/latex] if [latex]\mu_0[/latex] is outside the corresponding confidence interval.

Take the right-tailed test for example, we should reject [latex]H_0[/latex] if the observed test statistic [latex]t_o[/latex] falls in the rejection region, that is if [latex]t_o \geq t_{\alpha}[/latex]. This implies [latex]t_o = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \geq t_{\alpha} \Longrightarrow \mu_0 \leq \bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}.[/latex] Given that the upper-tailed confidence interval for a right-tailed test is [latex](\bar{x} - t_{\alpha / 2} \frac{s}{\sqrt{n}}, \infty)[/latex], [latex]\mu_0 \leq \bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}[/latex] means the value of [latex]\mu_0[/latex] is outside the confidence interval. The same rationale applies to two-tailed and left-tailed tests. Therefore, we can reject [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if [latex]\mu_0[/latex] is outside the corresponding (1– [latex]\alpha[/latex] )×100% confidence interval.

Example: Relationship Between Confidence Intervals and Hypothesis Tests

The ankle-brachial index (ABI) compares the blood pressure of a patient’s arm to the blood pressure of the patient’s leg. The ABI can be an indicator of different diseases, including arterial diseases. A healthy (or normal) ABI is 0.9 or greater. Researchers obtained the ABI of 100 women with peripheral arterial disease and obtained a mean ABI of 0.64 with a standard deviation of 0.15.

- Set up the hypotheses: [latex]H_0: \mu \geq 0.9[/latex] versus [latex]H_a: \mu < 0.9[/latex].
- The significance level is [latex]\alpha = 0.05[/latex].
- Compute the value of the test statistic: [latex]t_o = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{0.64 - 0.9}{0.15 / \sqrt{100}} = \frac{-0.26}{0.015} = -17.333[/latex] with [latex]df = n-1 = 100 -1 = 99[/latex] (not given in Table IV, use 95, the closest one smaller than 99).
- Find the P-value. For a left-tailed test, the P-value is the area to the left of the observed test statistic [latex]t_o[/latex]. [latex]\mbox{P-value} = P(t \leq t_o) = P(t \leq -17.333) = P(t \geq 17.333) 2.629(t_{0.005})[/latex].
- Decision: Since the P- value [latex]< 0.005 < 0.05(\alpha)[/latex], we should reject the null hypothesis [latex]H_0[/latex].
- Conclusion: At the 5% significance level, the data provide sufficient evidence that, on average, women with peripheral arterial disease have an unhealthy ABI.

[latex]\left( - \infty, \bar{x} + t_{\alpha} \frac{s}{\sqrt{n}} \right)= \left( - \infty, 0.64 + 1.661 \times \frac{0.15}{\sqrt{100}} \right) = (- \infty , 0.665)[/latex].

- Does the interval in part b) support the conclusion in part a)? In part a), we reject [latex]H_0[/latex] and claim that the mean ABI is below 0.9 for women with peripheral arterial disease. In part b), we are 95% confident that the mean ABI is less than 0.9 since the entire confidence interval is below 0.9. In other words, the hypothesized value 0.9 is outside the corresponding confidence interval, we should reject the null. Therefore, the results obtained in parts a) and b) are consistent.

Introduction to Applied Statistics by Wanhua Su is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

- Publications
- Account settings
- Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

## StatPearls [Internet].

Hypothesis testing, p values, confidence intervals, and significance.

Jacob Shreffler ; Martin R. Huecker .

## Affiliations

Last Update: March 13, 2023 .

- Definition/Introduction

Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.

- Issues of Concern

Without a foundational understanding of hypothesis testing, p values, confidence intervals, and the difference between statistical and clinical significance, it may affect healthcare providers' ability to make clinical decisions without relying purely on the research investigators deemed level of significance. Therefore, an overview of these concepts is provided to allow medical professionals to use their expertise to determine if results are reported sufficiently and if the study outcomes are clinically appropriate to be applied in healthcare practice.

Hypothesis Testing

Investigators conducting studies need research questions and hypotheses to guide analyses. Starting with broad research questions (RQs), investigators then identify a gap in current clinical practice or research. Any research problem or statement is grounded in a better understanding of relationships between two or more variables. For this article, we will use the following research question example:

Research Question: Is Drug 23 an effective treatment for Disease A?

Research questions do not directly imply specific guesses or predictions; we must formulate research hypotheses. A hypothesis is a predetermined declaration regarding the research question in which the investigator(s) makes a precise, educated guess about a study outcome. This is sometimes called the alternative hypothesis and ultimately allows the researcher to take a stance based on experience or insight from medical literature. An example of a hypothesis is below.

Research Hypothesis: Drug 23 will significantly reduce symptoms associated with Disease A compared to Drug 22.

The null hypothesis states that there is no statistical difference between groups based on the stated research hypothesis.

Researchers should be aware of journal recommendations when considering how to report p values, and manuscripts should remain internally consistent.

Regarding p values, as the number of individuals enrolled in a study (the sample size) increases, the likelihood of finding a statistically significant effect increases. With very large sample sizes, the p-value can be very low significant differences in the reduction of symptoms for Disease A between Drug 23 and Drug 22. The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. Based on the results, the investigators will either reject the null hypothesis (if they found significant differences or associations) or fail to reject the null hypothesis (they could not provide proof that there were significant differences or associations).

To test a hypothesis, researchers obtain data on a representative sample to determine whether to reject or fail to reject a null hypothesis. In most research studies, it is not feasible to obtain data for an entire population. Using a sampling procedure allows for statistical inference, though this involves a certain possibility of error. [1] When determining whether to reject or fail to reject the null hypothesis, mistakes can be made: Type I and Type II errors. Though it is impossible to ensure that these errors have not occurred, researchers should limit the possibilities of these faults. [2]

Significance

Significance is a term to describe the substantive importance of medical research. Statistical significance is the likelihood of results due to chance. [3] Healthcare providers should always delineate statistical significance from clinical significance, a common error when reviewing biomedical research. [4] When conceptualizing findings reported as either significant or not significant, healthcare providers should not simply accept researchers' results or conclusions without considering the clinical significance. Healthcare professionals should consider the clinical importance of findings and understand both p values and confidence intervals so they do not have to rely on the researchers to determine the level of significance. [5] One criterion often used to determine statistical significance is the utilization of p values.

P values are used in research to determine whether the sample estimate is significantly different from a hypothesized value. The p-value is the probability that the observed effect within the study would have occurred by chance if, in reality, there was no true effect. Conventionally, data yielding a p<0.05 or p<0.01 is considered statistically significant. While some have debated that the 0.05 level should be lowered, it is still universally practiced. [6] Hypothesis testing allows us to determine the size of the effect.

An example of findings reported with p values are below:

Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05.

Statement:Individuals who were prescribed Drug 23 experienced fewer symptoms (M = 1.3, SD = 0.7) compared to individuals who were prescribed Drug 22 (M = 5.3, SD = 1.9). This finding was statistically significant, p= 0.02.

For either statement, if the threshold had been set at 0.05, the null hypothesis (that there was no relationship) should be rejected, and we should conclude significant differences. Noticeably, as can be seen in the two statements above, some researchers will report findings with < or > and others will provide an exact p-value (0.000001) but never zero [6] . When examining research, readers should understand how p values are reported. The best practice is to report all p values for all variables within a study design, rather than only providing p values for variables with significant findings. [7] The inclusion of all p values provides evidence for study validity and limits suspicion for selective reporting/data mining.

While researchers have historically used p values, experts who find p values problematic encourage the use of confidence intervals. [8] . P-values alone do not allow us to understand the size or the extent of the differences or associations. [3] In March 2016, the American Statistical Association (ASA) released a statement on p values, noting that scientific decision-making and conclusions should not be based on a fixed p-value threshold (e.g., 0.05). They recommend focusing on the significance of results in the context of study design, quality of measurements, and validity of data. Ultimately, the ASA statement noted that in isolation, a p-value does not provide strong evidence. [9]

When conceptualizing clinical work, healthcare professionals should consider p values with a concurrent appraisal study design validity. For example, a p-value from a double-blinded randomized clinical trial (designed to minimize bias) should be weighted higher than one from a retrospective observational study [7] . The p-value debate has smoldered since the 1950s [10] , and replacement with confidence intervals has been suggested since the 1980s. [11]

Confidence Intervals

A confidence interval provides a range of values within given confidence (e.g., 95%), including the accurate value of the statistical constraint within a targeted population. [12] Most research uses a 95% CI, but investigators can set any level (e.g., 90% CI, 99% CI). [13] A CI provides a range with the lower bound and upper bound limits of a difference or association that would be plausible for a population. [14] Therefore, a CI of 95% indicates that if a study were to be carried out 100 times, the range would contain the true value in 95, [15] confidence intervals provide more evidence regarding the precision of an estimate compared to p-values. [6]

In consideration of the similar research example provided above, one could make the following statement with 95% CI:

Statement: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22; there was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

It is important to note that the width of the CI is affected by the standard error and the sample size; reducing a study sample number will result in less precision of the CI (increase the width). [14] A larger width indicates a smaller sample size or a larger variability. [16] A researcher would want to increase the precision of the CI. For example, a 95% CI of 1.43 – 1.47 is much more precise than the one provided in the example above. In research and clinical practice, CIs provide valuable information on whether the interval includes or excludes any clinically significant values. [14]

Null values are sometimes used for differences with CI (zero for differential comparisons and 1 for ratios). However, CIs provide more information than that. [15] Consider this example: A hospital implements a new protocol that reduced wait time for patients in the emergency department by an average of 25 minutes (95% CI: -2.5 – 41 minutes). Because the range crosses zero, implementing this protocol in different populations could result in longer wait times; however, the range is much higher on the positive side. Thus, while the p-value used to detect statistical significance for this may result in "not significant" findings, individuals should examine this range, consider the study design, and weigh whether or not it is still worth piloting in their workplace.

Similarly to p-values, 95% CIs cannot control for researchers' errors (e.g., study bias or improper data analysis). [14] In consideration of whether to report p-values or CIs, researchers should examine journal preferences. When in doubt, reporting both may be beneficial. [13] An example is below:

Reporting both: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22, p = 0.009. There was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

- Clinical Significance

Recall that clinical significance and statistical significance are two different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with smaller sample size and statistically non-significant results could impact clinical practice. [14] Additionally, as previously mentioned, a non-significant finding may reflect the study design itself rather than relationships between variables.

Healthcare providers using evidence-based medicine to inform practice should use clinical judgment to determine the practical importance of studies through careful evaluation of the design, sample size, power, likelihood of type I and type II errors, data analysis, and reporting of statistical findings (p values, 95% CI or both). [4] Interestingly, some experts have called for "statistically significant" or "not significant" to be excluded from work as statistical significance never has and will never be equivalent to clinical significance. [17]

The decision on what is clinically significant can be challenging, depending on the providers' experience and especially the severity of the disease. Providers should use their knowledge and experiences to determine the meaningfulness of study results and make inferences based not only on significant or insignificant results by researchers but through their understanding of study limitations and practical implications.

- Nursing, Allied Health, and Interprofessional Team Interventions

All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts in this chapter. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care.

- Review Questions
- Access free multiple choice questions on this topic.
- Comment on this article.

Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.

Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ), which permits others to distribute the work, provided that the article is not altered or used commercially. You are not required to obtain permission to distribute this article, provided that you credit the author and journal.

- Cite this Page Shreffler J, Huecker MR. Hypothesis Testing, P Values, Confidence Intervals, and Significance. [Updated 2023 Mar 13]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

## In this Page

Bulk download.

- Bulk download StatPearls data from FTP

## Related information

- PMC PubMed Central citations
- PubMed Links to PubMed

## Similar articles in PubMed

- The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). [PeerJ. 2021] The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). Messam LLM, Weng HY, Rosenberger NWY, Tan ZH, Payet SDM, Santbakshsing M. PeerJ. 2021; 9:e12453. Epub 2021 Nov 24.
- Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. [J Pharm Pract. 2010] Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Ferrill MJ, Brown DA, Kyle JA. J Pharm Pract. 2010 Aug; 23(4):344-51. Epub 2010 Apr 13.
- Interpreting "statistical hypothesis testing" results in clinical research. [J Ayurveda Integr Med. 2012] Interpreting "statistical hypothesis testing" results in clinical research. Sarmukaddam SB. J Ayurveda Integr Med. 2012 Apr; 3(2):65-9.
- Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. [Dermatol Surg. 2005] Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. Alam M, Barzilai DA, Wrone DA. Dermatol Surg. 2005 Apr; 31(4):462-6.
- Review Is statistical significance testing useful in interpreting data? [Reprod Toxicol. 1993] Review Is statistical significance testing useful in interpreting data? Savitz DA. Reprod Toxicol. 1993; 7(2):95-100.

## Recent Activity

- Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearl... Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearls

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10 .

- Payment Plans
- Product List
- Partnerships

- Try Free Trial
- Study Packages
- Levels I, II & III Lifetime Package
- Video Lessons
- Study Notes
- Practice Questions
- Levels II & III Lifetime Package
- About the Exam
- About your Instructor
- Part I Study Packages
- Parts I & II Packages
- Part I & Part II Lifetime Package
- Part II Study Packages
- Exams P & FM Lifetime Package
- Quantitative Questions
- Verbal Questions
- Data Insight Questions
- Live Tutoring
- About your Instructors
- EA Practice Questions
- Data Sufficiency Questions
- Integrated Reasoning Questions

## Hypothesis Testing

After completing this reading, you should be able to:

- Construct an appropriate null hypothesis and alternative hypothesis and distinguish between the two.
- Construct and apply confidence intervals for one-sided and two-sided hypothesis tests, and interpret the results of hypothesis tests with a specific level of confidence.
- Differentiate between a one-sided and a two-sided test and identify when to use each test.
- Explain the difference between Type I and Type II errors and how these relate to the size and power of a test.
- Understand how a hypothesis test and a confidence interval are related.
- Explain what the p-value of a hypothesis test measures.
- Interpret the results of hypothesis tests with a specific level of confidence.
- Identify the steps to test a hypothesis about the difference between two population means.
- Explain the problem of multiple testing and how it can bias results.

Hypothesis testing is defined as a process of determining whether a hypothesis is in line with the sample data. Hypothesis testing tries to test whether the observed data of the hypothesis is true. Hypothesis testing starts by stating the null hypothesis and the alternative hypothesis. The null hypothesis is an assumption of the population parameter. On the other hand, the alternative hypothesis states the parameter values (critical values) at which the null hypothesis is rejected. The critical values are determined by the distribution of the test statistic (when the null hypothesis is true) and the size of the test (which gives the size at which we reject the null hypothesis).

## Components of the Hypothesis Testing

The elements of the test hypothesis include:

- The null hypothesis.
- The alternative hypothesis.
- The test statistic.
- The size of the hypothesis test and errors
- The critical value.
- The decision rule.

## The Null hypothesis

As stated earlier, the first stage of the hypothesis test is the statement of the null hypothesis. The null hypothesis is the statement concerning the population parameter values. It brings out the notion that “there is nothing about the data.”

The null hypothesis , denoted as H 0 , represents the current state of knowledge about the population parameter that’s the subject of the test. In other words, it represents the “status quo.” For example, the U.S Food and Drug Administration may walk into a cooking oil manufacturing plant intending to confirm that each 1 kg oil package has, say, 0.15% cholesterol and not more. The inspectors will formulate a hypothesis like:

H 0 : Each 1 kg package has 0.15% cholesterol.

A test would then be carried out to confirm or reject the null hypothesis.

Other typical statements of H 0 include:

$$H_0:\mu={\mu}_0$$

$$H_0:\mu≤{\mu}_0$$

\(μ\) = true population mean and,

\(μ_0\)= the hypothesized population mean.

## The Alternative Hypothesis

The alternative hypothesis , denoted H 1 , is a contradiction of the null hypothesis. The null hypothesis determines the values of the population parameter at which the null hypothesis is rejected. Thus, rejecting the H 0 makes H 1 valid. We accept the alternative hypothesis when the “status quo” is discredited and found to be untrue.

Using our FDA example above, the alternative hypothesis would be:

H 1 : Each 1 kg package does not have 0.15% cholesterol.

The typical statements of H1 include:

$$H_1:\mu \neq {\mu}_0$$

$$H_1:\mu > {\mu}_0$$

Note that we have stated the alternative hypothesis, which contradicted the above statement of the null hypothesis.

## The Test Statistic

A test statistic is a standardized value computed from sample information when testing hypotheses. It compares the given data with what we would expect under the null hypothesis. Thus, it is a major determinant when deciding whether to reject H 0 , the null hypothesis.

We use the test statistic to gauge the degree of agreement between sample data and the null hypothesis. Analysts use the following formula when calculating the test statistic.

$$ \text{Test Statistic}= \frac{(\text{Sample Statistic–Hypothesized Value})}{(\text{Standard Error of the Sample Statistic})}$$

The test statistic is a random variable that changes from one sample to another. Test statistics assume a variety of distributions. We shall focus on normally distributed test statistics because it is used hypotheses concerning the means, regression coefficients, and other econometric models.

We shall consider the hypothesis test on the mean. Consider a null hypothesis \(H_0:μ=μ_0\). Assume that the data used is iid, and asymptotic normally distributed as:

$$\sqrt{n} (\hat{\mu}-\mu) \sim N(0, {\sigma}^2)$$

Where \({\sigma}^2\) is the variance of the sequence of the iid random variable used. The asymptotic distribution leads to the test statistic:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{\hat{\sigma}^2}{n}}}\sim N(0,1)$$

Note this is consistent with our initial definition of the test statistic.

The following table gives a brief outline of the various test statistics used regularly, based on the distribution that the data is assumed to follow:

$$\begin{array}{ll} \textbf{Hypothesis Test} & \textbf{Test Statistic}\\ \text{Z-test} & \text{z-statistic} \\ \text{Chi-Square Test} & \text{Chi-Square statistic}\\ \text{t-test} & \text{t-statistic} \\ \text{ANOVA} & \text{F-statistic}\\ \end{array}$$ We can subdivide the set of values that can be taken by the test statistic into two regions: One is called the non-rejection region, which is consistent with H 0 and the rejection region (critical region), which is inconsistent with H 0 . If the test statistic has a value found within the critical region, we reject H 0 .

Just like with any other statistic, the distribution of the test statistic must be specified entirely under H 0 when H 0 is true.

## The Size of the Hypothesis Test and the Type I and Type II Errors

While using sample statistics to draw conclusions about the parameters of the population as a whole, there is always the possibility that the sample collected does not accurately represent the population. Consequently, statistical tests carried out using such sample data may yield incorrect results that may lead to erroneous rejection (or lack thereof) of the null hypothesis. We have two types of errors:

## Type I Error

Type I error occurs when we reject a true null hypothesis. For example, a type I error would manifest in the form of rejecting H 0 = 0 when it is actually zero.

## Type II Error

Type II error occurs when we fail to reject a false null hypothesis. In such a scenario, the test provides insufficient evidence to reject the null hypothesis when it’s false.

The level of significance denoted by α represents the probability of making a type I error, i.e., rejecting the null hypothesis when, in fact, it’s true. α is the direct opposite of β, which is taken to be the probability of making a type II error within the bounds of statistical testing. The ideal but practically impossible statistical test would be one that simultaneously minimizes α and β. We use α to determine critical values that subdivide the distribution into the rejection and the non-rejection regions.

## The Critical Value and the Decision Rule

The decision to reject or not to reject the null hypothesis is based on the distribution assumed by the test statistic. This means if the variable involved follows a normal distribution, we use the level of significance (α) of the test to come up with critical values that lie along with the standard normal distribution.

The decision rule is a result of combining the critical value (denoted by \(C_α\)), the alternative hypothesis, and the test statistic (T). The decision rule is to whether to reject the null hypothesis in favor of the alternative hypothesis or fail to reject the null hypothesis.

For the t-test, the decision rule is dependent on the alternative hypothesis. When testing the two-side alternative, the decision is to reject the null hypothesis if \(|T|>C_α\). That is, reject the null hypothesis if the absolute value of the test statistic is greater than the critical value. When testing on the one-sided, decision rule, reject the null hypothesis if \(T<C_α\) when using a one-sided lower alternative and if \(T>C_α\) when using a one-sided upper alternative. When a null hypothesis is rejected at an α significance level, we say that the result is significant at α significance level.

Note that prior to decision-making, one must decide whether the test should be one-tailed or two-tailed. The following is a brief summary of the decision rules under different scenarios:

## Left One-tailed Test

H 1 : parameter < X

Decision rule: Reject H 0 if the test statistic is less than the critical value. Otherwise, do not reject H 0.

## Right One-tailed Test

H 1 : parameter > X

Decision rule: Reject H 0 if the test statistic is greater than the critical value. Otherwise, do not reject H 0.

## Two-tailed Test

H 1 : parameter ≠ X (not equal to X)

Decision rule: Reject H 0 if the test statistic is greater than the upper critical value or less than the lower critical value.

H 0 : μ < μ 0 vs. H 1 : μ > μ 0.

The second graph represents the rejection region when the alternative is a one-sided upper. The null hypothesis, in this case, is stated as:

H 0 : μ > μ 0 vs. H 1 : μ < μ 0.

## Example: Hypothesis Test on the Mean

Consider the returns from a portfolio \(X=(x_1,x_2,\dots, x_n)\) from 1980 through 2020. The approximated mean of the returns is 7.50%, with a standard deviation of 17%. We wish to determine whether the expected value of the return is different from 0 at a 5% significance level.

We start by stating the two-sided hypothesis test:

H 0 : μ =0 vs. H 1 : μ ≠ 0

The test statistic is:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{\hat{\sigma}^2}{n}}} \sim N(0,1)$$

In this case, we have,

\(\hat{μ}\)=0.075

\(\hat{\sigma}^2\)=0.17 2

$$T=\frac{0.075-0}{\sqrt{\frac{0.17^2}{40}}} \approx 2.79$$

At the significance level, \(α=5\%\),the critical value is \(±1.96\). Since this is a two-sided test, the rejection regions are ( \(-\infty,-1.96\) ) and (\(1.96, \infty \) ) as shown in the diagram below:

The example above is an example of a Z-test (which is mostly emphasized in this chapter and immediately follows from the central limit theorem (CLT)). However, we can use the Student’s t-distribution if the random variables are iid and normally distributed and that the sample size is small (n<30).

In Student’s t-distribution, we used the unbiased estimator of variance. That is:

$$s^2=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{s^2}{n}}}$$

Therefore the test statistic for \(H_0=μ_0\) is given by:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{s^2}{n}}} \sim t_{n-1}$$

## The Type II Error and the Test Power

The power of a test is the direct opposite of the level of significance. While the level of relevance gives us the probability of rejecting the null hypothesis when it’s, in fact, true, the power of a test gives the probability of correctly discrediting and rejecting the null hypothesis when it is false. In other words, it gives the likelihood of rejecting H 0 when, indeed, it’s false. Denoting the probability of type II error by \(\beta\), the power test is given by:

$$ \text{Power of a Test}=1–\beta $$

The power test measures the likelihood that the false null hypothesis is rejected. It is influenced by the sample size, the length between the hypothesized parameter and the true value, and the size of the test.

## Confidence Intervals

A confidence interval can be defined as the range of parameters at which the true parameter can be found at a confidence level. For instance, a 95% confidence interval constitutes the set of parameter values where the null hypothesis cannot be rejected when using a 5% test size. Therefore, a 1-α confidence interval contains values that cannot be disregarded at a test size of α.

It is important to note that the confidence interval depends on the alternative hypothesis statement in the test. Let us start with the two-sided test alternatives.

$$ H_0:μ=0$$

$$H_1:μ≠0$$

Then the \(1-α\) confidence interval is given by:

$$\left[\hat{\mu} -C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} ,\hat{\mu} + C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} \right]$$

\(C_α\) is the critical value at \(α\) test size.

## Example: Calculating Two-Sided Alternative Confidence Intervals

Consider the returns from a portfolio \(X=(x_1,x_2,…, x_n)\) from 1980 through 2020. The approximated mean of the returns is 7.50%, with a standard deviation of 17%. Calculate the 95% confidence interval for the portfolio return.

The \(1-\alpha\) confidence interval is given by:

$$\begin{align*}&\left[\hat{\mu}-C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} ,\hat{\mu} + C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} \right]\\& =\left[0.0750-1.96 \times \frac{0.17}{\sqrt{40}}, 0.0750+1.96 \times \frac{0.17}{\sqrt{40}} \right]\\&=[0.02232,0.1277]\end{align*}$$

Thus, the confidence intervals imply any value of the null between 2.23% and 12.77% cannot be rejected against the alternative.

## One-Sided Alternative

For the one-sided alternative, the confidence interval is given by either:

$$\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )$$

for the lower alternative

$$\left ( \hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}},\infty \right )$$

for the upper alternative.

## Example: Calculating the One-Sided Alternative Confidence Interval

Assume that we were conducting the following one-sided test:

\(H_0:μ≤0\)

\(H_1:μ>0\)

The 95% confidence interval for the portfolio return is:

$$\begin{align*}&=\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )\\&=\left(-\infty ,0.0750+1.645\times \frac{0.17}{\sqrt{40}}\right)\\&=(-\infty, 0.1192)\end{align*}$$

On the other hand, if the hypothesis test was:

\(H_0:μ>0\)

\(H_1:μ≤0\)

The 95% confidence interval would be:

$$=\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )$$

$$=\left(-\infty ,0.0750+1.645\times \frac{0.17}{\sqrt{40}}\right)=(0.1192, \infty)$$

Note that the critical value decreased from 1.96 to 1.645 due to a change in the direction of the change.

## The p-Value

When carrying out a statistical test with a fixed value of the significance level (α), we merely compare the observed test statistic with some critical value. For example, we might “reject H 0 using a 5% test” or “reject H 0 at 1% significance level”. The problem with this ‘classical’ approach is that it does not give us details about the strength of the evidence against the null hypothesis.

Determination of the p-value gives statisticians a more informative approach to hypothesis testing. The p-value is the lowest level at which we can reject H 0 . This means that the strength of the evidence against H 0 increases as the p-value becomes smaller. The test statistic depends on the alternative.

## The p-Value for One-Tailed Test Alternative

For one-tailed tests, the p-value is given by the probability that lies below the calculated test statistic for left-tailed tests. Similarly, the likelihood that lies above the test statistic in right-tailed tests gives the p-value.

Denoting the test statistic by T, the p-value for \(H_1:μ>0\) is given by:

$$P(Z>|T|)=1-P(Z≤|T|)=1- \Phi (|T|) $$

Conversely , for \(H_1:μ≤0 \) the p-value is given by:

$$ P(Z≤|T|)= \Phi (|T|)$$

Where z is a standard normal random variable, the absolute value of T (|T|) ensures that the right tail is measured whether T is negative or positive.

## The p-Value for Two-Tailed Test Alternative

If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We start by determining the probability lying below the negative value of the test statistic. Then, we add this to the probability lying above the positive value of the test statistic. That is the p-value for the two-tailed hypothesis test is given by:

$$2\left[1-\Phi [|T|\right]$$

## Example 1: p-Value for One-Sided Alternative

Let θ represent the probability of obtaining a head when a coin is tossed. Suppose we toss the coin 200 times, and heads come up in 85 of the trials. Test the following hypothesis at 5% level of significance.

H 0 : θ = 0.5

H 1 : θ < 0.5

First, not that repeatedly tossing a coin follows a binomial distribution.

Our p-value will be given by P(X < 85) where X `binomial(200,0.5) with mean 100(np=200*0.5), assuming H 0 is true.

$$\begin{align*}P\left [ z< \frac{85.5-100}{\sqrt{50}} \right]&=P(Z<-2.05)\\&=1–0.97982=0.02018 \end{align*}$$

Recall that for a binomial distribution, the variance is given by:

$$np(1-p)=200(0.5)(1-0.5)=50$$

(We have applied the Central Limit Theorem by taking the binomial distribution as approx. normal)

Since the probability is less than 0.05, H 0 is extremely unlikely, and we actually have strong evidence against H 0 that favors H 1 . Thus, clearly expressing this result, we could say:

“There is very strong evidence against the hypothesis that the coin is fair. We, therefore, conclude that the coin is biased against heads.”

Remember, failure to reject H 0 does not mean it’s true. It means there’s insufficient evidence to justify rejecting H 0, given a certain level of significance.

## Example 2: p-Value for Two-Sided Alternative

A CFA candidate conducts a statistical test about the mean value of a random variable X.

H 0 : μ = μ 0 vs. H 1 : μ ≠ μ 0

She obtains a test statistic of 2.2. Given a 5% significance level, determine and interpret the p-value

$$ \text{P-value}=2P(Z>2.2)=2[1–P(Z≤2.2)] =1.39\%×2=2.78\%$$

(We have multiplied by two since this is a two-tailed test)

The p-value (2.78%) is less than the level of significance (5%). Therefore, we have sufficient evidence to reject H 0 . In fact, the evidence is so strong that we would also reject H 0 at significance levels of 4% and 3%. However, at significance levels of 2% or 1%, we would not reject H 0 since the p-value surpasses these values.

## Hypothesis about the Difference between Two Population Means.

It’s common for analysts to be interested in establishing whether there exists a significant difference between the means of two different populations. For instance, they might want to know whether the average returns for two subsidiaries of a given company exhibit significant differences.

Now, consider a bivariate random variable:

$$W_i=[X_i,Y_i]$$

Assume that the components \(X_i\) and \(Y_i\)are both iid and are correlated. That is: \(\text{Corr} (X_i,Y_i )≠0\)

Now, suppose that we want to test the hypothesis that:

$$H_0:μ_X=μ_Y$$

$$H_1:μ_X≠μ_Y$$

In other words, we want to test whether the constituent random variables have equal means. Note that the hypothesis statement above can be written as:

$$H_0:μ_X-μ_Y=0$$

$$H_1:μ_X-μ_Y≠0$$

To execute this test, consider the variable:

$$Z_i=X_i-Y_i$$

Therefore, considering the above random variable, if the null hypothesis is correct then,

$$E(Z_i)=E(X_i)-E(Y_i)=μ_X-μ_Y=0$$

Intuitively, this can be considered as a standard hypothesis test of

H 0 : μ Z =0 vs. H 1 : μ Z ≠ 0.

The tests statistic is given by:

$$T=\frac{\hat{\mu}_z}{\sqrt{\frac{\hat{\sigma}^2_z}{n}}} \sim N(0,1)$$

Note that the test statistic formula accounts for the correction between \(X_i \) and \(Y_i\). It is easy to see that:

$$V(Z_i)=V(X_i )+V(Y_i)-2COV(X_i, Y_i)$$

Which can be denoted as:

$$\hat{\sigma}^2_z =\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}$$

$$ \hat{\mu}_z ={\mu}_X-{\mu}_Y $$

And thus the test statistic formula can be written as:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}}{n}}}$$

This formula indicates that correlation plays a crucial role in determining the magnitude of the test statistic.

Another special case of the test statistic is when \(X_i\), and \(Y_i\) are iid and independent. The test statistic is given by:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X}{n_X}+\frac{\hat{\sigma}^2_Y}{n_Y}}}$$

Where \(n_X\) and \(n_Y\) are the sample sizes of \(X_i\), and \(Y_i\) respectively.

## Example: Hypothesis Test on Two Means

An investment analyst wants to test whether there is a significant difference between the means of the two portfolios at a 95% level. The first portfolio X consists of 30 government-issued bonds and has a mean of 10% and a standard deviation of 2%. The second portfolio Y consists of 30 private bonds with a mean of 14% and a standard deviation of 3%. The correlation between the two portfolios is 0.7. Calculate the null hypothesis and state whether the null hypothesis is rejected or otherwise.

The hypothesis statement is given by:

H 0 : μ X – μ Y =0 vs. H 1 : μ X – μ Y ≠ 0.

Note that this is a two-tailed test. At 95% level, the test size is α=5% and thus the critical value \(C_α=±1.96\).

Recall that:

$$Cov(X, Y)=σ_{XY}=ρ_{XY} σ_X σ_Y$$

Where ρ_XY is the correlation coefficient between X and Y.

Now the test statistic is given by:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}}{n}}}=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\rho}_{XY} {\sigma}_X {\sigma}_Y}{n}}}$$

$$=\frac{0.10-0.14}{\sqrt{\frac{0.02^2 +0.03^2-2\times 0.7 \times 0.02 \times 0.03}{30}}}=-10.215$$

The test statistic is far much less than -1.96. Therefore the null hypothesis is rejected at a 95% level.

## The Problem of Multiple Testing

Multiple testing occurs when multiple multiple hypothesis tests are conducted on the same data set. The reuse of data results in spurious results and unreliable conclusions that do not hold up to scrutiny. The fundamental problem with multiple testing is that the test size (i.e., the probability that a true null is rejected) is only applicable for a single test. However, repeated testing creates test sizes that are much larger than the assumed size of alpha and therefore increases the probability of a Type I error.

Some control methods have been developed to combat multiple testing. These include Bonferroni correction, the False Discovery Rate (FDR), and Familywise Error Rate (FWER).

Practice Question An experiment was done to find out the number of hours that candidates spend preparing for the FRM part 1 exam. For a sample of 10 students , the average study time was found to be 312.7 hours, with a standard deviation of 7.2 hours. What is the 95% confidence interval for the mean study time of all candidates? A. [307.5, 317.9] B. [310, 317] C. [300, 317] D. [307.5, 312.2] The correct answer is A. To calculate the 95% confidence interval for the mean study time of all candidates, we can use the formula for the confidence interval when the population variance is unknown: \[\text{Confidence Interval} = \bar{X} \pm t_{1-\frac{\alpha}{2}} \times \frac{s}{\sqrt{n}}\] Where: \(\bar{X}\) is the sample mean \(t_{1-\frac{\alpha}{2}}\) is the t-score corresponding to the desired confidence level and degrees of freedom \(s\) is the sample standard deviation \(n\) is the sample size In this case: \(\bar{X} = 312.7\) hours (the average study time) \(s = 7.2\) hours (the standard deviation of study time) \(n = 10\) students (the sample size) To find the t-score (\(t_{1-\frac{\alpha}{2}}\)), we look at the t-table for the 95% confidence level (which corresponds to \(\alpha = 0.05\)) and 9 degrees of freedom (\(n – 1 = 10 – 1 = 9\)). The t-score is 2.262. Now, we can plug these values into the confidence interval formula: \[\text{Confidence Interval} = 312.7 \pm 2.262 \times \frac{7.2}{\sqrt{10}}\] Calculating the margin of error: \[\text{Margin of Error} = 2.262 \times \frac{7.2}{\sqrt{10}} \approx 5.2\] So the confidence interval is: \[\text{Confidence Interval} = 312.7 \pm 5.2 = [307.5, 317.9]\] Therefore, the 95% confidence interval for the mean study time of all candidates is [307.5, 317.9] hours.

Offered by AnalystPrep

## Approaches to Asset Allocation

Basic statistics.

After completing this reading, you should be able to: Interpret and apply the... Read More

## Random Variables

After completing this reading, you should be able to: Describe and distinguish a... Read More

## Machine Learning and Prediction

After completing this reading, you should be able to: Explain the role of... Read More

## Regression Diagnostics

After completing this reading, you should be able to: Explain how to test... Read More

## Leave a Comment Cancel reply

You must be logged in to post a comment.

- Quality Improvement
- Talk To Minitab

## Understanding Hypothesis Tests: Confidence Intervals and Confidence Levels

Topics: Hypothesis Testing , Data Analysis , Statistics

In this series of posts, I show how hypothesis tests and confidence intervals work by focusing on concepts and graphs rather than equations and numbers.

Previously, I used graphs to show what statistical significance really means . In this post, I’ll explain both confidence intervals and confidence levels, and how they’re closely related to P values and significance levels.

## How to Correctly Interpret Confidence Intervals and Confidence Levels

A confidence interval is a range of values that is likely to contain an unknown population parameter. If you draw a random sample many times, a certain percentage of the confidence intervals will contain the population mean. This percentage is the confidence level.

Most frequently, you’ll use confidence intervals to bound the mean or standard deviation, but you can also obtain them for regression coefficients, proportions, rates of occurrence (Poisson), and for the differences between populations.

Just as there is a common misconception of how to interpret P values , there’s a common misconception of how to interpret confidence intervals. In this case, the confidence level is not the probability that a specific confidence interval contains the population parameter.

The confidence level represents the theoretical ability of the analysis to produce accurate intervals if you are able to assess many intervals and you know the value of the population parameter. For a specific confidence interval from one study, the interval either contains the population value or it does not—there’s no room for probabilities other than 0 or 1. And you can't choose between these two possibilities because you don’t know the value of the population parameter.

"The parameter is an unknown constant and no probability statement concerning its value may be made." —Jerzy Neyman, original developer of confidence intervals.

This will be easier to understand after we discuss the graph below . . .

With this in mind, how do you interpret confidence intervals?

Confidence intervals serve as good estimates of the population parameter because the procedure tends to produce intervals that contain the parameter. Confidence intervals are comprised of the point estimate (the most likely value) and a margin of error around that point estimate. The margin of error indicates the amount of uncertainty that surrounds the sample estimate of the population parameter.

In this vein, you can use confidence intervals to assess the precision of the sample estimate. For a specific variable, a narrower confidence interval [90 110] suggests a more precise estimate of the population parameter than a wider confidence interval [50 150].

## Confidence Intervals and the Margin of Error

Let’s move on to see how confidence intervals account for that margin of error. To do this, we’ll use the same tools that we’ve been using to understand hypothesis tests. I’ll create a sampling distribution using probability distribution plots , the t-distribution , and the variability in our data. We'll base our confidence interval on the energy cost data set that we've been using.

When we looked at significance levels , the graphs displayed a sampling distribution centered on the null hypothesis value, and the outer 5% of the distribution was shaded. For confidence intervals, we need to shift the sampling distribution so that it is centered on the sample mean and shade the middle 95%.

The shaded area shows the range of sample means that you’d obtain 95% of the time using our sample mean as the point estimate of the population mean. This range [267 394] is our 95% confidence interval.

Using the graph, it’s easier to understand how a specific confidence interval represents the margin of error, or the amount of uncertainty, around the point estimate. The sample mean is the most likely value for the population mean given the information that we have. However, the graph shows it would not be unusual at all for other random samples drawn from the same population to obtain different sample means within the shaded area. These other likely sample means all suggest different values for the population mean. Hence, the interval represents the inherent uncertainty that comes with using sample data.

You can use these graphs to calculate probabilities for specific values. However, notice that you can’t place the population mean on the graph because that value is unknown. Consequently, you can’t calculate probabilities for the population mean, just as Neyman said!

## Why P Values and Confidence Intervals Always Agree About Statistical Significance

You can use either P values or confidence intervals to determine whether your results are statistically significant. If a hypothesis test produces both, these results will agree.

The confidence level is equivalent to 1 – the alpha level. So, if your significance level is 0.05, the corresponding confidence level is 95%.

- If the P value is less than your significance (alpha) level, the hypothesis test is statistically significant.
- If the confidence interval does not contain the null hypothesis value, the results are statistically significant.
- If the P value is less than alpha, the confidence interval will not contain the null hypothesis value.

For our example, the P value (0.031) is less than the significance level (0.05), which indicates that our results are statistically significant. Similarly, our 95% confidence interval [267 394] does not include the null hypothesis mean of 260 and we draw the same conclusion.

To understand why the results always agree, let’s recall how both the significance level and confidence level work.

- The significance level defines the distance the sample mean must be from the null hypothesis to be considered statistically significant.
- The confidence level defines the distance for how close the confidence limits are to sample mean.

Both the significance level and the confidence level define a distance from a limit to a mean. Guess what? The distances in both cases are exactly the same!

The distance equals the critical t-value * standard error of the mean . For our energy cost example data, the distance works out to be $63.57.

Imagine this discussion between the null hypothesis mean and the sample mean:

Null hypothesis mean, hypothesis test representative : Hey buddy! I’ve found that you’re statistically significant because you’re more than $63.57 away from me!

Sample mean, confidence interval representative : Actually, I’m significant because you’re more than $63.57 away from me !

Very agreeable aren’t they? And, they always will agree as long as you compare the correct pairs of P values and confidence intervals. If you compare the incorrect pair, you can get conflicting results, as shown by common mistake #1 in this post .

## Closing Thoughts

In statistical analyses, there tends to be a greater focus on P values and simply detecting a significant effect or difference. However, a statistically significant effect is not necessarily meaningful in the real world. For instance, the effect might be too small to be of any practical value.

It’s important to pay attention to the both the magnitude and the precision of the estimated effect. That’s why I'm rather fond of confidence intervals. They allow you to assess these important characteristics along with the statistical significance. You'd like to see a narrow confidence interval where the entire range represents an effect that is meaningful in the real world.

If you like this post, you might want to read the previous posts in this series that use the same graphical framework:

- Part One: Why We Need to Use Hypothesis Tests
- Part Two: Significance Levels (alpha) and P values

For more about confidence intervals, read my post where I compare them to tolerance intervals and prediction intervals .

If you'd like to see how I made the probability distribution plot, please read: How to Create a Graphical Version of the 1-sample t-Test .

## You Might Also Like

- Trust Center

© 2023 Minitab, LLC. All Rights Reserved.

- Terms of Use
- Privacy Policy
- Cookies Settings

## Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

- Knowledge Base
- Understanding Confidence Intervals | Easy Examples & Formulas

## Understanding Confidence Intervals | Easy Examples & Formulas

Published on August 7, 2020 by Rebecca Bevans . Revised on June 22, 2023.

When you make an estimate in statistics, whether it is a summary statistic or a test statistic , there is always uncertainty around that estimate because the number is based on a sample of the population you are studying.

The confidence interval is the range of values that you expect your estimate to fall between a certain percentage of the time if you run your experiment again or re-sample the population in the same way.

The confidence level is the percentage of times you expect to reproduce an estimate between the upper and lower bounds of the confidence interval, and is set by the alpha value .

## Table of contents

What exactly is a confidence interval, calculating a confidence interval: what you need to know, confidence interval for the mean of normally-distributed data, confidence interval for proportions, confidence interval for non-normally distributed data, reporting confidence intervals, caution when using confidence intervals, other interesting articles, frequently asked questions about confidence intervals.

A confidence interval is the mean of your estimate plus and minus the variation in that estimate. This is the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence.

Confidence , in statistics, is another way to describe probability. For example, if you construct a confidence interval with a 95% confidence level, you are confident that 95 out of 100 times the estimate will fall between the upper and lower values specified by the confidence interval.

Your desired confidence level is usually one minus the alpha (α) value you used in your statistical test :

Confidence level = 1 − a

So if you use an alpha value of p < 0.05 for statistical significance , then your confidence level would be 1 − 0.05 = 0.95, or 95%.

## When do you use confidence intervals?

You can calculate confidence intervals for many kinds of statistical estimates, including:

- Proportions
- Population means
- Differences between population means or proportions
- Estimates of variation among groups

These are all point estimates, and don’t give any information about the variation around the number. Confidence intervals are useful for communicating the variation around a point estimate.

However, the British people surveyed had a wide variation in the number of hours watched, while the Americans all watched similar amounts.

Even though both groups have the same point estimate (average number of hours watched), the British estimate will have a wider confidence interval than the American estimate because there is more variation in the data.

## Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

- Academic style
- Vague sentences
- Style consistency

See an example

Most statistical programs will include the confidence interval of the estimate when you run a statistical test.

If you want to calculate a confidence interval on your own, you need to know:

- The point estimate you are constructing the confidence interval for
- The critical values for the test statistic
- The standard deviation of the sample
- The sample size

Once you know each of these components, you can calculate the confidence interval for your estimate by plugging them into the confidence interval formula that corresponds to your data.

## Point estimate

The point estimate of your confidence interval will be whatever statistical estimate you are making (e.g., population mean , the difference between population means, proportions, variation among groups).

## Finding the critical value

Critical values tell you how many standard deviations away from the mean you need to go in order to reach the desired confidence level for your confidence interval.

There are three steps to find the critical value.

- Choose your alpha (α) value.

The alpha value is the probability threshold for statistical significance . The most common alpha value is p = 0.05, but 0.1, 0.01, and even 0.001 are sometimes used. It’s best to look at the research papers published in your field to decide which alpha value to use.

- Decide if you need a one-tailed interval or a two-tailed interval.

You will most likely use a two-tailed interval unless you are doing a one-tailed t test .

For a two-tailed interval, divide your alpha by two to get the alpha value for the upper and lower tails.

- Look up the critical value that corresponds with the alpha value.

If your data follows a normal distribution , or if you have a large sample size ( n > 30) that is approximately normally distributed, you can use the z distribution to find your critical values.

For a z statistic, some of the most common values are shown in this table:

If you are using a small dataset (n ≤ 30) that is approximately normally distributed, use the t distribution instead.

The t distribution follows the same shape as the z distribution, but corrects for small sample sizes. For the t distribution, you need to know your degrees of freedom (sample size minus 1).

Check out this set of t tables to find your t statistic. We have included the confidence level and p values for both one-tailed and two-tailed tests to help you find the t value you need.

For normal distributions, like the t distribution and z distribution, the critical value is the same on either side of the mean.

For a two-tailed 95% confidence interval, the alpha value is 0.025, and the corresponding critical value is 1.96.

## Finding the standard deviation

Most statistical software will have a built-in function to calculate your standard deviation, but to find it by hand you can first find your sample variance, then take the square root to get the standard deviation.

- Find the sample variance

Sample variance is defined as the sum of squared differences from the mean, also known as the mean-squared-error (MSE):

To find the MSE, subtract your sample mean from each value in the dataset, square the resulting number, and divide that number by n − 1 (sample size minus 1).

Then add up all of these numbers to get your total sample variance ( s 2 ). For larger sample sets, it’s easiest to do this in Excel.

- Find the standard deviation.

The standard deviation of your estimate ( s ) is equal to the square root of the sample variance/sample error ( s 2 ):

- 10 for the GB estimate.
- 5 for the USA estimate.

## Sample size

The sample size is the number of observations in your data set.

Normally-distributed data forms a bell shape when plotted on a graph, with the sample mean in the middle and the rest of the data distributed fairly evenly on either side of the mean.

The confidence interval for data which follows a standard normal distribution is:

- CI = the confidence interval
- X̄ = the population mean
- Z* = the critical value of the z distribution
- σ = the population standard deviation
- √n = the square root of the population size

The confidence interval for the t distribution follows the same formula, but replaces the Z * with the t *.

In real life, you never know the true values for the population (unless you can do a complete census). Instead, we replace the population values with the values from our sample data, so the formula becomes:

- ˆx = the sample mean
- s = the sample standard deviation

To calculate the 95% confidence interval, we can simply plug the values into the formula.

For the USA:

So for the USA, the lower and upper bounds of the 95% confidence interval are 34.02 and 35.98.

The confidence interval for a proportion follows the same pattern as the confidence interval for means, but place of the standard deviation you use the sample proportion times one minus the proportion:

- ˆp = the proportion in your sample (e.g. the proportion of respondents who said they watched any television at all)
- Z*= the critical value of the z distribution
- n = the sample size

## Prevent plagiarism. Run a free check.

To calculate a confidence interval around the mean of data that is not normally distributed, you have two choices:

- You can find a distribution that matches the shape of your data and use that distribution to calculate the confidence interval.
- You can perform a transformation on your data to make it fit a normal distribution, and then find the confidence interval for the transformed data.

Performing data transformations is very common in statistics, for example, when data follows a logarithmic curve but we want to use it alongside linear data. You just have to remember to do the reverse transformation on your data when you calculate the upper and lower bounds of the confidence interval.

Confidence intervals are sometimes reported in papers, though researchers more often report the standard deviation of their estimate.

If you are asked to report the confidence interval, you should include the upper and lower bounds of the confidence interval.

One place that confidence intervals are frequently used is in graphs. When showing the differences between groups, or plotting a linear regression, researchers will often include the confidence interval to give a visual representation of the variation around the estimate.

Confidence intervals are sometimes interpreted as saying that the ‘true value’ of your estimate lies within the bounds of the confidence interval.

This is not the case. The confidence interval cannot tell you how likely it is that you found the true value of your statistical estimate because it is based on a sample, not on the whole population .

The confidence interval only tells you what range of values you can expect to find if you re-do your sampling or run your experiment again in the exact same way.

The more accurate your sampling plan, or the more realistic your experiment, the greater the chance that your confidence interval includes the true value of your estimate. But this accuracy is determined by your research methods, not by the statistics you do after you have collected the data!

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

- Normal distribution
- Descriptive statistics
- Measures of central tendency
- Correlation coefficient

Methodology

- Cluster sampling
- Stratified sampling
- Types of interviews
- Cohort study
- Thematic analysis

Research bias

- Implicit bias
- Cognitive bias
- Survivorship bias
- Availability heuristic
- Nonresponse bias
- Regression to the mean

The confidence level is the percentage of times you expect to get close to the same estimate if you run your experiment again or resample the population in the same way.

The confidence interval consists of the upper and lower bounds of the estimate you expect to find at a given level of confidence.

For example, if you are estimating a 95% confidence interval around the mean proportion of female babies born every year based on a random sample of babies, you might find an upper bound of 0.56 and a lower bound of 0.48. These are the upper and lower bounds of the confidence interval. The confidence level is 95%.

To calculate the confidence interval , you need to know:

Then you can plug these components into the confidence interval formula that corresponds to your data. The formula depends on the type of estimate (e.g. a mean or a proportion) and on the distribution of your data.

The standard normal distribution , also called the z -distribution, is a special normal distribution where the mean is 0 and the standard deviation is 1.

Any normal distribution can be converted into the standard normal distribution by turning the individual values into z -scores. In a z -distribution, z -scores tell you how many standard deviations away from the mean each value lies.

The z -score and t -score (aka z -value and t -value) show how many standard deviations away from the mean of the distribution you are, assuming your data follow a z -distribution or a t -distribution .

These scores are used in statistical tests to show how far from the mean of the predicted distribution your statistical estimate is. If your test produces a z -score of 2.5, this means that your estimate is 2.5 standard deviations from the predicted mean.

The predicted mean and distribution of your estimate are generated by the null hypothesis of the statistical test you are using. The more standard deviations away from the predicted mean your estimate is, the less likely it is that the estimate could have occurred under the null hypothesis .

A critical value is the value of the test statistic which defines the upper and lower bounds of a confidence interval , or which defines the threshold of statistical significance in a statistical test. It describes how far from the mean of the distribution you have to go to cover a certain amount of the total variation in the data (i.e. 90%, 95%, 99%).

If you are constructing a 95% confidence interval and are using a threshold of statistical significance of p = 0.05, then your critical value will be identical in both cases.

If your confidence interval for a difference between groups includes zero, that means that if you run your experiment again you have a good chance of finding no difference between groups.

If your confidence interval for a correlation or regression includes zero, that means that if you run your experiment again there is a good chance of finding no correlation in your data.

In both of these cases, you will also find a high p -value when you run your statistical test, meaning that your results could have occurred under the null hypothesis of no relationship between variables or no difference between groups.

If you want to calculate a confidence interval around the mean of data that is not normally distributed , you have two choices:

- Find a distribution that matches the shape of your data and use that distribution to calculate the confidence interval.
- Perform a transformation on your data to make it fit a normal distribution, and then find the confidence interval for the transformed data.

## Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Understanding Confidence Intervals | Easy Examples & Formulas. Scribbr. Retrieved April 2, 2024, from https://www.scribbr.com/statistics/confidence-interval/

## Is this article helpful?

## Rebecca Bevans

Other students also liked, understanding p values | definition and examples, test statistics | definition, interpretation, and examples, how to calculate standard deviation (guide) | calculator & examples, what is your plagiarism score.

## Confidence Intervals

Hypothesis testing is the approach to statistical inference that we use when we have two competing theories that we are trying to choose between. A second approach to statistical inference is confidence intervals, which allow us to present a range of reasonable values for our unknown population parameter. The range of reasonable values allows us to understand the corresponding population better without requiring any ideas to be fully specified.

## General Motivation and Framework

We have access to our sample, but we would really like to make a statement about the corresponding population. For example, we can calculate that the median price per night for a Chicago Airbnb was \$126 for a sample. What we really want to know, though, is what the median price per night for a Chicago Airbnb is for the entire population of Airbnbs, so that we can make an appropriate statement for the population.

How can we extend our knowledge from the sample to the population? We can use confidence intervals to help us generate a range of reasonable values for our unknown parameter. This will help us to make reasonable conclusions that should extend to the population appropriately.

To do so, we will combine our knowledge of sampling distributions with our specific sample value. This has many similar flavors to hypothesis testing but is approaching the problem through a different framework. Below, we'll walk through an example followed by the process to generate a confidence interval.

## Confidence Interval Example

Like mentioned above, the median price per night for a Chicago Airbnb was $126 in our sample. Can we generate a sampling distribution for the possible values that the median price per night could take from repeated random samples?

We will use the resampling approach to generating a sampling distribution as described previously.

Histogram of the sampling distribution for the median price of a Chicago Airbnb.

We've now generated our sampling distribution for sample median prices of Airbnbs in Chicago. Now, suppose that I want to create a range of reasonable values for the population median prices with 90% confidence (we'll define what 90% confidence means soon). To do so, I'll find the middle 90% of this distribution by calculating the 5th percentile and the 95th percentile.

For our simulated sampling distribution, the middle 90% are between \$120 and \$132 per night for a Chicago Airbnb.

At last, we'll make a jump from making statements about samples to making statements about populations. We could say that a range of reasonable values for the population median price per night of a Chicago Airbnb is between \$120 and \$132 per night.

## Confidence Interval Steps

To generate a confidence interval, we follow the same set of steps. We do apply some steps differently depending on our specific parameter of interest.

To generate a confidence interval, we should:

- Identify and define the parameter of interest
- Determine the confidence level
- Generate or use theory to specify the sampling distribution and check conditions
- Calculate the middle region of your sampling distribution, according to your confidence level
- Write a conclusion in the context of the problem.

## Identify Parameter of Interest

We discussed identifying and defining the parameter of interest when we first described hypothesis testing. This is repeated for confidence intervals.

In this example, our population of interest is all Chicago Airbnbs. We likely would want to specify a time frame as well, and since we are using March 2023 data, we may specify that this is for all Chicago Airbnbs in March 2023.

Our parameter of interest (the summary measure) is the median. We may define the parameter of interest as $M$, the population median price per night for a Chicago Airbnb.

## Determine the Confidence Level

The confidence level is analogous to the significance level. We'll provide a more exact definition and interpretation of the confidence level shortly. Confidence levels should be greater than 0% and less than 100%.

Confidence levels do not depend on the data and should be selected before observing the data. The confidence level is generally chosen based on the stakeholders and their requirements for the confidence in results. More confidence in the results are associated with higher confidence levels.

Common confidence levels include 90%, 95%, 98%, and 99%.

## Determine the Sampling Distribution for the Sample Statistic

We again will use the sampling distribution of the sample statistic as the basis for our confidence interval calculation. To do so, we can follow the same process outlined for hypothesis testing. Recall, that we chose between a simulation-based resampling approach or a theory-based approach using the Central Limit Theorem to define the sampling distribution.

The biggest distinction between generating sampling distributions for confidence intervals compared to hypothesis testing is that we don't need to make any adjustments to our sampling distribution so that it is consistent with the null hypothesis. That is, recall that we wanted to adopt the skeptic's claim in hypothesis testing. When we were generating a sampling distribution, we would make any modifications necessary so that the sampling distribution fulfilled the condition of the null hypothesis. This distinction should be considered in two ways:

- when generating the sampling distribution
- when checking any necessary conditions

For example, if we were performing hypothesis testing with a simulation-based approach, we would need to first adjust the data so that the sample median was equal to the null value. However, without that condition for confidence intervals, we would use the data exactly as it is in the sample.

Similarly, some conditions for sampling distributions use information about the parameter of interest. For example, the theory-based approach with proportions requires that $n \times p$ and $n \times (1-p)$ are both at least 10. When we have a hypothesis, we should plug in the null value from the null hypothesis into these checks. With confidence intervals, if we don't have any requirements for the parameter, we can use our best estimate for $p$, which is often $\hat{p}$ when checking the conditions.

Again, the simulation-based approach requires the least number of assumptions. For our example, it is the only option for estimating the sampling distribution, since we haven't introduced theory that relates to the sampling distribution for a sample median.

## Calculate the Confidence Interval

After we have determined the sampling distribution, we want to actually calculate the confidence interval, which is the range of reasonable values for our parameter of interest.

We want to find the central part of the sampling distribution that corresponds to our confidence level to generate the confidence interval, regardless of the approach for generating the sampling distribution. That is, if we want a 95% confidence interval, we will want to find the 2.5th percentile and the 97.5th percentile of the sampling distribution, so that the middle 95% is contained within those two values. In general, if we say that our confidence level is represented as CL%, then we want the (100-CL)/2 and (100+CL)/2 percentiles. We can find these percentiles both for a simulated sampling distribution or for a well-defined distribution, as long as we provide Python with the appropriate information.

This might seem counterintuitive, as we are using information about our sample to generate a guess about our population. To understand this, let's start by saying that this range would be a range of typical values for a sample statistic as calculated from our available data. Then, we're going to switch the order of the statement. This indicates that a sample statistic like the one we found would be reasonable if our parameter were anywhere in that range instead. Therefore, we'll say that the confidence interval that we calculated represents a range of reasonable values for the parameter.

## Write a Conclusion in the Context of the Problem

Finally, we've generated our confidence interval and want to communicate our results to other stakeholders. What exactly does the confidence interval mean?

Informally, we might say something like: it is reasonable to claim that the population median price for a Chicago Airbnb is between \$120 and \$136 per night, with 90% confidence.

The formal interpretation is that we are 90% confident that the true population median price for a Chicago Airbnb falls in the range of \$120 and \$136 per night.

## Confidence Interval Widths

Say that a stakeholder is not satisfied with a confidence interval. A common concern is that a confidence interval is too wide; that is, your stakeholder would like a narrower range of reasonable values. What can be changed to satisfy your stakeholder?

The two adjustable factors that affect the width of the confidence interval are the:

- sample size
- confidence level

Larger sample sizes result in narrower sampling distributions (recall this feature of the standard error from our sampling distribution module). This will also result in our confidence interval being narrower.

Larger confidence levels require a larger component of the sampling distribution to be included in the confidence interval. This will result in a wider confidence interval.

Therefore, if your stakeholder wants a narrower confidence interval, you could add more observations to your sample size or you could reduce your confidence level. It is also possible to estimate a desired sample size before gathering data that results in a confidence interval with limitations on the width of the confidence interval. We will skip over this calculation for our course, although you may encounter it in a future course.

## Confidence Interval Misconceptions and Misinterpretations

We've discussed briefly what a confidence interval means. Equally important is what a confidence interval does not imply.

A confidence interval does not correspond to:

- the probability that the parameter is in the confidence interval
- a range of reasonable values for the sample data
- a range of reasonable values for a sample statistic
- a range of reasonable values for any future results from another sample

These last three misconceptions stem from misunderstanding that the confidence interval is about the parameter of interest and not about the sample or any of its corresponding characteristics.

For the first statement, consider that the population is already defined, and the corresponding parameter value for the population could then be calculated. It is a specific number, and it doesn't change. For example, it might be 120 or it could be 145. However, since the population is fixed, it is that exact number.

Once the confidence interval is calculated, then the confidence interval is also set and determined. It won't change. In this case, the parameter will either be contained in our confidence interval or it won't be, so the probability associated with the parameter being in the confidence interval is either 0 (the confidence interval isn't correct) or 1 (the confidence interval is correct).

## Confidence Level Interpretation

We now understand how to calculate a confidence interval, what the confidence interval indicates, and what it doesn't indicate. However, we need to return to the second step where we set the confidence level for the interval. We know that this will have ramifications for the following steps of generating a confidence interval. But, what does it mean?

The confidence level means:

"If we gathered repeated random samples of the same size and calculated a CL% confidence interval for each, we would expect CL% of the resulting confidence intervals to contain the true parameter of interest."

Generally, this means that we expect CL% of our intervals to be correct. However, as we discussed above, we can't apply this reasoning to one specific interval after it's been calculated. This still does allow for variability and for different confidence intervals being generated from different samples.

## Hypothesis Testing Decisions through Confidence Intervals

You may have noticed that many of the steps used for confidence intervals are shared with hypothesis testing. While there are distinctions between the two, we can also use confidence intervals to help us determine the result of a hypothesis test.

Suppose that a friend found it reported that the median price for all Chicago hotels is $160 per night. They suspect that Airbnbs are less expensive per night, and the population median price for Chicago Airbnbs is less expensive.

That is, the parameter of interest would be $M$ the population median price per night for all Chicago Airbnbs in March 2023. We can (and have) found the corresponding sample statistic, $m$ or the median price per night for the Chicago Airbnbs from our sample.

Because we don't have any data to analyze for Chicago hotels, we'll use this number as if it were true and treat this as a test for only one population. Our hypotheses would be:

$H_0: M = 160$

$H_a: M < 160$

What does the data say? If we've already generated a confidence interval, we don't need to repeat many of the steps for hypothesis testing. Instead, we can consider our calculated confidence interval as a range of reasonable values for our parameter. That is, it is reasonable that the population median price per night for all Chicago Airbnbs is between \$120 and \$136. In this case, the null value of 160 is not included in the range of reasonable values. Everything reasonable falls under the alternative hypothesis. We would want to reject the null hypothesis and adopt the alternative hypothesis as a more reasonable claim.

In this case, our confidence interval clearly supports our alternative hypothesis rather than our null hypothesis. However, in order to use confidence intervals to anticipate the decision for a hypothesis test, we need to ensure that we are using comparable confidence and significance levels:

- for a two-sided alternative hypothesis, use a confidence level of $1-\alpha$
- for a one-sided alternative hypothesis, use a confidence level of $1-2\times\alpha$

- Physician Physician Board Reviews Physician Associate Board Reviews CME Lifetime CME Free CME
- Student USMLE Step 1 USMLE Step 2 USMLE Step 3 COMLEX Level 1 COMLEX Level 2 COMLEX Level 3 96 Medical School Exams Student Resource Center NCLEX - RN NCLEX - LPN/LVN/PN 24 Nursing Exams
- Nurse Practitioner APRN/NP Board Reviews CNS Certification Reviews CE - Nurse Practitioner FREE CE
- Nurse RN Certification Reviews CE - Nurse FREE CE
- Pharmacist Pharmacy Board Exam Prep CE - Pharmacist
- Allied Allied Health Exam Prep Dentist Exams CE - Social Worker CE - Dentist
- Point of Care
- Free CME/CE

## Hypothesis Testing, P Values, Confidence Intervals, and Significance

Definition/introduction.

Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.

## Issues of Concern

Register for free and read the full article, learn more about a subscription to statpearls point-of-care.

Without a foundational understanding of hypothesis testing, p values, confidence intervals, and the difference between statistical and clinical significance, it may affect healthcare providers' ability to make clinical decisions without relying purely on the research investigators deemed level of significance. Therefore, an overview of these concepts is provided to allow medical professionals to use their expertise to determine if results are reported sufficiently and if the study outcomes are clinically appropriate to be applied in healthcare practice.

Hypothesis Testing

Investigators conducting studies need research questions and hypotheses to guide analyses. Starting with broad research questions (RQs), investigators then identify a gap in current clinical practice or research. Any research problem or statement is grounded in a better understanding of relationships between two or more variables. For this article, we will use the following research question example:

Research Question: Is Drug 23 an effective treatment for Disease A?

Research questions do not directly imply specific guesses or predictions; we must formulate research hypotheses. A hypothesis is a predetermined declaration regarding the research question in which the investigator(s) makes a precise, educated guess about a study outcome. This is sometimes called the alternative hypothesis and ultimately allows the researcher to take a stance based on experience or insight from medical literature. An example of a hypothesis is below.

Research Hypothesis: Drug 23 will significantly reduce symptoms associated with Disease A compared to Drug 22.

The null hypothesis states that there is no statistical difference between groups based on the stated research hypothesis.

Researchers should be aware of journal recommendations when considering how to report p values, and manuscripts should remain internally consistent.

Regarding p values, as the number of individuals enrolled in a study (the sample size) increases, the likelihood of finding a statistically significant effect increases. With very large sample sizes, the p-value can be very low significant differences in the reduction of symptoms for Disease A between Drug 23 and Drug 22. The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. Based on the results, the investigators will either reject the null hypothesis (if they found significant differences or associations) or fail to reject the null hypothesis (they could not provide proof that there were significant differences or associations).

To test a hypothesis, researchers obtain data on a representative sample to determine whether to reject or fail to reject a null hypothesis. In most research studies, it is not feasible to obtain data for an entire population. Using a sampling procedure allows for statistical inference, though this involves a certain possibility of error. [1] When determining whether to reject or fail to reject the null hypothesis, mistakes can be made: Type I and Type II errors. Though it is impossible to ensure that these errors have not occurred, researchers should limit the possibilities of these faults. [2]

Significance

Significance is a term to describe the substantive importance of medical research. Statistical significance is the likelihood of results due to chance. [3] Healthcare providers should always delineate statistical significance from clinical significance, a common error when reviewing biomedical research. [4] When conceptualizing findings reported as either significant or not significant, healthcare providers should not simply accept researchers' results or conclusions without considering the clinical significance. Healthcare professionals should consider the clinical importance of findings and understand both p values and confidence intervals so they do not have to rely on the researchers to determine the level of significance. [5] One criterion often used to determine statistical significance is the utilization of p values.

P values are used in research to determine whether the sample estimate is significantly different from a hypothesized value. The p-value is the probability that the observed effect within the study would have occurred by chance if, in reality, there was no true effect. Conventionally, data yielding a p<0.05 or p<0.01 is considered statistically significant. While some have debated that the 0.05 level should be lowered, it is still universally practiced. [6] Hypothesis testing allows us to determine the size of the effect.

An example of findings reported with p values are below:

Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05.

Statement:Individuals who were prescribed Drug 23 experienced fewer symptoms (M = 1.3, SD = 0.7) compared to individuals who were prescribed Drug 22 (M = 5.3, SD = 1.9). This finding was statistically significant, p= 0.02.

For either statement, if the threshold had been set at 0.05, the null hypothesis (that there was no relationship) should be rejected, and we should conclude significant differences. Noticeably, as can be seen in the two statements above, some researchers will report findings with < or > and others will provide an exact p-value (0.000001) but never zero [6] . When examining research, readers should understand how p values are reported. The best practice is to report all p values for all variables within a study design, rather than only providing p values for variables with significant findings. [7] The inclusion of all p values provides evidence for study validity and limits suspicion for selective reporting/data mining.

While researchers have historically used p values, experts who find p values problematic encourage the use of confidence intervals. [8] . P-values alone do not allow us to understand the size or the extent of the differences or associations. [3] In March 2016, the American Statistical Association (ASA) released a statement on p values, noting that scientific decision-making and conclusions should not be based on a fixed p-value threshold (e.g., 0.05). They recommend focusing on the significance of results in the context of study design, quality of measurements, and validity of data. Ultimately, the ASA statement noted that in isolation, a p-value does not provide strong evidence. [9]

When conceptualizing clinical work, healthcare professionals should consider p values with a concurrent appraisal study design validity. For example, a p-value from a double-blinded randomized clinical trial (designed to minimize bias) should be weighted higher than one from a retrospective observational study [7] . The p-value debate has smoldered since the 1950s [10] , and replacement with confidence intervals has been suggested since the 1980s. [11]

Confidence Intervals

A confidence interval provides a range of values within given confidence (e.g., 95%), including the accurate value of the statistical constraint within a targeted population. [12] Most research uses a 95% CI, but investigators can set any level (e.g., 90% CI, 99% CI). [13] A CI provides a range with the lower bound and upper bound limits of a difference or association that would be plausible for a population. [14] Therefore, a CI of 95% indicates that if a study were to be carried out 100 times, the range would contain the true value in 95, [15] confidence intervals provide more evidence regarding the precision of an estimate compared to p-values. [6]

In consideration of the similar research example provided above, one could make the following statement with 95% CI:

Statement: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22; there was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

It is important to note that the width of the CI is affected by the standard error and the sample size; reducing a study sample number will result in less precision of the CI (increase the width). [14] A larger width indicates a smaller sample size or a larger variability. [16] A researcher would want to increase the precision of the CI. For example, a 95% CI of 1.43 – 1.47 is much more precise than the one provided in the example above. In research and clinical practice, CIs provide valuable information on whether the interval includes or excludes any clinically significant values. [14]

Null values are sometimes used for differences with CI (zero for differential comparisons and 1 for ratios). However, CIs provide more information than that. [15] Consider this example: A hospital implements a new protocol that reduced wait time for patients in the emergency department by an average of 25 minutes (95% CI: -2.5 – 41 minutes). Because the range crosses zero, implementing this protocol in different populations could result in longer wait times; however, the range is much higher on the positive side. Thus, while the p-value used to detect statistical significance for this may result in "not significant" findings, individuals should examine this range, consider the study design, and weigh whether or not it is still worth piloting in their workplace.

Similarly to p-values, 95% CIs cannot control for researchers' errors (e.g., study bias or improper data analysis). [14] In consideration of whether to report p-values or CIs, researchers should examine journal preferences. When in doubt, reporting both may be beneficial. [13] An example is below:

Reporting both: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22, p = 0.009. There was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

## Clinical Significance

Recall that clinical significance and statistical significance are two different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with smaller sample size and statistically non-significant results could impact clinical practice. [14] Additionally, as previously mentioned, a non-significant finding may reflect the study design itself rather than relationships between variables.

Healthcare providers using evidence-based medicine to inform practice should use clinical judgment to determine the practical importance of studies through careful evaluation of the design, sample size, power, likelihood of type I and type II errors, data analysis, and reporting of statistical findings (p values, 95% CI or both). [4] Interestingly, some experts have called for "statistically significant" or "not significant" to be excluded from work as statistical significance never has and will never be equivalent to clinical significance. [17]

The decision on what is clinically significant can be challenging, depending on the providers' experience and especially the severity of the disease. Providers should use their knowledge and experiences to determine the meaningfulness of study results and make inferences based not only on significant or insignificant results by researchers but through their understanding of study limitations and practical implications.

## Nursing, Allied Health, and Interprofessional Team Interventions

All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts in this chapter. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care.

Jones M, Gebski V, Onslow M, Packman A. Statistical power in stuttering research: a tutorial. Journal of speech, language, and hearing research : JSLHR. 2002 Apr:45(2):243-55 [PubMed PMID: 12003508]

Sedgwick P. Pitfalls of statistical hypothesis testing: type I and type II errors. BMJ (Clinical research ed.). 2014 Jul 3:349():g4287. doi: 10.1136/bmj.g4287. Epub 2014 Jul 3 [PubMed PMID: 24994622]

Fethney J. Statistical and clinical significance, and how to use confidence intervals to help interpret both. Australian critical care : official journal of the Confederation of Australian Critical Care Nurses. 2010 May:23(2):93-7. doi: 10.1016/j.aucc.2010.03.001. Epub 2010 Mar 29 [PubMed PMID: 20347326]

Hayat MJ. Understanding statistical significance. Nursing research. 2010 May-Jun:59(3):219-23. doi: 10.1097/NNR.0b013e3181dbb2cc. Epub [PubMed PMID: 20445438]

Ferrill MJ, Brown DA, Kyle JA. Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Journal of pharmacy practice. 2010 Aug:23(4):344-51. doi: 10.1177/0897190009358774. Epub 2010 Apr 13 [PubMed PMID: 21507834]

Infanger D, Schmidt-Trucksäss A. P value functions: An underused method to present research results and to promote quantitative reasoning. Statistics in medicine. 2019 Sep 20:38(21):4189-4197. doi: 10.1002/sim.8293. Epub 2019 Jul 3 [PubMed PMID: 31270842]

Dorey F. Statistics in brief: Interpretation and use of p values: all p values are not equal. Clinical orthopaedics and related research. 2011 Nov:469(11):3259-61. doi: 10.1007/s11999-011-2053-1. Epub [PubMed PMID: 21918804]

Liu XS. Implications of statistical power for confidence intervals. The British journal of mathematical and statistical psychology. 2012 Nov:65(3):427-37. doi: 10.1111/j.2044-8317.2011.02035.x. Epub 2011 Oct 25 [PubMed PMID: 22026811]

Tijssen JG, Kolm P. Demystifying the New Statistical Recommendations: The Use and Reporting of p Values. Journal of the American College of Cardiology. 2016 Jul 12:68(2):231-3. doi: 10.1016/j.jacc.2016.05.026. Epub [PubMed PMID: 27386779]

Spanos A. Recurring controversies about P values and confidence intervals revisited. Ecology. 2014 Mar:95(3):645-51 [PubMed PMID: 24804448]

Freire APCF, Elkins MR, Ramos EMC, Moseley AM. Use of 95% confidence intervals in the reporting of between-group differences in randomized controlled trials: analysis of a representative sample of 200 physical therapy trials. Brazilian journal of physical therapy. 2019 Jul-Aug:23(4):302-310. doi: 10.1016/j.bjpt.2018.10.004. Epub 2018 Oct 16 [PubMed PMID: 30366845]

Dorey FJ. In brief: statistics in brief: Confidence intervals: what is the real result in the target population? Clinical orthopaedics and related research. 2010 Nov:468(11):3137-8. doi: 10.1007/s11999-010-1407-4. Epub [PubMed PMID: 20532716]

Porcher R. Reporting results of orthopaedic research: confidence intervals and p values. Clinical orthopaedics and related research. 2009 Oct:467(10):2736-7. doi: 10.1007/s11999-009-0952-1. Epub 2009 Jun 30 [PubMed PMID: 19565303]

Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. British medical journal (Clinical research ed.). 1986 Mar 15:292(6522):746-50 [PubMed PMID: 3082422]

Cooper RJ, Wears RL, Schriger DL. Reporting research results: recommendations for improving communication. Annals of emergency medicine. 2003 Apr:41(4):561-4 [PubMed PMID: 12658257]

Doll H, Carney S. Statistical approaches to uncertainty: P values and confidence intervals unpacked. Equine veterinary journal. 2007 May:39(3):275-6 [PubMed PMID: 17520981]

Colquhoun D. The reproducibility of research and the misinterpretation of p-values. Royal Society open science. 2017 Dec:4(12):171085. doi: 10.1098/rsos.171085. Epub 2017 Dec 6 [PubMed PMID: 29308247]

## Use the mouse wheel to zoom in and out, click and drag to pan the image

- school Campus Bookshelves
- menu_book Bookshelves
- perm_media Learning Objects
- login Login
- how_to_reg Request Instructor Account
- hub Instructor Commons
- Download Page (PDF)
- Download Full Book (PDF)
- Periodic Table
- Physics Constants
- Scientific Calculator
- Reference & Cite
- Tools expand_more
- Readability

selected template will load here

This action is not available.

## 11: Hypothesis Testing and Confidence Intervals with Two Samples

- Last updated
- Save as PDF
- Page ID 100403

You have learned to conduct hypothesis tests on single means and single proportions. You will expand upon that in this chapter. You will compare two means or two proportions to each other. The general procedure is still the same, just expanded. To compare two means or two proportions, you work with two groups. The groups are classified either as independent or matched pairs. Independent groups consist of two samples that are independent, that is, sample values selected from one population are not related in any way to sample values selected from the other population. Matched pairs consist of two samples that are dependent. The parameter tested using matched pairs is the population mean. The parameters tested using independent groups are either population means or population proportions.

- 11.1: Prelude to Hypothesis Testing with Two Samples This chapter deals with the following hypothesis tests: Independent groups (samples are independent) Test of two population means. Test of two population proportions. Matched or paired samples (samples are dependent) Test of the two population proportions by testing one population mean of differences.
- 11.2: Two Population Means with Unknown Standard Deviations The comparison of two population means is very common. A difference between the two samples depends on both the means and the standard deviations. Very different means can occur by chance if there is great variation among the individual samples.
- 11.3: Two Population Means with Known Standard Deviations Even though this situation is not likely (knowing the population standard deviations is not likely), the following example illustrates hypothesis testing for independent means, known population standard deviations.
- 11.4: Comparing Two Independent Population Proportions Comparing two proportions, like comparing two means, is common. If two estimated proportions are different, it may be due to a difference in the populations or it may be due to chance. A hypothesis test can help determine if a difference in the estimated proportions reflects a difference in the population proportions.
- 11.5: Matched or Paired Samples When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples. The differences form the sample that is used for the hypothesis test. Either the matched pairs have differences that come from a population that is normal or the number of difference
- 11.6: Hypothesis Testing for Two Means and Two Proportions (Worksheet) A statistics Worksheet: The student will select the appropriate distributions to use in each case. The student will conduct hypothesis tests and interpret the results.
- 11.7: Hypothesis Testing with Two Samples (Exercises) These are homework exercises to accompany the Textmap created for "Introductory Statistics" by OpenStax.

## Contributors and Attributions

- Template:ContribOpenStax

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

## Unit 11: Confidence intervals

About this unit, introduction to confidence intervals.

- Confidence intervals and margin of error (Opens a modal)
- Confidence interval simulation (Opens a modal)
- Interpreting confidence level example (Opens a modal)
- Interpreting confidence levels and confidence intervals (Opens a modal)

## Estimating a population proportion

- Confidence interval example (Opens a modal)
- Margin of error 1 (Opens a modal)
- Margin of error 2 (Opens a modal)
- Conditions for valid confidence intervals for a proportion (Opens a modal)
- Conditions for confidence interval for a proportion worked examples (Opens a modal)
- Reference: Conditions for inference on a proportion (Opens a modal)
- Critical value (z*) for a given confidence level (Opens a modal)
- Example constructing and interpreting a confidence interval for p (Opens a modal)
- Interpreting a z interval for a proportion (Opens a modal)
- Determining sample size based on confidence and margin of error (Opens a modal)
- Conditions for a z interval for a proportion Get 3 of 4 questions to level up!
- Finding the critical value z* for a desired confidence level Get 3 of 4 questions to level up!
- Calculating a z interval for a proportion Get 3 of 4 questions to level up!
- Sample size and margin of error in a z interval for p Get 3 of 4 questions to level up!

## Estimating a population mean

- Introduction to t statistics (Opens a modal)
- Simulation showing value of t statistic (Opens a modal)
- Conditions for valid t intervals (Opens a modal)
- Reference: Conditions for inference on a mean (Opens a modal)
- Example finding critical t value (Opens a modal)
- Example constructing a t interval for a mean (Opens a modal)
- Confidence interval for a mean with paired data (Opens a modal)
- Making a t interval for paired data (Opens a modal)
- Interpreting a confidence interval for a mean (Opens a modal)
- Sample size for a given margin of error for a mean (Opens a modal)
- Conditions for a t interval for a mean Get 3 of 4 questions to level up!
- Finding the critical value t* for a desired confidence level Get 3 of 4 questions to level up!
- Calculating a t interval for a mean Get 3 of 4 questions to level up!
- Sample size and margin of error in a confidence interval for a mean Get 3 of 4 questions to level up!

## More confidence interval videos

- T-statistic confidence interval (Opens a modal)
- Small sample size confidence intervals (Opens a modal)

- school Campus Bookshelves
- menu_book Bookshelves
- perm_media Learning Objects
- login Login
- how_to_reg Request Instructor Account
- hub Instructor Commons
- Download Page (PDF)
- Download Full Book (PDF)
- Periodic Table
- Physics Constants
- Scientific Calculator
- Reference & Cite
- Tools expand_more
- Readability

selected template will load here

This action is not available.

## 9.3: Two Proportion Z-Test and Confidence Interval

- Last updated
- Save as PDF
- Page ID 24063

- Rachel Webb
- Portland State University

This section will look at how to analyze a difference in the proportions for two independent samples. As with all other hypothesis tests and confidence intervals, the process of testing is the same, though the formulas and assumptions are different.

There are three types of hypothesis tests for comparing the difference in 2 population proportions p 1 – p 2 , see Figure 9-7.

Note that for our purposes, p 1 – p 2 = 0. We could also use a variant of this model to test for a magnitude difference for when p 1 – p 2 ≠ 0, but we will not cover that scenario.

The z-test is a statistical test for comparing the proportions from two populations. It can be used when the samples are independent, \(n_{1} \hat{p}_{1}\) ≥ 10, \(n_{1} \hat{q}_{1}\) ≥ 10, \(n_{2} \hat{p}_{2}\) ≥ 10, and \(n_{2} \hat{q}_{2}\) ≥ 10.

The formula for the z-test statistic is:

\(z=\frac{\left(\hat{p}_{1}-\hat{p}_{2}\right)-\left(p_{1}-p_{2}\right)}{\sqrt{\left(\hat{p} \cdot \hat{q}\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)\right)}}\)

Where \(\hat{p}=\frac{\left(x_{1}+x_{2}\right)}{\left(n_{1}+n_{2}\right)}=\frac{\left(\hat{p}_{1} \cdot n_{1}+\hat{p}_{2} \cdot n_{2}\right)}{\left(n_{1}+n_{2}\right)}, \quad \hat{q}=1-\hat{p}, \quad \hat{p}_{1}=\frac{x_{1}}{n_{1}}, \hat{p}_{2}=\frac{x_{2}}{n_{2}}\).

The pooled proportion \(\hat{p}\) is a weighted mean of the proportions and \(\hat{q}\) is the complement of \(\hat{p}\). Some texts or software may use different notation for the pooled proportion, note that \(\hat{p}=\bar{p}\).

A vice principal wants to see if there is a difference between the number of students who are late to class for the first class of the day compared to the student’s class right after lunch. To test their claim to see if there is a difference in the proportion of late students between first and after lunch classes, the vice-principal randomly selects 200 students from first class and records if they are late, then randomly selects 200 students in their class after lunch and records if they are late. At the 0.05 level of significance, can a difference be concluded?

Two Proportions Z-Interval

A 100(1 – \(\alpha\))% confidence interval for the difference between two population proportions p 1 – p 2 :

\(\left(\hat{p}_{1}-\hat{p}_{2}\right)-z_{\alpha / 2} \sqrt{\left(\frac{\hat{p}_{1} \hat{q}_{1}}{n_{1}}+\frac{\hat{p}_{2} \hat{q}_{2}}{n_{2}}\right)}<p_{1}-p_{2}<\left(\hat{p}_{1}-\hat{p}_{2}\right)+z_{\alpha / 2} \sqrt{\left(\frac{\hat{p}_{1} \hat{q}_{1}}{n_{1}}+\frac{\hat{p}_{2} \hat{q}_{2}}{n_{2}}\right)}\)

Or more compactly as \(\left(\hat{p}_{1}-\hat{p}_{2}\right) \pm z_{\alpha / 2} \sqrt{\left(\frac{\hat{p}_{1} \hat{q}_{1}}{n_{1}}+\frac{\hat{p}_{2} \hat{q}_{2}}{n_{2}}\right)}\)

The requirements are identical to the 2-proportion hypothesis test. Note that the standard error does not rely on a hypothesized proportion so do not use a confidence interval to make decisions based on a hypothesis statement.

Find the 95% confidence interval for the difference in the proportion of late students in their first class and the proportion who are late to their class after lunch.

## Confidence distributions and hypothesis testing

- Regular Article
- Open access
- Published: 29 March 2024

## Cite this article

You have full access to this open access article

- Eugenio Melilli ORCID: orcid.org/0000-0003-2542-5286 1 &
- Piero Veronese ORCID: orcid.org/0000-0002-4416-2269 1

68 Accesses

Explore all metrics

The traditional frequentist approach to hypothesis testing has recently come under extensive debate, raising several critical concerns. Additionally, practical applications often blend the decision-theoretical framework pioneered by Neyman and Pearson with the inductive inferential process relied on the p -value, as advocated by Fisher. The combination of the two methods has led to interpreting the p -value as both an observed error rate and a measure of empirical evidence for the hypothesis. Unfortunately, both interpretations pose difficulties. In this context, we propose that resorting to confidence distributions can offer a valuable solution to address many of these critical issues. Rather than suggesting an automatic procedure, we present a natural approach to tackle the problem within a broader inferential context. Through the use of confidence distributions, we show the possibility of defining two statistical measures of evidence that align with different types of hypotheses under examination. These measures, unlike the p -value, exhibit coherence, simplicity of interpretation, and ease of computation, as exemplified by various illustrative examples spanning diverse fields. Furthermore, we provide theoretical results that establish connections between our proposal, other measures of evidence given in the literature, and standard testing concepts such as size, optimality, and the p -value.

## Similar content being viewed by others

## Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Sander Greenland, Stephen J. Senn, … Douglas G. Altman

## Check your outliers! An introduction to identifying statistical outliers in R with easystats

Rémi Thériault, Mattan S. Ben-Shachar, … Dominique Makowski

## Violating the normality assumption may be the lesser of two evils

Ulrich Knief & Wolfgang Forstmeier

Avoid common mistakes on your manuscript.

## 1 Introduction

In applied research, the standard frequentist approach to hypothesis testing is commonly regarded as a straightforward, coherent, and automatic method for assessing the validity of a conjecture represented by one of two hypotheses, denoted as \({{{\mathcal {H}}}_{0}}\) and \({{{\mathcal {H}}}_{1}}\) . The probabilities \(\alpha \) and \(\beta \) of committing type I and type II errors (reject \({{{\mathcal {H}}}_{0}}\) , when it is true and accept \({{{\mathcal {H}}}_{0}}\) when it is false, respectively) are controlled through a carefully designed experiment. After having fixed \(\alpha \) (usually at 0.05), the p -value is used to quantify the measure of evidence against the null hypothesis. If the p -value is less than \(\alpha \) , the conclusion is deemed significant , suggesting that it is unlikely that the null hypothesis holds. Regrettably, this methodology is not as secure as it may seem, as evidenced by a large literature, see the ASA’s Statement on p -values (Wasserstein and Lazar 2016 ) and The American Statistician (2019, vol. 73, sup1) for a discussion of various principles, misconceptions, and recommendations regarding the utilization of p -values. The standard frequentist approach is, in fact, a blend of two different views on hypothesis testing presented by Neyman-Pearson and Fisher. The first authors approach hypothesis testing within a decision-theoretic framework, viewing it as a behavioral theory. In contrast, Fisher’s perspective considers testing as a component of an inductive inferential process that does not necessarily require an alternative hypothesis or concepts from decision theory such as loss, risk or admissibility, see Hubbard and Bayarri ( 2003 ). As emphasized by Goodman ( 1993 ) “the combination of the two methods has led to a reinterpretation of the p -value simultaneously as an ‘observed error rate’ and as a ‘measure of evidence’. Both of these interpretations are problematic...”.

It is out of our scope to review the extensive debate on hypothesis testing. Here, we briefly touch upon a few general points, without delving into the Bayesian approach.

i) The long-standing caution expressed by Berger and Sellke ( 1987 ) and Berger and Delampady ( 1987 ) that a p -value of 0.05 provides only weak evidence against the null hypothesis has been further substantiated by recent investigations into experiment reproducibility, see e.g., Open Science Collaboration OSC ( 2015 ) and Johnson et al. ( 2017 ). In light of this, 72 statisticians have stated “For fields where the threshold for defining statistical significance for new discoveries is \(p<0.05\) , we propose a change to \(p<0.005\) ”, see Benjamin et al. ( 2018 ).

ii) The ongoing debate regarding the selection of a one-sided or two-sided test leaves the standard practice of doubling the p-value , when moving from the first to the second type of test, without consistent support, see e.g., Freedman ( 2008 ).

iii) There has been a longstanding argument in favor of integrating hypothesis testing with estimation, see e.g. Yates ( 1951 , pp. 32–33) or more recently, Greenland et al. ( 2016 ) who emphasize that “... statistical tests should never constitute the sole input to inferences or decisions about associations or effects ... in most scientific settings, the arbitrary classification of results into significant and non-significant is unnecessary for and often damaging to valid interpretation of data”.

iv) Finally, the p -value is incoherent when it is regarded as a statistical measure of the evidence provided by the data in support of a hypothesis \({{{\mathcal {H}}}_{0}}\) . As shown by Schervish ( 1996 ), it is possible that the p -value for testing the hypothesis \({{{\mathcal {H}}}_{0}}\) is greater than that for testing \({{{\mathcal {H}}}_{0}}^{\prime } \supset {{{\mathcal {H}}}_{0}}\) for the same observed data.

While theoretical insights into hypothesis testing are valuable for elucidating various aspects, we believe they cannot be compelled to serve as a unique, definitive practical guide for real-world applications. For example, uniformly most powerful (UMP) tests for discrete models not only rarely exist, but nobody uses them because they are randomized. On the other hand, how can a test of size 0.05 be considered really different from one of size 0.047 or 0.053? Moreover, for one-sided hypotheses, why should the first type error always be much more severe than the second type one? Alternatively, why should the test for \({{{\mathcal {H}}}_{0}}: \theta \le \theta _0\) versus \({{{\mathcal {H}}}_{1}}: \theta >\theta _0\) always be considered equivalent to the test for \({{{\mathcal {H}}}_{0}}: \theta = \theta _0\) versus \({{{\mathcal {H}}}_{1}}: \theta >\theta _0\) ? Furthermore, the decision to test \({{{\mathcal {H}}}_{0}}: \theta =\theta _0\) rather than \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _0-\epsilon , \theta _0+\epsilon ]\) , for a suitable positive \(\epsilon \) , should be driven by the specific requirements of the application and not solely by the existence of a good or simple test. In summary, we concur with Fisher ( 1973 ) that “no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas”.

Considering all these crucial aspects, we believe it is essential to seek an applied hypothesis testing approach that encourages researchers to engage more deeply with the specific problem, avoids relying on standardized procedures, and is consistently integrated into a broader framework of inference. One potential solution can be found resorting to the “confidence distribution” (CD) approach. The modern CD theory was introduced by Schweder and Hjort ( 2002 ) and Singh et al. ( 2005 ) and relies on the idea of constructing a data-depending distribution for the parameter of interest to be used for inferential purposes. A CD should not be confused with a Bayesian posterior distribution. It is not derived through the Bayes theorem, and it does not require any prior distributions. Similar to the conventional practice in point or interval estimation, where one seeks a point or interval estimator, the objective of this theory is to discover a distribution estimator . Thanks to a clarification of this concept and a formalized definition of the CD within a purely frequentist setting, a wide literature on the topic has been developed encompassing both theoretical developments and practical applications, see e.g. for a general overview Schweder and Hjort ( 2016 ), Singh et al. ( 2007 ), and Xie and Singh ( 2013 ). We also remark that when inference is required for a real parameter, it is possible to establish a relationship between CDs and fiducial distributions, originally introduced by Fisher ( 1930 ). For a modern and general presentation of the fiducial inference see Hannig ( 2009 ) and Hannig et al. ( 2016 ), while for a connection with the CDs see Schweder and Hjort ( 2016 ) and Veronese and Melilli ( 2015 , 2018a ). Some results about the connection between CDs and hypothesis testing are presented in Singh et al. ( 2007 , Sec. 3.3) and Xie & Singh ( 2013 , Sec. 4.3), but the focus is only on the formal relationships between the support that a CD can provide for a hypothesis and the p -value.

In this paper we discuss in details the application of CDs in hypothesis testing. We show how CDs can offer valuable solutions to address the aforementioned difficulties and how a test can naturally be viewed as a part of a more extensive inferential process. Once a CD has been specified, everything can be developed straightforwardly, without any particular technical difficulties. The core of our approach centers on the notion of support provided by the data to a hypothesis through a CD. We introduce two distinct but related types of support, the choice of which depends on the hypothesis under consideration. They are always coherent, easy to interpret and to compute, even in case of interval hypotheses, contrary to what happens for the p -value. The flexibility, simplicity, and effectiveness of our proposal are illustrated by several examples from various fields and a simulation study. We have postponed the presentation of theoretical results, comparisons with other proposals found in the literature, as well as the connections with standard hypothesis testing concepts such as size, significance level, optimality, and p -values to the end of the paper to enhance its readability.

The paper is structured as follows: In Sect. 2 , we provide a review of the CD’s definition and the primary methods for its construction, with a particular focus on distinctive aspects that arise when dealing with discrete models (Sect. 2.1 ). Section 3 explores the application of the CD in hypothesis testing and introduces the two notions of support. In Sect. 4 , we discuss several examples to illustrate the benefits of utilizing the CD in various scenarios, offering comparisons with traditional p -values. Theoretical results about tests based on the CD and comparisons with other measures of support or plausibility for hypotheses are presented in Sect. 5 . Finally, in Sect. 6 , we summarize the paper’s findings and provide concluding remarks. For convenience, a table of CDs for some common statistical models can be found in Appendix A, while all the proofs of the propositions are presented in Appendix B.

## 2 Confidence distributions

The modern definition of confidence distribution for a real parameter \(\theta \) of interest, see Schweder & Hjort ( 2002 ; 2016 , sec. 3.2) and Singh et al. ( 2005 ; 2007 ) can be formulated as follows:

## Definition 1

Let \(\{P_{\theta ,\varvec{\lambda }},\theta \in \Theta \subseteq \mathbb {R}, \varvec{\lambda }\in \varvec{\Lambda }\}\) be a parametric model for data \(\textbf{X}\in {\mathcal {X}}\) ; here \(\theta \) is the parameter of interest and \(\varvec{\lambda }\) is a nuisance parameter. A function H of \(\textbf{X}\) and \(\theta \) is called a confidence distribution for \(\theta \) if: i) for each value \(\textbf{x}\) of \(\textbf{X}\) , \(H(\textbf{x},\cdot )=H_{\textbf{x}}(\cdot )\) is a continuous distribution function on \(\Theta \) ; ii) \(H(\textbf{X},\theta )\) , seen as a function of the random element \(\textbf{X}\) , has the uniform distribution on (0, 1), whatever the true parameter value \((\theta , \varvec{\lambda })\) . The function H is an asymptotic confidence distribution if the continuity requirement in i) is removed and ii) is replaced by: ii) \(^{\prime }\) \(H(\textbf{X},\theta )\) converges in law to the uniform distribution on (0, 1) for the sample size going to infinity, whatever the true parameter value \((\theta , \varvec{\lambda })\) .

The CD theory is placed in a purely frequentist context and the uniformity of the distribution ensures the correct coverage of the confidence intervals. The CD should be regarded as a distribution estimator of a parameter \(\theta \) and its mean, median or mode can serve as point estimates of \(\theta \) , see Xie and Singh ( 2013 ) for a detailed discussion. In essence, the CD can be employed in a manner similar to a Bayesian posterior distribution, but its interpretation differs and does not necessitate any prior distribution. Closely related to the CD is the confidence curve (CC) which, given an observation \(\textbf{x}\) , is defined as \( CC_{\textbf{x}}(\theta )=|1-2H_{\textbf{x}}(\theta )|\) ; see Schweder and Hjort ( 2002 ). This function provides the boundary points of equal-tailed confidence intervals for any level \(1-\alpha \) , with \(0<\alpha <1\) , and offers an immediate visualization of their length.

Various procedures can be adopted to obtain exact or asymptotic CDs starting, for example, from pivotal functions, likelihood functions and bootstrap distributions, as detailed in Singh et al. ( 2007 ), Xie and Singh ( 2013 ), Schweder and Hjort ( 2016 ). A CD (or an asymptotic CD) can also be derived directly from a real statistic T , provided that its exact or asymptotic distribution function \(F_{\theta }(t)\) is a continuously monotonic function in \(\theta \) and its limits are 0 and 1 as \(\theta \) approaches its boundaries. For example, if \(F_{\theta }(t)\) is nonincreasing, we can define

Furthermore, if \(H_t(\theta )\) is differentiable in \(\theta \) , we can obtain the CD-density \(h_t(\theta )=-({\partial }/{\partial \theta }) F_{\theta }(t)\) , which coincides with the fiducial density suggested by Fisher. In particular, when the statistical model belongs to the real regular natural exponential family (NEF) with natural parameter \(\theta \) and sufficient statistic T , there always exists an “optimal” CD for \(\theta \) which is given by ( 1 ), see Veronese and Melilli ( 2015 ).

The CDs based on a real statistic play an important role in hypothesis testing. In this setting remarkable results are obtained when the model has monotone likelihood ratio (MLR). We recall that if \(\textbf{X}\) is a random vector distributed according to the family \(\{p_\theta , \theta \in \Theta \subseteq \mathbb {R}\}\) , this family is said to have MLR in the real statistic \(T(\textbf{X})\) if, for any \(\theta _1 <\theta _2\) , the ratio \(p_{\theta _2}(\textbf{x})/p_{\theta _1}(\textbf{x})\) is a nondecreasing function of \(T(\textbf{x})\) for values of \(\textbf{x}\) that induce at least one of \(p_{\theta _1}\) and \(p_{\theta _2}\) to be positive. Furthermore, for such families, it holds that \(F_{\theta _2}(t) \le F_{\theta _1}(t)\) for each t , see Shao ( 2003 , Sec. 6.1.2). Families with MLR not only allow the construction of Uniformly Most Powerful (UMP) tests in various scenarios but also identify the statistic T , which can be employed in constructing the CD for \(\theta \) . Indeed, because \(F_\theta (t)\) is nonincreasing in \(\theta \) for each t , \(H_t(\theta )\) can be defined as in ( 1 ) provided the conditions of continuity and limits of \(F_{\theta }(t)\) are met. Of course, if the MLR is nonincreasing in T a similar result holds and the CD for \(\theta \) is \(H_t(\theta )=F_\theta (t)\) .

An interesting characteristic of the CD that validates its suitability for use in a testing problem is its consistency , meaning that it increasingly concentrates around the “true” value of \(\theta \) as the sample size grows, leading to the correct decision.

## Definition 2

The sequence of CDs \(H(\textbf{X}_n, \cdot )\) is consistent at some \(\theta _0 \in \Theta \) if, for every neighborhood U of \(\theta _0\) , \(\int _U dH(\textbf{X}_n, \theta ) \rightarrow 1\) , as \(n\rightarrow \infty \) , in probability under \(\theta _0\) .

The following proposition provides some useful asymptotic properties of a CD for independent identically distributed (i.i.d.) random variables.

## Proposition 1

Let \(X_1,X_2,\ldots \) be a sequence of i.i.d. random variables from a distribution function \(F_{\theta }\) , parameterized by a real parameter \(\theta \) , and let \(H_{\textbf{x}_n}\) be the CD for \(\theta \) based on \(\textbf{x}_n=(x_1, \ldots , x_n)\) . If \(\theta _0\) denotes the true value of \(\theta \) , then \(H(\textbf{X}_n, \cdot )\) is consistent at \(\theta _0\) if one of the following conditions holds:

\(F_{\theta }\) belongs to a NEF;

\(F_{\theta }\) is a continuous distribution function and standard regularity assumptions hold;

its expected value and variance converge for \(n\rightarrow \infty \) to \(\theta _0\) , and 0, respectively, in probability under \(\theta _0\) .

Finally, if i) or ii) holds the CD is asymptotically normal.

Table 8 in Appendix A provides a list of CDs for various standard models. Here, we present two basic examples, while numerous others will be covered in Sect. 4 within an inferential and testing framework.

( Normal model ) Let \(\textbf{X}=(X_1,\ldots ,X_n)\) be an i.i.d. sample from a normal distribution N \((\mu ,\sigma ^2)\) , with \(\sigma ^2\) known. A standard pivotal function is \(Q({\bar{X}}, \mu )=\sqrt{n}({\bar{X}}-\mu )/ \sigma \) , where \(\bar{X}=\sum X_i/n\) . Since \(Q({\bar{X}}, \mu )\) is decreasing in \(\mu \) and has the standard normal distribution \(\Phi \) , the CD for \(\mu \) is \(H_{\bar{x}}(\mu )=1-\Phi (\sqrt{n}({\bar{x}}-\mu )/ \sigma )=\Phi (\sqrt{n}(\mu -{\bar{x}})/ \sigma )\) , that is a N \(({\bar{x}},\sigma /\sqrt{n})\) . When the variance is unknown we can use the pivotal function \(Q({\bar{X}}, \mu )=\sqrt{n}({\bar{X}}-\mu )/S\) , where \(S^2=\sum (X_i-\bar{X})^2/(n-1)\) , and the CD for \(\mu \) is \(H_{{\bar{x}},s}(\mu )=1-F^{T_{n-1}}(\sqrt{n}({\bar{x}}-\mu )/ \sigma )=F^{T_{n-1}}(\sqrt{n}(\mu -{\bar{x}})/ \sigma )\) , where \(F^{T_{n-1}}\) is the t-distribution function with \(n-1\) degrees of freedom.

( Uniform model ) Let \(\textbf{X}=(X_1,\ldots ,X_n)\) be an i.i.d. sample from the uniform distribution on \((0,\theta )\) , \(\theta >0\) . Consider the (sufficient) statistic \(T=\max (X_1, \ldots ,X_n)\) whose distribution function is \(F_\theta (t)=(t/\theta )^n\) , for \(0<t<\theta \) . Because \(F_\theta (t)\) is decreasing in \(\theta \) and the limit conditions are satisfied for \(\theta >t\) , the CD for \(\theta \) is \(H_t(\theta )=1-(t/\theta )^n\) , i.e. a Pareto distribution \(\text {Pa}(n, t)\) with parameters n (shape) and t (scale). Since the uniform distribution is not regular, the consistency of the CD follows from condition iii) of Proposition 1 . This is because \(E^{H_{t}}(\theta )=nt/(n-1)\) and \(Var^{H_{t}}(\theta )=nt^2/((n-2)(n-1)^2)\) , so that, for \(n\rightarrow \infty \) , \(E^{H_{t}}(\theta ) \rightarrow \theta _0\) (from the strong consistency of the estimator T of \(\theta \) , see e.g. Shao 2003 , p.134) and \(Var^{H_{t}}(\theta )\rightarrow 0\) trivially.

## 2.1 Peculiarities of confidence distributions for discrete models

When the model is discrete, clearly we can only derive asymptotic CDs. However, a crucial question arises regarding uniqueness. Since \(F_{\theta }(t)=\text{ Pr}_\theta \{T \le t\}\) does not coincide with Pr \(_\theta \{T<t\}\) for any value t within the support \({\mathcal {T}}\) of T , it is possible to define two distinct “extreme” CDs. If \(F_\theta (t)\) is non increasing in \(\theta \) , we refer to the right CD as \(H_{t}^r(\theta )=1-\text{ Pr}_\theta \{T\le t\}\) and to the left CD as \(H_{t}^\ell (\theta )=1-\text{ Pr}_\theta \{T<t\}\) . Note that \(H_{t}^r(\theta ) < H_{t}^\ell (\theta )\) , for every \(t \in {{\mathcal {T}}}\) and \(\theta \in \Theta \) , so that the center (i.e. the mean or the median) of \(H_{t}^r(\theta )\) is greater than that of \(H_{t}^\ell (\theta )\) . If \(F_\theta (t)\) is increasing in \(\theta \) , we define \( H_{t}^\ell (\theta )=F_\theta (t)\) and \(H^r_t(\theta )=\text{ Pr}_\theta \{T<t\}\) and one again \(H_{t}^r(\theta ) < H_{t}^\ell (\theta )\) . Veronese & Melilli ( 2018b , sec. 3.2) suggest overcoming this nonuniqueness by averaging the CD-densities \(h_t^r\) and \(h_t^\ell \) using the geometric mean \(h_t^g(\theta )\propto \sqrt{h_t^r(\theta )h_t^\ell (\theta )}\) . This typically results in a simpler CD compared to the one obtained through the arithmetic mean, with smaller confidence intervals. Note that the (asymptotic) CD defined in ( 1 ) for discrete models corresponds to the right CD, and it is more appropriately referred to as \(H_t^r(\theta )\) hereafter. Clearly, \(H_{t}^\ell (\theta )\) can be obtained from \(H_{t}^r(\theta )\) by replacing t with its preceding value in the support \({\mathcal {T}}\) . For discrete models, the table in Appendix A reports \(H_{t}^r(\theta )\) , \(H_{t}^\ell (\theta )\) and \(H_t^g(\theta )\) . Compared to \(H^{\ell }_t\) and \(H^r_t\) , \(H^g_t\) offers the advantage of closely approximating a uniform distribution when viewed as a function of the random variable T .

## Proposition 2

Given a discrete statistic T with distribution indexed by a real parameter \(\theta \in \Theta \) and support \({{\mathcal {T}}}\) independent of \(\theta \) , assume that, for each \(\theta \in \Theta \) and \(t\in {\mathcal {T}}\) , \(H^r_t(\theta )< H^g_t(\theta ) < H^{\ell }_t(\theta )\) . Then, denoting by \(G^j\) the distribution function of \(H^j_T\) , with \(j=\ell ,g,r\) , we have \(G^\ell (u) \le u \le G^r(u)\) . Furthermore,

Notice that the assumption in Proposition 2 is always satisfied when the model belongs to a NEF, see Veronese and Melilli ( 2018a ).

The possibility of constructing different CDs using the same discrete statistic T plays an important role in connection with standard p -values, as we will see in Sect. 5 .

(Binomial model) Let \(\textbf{X}=(X_1,\ldots , X_n)\) be an i.i.d. sample from a binomial distributions Bi(1, p ) with success probability p . Then \(T=\sum _{i=1}^n X_i\) is distributed as a Bi( n , p ) and by ( 1 ), recalling the well-known relationship between the binomial and beta distributions, it follows that the right CD for p is a Be( \(t+1,n-t\) ) for \(t=0,1,\ldots , n-1\) . Furthermore, the left CD is a Be( \(t,n-t+1\) ) and it easily follows that \(H_t^g(p)\) is a Be( \(t+1/2,n-t+1/2\) ). Figure 1 shows the corresponding three CD-densities along with their respective CCs, emphasizing the central position of \(h_t^g(p)\) and its confidence intervals in comparison to \(h_t^\ell (p)\) and \(h^r_t(p)\) .

(Binomial model) CD-densities (left plot) and CCs (right plot) corresponding to \(H_t^g(p)\) (solid lines), \(H_t^{\ell }(p)\) (dashed lines) and \(H_t^r(p)\) (dotted lines) for the parameter p with n = 15 and \(t=5\) . In the CC plot, the horizontal dotted line is at level 0.95

## 3 Confidence distributions in testing problems

As mentioned in Sect. 1 , we believe that introducing a CD can serve as a valuable and unifying approach, compelling individuals to think more deeply about the specific problem they aim to address rather than resorting to automatic rules. In fact, the availability of a whole distribution for the parameter of interest equips statisticians and practitioners with a versatile tool for handling a wide range of inference tasks, such as point and interval estimation, hypothesis testing, and more, without the need for ad hoc procedures. Here, we will address the issue in the simplest manner, referring to Sect. 5 for connections with related ideas in the literature and additional technical details.

Given a set \(A \subseteq \Theta \subseteq \mathbb {R}\) , it seems natural to measure the “support” that the data \(\textbf{x}\) provide to A through the CD \(H_{\textbf{x}}\) , as \(CD(A)=H_{\textbf{x}}(A)= \int _{A} dH_{\textbf{x}}(\theta )\) . Notice that, with a slight abuse of notation widely used in literature (see e.g., Singh et al. 2007 , who call \(H_{\textbf{x}}(A)\) strong-support ), we use \(H_{\textbf{x}}(\theta )\) to indicate the distribution function on \(\Theta \subseteq \mathbb {R}\) evaluated at \(\theta \) and \(H_{\textbf{x}}(A)\) to denote the mass that \(H_{\textbf{x}}\) induces on a (measurable) subset \(A\subseteq \Theta \) . It immediately follows that to compare the plausibility of k different hypotheses \({{\mathcal {H}}}_{i}: \theta \in \Theta _i\) , \(i=1,\ldots ,k\) , with \(\Theta _i \subseteq \Theta \) not being a singleton, it is enough to compute each \(H_{\textbf{x}}(\Theta _i)\) . We will call \(H_{\textbf{x}}(\Theta _i)\) the CD-support provided by \(H_{\textbf{x}}\) to the set \(\Theta _i\) . In particular, consider the usual case in which we have two hypotheses \({{{\mathcal {H}}}_{0}}: \theta \in \Theta _0\) and \({{{\mathcal {H}}}_{1}}: \theta \in \Theta _1\) , with \(\Theta _0 \cap \Theta _1= \emptyset \) , \(\Theta _0 \cup \Theta _1 = \Theta \) and assume that \({{{\mathcal {H}}}_{0}}\) is not a precise hypothesis (i.e. is not of type \(\theta =\theta _0\) ). As in the Bayesian approach one can compute the posterior odds, here we can evaluate the confidence odds \(CO_{0,1}\) of \({{{\mathcal {H}}}_{0}}\) against \({{{\mathcal {H}}}_{1}}\)

If \(CO_{0,1}\) is greater than one, the data support \({{{\mathcal {H}}}_{0}}\) more than \({{{\mathcal {H}}}_{1}}\) and this support clearly increases with \(CO_{0,1}\) . Sometimes this type of information can be sufficient to have an idea of the reasonableness of the hypotheses, but if we need to take a decision, we can include the confidence odds in a full decision setting. Thus, writing the decision space as \({{\mathcal {D}}}=\{0,1\}\) , where i indicates accepting \({{{\mathcal {H}}}}_i\) , for \(i=0,1\) , a penalization for the two possible errors must be specified. A simple loss function is

where \(\delta \) denotes the decision taken and \(a_i >0\) , \(i=0,1\) . The optimal decision is the one that minimizes the (expected) confidence loss

Therefore, we will choose \({{{\mathcal {H}}}_{0}}\) if \(a_0 H_{\textbf{x}}(\Theta _0) > a_1 H_{\textbf{x}}(\Theta _1)\) , that is if \(CO_{0,1}>a_1/a_0\) or equivalently if \(H_{\textbf{x}}(\Theta _0)>a_1/(a_0+a_1)=\gamma \) . Clearly, if there is no reason to penalize differently the two errors by setting an appropriate value for the ratio \(a_1/a_0\) , we assume \(a_0=a_1\) so that \(\gamma =0.5\) . This implies that the chosen hypothesis will be the one receiving the highest level of the CD-support. Therefore, we state the following

## Definition 3

Given the two (non precise) hypotheses \({{\mathcal {H}}}_i: \theta \in \Theta _i\) , \(i=0,1\) , the CD-support of \({{\mathcal {H}}}_i\) is defined as \(H_{\textbf{x}}(\Theta _i)\) . The hypothesis \({{{\mathcal {H}}}_{0}}\) is rejected according to the CD-test if the CD-support is less than a fixed threshold \(\gamma \) depending on the loss function ( 3 ) or, equivalently, if the confidence odds \(CO_{0,1}\) are less than \(a_1/a_0=\gamma /(1-\gamma )\) .

Unfortunately, the previous notion of CD-support fails for a precise hypothesis \({{{\mathcal {H}}}_{0}}:\theta =\theta _0\) , since in this case \(H_{\textbf{x}}(\{\theta _0\})\) trivially equals zero. Notice that the problem cannot be solved by transforming \({{{\mathcal {H}}}_{0}}:\theta =\theta _0\) into the seemingly more reasonable \({{{\mathcal {H}}}_{0}}^{\prime }:\theta \in [\theta _0-\epsilon , \theta _0+\epsilon ]\) because, apart from the arbitrariness of \(\epsilon \) , the CD-support for very narrow range intervals would typically remain negligible. We thus introduce an alternative way to assess the plausibility of a precise hypothesis or, more generally, of a “small” interval hypothesis.

Consider first \({{{\mathcal {H}}}_{0}}:\theta =\theta _0\) and assume, as usual, that \(H_{\textbf{x}}(\theta )\) is a CD for \(\theta \) , based on the data \(\textbf{x}\) . Looking at the confidence curve \(CC_{\textbf{x}}(\theta )=|1-2H_{\textbf{x}}(\theta )|\) in Fig. 2 , it is reasonable to assume that the closer \(\theta _0\) is to the median \(\theta _m\) of the CD, the greater the consistency of the value of \(\theta _0\) with respect to \(\textbf{x}\) . Conversely, the complement to 1 of the CC represents the unconsidered confidence relating to both tails of the distribution. We can thus define a measure of plausibility for \({{{\mathcal {H}}}_{0}}:\theta =\theta _0\) as \((1-CC_{\textbf{x}}(\theta ))/2\) and this measure will be referred to as the CD*-support given by \(\textbf{x}\) to the hypothesis. It is immediate to see that

In other words, if \(\theta _0 < \theta _m\) \([\theta _0 > \theta _m]\) the CD*-support is \(H_{\textbf{x}}(\theta _0)\) \([1-H_{\textbf{x}}(\theta _0)]\) and corresponds to the CD-support of all \(\theta \) ’s that are less plausible than \(\theta _0\) among those located on the left [right] side of the CC . Clearly, if \(\theta _0 = \theta _m\) the CD*-support equals 1/2, its maximum value. Notice that in this case no alternative hypothesis is considered and that the CD*-support provides a measure of plausibility for \(\theta _0\) by examining “the direction of the observed departure from the null hypothesis”. This quotation is derived from Gibbons and Pratt ( 1975 ) and was originally stated to support their preference for reporting a one-tailed p -value over a two-tailed one. Here we are in a similar context and we refer to their paper for a detailed discussion of this recommendation.

The CD*-supports of the points \(\theta _0\) , \(\theta _1\) , \(\theta _m\) and \(\theta _2\) correspond to half of the solid vertical lines and are given by \(H_{\textbf{x}}(\theta _0)\) , \(H_{\textbf{x}}(\theta _1)\) , \(H_{\textbf{x}}(\theta _m)=1/2\) e \(1-H_{\textbf{x}}(\theta _2)\) , respectively

An alternative way to intuitively justify formula ( 4 ) is as follows. Since \(H_{\textbf{x}}(\{\theta _0\})=0\) , we can look at the set K of values of \(\theta \) which are in some sense “more consistent” with the observed data \(\textbf{x}\) than \(\theta _0\) , and define the plausibility of \({{{\mathcal {H}}}_{0}}\) as \(1-H_{\textbf{x}}(K)\) . This procedure was followed in a Bayesian framework by Pereira et al. ( 1999 ) and Pereira et al. ( 2008 ) who, in order to identify K , relay on the posterior distribution of \(\theta \) and focus on its mode. We refer to these papers for a more detailed discussion of this idea. Here we emphasize only that the evidence \(1-H_{\textbf{x}}(K)\) supporting \({{{\mathcal {H}}}_{0}}\) cannot be considered as evidence against a possible alternative hypothesis. In our context, the set K can be identified as the set \(\{\theta \in \Theta : \theta < \theta _0\}\) if \(H_{\textbf{x}}(\theta _0)>1-H_{\textbf{x}}(\theta _0)\) or as \(\{\theta \in \Theta : \theta >\theta _0\}\) if \(H_{\textbf{x}}(\theta _0)\le 1-H_{\textbf{x}}(\theta _0)\) . It follows immediately that \(1-H_{\textbf{x}}(K)=\min \{H_{\textbf{x}}(\theta _0), 1-H_{\textbf{x}}(\theta _0)\}\) , which coincides with the CD*-support given in ( 4 ).

We can readily extend the previous definition of CD*-support to interval hypotheses \({{{\mathcal {H}}}_{0}}:\theta \in [\theta _1, \theta _2]\) . This extension becomes particularly pertinent when dealing with small intervals, where the CD-support may prove ineffective. In such cases, the set K of \(\theta \) values that are “more consistent” with the data \(\textbf{x}\) than those falling within the interval \([\theta _1, \theta _2]\) should clearly exclude this interval. Instead, it should include one of the two tails, namely, either \({\theta \in \Theta : \theta < \theta _1}\) or \({\theta \in \Theta : \theta > \theta _2}\) , depending on which one receives a greater mass from the CD. Then

so that the CD*-support of the interval \([\theta _1,\theta _2]\) is \(\text{ CD* }([\theta _1,\theta _2])=1-H_{\textbf{x}}(K)=\min \{H_{\textbf{x}}(\theta _2), 1-H_{\textbf{x}}(\theta _1)\}\) , which reduces to ( 4 ) in the case of a degenerate interval (i.e., when \(\theta _1=\theta _2=\theta _0\) ). Therefore, we can establish the following

## Definition 4

Given the hypothesis \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) , with \(\theta _1 \le \theta _2 \) , the CD*-support of \({{{\mathcal {H}}}_{0}}\) is defined as \(\min \{H_{\textbf{x}}(\theta _2), 1-H_{\textbf{x}}(\theta _1)\}\) . If \(H_{\textbf{x}}(\theta _2) <1-H_{\textbf{x}}(\theta _1)\) it is more reasonable to consider values of \(\theta \) greater than those specified by \({{{\mathcal {H}}}_{0}}\) , and conversely, the opposite holds true in the reverse situation. Furthermore, the hypothesis \({{{\mathcal {H}}}_{0}}\) is rejected according to the CD*-test if its CD*-support is less than a fixed threshold \(\gamma ^*\) .

The definition of CD*-support has been established for bounded interval (or precise) hypothesis. However, it can be readily extended to one-sided intervals such as \((-\infty , \theta _0]\) or \([\theta _0, +\infty )\) , but in these cases, it is evident that the CD*- and the CD-support are equivalent. For a general interval hypothesis we observe that \(H_{\textbf{x}}([\theta _1, \theta _2])\le \min \{H_{\textbf{x}}(\theta _2), 1-H_{\textbf{x}}(\theta _1)\}\) . Consequently, the CD-support can never exceed the CD*-support, even though they exhibit significant similarity when \(\theta _1\) or \(\theta _2\) resides in the extreme region of one tail of the CD or when the CD is highly concentrated (see examples 4 , 6 and 7 ).

It is crucial to emphasize that both CD-support and CD*-support are coherent measures of the evidence provided by the data for a hypothesis. This coherence arises from the fact that if \({{{\mathcal {H}}}_{0}}\subset {{{\mathcal {H}}}_{0}}^{\prime }\) , both the supports for \({{{\mathcal {H}}}_{0}}^{\prime }\) cannot be less than those for \({{{\mathcal {H}}}_{0}}\) . This is in stark contrast to the behavior of p -values, as demonstrated in Schervish ( 1996 ), Peskun ( 2020 ), and illustrated in Examples 4 and 7 .

Finally, as seen in Sect. 2.1 , various options for CDs are available for discrete models. Unless a specific problem suggests otherwise (see Sect. 5.1 ), we recommend using the geometric mean \(H_t^g\) as it offers a more impartial treatment of \({{{\mathcal {H}}}_{0}}\) and e \({{{\mathcal {H}}}_{1}}\) , as shown in Proposition 2 .

In this section, we illustrate the behavior, effectiveness, and simplicity of CD- and CD*-supports in an inferential context through several examples. We examine various contexts to assess the flexibility and consistency of our approach and compare it with the standard one. It is worth noting that the computation of the p -value for interval hypotheses is challenging and does not have a closed form.

( Normal model ) As seen in Example 1 , the CD for the mean \(\mu \) of a normal model is N \(({\bar{x}},\sigma /\sqrt{n})\) , for \(\sigma \) known. For simplicity, we assume this case; otherwise, the CD would be a t-distribution. Figure 3 shows the CD-density and the corresponding CC for \({\bar{x}}=2.7\) with three different values of \(\sigma /\sqrt{n}\) : \(1/\sqrt{50}=0.141\) , \(1/\sqrt{25}=0.2\) and \(1/\sqrt{10}=0.316\) .

The observed \({\bar{x}}\) specifies the center of both the CD and the CC, and values of \(\mu \) that are far from it receive less support the smaller the dispersion \(\sigma /\sqrt{n}\) of the CD. Alternatively, values of \(\mu \) within the CC, i.e., within the confidence interval of a specific level, are more reasonable than values outside it. These values become more plausible as the level of the interval decreases. Table 1 clarifies these points by providing the CD-support, confidence odds, CD*-support, and the p -value of the UMPU test for different interval hypotheses and different values of \(\sigma /\sqrt{n}\) .

(Normal model) CD-densities (left plot) and CCs (right plot) for \(\mu \) with \({\bar{x}}=2.7\) and three values of \(\sigma /\sqrt{n}\) : \(1/\sqrt{50}\) (solid line), \(1/\sqrt{25}\) (dashed line) and \(1/\sqrt{10}\) (dotted line). In the CC plot the dotted horizontal line is at level 0.95

It can be observed that when the interval is sufficiently large, e.g., [2.0, 2.5], the CD- and the CD*-supports are similar. However, for smaller intervals, as in the other three cases, the difference between the CD- and the CD*-support increases with the variance of the CD, \(\sigma /\sqrt{n}\) , regardless of whether the interval contains the observation \({\bar{x}}\) or not. These aspects are general depending on the form of the CD. Therefore, a comparison between these two measures can be useful to clarify whether an interval is smaller or not, according to the problem under analysis. Regarding the p -value of the UMPU test (see Schervish 1996 , equation 2), it is similar to the CD*-support when the interval is large (first case). However, the difference increases with the growth of the variance in the other cases. Furthermore, enlarging the interval from [2.4, 2.6] to [2.3, 2.6], not reported in Table 1 , while the CD*-supports remain unchanged, results in p -values reducing to 0.241, 0.331, and 0.479 for the three considered variances. This once again highlights the incoherence of the p -value as a measure of the plausibility of a hypothesis.

Now, consider a precise hypothesis, for instance, \({{{\mathcal {H}}}_{0}}:\mu =2.35\) . For the three values used for \(\sigma /\sqrt{n}\) , the CD*-supports are 0.007, 0.040, and 0.134, respectively. From Fig. 3 , it is evident that the point \(\mu =2.35\) lies to the left of the median of the CD. Consequently, the data suggest values of \(\mu \) larger than 2.35. Furthermore, looking at the CC, it becomes apparent that 2.35 is not encompassed within the confidence interval of level 0.95 when \(\sigma /\sqrt{n}=1/\sqrt{50}\) , contrary to what occurs in the other two cases. Due to the symmetry of the normal model, the UMPU test coincides with the equal tailed test, so that the p -value is equal to 2 times the CD*-support (see Remark 4 in Sect. 5.2 ). Furthermore, the size of the CD*-test is \(2\gamma ^*\) , where \(\gamma ^*\) is the threshold fixed to decide whether to reject the hypothesis or not (see Proposition 5 . Thus, if a test of level 0.05 is desired, it is sufficient to fix \(\gamma ^*=0.025\) , and both the CD*-support and the p -value lead to the same decision, namely, rejecting \({{{\mathcal {H}}}_{0}}\) only for the case \(\sigma /\sqrt{n}=0.141\) .

To assess the effectiveness of the CD*-support, we conduct a brief simulation study. For different values of \(\mu \) , we generate 100000 values of \({\bar{x}}\) from a normal distribution with mean \(\mu \) and various standard deviation \(\sigma /\sqrt{n}\) . We obtain the corresponding CDs with the CD*-supports and compute also the p -values. In Table 2 , we consider \({{{\mathcal {H}}}_{0}}: \mu \in [2.0, 2.5]\) and the performance of the CD*-support can be evaluated looking for example at the proportions of values in the intervals [0, 0.4), [0.4, 0.6) and [0.6, 1]. Values of the CD*-support in the first interval suggest a low plausibility of \({{{\mathcal {H}}}_{0}}\) in the light of the data, while values in the third one suggest a high plausibility. We highlight the proportions of incorrect evaluations in boldface. The last column of the table reports the proportion of errors resulting from the use of the standard procedure based on the p -value for a threshold of 0.05. Note how the proportion of errors related to the CD*-support is generally quite low with a maximum value of 0.301, contrary to what happens for the automatic procedure based on the p -value, which reaches a proportion of error of 0.845. Notice that the maximum error due to the CD*-support is obtained when \({{{\mathcal {H}}}_{0}}\) is true, while that due to the p -value is obtained in the opposite, as expected.

We consider now the two hypotheses \({{{\mathcal {H}}}_{0}}:\mu =2.35\) and \({{{\mathcal {H}}}_{0}}: \mu \in [2.75,2.85]\) . Notice that the interval in the second hypothesis should be regarded as small, because it can be checked that the CD- and CD*-supports consistently differ, as can be seen for example in Table 1 for the case \({\bar{x}}=2.7\) . Thus, this hypothesis can be considered not too different from a precise one. Because for a precise hypothesis the CD*-support cannot be larger than 0.5, to evaluate the performance of the CD*-support we can consider the three intervals [0, 0.2), [0.2, 0.3) and [0.3, 0.5].

Table 3 reports the results of the simulation including again the proportion of errors resulting from the use of the p -value with threshold 0.05. For the precise hypothesis \({{{\mathcal {H}}}_{0}}: \mu =2.35\) , the proportion of values of the CD*-support less than 0.2 when \(\mu =2.35\) is, whatever the standard deviation, approximately equal to 0.4. This depends on the fact that for a precise hypothesis, the CD*-support has a uniform distribution on the interval [0, 0.5], see Proposition 5 . This aspect must be taken into careful consideration when setting a threshold for a CD*-test. On the other hand, the proportion of values of the CD*-support in the interval [0.3, 0.5], which wrongly support \({{{\mathcal {H}}}_{0}}\) when it is false, goes from 0.159 to 0.333 for \(\mu =2.55\) and from 0.010 to 0.193 for \(\mu =2.75\) , which are surely better than those obtained from the standard procedure based on the p -value. Take now the hypothesis \({{{\mathcal {H}}}_{0}}: \mu \in [2.75,2.85]\) . Since it can be considered not too different from a precise hypothesis, we consider the proportion of values of the CD*-support in the intervals [0, 0.2), [0.2, 0.3) and [0.3, 1]. Notice that, for simplicity, we assume 1 as the upper bound of the third interval, even though for small intervals, the values of the CD*-support can not be much larger than 0.5. In our simulation it does not exceed 0.635. For the different values of \(\mu \) considered the behavior of the CD*-support and p -value is not too different from the previous case of a precise hypothesis even if the proportion of errors when \({{{\mathcal {H}}}_{0}}\) is true decreases for both while it increases when \({{{\mathcal {H}}}_{0}}\) is false.

Binomial model Suppose we are interested in assessing the chances of candidate A winning the next ballot for a certain administrative position. The latest election poll based on a sample of size \(n=20\) , yielded \(t=9\) votes in favor of A . What can we infer? Clearly, we have a binomial model where the parameter p denotes the probability of having a vote in favor of A . The standard estimate of p is \(\hat{p}=9/20=0.45\) , which might suggest that A will lose the ballot. However, the usual (Wald) confidence interval of level 0.95 based on the normal approximation, i.e. \(\hat{p} \pm 1.96 \sqrt{\hat{p}(1-\hat{p})/n}\) , is (0.232, 0.668). Given its considerable width, this interval suggests that the previous estimate is unreliable. We could perform a statistical test with a significance level \(\alpha \) , but what is \({{{\mathcal {H}}}_{0}}\) , and what value of \(\alpha \) should we consider? If \({{{\mathcal {H}}}_{0}}: p \ge 0.5\) , implying \({{{\mathcal {H}}}_{1}}: p <0.5\) , the p -value is 0.327. This suggests not rejecting \({{{\mathcal {H}}}_{0}}\) for any usual value \(\alpha \) . However, if we choose \({{{\mathcal {H}}}_{0}}^\prime : p \le 0.5\) the p -value is 0.673, and in this case, we would not reject \({{{\mathcal {H}}}_{0}}^\prime \) . These results provide conflicting indications. As seen in Example 3 , the CD for p , \(H_t^g(p)\) , is Be(9.5,11.5) and Fig. 4 shows its CD-density along with the corresponding CC, represented by solid lines. The dotted horizontal line at 0.95 in the CC plot highlights the (non asymptotic) equal-tailed confidence interval (0.251, 0.662), which is shorter than the Wald interval. Note that our interval can be easily obtained by computing the quantiles of order 0.025 and 0.975 of the beta distribution.

(Binomial model) CD-densities (left plot) and CCs (right plot) corresponding to \(H_t^g(p)\) , for the parameter p , with \(\hat{p}=t/n=0.45\) : \(n=20\) , \(t=9\) (solid lines) and \(n=60\) , \(t=27\) (dashed lines). In the CC plot the horizontal dotted line is at level 0.95

The CD-support provided by the data for the two hypotheses \({{{\mathcal {H}}}_{0}}:p \ge 0.5\) and \({{{\mathcal {H}}}_{1}}:p < 0.5\) (the choice of what is called \(H_0\) being irrelevant), is \(1-H_t^g(0.5)=0.328\) and \(H_t^g(0.5)=0.672\) respectively. Therefore, the confidence odds are \(CO_{0,1}=0.328/0.672=0.488\) , suggesting that the empirical evidence in favor of the victory of A is half of that of its defeat. Now, consider a sample of size \(n=60\) with \(t=27\) , so that again \(\hat{p}=0.45\) . While a standard analysis leads to the same conclusions (the p -values for \({{{\mathcal {H}}}_{0}}\) and \({{{\mathcal {H}}}_{0}}^{\prime }\) are 0.219 and 0.781, respectively), the use of the CD clarifies the differences between the two cases. The corresponding CD-density and CC are also reported in Fig. 4 (dashed lines) and, as expected, they are more concentrated around \(\hat{p}\) . Thus, the accuracy of the estimates of p is greater for the larger n and the length of the confidence intervals is smaller. Furthermore, for \(n=60\) , \(CO_{0,1}=0.281\) reducing the chance that A wins to about 1 to 4.

As a second application on the binomial model, we follow Johnson and Rossell ( 2010 ) and consider a stylized phase II trial of a new drug designed to improve the overall response rate from 20% to 40% for a specific population of patients with a common disease. The hypotheses are \({{{\mathcal {H}}}_{0}}:p \le 0.2\) versus \({{{\mathcal {H}}}_{1}}: p>0.2\) . It is assumed that patients are accrued and the trial continues until one of the two events occurs: (a) data clearly support one of the two hypotheses (indicated by a CD-support greater than 0.9) or (b) 50 patients have entered the trial. Trials that are not stopped before the 51st patient accrues are assumed to be inconclusive.

Based on a simulation of 1000 trials, Table 4 reports the proportions of trials that conclude in favor of each hypothesis, along with the average number of patients observed before each trial is stopped, for \(\theta =0.1\) (the central value of \({{{\mathcal {H}}}_{0}}\) ) and for \(\theta =0.4\) . A comparison with the results reported by Johnson and Rossell ( 2010 ) reveals that our approach is clearly superior with respect to Bayesian inferences performed with standard priors and comparable to that obtained under their non-local prior carefully specified. Although there is a slight reduction in the proportion of trials stopped for \({{\mathcal {H}}}_0\) (0.814 compared to 0.91), the average number of involved patients is lower (12.7 compared to 17.7), and the power is higher (0.941 against 0.812).

(Exponential distribution) Suppose an investigator aims to compare the performance of a new item, measured in terms of average lifetime, with that of the one currently in use, which is 0.375. To model the item lifetime, it is common to use the exponential distribution with rate parameter \(\lambda \) , so that the mean is \(1/\lambda \) . The typical testing problem is defined by \({{\mathcal {H}}}_0: \lambda =1/0.375=2.667\) versus \({{\mathcal {H}}}_1: \lambda \ne 2.667\) . In many cases, it would be more realistic and interesting to consider hypotheses of the form \({{\mathcal {H}}}_0: \lambda \in [\lambda _1,\lambda _2]\) versus \({{\mathcal {H}}}_1: \lambda \notin [\lambda _1,\lambda _2]\) , and if \({{{\mathcal {H}}}_{0}}\) is rejected, it becomes valuable to know whether the new item is better or worse than the old one. Note that, although an UMPU test exists for this problem, calculating its p -value is not simple and cannot be expressed in a closed form. Here we consider two different null hypotheses: \({{\mathcal {H}}}_0: \lambda \in [2, 4]\) and \({{\mathcal {H}}}_0: \lambda \in [2.63, 2.70]\) , corresponding to a tolerance in the difference between the mean lifetimes of the new and old items equal to 0.125 and 0.005, respectively. Given a sample of n new items with mean \({\bar{x}}\) , it follows from Table 8 in Appendix A that the CD for \(\lambda \) is Ga( n , t ), where \(t=n\bar{x}\) . Assuming \(n=10\) , we consider two values of t , namely, 1.5 and 4.5. The corresponding CD-densities are illustrated in Fig. 5 showing how the observed value t significantly influences the shape of the distribution, altering both its center and its dispersion, in contrast to the normal model. Specifically, for \(t=1.5\) , the potential estimates of \(\lambda \) , represented by the mean and median of the CD, are 6.67 and 6.45, respectively. For \(t=4.5\) , these values change to 2.22 and 2.15.

Table 5 provides the CD- and the CD*-supports corresponding to the two null hypotheses considered, along with the p -values of the UMPU test. Figure 5 and Table 5 together make it evident that, for \(t=1.5\) , the supports of both interval null hypotheses are very low and leading to their rejection, unless the problem requires a loss function that strongly penalizes a wrong rejection. Furthermore, it is immediately apparent that the data suggest higher values of \(\lambda \) , indicating a lower average lifetime of the new item. Note that the standard criterion “ p -value \(< 0.05\) ” would imply not rejecting \({{{\mathcal {H}}}_{0}}: \lambda \in [2,4]\) . For \(t=4.5\) , when \({{{\mathcal {H}}}_{0}}: \lambda \in [2,4]\) , the median 2.15 of the CD falls within the interval [2, 4]. Consequently, both the CD- and the CD*-supports are greater than 0.5, leading to the acceptance of \({{{\mathcal {H}}}_{0}}\) , as also suggested by the p -value. When \({{{\mathcal {H}}}_{0}}: \lambda \in [2.63, 2.70]\) , the CD-support becomes meaningless, whereas the CD*-support is not negligible (0.256) and should be carefully evaluated in accordance with the problem under analysis. This contrasts with the indication provided by the p -value (0.555).

For the point null hypothesis \(\lambda =2.67\) , the analysis is similar to that for the interval [2.63, 2.70]. Note that, in this case, in addition to the UMPU test, it is also possible to consider the simpler and most frequently used equal-tailed test. The corresponding p -value is 0.016 for \(t=1.5\) and 0.484 for \(t=4.5\) ; these values are exactly two times the CD*-support, see Remark 4 .

(Exponential model) CD-densities for the rate parameter \(\lambda \) , with \(n=10\) and \(t=1.5\) (dashed line) and \(t=4.5\) (solid line)

( Uniform model ) As seen in Example 2 , the CD for the parameter \(\theta \) of the uniform distribution \(\text {U}(0, \theta )\) is a Pareto distribution \(\text {Pa}(n, t)\) , where t is the sample maximum. Figure 6 shows the CD-density for \(n=10\) and \(t=2.1\) .

Consider now \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1, \theta _2]\) versus \({{{\mathcal {H}}}_{1}}: \theta \notin [\theta _1, \theta _2]\) . As usual, we can identify the interval \([\theta _1, \theta _2]\) on the plot of the CD-density and immediately recognize when the CD-test trivially rejects \({{{\mathcal {H}}}_{0}}\) (the interval lies on the left of t , i.e. \(\theta _2<t\) ), when the value of \(\theta _1\) is irrelevant and only the CD-support of \([t,\theta _2]\) determines the decision ( \(\theta _1<t<\theta _2\) ), or when the whole CD-support of \([\theta _1,\theta _2]\) must be considered ( \(t<\theta _1<\theta _2\) ). These facts are not as intuitive when the p -value is used. Indeed, for this problem, there exists the UMP test of level \(\alpha \) (see Eftekharian and Taheri 2015 ) and it is possible to write the p -value as

(we are not aware of previous mention of it). Table 6 reports the p -value of the UMP test, as well as the CD and CD*-supports, for the two hypotheses \({{{\mathcal {H}}}_{0}}: \theta \in [1.5, 2.2]\) and \({{{\mathcal {H}}}_{0}}^\prime : \theta \in [2.0, 2.2]\) for a sample of size \(n=10\) and various values of t .

It can be observed that, when t belongs to the interval \([\theta _1, \theta _2]\) , the CD- and CD*-supports do not depend on \(\theta _1\) , as previously remarked, while the p -value does. This reinforces the incoherence of the p -value shown by Schervish ( 1996 ). For instance, when \(t=2.19\) , the p -value for \({{{\mathcal {H}}}_{0}}\) is 0.046, while that for \({{{\mathcal {H}}}_{0}}^{\prime }\) (included in \({{{\mathcal {H}}}_{0}}\) ) is larger, namely 0.072. Thus, assuming \(\alpha =0.05\) , the UMP test leads to the rejection of \({{{\mathcal {H}}}_{0}}\) but it results in the acceptance of the smaller hypothesis \({{{\mathcal {H}}}_{0}}^{\prime }\) .

(Uniform model) CD-density for \(\theta \) with \(n=10\) and \(t=2.1\)

( Sharpe ratio ) The Sharpe ratio is one of the most widely used measures of performance of stocks and funds. It is defined as the average excess return relative to the volatility, i.e. \(SR=\theta =(\mu _R-R_f)/\sigma _R\) , where \(\mu _R\) and \(\sigma _R\) are the mean and standard deviation of a return R and \(R_f\) is a risk-free rate. Under the typical assumption of constant risk-free rate, the excess returns \(X_1, X_2, \ldots , X_n\) of the fund over a period of length n are considered, leading to \(\theta =\mu /\sigma \) , where \(\mu \) and \(\sigma \) are the mean and standard deviation of each \(X_i\) . If the sample is not too small, the distribution and the dependence of the \(X_i\) ’s are not so crucial, and the inference on \(\theta \) is similar to that obtained under the basic assumption of i.i.d. normal random variables, as discussed in Opdyke ( 2007 ). Following this article, we consider the weekly returns of the mutual fund Fidelity Blue Chip Growth from 12/24/03 to 12/20/06 (these data are available for example on Yahoo! Finance, https://finance.yahoo.com/quote/FBGRX ) and assume that the excess returns are i.i.d. normal with a risk-free rate equal to 0.00052. Two different samples are analyzed: the first one includes all \(n_1=159\) observations from the entire period, while the second one is limited to the \(n_2=26\) weeks corresponding to the fourth quarter of 2005 and the first quarter of 2006. The sample mean, the standard deviation, and the corresponding sample Sharpe ratio for the first sample are \(\bar{x}_1=0.00011\) , \(s_1=0.01354\) , \(t_1=\bar{x}_1/s_1=0.00842\) . For the second sample, the values are \(\bar{x}_2=0.00280\) , \(s_2=0.01048\) , \(t_2=\bar{x}_2/s_2=0.26744\) .

We can derive the CD for \(\theta \) starting from the sampling distribution of the statistic \(W=\sqrt{n}T=\sqrt{n}\bar{X}/S\) , which has a noncentral t-distribution with \(n-1\) degrees of freedom and noncentrality parameter \(\tau =\sqrt{n}\mu /\sigma =\sqrt{n}\theta \) . This family has MLR (see Lehmann and Romano 2005 , p. 224) and the distribution function \(F^W_\tau \) of W is continuous in \(\tau \) with \(\lim _{\tau \rightarrow +\infty } F^W_\tau (w)=0\) and \(\lim _{\tau \rightarrow -\infty } F^W_\tau (w)=1\) , for each w in \(\mathbb {R}\) . Thus, from ( 1 ), the CD for \(\tau \) is \(H^\tau _w(\tau )=1-F^W_\tau (w)\) . Recalling that \(\theta =\tau /\sqrt{n}\) , the CD for \(\theta \) can be obtained using a trivial transformation which leads to \(H^\theta _w(\theta )=H^\tau _{w}(\sqrt{n}\theta )=1-F_{\sqrt{n}\theta }^W(w)\) , where \(w=\sqrt{n}t\) . In Figure 7 , the CD-densities for \(\theta \) relative to the two samples are plotted: they are symmetric and centered on the estimate t of \(\theta \) , and the dispersion is smaller for the one with the larger n .

Now, let us consider the typical hypotheses for the Sharpe ratio \({{\mathcal {H}}}_0: \theta \le 0\) versus \({{\mathcal {H}}}_1: \theta >0\) . From Table 7 , which reports the CD-supports and the corresponding odds for the two samples, and from Fig. 7 , it appears that the first sample clearly favors neither hypothesis, while \({{{\mathcal {H}}}_{1}}\) is strongly supported by the second one. Here, the p -value coincides with the CD-support (see Proposition 3 ), but choosing the the usual values 0.05 or 0.01 to decide whether to reject \({{{\mathcal {H}}}_{0}}\) or not may lead to markedly different conclusions.

When the assumption of i.i.d. normal returns does not hold, it is possible to show (Opdyke 2007 ) that the asymptotic distribution of T is normal with mean and variance \(\theta \) and \(\sigma ^2_T=(1+\theta ^2(\gamma _4-1)/4-\theta \gamma _3)/n\) , where \(\gamma _3\) and \(\gamma _4\) are the skewness and kurtosis of the \(X_i\) ’s. Thus, the CD for \(\theta \) can be derived from the asymptotic distribution of T and is N( \(t,\hat{\sigma }^2_T)\) , where \(\hat{\sigma }^2_T\) is obtained by estimating the population moments using the sample counterparts. The last column of Table 7 shows that the asymptotic CD-supports for \({{{\mathcal {H}}}_{0}}\) are not too different from the previous ones.

(Sharpe ratio) CD-densities for \(\theta =\mu /\sigma \) with \(n_1=159, t_1=0.008\) (solid line) and \(n_2\) =26, \(t_2=0.267\) (dashed line)

( Ratio of Poisson rates ) The comparison of Poisson rates \(\mu _1\) and \(\mu _2\) is important in various contexts, as illustrated for example by Lehmann & Romano ( 2005 , sec. 4.5), who also derive the UMPU test for the ratio \(\phi =\mu _1/\mu _2\) . Given two i.i.d. samples of sizes \(n_1\) and \(n_2\) from independent Poisson distributions, we can summarize the data with the two sufficient sample sums \(S_1\) and \(S_2\) , where \(S_i \sim \) Po( \(n_i\mu _i\) ), \(i=1,2\) . Reparameterizing the joint density of \((S_1, S_2)\) with \(\phi =\mu _1/\mu _2\) and \(\lambda =n_1\mu _1+n_2\mu _2\) , it is simple to verify that the conditional distribution of \(S_1\) given \(S_1+S_2=s_1+s_2\) is Bi( \(s_1+s_2, w\phi /(1+w\phi )\) ), with \(w=n_1/n_2\) , while the marginal distribution of \(S_1+S_2\) depends only on \(\lambda \) . Thus, for making inference on \(\phi \) , it is reasonable to use the CD for \(\phi \) obtained from the previous conditional distribution. Referring to the table in Appendix A, the CD \(H^g_{s_1,s_2}\) for \(w\phi /(1+w\phi )\) is Be \((s_1+1/2, s_2+1/2)\) , enabling us to determine the CD-density for \(\phi \) through the change of variable rule:

We compare our results with those derived by the standard conditional test implemented through the function poisson.test in R. We use the “eba1977” data set available in the package ISwR, ( https://CRAN.R-project.org/package=ISwR ), which contains counts of incident lung cancer cases and population size in four neighboring Danish cities by age group. Specifically, we compare the \(s_1=11\) lung cancer cases in a population of \(n_1=800\) people aged 55–59 living in Fredericia with the \(s_2=21\) cases observed in the other cities, which have a total of \(n_2=3011\) residents. For the hypothesis \({{{\mathcal {H}}}_{0}}: \phi =1\) versus \({{{\mathcal {H}}}_{1}}: \phi \ne 1\) , the R-output provides a p -value of 0.080 and a 0.95 confidence interval of (0.858, 4.277). If a significance level \(\alpha =0.05\) is chosen, \({{{\mathcal {H}}}_{0}}\) is not rejected, leading to the conclusion that there should be no reason for the inhabitants of Fredericia to worry.

Looking at the three CD-densities for \(\phi \) in Fig. 8 , it is evident that values of \(\phi \) greater than 1 are more supported than values less than 1. Thus, one should test the hypothesis \({{{\mathcal {H}}}_{0}}: \phi \le 1\) versus \({{{\mathcal {H}}}_{1}}: \phi >1\) . Using ( 5 ), it follows that the CD-support of \({{{\mathcal {H}}}_{0}}\) is \(H^g_{s_1,s_2}(1)=0.037\) , and the confidence odds are \(CO_{0,1}=0.037/(1-0.037)=0.038\) . To avoid rejecting \({{{\mathcal {H}}}_{0}}\) , a very asymmetric loss function should be deemed suitable. Finally, we observe that the confidence interval computed in R, is the Clopper-Pearson one, which has exact coverage but, as generally recognized, is too wide. In our context, this corresponds to taking the lower bound of the interval using the CC generated by \(H^\ell _{s_1, s_2}\) and the upper bound using that generated by \(H^r_{s_1, s_2}\) (see Veronese and Melilli 2015 ). It includes the interval generated by \(H_{s_1, s_2}^g\) , namely (0.931, 4.026), as shown in the right plot of Fig. 8 .

(Poisson-rates) CD-densities (left plot) and CCs (right plot) corresponding to \(H^g_{s_1,s_2}(\phi )\) (solid lines), \(H^\ell _{s_1,s_2}(\phi )\) (dashed lines) and \(H^r_{s_1,s_2}(\phi )\) (dotted lines) for the parameter \(\phi \) . In the CC plot the vertical lines identify the Clopper-Pearson confidence interval (dashed and dotted lines) and that based on \(H^g_{s_1,s_2}(\phi )\) (solid lines). The dotted horizontal line is at level 0.95

## 5 Properties of CD-support and CD*-support

5.1 one-sided hypotheses.

The CD-support of a set is the mass assigned to it by the CD, making it a fundamental component in all inferential problems based on CDs. Nevertheless, its direct utilization in hypothesis testing is rare, with the exception of Xie and Singh ( 2013 ). It can also be viewed as a specific instance of evidential support , a notion introduced by Bickel ( 2022 ) within a broader category of models known as evidential models , which encompass both posterior distributions and confidence distributions as specific cases.

Let us now consider a classical testing problem. Let \(\textbf{X}\) be an i.i.d. sample with a distribution depending on a real parameter \(\theta \) and let \({{{\mathcal {H}}}_{0}}: \theta \le \theta _0\) versus \({{{\mathcal {H}}}_{1}}: \theta >\theta _0\) , where \(\theta _0\) is a fixed value (the case \({{{\mathcal {H}}}_{0}}^\prime : \theta \ge \theta _0\) versus \({{{\mathcal {H}}}_{1}}^\prime : \theta <\theta _0\) is perfectly specular and will not be analyzed). In order to compare our test with the standard one, we assume that the model has MLR in \(T=T(\textbf{X})\) . Suppose first that the distribution function \(F_\theta (t)\) of T is continuous and that the CD for \(\theta \) is \(H_t(\theta )=1- F_{\theta }(t)\) . From Sect. 3 , the CD-support for \({{{\mathcal {H}}}_{0}}\) (which coincides with the CD*-support) is \(H_t(\theta _0)\) . In this case, the UMP test exists, as established by the Karlin-Rubin theorem, and rejects \({{{\mathcal {H}}}_{0}}\) if \(t > t_\alpha \) , where \(t_\alpha \) depends on the chosen significance level \(\alpha \) , or alternatively, if the p -value \(\text{ Pr}_{\theta _0}(T\ge t)\) is less than \(\alpha \) . Since \(\text{ Pr}_{\theta _0}(T\ge t)=1-F_{\theta _0}(t)=H_t(\theta _0)\) , the p -value coincides with the CD-support. Thus, to define a CD-test with size \(\alpha \) , it is enough to fix its rejection region as \(\{t: H_t(\theta _0)<\alpha \}\) , and both tests lead to the same conclusion.

When the statistic T is discrete, we have seen that various choices of CDs are possible. Assuming that \(H^r_t(\theta )< H^g_t(\theta ) < H^{\ell }_t(\theta )\) , as occurs for models belonging to a real NEF, it follows immediately that \(H^{r}_t\) provides stronger support for \({{\mathcal {H}}}_0: \theta \le \theta _0\) than \(H^g_t\) does, while \(H^{\ell }_t\) provides stronger support for \({{\mathcal {H}}}_0^\prime : \theta \ge \theta _0\) than \(H^g_t\) does. In other words, \(H_t^{\ell }\) is more conservative than \(H^g_t\) for testing \({{{\mathcal {H}}}_{0}}\) and the same happens to \(H^r_t\) for \({{{\mathcal {H}}}_{0}}^{\prime }\) . Therefore, selecting the appropriate CD can lead to the standard testing result. For example, in the case of \({{{\mathcal {H}}}_{0}}:\theta \le \theta _0\) versus \({{{\mathcal {H}}}_{1}}: \theta > \theta _0\) , the p -value is \(\text{ Pr}_{\theta _0}(T\ge t)=1-\text{ Pr}_{\theta _0}(T<t)=H^{\ell }_t(\theta _0)\) , and the rejection region of the standard test and that of the CD-test based on \(H_t^{\ell }\) coincide if the threshold is the same. However, as both tests are non-randomized, their size is typically strictly less than the fixed threshold.

The following proposition summarizes the previous considerations.

## Proposition 3

Consider a model indexed by a real parameter \(\theta \) with MLR in the statistic T and the one-sided hypotheses \({{{\mathcal {H}}}_{0}}: \theta \le \theta _0\) versus \({{{\mathcal {H}}}_{1}}: \theta >\theta _0\) , or \({{{\mathcal {H}}}_{0}}^\prime : \theta \ge \theta _0\) versus \({{{\mathcal {H}}}_{1}}^\prime : \theta <\theta _0\) . If T is continuous, then the CD-support and the p -value associated with the UMP test are equal. Thus, if a common threshold \(\alpha \) is set for both rejection regions, the two tests have size \(\alpha \) . If T is discrete, the CD-support coincides with the usual p -value if \(H^\ell _t [H^r_t]\) is chosen when \({{{\mathcal {H}}}_{0}}: \theta \le \theta _0\) \([{{{\mathcal {H}}}_{0}}^\prime : \theta \ge \theta _0]\) . For a fixed threshold \(\alpha \) , the two tests have a size not greater than \(\alpha \) .

The CD-tests with threshold \(\alpha \) mentioned in the previous proposition have significance level \(\alpha \) and are, therefore, valid , that is \(\sup _{\theta \in \Theta _0} Pr_\theta (H(T)\le \alpha ) \le \alpha \) (see Martin and Liu 2013 ). This is no longer true if, for a discrete T , we choose \(H^g_t\) . However, Proposition 2 implies that its average size is closer to \(\alpha \) compared to those of the tests obtained using \(H^\ell _t\) \([H^r_t]\) , making \(H^g_t\) more appropriate when the problem does not strongly suggest that the null hypothesis should be considered true “until proven otherwise”.

## 5.2 Precise and interval hypotheses

The notion of CD*-support surely demands more attention than that of CD-support. Recalling that the CD*-support only accounts for one direction of deviation from the precise or interval hypothesis, we will first briefly explore its connections with similar notions.

While the CD-support is an additive measure, meaning that for any set \(A \subseteq \Theta \) and its complement \(A^c\) , we always have \(\text{ CD }(A) +\text{ CD }(A^c)=1\) , the CD*-support is only a sub-additive measure, that is \(\text{ CD* }(A) +\text{ CD* }(A^c)\le 1\) , as can be easily checked. This suggests that the CD*-support can be related to a belief function. In essence, a belief function \(\text{ bel}_\textbf{x}(A)\) measures the evidence in \(\textbf{x}\) that supports A . However, due to its sub-additivity, it alone cannot provide sufficient information; it must be coupled with the plausibility function, defined as \(\text {pl}_\textbf{x}(A) = 1 - \text {bel}_\textbf{x}(A^c)\) . We refer to Martin and Liu ( 2013 ) for a detailed treatment of these notions within the general framework of Inferential Models , which admits a CD as a very specific case. We only mention here that they show that when \(A=\{\theta _0\}\) (i.e. a singleton), \(\text{ bel}_\textbf{x}(\{\theta _0\})=0\) , but \(\text{ bel}_\textbf{x}(\{\theta _0\}^c)\) can be different from 1. In particular, for the normal model N \((\theta ,1)\) , they found that, under some assumptions, \(\text{ bel}_\textbf{x}(\{\theta _0\}^c) =|2\Phi (x-\theta _0)-1|\) . Recalling the definition of the CC and the CD provided in Example 1 , it follows that the plausibility of \(\theta _0\) is \(\text {pl}_\textbf{x}(\{\theta _0\})=1-\text{ bel}_\textbf{x}(\{\theta _0\}^c)=1-|2\Phi (x-\theta _0)-1|= 1-CC_\textbf{x}(\theta _0)\) , and using ( 4 ), we can conclude that the CD*-support of \(\theta _0\) corresponds to half their plausibility.

The CD*-support for a precise hypothesis \({{{\mathcal {H}}}_{0}}: \theta =\theta _0\) is related to the notion of evidence, as defined in a Bayesian context by Pereira et al. ( 2008 ). Evidence is the posterior probability of the set \(\{\theta \in \Theta : p(\theta |\textbf{x})<p(\theta _0|\textbf{x})\}\) , where \(p(\theta |\textbf{x})\) is the posterior density of \(\theta \) . In particular, when a unimodal and symmetric CD is used as a posterior distribution, it is easy to check that the CD*-support coincides with half of the evidence.

The CD*-support is also related to the notion of weak-support defined by Singh et al. ( 2007 ) as \(\sup _{\theta \in [\theta _1,\theta _2]} 2 \min \{H_{\textbf{x}}(\theta ), 1-H_{\textbf{x}}(\theta )\}\) , but important differences exist. If data give little support to \({{{\mathcal {H}}}_{0}}\) , our definition highlights better whether values of \(\theta \) on the right or on the left of \({{{\mathcal {H}}}_{0}}\) are more reasonable. Moreover, if \({{{\mathcal {H}}}_{0}}\) is highly supported, that is \(\theta _m \in [\theta _1,\theta _2]\) , the weak-support is always equal to one, while the CD*-support assumes values in the interval [0.5, 1], allowing to better discriminate between different cases. Only if \({{{\mathcal {H}}}_{0}}\) is a precise hypothesis the two definitions agree, leaving out the multiplicative constant of two.

There exists a strong connection between the CD*-support and the e-value introduced by Peskun ( 2020 ). Under certain regularity assumptions, the e -value can be expressed in terms of a CD and coincides with the CD*-support, so that the properties and results originally established by Peskun for the e -value also apply to the CD*-support. More precisely, let us first consider the case of an observation x generated by the normal model \(\text {N}(\mu ,1)\) . Peskun shows that for the hypothesis \({{{\mathcal {H}}}_{0}}: \mu \in [\mu _1,\mu _2]\) , the e -value is equal to \(\min \{\Phi (x-\mu _1), \Phi (\mu _2-x)\}\) . Since, as shown in Example 1 , \(H_x(\mu )=1-\Phi (x-\mu )=\Phi (\mu -x)\) , it immediately follows that \(\min \{H_x(\mu _2),1-H_x(\mu _1)\}= \min \{\Phi (\mu _2-x), \Phi (x-\mu _1)\}\) , so that the e -value and the CD*-support coincide. For a more general case, we present the following result.

## Proposition 4

Let \(\textbf{X}\) be a random vector distributed according to the family of densities \(\{p_\theta , \theta \in \Theta \subseteq \mathbb {R}\}\) with a MLR in the real continuous statistic \(T=T(\textbf{X})\) , with distribution function \(F_\theta (t)\) . If \(F_\theta (t)\) is continuous in \(\theta \) with limits 0 and 1 for \(\theta \) tending to \(\sup (\Theta )\) and \(\inf (\Theta )\) , respectively, then the CD*-support and the e -value for the hypothesis \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) , \(\theta _1 \le \theta _2\) , are equivalent.

We emphasize, however, that the advantage of the CD*-support over the e -value relies on the fact that knowledge of the entire CD allows us to naturally encompass the testing problem into a more comprehensive and coherent inferential framework, in which the e -value is only one of the aspects to be taken into consideration.

Suppose now that a test of significance for \({{\mathcal {H}}}_0: \theta \in [\theta _1,\theta _2]\) , with \(\theta _1 \le \theta _2\) , is desired and that the CD for \(\theta \) is \(H_t(\theta )\) . Recall that the CD-support for \({{{\mathcal {H}}}_{0}}\) is \(H_t([\theta _1,\theta _2]) = \int _{\theta _1}^{\theta _2} dH_{t}(\theta ) = H_t(\theta _2)-H_t(\theta _1)\) , and that when \(\theta _1=\theta _2=\theta _0\) , or the interval \([\theta _1,\theta _2]\) is “small”, it becomes ineffective, and the CD*-support must be employed. The following proposition establishes some results about the CD- and the CD*-tests.

## Proposition 5

Given a statistical model parameterized by the real parameter \(\theta \) with MLR in the continuous statistic T , consider the hypothesis \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) with \( \theta _1 \le \theta _2\) . Then,

both the CD- and the CD*-tests reject \({{{\mathcal {H}}}_{0}}\) for all values of T that are smaller or larger than suitable values;

if a threshold \(\gamma \) is fixed for the CD-test, its size is not less than \(\gamma \) ;

for a precise hypothesis, i.e., \(\theta _1=\theta _2\) , the CD*-support, seen as function of the random variable T , has the uniform distribution on (0, 0.5);

if a threshold \(\gamma ^*\) is fixed for the CD*-test, its size falls within the interval \([\gamma ^*, \min (2\gamma ^*,1)]\) and equals \(\min (2\gamma ^*,1)\) when \(\theta _1=\theta _2\) , (i.e. when \({{{\mathcal {H}}}_{0}}\) is a precise hypothesis);

the CD-support is never greater than the CD*-support, and if a common threshold is fixed for both tests, the size of the CD-test is not smaller than that of the CD*-test.

Point i) highlights that the rejection regions generated by the CD- and CD*-tests are two-sided, resembling standard tests for hypotheses of this kind. However, even when \(\gamma = \gamma ^*\) , the rejection regions differ, with the CD-test being more conservative for \({{{\mathcal {H}}}_{0}}\) . This becomes crucial for small intervals, where the CD-test tends to reject the null hypothesis almost invariably.

Under the assumption of Proposition 5 , the p -value corresponding to the commonly used equal tailed test for a precise hypothesis \({{{\mathcal {H}}}_{0}}:\theta =\theta _0\) is \(2\min \{F_{\theta _0}(t), 1-F_{\theta _0}(t)\}\) , so that it coincides with 2 times the CD*-support.

For interval hypotheses, a UMPU test essentially exists only for models within a NEF, and an interesting relationship can be established with the CD-test.

## Proposition 6

Given the CD based on the sufficient statistic of a continuous real NEF with natural parameter \(\theta \) , consider the hypothesis \({{\mathcal {H}}}_0: \theta \in [\theta _1,\theta _2]\) versus \({{\mathcal {H}}}_1: \theta \notin [\theta _1,\theta _2]\) , with \(\theta _1 < \theta _2\) . If the CD-test has size \(\alpha _{CD}\) , it is the UMPU test among all \(\alpha _{CD}\) -level tests.

For interval hypotheses, unlike one-sided hypotheses, when the statistic T is discrete, there is no clear reason to prefer either \(H_t^{\ell }\) or \(H_t^r\) . Neither test is more conservative, as their respective rejection regions are shifted by just one point in the support of T . Thus, \(H^g_t\) can be considered again a reasonable compromise, due to its greater proximity to the uniform distribution. Moreover, while the results stated for continuous statistics may not hold exactly for discrete statistics, they remain approximately valid for not too small sample sizes, thanks to the asymptotic normality of CDs, as stated in Proposition 1 .

## 6 Conclusions

In this article, we propose the use of confidence distributions to address a hypothesis testing problem concerning a real parameter of interest. Specifically, we introduce the CD- and CD*-supports, which are suitable for evaluating one-sided or large interval null hypotheses and precise or small interval null hypotheses, respectively. This approach does not necessarily require identifying the first and second type errors or fixing a significance level a priori. We do not propose an automatic procedure; instead, we suggest a careful and more general inferential analysis of the problem based on CDs. CD- and CD*-supports are two simple coherent measures of evidence for a hypothesis with a clear meaning and interpretation. None of these features are owned by the p -value, which is more complex and generally does not exist in closed form for interval hypothesis.

It is well known that the significance level \(\alpha \) of a test, which is crucial to take a decision, should be adjusted according to the sample size, but this is almost never done in practice. In our approach, the support provided by the CD to a hypothesis trivially depends on the sample size through the dispersion of the CD. For example, if \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) , you can easily observe the effect of sample size on the CD-support of \({{{\mathcal {H}}}_{0}}\) by examining the interval \([\theta _1, \theta _2]\) on the CD-density plot. The CD-support can be non-negligible also when the length \(\Delta =\theta _2-\theta _1\) is small for a CD that is sufficiently concentrated on the interval. The relationship between \(\Delta \) and the dispersion of the CD highlights again the importance of a thoughtful choice of the threshold used for decision-making and the unreasonableness of using standard values. Note that the CD- and CD*-tests are similar in many standard situations, as shown in the examples presented.

Finally, we have investigated some theoretical aspects of the CD- and CD*-tests which are crucial in standard approach. While for one-sided hypotheses, an agreement with standard tests can be established, there are some distinctions to be made for two-sided hypotheses. If a threshold \(\gamma \) is fixed for a CD- or CD*-test, then its size exceeds \(\gamma \) reaching \(2\gamma \) for a CD*-test relative to a precise hypothesis. This is because the CD*-support only considers the appropriate tail suggested by the data and it does not adhere to the typical procedure of doubling the one-sided p -value, a procedure that can be criticized, as seen in Sect. 1 . Of course, if one is convinced of the need to double the p -value, in our context, it is sufficient to double the CD*-support. In the case of a precise hypothesis \({{{\mathcal {H}}}_{0}}: \theta = \theta _0\) , this leads to a valid test because \(Pr_{\theta _0}\left( 2\min \{H_{\textbf{x}}(\theta _0),1-H_{\textbf{x}}(\theta _0)\}\le \alpha \right) \le \alpha \) , as can be deduced by considering the relationship of the CD*-support with the e -value and the results in Peskun ( 2020 , Sec. 2).

Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, Bollen KA, Brembs B, Brown L, Camerer C et al (2018) Redefine statistical significance. Nat. Hum Behav 2:6–10

Article Google Scholar

Berger JO, Delampady M (1987) Testing precise hypotheses. Statist Sci 2:317–335

Google Scholar

Berger JO, Sellke T (1987) Testing a point null hypothesis: the irreconcilability of p-values and evidence. J Amer Statist Assoc 82:112–122

MathSciNet Google Scholar

Bickel DR (2022) Confidence distributions and empirical Bayes posterior distributions unified as distributions of evidential support. Comm Statist Theory Methods 51:3142–3163

Article MathSciNet Google Scholar

Eftekharian A, Taheri SM (2015) On the GLR and UMP tests in the family with support dependent on the parameter. Stat Optim Inf Comput 3:221–228

Fisher RA (1930) Inverse probability. Proceedings of the Cambridge Philosophical Society 26:528–535

Fisher RA (1973) Statistical methods and scientific inference. Hafner Press, New York

Freedman LS (2008) An analysis of the controversy over classical one-sided tests. Clinical Trials 5:635–640

Gibbons JD, Pratt JW (1975) p-values: interpretation and methodology. Amer Statist 29:20–25

Goodman SN (1993) p-values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 137:485–496

Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG (2016) Statistical tests, p-values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31:337–350

Hannig J (2009) On generalized fiducial inference. Statist Sinica 19:491–544

Hannig J, Iyer HK, Lai RCS, Lee TCM (2016) Generalized fiducial inference: a review and new results. J Amer Statist Assoc 44:476–483

Hubbard R, Bayarri MJ (2003) Confusion over measures of evidence (p’s) versus errors ( \(\alpha \) ’s) in Classical Statistical Testing. Amer Statist 57:171–178

Johnson VE, Rossell D (2010) On the use of non-local prior densities in Bayesian hypothesis tests. J R Stat Soc Ser B 72:143–170

Johnson VE, Payne RD, Wang T, Asher A, Mandal S (2017) On the reproducibility of psychological science. J Amer Statist Assoc 112:1–10

Lehmann EL, Romano JP (2005) Testing Statistical Hypotheses, 3rd edn. Springer, New York

Martin R, Liu C (2013) Inferential models: a framework for prior-free posterior probabilistic inference. J Amer Statist Assoc 108:301–313

Opdyke JD (2007) Comparing sharpe ratios: so where are the p -values? J Asset Manag 8:308–336

OSC (2015). Estimating the reproducibility of psychological science. Science 349:aac4716

Pereira CADB, Stern JM (1999) Evidence and credibility: full Bayesian significance test for precise hypotheses. Entropy 1:99–110

Pereira CADB, Stern JM, Wechsler S (2008) Can a significance test be genuinely Bayesian? Bayesian Anal 3:79–100

Peskun PH (2020) Two-tailed p-values and coherent measures of evidence. Amer Statist 74:80–86

Schervish MJ (1996) p values: What they are and what they are not. Amer Statist 50:203–206

Schweder T, Hjort NL (2002) Confidence and likelihood. Scand J Stat 29:309–332

Schweder T, Hjort NL (2016) Confidence, likelihood and probability. Cambridge University Press, London

Book Google Scholar

Shao J (2003) Mathematical statistics. Springer-Verlag, New York

Singh K, Xie M, Strawderman M (2005) Combining information through confidence distributions. Ann Statist 33:159–183

Singh K, Xie M, Strawderman WE (2007). Confidence distribution (CD) – Distribution estimator of a parameter. In Complex datasets and inverse problems: tomography, networks and beyond (pp. 132–150). Institute of Mathematical Statistics

Veronese P, Melilli E (2015) Fiducial and confidence distributions for real exponential families. Scand J Stat 42:471–484

Veronese P, Melilli E (2018) Fiducial, confidence and objective Bayesian posterior distributions for a multidimensional parameter. J Stat Plan Inference 195:153–173

Veronese P, Melilli E (2018) Some asymptotic results for fiducial and confidence distributions. Statist Probab Lett 134:98–105

Wasserstein RL, Lazar NA (2016) The ASA statement on p-values: context, process, and purpose. Amer Statist 70:129–133

Xie M, Singh K (2013) Confidence distribution, the frequentist distribution estimator of a parameter: a review. Int Stat Rev 81:3–39

Yates F (1951) The influence of statistical methods for research workers on the development of the science of statistics. J Amer Statist Assoc 46:19–34

Download references

## Acknowledgements

Partial financial support was received from Bocconi University. The authors would like to thank the referees for their valuable comments, suggestions and references, which led to a significantly improved version of the manuscript

Open access funding provided by Università Commerciale Luigi Bocconi within the CRUI-CARE Agreement.

## Author information

Authors and affiliations.

Bocconi University, Department of Decision Sciences, Milano, Italy

Eugenio Melilli & Piero Veronese

You can also search for this author in PubMed Google Scholar

## Corresponding author

Correspondence to Eugenio Melilli .

## Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendix A. Table of confidence distributions

Appendix b. proof of propositions, proof of proposition 1.

The asymptotic normality and the consistency of the CD in i) and ii) follow from Veronese & Melilli ( 2015 , Thm. 3) for models belonging to a NEF and from Veronese & Melilli ( 2018b , Thm. 1) for continuous arbitrary models. Part iii) of the proposition follows directly using the Chebyshev’s inequality. \(\diamond \)

## Proof of Proposition 2

Denote by \(F_{\theta }(t)\) the distribution function of T , assume that its support \({{\mathcal {T}}}=\{t_1,t_2,\ldots ,t_k\}\) is finite for simplicity and let \(p_j=p_j(\theta )=\text{ Pr}_\theta (T=t_j)\) , \(j=1,2,\ldots ,k\) for a fixed \(\theta \) . Consider the case \(H_t^r(\theta )=1-F_{\theta }(t)\) (if \(H_t^r(\theta )=F_{\theta }(t)\) the proof is similar) so that, for each \(j=2,\ldots ,k\) , \(H_{t_j}^\ell (\theta )=H_{t_{j-1}}^r(\theta )\) and \(H_{t_1}^\ell (\theta )=1\) . The supports of the random variables \(H^r_T(\theta )\) , \(H^\ell _T(\theta )\) and \(H^g_T(\theta )\) are, respectively,

where ( 6 ) holds because \(H^r_{t_j}(\theta )< H^g_{t_j}(\theta ) < H^{\ell }_{t_j}(\theta )\) . The probabilities corresponding to the points included in the three supports are of course the same, that is \(p_k,p_{k-1},\ldots ,p_1\) , in this order, so that \(G^\ell (u) \le u \le G^r(u)\) .

Let \(d(Q,R)=\int |Q(x)-R(x)|dx\) be the distance between the two arbitrary distribution functions Q and R . Denoting \(G^u\) as the uniform distribution function on (0, 1), we have

where the last inequality follows from ( 6 ). Thus, the distance from uniformity of \(H_T^g(\theta )\) is less than that of \(H_T^\ell (\theta )\) and of \(H_T^r(\theta )\) and ( 2 ) is proven. \(\diamond \)

## Proof of Proposition 4

Given the statistic T and the hypothesis \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) , the e -value, see Peskun 2020 , equation 12), is \(\min \bigg \{\max _{\theta \in [\theta _1,\theta _2]} F_\theta (t), \max _{\theta \in [\theta _1,\theta _2]} (1-F_\theta (t))\bigg \}\) . Under the assumptions of the proposition, it follows that \(F_t(\theta )\) is monotonically nonincreasing in \(\theta \) for each t (see Section 2 ). As a result, the e -value simplifies to:

where the last expression coincides with the CD*-support of \({{{\mathcal {H}}}_{0}}\) . Note that the same result holds if the MLR is nondecreasing in T ensuring that \(F_t(\theta )\) is monotonically nondecreasing. \(\diamond \)

## Proof of Proposition 5

Point i). Consider first the CD-test and let \(g(t)=H_t([\theta _1,\theta _2])=H_t(\theta _2)-H_t(\theta _1)=F_{\theta _1}(t)-F_{\theta _2}(t)\) , which is a nonnegative, continuous function with \(\lim _{t\rightarrow \pm \infty }g(t)=0\) and with derivative \(g^\prime (t)=f_{\theta _1}(t)- f_{\theta _2}(t)\) . Let \(t_0 \in \mathbb {R}\) be a point such that g is nondecreasing for \(t<t_0\) and strictly decreasing for \(t \in (t_0,t_1)\) , for a suitable \(t_1>t_0\) ; the existence of \(t_0\) is guaranteed by the properties of g . It follows that \(g^\prime (t) \ge 0\) for \(t<t_0\) and \(g^\prime (t)<0\) in \((t_0,t_1)\) . We show that \(t_0\) is the unique point at which the function \(g^\prime \) changes sign. Indeed, if \(t_2\) were a point greater than \(t_1\) such that \(g^\prime (t)>0\) for t in a suitable interval \((t_2,t_3)\) , with \(t_3> t_2\) , we would have, in this interval, \(f_{\theta _1}(t)>f_{\theta _2}(t)\) . Since \(f_{\theta _1}(t)<f_{\theta _2}(t)\) for \(t \in (t_0,t_1)\) , this implies \(f_{\theta _2}(t)/f_{\theta _1}(t)>1\) for \(t \in (t_0,t_1)\) and \(f_{\theta _2}(t)/f_{\theta _1}(t)<1\) for \(t \in (t_2,t_3)\) , which contradicts the assumption of the (nondecreasing) MLR in T . Thus, g ( t ) is nondecreasing for \(t<t_0\) and nonincreasing for \(t>t_0\) , and the set \(\{t: H_t([\theta _1,\theta _2])< \gamma \}\) coincides with \( \{t: t<t^\prime \) or \(t>t^{\prime \prime }\}\) for suitable \(t^\prime \) and \(t^{\prime \prime }\) .

Consider now the CD*-test. The corresponding support is \(\min \{H_t(\theta _2), 1-H_t(\theta _1)\}= \min \{1-F_{\theta _2}(t), F_{\theta _1}(t)\}\) , which is a continuous function of t and approaches zero as \(t \rightarrow \pm \infty \) . Moreover, it equals \(F_{\theta _1}(t)\) for \(t\le t^*=\inf \{t: F_{\theta _1}(t)=1-F_{\theta _2}(t)\}\) and \(1-F_{\theta _2}(t)\) for \(t\ge t^*\) . Thus, the function is nondecreasing for \(t \le t^*\) and nonincreasing for \(t \ge t^*\) , and the result is proven.

Point ii). Suppose having observed \(t^\prime = F_{\theta _1}^{-1}(\gamma )\) , then the CD-support for \({{{\mathcal {H}}}_{0}}\) is

so that \(t^\prime \) belongs to the rejection region defined by the threshold \(\gamma \) . Due to the structure of this region specified in point i), all \(t\le t^{\prime }\) belong to it. Now,

because \(F_{\theta }(t) \le F_{\theta _1}(t)\) for each t and \(\theta \in [\theta _1,\theta _2]\) . It follows that the size of the CD-test with threshold \(\gamma \) is not smaller than \(\gamma \) .

Point iii). The result follows from the equality of the CD*-support with the e -value, as stated in Proposition 4 , and the uniformity of the e -value as proven in Peskun ( 2020 , Sec. 2).

Point iv). The size of the CD*-test with threshold \(\gamma ^*\) is the supremum on \([\theta _1,\theta _2]\) of the following probability

under the assumption that \(F_{\theta _1}^{-1}(\gamma ^*) <F_{\theta _2}^{-1}(1-\gamma ^*)\) , otherwise the probability is one. Because \(F_{\theta _2}(t) \le F_{\theta }(t) \le F_{\theta _1}(t)\) for each t and \(\theta \in [\theta _1,\theta _2]\) , it follows that \(F_{\theta }(F_{\theta _1}^{-1}(\gamma ^*)) \le F_{\theta _1}(F_{\theta _1}^{-1}(\gamma ^*))=\gamma ^*\) , and \(F_{\theta }(F_{\theta _2}^{-1}(1-\gamma ^*)) \ge F_{\theta _2}(F_{\theta _2}^{-1}(1-\gamma ^*)) = 1-\gamma ^*\) so that the size is

Finally, if \(\theta =\theta _2\) , from ( 7 ) we have

and thus the size of the CD*-test must be included in the interval \([\gamma ^*,2\gamma ^*]\) , provided that \(2\gamma ^*\) is less than 1. For the case \(\theta _1=\theta _2\) , it follows from ( 7 ) that the size of the CD*-test is \(2\gamma ^*\) .

Point v). Because \(H_t([\theta _1,\theta _2]=H_t(\theta _2)-H_t(\theta _1)\le H_t(\theta _2)\) and also \(H_t(\theta _2)-H_t(\theta _1) \le 1-H_t(\theta _1)\) , recalling Definition 4 , it immediately follows that the CD-support is not greater than the CD*-support. Thus if the same threshold is fixed for the two tests, the rejection region of the CD-test includes that of the CD*-test, and the size of the first test is not smaller than that of the second one. \(\diamond \)

## Proof of Proposition 6

Recall from point i) of Proposition 5 , that the CD-test with threshold \(\gamma \) rejects \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) for values of T less than \(t^\prime \) or greater than \(t^{\prime \prime }\) , with \(t^\prime \) and \(t^{\prime \prime }\) solutions of the equation \(F_{\theta _1}(t)-F_{\theta _2}(t)=\gamma \) . Denoting with \(\pi _{CD}\) its power function, we have

Thus the power function of the CD-test is equal in \(\theta _1\) and \(\theta _2\) and this condition characterizes the UMPU test for the exponential families, see Lehmann & Romano ( 2005 , p. 135). \(\diamond \)

## Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

## About this article

Melilli, E., Veronese, P. Confidence distributions and hypothesis testing. Stat Papers (2024). https://doi.org/10.1007/s00362-024-01542-4

Download citation

Received : 05 April 2023

Revised : 14 December 2023

Published : 29 March 2024

DOI : https://doi.org/10.1007/s00362-024-01542-4

## Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

- Confidence curve
- Precise and interval hypotheses
- Statistical measure of evidence
- Uniformly most powerful test

## Mathematics Subject Classification

- Find a journal
- Publish with us
- Track your research

## IMAGES

## VIDEO

## COMMENTS

In other words, if the the 95% confidence interval contains the hypothesized parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always fail to reject the null hypothesis. If the 95% confidence interval does not contain the hypothesize parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always ...

A hypothesis test is a formal statistical test that is used to determine if some hypothesis about a population parameter is true. A confidence interval is a range of values that is likely to contain a population parameter with a certain level of confidence. This tutorial shares a brief overview of each method along with their similarities and ...

The relationship between the confidence level and the significance level for a hypothesis test is as follows: Confidence level = 1 - Significance level (alpha) For example, if your significance level is 0.05, the equivalent confidence level is 95%. Both of the following conditions represent statistically significant results: The P-value in a ...

Both confidence intervals and hypothesis intervals can be used in tandem to help support our conclusions! References: Vital Signs: Predicted Heart Age and Racial Disparities in Heart Age Among U.S. Adults at the State Level; Hypothesis Test vs. Confidence Interval | Statistics Tutorial #15 | MarinStatsLectures

A 2-sample t-test can construct a confidence interval for the mean difference. In this scenario, consider both the size and precision of the estimated effect. ... Learn more about how confidence intervals and hypothesis tests are similar. Related post: Effect Sizes in Statistics.

Confidence intervals (CI) and hypothesis tests should give consistent results: we should not reject [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if the corresponding [latex](1 - \alpha) \times 100\%[/latex] confidence interval contains the hypothesized value [latex]\mu_0[/latex]. Two-sided confidence intervals correspond ...

Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting ...

We can test whether this sample is drawn from a population with mean equals to μ by checking whether Ᾱ differs significantly from μ. We can also estimation a 95% confidence interval for the population mean where this sample is drawn from. Hypothesis Testing. Here are the steps for conducting hypothesis testing: Step 1: Set up the null ...

The correct answer is A. To calculate the 95% confidence interval for the mean study time of all candidates, we can use the formula for the confidence interval when the population variance is unknown: Confidence Interval = ¯X ±t1−α 2 × s √n Confidence Interval = X ¯ ± t 1 − α 2 × s n. Where: ¯X X ¯ is the sample mean.

Hypothesis testing and confidence intervals are intrinsically related. This chapter discusses how to test statistical hypotheses, and then focuses on interval estimation. Special attention is given to explanation of major statistical concepts, such as the p-value, in layman's terms. The chapter provides two ad hoc examples where hypothesis ...

You can use either P values or confidence intervals to determine whether your results are statistically significant. If a hypothesis test produces both, these results will agree. The confidence level is equivalent to 1 - the alpha level. So, if your significance level is 0.05, the corresponding confidence level is 95%.

The general summary is that we can use confidence intervals to test hypotheses by assessing whether the reference value under the null hypothesis is in the confidence interval (suggests insufficient evidence against \(H_0\) to reject it, at least at the \(\alpha\) level and equivalent to having a p-value larger than \(\alpha\)) or outside the ...

To calculate the 95% confidence interval, we can simply plug the values into the formula. For the USA: So for the USA, the lower and upper bounds of the 95% confidence interval are 34.02 and 35.98. For GB: So for the GB, the lower and upper bounds of the 95% confidence interval are 33.04 and 36.96.

• Exceeds the critical value (| -5.86| > 2.10), so we still reject the null hypothesis. • But remember that we had to assume CR4 in order to perform this hypothesis test and construct the confidence interval. • If there's any reason to think that the population errors are not normally distributed

In this chapter, you will learn to construct and interpret confidence intervals. You will also learn a new distribution, the Student's-t, and how it is used with these intervals. ... 12.3: Steps in Hypothesis Testing A statistician will make a decision about claims via a process called "hypothesis testing." A hypothesis test involves collecting ...

There is a close relationship between confidence intervals and significance tests. Specifically, if a statistic is significantly different from 0 0 at the 0.05 0.05 level, then the 95% 95 % confidence interval will not contain 0 0. All values in the confidence interval are plausible values for the parameter, whereas values outside the interval ...

Hypothesis Testing Decisions through Confidence Intervals You may have noticed that many of the steps used for confidence intervals are shared with hypothesis testing. While there are distinctions between the two, we can also use confidence intervals to help us determine the result of a hypothesis test.

In Bayesian inference, a confidence intervalover a single model parameter φ is simply a contiguous interval [φ1,φ2] that contains a speciﬁed proportion of the posterior probability mass over φ. The proportion of probability mass contained in the conﬁdence interval can be chosen depending on whether one wants a narrower or wider interval.

Confidence intervals rather than P values: estimation rather than hypothesis testing. British medical journal (Clinical research ed.). 1986 Mar 15:292(6522):746-50 [PubMed PMID: 3082422]

11.5: Matched or Paired Samples. When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples.

T-statistic confidence interval. Level up on all the skills in this unit and collect up to 800 Mastery points! "The average lifespan of a fruit fly is between 1 day and 10 years" is an example of a confidence interval, but it's not a very useful one. From scientific measures to election predictions, confidence intervals give us a range of ...

As with all other hypothesis tests and confidence intervals, the process of testing is the same, though the formulas and assumptions are different. There are three types of hypothesis tests for comparing the difference in 2 population proportions p 1 - p 2, see Figure 9-7. Figure 9-7. Note that for our purposes, p 1 - p 2 = 0.

In this article, we propose the use of confidence distributions to address a hypothesis testing problem concerning a real parameter of interest. Specifically, we introduce the CD- and CD*-supports, which are suitable for evaluating one-sided or large interval null hypotheses and precise or small interval null hypotheses, respectively.

% H05 = hypothesis test results for beta1/slope at the 5% % (alpha = 0.05) significance level % H01 = hypothesis test results for beta1/slope at the 1%