Icon Partners

  • Quality Improvement
  • Talk To Minitab

Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics

Topics: Hypothesis Testing , Statistics

What do significance levels and P values mean in hypothesis tests? What is statistical significance anyway? In this post, I’ll continue to focus on concepts and graphs to help you gain a more intuitive understanding of how hypothesis tests work in statistics.

To bring it to life, I’ll add the significance level and P value to the graph in my previous post in order to perform a graphical version of the 1 sample t-test. It’s easier to understand when you can see what statistical significance truly means!

Here’s where we left off in my last post . We want to determine whether our sample mean (330.6) indicates that this year's average energy cost is significantly different from last year’s average energy cost of $260.

Descriptive statistics for the example

The probability distribution plot above shows the distribution of sample means we’d obtain under the assumption that the null hypothesis is true (population mean = 260) and we repeatedly drew a large number of random samples.

I left you with a question: where do we draw the line for statistical significance on the graph? Now we'll add in the significance level and the P value, which are the decision-making tools we'll need.

We'll use these tools to test the following hypotheses:

  • Null hypothesis: The population mean equals the hypothesized mean (260).
  • Alternative hypothesis: The population mean differs from the hypothesized mean (260).

What Is the Significance Level (Alpha)?

The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.

These types of definitions can be hard to understand because of their technical nature. A picture makes the concepts much easier to comprehend!

The significance level determines how far out from the null hypothesis value we'll draw that line on the graph. To graph a significance level of 0.05, we need to shade the 5% of the distribution that is furthest away from the null hypothesis.

Probability plot that shows the critical regions for a significance level of 0.05

In the graph above, the two shaded areas are equidistant from the null hypothesis value and each area has a probability of 0.025, for a total of 0.05. In statistics, we call these shaded areas the critical region for a two-tailed test. If the population mean is 260, we’d expect to obtain a sample mean that falls in the critical region 5% of the time. The critical region defines how far away our sample statistic must be from the null hypothesis value before we can say it is unusual enough to reject the null hypothesis.

Our sample mean (330.6) falls within the critical region, which indicates it is statistically significant at the 0.05 level.

We can also see if it is statistically significant using the other common significance level of 0.01.

Probability plot that shows the critical regions for a significance level of 0.01

The two shaded areas each have a probability of 0.005, which adds up to a total probability of 0.01. This time our sample mean does not fall within the critical region and we fail to reject the null hypothesis. This comparison shows why you need to choose your significance level before you begin your study. It protects you from choosing a significance level because it conveniently gives you significant results!

Thanks to the graph, we were able to determine that our results are statistically significant at the 0.05 level without using a P value. However, when you use the numeric output produced by statistical software , you’ll need to compare the P value to your significance level to make this determination.

Ready for a demo of Minitab Statistical Software? Just ask! 

Talk to Minitab

What Are P values?

P-values are the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis.

This definition of P values, while technically correct, is a bit convoluted. It’s easier to understand with a graph!

To graph the P value for our example data set, we need to determine the distance between the sample mean and the null hypothesis value (330.6 - 260 = 70.6). Next, we can graph the probability of obtaining a sample mean that is at least as extreme in both tails of the distribution (260 +/- 70.6).

Probability plot that shows the p-value for our sample mean

In the graph above, the two shaded areas each have a probability of 0.01556, for a total probability 0.03112. This probability represents the likelihood of obtaining a sample mean that is at least as extreme as our sample mean in both tails of the distribution if the population mean is 260. That’s our P value!

When a P value is less than or equal to the significance level, you reject the null hypothesis. If we take the P value for our example and compare it to the common significance levels, it matches the previous graphical results. The P value of 0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01 level.

If we stick to a significance level of 0.05, we can conclude that the average energy cost for the population is greater than 260.

A common mistake is to interpret the P-value as the probability that the null hypothesis is true. To understand why this interpretation is incorrect, please read my blog post  How to Correctly Interpret P Values .

Discussion about Statistically Significant Results

A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. A test result is statistically significant when the sample statistic is unusual enough relative to the null hypothesis that we can reject the null hypothesis for the entire population. “Unusual enough” in a hypothesis test is defined by:

  • The assumption that the null hypothesis is true—the graphs are centered on the null hypothesis value.
  • The significance level—how far out do we draw the line for the critical region?
  • Our sample statistic—does it fall in the critical region?

Keep in mind that there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. The common alpha values of 0.05 and 0.01 are simply based on tradition. For a significance level of 0.05, expect to obtain sample means in the critical region 5% of the time when the null hypothesis is true . In these cases, you won’t know that the null hypothesis is true but you’ll reject it because the sample mean falls in the critical region. That’s why the significance level is also referred to as an error rate!

This type of error doesn’t imply that the experimenter did anything wrong or require any other unusual explanation. The graphs show that when the null hypothesis is true, it is possible to obtain these unusual sample means for no reason other than random sampling error. It’s just luck of the draw.

Significance levels and P values are important tools that help you quantify and control this type of error in a hypothesis test. Using these tools to decide when to reject the null hypothesis increases your chance of making the correct decision.

If you like this post, you might want to read the other posts in this series that use the same graphical framework:

  • Previous: Why We Need to Use Hypothesis Tests
  • Next: Confidence Intervals and Confidence Levels

If you'd like to see how I made these graphs, please read: How to Create a Graphical Version of the 1-sample t-Test .

minitab-on-linkedin

You Might Also Like

  • Trust Center

© 2023 Minitab, LLC. All Rights Reserved.

  • Terms of Use
  • Privacy Policy
  • Cookies Settings

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

significance hypothesis test meaning

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

Prevent plagiarism. Run a free check.

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved April 3, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

Hypothesis Testing (cont...)

Hypothesis testing, the null and alternative hypothesis.

In order to undertake hypothesis testing you need to express your research hypothesis as a null and alternative hypothesis. The null hypothesis and alternative hypothesis are statements regarding the differences or effects that occur in the population. You will use your sample to test which statement (i.e., the null hypothesis or alternative hypothesis) is most likely (although technically, you test the evidence against the null hypothesis). So, with respect to our teaching example, the null and alternative hypothesis will reflect statements about all statistics students on graduate management courses.

The null hypothesis is essentially the "devil's advocate" position. That is, it assumes that whatever you are trying to prove did not happen ( hint: it usually states that something equals zero). For example, the two different teaching methods did not result in different exam performances (i.e., zero difference). Another example might be that there is no relationship between anxiety and athletic performance (i.e., the slope is zero). The alternative hypothesis states the opposite and is usually the hypothesis you are trying to prove (e.g., the two different teaching methods did result in different exam performances). Initially, you can state these hypotheses in more general terms (e.g., using terms like "effect", "relationship", etc.), as shown below for the teaching methods example:

Depending on how you want to "summarize" the exam performances will determine how you might want to write a more specific null and alternative hypothesis. For example, you could compare the mean exam performance of each group (i.e., the "seminar" group and the "lectures-only" group). This is what we will demonstrate here, but other options include comparing the distributions , medians , amongst other things. As such, we can state:

Now that you have identified the null and alternative hypotheses, you need to find evidence and develop a strategy for declaring your "support" for either the null or alternative hypothesis. We can do this using some statistical theory and some arbitrary cut-off points. Both these issues are dealt with next.

Significance levels

The level of statistical significance is often expressed as the so-called p -value . Depending on the statistical test you have chosen, you will calculate a probability (i.e., the p -value) of observing your sample results (or more extreme) given that the null hypothesis is true . Another way of phrasing this is to consider the probability that a difference in a mean score (or other statistic) could have arisen based on the assumption that there really is no difference. Let us consider this statement with respect to our example where we are interested in the difference in mean exam performance between two different teaching methods. If there really is no difference between the two teaching methods in the population (i.e., given that the null hypothesis is true), how likely would it be to see a difference in the mean exam performance between the two teaching methods as large as (or larger than) that which has been observed in your sample?

So, you might get a p -value such as 0.03 (i.e., p = .03). This means that there is a 3% chance of finding a difference as large as (or larger than) the one in your study given that the null hypothesis is true. However, you want to know whether this is "statistically significant". Typically, if there was a 5% or less chance (5 times in 100 or less) that the difference in the mean exam performance between the two teaching methods (or whatever statistic you are using) is as different as observed given the null hypothesis is true, you would reject the null hypothesis and accept the alternative hypothesis. Alternately, if the chance was greater than 5% (5 times in 100 or more), you would fail to reject the null hypothesis and would not accept the alternative hypothesis. As such, in this example where p = .03, we would reject the null hypothesis and accept the alternative hypothesis. We reject it because at a significance level of 0.03 (i.e., less than a 5% chance), the result we obtained could happen too frequently for us to be confident that it was the two teaching methods that had an effect on exam performance.

Whilst there is relatively little justification why a significance level of 0.05 is used rather than 0.01 or 0.10, for example, it is widely used in academic research. However, if you want to be particularly confident in your results, you can set a more stringent level of 0.01 (a 1% chance or less; 1 in 100 chance or less).

Testimonials

One- and two-tailed predictions

When considering whether we reject the null hypothesis and accept the alternative hypothesis, we need to consider the direction of the alternative hypothesis statement. For example, the alternative hypothesis that was stated earlier is:

The alternative hypothesis tells us two things. First, what predictions did we make about the effect of the independent variable(s) on the dependent variable(s)? Second, what was the predicted direction of this effect? Let's use our example to highlight these two points.

Sarah predicted that her teaching method (independent variable: teaching method), whereby she not only required her students to attend lectures, but also seminars, would have a positive effect (that is, increased) students' performance (dependent variable: exam marks). If an alternative hypothesis has a direction (and this is how you want to test it), the hypothesis is one-tailed. That is, it predicts direction of the effect. If the alternative hypothesis has stated that the effect was expected to be negative, this is also a one-tailed hypothesis.

Alternatively, a two-tailed prediction means that we do not make a choice over the direction that the effect of the experiment takes. Rather, it simply implies that the effect could be negative or positive. If Sarah had made a two-tailed prediction, the alternative hypothesis might have been:

In other words, we simply take out the word "positive", which implies the direction of our effect. In our example, making a two-tailed prediction may seem strange. After all, it would be logical to expect that "extra" tuition (going to seminar classes as well as lectures) would either have a positive effect on students' performance or no effect at all, but certainly not a negative effect. However, this is just our opinion (and hope) and certainly does not mean that we will get the effect we expect. Generally speaking, making a one-tail prediction (i.e., and testing for it this way) is frowned upon as it usually reflects the hope of a researcher rather than any certainty that it will happen. Notable exceptions to this rule are when there is only one possible way in which a change could occur. This can happen, for example, when biological activity/presence in measured. That is, a protein might be "dormant" and the stimulus you are using can only possibly "wake it up" (i.e., it cannot possibly reduce the activity of a "dormant" protein). In addition, for some statistical tests, one-tailed tests are not possible.

Rejecting or failing to reject the null hypothesis

Let's return finally to the question of whether we reject or fail to reject the null hypothesis.

If our statistical analysis shows that the significance level is below the cut-off value we have set (e.g., either 0.05 or 0.01), we reject the null hypothesis and accept the alternative hypothesis. Alternatively, if the significance level is above the cut-off value, we fail to reject the null hypothesis and cannot accept the alternative hypothesis. You should note that you cannot accept the null hypothesis, but only find evidence against it.

Book cover

International Encyclopedia of Statistical Science pp 1318–1321 Cite as

Significance Testing: An Overview

  • Elena Kulinskaya 2 ,
  • Stephan Morgenthaler 3 &
  • Robert G. Staudte 4  
  • Reference work entry
  • First Online: 01 January 2014

223 Accesses

Introduction

A significance test is a statistical procedure for testing a hypothesis based on experimental or observational data. Let, for example, \(\bar{{X}}_{1}\) and \(\bar{{X}}_{2}\) be the average scores obtained in two groups of randomly selected subjects and let μ 1 and μ 2 denote the corresponding population averages. The observed averages can be used to test the null hypothesis μ 1 = μ 2 , which expresses the idea that both populations have equal average scores. A significant result occurs if \(\bar{{X}}_{1}\) and \(\bar{{X}}_{2}\) are very different from each other, because this contradicts or falsifies the null hypothesis. If the two group averages are similar to each other, the null hypothesis is not contradicted by the data. What exact values of the difference \(\bar{{X}}_{1} -\bar{ {X}}_{2}\) of the group averages are judged as significant depends on various elements. The variation of the scores between the subjects, for example, must be taken into account. This...

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

References and Further Reading

Berger JO (2003) Could Fisher, Jeffreys and Neyman have agreed on testing? Stat Sci 18(1):1–32, With discussion

Google Scholar  

Dollinger MB, Kulinskaya E, Staudte RG (1996) Fuzzy hypothesis tests and confidence intervals. In: Dowe DL, Korb KB, Oliver JJ (eds) Information, statistics and induction in science. World Scientific, Singapore, pp 119–128

Fisher RA (1990) Statistical methods, experimental design and scientific inference. Oxford University Press, Oxford. Reprints of Fisher’s main books, first printed in 1925, 1935 and 1956, respectively. The 14th edition of the first book was printed in 1973

Geyer C, Meeden G (2006) Fuzzy confidence intervals and p-values (with discussion). Stat Sci 20:258–387

MathSciNet   Google Scholar  

Hubbard R, Bayarri MJ (2003) Confusion over measures of evidence ( p ’s) versus errors ( a ’s) in classical statistical testing. Am Stat 57(3):171–182, with discussion

Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90: 773–795

MATH   Google Scholar  

Krantz D (1999) The null hypothesis testing controversy in psychology. J Am Stat Assoc 94(448):1372–1381

Kulinskaya E, Morgenthaler S, Staudte RG (2008) Meta analysis: a guide to calibrating and combining statistical evidence. Wiley, Chichester, www.wiley.com/go/meta_analysis

Lavine M, Schervish M (1999) Bayes factors: what they are and what they are not. Am Stat 53:119–122

Lehmann EL (1986) Testing statistical hypotheses, 2nd edn. Wiley, New York

Lehmann EL (1993) The Fisher, Neyman-Pearson theories of testing hypotheses: one theory or two? J Am Stat Assoc 88:1242–1249

Neyman J (1935) On the problem of confidence intervals. Ann Math Stat 6:111–116

Neyman J (1950) First course in probability and statistics. Henry Holt, New York

Neyman J, Pearson ES (1928) On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika 20A:175–240 and 263–294

Neyman J, Pearson ES (1933) On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc A 231:289–337

Welch BL (1938) The significance of the difference between two means when the variances are unequal. Biometrika 29:350–361

Wellek S (2003) Testing statistical hypotheses of equivalence. Chapman & Hall/CRC Press, New York

Download references

Author information

Authors and affiliations.

University of East Anglia, Norwich, UK

Elena Kulinskaya

Ecole Polytechnique Fédéralede Lausanne, Lausanne, Switzerland

Stephan Morgenthaler

La Trobe University, Melbourne, VIC, Australia

Robert G. Staudte

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Department of Statistics and Informatics, Faculty of Economics, University of Kragujevac, City of Kragujevac, Serbia

Miodrag Lovric

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this entry

Cite this entry.

Kulinskaya, E., Morgenthaler, S., Staudte, R.G. (2011). Significance Testing: An Overview. In: Lovric, M. (eds) International Encyclopedia of Statistical Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04898-2_514

Download citation

DOI : https://doi.org/10.1007/978-3-642-04898-2_514

Published : 02 December 2014

Publisher Name : Springer, Berlin, Heidelberg

Print ISBN : 978-3-642-04897-5

Online ISBN : 978-3-642-04898-2

eBook Packages : Mathematics and Statistics Reference Module Computer Science and Engineering

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Level of Significance & Hypothesis Testing

level of significance and hypothesis testing

In hypothesis testing , the level of significance is a measure of how confident you can be about rejecting the null hypothesis. This blog post will explore what hypothesis testing is and why understanding significance levels are important for your data science projects. In addition, you will also get to test your knowledge of level of significance towards the end of the blog with the help of quiz . These questions can help you test your understanding and prepare for data science / statistics interviews . Before we look into what level of significance is, let’s quickly understand what is hypothesis testing.

Table of Contents

What is Hypothesis testing and how is it related to significance level?

Hypothesis testing can be defined as tests performed to evaluate whether a claim or theory about something is true or otherwise. In order to perform hypothesis tests, the following steps need to be taken:

  • Hypothesis formulation: Formulate the null and alternate hypothesis
  • Data collection: Gather the sample of data
  • Statistical tests: Determine the statistical test and test statistics. The statistical tests can be z-test or t-test depending upon the number of data samples and/or whether the population variance is known otherwise.
  • Set the level of significance
  • Calculate the p-value
  • Draw conclusions: Based on the value of p-value and significance level, reject the null hypothesis or otherwise.

A detailed explanation is provided in one of my related posts titled hypothesis testing explained with examples .

What is the level of significance?

The level of significance is defined as the criteria or threshold value based on which one can reject the null hypothesis or fail to reject the null hypothesis. The level of significance determines whether the outcome of hypothesis testing is statistically significant or otherwise. The significance level is also called as alpha level.

Another way of looking at the level of significance is the value which represents the likelihood of making a type I error . You may recall that Type I error occurs while evaluating hypothesis testing outcomes. If you reject the null hypothesis by mistake, you end up making a Type I error. This scenario is also termed as “false positive”. Take an example of a person alleged with committing a crime. The null hypothesis is that the person is not guilty. Type I error happens when you reject the null hypothesis by mistake. Given the example, a Type I error happens when you reject the null hypothesis that the person is not guilty by mistake. The innocent person is convicted.

The level of significance can take values such as 0.1 , 0.05 , 0.01 . The most common value of the level of significance is 0.05 . The lower the value of significance level, the lesser is the chance of type I error. That would essentially mean that the experiment or hypothesis testing outcome would really need to be highly precise for one to reject the null hypothesis. The likelihood of making a type I error would be very low. However, that does increase the chances of making type II errors as you may make mistakes in failing to reject the null hypothesis. You may want to read more details in relation to type I errors and type II errors in this post – Type I errors and Type II errors in hypothesis testing

The outcome of the hypothesis testing is evaluated with the help of a p-value. If the p-value is less than the level of significance, then the hypothesis testing outcome is statistically significant. On the other hand, if the hypothesis testing outcome is not statistically significant or the p-value is more than the level of significance, then we fail to reject the null hypothesis. The same is represented in the picture below for a right-tailed test. I will be posting details on different types of tail test in future posts.

level of significance and hypothesis testing

The picture below represents the concept for two-tailed hypothesis test:

level of significance and two-tailed test

For example: Let’s say that a school principal wants to find out whether extra coaching of 2 hours after school help students do better in their exams. The hypothesis would be as follows:

  • Null hypothesis : There is no difference between the performance of students even after providing extra coaching of 2 hours after the schools are over.
  • Alternate hypothesis : Students perform better when they get extra coaching of 2 hours after the schools are over. This hypothesis testing example would require a level of significant value at 0.05 or simply put, it would need to be highly precise that there’s actually a difference between the performance of students based on whether they take extra coaching.

Now, let’s say that we conduct this experiment with 100 students and measure their scores in exams. The test statistics is computed to be z=-0.50 (p-value=0.62). Since the p-value is more than 0.05, we fail to reject the null hypothesis. There is not enough evidence to show that there’s a difference in the performance of students based on whether they get extra coaching.

While performing hypothesis tests or experiments, it is important to keep the level of significance in mind.

Why does one need a level of significance?

In hypothesis tests, if we do not have some sort of threshold by which to determine whether your results are statistically significant enough for you to reject the null hypothesis, then it would be tough for us to determine whether your findings are significant or not. This is why we take into account levels of significance when performing hypothesis tests and experiments.

Since hypothesis testing helps us in making decisions about our data, having a level of significance set up allows one to know what sort of chances their findings might have of actually being due to the null hypothesis. If you set your level of significance at 0.05 for example, it would mean that there’s only a five percent chance that the difference between groups (assuming two groups are tested) is due to random sampling error. So if we found a difference in the performance of students based on whether they take extra coaching, we would need to consider other factors that could have contributed to the difference.

This is why hypothesis testing and level of significance go hand in hand with one another: hypothesis tests help us know whether our data falls within a certain range where it’s statistically significant or not so statistically significant whereas the level of significance tells us how likely is it that our hypothesis testing results are not due to random sampling error.

How is the level of significance used in hypothesis testing?

The level of significance along with the test statistic and p-value formed a key part of hypothesis testing. The value that you derive from hypothesis testing depends on whether or not you accept/reject the null hypothesis, given your findings at each step. Before going into rejection vs non-rejection, let’s understand the terms better.

If the test statistic falls within the critical region, you reject the null hypothesis. This means that your findings are statistically significant and support the alternate hypothesis. The value of the p-value determines how likely it is for finding this outcome if, in fact, the null hypothesis were true. If the p-value is less than or equal to the level of significance, you reject the null hypothesis. This means that your hypothesis testing outcome was statistically significant at a certain degree and in favor of the alternate hypothesis.

If on the other hand, the p-value is greater than alpha level or significance level, then you fail to reject the null hypothesis. These findings are not statistically significant enough for one to reject the null hypothesis. The same is represented in the diagram below:

level of significance and p-value

Level of Significance – Quiz / Interview Questions

Here are some practice questions which can help you in testing your questions, and, also prepare for interviews.

#1. The p-value less than the level of significance would mean which of the following?

#2. level of significance is also called as ________, #3. the p-value of 0.03 is statistically significant for significance level as 0.01, #4. which of the following is looks to be inappropriate level of significance, #5. the statistically significant outcome of hypothesis testing would mean which of the following, #6. which one of the following is considered most popular choice of significance level, #7. which of the following will result in greater type i error, #8. which of the following will result in greater type ii error, recent posts.

Ajitesh Kumar

  • Self-Supervised Learning vs Transfer Learning: Examples - April 3, 2024
  • OKRs vs KPIs vs KRAs: Differences and Examples - February 21, 2024
  • CEP vs Traditional Database Examples - February 2, 2024

Oops! Check your answers again. The minimum pass percentage is 70%.

Share your score!

Hypothesis testing is an important statistical concept that helps us determine whether the claim made about anything is true or otherwise. The hypothesis test statistic, level of significance, and p-value all work together to help you make decisions about your data. If our hypothesis tests show enough evidence to reject the null hypothesis, then we know statistically significant findings are at hand. This post gave you ideas for how you can use hypothesis testing in your experiments by understanding what it means when someone rejects or fails to reject the null hypothesis.

Ajitesh Kumar

3 responses.

Well explained with examples and helpful illustration

Thank you for your feedback

Well explained

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:
  • Excellence Awaits: IITs, NITs & IIITs Journey

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • Self-Supervised Learning vs Transfer Learning: Examples
  • OKRs vs KPIs vs KRAs: Differences and Examples
  • CEP vs Traditional Database Examples
  • Retrieval Augmented Generation (RAG) & LLM: Examples
  • Attention Mechanism in Transformers: Examples

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

  • Forgot your Password?

First, please create an account

Significance level and power of a hypothesis test.

  • Selecting an Appropriate Significance Level
  • Cautions about Significance Level
  • Power of a Hypothesis Test

1. Significance Level

Before we begin, you should first understand what is meant by statistical significance. When you calculate a test statistic in a hypothesis test, you can calculate the p-value. The p-value is the probability that you would have obtained a statistic as large (or small, or extreme) as the one you got if the null hypothesis is true. It's a conditional probability.

Sometimes you’re willing to attribute whatever difference you found between your statistic and your parameter to chance. If this is the case, you fail to reject the null hypothesis, if you’re willing to write off the differences between your statistic and your hypothesized parameter.

If you’re not, meaning it's just too far away from the mean to attribute to chance, then you’re going to reject the null hypothesis in favor of the alternative.

This is what it might look like for a two-tailed test.

Two-Tailed Test

The hypothesized mean is right in the center of the normal distribution. Anything that is considered to be too far away--something like two standard deviations or more away--you would reject the null hypothesis. Anything you might attribute to chance, within the two standard deviations, you would fail to reject the null hypothesis. Again, this is assuming that the null hypothesis is true.

However, think about this. All of this curve assumes that the null hypothesis is true, but you make a decision to reject the null hypothesis anyway if the statistic you got is far away. It means that this would rarely happen by chance. But, it's still the wrong thing to do technically, if the null hypothesis is true. This idea that we're comfortable making some error sometimes is called a significance level.

The probability of rejecting the null hypothesis in error, in other words, rejecting the null hypothesis when it is, in fact, true, is called a Type I Error.

Rejecting Null Hypothesis

Fortunately, you get to choose how big you want this error to be. You could have stated that three standard deviations from the mean on either side as "too far away". Or, for instance, you could say you only want to be wrong 1% of the time, or 5% of the time, meaning that you are rejecting the null hypothesis in error that often.

alpha

When you choose how big you want alpha to be, you do it before you start the tests. You do it this way to reduce bias because if you already ran the tests, you could choose an alpha level that would automatically make your result seem more significant than it is. You don't want to bias your results that way.

Take a look back at this visual here.

File:5105-sig3.png

The alpha, in this case, is 0.05. If you recall, the 68-95-99.7 rule says that 95% of the values will fall within two standard deviations of the mean, meaning that 5% of the values will fall outside of those two standard deviations. Your decision to reject the null hypothesis will be 5% of the time; the most extreme 5% of cases, you will not be willing to attribute to chance variation from the hypothesized mean.

The level of significance will also depend on the type of experiment that you're doing.

If you want to be really cautious and not reject the null hypothesis in error very much, you'll choose a low significance level, like 0.01. This means that only the most extreme 1% of cases will have the null hypothesis rejected.

If you don't believe a Type I Error is going to be that bad, you might allow the significance level to be something higher, like 0.05 or 0.10. Those still seem like low numbers. However, think about what that means. This means that one out of every 20, or one out of every ten samples of that particular size will have the null hypothesis rejected even when it's true. Are you willing to make that mistake one out of every 20 times or once every ten times? Or are you only willing to make that mistake one out of every 100 times? Setting this value to something really low reduces the probability that you make that error.

It is important to note that you don't want the significance level to be too low. The problem with setting it really low is that as you lower the value of a Type 1 Error, you actually increase the probability of a Type II Error.

A Type II Error is failing to reject the null hypothesis when a difference does exist. This reduces the power or the sensitivity of your significance test, meaning that you will not be able to detect very real differences from the null hypothesis when they actually exist if your alpha level is set too low.

2. Power of a Hypothesis Test

You might wonder, what is power? Power is the ability of a hypothesis test to detect a difference that is present.

Consider the curves below. Note that μ 0 is the hypothesized mean and μ A is the actual mean. The actual mean is different than the null hypothesis; therefore, you should reject the null hypothesis. What you end up with is an identical curve to the original normal curve.

File:5106-sig4.png

If you take a look at the curve below, it illustrates the way the data is actually behaving, versus the way you thought it should behave based on the null hypothesis. This line in the sand still exists, which means that because we should reject the null hypothesis, this area in orange is a mistake.

Failing to reject the null hypothesis is wrong, if this is actually the mean, which is different from the null hypothesis' mean. This is a type II error.

File:5107-sig5.png

Now, the area in yellow on the other side, where you are correctly rejecting the null hypothesis when a difference is present, is called power of a hypothesis test . Power is the probability of rejecting the null hypothesis correctly, rejecting when the null hypothesis is false, which is a correct decision.

File:5108-sig6.png

summary The probability of a type I error is a value that you get to choose in a hypothesis test. It is called the significance level and is denoted with the Greek letter alpha. Choosing a big significance level allows you to reject the null hypothesis more often, though the problem is that sometimes we reject the null hypothesis in error. When the difference really doesn't exist, you say that a difference does exist. However, if you choose a really small one, you reject the null hypothesis less often. Sometimes you fail to reject the null hypothesis in error as well. There's no foolproof method here. Usually, you want to keep your significance levels low, such as 0.05 or 0.01. Note that 0.05 is the default choice for most significance tests for most hypothesis testing. Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

The probability that we reject the null hypothesis (correctly) when a difference truly does exist.

open parentheses alpha close parentheses

  • Privacy Policy
  • Cookie Policy
  • Terms of Use

Your Privacy Choices Icon

© 2024 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC.

P-Value And Statistical Significance: What It Is & Why It Matters

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

The p-value in statistics quantifies the evidence against a null hypothesis. A low p-value suggests data is inconsistent with the null, potentially favoring an alternative hypothesis. Common significance thresholds are 0.05 or 0.01.

P-Value Explained in Normal Distribution

Hypothesis testing

When you perform a statistical test, a p-value helps you determine the significance of your results in relation to the null hypothesis.

The null hypothesis (H0) states no relationship exists between the two variables being studied (one variable does not affect the other). It states the results are due to chance and are not significant in supporting the idea being investigated. Thus, the null hypothesis assumes that whatever you try to prove did not happen.

The alternative hypothesis (Ha or H1) is the one you would believe if the null hypothesis is concluded to be untrue.

The alternative hypothesis states that the independent variable affected the dependent variable, and the results are significant in supporting the theory being investigated (i.e., the results are not due to random chance).

What a p-value tells you

A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e., that the null hypothesis is true).

The level of statistical significance is often expressed as a p-value between 0 and 1.

The smaller the p -value, the less likely the results occurred by random chance, and the stronger the evidence that you should reject the null hypothesis.

Remember, a p-value doesn’t tell you if the null hypothesis is true or false. It just tells you how likely you’d see the data you observed (or more extreme data) if the null hypothesis was true. It’s a piece of evidence, not a definitive proof.

Example: Test Statistic and p-Value

Suppose you’re conducting a study to determine whether a new drug has an effect on pain relief compared to a placebo. If the new drug has no impact, your test statistic will be close to the one predicted by the null hypothesis (no difference between the drug and placebo groups), and the resulting p-value will be close to 1. It may not be precisely 1 because real-world variations may exist. Conversely, if the new drug indeed reduces pain significantly, your test statistic will diverge further from what’s expected under the null hypothesis, and the p-value will decrease. The p-value will never reach zero because there’s always a slim possibility, though highly improbable, that the observed results occurred by random chance.

P-value interpretation

The significance level (alpha) is a set probability threshold (often 0.05), while the p-value is the probability you calculate based on your study or analysis.

A p-value less than or equal to your significance level (typically ≤ 0.05) is statistically significant.

A p-value less than or equal to a predetermined significance level (often 0.05 or 0.01) indicates a statistically significant result, meaning the observed data provide strong evidence against the null hypothesis.

This suggests the effect under study likely represents a real relationship rather than just random chance.

For instance, if you set α = 0.05, you would reject the null hypothesis if your p -value ≤ 0.05. 

It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random).

Therefore, we reject the null hypothesis and accept the alternative hypothesis.

Example: Statistical Significance

Upon analyzing the pain relief effects of the new drug compared to the placebo, the computed p-value is less than 0.01, which falls well below the predetermined alpha value of 0.05. Consequently, you conclude that there is a statistically significant difference in pain relief between the new drug and the placebo.

What does a p-value of 0.001 mean?

A p-value of 0.001 is highly statistically significant beyond the commonly used 0.05 threshold. It indicates strong evidence of a real effect or difference, rather than just random variation.

Specifically, a p-value of 0.001 means there is only a 0.1% chance of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is correct.

Such a small p-value provides strong evidence against the null hypothesis, leading to rejecting the null in favor of the alternative hypothesis.

A p-value more than the significance level (typically p > 0.05) is not statistically significant and indicates strong evidence for the null hypothesis.

This means we retain the null hypothesis and reject the alternative hypothesis. You should note that you cannot accept the null hypothesis; we can only reject it or fail to reject it.

Note : when the p-value is above your threshold of significance,  it does not mean that there is a 95% probability that the alternative hypothesis is true.

One-Tailed Test

Probability and statistical significance in ab testing. Statistical significance in a b experiments

Two-Tailed Test

statistical significance two tailed

How do you calculate the p-value ?

Most statistical software packages like R, SPSS, and others automatically calculate your p-value. This is the easiest and most common way.

Online resources and tables are available to estimate the p-value based on your test statistic and degrees of freedom.

These tables help you understand how often you would expect to see your test statistic under the null hypothesis.

Understanding the Statistical Test:

Different statistical tests are designed to answer specific research questions or hypotheses. Each test has its own underlying assumptions and characteristics.

For example, you might use a t-test to compare means, a chi-squared test for categorical data, or a correlation test to measure the strength of a relationship between variables.

Be aware that the number of independent variables you include in your analysis can influence the magnitude of the test statistic needed to produce the same p-value.

This factor is particularly important to consider when comparing results across different analyses.

Example: Choosing a Statistical Test

If you’re comparing the effectiveness of just two different drugs in pain relief, a two-sample t-test is a suitable choice for comparing these two groups. However, when you’re examining the impact of three or more drugs, it’s more appropriate to employ an Analysis of Variance ( ANOVA) . Utilizing multiple pairwise comparisons in such cases can lead to artificially low p-values and an overestimation of the significance of differences between the drug groups.

How to report

A statistically significant result cannot prove that a research hypothesis is correct (which implies 100% certainty).

Instead, we may state our results “provide support for” or “give evidence for” our research hypothesis (as there is still a slight probability that the results occurred by chance and the null hypothesis was correct – e.g., less than 5%).

Example: Reporting the results

In our comparison of the pain relief effects of the new drug and the placebo, we observed that participants in the drug group experienced a significant reduction in pain ( M = 3.5; SD = 0.8) compared to those in the placebo group ( M = 5.2; SD  = 0.7), resulting in an average difference of 1.7 points on the pain scale (t(98) = -9.36; p < 0.001).

The 6th edition of the APA style manual (American Psychological Association, 2010) states the following on the topic of reporting p-values:

“When reporting p values, report exact p values (e.g., p = .031) to two or three decimal places. However, report p values less than .001 as p < .001.

The tradition of reporting p values in the form p < .10, p < .05, p < .01, and so forth, was appropriate in a time when only limited tables of critical values were available.” (p. 114)

  • Do not use 0 before the decimal point for the statistical value p as it cannot equal 1. In other words, write p = .001 instead of p = 0.001.
  • Please pay attention to issues of italics ( p is always italicized) and spacing (either side of the = sign).
  • p = .000 (as outputted by some statistical packages such as SPSS) is impossible and should be written as p < .001.
  • The opposite of significant is “nonsignificant,” not “insignificant.”

Why is the p -value not enough?

A lower p-value  is sometimes interpreted as meaning there is a stronger relationship between two variables.

However, statistical significance means that it is unlikely that the null hypothesis is true (less than 5%).

To understand the strength of the difference between the two groups (control vs. experimental) a researcher needs to calculate the effect size .

When do you reject the null hypothesis?

In statistical hypothesis testing, you reject the null hypothesis when the p-value is less than or equal to the significance level (α) you set before conducting your test. The significance level is the probability of rejecting the null hypothesis when it is true. Commonly used significance levels are 0.01, 0.05, and 0.10.

Remember, rejecting the null hypothesis doesn’t prove the alternative hypothesis; it just suggests that the alternative hypothesis may be plausible given the observed data.

The p -value is conditional upon the null hypothesis being true but is unrelated to the truth or falsity of the alternative hypothesis.

What does p-value of 0.05 mean?

If your p-value is less than or equal to 0.05 (the significance level), you would conclude that your result is statistically significant. This means the evidence is strong enough to reject the null hypothesis in favor of the alternative hypothesis.

Are all p-values below 0.05 considered statistically significant?

No, not all p-values below 0.05 are considered statistically significant. The threshold of 0.05 is commonly used, but it’s just a convention. Statistical significance depends on factors like the study design, sample size, and the magnitude of the observed effect.

A p-value below 0.05 means there is evidence against the null hypothesis, suggesting a real effect. However, it’s essential to consider the context and other factors when interpreting results.

Researchers also look at effect size and confidence intervals to determine the practical significance and reliability of findings.

How does sample size affect the interpretation of p-values?

Sample size can impact the interpretation of p-values. A larger sample size provides more reliable and precise estimates of the population, leading to narrower confidence intervals.

With a larger sample, even small differences between groups or effects can become statistically significant, yielding lower p-values. In contrast, smaller sample sizes may not have enough statistical power to detect smaller effects, resulting in higher p-values.

Therefore, a larger sample size increases the chances of finding statistically significant results when there is a genuine effect, making the findings more trustworthy and robust.

Can a non-significant p-value indicate that there is no effect or difference in the data?

No, a non-significant p-value does not necessarily indicate that there is no effect or difference in the data. It means that the observed data do not provide strong enough evidence to reject the null hypothesis.

There could still be a real effect or difference, but it might be smaller or more variable than the study was able to detect.

Other factors like sample size, study design, and measurement precision can influence the p-value. It’s important to consider the entire body of evidence and not rely solely on p-values when interpreting research findings.

Can P values be exactly zero?

While a p-value can be extremely small, it cannot technically be absolute zero. When a p-value is reported as p = 0.000, the actual p-value is too small for the software to display. This is often interpreted as strong evidence against the null hypothesis. For p values less than 0.001, report as p < .001

Further Information

  • P-values and significance tests (Kahn Academy)
  • Hypothesis testing and p-values (Kahn Academy)
  • Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “ p “< 0.05”.
  • Criticism of using the “ p “< 0.05”.
  • Publication manual of the American Psychological Association
  • Statistics for Psychology Book Download

Bland, J. M., & Altman, D. G. (1994). One and two sided tests of significance: Authors’ reply.  BMJ: British Medical Journal ,  309 (6958), 874.

Goodman, S. N., & Royall, R. (1988). Evidence and scientific research.  American Journal of Public Health ,  78 (12), 1568-1574.

Goodman, S. (2008, July). A dirty dozen: twelve p-value misconceptions . In  Seminars in hematology  (Vol. 45, No. 3, pp. 135-140). WB Saunders.

Lang, J. M., Rothman, K. J., & Cann, C. I. (1998). That confounded P-value.  Epidemiology (Cambridge, Mass.) ,  9 (1), 7-8.

Print Friendly, PDF & Email

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of f1000res

  • PMC5635437.1 ; 2015 Aug 25
  • PMC5635437.2 ; 2016 Jul 13
  • ➤ PMC5635437.3; 2016 Oct 10

Null hypothesis significance testing: a short tutorial

Cyril pernet.

1 Centre for Clinical Brain Sciences (CCBS), Neuroimaging Sciences, The University of Edinburgh, Edinburgh, UK

Version Changes

Revised. amendments from version 2.

This v3 includes minor changes that reflect the 3rd reviewers' comments - in particular the theoretical vs. practical difference between Fisher and Neyman-Pearson. Additional information and reference is also included regarding the interpretation of p-value for low powered studies.

Peer Review Summary

Although thoroughly criticized, null hypothesis significance testing (NHST) remains the statistical method of choice used to provide evidence for an effect, in biological, biomedical and social sciences. In this short tutorial, I first summarize the concepts behind the method, distinguishing test of significance (Fisher) and test of acceptance (Newman-Pearson) and point to common interpretation errors regarding the p-value. I then present the related concepts of confidence intervals and again point to common interpretation errors. Finally, I discuss what should be reported in which context. The goal is to clarify concepts to avoid interpretation errors and propose reporting practices.

The Null Hypothesis Significance Testing framework

NHST is a method of statistical inference by which an experimental factor is tested against a hypothesis of no effect or no relationship based on a given observation. The method is a combination of the concepts of significance testing developed by Fisher in 1925 and of acceptance based on critical rejection regions developed by Neyman & Pearson in 1928 . In the following I am first presenting each approach, highlighting the key differences and common misconceptions that result from their combination into the NHST framework (for a more mathematical comparison, along with the Bayesian method, see Christensen, 2005 ). I next present the related concept of confidence intervals. I finish by discussing practical aspects in using NHST and reporting practice.

Fisher, significance testing, and the p-value

The method developed by ( Fisher, 1934 ; Fisher, 1955 ; Fisher, 1959 ) allows to compute the probability of observing a result at least as extreme as a test statistic (e.g. t value), assuming the null hypothesis of no effect is true. This probability or p-value reflects (1) the conditional probability of achieving the observed outcome or larger: p(Obs≥t|H0), and (2) is therefore a cumulative probability rather than a point estimate. It is equal to the area under the null probability distribution curve from the observed test statistic to the tail of the null distribution ( Turkheimer et al. , 2004 ). The approach proposed is of ‘proof by contradiction’ ( Christensen, 2005 ), we pose the null model and test if data conform to it.

In practice, it is recommended to set a level of significance (a theoretical p-value) that acts as a reference point to identify significant results, that is to identify results that differ from the null-hypothesis of no effect. Fisher recommended using p=0.05 to judge whether an effect is significant or not as it is roughly two standard deviations away from the mean for the normal distribution ( Fisher, 1934 page 45: ‘The value for which p=.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not’). A key aspect of Fishers’ theory is that only the null-hypothesis is tested, and therefore p-values are meant to be used in a graded manner to decide whether the evidence is worth additional investigation and/or replication ( Fisher, 1971 page 13: ‘it is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require […]’ and ‘no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon’). How small the level of significance is, is thus left to researchers.

What is not a p-value? Common mistakes

The p-value is not an indication of the strength or magnitude of an effect . Any interpretation of the p-value in relation to the effect under study (strength, reliability, probability) is wrong, since p-values are conditioned on H0. In addition, while p-values are randomly distributed (if all the assumptions of the test are met) when there is no effect, their distribution depends of both the population effect size and the number of participants, making impossible to infer strength of effect from them.

Similarly, 1-p is not the probability to replicate an effect . Often, a small value of p is considered to mean a strong likelihood of getting the same results on another try, but again this cannot be obtained because the p-value is not informative on the effect itself ( Miller, 2009 ). Because the p-value depends on the number of subjects, it can only be used in high powered studies to interpret results. In low powered studies (typically small number of subjects), the p-value has a large variance across repeated samples, making it unreliable to estimate replication ( Halsey et al. , 2015 ).

A (small) p-value is not an indication favouring a given hypothesis . Because a low p-value only indicates a misfit of the null hypothesis to the data, it cannot be taken as evidence in favour of a specific alternative hypothesis more than any other possible alternatives such as measurement error and selection bias ( Gelman, 2013 ). Some authors have even argued that the more (a priori) implausible the alternative hypothesis, the greater the chance that a finding is a false alarm ( Krzywinski & Altman, 2013 ; Nuzzo, 2014 ).

The p-value is not the probability of the null hypothesis p(H0), of being true, ( Krzywinski & Altman, 2013 ). This common misconception arises from a confusion between the probability of an observation given the null p(Obs≥t|H0) and the probability of the null given an observation p(H0|Obs≥t) that is then taken as an indication for p(H0) (see Nickerson, 2000 ).

Neyman-Pearson, hypothesis testing, and the α-value

Neyman & Pearson (1933) proposed a framework of statistical inference for applied decision making and quality control. In such framework, two hypotheses are proposed: the null hypothesis of no effect and the alternative hypothesis of an effect, along with a control of the long run probabilities of making errors. The first key concept in this approach, is the establishment of an alternative hypothesis along with an a priori effect size. This differs markedly from Fisher who proposed a general approach for scientific inference conditioned on the null hypothesis only. The second key concept is the control of error rates . Neyman & Pearson (1928) introduced the notion of critical intervals, therefore dichotomizing the space of possible observations into correct vs. incorrect zones. This dichotomization allows distinguishing correct results (rejecting H0 when there is an effect and not rejecting H0 when there is no effect) from errors (rejecting H0 when there is no effect, the type I error, and not rejecting H0 when there is an effect, the type II error). In this context, alpha is the probability of committing a Type I error in the long run. Alternatively, Beta is the probability of committing a Type II error in the long run.

The (theoretical) difference in terms of hypothesis testing between Fisher and Neyman-Pearson is illustrated on Figure 1 . In the 1 st case, we choose a level of significance for observed data of 5%, and compute the p-value. If the p-value is below the level of significance, it is used to reject H0. In the 2 nd case, we set a critical interval based on the a priori effect size and error rates. If an observed statistic value is below and above the critical values (the bounds of the confidence region), it is deemed significantly different from H0. In the NHST framework, the level of significance is (in practice) assimilated to the alpha level, which appears as a simple decision rule: if the p-value is less or equal to alpha, the null is rejected. It is however a common mistake to assimilate these two concepts. The level of significance set for a given sample is not the same as the frequency of acceptance alpha found on repeated sampling because alpha (a point estimate) is meant to reflect the long run probability whilst the p-value (a cumulative estimate) reflects the current probability ( Fisher, 1955 ; Hubbard & Bayarri, 2003 ).

An external file that holds a picture, illustration, etc.
Object name is f1000research-4-10487-g0000.jpg

The figure was prepared with G-power for a one-sided one-sample t-test, with a sample size of 32 subjects, an effect size of 0.45, and error rates alpha=0.049 and beta=0.80. In Fisher’s procedure, only the nil-hypothesis is posed, and the observed p-value is compared to an a priori level of significance. If the observed p-value is below this level (here p=0.05), one rejects H0. In Neyman-Pearson’s procedure, the null and alternative hypotheses are specified along with an a priori level of acceptance. If the observed statistical value is outside the critical region (here [-∞ +1.69]), one rejects H0.

Acceptance or rejection of H0?

The acceptance level α can also be viewed as the maximum probability that a test statistic falls into the rejection region when the null hypothesis is true ( Johnson, 2013 ). Therefore, one can only reject the null hypothesis if the test statistics falls into the critical region(s), or fail to reject this hypothesis. In the latter case, all we can say is that no significant effect was observed, but one cannot conclude that the null hypothesis is true. This is another common mistake in using NHST: there is a profound difference between accepting the null hypothesis and simply failing to reject it ( Killeen, 2005 ). By failing to reject, we simply continue to assume that H0 is true, which implies that one cannot argue against a theory from a non-significant result (absence of evidence is not evidence of absence). To accept the null hypothesis, tests of equivalence ( Walker & Nowacki, 2011 ) or Bayesian approaches ( Dienes, 2014 ; Kruschke, 2011 ) must be used.

Confidence intervals

Confidence intervals (CI) are builds that fail to cover the true value at a rate of alpha, the Type I error rate ( Morey & Rouder, 2011 ) and therefore indicate if observed values can be rejected by a (two tailed) test with a given alpha. CI have been advocated as alternatives to p-values because (i) they allow judging the statistical significance and (ii) provide estimates of effect size. Assuming the CI (a)symmetry and width are correct (but see Wilcox, 2012 ), they also give some indication about the likelihood that a similar value can be observed in future studies. For future studies of the same sample size, 95% CI give about 83% chance of replication success ( Cumming & Maillardet, 2006 ). If sample sizes however differ between studies, CI do not however warranty any a priori coverage.

Although CI provide more information, they are not less subject to interpretation errors (see Savalei & Dunn, 2015 for a review). The most common mistake is to interpret CI as the probability that a parameter (e.g. the population mean) will fall in that interval X% of the time. The correct interpretation is that, for repeated measurements with the same sample sizes, taken from the same population, X% of times the CI obtained will contain the true parameter value ( Tan & Tan, 2010 ). The alpha value has the same interpretation as testing against H0, i.e. we accept that 1-alpha CI are wrong in alpha percent of the times in the long run. This implies that CI do not allow to make strong statements about the parameter of interest (e.g. the mean difference) or about H1 ( Hoekstra et al. , 2014 ). To make a statement about the probability of a parameter of interest (e.g. the probability of the mean), Bayesian intervals must be used.

The (correct) use of NHST

NHST has always been criticized, and yet is still used every day in scientific reports ( Nickerson, 2000 ). One question to ask oneself is what is the goal of a scientific experiment at hand? If the goal is to establish a discrepancy with the null hypothesis and/or establish a pattern of order, because both requires ruling out equivalence, then NHST is a good tool ( Frick, 1996 ; Walker & Nowacki, 2011 ). If the goal is to test the presence of an effect and/or establish some quantitative values related to an effect, then NHST is not the method of choice since testing is conditioned on H0.

While a Bayesian analysis is suited to estimate that the probability that a hypothesis is correct, like NHST, it does not prove a theory on itself, but adds its plausibility ( Lindley, 2000 ). No matter what testing procedure is used and how strong results are, ( Fisher, 1959 p13) reminds us that ‘ […] no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon'. Similarly, the recent statement of the American Statistical Association ( Wasserstein & Lazar, 2016 ) makes it clear that conclusions should be based on the researchers understanding of the problem in context, along with all summary data and tests, and that no single value (being p-values, Bayesian factor or else) can be used support or invalidate a theory.

What to report and how?

Considering that quantitative reports will always have more information content than binary (significant or not) reports, we can always argue that raw and/or normalized effect size, confidence intervals, or Bayes factor must be reported. Reporting everything can however hinder the communication of the main result(s), and we should aim at giving only the information needed, at least in the core of a manuscript. Here I propose to adopt optimal reporting in the result section to keep the message clear, but have detailed supplementary material. When the hypothesis is about the presence/absence or order of an effect, and providing that a study has sufficient power, NHST is appropriate and it is sufficient to report in the text the actual p-value since it conveys the information needed to rule out equivalence. When the hypothesis and/or the discussion involve some quantitative value, and because p-values do not inform on the effect, it is essential to report on effect sizes ( Lakens, 2013 ), preferably accompanied with confidence or credible intervals. The reasoning is simply that one cannot predict and/or discuss quantities without accounting for variability. For the reader to understand and fully appreciate the results, nothing else is needed.

Because science progress is obtained by cumulating evidence ( Rosenthal, 1991 ), scientists should also consider the secondary use of the data. With today’s electronic articles, there are no reasons for not including all of derived data: mean, standard deviations, effect size, CI, Bayes factor should always be included as supplementary tables (or even better also share raw data). It is also essential to report the context in which tests were performed – that is to report all of the tests performed (all t, F, p values) because of the increase type one error rate due to selective reporting (multiple comparisons and p-hacking problems - Ioannidis, 2005 ). Providing all of this information allows (i) other researchers to directly and effectively compare their results in quantitative terms (replication of effects beyond significance, Open Science Collaboration, 2015 ), (ii) to compute power to future studies ( Lakens & Evers, 2014 ), and (iii) to aggregate results for meta-analyses whilst minimizing publication bias ( van Assen et al. , 2014 ).

[version 3; referees: 1 approved

Funding Statement

The author(s) declared that no grants were involved in supporting this work.

  • Christensen R: Testing Fisher, Neyman, Pearson, and Bayes. The American Statistician. 2005; 59 ( 2 ):121–126. 10.1198/000313005X20871 [ CrossRef ] [ Google Scholar ]
  • Cumming G, Maillardet R: Confidence intervals and replication: Where will the next mean fall? Psychological Methods. 2006; 11 ( 3 ):217–227. 10.1037/1082-989X.11.3.217 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Dienes Z: Using Bayes to get the most out of non-significant results. Front Psychol. 2014; 5 :781. 10.3389/fpsyg.2014.00781 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fisher RA: Statistical Methods for Research Workers . (Vol. 5th Edition). Edinburgh, UK: Oliver and Boyd.1934. Reference Source [ Google Scholar ]
  • Fisher RA: Statistical Methods and Scientific Induction. Journal of the Royal Statistical Society, Series B. 1955; 17 ( 1 ):69–78. Reference Source [ Google Scholar ]
  • Fisher RA: Statistical methods and scientific inference . (2nd ed.). NewYork: Hafner Publishing,1959. Reference Source [ Google Scholar ]
  • Fisher RA: The Design of Experiments . Hafner Publishing Company, New-York.1971. Reference Source [ Google Scholar ]
  • Frick RW: The appropriate use of null hypothesis testing. Psychol Methods. 1996; 1 ( 4 ):379–390. 10.1037/1082-989X.1.4.379 [ CrossRef ] [ Google Scholar ]
  • Gelman A: P values and statistical practice. Epidemiology. 2013; 24 ( 1 ):69–72. 10.1097/EDE.0b013e31827886f7 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Halsey LG, Curran-Everett D, Vowler SL, et al.: The fickle P value generates irreproducible results. Nat Methods. 2015; 12 ( 3 ):179–85. 10.1038/nmeth.3288 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hoekstra R, Morey RD, Rouder JN, et al.: Robust misinterpretation of confidence intervals. Psychon Bull Rev. 2014; 21 ( 5 ):1157–1164. 10.3758/s13423-013-0572-3 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hubbard R, Bayarri MJ: Confusion over measures of evidence (p’s) versus errors ([alpha]’s) in classical statistical testing. The American Statistician. 2003; 57 ( 3 ):171–182. 10.1198/0003130031856 [ CrossRef ] [ Google Scholar ]
  • Ioannidis JP: Why most published research findings are false. PLoS Med. 2005; 2 ( 8 ):e124. 10.1371/journal.pmed.0020124 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Johnson VE: Revised standards for statistical evidence. Proc Natl Acad Sci U S A. 2013; 110 ( 48 ):19313–19317. 10.1073/pnas.1313476110 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Killeen PR: An alternative to null-hypothesis significance tests. Psychol Sci. 2005; 16 ( 5 ):345–353. 10.1111/j.0956-7976.2005.01538.x [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kruschke JK: Bayesian Assessment of Null Values Via Parameter Estimation and Model Comparison. Perspect Psychol Sci. 2011; 6 ( 3 ):299–312. 10.1177/1745691611406925 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Krzywinski M, Altman N: Points of significance: Significance, P values and t -tests. Nat Methods. 2013; 10 ( 11 ):1041–1042. 10.1038/nmeth.2698 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lakens D: Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t -tests and ANOVAs. Front Psychol. 2013; 4 :863. 10.3389/fpsyg.2013.00863 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lakens D, Evers ER: Sailing From the Seas of Chaos Into the Corridor of Stability: Practical Recommendations to Increase the Informational Value of Studies. Perspect Psychol Sci. 2014; 9 ( 3 ):278–292. 10.1177/1745691614528520 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lindley D: The philosophy of statistics. Journal of the Royal Statistical Society. 2000; 49 ( 3 ):293–337. 10.1111/1467-9884.00238 [ CrossRef ] [ Google Scholar ]
  • Miller J: What is the probability of replicating a statistically significant effect? Psychon Bull Rev. 2009; 16 ( 4 ):617–640. 10.3758/PBR.16.4.617 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Morey RD, Rouder JN: Bayes factor approaches for testing interval null hypotheses. Psychol Methods. 2011; 16 ( 4 ):406–419. 10.1037/a0024377 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Neyman J, Pearson ES: On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference: Part I. Biometrika. 1928; 20A ( 1/2 ):175–240. 10.3389/fpsyg.2015.00245 [ CrossRef ] [ Google Scholar ]
  • Neyman J, Pearson ES: On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond Ser A. 1933; 231 ( 694–706 ):289–337. 10.1098/rsta.1933.0009 [ CrossRef ] [ Google Scholar ]
  • Nickerson RS: Null hypothesis significance testing: a review of an old and continuing controversy. Psychol Methods. 2000; 5 ( 2 ):241–301. 10.1037/1082-989X.5.2.241 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nuzzo R: Scientific method: statistical errors. Nature. 2014; 506 ( 7487 ):150–152. 10.1038/506150a [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Open Science Collaboration. PSYCHOLOGY. Estimating the reproducibility of psychological science. Science. 2015; 349 ( 6251 ):aac4716. 10.1126/science.aac4716 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rosenthal R: Cumulating psychology: an appreciation of Donald T. Campbell. Psychol Sci. 1991; 2 ( 4 ):213–221. 10.1111/j.1467-9280.1991.tb00138.x [ CrossRef ] [ Google Scholar ]
  • Savalei V, Dunn E: Is the call to abandon p -values the red herring of the replicability crisis? Front Psychol. 2015; 6 :245. 10.3389/fpsyg.2015.00245 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Tan SH, Tan SB: The Correct Interpretation of Confidence Intervals. Proceedings of Singapore Healthcare. 2010; 19 ( 3 ):276–278. 10.1177/201010581001900316 [ CrossRef ] [ Google Scholar ]
  • Turkheimer FE, Aston JA, Cunningham VJ: On the logic of hypothesis testing in functional imaging. Eur J Nucl Med Mol Imaging. 2004; 31 ( 5 ):725–732. 10.1007/s00259-003-1387-7 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • van Assen MA, van Aert RC, Nuijten MB, et al.: Why Publishing Everything Is More Effective than Selective Publishing of Statistically Significant Results. PLoS One. 2014; 9 ( 1 ):e84896. 10.1371/journal.pone.0084896 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Walker E, Nowacki AS: Understanding equivalence and noninferiority testing. J Gen Intern Med. 2011; 26 ( 2 ):192–196. 10.1007/s11606-010-1513-8 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wasserstein RL, Lazar NA: The ASA’s Statement on p -Values: Context, Process, and Purpose. The American Statistician. 2016; 70 ( 2 ):129–133. 10.1080/00031305.2016.1154108 [ CrossRef ] [ Google Scholar ]
  • Wilcox R: Introduction to Robust Estimation and Hypothesis Testing . Edition 3, Academic Press, Elsevier: Oxford, UK, ISBN: 978-0-12-386983-8.2012. Reference Source [ Google Scholar ]

Referee response for version 3

Dorothy vera margaret bishop.

1 Department of Experimental Psychology, University of Oxford, Oxford, UK

I can see from the history of this paper that the author has already been very responsive to reviewer comments, and that the process of revising has now been quite protracted.

That makes me reluctant to suggest much more, but I do see potential here for making the paper more impactful. So my overall view is that, once a few typos are fixed (see below), this could be published as is, but I think there is an issue with the potential readership and that further revision could overcome this.

I suspect my take on this is rather different from other reviewers, as I do not regard myself as a statistics expert, though I am on the more quantitative end of the continuum of psychologists and I try to keep up to date. I think I am quite close to the target readership , insofar as I am someone who was taught about statistics ages ago and uses stats a lot, but never got adequate training in the kinds of topic covered by this paper. The fact that I am aware of controversies around the interpretation of confidence intervals etc is simply because I follow some discussions of this on social media. I am therefore very interested to have a clear account of these issues.

This paper contains helpful information for someone in this position, but it is not always clear, and I felt the relevance of some of the content was uncertain. So here are some recommendations:

  • As one previous reviewer noted, it’s questionable that there is a need for a tutorial introduction, and the limited length of this article does not lend itself to a full explanation. So it might be better to just focus on explaining as clearly as possible the problems people have had in interpreting key concepts. I think a title that made it clear this was the content would be more appealing than the current one.
  • P 3, col 1, para 3, last sentence. Although statisticians always emphasise the arbitrary nature of p < .05, we all know that in practice authors who use other values are likely to have their analyses queried. I wondered whether it would be useful here to note that in some disciplines different cutoffs are traditional, e.g. particle physics. Or you could cite David Colquhoun’s paper in which he recommends using p < .001 ( http://rsos.royalsocietypublishing.org/content/1/3/140216) - just to be clear that the traditional p < .05 has been challenged.

What I can’t work out is how you would explain the alpha from Neyman-Pearson in the same way (though I can see from Figure 1 that with N-P you could test an alternative hypothesis, such as the idea that the coin would be heads 75% of the time).

‘By failing to reject, we simply continue to assume that H0 is true, which implies that one cannot….’ have ‘In failing to reject, we do not assume that H0 is true; one cannot argue against a theory from a non-significant result.’

I felt most readers would be interested to read about tests of equivalence and Bayesian approaches, but many would be unfamiliar with these and might like to see an example of how they work in practice – if space permitted.

  • Confidence intervals: I simply could not understand the first sentence – I wondered what was meant by ‘builds’ here. I understand about difficulties in comparing CI across studies when sample sizes differ, but I did not find the last sentence on p 4 easy to understand.
  • P 5: The sentence starting: ‘The alpha value has the same interpretation’ was also hard to understand, especially the term ‘1-alpha CI’. Here too I felt some concrete illustration might be helpful to the reader. And again, I also found the reference to Bayesian intervals tantalising – I think many readers won’t know how to compute these and something like a figure comparing a traditional CI with a Bayesian interval and giving a source for those who want to read on would be very helpful. The reference to ‘credible intervals’ in the penultimate paragraph is very unclear and needs a supporting reference – most readers will not be familiar with this concept.

P 3, col 1, para 2, line 2; “allows us to compute”

P 3, col 2, para 2, ‘probability of replicating’

P 3, col 2, para 2, line 4 ‘informative about’

P 3, col 2, para 4, line 2 delete ‘of’

P 3, col 2, para 5, line 9 – ‘conditioned’ is either wrong or too technical here: would ‘based’ be acceptable as alternative wording

P 3, col 2, para 5, line 13 ‘This dichotomisation allows one to distinguish’

P 3, col 2, para 5, last sentence, delete ‘Alternatively’.

P 3, col 2, last para line 2 ‘first’

P 4, col 2, para 2, last sentence is hard to understand; not sure if this is better: ‘If sample sizes differ between studies, the distribution of CIs cannot be specified a priori’

P 5, col 1, para 2, ‘a pattern of order’ – I did not understand what was meant by this

P 5, col 1, para 2, last sentence unclear: possible rewording: “If the goal is to test the size of an effect then NHST is not the method of choice, since testing can only reject the null hypothesis.’ (??)

P 5, col 1, para 3, line 1 delete ‘that’

P 5, col 1, para 3, line 3 ‘on’ -> ‘by’

P 5, col 2, para 1, line 4 , rather than ‘Here I propose to adopt’ I suggest ‘I recommend adopting’

P 5, col 2, para 1, line 13 ‘with’ -> ‘by’

P 5, col 2, para 1 – recommend deleting last sentence

P 5, col 2, para 2, line 2 ‘consider’ -> ‘anticipate’

P 5, col 2, para 2, delete ‘should always be included’

P 5, col 2, para 2, ‘type one’ -> ‘Type I’

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

The University of Edinburgh, UK

I wondered about changing the focus slightly and modifying the title to reflect this to say something like: Null hypothesis significance testing: a guide to commonly misunderstood concepts and recommendations for good practice

Thank you for the suggestion – you indeed saw the intention behind the ‘tutorial’ style of the paper.

  • P 3, col 1, para 3, last sentence. Although statisticians always emphasise the arbitrary nature of p < .05, we all know that in practice authors who use other values are likely to have their analyses queried. I wondered whether it would be useful here to note that in some disciplines different cutoffs are traditional, e.g. particle physics. Or you could cite David Colquhoun’s paper in which he recommends using p < .001 ( http://rsos.royalsocietypublishing.org/content/1/3/140216)  - just to be clear that the traditional p < .05 has been challenged.

I have added a sentence on this citing Colquhoun 2014 and the new Benjamin 2017 on using .005.

I agree that this point is always hard to appreciate, especially because it seems like in practice it makes little difference. I added a paragraph but using reaction times rather than a coin toss – thanks for the suggestion.

Added an example based on new table 1, following figure 1 – giving CI, equivalence tests and Bayes Factor (with refs to easy to use tools)

Changed builds to constructs (this simply means they are something we build) and added that the implication that probability coverage is not warranty when sample size change, is that we cannot compare CI.

I changed ‘ i.e. we accept that 1-alpha CI are wrong in alpha percent of the times in the long run’ to ‘, ‘e.g. a 95% CI is wrong in 5% of the times in the long run (i.e. if we repeat the experiment many times).’ – for Bayesian intervals I simply re-cited Morey & Rouder, 2011.

It is not the CI cannot be specified, it’s that the interval is not predictive of anything anymore! I changed it to ‘If sample sizes, however, differ between studies, there is no warranty that a CI from one study will be true at the rate alpha in a different study, which implies that CI cannot be compared across studies at this is rarely the same sample sizes’

I added (i.e. establish that A > B) – we test that conditions are ordered, but without further specification of the probability of that effect nor its size

Yes it works – thx

P 5, col 2, para 2, ‘type one’ -> ‘Type I’ 

Typos fixed, and suggestions accepted – thanks for that.

Stephen J. Senn

1 Luxembourg Institute of Health, Strassen, L-1445, Luxembourg

The revisions are OK for me, and I have changed my status to Approved.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Referee response for version 2

On the whole I think that this article is reasonable, my main reservation being that I have my doubts on whether the literature needs yet another tutorial on this subject.

A further reservation I have is that the author, following others, stresses what in my mind is a relatively unimportant distinction between the Fisherian and Neyman-Pearson (NP) approaches. The distinction stressed by many is that the NP approach leads to a dichotomy accept/reject based on probabilities established in advance, whereas the Fisherian approach uses tail area probabilities calculated from the observed statistic. I see this as being unimportant and not even true. Unless one considers that the person carrying out a hypothesis test (original tester) is mandated to come to a conclusion on behalf of all scientific posterity, then one must accept that any remote scientist can come to his or her conclusion depending on the personal type I error favoured. To operate the results of an NP test carried out by the original tester, the remote scientist then needs to know the p-value. The type I error rate is then compared to this to come to a personal accept or reject decision (1). In fact Lehmann (2), who was an important developer of and proponent of the NP system, describes exactly this approach as being good practice. (See Testing Statistical Hypotheses, 2nd edition P70). Thus using tail-area probabilities calculated from the observed statistics does not constitute an operational difference between the two systems.

A more important distinction between the Fisherian and NP systems is that the former does not use alternative hypotheses(3). Fisher's opinion was that the null hypothesis was more primitive than the test statistic but that the test statistic was more primitive than the alternative hypothesis. Thus, alternative hypotheses could not be used to justify choice of test statistic. Only experience could do that.

Further distinctions between the NP and Fisherian approach are to do with conditioning and whether a null hypothesis can ever be accepted.

I have one minor quibble about terminology. As far as I can see, the author uses the usual term 'null hypothesis' and the eccentric term 'nil hypothesis' interchangeably. It would be simpler if the latter were abandoned.

Referee response for version 1

Marcel alm van assen.

1 Department of Methodology and Statistics, Tilburgh University, Tilburg, Netherlands

Null hypothesis significance testing (NHST) is a difficult topic, with misunderstandings arising easily. Many texts, including basic statistics books, deal with the topic, and attempt to explain it to students and anyone else interested. I would refer to a good basic text book, for a detailed explanation of NHST, or to a specialized article when wishing an explaining the background of NHST. So, what is the added value of a new text on NHST? In any case, the added value should be described at the start of this text. Moreover, the topic is so delicate and difficult that errors, misinterpretations, and disagreements are easy. I attempted to show this by giving comments to many sentences in the text.

Abstract: “null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences to investigate if an effect is likely”. No, NHST is the method to test the hypothesis of no effect.

Intro: “Null hypothesis significance testing (NHST) is a method of statistical inference by which an observation is tested against a hypothesis of no effect or no relationship.” What is an ‘observation’? NHST is difficult to describe in one sentence, particularly here. I would skip this sentence entirely, here.

Section on Fisher; also explain the one-tailed test.

Section on Fisher; p(Obs|H0) does not reflect the verbal definition (the ‘or more extreme’ part).

Section on Fisher; use a reference and citation to Fisher’s interpretation of the p-value

Section on Fisher; “This was however only intended to be used as an indication that there is something in the data that deserves further investigation. The reason for this is that only H0 is tested whilst the effect under study is not itself being investigated.” First sentence, can you give a reference? Many people say a lot about Fisher’s intentions, but the good man is dead and cannot reply… Second sentence is a bit awkward, because the effect is investigated in a way, by testing the H0.

Section on p-value; Layout and structure can be improved greatly, by first again stating what the p-value is, and then statement by statement, what it is not, using separate lines for each statement. Consider adding that the p-value is randomly distributed under H0 (if all the assumptions of the test are met), and that under H1 the p-value is a function of population effect size and N; the larger each is, the smaller the p-value generally is.

Skip the sentence “If there is no effect, we should replicate the absence of effect with a probability equal to 1-p”. Not insightful, and you did not discuss the concept ‘replicate’ (and do not need to).

Skip the sentence “The total probability of false positives can also be obtained by aggregating results ( Ioannidis, 2005 ).” Not strongly related to p-values, and introduces unnecessary concepts ‘false positives’ (perhaps later useful) and ‘aggregation’.

Consider deleting; “If there is an effect however, the probability to replicate is a function of the (unknown) population effect size with no good way to know this from a single experiment ( Killeen, 2005 ).”

The following sentence; “ Finally, a (small) p-value  is not an indication favouring a hypothesis . A low p-value indicates a misfit of the null hypothesis to the data and cannot be taken as evidence in favour of a specific alternative hypothesis more than any other possible alternatives such as measurement error and selection bias ( Gelman, 2013 ).” is surely not mainstream thinking about NHST; I would surely delete that sentence. In NHST, a p-value is used for testing the H0. Why did you not yet discuss significance level? Yes, before discussing what is not a p-value, I would explain NHST (i.e., what it is and how it is used). 

Also the next sentence “The more (a priori) implausible the alternative hypothesis, the greater the chance that a finding is a false alarm ( Krzywinski & Altman, 2013 ;  Nuzzo, 2014 ).“ is not fully clear to me. This is a Bayesian statement. In NHST, no likelihoods are attributed to hypotheses; the reasoning is “IF H0 is true, then…”.

Last sentence: “As  Nickerson (2000)  puts it ‘theory corroboration requires the testing of multiple predictions because the chance of getting statistically significant results for the wrong reasons in any given case is high’.” What is relation of this sentence to the contents of this section, precisely?

Next section: “For instance, we can estimate that the probability of a given F value to be in the critical interval [+2 +∞] is less than 5%” This depends on the degrees of freedom.

“When there is no effect (H0 is true), the erroneous rejection of H0 is known as type I error and is equal to the p-value.” Strange sentence. The Type I error is the probability of erroneously rejecting the H0 (so, when it is true). The p-value is … well, you explained it before; it surely does not equal the Type I error.

Consider adding a figure explaining the distinction between Fisher’s logic and that of Neyman and Pearson.

“When the test statistics falls outside the critical region(s)” What is outside?

“There is a profound difference between accepting the null hypothesis and simply failing to reject it ( Killeen, 2005 )” I agree with you, but perhaps you may add that some statisticians simply define “accept H0’” as obtaining a p-value larger than the significance level. Did you already discuss the significance level, and it’s mostly used values?

“To accept or reject equally the null hypothesis, Bayesian approaches ( Dienes, 2014 ;  Kruschke, 2011 ) or confidence intervals must be used.” Is ‘reject equally’ appropriate English? Also using Cis, one cannot accept the H0.

Do you start discussing alpha only in the context of Cis?

“CI also indicates the precision of the estimate of effect size, but unless using a percentile bootstrap approach, they require assumptions about distributions which can lead to serious biases in particular regarding the symmetry and width of the intervals ( Wilcox, 2012 ).” Too difficult, using new concepts. Consider deleting.

“Assuming the CI (a)symmetry and width are correct, this gives some indication about the likelihood that a similar value can be observed in future studies, with 95% CI giving about 83% chance of replication success ( Lakens & Evers, 2014 ).” This statement is, in general, completely false. It very much depends on the sample sizes of both studies. If the replication study has a much, much, much larger N, then the probability that the original CI will contain the effect size of the replication approaches (1-alpha)*100%. If the original study has a much, much, much larger N, then the probability that the original Ci will contain the effect size of the replication study approaches 0%.

“Finally, contrary to p-values, CI can be used to accept H0. Typically, if a CI includes 0, we cannot reject H0. If a critical null region is specified rather than a single point estimate, for instance [-2 +2] and the CI is included within the critical null region, then H0 can be accepted. Importantly, the critical region must be specified a priori and cannot be determined from the data themselves.” No. H0 cannot be accepted with Cis.

“The (posterior) probability of an effect can however not be obtained using a frequentist framework.” Frequentist framework? You did not discuss that, yet.

“X% of times the CI obtained will contain the same parameter value”. The same? True, you mean?

“e.g. X% of the times the CI contains the same mean” I do not understand; which mean?

“The alpha value has the same interpretation as when using H0, i.e. we accept that 1-alpha CI are wrong in alpha percent of the times. “ What do you mean, CI are wrong? Consider rephrasing.

“To make a statement about the probability of a parameter of interest, likelihood intervals (maximum likelihood) and credibility intervals (Bayes) are better suited.” ML gives the likelihood of the data given the parameter, not the other way around.

“Many of the disagreements are not on the method itself but on its use.” Bayesians may disagree.

“If the goal is to establish the likelihood of an effect and/or establish a pattern of order, because both requires ruling out equivalence, then NHST is a good tool ( Frick, 1996 )” NHST does not provide evidence on the likelihood of an effect.

“If the goal is to establish some quantitative values, then NHST is not the method of choice.” P-values are also quantitative… this is not a precise sentence. And NHST may be used in combination with effect size estimation (this is even recommended by, e.g., the American Psychological Association (APA)).

“Because results are conditioned on H0, NHST cannot be used to establish beliefs.” It can reinforce some beliefs, e.g., if H0 or any other hypothesis, is true.

“To estimate the probability of a hypothesis, a Bayesian analysis is a better alternative.” It is the only alternative?

“Note however that even when a specific quantitative prediction from a hypothesis is shown to be true (typically testing H1 using Bayes), it does not prove the hypothesis itself, it only adds to its plausibility.” How can we show something is true?

I do not agree on the contents of the last section on ‘minimal reporting’. I prefer ‘optimal reporting’ instead, i.e., the reporting the information that is essential to the interpretation of the result, to any ready, which may have other goals than the writer of the article. This reporting includes, for sure, an estimate of effect size, and preferably a confidence interval, which is in line with recommendations of the APA.

I have read this submission. I believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

The idea of this short review was to point to common interpretation errors (stressing again and again that we are under H0) being in using p-values or CI, and also proposing reporting practices to avoid bias. This is now stated at the end of abstract.

Regarding text books, it is clear that many fail to clearly distinguish Fisher/Pearson/NHST, see Glinet et al (2012) J. Exp Education 71, 83-92. If you have 1 or 2 in mind that you know to be good, I’m happy to include them.

I agree – yet people use it to investigate (not test) if an effect is likely. The issue here is wording. What about adding this distinction at the end of the sentence?: ‘null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences used to investigate if an effect is likely, even though it actually tests for the hypothesis of no effect’.

I think a definition is needed, as it offers a starting point. What about the following: ‘NHST is a method of statistical inference by which an experimental factor is tested against a hypothesis of no effect or no relationship based on a given observation’

The section on Fisher has been modified (more or less) as suggested: (1) avoiding talking about one or two tailed tests (2) updating for p(Obs≥t|H0) and (3) referring to Fisher more explicitly (ie pages from articles and book) ; I cannot tell his intentions but these quotes leave little space to alternative interpretations.

The reasoning here is as you state yourself, part 1: ‘a p-value is used for testing the H0; and part 2: ‘no likelihoods are attributed to hypotheses’ it follows we cannot favour a hypothesis. It might seems contentious but this is the case that all we can is to reject the null – how could we favour a specific alternative hypothesis from there? This is explored further down the manuscript (and I now point to that) – note that we do not need to be Bayesian to favour a specific H1, all I’m saying is this cannot be attained with a p-value.

The point was to emphasise that a p value is not there to tell us a given H1 is true and can only be achieved through multiple predictions and experiments. I deleted it for clarity.

This sentence has been removed

Indeed, you are right and I have modified the text accordingly. When there is no effect (H0 is true), the erroneous rejection of H0 is known as type 1 error. Importantly, the type 1 error rate, or alpha value is determined a priori. It is a common mistake but the level of significance (for a given sample) is not the same as the frequency of acceptance alpha found on repeated sampling (Fisher, 1955).

A figure is now presented – with levels of acceptance, critical region, level of significance and p-value.

I should have clarified further here – as I was having in mind tests of equivalence. To clarify, I simply states now: ‘To accept the null hypothesis, tests of equivalence or Bayesian approaches must be used.’

It is now presented in the paragraph before.

Yes, you are right, I completely overlooked this problem. The corrected sentence (with more accurate ref) is now “Assuming the CI (a)symmetry and width are correct, this gives some indication about the likelihood that a similar value can be observed in future studies. For future studies of the same sample size, 95% CI giving about 83% chance of replication success (Cumming and Mallardet, 2006). If sample sizes differ between studies, CI do not however warranty any a priori coverage”.

Again, I had in mind equivalence testing, but in both cases you are right we can only reject and I therefore removed that sentence.

Yes, p-values must be interpreted in context with effect size, but this is not what people do. The point here is to be pragmatic, does and don’t. The sentence was changed.

Not for testing, but for probability, I am not aware of anything else.

Cumulative evidence is, in my opinion, the only way to show it. Even in hard science like physics multiple experiments. In the recent CERN study on finding Higgs bosons, 2 different and complementary experiments ran in parallel – and the cumulative evidence was taken as a proof of the true existence of Higgs bosons.

Daniel Lakens

1 School of Innovation Sciences, Eindhoven University of Technology, Eindhoven, Netherlands

I appreciate the author's attempt to write a short tutorial on NHST. Many people don't know how to use it, so attempts to educate people are always worthwhile. However, I don't think the current article reaches it's aim. For one, I think it might be practically impossible to explain a lot in such an ultra short paper - every section would require more than 2 pages to explain, and there are many sections. Furthermore, there are some excellent overviews, which, although more extensive, are also much clearer (e.g., Nickerson, 2000 ). Finally, I found many statements to be unclear, and perhaps even incorrect (noted below). Because there is nothing worse than creating more confusion on such a topic, I have extremely high standards before I think such a short primer should be indexed. I note some examples of unclear or incorrect statements below. I'm sorry I can't make a more positive recommendation.

“investigate if an effect is likely” – ambiguous statement. I think you mean, whether the observed DATA is probable, assuming there is no effect?

The Fisher (1959) reference is not correct – Fischer developed his method much earlier.

“This p-value thus reflects the conditional probability of achieving the observed outcome or larger, p(Obs|H0)” – please add 'assuming the null-hypothesis is true'.

“p(Obs|H0)” – explain this notation for novices.

“Following Fisher, the smaller the p-value, the greater the likelihood that the null hypothesis is false.”  This is wrong, and any statement about this needs to be much more precise. I would suggest direct quotes.

“there is something in the data that deserves further investigation” –unclear sentence.

“The reason for this” – unclear what ‘this’ refers to.

“ not the probability of the null hypothesis of being true, p(H0)” – second of can be removed?

“Any interpretation of the p-value in relation to the effect under study (strength, reliability, probability) is indeed

wrong, since the p-value is conditioned on H0”  - incorrect. A big problem is that it depends on the sample size, and that the probability of a theory depends on the prior.

“If there is no effect, we should replicate the absence of effect with a probability equal to 1-p.” I don’t understand this, but I think it is incorrect.

“The total probability of false positives can also be obtained by aggregating results (Ioannidis, 2005).” Unclear, and probably incorrect.

“By failing to reject, we simply continue to assume that H0 is true, which implies that one cannot, from a nonsignificant result, argue against a theory” – according to which theory? From a NP perspective, you can ACT as if the theory is false.

“(Lakens & Evers, 2014”) – we are not the original source, which should be cited instead.

“ Typically, if a CI includes 0, we cannot reject H0.”  - when would this not be the case? This assumes a CI of 1-alpha.

“If a critical null region is specified rather than a single point estimate, for instance [-2 +2] and the CI is included within the critical null region, then H0 can be accepted.” – you mean practically, or formally? I’m pretty sure only the former.

The section on ‘The (correct) use of NHST’ seems to conclude only Bayesian statistics should be used. I don’t really agree.

“ we can always argue that effect size, power, etc. must be reported.” – which power? Post-hoc power? Surely not? Other types are unknown. So what do you mean?

The recommendation on what to report remains vague, and it is unclear why what should be reported.

This sentence was changed, following as well the other reviewer, to ‘null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences to investigate if an effect is likely, even though it actually tests whether the observed data are probable, assuming there is no effect’

Changed, refers to Fisher 1925

I changed a little the sentence structure, which should make explicit that this is the condition probability.

This has been changed to ‘[…] to decide whether the evidence is worth additional investigation and/or replication (Fisher, 1971 p13)’

my mistake – the sentence structure is now ‘ not the probability of the null hypothesis p(H0), of being true,’ ; hope this makes more sense (and this way refers back to p(Obs>t|H0)

Fair enough – my point was to stress the fact that p value and effect size or H1 have very little in common, but yes that the part in common has to do with sample size. I left the conditioning on H0 but also point out the dependency on sample size.

The whole paragraph was changed to reflect a more philosophical take on scientific induction/reasoning. I hope this is clearer.

Changed to refer to equivalence testing

I rewrote this, as to show frequentist analysis can be used  - I’m trying to sell Bayes more than any other approach.

I’m arguing we should report it all, that’s why there is no exhausting list – I can if needed.

Statology

Statistics Made Easy

An Explanation of P-Values and Statistical Significance

In statistics, p-values are commonly used in hypothesis testing for t-tests, chi-square tests, regression analysis, ANOVAs, and a variety of other statistical methods.

Despite being so common, people often interpret p-values incorrectly, which can lead to errors when interpreting the findings from an analysis or a study. 

This post explains how to understand and interpret p-values in a clear, practical way.

Hypothesis Testing

To understand p-values, we first need to understand the concept of hypothesis testing .

A  hypothesis test  is a formal statistical test we use to reject or fail to reject some hypothesis. For example, we may hypothesize that a new drug, method, or procedure provides some benefit over a current drug, method, or procedure. 

To test this, we can conduct a hypothesis test where we use a null and alternative hypothesis:

Null hypothesis – There is no effect or difference between the new method and the old method.

Alternative hypothesis – There does exist some effect or difference between the new method and the old method.

A p-value indicates how believable the null hypothesis is, given the sample data. Specifically, assuming the null hypothesis is true, the p-value tells us the probability of obtaining an effect at least as large as the one we actually observed in the sample data. 

If the p-value of a hypothesis test is sufficiently low, we can reject the null hypothesis. Specifically, when we conduct a hypothesis test, we must choose a significance level at the outset. Common choices for significance levels are 0.01, 0.05, and 0.10.

If the p-values is  less than  our significance level, then we can reject the null hypothesis.

Otherwise, if the p-value is  equal to or greater than  our significance level, then we fail to reject the null hypothesis. 

How to Interpret a P-Value

The textbook definition of a p-value is:

A p-value is the probability of observing a sample statistic that is at least as extreme as your sample statistic, given that the null hypothesis is true.

For example, suppose a factory claims that they produce tires that have a mean weight of 200 pounds. An auditor hypothesizes that the true mean weight of tires produced at this factory is different from 200 pounds so he runs a hypothesis test and finds that the p-value of the test is 0.04. Here is how to interpret this p-value:

If the factory does indeed produce tires that have a mean weight of 200 pounds, then 4% of all audits will obtain the effect observed in the sample, or larger, because of random sample error. This tells us that obtaining the sample data that the auditor did would be pretty rare if indeed the factory produced tires that have a mean weight of 200 pounds. 

Depending on the significance level used in this hypothesis test, the auditor would likely reject the null hypothesis that the true mean weight of tires produced at this factory is indeed 200 pounds. The sample data that he obtained from the audit is not very consistent with the null hypothesis.

How Not  to Interpret a P-Value

The biggest misconception about p-values is that they are equivalent to the probability of making a mistake by rejecting a true null hypothesis (known as a Type I error).

There are two primary reasons that p-values can’t be the error rate:

1.  P-values are calculated based on the assumption that the null hypothesis is true and that the difference between the sample data and the null hypothesis is simple caused by random chance. Thus, p-values can’t tell you the probability that the null is true or false since it is 100% true based on the perspective of the calculations.

2. Although a low p-value indicates that your sample data are unlikely assuming the null is true, a p-value still can’t tell you which of the following cases is more likely:

  • The null is false
  • The null is true but you obtained an odd sample

In regards to the previous example, here is a correct and incorrect way to interpret the p-value:

  • Correct Interpretation: Assuming the factory does produce tires with a mean weight of 200 pounds, you would obtain the observed difference that you  did  obtain in your sample or a more extreme difference in 4% of audits due to random sampling error.
  • Incorrect Interpretation: If you reject the null hypothesis, there is a 4% chance that you are making a mistake.

Examples of Interpreting P-Values

The following examples illustrate correct ways to interpret p-values in the context of hypothesis testing.

A phone company claims that 90% of its customers are satisfied with their service. To test this claim, an independent researcher gathered a simple random sample of 200 customers and asked them if they are satisfied with their service, to which 85% responded yes. The p-value associated with this sample data turned out to be 0.018.

Correct interpretation of p-value:  Assuming that 90% of the customers actually are satisfied with their service, the researcher would obtain the observed difference that he  did  obtain in his sample or a more extreme difference in 1.8% of audits due to random sampling error.

A company invents a new battery for phones. The company claims that this new battery will work for at least 10 minutes longer than the old battery. To test this claim, a researcher takes a simple random sample of 80 new batteries and 80 old batteries. The new batteries run for an average of 120 minutes with a standard deviation of 12 minutes and the old batteries run for an average of 115 minutes with a standard deviation of 15 minutes. The p-value that results from the test for a difference in population means is 0.011.

Correct interpretation of p-value:  Assuming that the new battery works for the same amount of time or less than the old battery, the researcher would obtain the observed difference or a more extreme difference in 1.1% of studies due to random sampling error.

' src=

Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

9.8: The Observed Significance of a Test

  • Last updated
  • Save as PDF
  • Page ID 20721

Learning Objectives

  • To learn what the observed significance of a test is.
  • To learn how to compute the observed significance of a test.
  • To learn how to apply the \(p\)-value approach to hypothesis testing.

The Observed Significance

The conceptual basis of our testing procedure is that we reject \(H_0\) only if the data that we obtained would constitute a rare event if \(H_0\) were actually true. The level of significance α specifies what is meant by “rare.” The observed significance of the test is a measure of how rare the value of the test statistic that we have just observed would be if the null hypothesis were true. That is, the observed significance of the test just performed is the probability that, if the test were repeated with a new sample, the result of the new test would be at least as contrary to \(H_0\) and in support of \(H_a\) as what was observed in the original test.

Definition: observed significance

The observed significance or \(p\)-value of a specific test of hypotheses is the probability, on the supposition that \(H_0\) is true, of obtaining a result at least as contrary to \(H_0\) and in favor of \(H_a\) as the result actually observed in the sample data.

Think back to "Example 8.2.1", Section 8.2 concerning the effectiveness of a new pain reliever. This was a left-tailed test in which the value of the test statistic was \(-1.886\). To be as contrary to \(H_0\) and in support of \(H_a\) as the result \(Z=-1.886\) actually observed means to obtain a value of the test statistic in the interval \((-\infty ,-1.886]\). Rounding \(-1.886\) to \(-1.89\), we can read directly from Figure 7.1.5 that \(P(Z\leq -1.89)=0.0294\). Thus the \(p\)-value or observed significance of the test in "Example 8.2.1", Section 8.2 is \(0.0294\) or about \(3\%\). Under repeated sampling from this population, if \(H_0\) were true then only about \(3\%\) of all samples of size \(50\) would give a result as contrary to \(H_0\) and in favor of \(H_a\) as the sample we observed. Note that the probability \(0.0294\) is the area of the left tail cut off by the test statistic in this left-tailed test.

Analogous reasoning applies to a right-tailed or a two-tailed test, except that in the case of a two-tailed test being as far from \(0\) as the observed value of the test statistic but on the opposite side of \(0\) is just as contrary to \(H_0\) as being the same distance away and on the same side of \(0\), hence the corresponding tail area is doubled.

Computational Definition of the Observed Significance of a Test of Hypotheses

The observed significance of a test of hypotheses is the area of the tail of the distribution cut off by the test statistic (times two in the case of a two-tailed test).

Example \(\PageIndex{1}\)

Compute the observed significance of the test performed in "Example 8.2.2", Section 8.2.

The value of the test statistic was \(z=2.490\), which by Figure 7.1.5 cuts off a tail of area \(0.0064\), as shown in Figure \(\PageIndex{1}\). Since the test was two-tailed, the observed significance is \(2\times 0.0064=0.0128\).

alt

The p-value Approach to Hypothesis Testing

In "Example 8.2.1", Section 8.2 the test was performed at the \(5\%\) level of significance: the definition of “rare” event was probability \(\alpha =0.05\) or less. We saw above that the observed significance of the test was \(p=0.0294\) or about \(3\%\). Since \(p=0.0294<0.05=\alpha\) (or \(3\%\) is less than \(5\%\)), the decision turned out to be to reject: what was observed was sufficiently unlikely to qualify as an event so rare as to be regarded as (practically) incompatible with \(H_0\).

In "Example 8.2.2", Section 8.2 the test was performed at the \(1\%\) level of significance: the definition of “rare” event was probability \(\alpha =0.01\) or less. The observed significance of the test was computed in "Example \(\PageIndex{1}\)" as \(p=0.0128\) or about \(1.3\%\). Since \(p=0.0128>0.01=\alpha\) (or \(1.3\%\) is greater than \(1\%\)), the decision turned out to be not to reject. The event observed was unlikely, but not sufficiently unlikely to lead to rejection of the null hypothesis.

The reasoning just presented is the basis for a slightly different but equivalent formulation of the hypothesis testing process. The first three steps are the same as before, but instead of using \(\alpha\) to compute critical values and construct a rejection region, one computes the \(p\)-value \(p\) of the test and compares it to \(\alpha\), rejecting \(H_0\) if \(p\leq \alpha\) and not rejecting if \(p>\alpha\).

Systematic Hypothesis Testing Procedure: p -Value Approach

  • Identify the null and alternative hypotheses.
  • Identify the relevant test statistic and its distribution.
  • Compute from the data the value of the test statistic.
  • Compute the \(p\)-value of the test.
  • Compare the value computed in Step 4 to significance level α and make a decision: reject \(H_0\) if \(p\leq \alpha\) and do not reject \(H_0\) if \(p>\alpha\). Formulate the decision in the context of the problem, if applicable.

Example \(\PageIndex{2}\)

The total score in a professional basketball game is the sum of the scores of the two teams. An expert commentator claims that the average total score for NBA games is \(202.5\). A fan suspects that this is an overstatement and that the actual average is less than \(202.5\). He selects a random sample of \(85\) games and obtains a mean total score of \(199.2\) with standard deviation \(19.63\). Determine, at the \(5\%\) level of significance, whether there is sufficient evidence in the sample to reject the expert commentator’s claim.

  • Step 1 . Let \(\mu\) be the true average total game score of all NBA games. The relevant test is \[H_0: \mu =202.5\\ \text{vs}\\ H_a: \mu <202.5\; @\; \alpha =0.05 \nonumber \]
  • Step 2 . The sample is large and the population standard deviation is unknown. Thus the test statistic is \[Z=\frac{\bar{x}-\mu _0}{s/\sqrt{n}} \nonumber \] and has the standard normal distribution.
  • Step 3 . Inserting the data into the formula for the test statistic gives \[Z=\frac{\bar{x}-\mu _0}{s/\sqrt{n}}=\frac{199.2-202.5}{19.63/\sqrt{85}}=-1.55 \nonumber \]
  • Step 4 . The area of the left tail cut off by \(z=-1.55\) is, by Figure 7.1.5, \(0.0606\), as illustrated in Figure \(\PageIndex{2}\). Since the test is left-tailed, the \(p\)-value is just this number, \(p=0.0606\).
  • Step 5 . Since \(p=0.0606>0.05=\alpha\), the decision is not to reject \(H_0\). In the context of the problem our conclusion is:

The data do not provide sufficient evidence, at the \(5\%\) level of significance, to conclude that the average total score of NBA games is less than \(202.5\).

alt

Example \(\PageIndex{3}\)

Mr. Prospero has been teaching Algebra II from a particular textbook at Remote Isle High School for many years. Over the years students in his Algebra II classes have consistently scored an average of \(67\) on the end of course exam (EOC). This year Mr. Prospero used a new textbook in the hope that the average score on the EOC test would be higher. The average EOC test score of the \(64\) students who took Algebra II from Mr. Prospero this year had mean \(69.4\) and sample standard deviation \(6.1\). Determine whether these data provide sufficient evidence, at the \(1\%\) level of significance, to conclude that the average EOC test score is higher with the new textbook.

  • Step 1 . Let \(\mu\) be the true average score on the EOC exam of all Mr. Prospero’s students who take the Algebra II course with the new textbook. The natural statement that would be assumed true unless there were strong evidence to the contrary is that the new book is about the same as the old one. The alternative, which it takes evidence to establish, is that the new book is better, which corresponds to a higher value of \(\mu\). Thus the relevant test is \[H_0: \mu =67\\ \text{vs}\\ H_a: \mu >67\; @\; \alpha =0.01 \nonumber \]
  • Step 3 . Inserting the data into the formula for the test statistic gives \[Z=\frac{\bar{x}-\mu _0}{s/\sqrt{n}}=\frac{69.4-67}{6.1/\sqrt{64}}=3.15 \nonumber \]
  • Step 4 . The area of the right tail cut off by \(z=3.15\) is, by Figure 7.1.5, \(1-0.9992=0.0008\), as shown in Figure \(\PageIndex{3}\). Since the test is right-tailed, the \(p\)-value is just this number, \(p=0.0008\).
  • Step 5 . Since \(p=0.0008<0.01=\alpha\), the decision is to reject \(H_0\). In the context of the problem our conclusion is:

The data provide sufficient evidence, at the \(1\%\) level of significance, to conclude that the average EOC exam score of students taking the Algebra II course from Mr. Prospero using the new book is higher than the average score of those taking the course from him but using the old book.

alt

Example \(\PageIndex{4}\)

For the surface water in a particular lake, local environmental scientists would like to maintain an average pH level at \(7.4\). Water samples are routinely collected to monitor the average pH level. If there is evidence of a shift in pH value, in either direction, then remedial action will be taken. On a particular day \(30\) water samples are taken and yield average pH reading of \(7.7\) with sample standard deviation \(0.5\). Determine, at the \(1\%\) level of significance, whether there is sufficient evidence in the sample to indicate that remedial action should be taken.

  • Step 1 . Let \(\mu\) be the true average pH level at the time the samples were taken. The relevant test is \[H_0: \mu =7.4\\ \text{vs}\\ H_a: \mu \neq 7.4\; @\; \alpha =0.01 \nonumber \]
  • Step 3 . Inserting the data into the formula for the test statistic gives \[Z=\frac{\bar{x}-\mu _0}{s/\sqrt{n}}=\frac{7.7-7.4}{0.5/\sqrt{30}}=3.29 \nonumber \]
  • Step 4 . The area of the right tail cut off by \(z=3.29\) is, by Figure 7.1.5, \(1-0.9995=0.0005\), as illustrated in Figure \(\PageIndex{4}\). Since the test is two-tailed, the p-value is the double of this number, p=2×0.0005=0.0010.
  • Step 5 . Since \(p=0.0010<0.01=\alpha\), the decision is to reject \(H_0\). In the context of the problem our conclusion is:

The data provide sufficient evidence, at the \(1\%\) level of significance, to conclude that the average pH of surface water in the lake is different from \(7.4\). That is, remedial action is indicated.

alt

Key Takeaway

  • The observed significance or \(p\)-value of a test is a measure of how inconsistent the sample result is with \(H_0\) and in favor of \(H_a\).
  • The \(p\)-value approach to hypothesis testing means that one merely compares the \(p\)-value to \(\alpha\) instead of constructing a rejection region.
  • There is a systematic five-step procedure for the \(p\)-value approach to hypothesis testing.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

K12 LibreTexts

9.6: Significance Test for a Mean

  • Last updated
  • Save as PDF
  • Page ID 5788

Significance Testing for Means

Evaluating hypotheses for population means using large samples.

When testing a hypothesis for the mean of a normal distribution, we follow a series of six basic steps:

  • State the null and alternative hypotheses.
  • Choose an α level
  • Set the criterion (critical values) for rejecting the null hypothesis.
  • Compute the test statistic.
  • Make a decision (reject or fail to reject the null hypothesis)
  • Interpret the result

If we reject the null hypothesis we are saying that the difference between the observed sample mean and the hypothesized population mean is too great to be attributed to chance. When we fail to reject the null hypothesis, we are saying that the difference between the observed sample mean and the hypothesized population mean is probable if the null hypothesis is true. Essentially, we are willing to attribute this difference to sampling error.

The school nurse was wondering if the average height of 7th graders has been increasing. Over the last 5 years, the average height of a 7th grader was 145 cm with a standard deviation of 20 cm. The school nurse takes a random sample of 200 students and finds that the average height this year is 147 cm. Conduct a single-tailed hypothesis test using a .05 significance level to evaluate the null and alternative hypotheses.

First, we develop our null and alternative hypotheses:

H0:μHa:μ=145>145

Choose α=.05. The critical value for this one tailed test is 1.64. Any test statistic greater than 1.64 will be in the rejection region.

Next, we calculate the test statistic for the sample of 7th graders.

Screen Shot 2020-07-20 at 6.02.14 PM.png

Since the calculated z−score of 1.414 is smaller than 1.64 and thus does not fall in the critical region. Our decision is to fail to reject the null hypothesis and conclude that the probability of obtaining a sample mean equal to 147 if the mean of the population is 145 is likely to have been due to chance.

Testing a Mean Hypothesis Using P-values

We can also test a mean hypothesis using p-values. The following examples show how to do this.

A sample of size 157 is taken from a normal distribution, with a standard deviation of 9. The sample mean is 65.12. Use the 0.01 significance level to test the claim that the population mean is greater than 65.

We always put equality in the null hypothesis, so our claim will be in the alternative hypothesis.

HA:μ>65

The test statistic is:

Screen Shot 2020-07-20 at 6.02.47 PM.png

Now we will find the probability of observing a test statistic at least this extreme when assuming the null hypothesis. Since our alternative hypothesis is that the mean is greater, we want to find the probability of z scores that are greater than our test statistics. The p-value we are looking for is:

p-value=P(z>0.17)=1−P(z<0.17)

Using a z-score table:

p-value=P(z>0.0.167)=1−P(z<0.167)=1−0.6064=0.3936>0.01

The probability of observing a test statistic at least as big as the z=0.17 is 0.3936. Since this is greater than our significance level, 0.01, we fail to reject the null hypothesis. This means that the data does not support the claim that the mean is greater than 65.

Testing a Mean Hypothesis When the Population Standard Deviation is Known

We can also use the standard normal distribution, or z-scores, to test a mean hypothesis when the population standard deviation is known. The next two examples, though they have a smaller sample size, have a known population standard deviation.

1. A sample of size 50 is taken from a normal distribution, with a known population standard deviation of 26. The sample mean is 167.02. Use the 0.05 significance level to test the claim that the population mean is greater than 170.

H 0 :μ=170

H A :μ>170

Screen Shot 2020-07-20 at 6.25.27 PM.png

p-value=P(z>0.811)=1−P(z<0.811)=1−0.791=0.209>0.05

The probability of observing a test statistic at least as big as the z=0.81 is 0.209. Since this is greater than our significance level, 0.05, we fail to reject the null hypothesis. This means that the data does not support the claim that the mean is greater than 170.

2. A sample of size 20 is taken from a normal distribution, with a known population standard deviation of 0.01. The sample mean is 0.194. Use the 0.01 significance level to test the claim that the population mean is equal to 0.22.

We always put equality in the null hypothesis, so our claim will be in the null hypothesis. There is no reason to do a left or right tailed test, so we will do a two tailed test:

H 0 :μ=0.22

H A :μ≠0.22

Screen Shot 2020-07-20 at 6.42.22 PM.png

Now we will find the probability of observing a test statistic at least this extreme when assuming the null hypothesis. Since our alternative hypothesis is that the mean is not equal to 0.22, we need to find the probability of being less than -2.91, and we also need to find the probability of being greater than positive 2.91. However, since the normal distribution is symmetric, these probabilities will be the same, so we can find one and multiply it by 2:

p-value=2⋅P(z<−2.91)=2⋅0.0018=0.0036⧸>0.01

The probability of observing a test statistic at least as extreme as z=−2.91 is 0.0036. Since this is less than our significance level, 0.01, we reject the null hypothesis. This means that the data does not support the claim that the mean is equal to 0.22.

A sample of size 36 is taken from a normal distribution, with a known population standard deviation of 57. The sample mean is 988.93. Use the 0.05 significance level to test the claim that the population mean is less than 1000.

We always put equality in the null hypothesis, so our claim will be in the alternative hypothesis:

H 0 :μ=1000

H A :μ<1000

Screen Shot 2020-07-20 at 6.42.59 PM.png

Now we will find the probability of observing a test statistic at least this extreme when assuming the null hypothesis. Since our alternative hypothesis is that the mean is less than 1000, we need to find the probability of z scores less than -1.17:

p-value=P(z<−1.17)=0.1210>0.05

The probability of observing a test statistic at least as extreme as z=−1.17 is 0.1210. Since this is greater than our significance level, 0.05, we fail to reject the null hypothesis. This means that the data does not support the claim that the mean is less than 1000.

  • True or False: When we fail to reject the null hypothesis, we are saying that the difference between the observed sample mean and the hypothesized population mean is probable if the null hypothesis is true.
  • What would the null and alternative hypotheses be for this scenario?
  • What would the standard error be for this particular scenario?
  • Describe in your own words how you would set the critical regions and what they would be at an alpha level of .05.
  • Test the null hypothesis and explain your decision
  • A one-tailed or two-tailed test
  • .05 or .01 level of significance
  • A sample size of n=144 or n=444
  • A coal miner claims that the mean number of coal mined per day is more than 30,000 pounds. A random sample of 150 days finds that the mean number of pounds of coal mined is 20,000 pounds with a standard deviation of 1,000. Test the claim at the 5% level of significance.
  • A high school teacher claims that the average time a student spends on math homework is less than one hour. A random sample of 250 students is drawn and the mean time spent on math homework in this sample was 45 minutes with a standard deviation of 10. Test the teacher’s claim at the 1% level of significance.
  • A student claims that the average time spent studying for a statistics exam is 1.5 hours. A random sample of 200 students is drawn and the sample mean is 150 minutes with a standard deviation of 15. Test the claim at the 10% level of significance.

For problems 7-14 , IQ tests are designed to have a standard deviation of 15 points. They are intended to have a mean of 100 points. For the following data on scores for the new IQ tests, test the claim that their mean is equal to 100. Use 0.05 significance level.

  • n=107,x̄=94.77
  • n=56,x̄=109.0012
  • n=17,x̄=100.13
  • n=37,x̄=78.92
  • n=72,x̄=98.73
  • n=10,x̄=103.34
  • n=80,x̄=98.38
  • n=150,x̄=108.89

For 15-16, find the p-value. Explain whether you will reject or fail to reject based on the p-value.

  • Test the claim that the mean is greater than 27, if n=101,x̄=26.99,σ=5
  • Test the claim that the mean is less than 10,000, if n=81,x̄=9941.06,σ=1000

Review (Answers)

To view the Review answers, open this PDF file and look for section 8.4.

Additional Resources

Video: Z Test for Mean

Practice: Significance Test for a Mean

Real World: Paying Attention to Heredity

  • School Guide
  • Mathematics
  • Number System and Arithmetic
  • Trigonometry
  • Probability
  • Mensuration
  • Maths Formulas
  • Class 8 Maths Notes
  • Class 9 Maths Notes
  • Class 10 Maths Notes
  • Class 11 Maths Notes
  • Class 12 Maths Notes
  • Histogram - Definition, Types, Graph, and Examples
  • Z-Score Table
  • Horizontal Bar Graph
  • Line of Best Fit
  • Level of Significance-Definition, Steps and Examples
  • Standard Error
  • Quartile Formula
  • Descriptive Statistics
  • Skewness Formula
  • Skip Counting
  • What is the common difference of an AP in which a 25 - a 12 = - 52?
  • Simplify (x2 - 2x)/(x + 1) divided by (x2 + x - 6)/(x2 - 1)
  • How to evaluate trigonometric functions without a calculator?
  • Antisymmetric Relation
  • Mode of Grouped Data in Statistics
  • Control Variables in Statistics
  • Pictograph in Statistics, Creating, Reading, Examples, Advantages/Disadvantages, Practice Questions
  • Relative Frequency: Formula, Definition, Examples and FAQs

Tests of Significance: Process, Example and Type

Test of significance is a process for comparing observed data with a claim(also called a hypothesis), the truth of which is being assessed in further analysis. Let’s learn about test of significance, null hypothesis and Significance testing below.

Tests of Significance in Statistics

In technical terms, it is a probability measurement of a certain statistical test or research in the theory making in a way that the outcome must have occurred by chance instead of the test or experiment being right. The ultimate goal of descriptive statistical research is the revelation of the truth In doing so, the researcher has to make sure that the sample is of good quality, the error is minimal, and the measures are precise. These things are to be completed through several stages. The researcher will need to know whether the experimental outcomes are from a proper study process or just due to chance.

The sample size is the one that primarily specifies the probability that the event could occur without the effect of really performed research. It may be weak or strong depending on a certain statistical significance. Its bearings are put into question. They may or may not make a difference. The presence of a careless researcher can be a start of when a researcher instead of carefully making use of language in the report of his experiment, the significance of the study might be misinterpreted.

Significance Testing

Statistics involves the issue of assessing whether a result obtained from an experiment is important enough or not. In the field of quantitative significance, there are defined tests that may have relevant uses. The designation of tests depends on the type of tests or the tests of significance are more known as the simple significance tests.

These stand up for certain levels of error mislead. Sometimes the trial designer is called upon to predefine the probability of sampling error in the initial stage of the experiment. The population sampling test is regarded as one which does not study the whole, and as such the sampling error always exists. The testing of the significance is an equally important part of the statistical research.

Null Hypothesis

Every test for significance starts with a null hypothesis H 0 . H 0 represents a theory that has been suggested, either because it’s believed to be true or because it’s to be used as a basis for argument, but has not been proved. For example, during a clinical test of a replacement drug, the null hypothesis could be that the new drug is not any better, on average than the present drug. We would write H 0 : there’s no difference between the 2 drugs on average.

Process of Significance Testing

In the process of testing for statistical significance, the following steps must be taken:

Step 1: Start by coming up with a research idea or question for your thesis. Step 2: Create a neutral comparison to test against your hypothesis. Step 3: Decide on the level of certainty you need for your results, which affects the type of sign language translators and communication methods you’ll use. Step 4: Choose the appropriate statistical test to analyze your data accurately. Step 5: Understand and explain what your results mean in the context of your research question.

Types of Errors

There are basically two types of errors:

Type I Error

Type ii error.

Now let’s learn about these errors in detail.

A type I error is where the researcher finds out that the relationship presumed maxim is a case; however, there is evidence showing it is not a function explained. This type of error leads to a failure of the researcher who says that the H 0 or null hypothesis has to be accepted while in reality, it was supposed to be rejected together with the research hypothesis. Researchers commit an error in the first type when α (alpha) is their probability.

Type II error is the same as the type I error is the case. You begin to suppress your emotions and avoid experiencing any connection when someone thinks that you have no relation even though there does exist among you. In this sort of error, the researcher is expected to see the research hypothesis as true and treat the null hypotheses as false while he may do not and the opposite situation happens. Type II error is identified with β that equals to the possibility to make a type II error which is an error of omission.

Statistical Tests

One-tailed and two-tailed statistical tests help determine how significant a finding is in a set of data.

When we think that a parameter might change in one specific direction from a baseline, we use a one-tailed test. For example, if we’re testing whether a new drug makes people perform better, we might only care if it improves performance, not if it makes it worse.

On the flip side, a two-tailed test comes into play when changes could go in either direction from the baseline. For instance, if we’re studying the effect of a new teaching method on test scores, we’d want to know if it makes scores better or worse, so we’d use a two-tailed test.

Types of Statistical Tests

Hypothesis testing can be done via use of either one-tailed or two-tailed statistical test. The purpose of these tests is to obtain the probability with which a parameter from a given data set is statistically significant. These are also called lateral flow and dipstick tests.

  • One-tailed test can be used so that the differences of the parameter estimations within only one side from a given standard can be perceived plausible.
  • Two-tailed test needs to be applied in the case when you consider deviations from both sides of benchmark value as possible in science.

The expression “tail” is used in the terminology in which those tests are referred and the reason for that is that outliers, i.e. observation ended up rejecting the null hypothesis, are the extreme points of the distribution, those areas normally have a small influence or “tail off” similar to the bell shape or normal distribution. One study should make an application either the one-tailed test or two-tailed test according to the judgment of the research hypothesis.

What is p-Value Testing?

In the case of data information significance, the p-value is an additional and significant term for hypothesis testing. The p-value is a function whose domain is the observed result of sample and range is testing subset of statistical hypothesis which is being used for testing of statistical hypothesis. It must determine what the threshold value is before starting of the test. The significance level holds the name, traditional 1% or 5%, which stands for the level of the significance considered to be of value. One of the parameters of the Savings function is α.

In the condition if the p-value is greater than or equal the α term, inconsistency between our null model and the data exists. As a result the null hypothesis should be rejected and a new hypothesis may be supposed being true, or may be assumed as such one.

Example on Test of Significance

Some examples of test of significance are added below:

Example 1: T-Test for Medical Research – The T Test

For example, a medical study researching the performance of a new drug that comes to the conclusion of a reduced in blood pressure. The researchers predict that the patients taking the new drug will show a frankly larger decrease in blood pressure as opposed to the study participants on a placebo. They collect data from two groups: treat one group with an experimental drug and give all the placebo to the second group.

Researchers apply a t-test to the data in order determine the value of two assumed normal populations difference and study whether it statistically significant. The H0 (null hypothesis) could state that there is no significant difference in the blood pressure registered in the two groups of subjects, while the HA1 (alternative hypothesis) should be indicating the positivity of a significant difference. They can check whether or not the outcomes are significantly different by using the t-test, and therefore reduce the possibility of any confusing hypotheses.

Example 2: Chi-Square Analysis in Market Research

Think about the situation where you have to carry out a market research work to ascertain the link between customers satisfaction (comprised of satisfied satisfied or neutral scores) and their product preferences (the three products designated as Product A, Product B, and Product C). A chi-square test was used by the researchers to check whether they had a substantial association with the two categorical variables they were dealing with.

The H0 null hypothesis states customer satisfaction and product preferences are unrelated, the contrary to which H1 alternative hypothesis shows the customers’ satisfaction and product preferences are related. Thereby, the researchers will be able to execute the chi-square test on the gathered data and find out if the existed observations among customer satisfaction and product preferences are statistically significant by doing so. This allows us to make conclusions how the satisfaction degree of customers affects the market conception of goods for the target market.

Example 3: ANOVA in Educational Research

Think of a researcher whom is studying if there is any difference between various learning ways and their effect on students’ study achievements. HO represents the null hypothesis which asserts no differences in scores for the groups while the alternative hypothesis (HA) claims at least one group has a different mean. Via use Analysis of Variance ( ANOVA ), a researcher determines whether or not there is any statistically significant difference in performance hence, across the methods of teaching.

Example 4: Regression Analysis in Economics

In an economic study, researchers examine the connection between ads cost and revenue for the group of businesses that have recently disclosed their financial results. The null space proposes that there is no such linear connection between the advertisement spending and purchases.

Among the models, the regression analysis used to determine whether the changes in sales are attributed to the changes in advertising to a statistically significant level (the regression line slope is significantly different from zero) is chosen.

Example 5: Paired T-Test in Psychology

A psychologist decides to do a study to find out if a new type of therapy can make someone get rid of anxiety. Patients are evaluated of their level of anxiety prior to initiating the intervention and right after.

The null hypothesis claims that there is no noticeable difference in the levels of anxiety from a pre-intervention to a post-intervention setting. Using a paired t-test, a psychologist who collected the anxiety scores of a group before and after the experiment can prove statistically the observed change in these scores.

Test of Significance – FAQs

What is test of significance.

Test of significance is a process for comparing observed data with a claim(also called a hypothesis), the truth of which is being assessed in further analysis.

Define Statistical Significance Test?

Random distribution of observed data implies that there must be a certain cause behind which could then be associated with the data. This outcome is also referred to as the statistical significance. Whatever the explicit field or the profession that rely utterly on numbers and research, like finance, economics, investing, medicine, and biology, statistic is important.

What is the meaning of a test of significance?

Statistical significant tests work in order to determine if the differences found in assessment data is just due to random errors arising from sampling or not. This is a “silent” category of research that ought to be overlooked for it brings on mere incompatibilities.

What is the importance of the Significance test?

In experiments, the significance tests indeed have specific applied value. That is because they help researchers to draw conclusion whether the data supports or not the null hypothesis, and therefore whether the alternative hypothesis is true or not.

How many types of Significance tests are there in statistical mathematics?

In statistics, we have tests like t-test, aZ-test, chi-square test, annoVA test, binomial test, mediana test and others. Greatly decentralized data can be tested with parametric tests.

How does choosing a significance level (α) influence the interpretation of the attributable tests?

The parameter α which stands for the significance level is a function of this threshold, and to fail this test null hypothesis value has to be rejected. Hence, a smaller α value means higher strictness of acceptance threshold and false positives are limited while there could be an increase in false negatives.

Is significance testing limited to parametric methods like comparison of two means or, it can be applied to non-parametric datasets also?

Inference is something useful which can be miscellaneous and can adapt to parametric or non-parametric data. Non-parametric tests, for instance the Mann-Whitney U test and the Wilcoxon signed-rank test, are often applied in operations research, since they do not require that data meet the assumptions of parametric tests.

Please Login to comment...

author

  • Math-Statistics
  • School Learning
  • 10 Best HuggingChat Alternatives and Competitors
  • Best Free Android Apps for Podcast Listening
  • Google AI Model: Predicts Floods 7 Days in Advance
  • Who is Devika AI? India's 'AI coder', an alternative to Devin AI
  • 30 OOPs Interview Questions and Answers (2024)

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Mathematics LibreTexts

8.2: The Observed Significance of a Test

  • Last updated
  • Save as PDF
  • Page ID 130264

Learning Objectives

  • To learn what the observed significance of a test is.
  • To learn how to compute the observed significance of a test.
  • To learn how to apply the \(p\)-value approach to hypothesis testing.

The Observed Significance

The conceptual basis of our testing procedure is that we reject \(H_0\) only if the data that we obtained would constitute a rare event if \(H_0\) were actually true. The level of significance α specifies what is meant by “rare.” The observed significance of the test is a measure of how rare the value of the test statistic that we have just observed would be if the null hypothesis were true. That is, the observed significance of the test just performed is the probability that, if the test were repeated with a new sample, the result of the new test would be at least as contrary to \(H_0\) and in support of \(H_a\) as what was observed in the original test.

Definition: observed significance

The observed significance or \(p\)-value of a specific test of hypotheses is the probability, on the supposition that \(H_0\) is true, of obtaining a result at least as contrary to \(H_0\) and in favor of \(H_a\) as the result actually observed in the sample data.

Think back to "Example 8.2.1", Section 8.2 concerning the effectiveness of a new pain reliever. This was a left-tailed test in which the value of the test statistic was \(-1.886\). To be as contrary to \(H_0\) and in support of \(H_a\) as the result \(Z=-1.886\) actually observed means to obtain a value of the test statistic in the interval \((-\infty ,-1.886]\). Rounding \(-1.886\) to \(-1.89\), we can read directly from Figure 7.1.5 that \(P(Z\leq -1.89)=0.0294\). Thus the \(p\)-value or observed significance of the test in "Example 8.2.1", Section 8.2 is \(0.0294\) or about \(3\%\). Under repeated sampling from this population, if \(H_0\) were true then only about \(3\%\) of all samples of size \(50\) would give a result as contrary to \(H_0\) and in favor of \(H_a\) as the sample we observed. Note that the probability \(0.0294\) is the area of the left tail cut off by the test statistic in this left-tailed test.

Analogous reasoning applies to a right-tailed or a two-tailed test, except that in the case of a two-tailed test being as far from \(0\) as the observed value of the test statistic but on the opposite side of \(0\) is just as contrary to \(H_0\) as being the same distance away and on the same side of \(0\), hence the corresponding tail area is doubled.

Computational Definition of the Observed Significance of a Test of Hypotheses

The observed significance of a test of hypotheses is the area of the tail of the distribution cut off by the test statistic (times two in the case of a two-tailed test).

Example \(\PageIndex{1}\)

Compute the observed significance of the test performed in "Example 8.2.2", Section 8.2.

The value of the test statistic was \(z=2.490\), which by Figure 7.1.5 cuts off a tail of area \(0.0064\), as shown in Figure \(\PageIndex{1}\). Since the test was two-tailed, the observed significance is \(2\times 0.0064=0.0128\).

alt

The p-value Approach to Hypothesis Testing

In "Example 8.2.1", Section 8.2 the test was performed at the \(5\%\) level of significance: the definition of “rare” event was probability \(\alpha =0.05\) or less. We saw above that the observed significance of the test was \(p=0.0294\) or about \(3\%\). Since \(p=0.0294<0.05=\alpha\) (or \(3\%\) is less than \(5\%\)), the decision turned out to be to reject: what was observed was sufficiently unlikely to qualify as an event so rare as to be regarded as (practically) incompatible with \(H_0\).

In "Example 8.2.2", Section 8.2 the test was performed at the \(1\%\) level of significance: the definition of “rare” event was probability \(\alpha =0.01\) or less. The observed significance of the test was computed in "Example \(\PageIndex{1}\)" as \(p=0.0128\) or about \(1.3\%\). Since \(p=0.0128>0.01=\alpha\) (or \(1.3\%\) is greater than \(1\%\)), the decision turned out to be not to reject. The event observed was unlikely, but not sufficiently unlikely to lead to rejection of the null hypothesis.

The reasoning just presented is the basis for a slightly different but equivalent formulation of the hypothesis testing process. The first three steps are the same as before, but instead of using \(\alpha\) to compute critical values and construct a rejection region, one computes the \(p\)-value \(p\) of the test and compares it to \(\alpha\), rejecting \(H_0\) if \(p\leq \alpha\) and not rejecting if \(p>\alpha\).

Systematic Hypothesis Testing Procedure: p -Value Approach

  • Identify the null and alternative hypotheses.
  • Identify the relevant test statistic and its distribution.
  • Compute from the data the value of the test statistic.
  • Compute the \(p\)-value of the test.
  • Compare the value computed in Step 4 to significance level α and make a decision: reject \(H_0\) if \(p\leq \alpha\) and do not reject \(H_0\) if \(p>\alpha\). Formulate the decision in the context of the problem, if applicable.

Example \(\PageIndex{2}\)

The total score in a professional basketball game is the sum of the scores of the two teams. An expert commentator claims that the average total score for NBA games is \(202.5\). A fan suspects that this is an overstatement and that the actual average is less than \(202.5\). He selects a random sample of \(85\) games and obtains a mean total score of \(199.2\) with standard deviation \(19.63\). Determine, at the \(5\%\) level of significance, whether there is sufficient evidence in the sample to reject the expert commentator’s claim.

  • Step 1 . Let \(\mu\) be the true average total game score of all NBA games. The relevant test is \[H_0: \mu =202.5\\ \text{vs}\\ H_a: \mu <202.5\; @\; \alpha =0.05 \nonumber \]
  • Step 2 . The sample is large and the population standard deviation is unknown. Thus the test statistic is \[Z=\frac{\bar{x}-\mu _0}{s/\sqrt{n}} \nonumber \] and has the standard normal distribution.
  • Step 3 . Inserting the data into the formula for the test statistic gives \[Z=\frac{\bar{x}-\mu _0}{s/\sqrt{n}}=\frac{199.2-202.5}{19.63/\sqrt{85}}=-1.55 \nonumber \]
  • Step 4 . The area of the left tail cut off by \(z=-1.55\) is, by Figure 7.1.5, \(0.0606\), as illustrated in Figure \(\PageIndex{2}\). Since the test is left-tailed, the \(p\)-value is just this number, \(p=0.0606\).
  • Step 5 . Since \(p=0.0606>0.05=\alpha\), the decision is not to reject \(H_0\). In the context of the problem our conclusion is:

The data do not provide sufficient evidence, at the \(5\%\) level of significance, to conclude that the average total score of NBA games is less than \(202.5\).

alt

Example \(\PageIndex{3}\)

Mr. Prospero has been teaching Algebra II from a particular textbook at Remote Isle High School for many years. Over the years students in his Algebra II classes have consistently scored an average of \(67\) on the end of course exam (EOC). This year Mr. Prospero used a new textbook in the hope that the average score on the EOC test would be higher. The average EOC test score of the \(64\) students who took Algebra II from Mr. Prospero this year had mean \(69.4\) and sample standard deviation \(6.1\). Determine whether these data provide sufficient evidence, at the \(1\%\) level of significance, to conclude that the average EOC test score is higher with the new textbook.

  • Step 1 . Let \(\mu\) be the true average score on the EOC exam of all Mr. Prospero’s students who take the Algebra II course with the new textbook. The natural statement that would be assumed true unless there were strong evidence to the contrary is that the new book is about the same as the old one. The alternative, which it takes evidence to establish, is that the new book is better, which corresponds to a higher value of \(\mu\). Thus the relevant test is \[H_0: \mu =67\\ \text{vs}\\ H_a: \mu >67\; @\; \alpha =0.01 \nonumber \]
  • Step 3 . Inserting the data into the formula for the test statistic gives \[Z=\frac{\bar{x}-\mu _0}{s/\sqrt{n}}=\frac{69.4-67}{6.1/\sqrt{64}}=3.15 \nonumber \]
  • Step 4 . The area of the right tail cut off by \(z=3.15\) is, by Figure 7.1.5, \(1-0.9992=0.0008\), as shown in Figure \(\PageIndex{3}\). Since the test is right-tailed, the \(p\)-value is just this number, \(p=0.0008\).
  • Step 5 . Since \(p=0.0008<0.01=\alpha\), the decision is to reject \(H_0\). In the context of the problem our conclusion is:

The data provide sufficient evidence, at the \(1\%\) level of significance, to conclude that the average EOC exam score of students taking the Algebra II course from Mr. Prospero using the new book is higher than the average score of those taking the course from him but using the old book.

alt

Example \(\PageIndex{4}\)

For the surface water in a particular lake, local environmental scientists would like to maintain an average pH level at \(7.4\). Water samples are routinely collected to monitor the average pH level. If there is evidence of a shift in pH value, in either direction, then remedial action will be taken. On a particular day \(30\) water samples are taken and yield average pH reading of \(7.7\) with sample standard deviation \(0.5\). Determine, at the \(1\%\) level of significance, whether there is sufficient evidence in the sample to indicate that remedial action should be taken.

  • Step 1 . Let \(\mu\) be the true average pH level at the time the samples were taken. The relevant test is \[H_0: \mu =7.4\\ \text{vs}\\ H_a: \mu \neq 7.4\; @\; \alpha =0.01 \nonumber \]
  • Step 3 . Inserting the data into the formula for the test statistic gives \[Z=\frac{\bar{x}-\mu _0}{s/\sqrt{n}}=\frac{7.7-7.4}{0.5/\sqrt{30}}=3.29 \nonumber \]
  • Step 4 . The area of the right tail cut off by \(z=3.29\) is, by Figure 7.1.5, \(1-0.9995=0.0005\), as illustrated in Figure \(\PageIndex{4}\). Since the test is two-tailed, the p-value is the double of this number, p=2×0.0005=0.0010.
  • Step 5 . Since \(p=0.0010<0.01=\alpha\), the decision is to reject \(H_0\). In the context of the problem our conclusion is:

The data provide sufficient evidence, at the \(1\%\) level of significance, to conclude that the average pH of surface water in the lake is different from \(7.4\). That is, remedial action is indicated.

alt

Key Takeaway

  • The observed significance or \(p\)-value of a test is a measure of how inconsistent the sample result is with \(H_0\) and in favor of \(H_a\).
  • The \(p\)-value approach to hypothesis testing means that one merely compares the \(p\)-value to \(\alpha\) instead of constructing a rejection region.
  • There is a systematic five-step procedure for the \(p\)-value approach to hypothesis testing.

IMAGES

  1. An easy-to-understand summary of significance level

    significance hypothesis test meaning

  2. Significance Level and Power of a Hypothesis Test Tutorial

    significance hypothesis test meaning

  3. Significance Level and Power of a Hypothesis Test Tutorial

    significance hypothesis test meaning

  4. Null Hypothesis Significance Testing Overview

    significance hypothesis test meaning

  5. PPT

    significance hypothesis test meaning

  6. PPT

    significance hypothesis test meaning

VIDEO

  1. Hypothesis Testing

  2. hypothesis testing ll meaning ll definition ll types ll errors ll level of significance ll SEM

  3. Lecture 10: Hypothesis Testing

  4. Intro to hypothesis, Types functions

  5. Hypothesis Formulation

  6. Hypothesis । प्राक्कल्पना। social research। sociology । BA sem 6 l sociology important questions

COMMENTS

  1. How Hypothesis Tests Work: Significance Levels (Alpha) and P values

    Using P values and Significance Levels Together. If your P value is less than or equal to your alpha level, reject the null hypothesis. The P value results are consistent with our graphical representation. The P value of 0.03112 is significant at the alpha level of 0.05 but not 0.01.

  2. Understanding Hypothesis Tests: Significance Levels (Alpha) and P

    The P value of 0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01 level. If we stick to a significance level of 0.05, we can conclude that the average energy cost for the population is greater than 260. A common mistake is to interpret the P-value as the probability that the null hypothesis is true.

  3. Significance tests (hypothesis testing)

    Significance tests give us a formal process for using sample data to evaluate the likelihood of some claim about a population value. Learn how to conduct significance tests and calculate p-values to see how likely a sample result is to occur by random chance. You'll also see how we use p-values to make conclusions about hypotheses.

  4. An Easy Introduction to Statistical Significance (With Examples)

    The p value determines statistical significance. An extremely low p value indicates high statistical significance, while a high p value means low or no statistical significance. Example: Hypothesis testing. To test your hypothesis, you first collect data from two groups. The experimental group actively smiles, while the control group does not.

  5. Statistical Significance: Definition & Meaning

    The definition of statistically significant is that the sample effect is unlikely to be caused by chance (i.e., sampling error). In other words, what we see in the sample likely reflects an effect or relationship that exists in the population. Using a more statistically correct technical definition, statistical significance relates to the ...

  6. Tests of Significance

    Significance Levels The significance level for a given hypothesis test is a value for which a P-value less than or equal to is considered statistically significant. Typical values for are 0.1, 0.05, and 0.01. These values correspond to the probability of observing such an extreme value by chance. In the test score example above, the P-value is 0.0082, so the probability of observing such a ...

  7. Hypothesis Testing

    Hypothesis testing is a formal procedure for investigating our ideas about the world. It allows you to statistically test your predictions. ... researchers choose a more conservative level of significance, such as 0.01 (1%). ... Stating results in a statistics assignment In our comparison of mean height between men and women we found an average ...

  8. Hypothesis Testing

    If our statistical analysis shows that the significance level is below the cut-off value we have set (e.g., either 0.05 or 0.01), we reject the null hypothesis and accept the alternative hypothesis. Alternatively, if the significance level is above the cut-off value, we fail to reject the null hypothesis and cannot accept the alternative ...

  9. 11.2: Significance Testing

    Over the years, the meaning of "significant" changed, leading to the potential misinterpretation. There are two approaches (at least) to conducting significance tests. In one (favored by R. Fisher), a significance test is conducted and the probability value reflects the strength of the evidence against the null hypothesis.

  10. Significance Testing: An Overview

    Statistical significance of a test, meaning a null hypothesis is rejected at a pre-specified level such as 0.05, is not evidence for a result which has practical or scientific significance. This has led many practitioners to move away from the simple reporting of p-values to reporting of confidence intervals for effects; see Krantz () for example.A measure of evidence for a positive effect ...

  11. Level of Significance & Hypothesis Testing

    The level of significance is defined as the criteria or threshold value based on which one can reject the null hypothesis or fail to reject the null hypothesis. The level of significance determines whether the outcome of hypothesis testing is statistically significant or otherwise. The significance level is also called as alpha level.

  12. Significance Level and Power of a Hypothesis Test

    2. Power of a Hypothesis Test. You might wonder, what is power? Power is the ability of a hypothesis test to detect a difference that is present. Consider the curves below. Note that μ 0 is the hypothesized mean and μ A is the actual mean. The actual mean is different than the null hypothesis; therefore, you should reject the null hypothesis.

  13. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Definition/Introduction. Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators.

  14. PDF Tests of Significance

    A test of significance is a formal procedure for comparing observed data with a claim (also called a hypothesis), the truth of which is being assessed. • The claim is a statement about a parameter, like the population proportion p or the population mean µ. • The results of a significance test are expressed in terms of a probability that

  15. Understanding P-Values and Statistical Significance

    In statistical hypothesis testing, you reject the null hypothesis when the p-value is less than or equal to the significance level (α) you set before conducting your test. The significance level is the probability of rejecting the null hypothesis when it is true. Commonly used significance levels are 0.01, 0.05, and 0.10.

  16. Null hypothesis significance testing: a short tutorial

    Fisher, significance testing, and the p-value. The method developed by ( Fisher, 1934; Fisher, 1955; Fisher, 1959) allows to compute the probability of observing a result at least as extreme as a test statistic (e.g. t value), assuming the null hypothesis of no effect is true.This probability or p-value reflects (1) the conditional probability of achieving the observed outcome or larger: p(Obs ...

  17. An Explanation of P-Values and Statistical Significance

    If the p-value of a hypothesis test is sufficiently low, we can reject the null hypothesis. Specifically, when we conduct a hypothesis test, we must choose a significance level at the outset. Common choices for significance levels are 0.01, 0.05, and 0.10. If the p-values is less than our significance level, then we can reject the null hypothesis.

  18. Statistical hypothesis test

    Hypothesis testing can mean any mixture of two formulations that both changed with time. Any discussion of significance testing vs hypothesis testing is doubly vulnerable to confusion. Fisher thought that hypothesis testing was a useful strategy for performing industrial quality control, ...

  19. 9.8: The Observed Significance of a Test

    The p-value Approach to Hypothesis Testing. In "Example 8.2.1", Section 8.2 the test was performed at the \(5\%\) level of significance: the definition of "rare" event was probability \(\alpha =0.05\) or less. We saw above that the observed significance of the test was \(p=0.0294\) or about \(3\%\).

  20. 9.6: Significance Test for a Mean

    The sample mean is 0.194. Use the 0.01 significance level to test the claim that the population mean is equal to 0.22. We always put equality in the null hypothesis, so our claim will be in the null hypothesis. There is no reason to do a left or right tailed test, so we will do a two tailed test: H 0 :μ=0.22.

  21. Tests of Significance: Process, Example and Type

    Process of Significance Testing. In the process of testing for statistical significance, the following steps must be taken: Step 1: Start by coming up with a research idea or question for your thesis. Step 2: Create a neutral comparison to test against your hypothesis.

  22. 8.2: The Observed Significance of a Test

    The p-value Approach to Hypothesis Testing. In "Example 8.2.1", Section 8.2 the test was performed at the \(5\%\) level of significance: the definition of "rare" event was probability \(\alpha =0.05\) or less. We saw above that the observed significance of the test was \(p=0.0294\) or about \(3\%\).