Guide to Essential BioStatistics IV: Statistical Significance, Power and Effect Size
In this fourth article in the LabCoat Guide to BioStatistics series, we learn about Statistical Significance, Power and Effect Size.
In the previous articles in this series, we explored the Scientific Method and Proposing Hypotheses and Type-I and Type-II errors.
Future articles will cover: Designing and implementing experiments (Significance, Power, Effect, Variance, Replication and Randomization), Critically evaluating experimental data (Q-test; SD, SE and 95%CI), and Concluding whether to accept or reject the hypothesis (F- and T-tests, Chi-square, ANOVA and post-ANOVA testing).
Statistical Significance, Power and Effect Size.
The significance level of an experiment, called α (alpha) is the probability of rejecting our Null-hypothesis even though it is true, thus concluding that there is a difference between treatments, even though there is none. This is of importance for biological trials where we as researchers may be biased toward the conclusion that our experiments support the hoped-for effect – the alternative hypothesis.
Let us consider our proposed experiment in which plants sprayed with a phytotoxic insecticide with added herbicide safener is tested, relative to plants sprayed with a phytotoxic insecticide without added safener. Our experimental data reveals that for 20 sprayed plants, all 20 show phytotoxicity in the absence of the safener, while only 19 of those sprayed with the safener show phytotoxicity. Clearly the one symptom-free plant could result from chance, and there would be no reason to reject the Null hypothesis.
If, however, 19 of the safener treated plants showed no sign of phytotoxicity, common sense would lead us to conclude that this was extremely unlikely to happen due to chance and we would reject the Null hypothesis, concluding that the treatment had an effect. But what if only nine plants showed no sign of phytotoxicity – here, more than common sense is required to determine the likelihood of this occurring due to chance, and we will need to calculate the probability of the Null hypothesis being rejected – and that the safener treatment does have a significant effect.
Significance
Stated in statistical terms, the typical significance level (alpha) of 0.05 (5%) corresponds to a 95% confidence level – we are 95% confident that we will not make a Type-I error (identifying false positives) and reject the Null hypothesis despite its being true. In other words, only 5% of our experiments will give us a false positive.
This level is termed the “critical value”. If the test statistic falls below the critical value (in this case α=0.05, we accept the null hypothesis (H0), if it falls above the critical value we reject the null hypothesis
Figure 1: Significance level as determined by the critical value. If the test statistic falls below the critical value, we accept the null hypothesis (H0). α (alpha) represents the probability of making a Type-I error: rejecting our Null-hypothesis even though it is true (false positive).
The 5% significance level is a threshold arbitrarily decided on by biologists and is the threshold most commonly used in the Biological Sciences. Depending on the trial circumstances, i.e. what the data is to be used for, a 1% or 10% significance level may be appropriate.
▶︎ A biological rule of thumb is that a 5% significance level is considered appropriate, meaning that we can be 95% confident that there is in fact a difference between treatments.
If the cost of a false positive is high (for example leading to a strategic decision to initiate a costly development process) thresholds for critical trials may be made more stringent (1%).
Conversely, if the cost of a false negative is high (leading to you missing out on a valuable discovery), the threshold for initial screening experiments may be raised to e.g. 10%.
If the probability (p-value, or “p”) that the test statistic calculated from the observed data occurred by chance is less than α = (5%), we may reject the Null hypothesis. This is commonly phrased as “a small p-value indicates that the Null hypothesis is not a good explanation for the data – we can be 95% confident that the alternative hypothesis is true”.
This opens for one of the most hotly debated topics in biological statistics – the misuse of the p-value. Statisticians argue that it is incorrect to transpose the observation of a “5% chance of getting the observed results if the Null hypothesis is correct” into an observation of a “95% probability that the Null hypothesis is false” or “95% certainty that the observed difference is real and could not have arisen by chance” or “the difference is statistically significant”.
It is beyond the scope of this book to expand on this debate but suffice to say the above transposition is commonly used in the biological sciences, and the following paragraphs should be read with the above in mind.
It is common practice in most of the sciences to provide a p-value resulting from a statistical test and on this basis to conclude that the results are significant or not significant.
As seen above, the 5% significance level is a threshold arbitrarily decided on by biologists. Almost immediately, it appears, biologists (probably those that had p<0.06% datasets!) began to argue that fixed significance levels were too restrictive for biological data sets and devised a system of graduated levels of “the difference is statistically significant”.
Table 1: p-values and associated levels of “statistical significance”.
This tendency to correlate p-values with significance should be tempered with the consideration of whether the observed differences are substantively important. For example, a small difference in efficacy between two treatments maybe statistically significant according to the above (hypothesis p-value) definition, but this difference may be too small to make it commercially meaningful (biologically significant).
For this reason, statisticians typically recommend estimating the effect size and evaluating confidence intervals for the differences between means, to emphasize the importance of the magnitude of the effect rather than simply statistically significant/non-significant hypothesis testing. These parameters will be discussed in the following.
Power or Sensitivity of a Test
The power or sensitivity of a test may be used to determine the appropriate sample size for a test or experiment. Before planning an experiment, researchers must consider which level of statistical power is sufficient to ensure that the test is sensitive enough to identify the difference between treatments.
Stated in statistical terms, Power is the probability of correctly rejecting the Null hypothesis when it is false and identifying a significant effect when such an effect exists.
▶︎ A biological rule of thumb is that a Power of 80% is considered appropriate, meaning that there is only a 20% chance of erroneously concluding that there was no difference in efficacy between the treatments.
Figure 2: Power (or sensitivity of a test) is the probability of correctly rejecting the Null hypothesis when it is false and identifying a significant effect when such an effect exists. α (alpha) represents the probability of making a Type-I error: rejecting our Null-hypothesis even though it is true (false positive). β (beta) represents the probability of making a Type-II error: accepting our Null-hypothesis even though it is false (false negative).
The trade-off is that the principal way to increase statistical power is to increase the number of replicates or the number of treatments – depending on the variability of biological material. This can require so many replicates that the trial becomes economically and practically unfeasible.
Researchers thus need to calculate the lowest sample sizes which will permit a statistically viable experiment. To be able to do this, the statistical power of the test must be defined, and the variability or variance within the experimental setup must be determined.
Effect Size
Statisticians define effect size as the minimum acceptable deviation from the Null hypothesis and is a quantitative measure of the magnitude of the difference between two groups (H0 and H1). For crop protection researchers, effect size (or treatment effect) is often defined as the minimum improvement in efficacy needed to justify the costs of developing a new pesticide formulation.
▶︎ A biological rule of thumb is that an effect size (treatment effect or improvement in efficacy) of 20% is considered economically viable.
Figure 3: Effect size is a measure of the magnitude of the difference between H0 and H1 and is often defined as the minimum improvement in efficacy needed to justify the costs of developing a new product.
As we will see shortly, effect size can have a significant influence on the number of replicates or sample size needed to perform the experiment. The reality (typically determined by budget) is a trade-off cost and space – it is not always feasible to include the requisite number of replicates (sample size) in, for example, a greenhouse experiment.
While larger sample sizes increase power and decrease estimation error, by increasing effect size (larger differences make it easier to confirm their significance) it is possible to execute a study within the constraints of a feasible sample size and retain sufficient power.
With this information on Experimental Parameters and the information on the Scientific Method and Hypotheses presented in the previous articles, we are ready to move on to the next article in this series: Experimental Design.
The first two books in the LABCOAT GUIDE TO CROP PROTECTION series are now published and available in eBook and Print formats!
Aimed at students, professionals, and others wishing to understand basic aspects of Pesticide and Biopesticide Mode Of Action & Formulation and Strategic R&D Management, this series is an easily accessible introduction to essential principles of Crop Protection Development and Research Management.
A little about myself
I am a Plant Scientist with a background in Molecular Plant Biology and Crop Protection.
20 years ago, I worked at Copenhagen University and the University of Adelaide on plant responses to biotic and abiotic stress in crops.
At that time, biology-based crop protection strategies had not taken off commercially, so I transitioned to conventional (chemical) crop protection R&D at Cheminova, later FMC.
During this period, public opinion, as well as increasing regulatory requirements, gradually closed the door of opportunity for conventional crop protection strategies, while the biological crop protection technology I had contributed to earlier began to reach commercial viability.