The problem with P values (and why it matters to evidence-based medicine)

By Naveed Saleh, MD, MS | Medically reviewed by Kristen Fuller, MD

Published October 26, 2022

Key Takeaways

Although P values can make or break the publication of significant findings, researchers contend that they have flaws, when used alone.
P values don’t reflect effect sizes and can be false positives. Furthermore, researchers can tailor their approaches to effect desirable P values.
P values should be considered in the context of effect sizes and confidence intervals.

P values have long been considered the gold standard for measuring quantitative research in clinical trials to test treatment interventions for medical diagnoses—but that may be changing.

The effectiveness of P values, when used alone, has been called into question by the American Statistical Association, which wrote that they aren’t a good measure of evidence when exploring a model or hypothesis—and on their own, yield false positives and don’t reflect an effect’s magnitude of range.^[]

Should researchers and clinicians reconsider use of P values?

What is a P value?

P values refer to the likelihood of obtaining the observed data if the null hypothesis is true. A null hypothesis is a type of statistical hypothesis that proposes that no statistical significance exists in a set of given observations.

A P value less than 0.05 is typically considered to be statistically significant, in which case the null hypothesis should be rejected, meaning that there is a statistical significance in a set of given observations.^[] A P value greater than 0.05 means that deviation from the null hypothesis is not statistically significant, and the null hypothesis should not be rejected.

According to an article published in the American Journal of Pharmaceutical Education (AJPE), “Contrary to popular misconception, it is not the probability that one’s results were obtained by chance, the probability that the null hypothesis is true, or the probability of a false positive result. In fact, the false positive rate associated with a P value of .05 is usually around 30% but can be much higher.”^[]

Many investigators define the P value as the probability of observing their data given that the null hypothesis is true, according to an article published in the Postgraduate Medical Journal.^[] Instead, the Postgraduate Medical Journal authors wrote, the statement should read that the P value is the probability of observing their data due to the null hypothesis being true. The null hypothesis may be true, but it may be false.

“This common misinterpretation of the P value exaggerates the weight of evidence against the null hypothesis,” the authors wrote. “What we actually need is the false discovery rate, which is the proportion of reported discoveries that are false positives.”

False positive rates are a serious concern.

The AJPE study demonstrated that the results of only 38 out of 100 psychology studies could be replicated, meaning the other 62 were false positives. Furthermore, researchers from Bayer were only able to replicate 25% of 67 studies, while Amgen investigators could only replicate the results of 6 out of 53 landmark studies.

Limitations of P values

One issue with P values is that they don’t reflect the magnitude of an effect, with even the smallest impact being statistically significant with larger sample sizes. In the AJPE study, which examined the effect of aspirin on myocardial infarctions with a P <. 00001, only one-tenth of 1% of the risk of experiencing an MI could be affected by aspirin.

Ethical concerns can result when P values are considered the be-all and end-all of publication.

Investigators may intentionally seek small P values by performing a study multiple times but reporting only the successful attempts, the AJPE researchers wrote.

Investigators can also amass several variables but report only on those with statistically significant effects, as well as expunging outliers or adjusting screening criteria after a trial begins. Variables can be recast via merging, splitting, or transformation. Tests for significance can also be done before an experiment is complete, with experiments truncated after results prove to be significant.

“All of these practices are ethically dubious, and can harm the replicability of one’s results,” the authors of the AJPE article wrote.

A 2019 editorial in the New England Journal of Medicine (NEJM) discussed how “editors and statistical consultants have become increasingly concerned about the overuse and misinterpretation of significance testing and P values in the medical literature.”^[]

"Along with their strengths, P values are subject to inherent weaknesses."

— Harrington, et al., New England Journal of Medicine

Moving forward

P values should be considered within the context of other variables including means, standard deviations, confidence intervals (CIs), effect sizes, and R2 (ie, the coefficient of determination).

Expressing results in terms of effect sizes and CIs offers a more robust analysis and represents the statistical significance and the size of treatment effects.

Even though treatment effects don’t offer a dichotomous determination, they can be deemed significant or not significant with null hypothesis significance testing.

“There are many equations and complex concepts for CIs and effect sizes, we should understand the exact meanings of these estimates and should use them appropriately when interpreting and describing statistical results,” the authors of a review in the Korean Journal of Anesthesiology wrote.^[]

“A well-designed randomized or observational study will have a primary hypothesis and a prespecified method of analysis, and the significance level from that analysis is a reliable indicator of the extent to which the observed data contradict a null hypothesis of no association between an intervention or an exposure and a response,” wrote the authors of the NEJM editorial.

“Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions,” the NEJM authors concluded.

What this means for you

Although P values are considered the gold standard of statistical results, researchers contend that these measures are flawed, saying they don’t reflect effect sizes and can yield false positives. P values should be considered in terms of effect sizes and confidence intervals to overcome some of their potential limitations.