Family wise error rate (Ofer Abarbanel online library)

In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.


Tukey coined the terms experimentwise error rate and “error rate per-experiment” to indicate error rates that the researcher could use as a control level in a multiple hypothesis experiment.[citation needed]


Within the statistical framework, there are several definitions for the term “family”:

  • Hochberg & Tamhane defined “family” in 1987 as “any collection of inferences for which it is meaningful to take into account some combined measure of error”.[1]
  • According to Cox in 1982, a set of inferences should be regarded a family:[citation needed]
  1. To take into account the selection effect due to data dredging
  2. To ensure simultaneous correctness of a set of inferences as to guarantee a correct overall decision

To summarize, a family could best be defined by the potential selective inference that is being faced: A family is the smallest set of items of inference in an analysis, interchangeable about their meaning for the goal of research, from which selection of results for action, presentation or highlighting could be made (Yoav Benjamini).[citation needed]


A procedure controls the FWER in the strong sense if the FWER control at level {\displaystyle \alpha \,\!} is guaranteed for any configuration of true and non-true null hypotheses (whether the global null hypothesis is true or not).[3]

Resampling procedures

The procedures of Bonferroni and Holm control the FWER under any dependence structure of the p-values (or equivalently the individual test statistics). Essentially, this is achieved by accommodating a `worst-case’ dependence structure (which is close to independence for most practical purposes). But such an approach is conservative if dependence is actually positive. To give an extreme example, under perfect positive dependence, there is effectively only one test and thus, the FWER is uninflated.

Accounting for the dependence structure of the p-values (or of the individual test statistics) produces more powerful procedures. This can be achieved by applying resampling methods, such as bootstrapping and permutations methods. The procedure of Westfall and Young (1993) requires a certain condition that does not always hold in practice (namely, subset pivotality).[6] The procedures of Romano and Wolf (2005a,b) dispense with this condition and are thus more generally valid.[7][8]

Alternative approaches

FWER control exerts a more stringent control over false discovery compared to false discovery rate (FDR) procedures. FWER control limits the probability of at least one false discovery, whereas FDR control limits (in a loose sense) the expected proportion of false discoveries. Thus, FDR procedures have greater power at the cost of increased rates of type I errors, i.e., rejecting null hypotheses that are actually true.[11]

On the other hand, FWER control is less stringent than per-family error rate control, which limits the expected number of errors per family. Because FWER control is concerned with at least one false discovery, unlike per-family error rate control it does not treat multiple simultaneous false discoveries as any worse than one false discovery. The Bonferroni correction is often considered as merely controlling the FWER, but in fact also controls the per-family error rate.[12]


  1. ^Hochberg, Y.; Tamhane, A. C. (1987). Multiple Comparison Procedures. New York: Wiley. p. 5. ISBN 978-0-471-82222-6.
  2. ^Dmitrienko, Alex; Tamhane, Ajit; Bretz, Frank (2009). Multiple Testing Problems in Pharmaceutical Statistics (1 ed.). CRC Press. p. 37. ISBN 9781584889847.
  3. ^Dmitrienko, Alex; Tamhane, Ajit; Bretz, Frank (2009). Multiple Testing Problems in Pharmaceutical Statistics (1 ed.). CRC Press. p. 37. ISBN 9781584889847.
  4. ^Aickin, M; Gensler, H (1996). “Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods”. American Journal of Public Health. 86 (5): 726–728. doi:10.2105/ajph.86.5.726. PMC 1380484. PMID 8629727.
  5. ^Hochberg, Yosef (1988). “A Sharper Bonferroni Procedure for Multiple Tests of Significance” (PDF). Biometrika. 75 (4): 800–802. doi:10.1093/biomet/75.4.800.
  6. ^Westfall, P. H.; Young, S. S. (1993). Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. New York: John Wiley. ISBN 978-0-471-55761-6.
  7. ^Romano, J.P.; Wolf, M. (2005a). “Exact and approximate stepdown methods for multiple hypothesis testing”. Journal of the American Statistical Association. 100 (469): 94–108. doi:10.1198/016214504000000539. hdl:10230/576.
  8. ^Romano, J.P.; Wolf, M. (2005b). “Stepwise multiple testing as formalized data snooping”. Econometrica. 73 (4): 1237–1282. CiteSeerX doi:10.1111/j.1468-0262.2005.00615.x.
  9. ^Good, I J (1958). “Significance tests in parallel and in series”. Journal of the American Statistical Association. 53 (284): 799–813. doi:10.1080/01621459.1958.10501480. JSTOR 2281953.
  10. ^Wilson, D J (2019). “The harmonic mean p-value for combining dependent tests”. Proceedings of the National Academy of Sciences USA. 116 (4): 1195–1200. doi:10.1073/pnas.1814092116. PMC 6347718. PMID 30610179.
  11. ^Shaffer, J. P. (1995). “Multiple hypothesis testing”. Annual Review of Psychology. 46: 561–584. doi:10.1146/ hdl:10338.dmlcz/142950.
  12. ^Frane, Andrew (2015). “Are per-family Type I error rates relevant in social and behavioral science?”. Journal of Modern Applied Statistical Methods. 14 (1): 12–23. doi:10.22237/jmasm/1430453040.


Ofer Abarbanel online library