Samuel Pawel 和 Leonhard Held 对 Grünwald、de Heide 和 Koolen 的《安全检验》讨论的贡献

Samuel Pawel and Leonhard Held’s contribution to the Discussion of ‘Safe Testing’ by Grünwald, de Heide, and Koolen

Journal of the Royal Statistical Society. Series B: Statistical Methodology · 2024

被引 0

ABS 4

Samuel Pawel 通讯
Leonhard Held

中文导读

本文讨论了‘安全检验’框架的局限性，指出其可能无法解决认知偏差和模型误设问题，并提出用 e 值函数图替代传统 p 值以促进渐进定量推断。

Abstract

We congratulate the authors on their paper introducing a ‘safe’ inference framework. However, despite their supposed safety, we have reservations about whether ‘safe’ tests can make the scientific enterprise more reliable. For instance, researchers may still prefer classical non-sequential over sequential study designs, as the latter can be difficult to implement in practice. For example, interim analyses in randomized clinical trials will require unblinding and may as such threaten the integrity of the trial (Ellenberg et al., 2019, Chapter 5). In addition, it has been argued that practical problems in the use of traditional p-values (rather than e-values) stem more from cognitive factors than from the choice of inferential statistics (Gelman, 2016). For example, an excessive focus on testing parameter values of ‘no effect’ while ignoring other relevant parameter values (‘nullism’), unnecessary dichotomization of quantitative information (‘dichotomania’), or treating statistical models as known physical laws rather than speculative assumptions (‘statistical reification’) can all lead to distorted interpretations of study results (Greenland, 2017). Especially with respect to the last point, we believe that naming a statistical test as ‘safe’ can mislead researchers into thinking that the method is ‘always valid’, when this is only true in a certain technical sense and under certain speculative assumptions (e.g. that the underlying data model is correctly specified). For instance, in meta-analysis, a promising application according to the authors, inferences based on e-values will not be ‘safe’ unless the analysis accounts for possible bias, for example, publication bias (Cooper et al., 2019, Chapter 18). In this case, can e-values be adjusted for potential model misspecification (Copas & Eguchi, 2005)? The evidence interpretation of e-values as Bayes factors is more aligned with our view, since Bayes factors quantify relative predictive performance of competing hypotheses without claims about ‘safe’ error rates, yet challenges with nullistic and dichotomous thinking remain. If conventional p-values and confidence intervals were replaced by e-value counterparts, how should results be reported to address these challenges? For testing of one- and two-dimensional parameters, we propose a graphical display inspired by p-value functions and confidence curves (Birnbaum, 1961; Cox, 1958), and their Bayes factor analogues (Pawel et al., 2024). The idea is to plot the e-value as a function of the tested parameter value, see Figure 1 for an illustration. Like the p-value function, this shifts emphasis away from nullistic-dichotomous to gradual-quantitative inferences. Example of an e-value function, i.e. the e-value displayed as a function of the tested parameter value. The graph provides a wealth of information as e-values and conservative p-values (for any tested parameter value), conservative confidence intervals (at any level of interest), and point estimates (the parameter values with the least evidence against them) can be easily read off. In summary, while ‘safe’ tests offer potential advantages, they are not always safe to use and can be affected by the same cognitive issues as other methods. Careful application and contextual interpretation are paramount to their success.

统计学假设检验科学方法论元分析

阅读原文 ↗