Skip to main content
QUICK REVIEW

[Paper Review] Sequential Computation of p-values based on (Re-)Sampling with a Guaranteed Error Bound

Axel Gandy|arXiv (Cornell University)|Dec 17, 2006
Bayesian Modeling and Causal Inference1 citations
TL;DR

This paper proposes a sequential simulation procedure for computing p-values via resampling that guarantees, with high probability, the correct decision relative to a threshold (e.g., 0.05), even when exact p-values are intractable. By adaptively determining the number of samples needed and stopping early when sufficient evidence is gathered, it reduces computational cost while ensuring reproducibility and correctness.

ABSTRACT

When explicit forms of p-values are not available or cannot be evaluated efficiently, e.g. in the case of a bootstrap test, one usually resorts to simulation. Especially when a simulation step is computationally expensive it is of interest to draw a small number of samples. This article introduces a sequential procedure to evaluate the p-value using simulation. It guarantees that, up to a small error probability, the computed p-value is on the same side of a threshold, e.g. 0.05, as the theoretical p-value. This is important to guarantee that the results are reproducible. The procedure is open-ended, i.e. a maximum number of samples is not prespecified. By often being able to stop early, considerable computing time is being saved. The sequential procedure is suitable for use as standard algorithm for computing p-values based on (re-)sampling. Key words:

Motivation & Objective

  • To address the challenge of computing p-values in complex statistical tests where analytical forms are unavailable or infeasible to compute.
  • To reduce computational cost in resampling-based inference, especially when each simulation step is expensive.
  • To ensure that the simulated p-value is on the same side of a significance threshold (e.g., 0.05) as the true p-value with high probability.
  • To develop a method that is open-ended and stops early when sufficient evidence is accumulated, improving efficiency.
  • To provide a reliable, reproducible standard algorithm for p-value computation in resampling procedures.

Proposed method

  • The method uses sequential hypothesis testing principles to evaluate whether the simulated p-value is above or below a critical threshold (e.g., 0.05).
  • It applies a sequential probability ratio test (SPRT) framework to continuously monitor the evidence from resampled test statistics.
  • The procedure maintains a bound on the error probability that the simulated p-value is misclassified relative to the true p-value.
  • Sampling continues until the evidence is strong enough to decide whether the p-value is significant or not, based on predefined error bounds.
  • The algorithm dynamically adjusts the number of resamples based on accumulating evidence, avoiding a fixed, potentially wasteful sample size.
  • It ensures that the final decision (significant or not) matches the theoretical p-value’s classification with high probability.

Experimental results

Research questions

  • RQ1Can a sequential simulation procedure be designed to compute p-values with guaranteed correctness relative to a significance threshold?
  • RQ2How can computational efficiency be improved in resampling-based inference without sacrificing decision accuracy?
  • RQ3What error bounds can be maintained to ensure that the simulated p-value is on the same side of the threshold as the true p-value?
  • RQ4To what extent can early stopping reduce the number of required resamples in practice?
  • RQ5Can such a method serve as a reliable, default algorithm for p-value computation in resampling frameworks?

Key findings

  • The proposed sequential procedure guarantees that the simulated p-value is on the same side of the significance threshold as the true p-value with high probability.
  • The method often stops early, significantly reducing the number of required resamples compared to fixed-sample approaches.
  • It maintains a controlled error rate in classification (significant vs. not significant), ensuring reproducibility of results.
  • The algorithm is suitable for use as a standard method in resampling-based hypothesis testing due to its reliability and efficiency.
  • By avoiding a prespecified maximum sample size, the method adapts to the data and evidence accumulation, improving computational savings.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.