ACL 2026

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

STOP is a lightweight internal pruning module that reads KV-cache states and filters unpromising reasoning paths early, improving both accuracy and efficiency in parallel reasoning.

Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang, Benyou Wang
The Chinese University of Hong Kong, Shenzhen · Shenzhen Loop Area Institute · USTB · DualityRL

TL;DR

STOP is a lightweight pruning module that identifies doomed reasoning paths from their early prefixes. It is the first instantiation of learnable internal pruning, improving parallel reasoning accuracy while reducing token usage by over 70% in many settings.

70%+Token reduction in many settings
1.5B–20BModel scales evaluated
43AIMO3 competition score with tool use

Why prune early?

In standard parallel reasoning, every sampled path is generated to completion and then aggregated. But many paths fail due to early mistakes and still consume equal compute. Worse, they can pollute the final vote.

Key insight. Early-prefix failure is often irreversible. The right question is not how to verify a finished trajectory, but how to stop a bad one early.
Motivation figure
Figure 1. Early errors often lead to irreversible failure; pruning them early both saves compute and improves aggregation quality.

A unified view of path pruning

We organize prior work by two axes: whether the pruning signal comes from internal or external states, and whether it is learnable or not. This taxonomy reveals an unexplored sweet spot: learnable internal pruning.

Taxonomy figure
Figure 2. STOP instantiates Type IV: learnable internal pruning.

Method: STOP

STOP is a lightweight, non-invasive module built on top of a frozen reasoning model. Instead of re-encoding the full text, it appends a short sequence of learnable [STOP] tokens to the cached reasoning prefix, reads the prefix KV cache, and predicts whether the trajectory should be resumed.

  • Launch: generate short prefixes and cache internal states.
  • Check: append [STOP] and score each prefix using a small adapter.
  • Resume: keep only the top-ranked candidates and continue generation.
Why it is efficient. STOP avoids re-encoding the full context. It directly reuses the heavy computation already stored in the KV cache.
STOP framework
Figure 3. Launch–Check–Resume: cache prefixes, score them with STOP, and resume only the most promising paths.

Key results

STOP consistently improves effectiveness and efficiency across reasoning models from 1.5B to 20B. It also generalizes beyond standard benchmarks to realistic tool-use settings.

Main results table
Main benchmark results. STOP consistently outperforms existing pruning baselines while reducing token usage.

Main takeaway

SettingResult
AIME24 (1.5B)30.10 → 37.92
Token usage> 70% reduction

AIMO3 competition with tool use

MethodScore
Baseline + Tool39
STOP (24→8)42
STOP (16→8)43
The best setting reaches silver-level performance on the public leaderboard.

Scaling law for deployment

Beyond a single method, STOP also provides a practical deployment rule. We derive an empirical guideline that predicts the optimal retention ratio based on compute budget, prefix length, and task length.

Why it matters. Practitioners can choose pruning aggressiveness without exhaustive sweeps, making STOP practical under real resource constraints.
Scaling law figure
Figure 6. The inverse retention ratio follows a stable empirical trend across settings.

Abstract

Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. We propose the first systematic taxonomy of path pruning, identify learnable internal pruning as a missing paradigm, and instantiate it with STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B demonstrate that STOP improves both effectiveness and efficiency. We further validate its scalability under varying compute budgets and distill these observations into practical deployment guidelines.

Citation

@misc{bi2026cut,
  title={Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning},
  author={Bi, Jiaxi and Luo, Tongxu and Du, Wenyu and Tang, Zhengyang and Wang, Benyou},
  year={2026},
  eprint={2604.16029},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}