ACL2026 | Step Length Confounding

Abstract

Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised fine-tuning on large-scale and high-quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness-based selection methods to filter high-quality samples.

Despite the proven effectiveness of naturalness-based data selection—which ranks data by the average log probability assigned by LLMs—our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher-quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low-probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities.

To address this issue, we propose two variant methods: ASLEC-drop, which drops first-token probabilities when computing average log probability, and ASLEC-casl, which applies a causal debiasing regression to remove the first tokens' confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.

Motivation: Step Length Confounding

Existing naturalness-based data selection methods compute the average log probability assigned by the target LLM to each training sample, then select samples with the highest scores. While intuitive, our preliminary experiments reveal a critical flaw: these methods consistently prefer samples with longer reasoning steps rather than higher-quality ones.

            ★ Conclusion 1. 
            The naturalness-based reasoning data selection approach tends to prefer samples
            with longer reasoning steps (i.e., more tokens per step).
          

As shown in the figure below, for all four representative naturalness-based selection methods (GRACE, Local LP, Min Entropy, Min Perplexity), the selected data consistently exhibits significantly longer step lengths compared to the unselected data. This reveals a systematic step length confounding problem.

Figure 1. Step length distributions of data selected vs. unselected by four naturalness-based methods. Selected samples exhibit markedly longer steps.

Why Does Step Length Confounding Occur?

To understand the root cause, we examine the relationship between step length and step-level log probability. The figure below shows a clear monotonic increasing relationship: longer reasoning steps consistently receive higher average log probabilities.

Figure 2. Relationship between step-level log probability and step length. A clear monotonic increasing trend is observed across all source LLMs.

The mechanism is as follows: the first token of each reasoning step always exhibits a lower log probability, as it must fork into different reasoning branches (higher entropy). In longer steps, these low-probability first tokens constitute a smaller proportion of the total token count, so their negative impact is diluted, inflating the overall average log probability.

            ★ Conclusion 2. 
            Step length confounding occurs because the low-probability first token
            constitutes a smaller ratio of longer responses. This increases average log
            probabilities, making samples with longer steps more likely to be selected.
          

Figure 3. Representative cases showing token-level log probabilities for steps of different lengths. The first token consistently exhibits lower log probability.

Our Proposed Method: ASLEC

Given the above analysis that low-probability first tokens cause step length confounding, we propose Alleviating Step Length Confounding (ASLEC), including two efficient variant methods that intervene on the first-token probability when computing selection scores.

▶ ASLEC-drop: Dropping the First Token

The most direct approach: when computing the geometric mean of token log probabilities over a reasoning solution, we exclude the first token of each step. Formally, the scoring metric is:

            sdropi = (1 / (Ti − |Si|)) ∑step l ∑t=2...|s| log P(st | context)
          

By dropping the first token at each step, we remove the dominant source of step-length bias. This approach is extremely lightweight and introduces zero additional computation.

▶ ASLEC-casl: Causally De-biasing

Although dropping the first token eliminates the bias, it also discards informative signals carried by that token. To preserve these signals, we apply a causal debiasing strategy inspired by regression adjustment.

We model the raw log-probability score via linear regression:

            slogpi = β1 sfirsti
            + β2 sdropi
            + γ Zi + ε
          

where Z_i = |S_i| / T_i is the first-token ratio (the confounding factor), and γ captures its confounding effect. We estimate β₁, β₂, γ via ordinary least squares, then compute the deconfounded score:

            scasli = slogpi − γ Zi
          

This closed-form solution is highly efficient (fitting completes in seconds) and preserves the informative first-token signal while removing the length-induced bias.

Experimental Results

We evaluate our methods on two LLM reasoning datasets (LIMO-v2 and AceReason-1.1-SFT), training four target LLMs of varying sizes (Qwen3-4B-Base, Qwen3-8B-Base, Qwen3-4B-Instruct, Qwen2.5-7B-Instruct), and evaluating on five benchmarks (AIME24, AIME25, MATH500, OlympiadBench, GPQA). We compare against two state-of-the-art naturalness-based selection methods: GRACE and Local LP.

Main Results on LIMO-v2

Select 4k from 16k responses (5 responses per source LLM × 4 LLMs × 800 problems).

Model	Method	AIME24	AIME25	MATH500	OlympicB	Avg.
4B-Base	GRACE	16.66	15.83	59.40	33.33	31.42
	Local LP	19.16	20.83	71.60	34.11	36.50
	+ ASLEC-drop (ours)	30.00 ↑10.84	28.33 ↑7.50	77.80 ↑6.20	38.38 ↑4.27	44.64
	+ ASLEC-casl (ours)	31.66 ↑12.50	30.83 ↑10.00	80.00 ↑8.40	42.81 ↑8.70	47.54

8B-Base	GRACE	30.83	21.66	72.00	39.70	42.36
	Local LP	34.16	20.83	76.60	42.81	44.06
	+ ASLEC-drop (ours)	41.66 ↑10.50	36.66 ↑15.83	81.40 ↑4.80	47.85 ↑5.04	52.92
	+ ASLEC-casl (ours)	45.00 ↑13.34	37.50 ↑16.67	85.40 ↑8.80	49.03 ↑6.22	56.15

4B-Instruct	GRACE	59.16	50.00	79.36	47.79	63.82
	Local LP	61.66	49.16	80.75	50.14	65.84
	+ ASLEC-drop (ours)	69.16 ↑7.50	56.66 ↑7.50	89.88 ↑9.13	57.64 ↑7.50	72.77
	+ ASLEC-casl (ours)	71.66 ↑10.00	58.33 ↑9.17	93.20 ↑12.45	60.44 ↑10.30	76.16

Table 1. Experimental results on LIMO-v2. Red arrows indicate improvement over Local LP (SOTA baseline). Bold: best score.

Analysis: Alleviating Step Length Confounding

Beyond performance improvements, we verify that our methods directly reduce the step length confounding effect. The figure below shows that data selected by ASLEC-drop and ASLEC-casl exhibits a much more balanced step-length distribution between selected and unselected samples, in stark contrast to prior methods.

Figure 4. Step length distributions for data selected vs. unselected by our two proposed variants. Both methods yield markedly smaller step-length disparities.

          Overall Gains.
          Across four LLMs and two datasets, our methods achieve average accuracy improvements of
          +6.28% (ASLEC-drop) and +9.08% (ASLEC-casl) over
          the SOTA naturalness-based selection method Local LP. ASLEC-casl consistently outperforms
          ASLEC-drop, demonstrating the value of preserving first-token information through
          causal debiasing rather than simply discarding it.
        

Key Contributions

1

Discovery of Step Length Confounding. Through extensive experiments, we identify a systematic step length confounding problem in existing naturalness-based LLM reasoning data selection methods, and reveal that the root cause lies in the low-probability first token of each reasoning step.

2

Two Efficient Debiasing Variants. We propose ASLEC-drop and ASLEC-casl, which alleviate step length confounding by intervening on the first-token probability when computing the global average log probability. Both methods add negligible computational overhead.

3

Extensive Empirical Validation. Experiments across 4 LLMs, 2 SFT datasets, and 5 evaluation benchmarks demonstrate consistent and significant improvements over existing naturalness-based selection methods, with particular gains in low-resource SFT scenarios.

BibTeX

@inproceedings{wang2026aslec,
  title     = {On the Step Length Confounding in {LLM} Reasoning Data Selection},
  author    = {Bing Wang and Rui Miao and Chen Shen and Shaotian Yan and Kaiyuan Liu and Ximing Li and Xiaosong Yuan and Sinan Fan and Jun Zhang and Jieping Ye},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics},
  year      = {2026}
}