QUICK REVIEW

[논문 리뷰] Diffusion Language Models Are Natively Length-Aware

Vittorio Rossi, Giacomo Cirò|arXiv (Cornell University)|2026. 03. 06.

Topic Modeling인용 수 0

한 줄 요약

이 논문은 초기 EoS 로짓에서 길이 신호를 추출하여 생성 전 디퓨전 캔버스를 자르는 제로샷 방법 SmartCrop을 제시합니다. 네 가지 벤치마크에서 성능 손실 최소화 또는 무손실로 FLOPs를 대폭 줄입니다.

ABSTRACT

Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in computational waste for the majority of short responses common in reasoning and chat tasks. To address this problem, we conjecture that the latent prompt representation contains sufficient information to estimate the required output length. We provide empirical evidence for this phenomenon and propose a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings. We evaluate our approach on four benchmarks with diverse tasks -- GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) -- revealing massive efficiency gains at minimal performance impact. We report significant reductions in FLOPs across all tasks, with no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.

연구 동기 및 목표

고정된 길이의 캔버스와 EoS 패딩으로 발생하는 확산 언어 모델의 추론 낭비를 줄일 필요성 제기.
잠재 프롬프트 표현으로 출력 길이를 예측하는 제로샷, 모델 내장 메커니즘 제안.
동적 캔버스 자르기가 계산량(FLOPs)을 감소시키고 작업 성능에 미치는 영향이 최소이거나 양의 영향을 주는지 보이기.
GSM8K, HumanEval, IfEval, LongFormQA 등 다양한 벤치마크에서 8B 파라미터 확산 LM(LLaDA)을 사용하여 평가.
길이 예측의 강건성 및 패딩에 대한 민감도 분석 및 Robustness 입증

제안 방법

길이 예측을 EoS 로짓을 통한 캔버스 종료 누적 확률 추정으로 설정.
임계값 기반 자르기 규칙 정의: 누적 종료 확률이 tau를 초과하는 최초 위치에서 자름(예: 0.9).
원래의 캔버스에서 재학습 없이, 표준 디퓨전 디노이징 이전의 후처리 단계로 SmartCrop를 적용(아키텍처에 구애받지 않는 방식).
네 벤치마크에서 고정 캔버스 vs. 잘린 캔버스의 LLaDA를 평가하고 FLOPs 절감 및 작업별 지표를 보고.
잘린 길이를 perturb하고 무작위 길이 베이스라인과 비교하는 민감도 분석을 수행하여 인스턴스별 길이 예측을 확인.

Figure 1 : Predicted Length Distributions. Our SmartCrop ( $\tau=0.9$ ) method successfully predicts task-specific output lengths across four benchmark datasets. The abrupt truncations observed in certain distributions correspond to context length constraints (refer to Section 4 for details).

실험 결과

연구 질문

RQ1EoS 패딩으로 훈련된 DLM이 필요한 출력 길이에 대한 내부 프롬프트 조건 신호를 드러내는가?
RQ2초기 EoS 로짓을 기반으로 한 제로샷 캔버스 자르기가 추론 계산을 줄이고 작업 성능에 해를 끼치지 않는가, 어쩌면 개선하는가?
RQ3SmartCrop가 다양한 출력 길이 체계(추론, 코드, 지시문 이행, QA)에서 어떻게 수행하는가?
RQ4캔버스 크기의扰 perturbation에 대한 길이 예측의 강건성은 어떠한가?

주요 결과

벤치마크	방법	L_p	평균 처리 길이	지표 ↑	FLOPs 저장 % ↑	Perf. Δ % ↑
IfEval	FC	87.2	1367.2	0.4801	-	-
IfEval	SC-0.5	192.1	0.5342	-	98.47***	+11.25*
IfEval	SC-0.75	208.0	0.5521	-	98.05***	+14.99**
IfEval	SC-0.9	222.0	0.5459	-	97.64***	+13.70**
IfEval	SC-0.95	230.5	0.5450	-	97.37***	+13.50**
IfEval	SC-0.99	243.8	0.5694	-	96.92***	+18.58***
GSM8K	FC	140.7	396.7	0.5616	-	-
GSM8K	SC-0.5	239.2	0.5452	-	69.39***	-2.92
GSM8K	SC-0.75	261.2	0.5516	-	59.09***	-1.77
GSM8K	SC-0.9	278.8	0.5457	-	50.15***	-2.83
GSM8K	SC-0.95	288.5	0.5490	-	44.93***	-2.25
GSM8K	SC-0.99	302.8	0.5520	-	37.01***	-1.71
HumanEval	FC	178.5	690.5	0.4592	-	-
HumanEval	SC-0.5	488.2	0.4665	-	46.42***	+1.59
HumanEval	SC-0.75	506.7	0.4688	-	41.06***	+2.08
HumanEval	SC-0.9	521.9	0.4851	-	36.53***	+5.65
HumanEval	SC-0.95	531.0	0.4598	-	33.98***	+0.13
HumanEval	SC-0.99	543.6	0.4106	-	30.16***	-10.59
LongFormQA	FC	77.6	589.6	0.1341	-	-
LongFormQA	SC-0.5	155.1	0.2115	-	85.40***	+57.72***
LongFormQA	SC-0.75	164.4	0.2152	-	82.56***	+60.48***
LongFormQA	SC-0.9	172.7	0.2173	-	79.94***	+62.01***
LongFormQA	SC-0.95	177.5	0.2196	-	78.35***	+63.73***
LongFormQA	SC-0.99	185.2	0.2210	-	75.86***	+64.83***

SmartCrop은 작업 전반에서 FLOPs를 46–98% 감소시키며 평균 67%의 절감 효과를 보인다.
성능 저하는 대부분의 작업에서 통계적으로 유의미하지 않으며 IfEval과 LongFormQA에서 유의미한 개선이 나타난다.
GSM8K와 HumanEval에서 잘린 캔버스가 큰 계산 절감을 가져오고 지표 성능은 거의 손실이 없다.
IfEval에서 더 짧은 캔버스가 패딩으로 인한 저하를 완화하고 정확도가 향상된다.
LongFormQA에서 캔버스 자르기가 ROUGE-1을 증가시켜 간결성과 정보 밀도가 높아진 것을 시사한다.
이 방법은 매 스텝마다 디노이즈 처리하는 캔버스의 양을 크게 줄이면서 성능을 보존하거나 개선한다.

Figure 2 : Sensitivity of IfEval Performance to Context Length Perturbations. We analyze the robustness of SmartCrop ( $\tau=0.9$ ) by shifting the predicted length $\hat{L}$ by a deviation factor $\delta\in[-50\%,+50\%]$ . The blue curve shows the model performance (mean $\pm$ 95% CI) across these

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.