QUICK REVIEW

[논문 리뷰] propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Maximilian Idahl, Benedikt Droste|arXiv (Cornell University)|2026. 02. 12.

Library Science and Information Systems인용 수 0

한 줄 요약

propella-1은 57개 언어의 18개 속성에 걸쳐 문서를 주석하는 작은 다국어 LLM을 도입하여 LLM 사전학습을 위한 다차원 데이터 큐레이션을 가능하게 하며 3B+ 주석 데이터셋을 공개합니다.

ABSTRACT

Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.

연구 동기 및 목표

LLM 사전학습에서 단일 점수 데이터 품질 필터의 한계를 해결한다.
18개 속성을 여섯 카테고리로 구성한 구조화된 다속성 주석 프레임워크를 제공한다.
JSON 주석을 출력하는 компакт한 다국어 디코더-전용 모델을 개발한다.
유연하고 구성 가능한 데이터 큐레이션을 가능하게 하는 대규모 주석을 공개한다.

제안 방법

상호 보완적 품질 차원을 포착하기 위해 여섯 카테고리의 18개 속성을 정의한다.
Qwen-3를 기반으로 57개 언어에 대해 0.6B, 1.7B, 4B의 세 개의 디코더-전용 모델을 훈련한다.
64K 컨텍스트 길이와 800-토큰의 компакт한 시스템 프롬프트로 미세조정한다.
속성별로 번호 매김된 값을 갖는 엄밀히 구조화된 JSON을 출력한다.
다중 유형 지표를 사용하여 프런티어 LLM 참조 라벨(Gemini-3-Pro)과 대조 평가한다.
propella-annotations 데이터셋을 수십억 개의 문서 주석과 함께 공개한다.

Figure 1: Overall annotation agreement scores across all evaluated models. propella-1-4b exceeds Gemini-3-Flash and significantly larger open models.

실험 결과

연구 질문

RQ1다중 속성 주석이 단일 점수 품질 필터를 넘어 데이터 큐레이션의 유연성을 향상시킬 수 있는가?
RQ2작은 다국어 모델은 더 큰 기준선에 비해 구조화된 다속성 주석 작업에서 어떤 성능을 보이는가?
RQ3대규모의 공개 릴리스 다중 속성 주석 데이터셋이 LLM 사전학습에 어떤 범위와 유용성을 제공하는가?
RQ4다중 속성 주석이 언어와 데이터 소스별로 어떻게 달라져 언어별 큐레이션 전략을 inform하는가?

주요 결과

Model	GPU	Docs/s	h / 1M docs	Prompt TPS	Output TPS
propella-1-4b	A100 80GB	10.3	27.0	19.1K	1.5K
propella-1-4b	H100 96GB	22.4	12.4	41.6K	3.2K
propella-1-4b (fp8)	H100 96GB	27.0	10.3	50.1K	3.9K
propella-1-1.7b	A100 80GB	17.8	15.6	33.0K	2.6K
propella-1-1.7b	H100 96GB	35.8	7.8	66.5K	5.2K
propella-1-1.7b (fp8)	H100 96GB	39.1	7.1	72.7K	5.7K
propella-1-0.6b	A100 80GB	21.5	12.9	40.0K	3.1K
propella-1-0.6b	H100 96GB	39.9	7.0	74.2K	5.7K

4B propella-1 모델은 전반적인 주석 일치도 0.779를 달성하여 Gemini-3-Flash 및 다수의 오픈 베이스라인보다 우수하다.
가장 작은 0.6B 모델도 0.729의 전반적 점수를 달성하여 더 큰 모델에 근접한 성능을 보인다.
fp8 추론은 bf16 대비 주석 품질 저하가 미미한 수준으로 유지한다.
propella-annotations는 주요 사전학습 말뭉치에 걸친 30억 개 이상의 문서 주석을 포함한다.
주석은 단일 점수 필터가 놓치는 데이터 소스와 언어 간 다차원 차이를 드러낸다.
데이터셋은 확장 가능하고 준수하며 다국어 데이터 분석 및 큐레이션 워크플로를 가능하게 한다.

Figure 2: Per-property annotation agreement scores across all evaluated models (12 properties). See Figure 7 in Appendix C for the full breakdown of all 17 properties.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.