QUICK REVIEW

[論文レビュー] propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Maximilian Idahl, Benedikt Droste|arXiv (Cornell University)|Feb 12, 2026

Library Science and Information Systems被引用数 0

ひとこと要約

propella-1は、18の特性を57言語で文書に注釈づける小規模な多言語LLMを導入し、LLM前処理の多次元データキュレーションを可能にする。3B+の注釈データセットをリリース。

ABSTRACT

Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.

研究の動機と目的

LLM前処理における単一スコアのデータ品質フィルタの制限を解消する。
6カテゴリの18の特性にわたる構造化された多特性注釈フレームワークを提供する。
JSON注釈を出力するコンパクトな多言語デコーダーのみモデルを開発する。
大規模な注釈を公開して柔軟な構成可能なデータキュレーションを可能にする。

提案手法

補完的な品質次元を捉えるために6カテゴリ全体で18の特性を定義する。
57言語に対してQwen-3をベースとした3つのデコーダー・オンリーモデル（0.6B、1.7B、4B）を訓練する。
64Kのコンテキスト長と800トークンのコンパクト・システムプロンプトでファインチューニングする。
各特性ごとに番号付きの値を含む厳密に構造化されたJSONを出力する。
マルチタイプ指標を用いて frontier LLM の参照ラベル（Gemini-3-Pro）と比較して評価する。
propella-annotationsデータセットを公開し、数十億の文書注釈を提供する。

Figure 1: Overall annotation agreement scores across all evaluated models. propella-1-4b exceeds Gemini-3-Flash and significantly larger open models.

実験結果

リサーチクエスチョン

RQ1単一スコアの品質フィルタを超えた多特性注釈はデータキュレーションの柔軟性を向上させるか。
RQ2小規模な多言語モデルは、構造化された多特性注釈タスクにおいて、より大きなベースラインと比較してどう機能するか。
RQ3大規模で公開された多特性注釈データセットのLLM前処理における範囲と有用性はどの程度か。
RQ4多特性注釈は言語とデータソース間でどのように異なり、言語特有のキュレーション戦略をどう inform するか。

主な発見

Model	GPU	Docs/s	h / 1M docs	Prompt TPS	Output TPS
propella-1-4b	A100 80GB	10.3	27.0	19.1K	1.5K
propella-1-4b	H100 96GB	22.4	12.4	41.6K	3.2K
propella-1-4b (fp8)	H100 96GB	27.0	10.3	50.1K	3.9K
propella-1-1.7b	A100 80GB	17.8	15.6	33.0K	2.6K
propella-1-1.7b	H100 96GB	35.8	7.8	66.5K	5.2K
propella-1-1.7b (fp8)	H100 96GB	39.1	7.1	72.7K	5.7K
propella-1-0.6b	A100 80GB	21.5	12.9	40.0K	3.1K
propella-1-0.6b	H100 96GB	39.9	7.0	74.2K	5.7K

4Bのpropella-1モデルは全体の注釈同意度0.779を達成し、Gemini-3-Flashや多くのオープンベースラインを上回る。
最小の0.6Bモデルですら全体0.729を達成し、タスクにおいてより大きなモデルに近づく。
fp8での推論はbf16と比較して品質の劣化がほとんどなく注釈品質を保持する。
propella-annotationsは主要な前処理コーパスを横断して30億を超える文書注釈をカバーする。
注釈は単一スコアのフィルターが見逃すデータソースと言語間の多次元差を明らかにする。
このデータセットはスケーラブルで準拠した多言語データ分析・キュレーションワークフローを可能にする。

Figure 2: Per-property annotation agreement scores across all evaluated models (12 properties). See Figure 7 in Appendix C for the full breakdown of all 17 properties.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。