[论文解读] Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal
本研究在FUNSD和CORD上评估逐步数据调度(33%→67%→100%)对文本BERT和多模态LayoutLMv3的课程学习的影响,显示出一致的计算量下降和架构依赖的性能收益,经过匹配计算分析揭示在容量受限模型中的真实调度收益。
We investigate whether progressive data scheduling -- a curriculum learning strategy that incrementally increases training data exposure (33\%$ ightarrow$67\%$ ightarrow$100\%) -- yields consistent efficiency gains across architecturally distinct document understanding models. By evaluating BERT (text-only, 110M parameters) and LayoutLMv3 (multimodal, 126M parameters) on the FUNSD and CORD benchmarks, we establish that this schedule reduces wall-clock training time by approximately 33\%, commensurate with the reduction from 6.67 to 10.0 effective epoch-equivalents of data. To isolate curriculum effects from compute reduction, we introduce matched-compute baselines (Standard-7) that control for total gradient updates. On the FUNSD dataset, the curriculum significantly outperforms the matched-compute baseline for BERT ($Δ$F1 = +0.023, $p=0.022$, $d_z=3.83$), constituting evidence for a genuine scheduling benefit in capacity-constrained models. In contrast, no analogous benefit is observed for LayoutLMv3 ($p=0.621$), whose multimodal representations provide sufficient inductive bias. On the CORD dataset, all conditions converge to equivalent F1 scores ($\geq$0.947) irrespective of scheduling, indicating a performance ceiling. Schedule ablations comparing progressive, two-phase, reverse, and random pacing confirm that the efficiency gain derives from reduced data volume rather than ordering. Taken together, these findings demonstrate that progressive scheduling is a reliable compute-reduction strategy across model families, with curriculum-specific benefits contingent on the interaction between model capacity and task complexity.
研究动机与目标
- 评估渐进数据调度是否在架构不同的文档理解模型(文本型 vs 多模态)上带来效率提升。
- 在三阶段课程与标准训练相比,量化墙时训练时间的下降。
- 使用匹配计算基线(Standard-7)将课程效果与减少梯度更新区分开。
- 开展调度消融以确定排序或数据量是否驱动改进。
- 提供跨架构与统计分析以指导实际训练协议。
提出的方法
- 在10个总轮次中实现三阶段渐进数据调度(33%→67%→100%),以获得6.67个有效曝光轮次。
- 使用匹配计算基线(Standard-7)来区分课程效果与减少的梯度更新。
- 在FUNSD和CORD基准上比较BERT-base(文本)与LayoutLMv3-base(多模态),以SeqEval的实体级F1衡量。
- 进行调度消融(两阶段、反向、随机)以评估排序的重要性。
- 报告配对统计检验,使用三组种子计算Cohen’s d_z效应量。
- 提供含合成数据的扩展领域评估以测试框架的通用性。
实验结果
研究问题
- RQ1逐步数据调度是否能为文本型与多模态文档理解模型都降低训练时间?
- RQ2课程的收益是否仅限于计算降幅,还是在两种架构中都出现?
- RQ3数据排序(33%→67%→100%)是否对比简单的数据子采样具有特定益处?
- RQ4课程在不同文档理解任务(FUNSD与CORD)及扩展域中的表现如何变化?
主要发现
| Dataset | Architecture | Condition | Eff. Ep. | Final Loss | Entity F1 | P / R | Time (s) | Speedup |
|---|---|---|---|---|---|---|---|---|
| FUNSD | BERT | Standard-10 | 10.0 | 0.508±0.013 | 0.562±0.009 | 0.514/0.620 | 53.7±0.2 | – |
| FUNSD | BERT | Curriculum-10 | 6.67 | 0.635±0.031 | 0.543±0.009 | 0.496/0.600 | 35.8±0.1 | 33.3% |
| FUNSD | BERT | Standard-7 | 7.0 | 0.733±0.006 | 0.521±0.010 | 0.469/0.585 | 37.5±0.0 | 30.2% |
| FUNSD | LayoutLMv3 | Standard-10 | 10.0 | 0.075±0.004 | 0.821±0.009 | 0.806/0.836 | 139.8±1.4 | – |
| FUNSD | LayoutLMv3 | Curriculum-10 | 6.67 | 0.193±0.009 | 0.807±0.003 | 0.781/0.833 | 92.5±0.7 | 33.9% |
| FUNSD | LayoutLMv3 | Standard-7 | 7.0 | 0.166±0.011 | 0.803±0.007 | 0.785/0.823 | 97.0±0.3 | 30.6% |
| CORD | BERT | Standard-10 | 10.0 | 0.021±0.002 | 0.947±0.003 | 0.951/0.943 | 277.8±0.3 | – |
| CORD | BERT | Curriculum-10 | 6.67 | 0.040±0.001 | 0.949±0.007 | 0.952/0.945 | 185.2±0.1 | 33.3% |
| CORD | BERT | Standard-7 | 7.0 | 0.041±0.002 | 0.948±0.003 | 0.952/0.945 | 194.5±0.2 | 30.0% |
| CORD | LayoutLMv3 | Standard-10 | 10.0 | 0.025±0.003 | 0.955±0.003 | 0.958/0.952 | 838.9±6.9 | – |
| CORD | LayoutLMv3 | Curriculum-10 | 6.67 | 0.059±0.003 | 0.953±0.009 | 0.958/0.947 | 557.8±1.2 | 33.5% |
| CORD | LayoutLMv3 | Standard-7 | 7.0 | 0.041±0.003 | 0.959±0.005 | 0.963/0.955 | 584.0±1.7 | 30.4% |
- 相对Standard-10,Curriculum-10使两者在墙时训练时间下降约33%。
- 在FUNSD上,Curriculum-10对BERT的F1提升优于Standard-7(ΔF1 = +0.023,p = 0.022,d_z = 3.83)。
- LayoutLMv3在FUNSD上采用Curriculum-10相较于Standard-7未显著获得F1收益(p = 0.621)。
- 在CORD上,所有条件的F1趋于相似(≥0.947),表明存在性能上限。
- 跨架构,Curriculum-10带来的墙时加速大约在33.3%–33.9%之间(均值约33.7%)。
- 调度消融显示在约6.67个有效轮次下,渐进、两阶段、反向或随机节奏之间未显示显著差异,表明数据量驱动效率,而非排序。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。