Skip to main content
QUICK REVIEW

[论文解读] Synthetic Data for Veterinary EHR De-identification: Benefits, Limits, and Safety Trade-offs Under Fixed Compute

David M. Brundage|arXiv (Cornell University)|Jan 13, 2026
Electronic Health Records Systems被引用 0
一句话总结

本论文评估在不同训练策略下,LLM 生成的合成兽医叙述对去识别的影响,结果显示合成数据在扩大暴露量时有帮助,但在预算固定的条件下不能替代真实带标签的笔记;收益很大程度上由暴露驱动。

ABSTRACT

Veterinary electronic health records (vEHRs) contain privacy-sensitive identifiers that limit secondary use. While PetEVAL provides a benchmark for veterinary de-identification, the domain remains low-resource. This study evaluates whether large language model (LLM)-generated synthetic narratives improve de-identification safety under distinct training regimes, emphasizing (i) synthetic augmentation and (ii) fixed-budget substitution. We conducted a controlled simulation using a PetEVAL-derived corpus (3,750 holdout/1,249 train). We generated 10,382 synthetic notes using a privacy-preserving "template-only" regime where identifiers were removed prior to LLM prompting. Three transformer backbones (PetBERT, VetBERT, Bio_ClinicalBERT) were trained under varying mixtures. Evaluation prioritized document-level leakage rate (the fraction of documents with at least one missed identifier) as the primary safety outcome. Results show that under fixed-sample substitution, replacing real notes with synthetic ones monotonically increased leakage, indicating synthetic data cannot safely replace real supervision. Under compute-matched training, moderate synthetic mixing matched real-only performance, but high synthetic dominance degraded utility. Conversely, epoch-scaled augmentation improved performance: PetBERT span-overlap F1 increased from 0.831 to 0.850 +/- 0.014, and leakage decreased from 6.32% to 4.02% +/- 0.19%. However, these gains largely reflect increased training exposure rather than intrinsic synthetic data quality. Corpus diagnostics revealed systematic synthetic-real mismatches in note length and label distribution that align with persistent leakage. We conclude that synthetic augmentation is effective for expanding exposure but is complementary, not substitutive, for safety-critical veterinary de-identification.

研究动机与目标

  • 在隐私限制数据共享的 vEHR 中,推动去识别的动机。
  • 在不同训练策略下,研究合成数据是否改善安全性(文档层面的泄露)和实用性(片段级 F1)。
  • 考察数据增强与替代以及与Compute匹配的对照,理解暴露效应。
  • 表征合成数据构成(有 PII 与无 PII)对召回、精度与兽医命名实体识别泄露的影响。

提出的方法

  • 使用基于 PetEVAL 的低资源仿真,拥有 3,750 条真实保留笔记和 1,249 条真实训练笔记。
  • 通过模板仅生成 regime,在占位符和确定性实例化下生成 10,382 条合成笔记。
  • 在暴露规模化、固定样本和计算匹配三种 regime 下,训练三种 Transformer 主干模型(PetBERT、VetBERT、Bio_ClinicalBERT)。
  • 以标记级、片段级、文档级指标进行评估,优先将文档级泄露作为主要安全结果。
  • 进行消融实验,改变无 PII 的合成比例并在不同种子上进行敏感性分析。
Figure 1: Synthetic augmentation sweep ( $L{=}512$ , stride $=64$ ; $n{=}3$ seeds). Points show mean; error bars show $\pm$ 1 SD across seeds. Top: Span-overlap F1 increases with synthetic fraction across backbones. Bottom: Document-level overlap leakage decreases with synthetic fraction, with PetBE
Figure 1: Synthetic augmentation sweep ( $L{=}512$ , stride $=64$ ; $n{=}3$ seeds). Points show mean; error bars show $\pm$ 1 SD across seeds. Top: Span-overlap F1 increases with synthetic fraction across backbones. Bottom: Document-level overlap leakage decreases with synthetic fraction, with PetBE

实验结果

研究问题

  • RQ1LLM 生成的合成兽医叙述在不同训练策略下是否能提高去识别的安全性和实用性?
  • RQ2在增加暴露(增强)与固定预算替代(替换)下,对片段级 F1 和文档级泄露的影响是否不同?
  • RQ3无 PII 合成笔记的比例如何影响召回、精确度和泄露?
  • RQ4观察到的收益是否主要来自暴露增加而非合成文本的内在质量?
  • RQ5合成文本与真实数据之间的语料结构不匹配会在哪些方面限制安全合成设计?

主要发现

BackboneTarget synthetic fractionSpan-overlap F1 (mean ± SD)Doc leakage (overlap) (mean ± SD)
PetBERT0.900.850 ± 0.0144.02% ± 0.19%
VetBERT0.900.777 ± 0.0025.92% ± 0.10%
Bio_ClinicalBERT0.900.594 ± 0.0049.01% ± 0.18%
  • 在基于时期的训练下,更高比例的合成笔记在跨主干模型中提升了片段重叠 F1 并降低了文档级泄露,其中 PetBERT 在约90% 的合成混合下达到 0.850±0.014 的 F1 和 4.02%±0.19% 的泄露。
  • 在固定样本替代下,用合成数据替代真实笔记会使文档级泄露单调增加,即使片段 F1 保持较高(例如 PetBERT F1 为 0.847,真实笔记占比 100% 时,降至真实笔记占比 5% 时的 0.820)。
  • 计算匹配训练显示中等的合成混合比例(约 50%)可获得最佳 F1 与低泄露,而高比例合成主导则削弱效用而不降低泄露。
  • 基于时期的增量增加了对少数实体类型(LOC、ORG)的召回并降低泄露,而在计算匹配 regime 下过度的合成主导则损害性能。
  • 无 PII 偏向的合成数据提高了 F1,但有时也提高泄露;实现接近最小泄露且跨种子保持稳定召回的平衡 50% 无 PII 合成混合效果最佳。
  • 收益的主导因素是暴露增加而非合成文本的固有优势;显著的合成效益取决于训练策略和数据组成。
Figure 2: Per-entity overlap recall across synthetic fractions. Synthetic augmentation drives recall gains in minority classes (e.g., LOC/ORG) while high-frequency classes (PER) change modestly.
Figure 2: Per-entity overlap recall across synthetic fractions. Synthetic augmentation drives recall gains in minority classes (e.g., LOC/ORG) while high-frequency classes (PER) change modestly.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。