Skip to main content
QUICK REVIEW

[论文解读] Agentic Framework for Political Biography Extraction

Zhu, Yifei, Yang, Songpo|arXiv (Cornell University)|Feb 23, 2026
Computational and Text Analysis Methods被引用 0
一句话总结

本文提出一个两阶段的 Synthesis-Coding 框架,利用具备行动能力的大语言模型从开放网络源自动提取结构化的精英传记,并与人工编码者与维基百科派生基准进行验证。

ABSTRACT

The production of large-scale political datasets typically demands extracting structured facts from vast piles of unstructured documents or web sources, a task that traditionally relies on expensive human experts and remains prohibitively difficult to automate at scale. In this paper, we leverage Large Language Models (LLMs) to automate the extraction of multi-dimensional elite biographies, addressing a long-standing bottleneck in political science research. We propose a two-stage ``Synthesis-Coding'' framework for complex extraction task: an upstream synthesis stage that uses recursive agentic LLMs to search, filter, and curate biography from heterogeneous web sources, followed by a downstream coding stage that maps curated biography into structured dataframes. We validate this framework through three primary results. First, we demonstrate that, when given curated contexts, LLM coders match or outperform human experts in extraction accuracy. Second, we show that in web environments, the agentic system synthesizes more information from web resources than human collective intelligence (Wikipedia). Finally, we diagnosed that directly coding from long and multi-language corpora introduces bias that the synthesis stage can alleviate by curating evidence into signal-dense representations. By comprehensive evaluation, We provide a generalizable, scalable framework for building transparent and expansible large scale database in political science.

研究动机与目标

  • 将政治事实提取正式化为开放领域的多字段传记编码,而非简单分类。
  • 提出一个带迭代、使用工具检索的 Synthesis-Coding 架构,以克服开放网络提取瓶颈。
  • 开源一个具行动能力的包;通过 curated evaluation 数据集展示可扩展性和透明性。
  • 生成大规模跨国政治精英传记数据集,降低信息匮乏环境中的数据生产门槛。

提出的方法

  • 提出一个两阶段的 Synthesis-Coding 工作流程,其中 synthesis 从开放网络源整理证据,coding 将其转化为结构化事实。
  • 实现一个递归检索与综合循环,利用具行动能力的大语言模型来决定要检索哪些来源以及如何将发现浓缩成一个综合性报告。
  • 采用确定性检索工具收集证据并为 coding 阶段生成压缩、类似维基入口的输入。
  • 用 CPED(中国政治精英数据库)和 Baidu Baike 作为证据基础,对比验证 LLM 的编码结果与人工编码基准。
  • 通过 claim normalization 和 LLM 辅助的证据验证构建 Consolidated Ground Truth (CGT),并进行人工审计检查。
Figure 1 : Two coding strategies for elite biographies. Left: when a Wikipedia page exists, we code directly from the curated page with a single LLM pass. Right: when Wikipedia is missing or incomplete, we search across web sources and iteratively synthesize a synthetic report, then code from that r
Figure 1 : Two coding strategies for elite biographies. Left: when a Wikipedia page exists, we code directly from the curated page with a single LLM pass. Right: when Wikipedia is missing or incomplete, we search across web sources and iteratively synthesize a synthetic report, then code from that r

实验结果

研究问题

  • RQ1当提供经过策划的传记文本时,LLM 编码者是否可以达到或超过人工编码的准确性?
  • RQ2从开放网络来源进行的具行动能力综合是否比人类集体综合(如维基百科)生成的传记数据更可靠?
  • RQ3长文本/多语言输入是否会降低编码质量,且综合是否能缓解这一降解?
  • RQ4所提出的 Synthesis-Coding 框架是否可扩展、透明且能泛化到跨国精英传记?

主要发现

  • 当给定经过策划的维基百科/百度百科文本时,LLM 编码者可以达到或超过人工编码质量。
  • 来自开放网络源的具行动能力综合在全球政治精英的传记整理中优于维基百科作为资料来源。
  • 直接从长文本和多语言语料进行编码会降低质量,但恰当的综合能够整理出高信号表示以减轻偏差。
  • 该框架提供一种可泛化、可扩展的构建透明、可扩展的大规模政治传记数据集的方法。
Figure 2 : Experiment 1 Results: LLM coding performance relative to the human baseline (China sample, N=197). Points indicate coefficient estimates with 95% confidence intervals. The human-coded baseline ( Human_wiki ) is normalized to zero. Positive values indicate that LLMs outperform human coders
Figure 2 : Experiment 1 Results: LLM coding performance relative to the human baseline (China sample, N=197). Points indicate coefficient estimates with 95% confidence intervals. The human-coded baseline ( Human_wiki ) is normalized to zero. Positive values indicate that LLMs outperform human coders

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。