QUICK REVIEW

[论文解读] Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Joshua D. Harris, Timothy Laurence|arXiv (Cornell University)|May 23, 2024

Topic Modeling被引用 5

一句话总结

本文在 17 个公共卫生分类/提取任务上评估 open-weight LLMs（7–70B），覆盖 burden、risk factors 和 interventions 三个子领域，比较 zero-shot prompts 与 few-shot prompts，并在子任务上以 GPT-4 作为基准。

ABSTRACT

Advances in Large Language Models (LLMs) have led to significant interest in their potential to support human experts across a range of domains, including public health. In this work we present automated evaluations of LLMs for public health tasks involving the classification and extraction of free text. We combine six externally annotated datasets with seven new internally annotated datasets to evaluate LLMs for processing text related to: health burden, epidemiological risk factors, and public health interventions. We evaluate eleven open-weight LLMs (7-123 billion parameters) across all tasks using zero-shot in-context learning. We find that Llama-3.3-70B-Instruct is the highest performing model, achieving the best results on 8/16 tasks (using micro-F1 scores). We see significant variation across tasks with all open-weight LLMs scoring below 60% micro-F1 on some challenging tasks, such as Contact Classification, while all LLMs achieve greater than 80% micro-F1 on others, such as GI Illness Classification. For a subset of 11 tasks, we also evaluate three GPT-4 and GPT-4o series models and find comparable results to Llama-3.3-70B-Instruct. Overall, based on these initial results we find promising signs that LLMs may be useful tools for public health experts to extract information from a wide variety of free text sources, and support public health surveillance, research, and interventions.

研究动机与目标

评估 open-weight LLM 在广泛的公共卫生自由文本分类和提取任务上的表现。
识别驱动模型、任务与数据源之间表现差异的因素。
为将来对公共卫生专用 LLM 的微调和控制部署提供基线。
在部分任务上与 GPT-4 进行比较，以评估私有模型的同等性。

提出的方法

汇编 13 个外部数据集和 7 个内部数据集，覆盖 burden、risk factors、interventions 三个子领域，共 17 项公共卫生任务。
使用标准化提示/后处理框架，针对 7–70B 的五种 open-weight LLM 进行零样本（zero-shot）评估。
在 UKHSA HPC 资源上使用内部 LLM API 以确保数据治理与可复现性。
在最困难的任务上，对零样本结果和选择的少样本提示（10-shot/7-shot）进行对比。
计算 micro-F1（主指标）和 macro-F1 分数；对提取结果使用精确匹配，并对输出进行后处理以与 ground-truth 标签对齐。
在 12 项任务上包含 GPT-4 结果以进行外部基准比较。

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

实验结果

研究问题

RQ1 open-weight LLM 在零样本提示下对公共卫生广泛分类与提取任务的表现如何？
RQ2任务类型、数据源与模型规模对性能的影响是什么？
RQ3 少样本提示是否在困难的公共卫生任务上显著提升性能？
RQ4 GPT-4 与顶尖 open-weight 模型在部分任务上的比较如何？
RQ5 当前 LLMs 在哪些任务上仍然具有挑战性，需要未来的改进？

主要发现

Llama-3-70B-Instruct 在 17 项任务中，零样本提示下获得 15 项任务的最高 micro-F1。
对于所有 open-weight 模型，某些任务仍然具有挑战性（如 BioDex Drugs Extraction、Contact Classification、MMLU Virology），micro-F1 低于 60%。
GPT-4 在大约一半的 12 项任务上与 open-weight 模型匹配或超过，但在若干项任务上被 Llama-3-70B-Instruct 超越。
少样本提示在诸如 Contact Classification 与 Health Causal Claims Classification 等困难任务上给出显著改善（通常 micro-F1 提升超过 15 个百分点），适用于除 Flan-T5-xxl 外的大多数模型。
模型规模与架构（例如从 Llama-3-8B-Instruct 提升到 Llama-3-70B-Instruct）在关键任务上能带来 10 点以上的 micro-F1 提升（如 MMLU Genetics、MMLU Nutrition、Health Advice、Causal Relation）。
总体而言，open-weight LLMs 在将公共卫生自由文本结构化和支持监测方面展现出希望，但对专业知识任务仍需谨慎使用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。