[论文解读] Text Classification via Large Language Models
CARP 引入 Clue And Reasoning Prompting,通过 (1) 收集线索、(2) 诊断性推理、(3) 最终决策,利用来自微调模型的 kNN 基于示例来克服令牌限制,从而提升基于大模型的文本分类性能;在多个基准上实现 SOTA,并在低资源/领域自适应场景展现强劲表现。
Despite the remarkable success of large-scale Language Models (LLMs) such as GPT-3, their performances still significantly underperform fine-tuned models in the task of text classification. This is due to (1) the lack of reasoning ability in addressing complex linguistic phenomena (e.g., intensification, contrast, irony etc); (2) limited number of tokens allowed in in-context learning. In this paper, we introduce Clue And Reasoning Prompting (CARP). CARP adopts a progressive reasoning strategy tailored to addressing the complex linguistic phenomena involved in text classification: CARP first prompts LLMs to find superficial clues (e.g., keywords, tones, semantic relations, references, etc), based on which a diagnostic reasoning process is induced for final decisions. To further address the limited-token issue, CARP uses a fine-tuned model on the supervised dataset for $k$NN demonstration search in the in-context learning, allowing the model to take the advantage of both LLM's generalization ability and the task-specific evidence provided by the full labeled dataset. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks, 97.39 (+1.24) on SST-2, 96.40 (+0.72) on AGNews, 98.78 (+0.25) on R8 and 96.95 (+0.6) on R52, and a performance comparable to SOTA on MR (92.39 v.s. 93.3). More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups. Specifically, using 16 examples per class, CARP achieves comparable performances to supervised models with 1,024 examples per class.
研究动机与目标
- 在现有微调模型存在差距的情况下,推动使用大语言模型(LLMs)进行文本分类的研究动机。
- 提出 CARP(Clue And Reasoning Prompting),将推理分解为线索收集、诊断性推理和最终决策。
- 通过将来自微调模型的 kNN 基于示例整合到上下文学习中,解决上下文令牌限制问题。
- 在零-shot、少量-shot 和全量数据情境下,展示在流行文本分类基准上的最先进性能。
- 展示 CARP 在低资源和领域自适应设置中的鲁棒性。
提出的方法
- 将文本分类推理分解为三步:收集线索(关键词、语气、关系)、由线索与输入诱导诊断性推理、并给出最终标签决策。
- 使用带示例的上下文学习;从微调的 RoBERTa 基编码器检索最近的 k 个邻居来形成任务特定示例(kNN)以缓解令牌限制。
- 采用渐进式提示策略(CARP),先由 LLM 识别表层线索,再在其基础上推理,最终输出标签。
- 在零-shot、少-shot 和全数据实验中,以上行以 InstructGPT-3(text-davinci-003)为骨干;与原生 ICL、CoT 和有监督基线进行对比。
- 通过多种采样策略(Random、SimCSE kNN-Sampler、FT kNN-Sampler)进行演示,并提出投票方案(多数投票、加权概率)以聚合多次运行结果。
- 在 SST-2、AGNews、R8、R52、MR 数据集上评估;报告 5 次运行的均值和标准差。
实验结果
研究问题
- RQ1CARP 是否通过将推理分解为线索与诊断推理,超越标准提示在文本分类上取得提升?
- RQ2使用来自任务微调编码器的 kNN 演示是否在令牌约束下提升上下文学习效果?
- RQ3在不同数据集与资源设置下,CARP 相对于原生提示、链式推理提示和有监督基线的表现如何?
- RQ4CARP 对域迁移和低资源场景是否具有鲁棒性?
- RQ5不同演示采样策略和投票方案对 CARP 性能有何影响?
主要发现
| 模型 | SST-2 | AGNews | R8 | R52 | MR | 平均 |
|---|---|---|---|---|---|---|
| Supervised Methods | 95.99 | 95.55 | 97.76 | 96.42 | 91.16 | 95.38 |
| RoBERTa-Large | 95.99 | 95.55 | 97.76 | 96.42 | 91.16 | 95.38 |
| RoBERTa-GCN | 95.80 | 95.68 | 98.20 | 96.10 | 89.70 | 95.10 |
| XLNet | 96.10 | 95.55 | - | - | - | - |
| VLAWE | - | - | - | - | 93.3 | - |
| GCN-SB | - | - | 98.53 | 96.35 | 87.59 | - |
| Table (Note: Fig/Table context) | - | - | - | - | - | - |
| Zero-shot Setting - Vanilla | 91.55 | 90.72 | 90.19 | 89.06 | 88.69 | 90.04 |
| Zero-shot Setting - CoT | 92.11 | 91.25 | 90.48 | 91.24 | 89.37 | 90.89 |
| Zero-shot Setting - CARP | 93.01 | 92.60 | 91.75 | 91.80 | 89.94 | 91.82 |
| Few-shot Setting - Random Sampler - Vanilla | 92.36 | 91.74 | 91.58 | 91.56 | 89.15 | 91.28 |
| Few-shot Setting - Random Sampler - CoT | 94.56 | 95.02 | 92.49 | 92.03 | 89.91 | 92.80 |
| Few-shot Setting - Random Sampler - CARP | 96.20 | 95.18 | 97.60 | 96.19 | 90.03 | 95.04 |
| Few-shot Setting - SimCSE kNN-Sampler - Vanilla | 93.90 | 93.50 | 94.36 | 92.40 | 89.59 | 94.05 |
| Few-shot Setting - SimCSE kNN-Sampler - CoT | 94.21 | 94.28 | 95.07 | 92.98 | 90.27 | 93.69 |
| Few-shot Setting - SimCSE kNN-Sampler - CARP | 95.69 | 95.25 | 97.83 | 96.27 | 90.74 | 95.16 |
| Few-shot Setting - FT kNN-Sampler - Vanilla | 94.01 | 94.14 | 95.57 | 95.79 | 90.90 | 94.08 |
| Few-shot Setting - FT kNN-Sampler - CoT | 95.48 | 94.89 | 95.59 | 95.89 | 90.17 | 94.40 |
| Few-shot Setting - FT kNN-Sampler - CARP | 96.80 | 95.99 | 98.29 | 96.82 | 91.90 | 95.97 |
| Few-shot Setting - CARP (WP Vote) | 97.39 | 96.40 | 98.78 | 96.95 | 92.39 | 96.38 |
- CARP 在五个基准中的四个上达到新的 SOTA:SST-2、AGNews、R8、R52;MR 具有小幅优势。
- 零-shot CARP 和少-shot CARP 始终优于原生提示和 CoT 基线。
- 使用每类 16 条示例时,CARP 的表现达到与在更大标注集合上训练的监督模型相当的水平;在低资源情境下,CARP 接近全数据监督性能。
- 使用微调编码器(FT RoBERTa)进行基于 kNN 的演示检索在任务特定检索方面优于像 SimCSE 这样的语义编码器。
- WP 投票进一步提升结果,例如 CARP (WP Vote) 在 SST-2、AGNews、R8、R52、MR 上分别达到 97.39、96.40、98.78、96.95、92.39。
- CARP 展现出强大的领域适应能力,当演示来自不同领域时性能衰减较小。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。