QUICK REVIEW

[论文解读] Text Classification via Large Language Models

Xiaofei Sun, Xiaoya Li|arXiv (Cornell University)|May 15, 2023

Topic Modeling被引用 14

一句话总结

CARP 引入 Clue And Reasoning Prompting，通过 (1) 收集线索、(2) 诊断性推理、(3) 最终决策，利用来自微调模型的 kNN 基于示例来克服令牌限制，从而提升基于大模型的文本分类性能；在多个基准上实现 SOTA，并在低资源/领域自适应场景展现强劲表现。

ABSTRACT

Despite the remarkable success of large-scale Language Models (LLMs) such as GPT-3, their performances still significantly underperform fine-tuned models in the task of text classification. This is due to (1) the lack of reasoning ability in addressing complex linguistic phenomena (e.g., intensification, contrast, irony etc); (2) limited number of tokens allowed in in-context learning. In this paper, we introduce Clue And Reasoning Prompting (CARP). CARP adopts a progressive reasoning strategy tailored to addressing the complex linguistic phenomena involved in text classification: CARP first prompts LLMs to find superficial clues (e.g., keywords, tones, semantic relations, references, etc), based on which a diagnostic reasoning process is induced for final decisions. To further address the limited-token issue, CARP uses a fine-tuned model on the supervised dataset for $k$NN demonstration search in the in-context learning, allowing the model to take the advantage of both LLM's generalization ability and the task-specific evidence provided by the full labeled dataset. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks, 97.39 (+1.24) on SST-2, 96.40 (+0.72) on AGNews, 98.78 (+0.25) on R8 and 96.95 (+0.6) on R52, and a performance comparable to SOTA on MR (92.39 v.s. 93.3). More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups. Specifically, using 16 examples per class, CARP achieves comparable performances to supervised models with 1,024 examples per class.

研究动机与目标

在现有微调模型存在差距的情况下，推动使用大语言模型（LLMs）进行文本分类的研究动机。
提出 CARP（Clue And Reasoning Prompting），将推理分解为线索收集、诊断性推理和最终决策。
通过将来自微调模型的 kNN 基于示例整合到上下文学习中，解决上下文令牌限制问题。
在零-shot、少量-shot 和全量数据情境下，展示在流行文本分类基准上的最先进性能。
展示 CARP 在低资源和领域自适应设置中的鲁棒性。

提出的方法

将文本分类推理分解为三步：收集线索（关键词、语气、关系）、由线索与输入诱导诊断性推理、并给出最终标签决策。
使用带示例的上下文学习；从微调的 RoBERTa 基编码器检索最近的 k 个邻居来形成任务特定示例（kNN）以缓解令牌限制。
采用渐进式提示策略（CARP），先由 LLM 识别表层线索，再在其基础上推理，最终输出标签。
在零-shot、少-shot 和全数据实验中，以上行以 InstructGPT-3（text-davinci-003）为骨干；与原生 ICL、CoT 和有监督基线进行对比。
通过多种采样策略（Random、SimCSE kNN-Sampler、FT kNN-Sampler）进行演示，并提出投票方案（多数投票、加权概率）以聚合多次运行结果。
在 SST-2、AGNews、R8、R52、MR 数据集上评估；报告 5 次运行的均值和标准差。

实验结果

研究问题

RQ1CARP 是否通过将推理分解为线索与诊断推理，超越标准提示在文本分类上取得提升？
RQ2使用来自任务微调编码器的 kNN 演示是否在令牌约束下提升上下文学习效果？
RQ3在不同数据集与资源设置下，CARP 相对于原生提示、链式推理提示和有监督基线的表现如何？
RQ4CARP 对域迁移和低资源场景是否具有鲁棒性？
RQ5不同演示采样策略和投票方案对 CARP 性能有何影响？

主要发现

模型	SST-2	AGNews	R8	R52	MR	平均
Supervised Methods	95.99	95.55	97.76	96.42	91.16	95.38
RoBERTa-Large	95.99	95.55	97.76	96.42	91.16	95.38
RoBERTa-GCN	95.80	95.68	98.20	96.10	89.70	95.10
XLNet	96.10	95.55	-	-	-	-
VLAWE	-	-	-	-	93.3	-
GCN-SB	-	-	98.53	96.35	87.59	-
Table (Note: Fig/Table context)	-	-	-	-	-	-
Zero-shot Setting - Vanilla	91.55	90.72	90.19	89.06	88.69	90.04
Zero-shot Setting - CoT	92.11	91.25	90.48	91.24	89.37	90.89
Zero-shot Setting - CARP	93.01	92.60	91.75	91.80	89.94	91.82
Few-shot Setting - Random Sampler - Vanilla	92.36	91.74	91.58	91.56	89.15	91.28
Few-shot Setting - Random Sampler - CoT	94.56	95.02	92.49	92.03	89.91	92.80
Few-shot Setting - Random Sampler - CARP	96.20	95.18	97.60	96.19	90.03	95.04
Few-shot Setting - SimCSE kNN-Sampler - Vanilla	93.90	93.50	94.36	92.40	89.59	94.05
Few-shot Setting - SimCSE kNN-Sampler - CoT	94.21	94.28	95.07	92.98	90.27	93.69
Few-shot Setting - SimCSE kNN-Sampler - CARP	95.69	95.25	97.83	96.27	90.74	95.16
Few-shot Setting - FT kNN-Sampler - Vanilla	94.01	94.14	95.57	95.79	90.90	94.08
Few-shot Setting - FT kNN-Sampler - CoT	95.48	94.89	95.59	95.89	90.17	94.40
Few-shot Setting - FT kNN-Sampler - CARP	96.80	95.99	98.29	96.82	91.90	95.97
Few-shot Setting - CARP (WP Vote)	97.39	96.40	98.78	96.95	92.39	96.38

CARP 在五个基准中的四个上达到新的 SOTA：SST-2、AGNews、R8、R52；MR 具有小幅优势。
零-shot CARP 和少-shot CARP 始终优于原生提示和 CoT 基线。
使用每类 16 条示例时，CARP 的表现达到与在更大标注集合上训练的监督模型相当的水平；在低资源情境下，CARP 接近全数据监督性能。
使用微调编码器（FT RoBERTa）进行基于 kNN 的演示检索在任务特定检索方面优于像 SimCSE 这样的语义编码器。
WP 投票进一步提升结果，例如 CARP (WP Vote) 在 SST-2、AGNews、R8、R52、MR 上分别达到 97.39、96.40、98.78、96.95、92.39。
CARP 展现出强大的领域适应能力，当演示来自不同领域时性能衰减较小。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。