Skip to main content
QUICK REVIEW

[论文解读] Text Classification via Large Language Models

Xiaofei Sun, Xiaoya Li|arXiv (Cornell University)|May 15, 2023
Topic Modeling被引用 14
一句话总结

CARP 引入 Clue And Reasoning Prompting,通过 (1) 收集线索、(2) 诊断性推理、(3) 最终决策,利用来自微调模型的 kNN 基于示例来克服令牌限制,从而提升基于大模型的文本分类性能;在多个基准上实现 SOTA,并在低资源/领域自适应场景展现强劲表现。

ABSTRACT

Despite the remarkable success of large-scale Language Models (LLMs) such as GPT-3, their performances still significantly underperform fine-tuned models in the task of text classification. This is due to (1) the lack of reasoning ability in addressing complex linguistic phenomena (e.g., intensification, contrast, irony etc); (2) limited number of tokens allowed in in-context learning. In this paper, we introduce Clue And Reasoning Prompting (CARP). CARP adopts a progressive reasoning strategy tailored to addressing the complex linguistic phenomena involved in text classification: CARP first prompts LLMs to find superficial clues (e.g., keywords, tones, semantic relations, references, etc), based on which a diagnostic reasoning process is induced for final decisions. To further address the limited-token issue, CARP uses a fine-tuned model on the supervised dataset for $k$NN demonstration search in the in-context learning, allowing the model to take the advantage of both LLM's generalization ability and the task-specific evidence provided by the full labeled dataset. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks, 97.39 (+1.24) on SST-2, 96.40 (+0.72) on AGNews, 98.78 (+0.25) on R8 and 96.95 (+0.6) on R52, and a performance comparable to SOTA on MR (92.39 v.s. 93.3). More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups. Specifically, using 16 examples per class, CARP achieves comparable performances to supervised models with 1,024 examples per class.

研究动机与目标

  • 在现有微调模型存在差距的情况下,推动使用大语言模型(LLMs)进行文本分类的研究动机。
  • 提出 CARP(Clue And Reasoning Prompting),将推理分解为线索收集、诊断性推理和最终决策。
  • 通过将来自微调模型的 kNN 基于示例整合到上下文学习中,解决上下文令牌限制问题。
  • 在零-shot、少量-shot 和全量数据情境下,展示在流行文本分类基准上的最先进性能。
  • 展示 CARP 在低资源和领域自适应设置中的鲁棒性。

提出的方法

  • 将文本分类推理分解为三步:收集线索(关键词、语气、关系)、由线索与输入诱导诊断性推理、并给出最终标签决策。
  • 使用带示例的上下文学习;从微调的 RoBERTa 基编码器检索最近的 k 个邻居来形成任务特定示例(kNN)以缓解令牌限制。
  • 采用渐进式提示策略(CARP),先由 LLM 识别表层线索,再在其基础上推理,最终输出标签。
  • 在零-shot、少-shot 和全数据实验中,以上行以 InstructGPT-3(text-davinci-003)为骨干;与原生 ICL、CoT 和有监督基线进行对比。
  • 通过多种采样策略(Random、SimCSE kNN-Sampler、FT kNN-Sampler)进行演示,并提出投票方案(多数投票、加权概率)以聚合多次运行结果。
  • 在 SST-2、AGNews、R8、R52、MR 数据集上评估;报告 5 次运行的均值和标准差。

实验结果

研究问题

  • RQ1CARP 是否通过将推理分解为线索与诊断推理,超越标准提示在文本分类上取得提升?
  • RQ2使用来自任务微调编码器的 kNN 演示是否在令牌约束下提升上下文学习效果?
  • RQ3在不同数据集与资源设置下,CARP 相对于原生提示、链式推理提示和有监督基线的表现如何?
  • RQ4CARP 对域迁移和低资源场景是否具有鲁棒性?
  • RQ5不同演示采样策略和投票方案对 CARP 性能有何影响?

主要发现

模型SST-2AGNewsR8R52MR平均
Supervised Methods95.9995.5597.7696.4291.1695.38
RoBERTa-Large95.9995.5597.7696.4291.1695.38
RoBERTa-GCN95.8095.6898.2096.1089.7095.10
XLNet96.1095.55----
VLAWE----93.3-
GCN-SB--98.5396.3587.59-
Table (Note: Fig/Table context)------
Zero-shot Setting - Vanilla91.5590.7290.1989.0688.6990.04
Zero-shot Setting - CoT92.1191.2590.4891.2489.3790.89
Zero-shot Setting - CARP93.0192.6091.7591.8089.9491.82
Few-shot Setting - Random Sampler - Vanilla92.3691.7491.5891.5689.1591.28
Few-shot Setting - Random Sampler - CoT94.5695.0292.4992.0389.9192.80
Few-shot Setting - Random Sampler - CARP96.2095.1897.6096.1990.0395.04
Few-shot Setting - SimCSE kNN-Sampler - Vanilla93.9093.5094.3692.4089.5994.05
Few-shot Setting - SimCSE kNN-Sampler - CoT94.2194.2895.0792.9890.2793.69
Few-shot Setting - SimCSE kNN-Sampler - CARP95.6995.2597.8396.2790.7495.16
Few-shot Setting - FT kNN-Sampler - Vanilla94.0194.1495.5795.7990.9094.08
Few-shot Setting - FT kNN-Sampler - CoT95.4894.8995.5995.8990.1794.40
Few-shot Setting - FT kNN-Sampler - CARP96.8095.9998.2996.8291.9095.97
Few-shot Setting - CARP (WP Vote)97.3996.4098.7896.9592.3996.38
  • CARP 在五个基准中的四个上达到新的 SOTA:SST-2、AGNews、R8、R52;MR 具有小幅优势。
  • 零-shot CARP 和少-shot CARP 始终优于原生提示和 CoT 基线。
  • 使用每类 16 条示例时,CARP 的表现达到与在更大标注集合上训练的监督模型相当的水平;在低资源情境下,CARP 接近全数据监督性能。
  • 使用微调编码器(FT RoBERTa)进行基于 kNN 的演示检索在任务特定检索方面优于像 SimCSE 这样的语义编码器。
  • WP 投票进一步提升结果,例如 CARP (WP Vote) 在 SST-2、AGNews、R8、R52、MR 上分别达到 97.39、96.40、98.78、96.95、92.39。
  • CARP 展现出强大的领域适应能力,当演示来自不同领域时性能衰减较小。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。