[論文レビュー] Pushing the Limits of ChatGPT on NLP Tasks
要約: 本論文はChatGPTがNLPタスクで低アクションになる原因を分析し、プロンプトの多様性、タスクの形式化、取得、推論、自己検証、パラフレーズの一連の拡張戦略を提示します。これらは21のデータセットと10のNLPタスクを横断して性能を大幅に向上させ、監視付きのベースラインに近づくか上回ります。
Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines. In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors: (1) token limit in the prompt does not allow for the full utilization of the supervised datasets; (2) mismatch between the generation nature of ChatGPT and NLP tasks; (3) intrinsic pitfalls of LLMs models, e.g., hallucination, overly focus on certain keywords, etc. In this work, we propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks. Our proposed modules include (1) a one-input-multiple-prompts strategy that employs multiple prompts for one input to accommodate more demonstrations; (2) using fine-tuned models for better demonstration retrieval; (3) transforming tasks to formats that are more tailored to the generation nature; (4) employing reasoning strategies that are tailored to addressing the task-specific complexity; (5) the self-verification strategy to address the hallucination issue of LLMs; (6) the paraphrase strategy to improve the robustness of model predictions. We conduct experiments on 21 datasets of 10 representative NLP tasks, including question answering, commonsense reasoning, natural language inference, sentiment analysis, named entity recognition, entity-relation extraction, event extraction, dependency parsing, semantic role labeling, and part-of-speech tagging. Using the proposed assemble of techniques, we are able to significantly boost the performance of ChatGPT on the selected NLP tasks, achieving performances comparable to or better than supervised baselines, or even existing SOTA performances.
研究の動機と目的
- Identify the main factors limiting ChatGPT on NLP tasks (token limits, task misalignment, reasoning gaps, hallucinations).
- Develop a generic toolkit to push ChatGPT performance across diverse NLP tasks.
- Demonstrate effectiveness on a broad set of datasets spanning QA, reasoning, NER, NER-relations, sentiment, parsing, and more.
提案手法
- One-input-multiple-prompts to expand demonstrations within the token limit and ensemble via voting.
- FT-retrieval using fine-tuned models to retrieve task-specific demonstrations for better prompt quality.
- Transforming tasks into generation-friendly formats and incorporating task-tailored reasoning (chain-of-thought) explanations.
- Proper task formalization to align NLP tasks with generation, including copy-modify and N-binary vs N-class approaches.
- Self-verification to mitigate hallucinations by post-generation validation.
- Paraphrase strategy to improve robustness by evaluating multiple paraphrases of the input.

実験結果
リサーチクエスチョン
- RQ1Can increasing the number and diversity of demonstrations via multiple prompts close the gap between ChatGPT and supervised baselines?
- RQ2Does task-tailored demonstration retrieval (especially FT-based) improve ChatGPT’s performance beyond random or general-purpose retrieval?
- RQ3Can generation-friendly task formalization and reasoning enhance ChatGPT’s accuracy across diverse NLP tasks?
- RQ4To what extent do self-verification and paraphrasing mitigate hallucinations and improve robustness?
主な発見
- The one-input-multiple-prompts strategy yields substantial gains by enabling more demonstrations and voting across 21 datasets.
- Fine-tuned retrieval (FT) substantially improves demonstration relevance over random or general semantic retrieval, bridging the gap to supervised baselines.
- Reasoning-oriented prompts (including chain-of-thought rationale) improve performance across tasks, with benefits amplified when combined with other strategies.
- Self-verification (SV) consistently boosts performance and reduces hallucinations, notably in NER and CB/NLI settings.
- Paraphrase strategy enhances robustness for sentence-level tasks by reducing token-dominance effects and enabling voting across paraphrases.
- Across QA, commonsense reasoning, NLI, sentiment, NER, and related tasks, the assembled strategy set achieves performance comparable to or better than supervised baselines on several datasets; notable gains are reported in specific tasks (e.g., QA, SST-2, NER, and entity-relations).
- On out-of-domain MRQA-OOD, ChatGPT with the proposed methods can surpass some supervised baselines, indicating strong domain adaptability.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。