QUICK REVIEW

[論文レビュー] DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi|arXiv (Cornell University)|Oct 5, 2023

Topic Modeling被引用数 49

ひとこと要約

DSPy はLMパイプラインをテキスト変換グラフへ変換する宣言型、パラメータ化モジュールとコンパイラ（teleprompters）を提供し、 prompting と finetuning を最適化することで自己改善する多段階NLPシステムを可能にする。ケーススタディは小規模・大規模LMを用いた手作りプロンプトに対して大幅な性能向上を示す。

ABSTRACT

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded "prompt templates", i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn (by creating and collecting demonstrations) how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric. We conduct two case studies, showing that succinct DSPy programs can express and optimize sophisticated LM pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, a few lines of DSPy allow GPT-3.5 and llama2-13b-chat to self-bootstrap pipelines that outperform standard few-shot prompting (generally by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively). On top of that, DSPy programs compiled to open and relatively small LMs like 770M-parameter T5 and llama2-13b-chat are competitive with approaches that rely on expert-written prompt chains for proprietary GPT-3.5. DSPy is available at https://github.com/stanfordnlp/dspy

研究の動機と目的

手書きのプロンプトテンプレートを超えた、LMパイプライン構築の体系的でモジュラーなアプローチを促進する。
DSPy の抽象概念（signatures, modules, teleprompters）を導入し、それらが最適化された prompting/finetuning ストラテジへどのようにコンパイルできるかを示す。
ケーススタディを通じて、DSPy が手作りのプロンプトを上回り、より小さなLMが大規模で専門家設計の手法に匹敵できることを示す。
ブートストラップされたデモンストレーションとモジュラー設計が、数学文字題やマルチホップQAなどの多段階NLPタスクを改善することを示す。

提案手法

DSPy を、宣言型モジュールを用いてLMパイプラインをテキスト変換グラフとして表現するPythonベースのフレームワークとして定義する。
モジュールの入力/出力挙動を指定する自然言語署名を導入し（例: question -> answer）、Predict、ChainOfThought、ReAct などのモジュールを介して自動 prompting/finetuning を可能にする。
LM選択、プロンプト、デモンストレーションをブートストラップするデモンストレーションによって含む、モジュールのパラメータ化を説明する。
teleprompters を、デモンストレーションを生成し、パラメータを調整し、ある訓練/検証セット上の指標を最適化するように制御フローを調整するコンパイラ/最適化ツールとして提示する。
DSPy のコンパイル過程を3段階で説明する：候補生成（モジュールデモンストレーションの作成）、パラメータ最適化（デモとプロンプトの間で選択）、高次のプログラム最適化（アンサンブル、パイプライン）。
簡単な retrieval-augmented generation (RAG) システムや GSM8K math-word-problem パイプラインなどの例で、モジュールの組み合わせが性能にどのように影響するかを示す。

実験結果

リサーチクエスチョン

RQ1DSPy は、性能を犠牲にすることなく、手作りのプロンプトテンプレートをモジュール化されたパラメータ化LM呼び出しコンポーネントに置換できるか。
RQ2プロンプティング技術をパラメータ化し、モジュールを最適化するコンパイラを使用することは、専門家設計のプロンプトよりも異なるLMに適応するのに効果的か。
RQ3DSPy のモジュラーアプローチは、複雑な多段階NLPパイプラインの探索と最適化をどのように可能にするか。
RQ4GSM8K の数学問題やマルチホップQAのようなタスクにおいて、ベースライン prompting 手法と比較して DSPy で達成可能な性能向上はどの程度か。

主な発見

プログラム	コンパイル	トレーニング	Dev (GPT-3.5)	Test (GPT-3.5)	Dev (Llama2-13b-chat)	Test (Llama2-13b-chat)
vanilla	none	n/a	24.0	25.2	7.0	9.4
fewshot	trainset	trainset	33.1	–	4.3	–
bootstrap	trainset	trainset	44.0	–	28.0	–
bootstrap × 2	trainset	trainset	64.7	61.7	37.3	36.5
+ ensemble	trainset	trainset	62.7	61.9	39.0	34.6
CoT	none	n/a	50.0	–	26.7	–
CoT	fewshot	trainset	63.0	–	27.3	–
CoT	fewshot + human_CoT	trainset	78.6	72.4	34.3	33.7
bootstrap	trainset	trainset	80.3	72.9	43.3	–
+ ensemble	trainset	trainset	86.7	–	49.0	46.9
reflection	none	n/a	65.0	–	36.7	–
reflection	fewshot	trainset	71.7	–	36.3	–
reflection	bootstrap	trainset	83.0	76.0	44.3	40.2
reflection × ensemble	trainset	trainset	86.7	–	49.0	46.9

DSPy対応パイプラインは、テストしたLM（例：GPT-3.5 および llama2-13b-chat）全般で標準的な few-shot prompting を大幅に上回る。
DSPy 内でのデモンストレーションのブートストラップは、多くの設定で人手作成の推論連鎖（CoT）と同等かそれを上回る。
アンサンブルと多段階推論モジュール（例：reflection、chain-of-thought の派生形）は最も大きな利得を生み、コンパイル時に顕著な改善が見られる。
小型/オープンソースLM（例：770M T5、llama2-13b-chat）は、大型モデル向けの専門家が書いたプロンプト連鎖と対等に競合できる。
DSPy プログラムをコンパイルする（控えめなデータセットと小さな訓練信号でも） substantial accuracy の改善を達成、例えば GSM8K の向上が 33% から 82%（GPT-3.5）および 32% から 46%（llama2-13b-chat）など、特定の設定下で。
DSPy は、高性能 LM システムを、広範な手作りプロンプトなしでモジュール単位から構築できることを示し、パイプライン設計のスケーラブルな探索を可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。