QUICK REVIEW

[論文レビュー] Approaching Human-Level Forecasting with Language Models

Danny Halawi, Fred Zhang|arXiv (Cornell University)|Feb 28, 2024

demographic modeling and climate adaptation被引用数 5

ひとこと要約

最新のニュースを取得し、スクラッチパッドプロンプトで推論し、予測をアンサンブルすることで、二値イベントを予測する検索拡張言語モデルシステムを構築し、群衆にほぼ近い性能を達成し、いくつかの設定でそれを上回る。

ABSTRACT

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

研究の動機と目的

二値イベント予測で自動予測を人間の予測者と同等にする動機づけ。
最新情報を取り入れるためのretrieval-augmented promptingの活用。
推論を改善する自己教師付きファインチューニング手法の開発。
最新で大規模なデータセットに対して、群衆の集計とエンドツーエンドの予測を評価。

提案手法

クエリ生成、関連性ランキング、記事要約を備えた retrieval-augmented LM パイプラインを構築する。
慎重に設計された scratchpad プロンプトを介して質問文の文脈と記事要約から予測を生成する推論モジュールを使用する。
推論を改善するために、モデル出力が群衆を上回る自己教師付きデータでモデルをファインチューニングする。
トリミング平均を用いて複数の予測をアンサンブルし、最終予測を生成する。
群衆の集計と比較して、Brierスコアとキャリブレーション指標でエンドツーエンドのシステムを評価する。
retrieval、 prompting、およびアンサンブル戦略を最適化するハイパーパラメータ探索。

(a) Our retrieval system . The LM takes in the question and generates search queries to retrieve articles from historical news APIs. Then the LM ranks the articles on relevancy and summarizes the top $k$ articles.

実験結果

リサーチクエスチョン

RQ1retrieval-augmented LM システムは、二値イベントを人間の群衆の性能と同等またはそれに近い精度で予測できるか。
RQ2取得品質、推論プロンプト、アンサンブルが予測の精度とキャリブレーションにどう影響するか。
RQ3推論の自己教師付きファインチューニングは、ゼロショットのベースラインと比べて予測性能を向上させるか。
RQ4選択的設定および十分な関連記事がある場合、いくつかの指標でシステムは群衆の集計を上回ることがある。

主な発見

エンドツーエンドのシステムはすべての質問で群衆の性能に近づき、Brier scoreは0.179（群衆は0.149、テストセット全体の平均）。
集計された正確さでは、すべての質問を通じてシステムが71.5%、群衆が77.0%であった。
選択的設定と十分な関連記事がある場合、システムは一部の指標で群衆の集計を上回ることがある。
システムは十分にキャリブレーションされており、RMSキャリブレーションは群衆と同等、ゼロショット設定で基礎モデルより改善。
少なくとも5件の関連記事の取得と早期取得日が群衆に対する性能を向上させる。
群衆を上回る予測を含むデータで推論LMをファインチューニングすると、予測能力がより強化される。

(b) Our reasoning system . The system takes in the question and summarized articles and prompts LMs to generate forecasts. The forecasts are then aggregated into a final forecast using the trimmed mean.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。