QUICK REVIEW

[論文レビュー] CREATE: Testing LLMs for Associative Creativity

Manya Wadhwa, Tania Roy|arXiv (Cornell University)|Mar 10, 2026

Artificial Intelligence in Games被引用数 0

ひとこと要約

CREATEは知識グラフ上の実世界概念間の高品質・多様・独自性のある経路を生成・ランキングしてLLMsのアソシエーティブ・クリエイティビティを評価する。 frontierモデルは最適だが独自性の高い経路の達成は難しい。

ABSTRACT

A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.

研究の動機と目的

LLMsが実世界の概念間で創造的でオープンエンドな接続を生成する能力を評価する。
知識グラフの経路における品質・多様性・独自性を通じてアソシエティブ・クリエイティビティを定義・測定する。
モデル思考予算とプロンプト戦略が創造的出力に与える影響を調査する。
客観的な評価でガイドできるスケーラブルな知識駆動型ベンチマークを提供する。

提案手法

アソシエイティブ・クリエイティビティを、エンティティを有効な三重項で結ぶ知識グラフの経路として形式化する。
品質を、経路三重項の最小の特異性として定義し、関係の事実性を担保する。
経路間距離を、経路文字列の埋め込みベースのコサイン距離で定義する。
品質と距離を組み合わせてクリエイティブ・ユーティリティ指標を、忍耐パラメータとともに定義する。
Wikidata由来のクエリを用いて多様な領域を跨ぐCREATEを構築し、人的・LLM評価で検証する。
基本プロンプトとバリエーション、反復・リサンプリング prompting を含む、思考の有無を広範なモデル群で評価する。

Figure 1 : Motivating example of brainstorming paths in knowledge graphs. In CREATE, only the question is given; reasoning over the graph is implicit in the model’s parameters and thinking trace, similar to drawing connections for scientific research. Finding strong, distinct paths can be challengin

実験結果

リサーチクエスチョン

RQ1LLMsは実世界のエンティティを結ぶ高品質で多様かつ独自な複数経路を生成できるか。
RQ2モデル思考予算とプロンプト変種が創造的ユーティリティ、品質、多様性、独自性に与える影響は。
RQ3事実性と創造的ユーティリティのトレードオフは何で、どのモデルがそれを最もよくバランスするか。
RQ4高度なプロンプティング戦略はモデル間のアソシエイティブ・クリエイティビティを信頼性高く向上させるか。

主な発見

モデル	s0.7	s	sigma	d	\|U\|	avg num tokens
GPT-4.1-mini	6.15 (5.08)	7.16 (6.81)	3.09 (1.66)	0.81 (0.26)	3.59 (3.72)	797 (258)
GPT-4.1	7.49 (5.25)	9.39 (8.01)	3.31 (1.50)	0.77 (0.27)	6.05 (5.27)	1076 (430)
GPT-5-mini (low)	6.21 (4.19)	7.03 (5.40)	3.23 (1.47)	0.64 (0.31)	4.95 (3.75)	1918 (482)
GPT-5-mini (med)	7.09 (4.61)	8.54 (6.56)	3.36 (1.45)	0.61 (0.31)	7.94 (5.52)	6360 (1743)
GPT-5-mini (high)	7.83 (4.95)	10.16 (7.85)	3.41 (1.46)	0.57 (0.29)	15.48 (10.65)	23480 (5518)
GPT-5 (med)	8.98 (5.11)	12.03 (8.67)	3.63 (1.34)	0.58 (0.27)	18.84 (13.72)	19090 (4767)
Claude-3-Haiku	3.49 (3.38)	3.68 (3.83)	2.34 (1.57)	0.83 (0.29)	1.69 (2.02)	373 (108)
Claude-Haiku-4.5 (low)	4.50 (3.78)	4.91 (4.54)	2.65 (1.51)	0.74 (0.32)	2.78 (2.79)	1004 (259)
Claude-Haiku-4.5 (med)	4.84 (3.87)	5.30 (4.67)	2.77 (1.53)	0.71 (0.31)	3.12 (3.01)	1658 (477)
Claude-Haiku-4.5 (high)	4.86 (3.97)	5.36 (4.89)	2.81 (1.55)	0.69 (0.33)	3.16 (3.03)	2150 (529)
Qwen3-30B-Instruct	5.20 (4.60)	6.27 (6.42)	2.66 (1.58)	0.75 (0.30)	5.61 (7.12)	1905 (480)
Qwen3-32B (16k)	4.69 (3.88)	5.08 (4.64)	2. unknown	0.81 (0.27)	2.34 (2.40)	3347 (1255)
Qwen3-32B (32k)	4.71 (3.77)	5.11 (4.56)	2.78 (1.51)	0.83 (0.26)	2.38 (2.43)	3333 (1221)
Olmo-3.1-32B-Instruct	3.77 (3.58)	4.13 (4.34)	2.32 (1.56)	0.83 (0.26)	2.46 (3.06)	846 (313)
Olmo-3.1-32B-Think (16k)	4.78 (3.96)	5.25 (4.95)	2.86 (1.63)	0.72 (0.33)	3.19 (3.46)	11939 (2269)
Olmo-3.1-32B-Think (32k)	4.97 (4.24)	5.52 (5.35)	2.87 (1.66)	0.71 (0.33)	3.34 (3.66)	12139 (2481)
Gemini-3-pro	8.29 (5.19)	10.41 (7.95)	3.56 (1.42)	0.77 (0.25)	6.00 (4.93)	1770 (795)

Frontierモデルは、open-sourceおよび小規模モデルと比べて忍耐設定全体で最も高い創造的ユーティリティを達成する。
生成する経路数を増やす（高い忍耐）は一般にユーティリティを高めるが、すべてのモデルで普遍的ではない。
品質が高く経路の多様性が大きいほどユーティリティが高くなる傾向がある。経路が強く独自であれば、少ない経路数でも同様のユーティリティに達するモデルもある。
反復 prompting とリサンプリングは創造的ユーティリティを大幅に向上させる一方、口頭化サンプリングは経路の妥当性を低下させる。
独自性 nu(U) はフロンティアモデル間で類似しているが、反復はリサンプリングよりも独自性をより確実に改善する。
事実性とユーティリティにはトレードオフがあり、事実性の要件が厳しくなるとユーティリティが低下する。GPT-5は厳格な条件下で両方を最もよくバランスした。
LLM判定者による事実性判断は概ね信頼性があるが、クラス間で適合率・再現率はさまま。

Figure 2 : Examples of model-generated paths $u$ compared against population paths, along with quality scores and minimum distance values. The first and last connect artists through classic relations of directing, acting, performing, etc. The second path is the weakest according to the assessed spec

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。