QUICK REVIEW

[논문 리뷰] CREATE: Testing LLMs for Associative Creativity

Manya Wadhwa, Tania Roy|arXiv (Cornell University)|2026. 03. 10.

Artificial Intelligence in Games인용 수 0

한 줄 요약

CREATE는 지식 그래프에서 실제 세계 개념 간 고품질, 다양하고 독특한 경로의 생성 및 순위를 통해 LLM의 연상적 창의성을 평가합니다. 프런티어 모델이 최상의 성능을 보이나 높은 구별성을 달성하는 것은 여전히 도전적입니다.

ABSTRACT

A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.

연구 동기 및 목표

LLM이 실제 세계 개념 간 창의적이고 개방형 연결을 얼마나 잘 생성하는지 평가합니다.
지식 그래프 경로의 품질, 다양성, 구별성을 통해 연상적 창의성을 정의하고 측정합니다.
모델 사고 예산과 프롬프팅 전략이 창의적 산출물에 미치는 영향을 조사합니다.
창의적 AI 개발을 안내할 수 있는 객관적 채점의 확장 가능한 지식 기반 벤치마크를 제공합니다.

제안 방법

연관 창의성을 엔티티를 유효한 트리플로 연결하는 지식 그래프의 경로로 형식화합니다.
경로 트리플의 최소 특이성으로 품질을 정의하고 관계의 사실성을 강제합니다.
경로 문자열의 임베딩 기반 코사인 거리를 통해 경로 간 거리를 정의합니다.
품질과 거리를 창의적 효용 지표로 결합하되 인내도(patience) 매개변수를 포함합니다.
다양한 도메인을 포괄하는 위키데이터 기반 쿼리를 사용해 CREATE를 구성하고 인간-LLM 판단으로 검증합니다.
기초 프롬프트와 변형을 사용하고 반복적 프롬프팅 및 재샘플링 프롬프팅을 포함한 광범위한 모델(비-사고형 및 사고형) 평가를 수행합니다.

Figure 1 : Motivating example of brainstorming paths in knowledge graphs. In CREATE, only the question is given; reasoning over the graph is implicit in the model’s parameters and thinking trace, similar to drawing connections for scientific research. Finding strong, distinct paths can be challengin

실험 결과

연구 질문

RQ1LLM은 실제 세계 엔티티를 연결하는 다수의 고품질, 다양하고 구별되는 경로를 생성할 수 있는가?
RQ2모델 사고 예산과 프롬프트 변형이 창의적 효용, 품질, 다양성 및 구별성에 어떤 영향을 미치는가?
RQ3사실성(factuality)과 창의적 효용의 균형은 어떠하며 어떤 모델이 이를 가장 잘 균형 있게 다루는가?
RQ4발전된 프롬프팅 전략이 모델 간 연상적 창의성을 신뢰성 있게 향상시키는가?

주요 결과

Model	s0.7	s	sigma	d	\|U\|	avg num tokens
GPT-4.1-mini	6.15 (5.08)	7.16 (6.81)	3.09 (1.66)	0.81 (0.26)	3.59 (3.72)	797 (258)
GPT-4.1	7.49 (5.25)	9.39 (8.01)	3.31 (1.50)	0.77 (0.27)	6.05 (5.27)	1076 (430)
GPT-5-mini (low)	6.21 (4.19)	7.03 (5.40)	3.23 (1.47)	0.64 (0.31)	4.95 (3.75)	1918 (482)
GPT-5-mini (med)	7.09 (4.61)	8.54 (6.56)	3.36 (1.45)	0.61 (0.31)	7.94 (5.52)	6360 (1743)
GPT-5-mini (high)	7.83 (4.95)	10.16 (7.85)	3.41 (1.46)	0.57 (0.29)	15.48 (10.65)	23480 (5518)
GPT-5 (med)	8.98 (5.11)	12.03 (8.67)	3.63 (1.34)	0.58 (0.27)	18.84 (13.72)	19090 (4767)
Claude-3-Haiku	3.49 (3.38)	3.68 (3.83)	2.34 (1.57)	0.83 (0.29)	1.69 (2.02)	373 (108)
Claude-Haiku-4.5 (low)	4.50 (3.78)	4.91 (4.54)	2.65 (1.51)	0.74 (0.32)	2.78 (2.79)	1004 (259)
Claude-Haiku-4.5 (med)	4.84 (3.87)	5.30 (4.67)	2.77 (1.53)	0.71 (0.31)	3.12 (3.01)	1658 (477)
Claude-Haiku-4.5 (high)	4.86 (3.97)	5.36 (4.89)	2.81 (1.55)	0.69 (0.33)	3.16 (3.03)	2150 (529)
Qwen3-30B-Instruct	5.20 (4.60)	6.27 (6.42)	2.66 (1.58)	0.75 (0.30)	5.61 (7.12)	1905 (480)
Qwen3-32B (16k)	4.69 (3.88)	5.08 (4.64)	2. unknown	0.81 (0.27)	2.34 (2.40)	3347 (1255)
Qwen3-32B (32k)	4.71 (3.77)	5.11 (4.56)	2.78 (1.51)	0.83 (0.26)	2.38 (2.43)	3333 (1221)
Olmo-3.1-32B-Instruct	3.77 (3.58)	4.13 (4.34)	2.32 (1.56)	0.83 (0.26)	2.46 (3.06)	846 (313)
Olmo-3.1-32B-Think (16k)	4.78 (3.96)	5.25 (4.95)	2.86 (1.63)	0.72 (0.33)	3.19 (3.46)	11939 (2269)
Olmo-3.1-32B-Think (32k)	4.97 (4.24)	5.52 (5.35)	2.87 (1.66)	0.71 (0.33)	3.34 (3.66)	12139 (2481)
Gemini-3-pro	8.29 (5.19)	10.41 (7.95)	3.56 (1.42)	0.77 (0.25)	6.00 (4.93)	1770 (795)

프런티어 모델은 오픈 소스 및 소형 모델에 비해 모든 인내도(patience) 설정에서 가장 높은 창의적 효용을 달성합니다.
생성 경로 수를 늘리는 것은 일반적으로 효용을 높이지만 모든 모델에서 보편적으로 그렇지는 않습니다.
높은 품질과 더 큰 경로 다양성은 더 높은 효용과 상관관계가 있으며, 강하고 구별되는 경로를 가진 경우 적은 경로로도 유사한 효용을 달성할 수 있습니다.
반복적 프롬프팅과 재샘플링은 창의적 효용을 크게 향상시키는 반면, 구두화된 샘플링은 경로의 타당성을 낮춥니다.
구별성 nu(U)는 프런티어 모델 간에 유사하지만, 반복은 재샘플링보다 더 안정적으로 구별성을 향상시킵니다.
사실성(factuality)과 효용 간에 트레이드오프가 exist하며, 사실성의 강한 요구는 효용을 감소시키고, GPT-5가 엄격한 조건에서 양쪽을 가장 잘 밸런싱합니다.
LLM 판단에 의한 사실성 판단은 클래스별로 정밀도/재현률이 다소 차이가 있어도 합리적인 신뢰성을 보여줍니다.

Figure 2 : Examples of model-generated paths $u$ compared against population paths, along with quality scores and minimum distance values. The first and last connect artists through classic relations of directing, acting, performing, etc. The second path is the weakest according to the assessed spec

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.