Skip to main content
QUICK REVIEW

[論文レビュー] Art or Artifice? Large Language Models and the False Promise of Creativity

Tuhin Chakrabarty, Philippe Laban|arXiv (Cornell University)|Sep 25, 2023
Topic Modeling被引用数 9
ひとこと要約

本論文は TTCW を提案する。TTCW は Torrance に触発された、短編小説の創造性を評価するアーティファクト中心の評価法であり、創造性テストにおいて LLM が人間の作家より3–10倍遅れ、LLM も創造性を信頼できず評価できないことを示している。

ABSTRACT

Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.

研究の動機と目的

  • Adapt TTCT into an artifact-centric Torrance Test for Creative Writing (TTCW) to evaluate creativity in short stories.
  • Validate TTCW using expert judgments on 48 stories (professionals vs. LLMs) under the Consensual Assessment Technique.
  • Investigate whether LLMs can function as autonomous assessors of TTCW tests.
  • Compare performance across LLMs to identify dimension-specific strengths and weaknesses.
  • Release large-scale TTCW annotations to enable future research in AI-assisted creative writing.

提案手法

  • Grounded TTCW in TTCT dimensions of Fluency, Flexibility, Originality, and Elaboration.
  • Develop 14 binary tests (Yes/No) with open-ended rationales to assess each story across TTCW dimensions.
  • Use Consensual Assessment Technique with 10 expert raters to evaluate 48 stories (12 professional, 36 LLM-generated).
  • Analyze inter-rater agreement (Fleiss Kappa) and aggregate correlations (Pearson) to validate TTCW.
  • Compare LLM-generated stories against expert-written ones across tests to quantify pass rates (e.g., 84.7% vs. 9–30%).
  • Experiment with LLMs as TTCW assessors to measure correlation with expert judgments.
Figure 1 . Pipeline showing the construction of TTCW and evaluation of short stories using the TTCW framework
Figure 1 . Pipeline showing the construction of TTCW and evaluation of short stories using the TTCW framework

実験結果

リサーチクエスチョン

  • RQ1RQ1: TTCW-based creativity evaluation is consistent and reproducible among experts?
  • RQ2RQ2: Expert-written stories pass TTCW tests more often than LLM-generated stories, and which tests show largest gaps?
  • RQ3RQ3: Do different LLMs differ in passing TTCW tests, and are there dimension-specific strengths?
  • RQ4RQ4: Can LLMs reliably assess TTCW tests, i.e., do their judgments correlate with expert judgments?

主な発見

  • Experts show moderate agreement on individual TTCW tests (Fleiss Kappa ~0.41) and strong agreement when aggregating all tests (Pearson ~0.69).
  • Expert-written stories pass an average of 84.7% of TTCW tests, while ChatGPT-generated stories pass ~9% and Claude-generated stories pass ~30%.
  • LLMs exhibit dimension-specific strengths (GPT-4 favors Originality; Claude v1.3 favors Fluency, Flexibility, Elaboration).
  • LLMs as TTCW assessors largely fail to correlate with expert judgments (correlations near zero).
  • The TTCW framework is additive across tests; higher overall pass counts indicate greater perceived creativity.
Figure 2 . Pipeline showing how our test set is created for evaluation. For each human-written original New Yorker story, we generate 3 stories from one LLM each, based on the plot of the original story. The plot is a single-sentence summary of the original story automatically generated by GPT-4 and
Figure 2 . Pipeline showing how our test set is created for evaluation. For each human-written original New Yorker story, we generate 3 stories from one LLM each, based on the plot of the original story. The plot is a single-sentence summary of the original story automatically generated by GPT-4 and

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。