QUICK REVIEW

[論文レビュー] Architecture inside the mirage: evaluating generative image models on architectural style, elements, and typologies

Jamie Magrill, Leah Gornstein|arXiv (Cornell University)|Jan 14, 2026

Aesthetic Perception and Analysis被引用数 0

ひとこと要約

研究は30の建築的プロンプトに対して5つのGenAI画像プラットフォームを評価し、歴史家基準に対する生成画像の精度を測定。全体的な精度は限定的で、ラベリングと出所の要件を促す。

ABSTRACT

Generative artificial intelligence (GenAI) text-to-image systems are increasingly used to generate architectural imagery, yet their capacity to reproduce accurate images in a historically rule-bound field remains poorly characterized. We evaluated five widely used GenAI image platforms (Adobe Firefly, DALL-E 3, Google Imagen 3, Microsoft Image Generator, and Midjourney) using 30 architectural prompts spanning styles, typologies, and codified elements. Each prompt-generator pair produced four images (n = 600 images total). Two architectural historians independently scored each image for accuracy against predefined criteria, resolving disagreements by consensus. Set-level performance was summarized as zero to four accurate images per four-image set. Image output from Common prompts was 2.7-fold more accurate than from Rare prompts (p < 0.05). Across platforms, overall accuracy was limited (highest accuracy score 52 percent; lowest 32 percent; mean 42 percent). All-correct (4 out of 4) outcomes were similar across platforms. By contrast, all-incorrect (0 out of 4) outcomes varied substantially, with Imagen 3 exhibiting the fewest failures and Microsoft Image Generator exhibiting the highest number of failures. Qualitative review of the image dataset identified recurring patterns including over-embellishment, confusion between medieval styles and their later revivals, and misrepresentation of descriptive prompts (for example, egg-and-dart, banded column, pendentive). These findings support the need for visible labeling of GenAI synthetic content, provenance standards for future training datasets, and cautious educational use of GenAI architectural imagery.

研究の動機と目的

5つの広く使用されているGenAI画像プラットフォームが、テキストプロンプトから建築スタイル、タイプ、および要素を再現する能力を評価する。
標準化された基準に基づく独立した専門家の採点を用いて画像の精度を定量化する。
生成画像の精度に対するプロンプト頻度（Common vs Rare）の影響を検討する。
ラベリングと出所標準を inform するために、GenAI出力の定性的パターンを特徴づける。

提案手法

Adobe Firefly、DALL-E 3、Google Imagen 3、Microsoft Image Generator、Midjourney の5つのGenAIプラットフォームを使用する。
スタイル、タイプ、そしてコード化要素を横断する30の建築プロンプトを作成する。
プロンプト-プラットフォームペアごとに4枚の画像を生成する（n = 600枚）。
2名の建築史家があらかじめ定められた基準に対して独立して画像の精度を採点し、意見の不一致はコンセンサスで解決する。
セットごとの性能を要約する（4枚セットあたり0–4枚の正確な画像）。
Common vs Rareプロンプトの統計的比較（p < 0.05）。

実験結果

リサーチクエスチョン

RQ1GenAIプラットフォーム間で建築スタイル、タイプ、要素を再現する際の精度はどの程度か？
RQ2プロンプト頻度（Common vs Rare）は出力の精度にどのような影響を与えるか？
RQ3精度と失敗率にプラットフォーム特有のパターンはあるか？
RQ4信頼性と解釈可能性に影響を与えるGenAI建築画像の定性的パターンは何か？

主な発見

プラットフォーム全体の平均精度は42％（レンジ32％–52％）。
CommonプロンプトはRareプロンプトより精度が2.7倍高い（p < 0.05）。
最高の精度は52％、最低は32％で、4/4正解の全正解結果はプラットフォーム間で類似。
全て不正解（0/4）の結果はプラットフォームによって異なり、Imagen 3が最も不良が少なく、Microsoft Image Generatorが最も多い。
定性的パターンには過剰装飾、中世スタイルと revival の混乱、記述プロンプトの誤表現（例：卵-ダート、縞模様の列柱、垂直球など）が含まれる。
結果は合成コンテンツの可視ラベリングとトレーニングデータの出所標準を支持しており、教育現場での慎重な使用を勧告する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。