QUICK REVIEW

[論文レビュー] BEMEval-Doc2Schema: Benchmarking Large Language Models for Structured Data Extraction in Building Energy Modeling

Yiyuan Jia, Xiaoqin Fu|arXiv (Cornell University)|Feb 18, 2026

BIM and Construction Integration被引用数 0

ひとこと要約

BEMEval-Doc2Schemaを紹介。構造化データ抽出におけるLLMの評価用ベンチマークで、新規KVOR指標とモデル間比較を提供。

ABSTRACT

Recent advances in foundation models, including large language models (LLMs), have created new opportunities to automate building energy modeling (BEM). However, systematic evaluation has remained challenging due to the absence of publicly available, task-specific datasets and standardized performance metrics. We present BEMEval, a benchmark framework designed to assess foundation models' performance across BEM tasks. The first benchmark in this suite, BEMEval-Doc2Schema, focuses on structured data extraction from building documentation, a foundational step toward automated BEM processes. BEMEval-Doc2Schema introduces the Key-Value Overlap Rate (KVOR), a metric that quantifies the alignment between LLM-generated structured outputs and ground-truth schema references. Using this framework, we evaluate two leading models (GPT-5 and Gemini 2.5) under zero-shot and few-shot prompting strategies across three datasets: HERS L100, NREL iUnit, and NIST NZERTF. Results show that Gemini 2.5 consistently outperforms GPT-5, and that few-shot prompts improve accuracy for both models. Performance also varies by schema: the EPC schema yields significantly higher KVOR scores than HPXML, reflecting its simpler and reduced hierarchical depth. By combining curated datasets, reproducible metrics, and cross-model comparisons, BEMEval-Doc2Schema establishes the first community-driven benchmark for evaluating LLMs in performing building energy modeling tasks, laying the groundwork for future research on AI-assisted BEM workflows.

研究の動機と目的

foundationモデルを通じた自動化された建築エネルギーモデリング（BEM）を推進し、評価ギャップを浮き彫りにする。
BEMEval を、BEM におけるタスク固有の評価のためのベンチマークフレームワークとして提案する。
BEMEval-Doc2Schema を導入し、建物文書からの構造化データ抽出に焦点を当てる。

提案手法

Key-Value Overlap Rate（KVOR）指標を定義し、LLM出力とグラウンドトゥルースのスキーマ間の整合性を測定する。
先端LLM2モデル（GPT-5と Gemini 2.5）をゼロショットおよび少数ショット prompting で評価する。
三つのデータセット（HERS L100、NREL iUnit、NIST NZERTF）を用いてスキーマ間の性能を評価する。
スキーマ間での性能を比較し、EPC と HPXML のような深い階層が KVOR に影響を与えることを指摘する。
curated データセットとモデル間比較を含む再現可能なベンチマーク設定を提供する。

実験結果

リサーチクエスチョン

RQ1 LLMS はゼロショットおよび少数ショット prompting を用いて建物文書から構造化データを正確に抽出できるか？
RQ2 KVOR は生成出力とグラウンドトゥルースのスキーマ間の整合性をどう反映するか？
RQ3 モデル選択（GPT-5 対 Gemini 2.5）とデータセット/スキーマの複雑さは抽出性能にどう影響するか？
RQ4 スキーマ設計（EPC 対 HPXML）は KVOR で測定される抽出難易度に影響を与えるか？

主な発見

Gemini 2.5 は KVOR ベースの評価で一貫して GPT-5 を上回る。
少数ショット prompting は両モデルの抽出精度を向上させる。
EPC スキーマは HPXML よりも KVOR スコアが高く、階層が浅いほど有利である。
BEMEval-Doc2Schema は BEM タスクの LL M 評価におけるコミュニティ主導の再現可能なベンチマークを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。