QUICK REVIEW

[论文解读] BEMEval-Doc2Schema: Benchmarking Large Language Models for Structured Data Extraction in Building Energy Modeling

Yiyuan Jia, Xiaoqin Fu|arXiv (Cornell University)|Feb 18, 2026

BIM and Construction Integration被引用 0

一句话总结

简要直接回答摘要译文：引入 BEMEval-Doc2Schema，这是在建筑能源建模中评估大型语言模型（LLM）对结构化数据提取能力的基准，新增 KVOR 指标并进行跨模型比较。

ABSTRACT

Recent advances in foundation models, including large language models (LLMs), have created new opportunities to automate building energy modeling (BEM). However, systematic evaluation has remained challenging due to the absence of publicly available, task-specific datasets and standardized performance metrics. We present BEMEval, a benchmark framework designed to assess foundation models' performance across BEM tasks. The first benchmark in this suite, BEMEval-Doc2Schema, focuses on structured data extraction from building documentation, a foundational step toward automated BEM processes. BEMEval-Doc2Schema introduces the Key-Value Overlap Rate (KVOR), a metric that quantifies the alignment between LLM-generated structured outputs and ground-truth schema references. Using this framework, we evaluate two leading models (GPT-5 and Gemini 2.5) under zero-shot and few-shot prompting strategies across three datasets: HERS L100, NREL iUnit, and NIST NZERTF. Results show that Gemini 2.5 consistently outperforms GPT-5, and that few-shot prompts improve accuracy for both models. Performance also varies by schema: the EPC schema yields significantly higher KVOR scores than HPXML, reflecting its simpler and reduced hierarchical depth. By combining curated datasets, reproducible metrics, and cross-model comparisons, BEMEval-Doc2Schema establishes the first community-driven benchmark for evaluating LLMs in performing building energy modeling tasks, laying the groundwork for future research on AI-assisted BEM workflows.

研究动机与目标

通过基础模型推动自动化的建筑能源建模（BEM），并突出评估差距。
提出 BEMEval 作为 BEM 任务特定评估的基准框架。
介绍 BEMEval-Doc2Schema，聚焦于从建筑文档中提取结构化数据。

提出的方法

定义 Key-Value Overlap Rate (KVOR) 指标，用以衡量 LLM 输出与真实模式之间的一致性。
在零样本和少量示例提示下评估两大领先 LLM（GPT-5 和 Gemini 2.5）。
使用三个数据集（HERS L100、NREL iUnit、NIST NZERTF）来在不同模式下评估对齐表现。
比较不同模式下的表现，指出更深的层级会影响 KVOR（EPC vs HPXML）。
提供可复现的基准设置，包含经筛选的数据集和跨模型对比。

实验结果

研究问题

RQ1LLMs 是否能够在零样本和少量示例提示下，准确地从建筑文档中提取结构化数据？
RQ2KVOR 如何反映生成输出与真实模式之间的对齐情况？
RQ3模型选择（GPT-5 vs Gemini 2.5）和数据集/模式复杂性如何影响提取性能？
RQ4模式设计（EPC vs HPXML）是否会影响以 KVOR 衡量的提取难度？

主要发现

Gemini 2.5 在 KVOR 基准评估中持续优于 GPT-5。
少量示例提示提升了两种模型的提取准确性。
EPC 模式由于层级更简单、层级更浅，KVOR 得分高于 HPXML。
BEMEval-Doc2Schema 展示了一个以社区驱动、可重复的 BEM 任务中 LLM 评估基准。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。