QUICK REVIEW

[论文解读] Large Language Model for OWL Proofs

Hui Yang, Jiaoyan Chen|arXiv (Cornell University)|Jan 18, 2026

Explainable Artificial Intelligence (XAI)被引用 0

一句话总结

该论文在 OWL 本体上的证明能力评估中，聚焦 Extraction、Simplification、Explanation 三类任务，并分析复杂性、语言形式、噪声与不完整前提如何影响性能。

ABSTRACT

The ability of Large Language Models (LLMs) to perform reasoning tasks such as deduction has been widely investigated in recent years. Yet, their capacity to generate proofs-faithful, human-readable explanations of why conclusions follow-remains largely under explored. In this work, we study proof generation in the context of OWL ontologies, which are widely adopted for representing and reasoning over complex knowledge, by developing an automated dataset construction and evaluation framework. Our evaluation encompassing three sequential tasks for complete proving: Extraction, Simplification, and Explanation, as well as an additional task of assessing Logic Completeness of the premise. Through extensive experiments on widely used reasoning LLMs, we achieve important findings including: (1) Some models achieve overall strong results but remain limited on complex cases; (2) Logical complexity, rather than representation format (formal logic language versus natural language), is the dominant factor shaping LLM performance; and (3) Noise and incompleteness in input data substantially diminish LLMs' performance. Together, these results underscore both the promise of LLMs for explanation with rigorous logics and the gap of supporting resilient reasoning under complex or imperfect conditions. Code and data are available at https://github.com/HuiYang1997/LLMOwlR.

研究动机与目标

Motivate and study proof construction for OWL ontologies, including how LLMs generate faithful, human-readable explanations.
Define three progressively harder tasks (Extraction, Simplification, Explanation) plus a Logic Completeness assessment under incomplete premises.
Propose an automatic dataset construction and evaluation framework for OWL ontologies in the EL Description Logic fragment.
Empirically compare multiple reasoning-capable LLMs across real-world ontologies to identify success factors and failure modes.

提出的方法

Construct automatic datasets for EL ontologies using minimal justifications as gold evidence for conclusions.
Extract minimal axiom sets (justifications) required to entail a target subsumption.
Simplify extracted axioms to their essential components that contribute to the derivation.
Generate coherent explanations that describe how conclusions follow from premises, using a structured AXIOMS_USED, SIMPLIFY, DERIVE format.
Evaluate under variations including formal vs natural language, and complete vs incomplete premises, using prompt-based methods and multiple LLMs.

Figure 2 . Distribution of the selected conclusions (subsumptions).

实验结果

研究问题

RQ1Can LLMs construct complete and minimal justifications for OWL entailments in EL ontologies?
RQ2How do formal logic expressions versus natural language presentations affect LLM proof generation?
RQ3What is the impact of noisy or incomplete premises on LLMs’ ability to extract, simplify, and explain OWL proofs?
RQ4Which factors (model size, prompts, and inference rules) most influence LLM performance on OWL proofs?

主要发现

Some models achieve strong overall results but struggle on complex derivations beyond simple transitive patterns.
Logical complexity of the task, not the representation format, is the main determinant of performance.
Noise and incomplete premises can significantly reduce reasoning accuracy (up to about 38% drop with missing premises).
Inference rules substantially boost performance for some models (e.g., Qwen3-32B) but have mixed effects for others (e.g., GPT-o4-mini).
GPT-o4-mini often yields the best simplification and overall derivation accuracy, with 7–8 steps per derivation on average.
Performance varies across ontologies, with SNOMED CT generally more challenging than GO-Plus and Foodon.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。