QUICK REVIEW

[Paper Review] KG^2: Learning to Reason Science Exam Questions with Contextual Knowledge Graph Embeddings

Yuyu Zhang, Hanjun Dai|arXiv (Cornell University)|May 31, 2018

Topic Modeling21 references23 citations

TL;DR

KG² proposes a neural reasoning framework that constructs contextual knowledge graphs from questions and supporting sentences to improve science question answering. By learning to reason over paired hypothesis and supporting fact graphs, it achieves 31.70 on the ARC Challenge Set, significantly outperforming prior state-of-the-art methods by 17.5%.

ABSTRACT

The AI2 Reasoning Challenge (ARC), a new benchmark dataset for question answering (QA) has been recently released. ARC only contains natural science questions authored for human exams, which are hard to answer and require advanced logic reasoning. On the ARC Challenge Set, existing state-of-the-art QA systems fail to significantly outperform random baseline, reflecting the difficult nature of this task. In this paper, we propose a novel framework for answering science exam questions, which mimics human solving process in an open-book exam. To address the reasoning challenge, we construct contextual knowledge graphs respectively for the question itself and supporting sentences. Our model learns to reason with neural embeddings of both knowledge graphs. Experiments on the ARC Challenge Set show that our model outperforms the previous state-of-the-art QA systems.

Motivation & Objective

Address the challenge of answering complex, logic-intensive science exam questions that require deeper reasoning beyond surface-level patterns.
Overcome the limitations of existing QA systems that fail on the ARC Challenge Set despite using large corpora and neural models.
Mimic human problem-solving in open-book exams by combining question-stem and answer options into hypotheses, retrieving supporting facts, and verifying them via graph-based reasoning.
Develop a differentiable neural framework that learns to reason over structured representations of knowledge, improving generalization and interpretability.
Facilitate progress on the ARC benchmark by decomposing remaining difficulties into identifiable categories such as missing support, parsing errors, and complex reasoning.

Proposed method

Construct a hypothesis graph by combining the question stem and each answer option, using Open Information Extraction (Open IE) to extract subject-predicate-object triples.
Retrieve supporting sentences from the ARC Corpus using a search engine, then generate a supporting fact graph via Open IE to represent relevant knowledge.
Represent both hypothesis and supporting graphs as knowledge graphs where entities are nodes and relations are edges, enabling structured reasoning.
Train a differentiable neural reasoning engine that compares the structural patterns between hypothesis and supporting graphs to predict the correct answer.
Use a contrastive learning objective to align the reasoning patterns in the hypothesis graph with those in the supporting graph, improving generalization.
Optimize the model end-to-end using gradient descent to refine embeddings and reasoning decisions, with attention mechanisms to focus on relevant subgraphs.

Experimental results

Research questions

RQ1Can a neural reasoning model that constructs contextual knowledge graphs from questions and supporting facts outperform existing QA systems on the ARC Challenge Set?
RQ2To what extent does graph-based reasoning over structured representations improve performance on questions requiring advanced logic and comprehension?
RQ3What are the primary failure modes in current QA systems on the ARC Challenge Set, and can they be mitigated by structured reasoning over knowledge graphs?
RQ4How does the performance of the model scale with improvements in knowledge coverage and parsing quality?
RQ5Can a differentiable, end-to-end framework for reasoning over knowledge graphs close the gap between neural QA and human-level performance on science exams?

Key findings

KG² achieves a test score of 31.70 on the ARC Challenge Set, representing a 17.5% improvement over the previous state-of-the-art score of 26.41.
The model significantly outperforms all baselines, including strong models like BiDAF (26.54) and TableILP (26.97), demonstrating the effectiveness of graph-based reasoning.
The random baseline score is 25.02, indicating that prior methods perform only slightly better than random, highlighting the difficulty of the ARC Challenge Set.
Analysis shows that 50% of questions lack sufficient supporting information in the corpus, suggesting that knowledge coverage is a major bottleneck.
12% of questions fail due to Open IE parsing errors, indicating that sentence-level parsing could improve performance.
Only 15% of questions are considered 'learnable' under the current framework, suggesting that the upper bound for current methods is around 36.25 if all learnable questions are correctly answered.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.