QUICK REVIEW

[논문 리뷰] ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

Sheng Zhang, Xiaodong Liu|arXiv (Cornell University)|2018. 10. 30.

Topic Modeling참고 문헌 30인용 수 215

한 줄 요약

논문은 ReCoRD를 소개하며, 대규모 MRC 데이터세트로 상식 추론이 필요하고 인간이 최첨첨단 모델보다 현저히 우수함을 보이고, 극복해야 할 격차를 강조한다.

ABSTRACT

We present a large-scale dataset, ReCoRD, for machine reading comprehension requiring commonsense reasoning. Experiments on this dataset demonstrate that the performance of state-of-the-art MRC systems fall far behind human performance. ReCoRD represents a challenge for future research to bridge the gap between human and machine commonsense reading comprehension. ReCoRD is available at http://nlp.jhu.edu/record.

연구 동기 및 목표

표면 수준 텍스트 패턴을 넘어서는 광범위한 상식 추론을 필요로 하는 읽기 이해의 필요성을 고취합니다.
뉴스 기사로부터 상식 추론을 평가하기 위한 큰 벤치마크(지문, 클로즈형 쿼리, 정답)를 자동으로 생성합니다.
질문이 비-trivial 추론을 필요로 하고 모호하지 않도록 필터링 및 인간 검증을 적용합니다.
벤치마크와 인간 성능을 제공하여 기계와 인간 사이의 간극을 상식 MRC에서 계량합니다.

제안 방법

Automatically generate 770k (passage, query, answer) triples from CNN/Daily Mail news articles.
Form cloze-style queries by replacing a named entity with X in sentences that cite antecedents in the passage.
Filter easy triples using a strong MRC model (SAN) to keep 244k harder triples.
Crowdsource human validation to prune ambiguity and ensure correct answers, yielding a 120,730 query set across train/dev/test splits.
Evaluate multiple MRC models (including DocQA with/without ELMo, QANet, ASReader, SAN, language models) and human performance on exact match and F1 metrics.

실험 결과

연구 질문

RQ1How do current MRC models perform on a dataset that requires commonsense reasoning?
RQ2What is the performance gap between humans and machines on ReCoRD across standard MRC architectures?
RQ3What types of commonsense reasoning are most prevalent in ReCoRD and how do models fare on them?
RQ4Does candidate-entity guidance (the cloze setting) help models, and how does data construction affect difficulty?

주요 결과

Model	EM Dev	EM Test	F1 Dev	F1 Test
Human	91.28	91.31	91.64	91.69
DocQA w/ ELMo	44.13	45.44	45.39	46.65
DocQA w/o ELMo	36.59	38.52	37.89	39.76
SAN	38.14	39.77	39.09	40.72
QANet	35.38	36.51	36.75	37.79
ASReader	29.24	29.80	29.80	30.35
LM	16.73	17.57	17.41	18.15
Random Guess	18.41	18.55	19.06	19.12

Humans achieve 91.31 EM and 91.69 F1 on the test set, while the best automatic method (DocQA with ELMo) achieves 46.65 F1 and 45.44 EM on the test set.
SAN-based filtering confirms that many queries are hard across models, with substantially lower scores than humans.
Unsupervised language models perform similarly to random guessing on ReCoRD, suggesting domain knowledge gaps.
Eliciting answers from candidate entities (cloze setting) provides potential gains (~6% OOC reduction) if models leverage entity candidates.
Across 100 sampled queries, 75% require commonsense reasoning, with major types including conceptual knowledge and causal/naïve psychology reasoning.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.