[논문 리뷰] NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned
A comprehensive report on the EfficientQA open-domain QA competition at NeurIPS 2020, detailing memory-budgeted systems, their retrieval-reader architectures, evaluation (automatic and human), and insights on ambiguities in open-domain QA.
We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage contestants to explore the trade-off between storing retrieval corpora or the parameters of learned models. In this report, we describe the motivation and organization of the competition, review the best submissions, and analyze system predictions to inform a discussion of evaluation for open-domain QA.
연구 동기 및 목표
- Motivate and organize a memory-efficient open-domain QA competition.
- Survey top submissions across unrestricted and memory-constrained tracks.
- Evaluate predictions with both automatic metrics and human judgments to understand correctness under ambiguity.
- Compare system predictions to human trivia experts to assess upper bounds and practical performance.
제안 방법
- Describe the competition setup, tracks, and memory budgets.
- Summarize leading participant systems and their retrieval-reader designs.
- Introduce a human evaluation scheme to assess correctness beyond exact-match metrics.
- Analyze automatic vs. human evaluation gaps and question ambiguity effects.
실험 결과
연구 질문
- RQ1How do memory budgets influence retrieval and reasoning strategies in open-domain QA?
- RQ2What retrieval and reading architectures yield the best accuracy under different memory constraints?
- RQ3How well do automatic exact-match metrics reflect true answer correctness in open-domain QA, and how does human judgment differ?
- RQ4What is the impact of question ambiguity on QA evaluation and system ranking?
주요 결과
| Track | Model | Automatic eval | Human eval - Definitely | Human eval - Possibly |
|---|---|---|---|---|
| Unrestricted | MS UnitedQA | 54.00 | 65.80 (+21.9%) | 78.12 (+44.7%) |
| Unrestricted | FB Hybrid | 53.89 | 67.38 (+25.0%) | 79.88 (+48.2%) |
| 6GiB | FB system | 53.33 | 65.18 (+22.2%) | 76.09 (+42.7%) |
| 6GiB | Ousia-Tohoku Soseki | 50.17 | 62.01 (+23.6%) | 73.83 (+47.2%) |
| 6GiB | BUT R2-D2 | 47.28 | 58.96 (+24.7%) | 70.33 (+49.2%) |
| 500MiB | NAVER RDR | 32.06 | 42.23 (+31.7%) | 54.95 (+71.4%) |
| 500MiB | UCLNLP-FB system (29M) | 33.44 | 39.40 (+17.8%) | 47.37 (+41.7%) |
| 25% smallest | UCLNLP-FB system (29M) | 26.78 | 32.45 (+21.2%) | 41.21 (+53.9%) |
- Top submissions across tracks substantially outperformed baselines by up to ~20% in accuracy, leveraging retrieval-augmented generation and compression techniques.
- Unrestricted and 6GiB tracks show close performance, indicating strong compression and pruning can maintain accuracy.
- Automatic evaluation underestimates correctness for semantically equivalent or context-dependent answers; human evaluation reveals substantial gains (up to ~25% more accuracy, and up to ~54% when considering plausibly correct answers).
- Ambiguity and time-dependence of open-domain questions significantly affect evaluation and rankings; agreement among human raters is moderate and depends on the definition of correctness.
- Systems with diverse retrieval strategies (e.g., combining dense retrievers, generative augmentation, and data augmentation) tend to yield complementary errors, improving ensemble potential.
- Memory-efficient systems (500MiB, 25% smallest) can achieve competitive accuracy by aggressive corpus pruning and model/embedding compression.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.