[Paper Review] Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering
This paper introduces WebQA, a large-scale real-world factoid QA dataset with over 42,000 questions and 556,000 evidence passages, and proposes an end-to-end neural recurrent sequence labeling model that frames QA as a sequence labeling task using CRF. The model achieves an F1 score of 74.69% with word-based input and 70.97% with character-based input, demonstrating robustness and effectiveness without expensive softmax computation or predefined answer candidates.
While question answering (QA) with neural network, i.e. neural QA, has achieved promising results in recent years, lacking of large scale real-word QA dataset is still a challenge for developing and evaluating neural QA system. To alleviate this problem, we propose a large scale human annotated real-world QA dataset WebQA with more than 42k questions and 556k evidences. As existing neural QA methods resolve QA either as sequence generation or classification/ranking problem, they face challenges of expensive softmax computation, unseen answers handling or separate candidate answer generation component. In this work, we cast neural QA as a sequence labeling problem and propose an end-to-end sequence labeling model, which overcomes all the above challenges. Experimental results on WebQA show that our model outperforms the baselines significantly with an F1 score of 74.69% with word-based input, and the performance drops only 3.72 F1 points with more challenging character-based input.
Motivation & Objective
- Address the lack of large-scale, real-world QA datasets suitable for training and evaluating end-to-end neural QA systems.
- Overcome limitations of existing neural QA methods that rely on sequence generation (expensive softmax) or classification/ranking (require predefined candidates or separate generation components).
- Develop a new design choice for answer production that is computationally efficient, handles out-of-vocabulary words, and supports end-to-end training.
- Enable research in evidence ranking and answer sentence selection by providing multiple human-annotated evidences per question.
Proposed method
- Frame open-domain factoid QA as a sequence labeling problem, where the model predicts the start and end positions of the answer span in a retrieved evidence passage.
- Use a conditional random field (CRF) layer to model label dependencies and improve span boundary prediction accuracy.
- Employ a bi-directional LSTM encoder for both questions and evidence passages to capture contextual representations.
- Compute question and evidence representations using a single-time attention mechanism to dynamically weigh relevant words.
- Integrate neural features (e.g., word embeddings, q-e.comm, e-e.comm) with CRF via joint training, avoiding manual feature engineering.
- Support both word-based and character-based input to improve robustness to rare or unseen words.
Experimental results
Research questions
- RQ1Can a sequence labeling approach outperform traditional sequence generation and classification-based methods in open-domain factoid QA?
- RQ2How effective is an end-to-end neural sequence labeling model with CRF in handling unseen words and reducing computational cost compared to softmax-based generation?
- RQ3To what extent does pre-trained fixed word embedding improve generalization compared to trainable embeddings in the QA setting?
- RQ4How does the model perform under character-based input, and how does it compare to word-based input in terms of robustness and accuracy?
- RQ5What is the contribution of question-evidence interaction features (e.g., q-e.comm) to the overall performance of the sequence labeling model?
Key findings
- The proposed sequence labeling model achieves an F1 score of 74.69% on the WebQA dataset using word-based input, significantly outperforming baseline methods.
- With character-based input, the model maintains strong performance, achieving an F1 score of 70.97%, a drop of only 3.72 points from the word-based version, demonstrating robustness to out-of-vocabulary words.
- Fixed pre-trained word embeddings (e.g., from language models) lead to better generalization and lower overfitting compared to trainable embeddings, which degrade performance due to increased parameter count and poor inductive bias.
- The q-e.comm feature (indicating whether a word appears in both question and evidence) is highly effective, as it helps the model identify non-answer tokens, contributing significantly to performance.
- The single-time attention mechanism for question representation yields better results than max or average pooling, indicating that flexible, selective attention is more effective for capturing relevant question features.
- Deeper and wider LSTM structures with cross-layer connections improve performance, showing that modeling long-range dependencies in evidence is beneficial for answer span detection.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.