QUICK REVIEW

[Paper Review] Two are Better than One: An Ensemble of Retrieval- and Generation-Based Dialog Systems

Yiping Song, Rui Yan|arXiv (Cornell University)|Oct 23, 2016

Topic Modeling29 references87 citations

TL;DR

This paper proposes an ensemble model that combines retrieval-based and generation-based open-domain dialog systems to improve response quality. By feeding both the user query and a retrieved candidate reply to a biseq2seq generator, and then post-reranking both retrieved and generated responses, the ensemble significantly outperforms either component alone, achieving state-of-the-art performance on multiple metrics including BLEU, ROUGE, and human evaluation scores.

ABSTRACT

Open-domain human-computer conversation has attracted much attention in the field of NLP. Contrary to rule- or template-based domain-specific dialog systems, open-domain conversation usually requires data-driven approaches, which can be roughly divided into two categories: retrieval-based and generation-based systems. Retrieval systems search a user-issued utterance (called a query) in a large database, and return a reply that best matches the query. Generative approaches, typically based on recurrent neural networks (RNNs), can synthesize new replies, but they suffer from the problem of generating short, meaningless utterances. In this paper, we propose a novel ensemble of retrieval-based and generation-based dialog systems in the open domain. In our approach, the retrieved candidate, in addition to the original query, is fed to an RNN-based reply generator, so that the neural model is aware of more information. The generated reply is then fed back as a new candidate for post-reranking. Experimental results show that such ensemble outperforms each single part of it by a large margin.

Motivation & Objective

To address the limitations of standalone retrieval and generation-based dialog systems in open-domain conversation, where retrieval systems lack novelty and generative models produce generic replies.
To explore whether combining retrieval and generation can yield better performance by leveraging the strengths of both approaches.
To investigate the impact of integrating retrieved candidates into the response generation process and the role of post-reranking in improving final response selection.
To validate the effectiveness of the ensemble through ablation studies and qualitative case analysis.

Proposed method

The system first retrieves a candidate reply using a standard information retrieval method from a large database of query-reply pairs.
The retrieved reply and the original query are jointly encoded using a biseq2seq model, which generates a new response by attending to both sequences.
The generated response is then re-evaluated by the same retrieval system’s scorer to produce a reranked list of candidates, including both retrieved and generated responses.
The final response is selected based on the reranking score, ensuring relevance and semantic quality.
The biseq2seq model uses a dual-encoder architecture to encode the query and retrieved reply separately, concatenating their final hidden states as the initial decoder state.
Post-reranking uses the same retrieval model’s scoring function to re-evaluate and re-rank both retrieved and generated candidates, selecting the highest-scoring one.

Experimental results

Research questions

RQ1Can combining retrieval and generation-based systems improve response quality in open-domain dialog systems?
RQ2Does incorporating the retrieved candidate into the generator’s input mitigate the 'low-substance' problem of generic responses?
RQ3Is post-reranking effective in selecting the best response from both retrieved and generated candidates?
RQ4Do both the biseq2seq generator and the post-reranking mechanism contribute significantly to the ensemble’s performance?

Key findings

The ensemble model outperforms both the retrieval-only and generation-only baselines across all evaluation metrics, including BLEU, ROUGE, and human evaluation scores.
The biseq2seq generator produces more meaningful responses than standard seq2seq, with content words from the retrieved reply often appearing in the generated output.
Post-reranking significantly improves performance by filtering out low-quality generated or retrieved responses, with 44.77% of final selections being generated responses in the best configuration.
Ablation studies confirm that both the biseq2seq generator and the post-reranking mechanism are essential, as removing either leads to a drop in performance.
The ensemble achieves a 55.23% selection rate for generated responses in the biseq2seq-based model, indicating strong contribution from the generator.
The model consistently outperforms baselines on both automatic and human evaluation, demonstrating the effectiveness of the ensemble strategy.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.