[论文解读] The Battle of LLMs: A Comparative Study in Conversational QA Tasks
对话式问答在四个数据集上的ChatGPT、GPT-4、Gemini、Mixtral和Claude的比较评估,使用多种指标和两模块生成管线来评估准确性、流畅性和一致性。
Large language models have gained considerable interest for their impressive performance on various tasks. Within this domain, ChatGPT and GPT-4, developed by OpenAI, and the Gemini, developed by Google, have emerged as particularly popular among early adopters. Additionally, Mixtral by Mistral AI and Claude by Anthropic are newly released, further expanding the landscape of advanced language models. These models are viewed as disruptive technologies with applications spanning customer service, education, healthcare, and finance. More recently, Mistral has entered the scene, captivating users with its unique ability to generate creative content. Understanding the perspectives of these users is crucial, as they can offer valuable insights into the potential strengths, weaknesses, and overall success or failure of these technologies in various domains. This research delves into the responses generated by ChatGPT, GPT-4, Gemini, Mixtral and Claude across different Conversational QA corpora. Evaluation scores were meticulously computed and subsequently compared to ascertain the overall performance of these models. Our study pinpointed instances where these models provided inaccurate answers to questions, offering insights into potential areas where they might be susceptible to errors. In essence, this research provides a comprehensive comparison and evaluation of these state of-the-art language models, shedding light on their capabilities while also highlighting potential areas for improvement
研究动机与目标
- 评估主流大语言模型(ChatGPT、GPT-4、Gemini、Mixtral、Claude)在对话式问答任务上的性能。
- 开发并验证一个可扩展的生成与评估大规模回答的流水线。
- 使用标准自然语言处理指标量化质量,并分析如准确性、相关性和一致性等局限性。
提出的方法
- 两模块流水线:问题生成(改述、增强、从问答语料库抽样)与回答生成(LLMs),以覆盖广泛的对话式问答。
- 使用四个对话式问答基准进行评估:CoQA、DialFact、FaVIQ、CoDAH。
- 量化指标:BLEU、METEOR、BART、NIST、Jaccard、ROUGE-L、TER;以及Chain-of-Thought、Zero-Shot和3-Shot设定。
- 从Rawte等人(2023)改编的Hallucination Vulnerability Index (HVI),用于评估模型在实体伪造与替换方面的脆弱性。

实验结果
研究问题
- RQ1在对话式问答任务中,ChatGPT、GPT-4、Gemini、Mixtral和Claude在准确性、相关性和一致性方面有何比较?
- RQ2对模型性能有何影响?设置(Zero-shot 与 3-Shot)和Chain-of-Thought推理的影响如何?
- RQ3这些模型如何处理同一情境中重复问题的幻觉和一致性?
- RQ4在对话式问答语料库上评估时,这些LLM暴露出的局限性和偏见有哪些?
主要发现
- GPT-4 与 Claude 在准确性、相关性和一致性方面超越在各测试情景中的 ChatGPT-3、Gemini 和 Mixtral。
- 在 Chain-of-Thought、Zero-Shot 和 3-Shot 评估中,GPT-4 与 Claude 展现出更高的一致性和情境相关性。
- 在某些评估中,整体平均指标包括 BLEU 约0.79 和 ROUGE-L 约0.53,且数据集间差异显著。
- 某些模型(尤其是 ChatGPT-3、Gemini、Mixtral)在相同上下文下表现出不一致性和偶发的误导性回答。
- 本研究引入 HVI 用于量化幻觉脆弱性,并报告各模型在实体伪造与替换方面的特定倾向。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。