QUICK REVIEW

[논문 리뷰] Visualization Generation with Large Language Models: An Evaluation

Xinyu Wang, Liang, Chenwei|arXiv (Cornell University)|2024. 01. 20.

Data Mining Algorithms and Applications인용 수 9

한 줄 요약

Vega-Lite를 활용한 GPT-3.5가 nvBench에서 강한 NL2VIS 성능을 보이며, few-shot 프롬프트가 zero-shot 프롬프트를 능가하고 이전의 NL2VIS 연구들보다 우수하다.

ABSTRACT

The frequent need for analysts to create visualizations to derive insights from data has driven extensive research into the generation of natural Language to Visualization (NL2VIS). While recent progress in large language models (LLMs) suggests their potential to effectively support NL2VIS tasks, existing studies lack a systematic investigation into the performance of different LLMs under various prompt strategies. This paper addresses this gap and contributes a crucial baseline evaluation of LLMs' capabilities in generating visualization specifications of NL2VIS tasks. Our evaluation utilizes the nvBench dataset, employing six representative LLMs and eight distinct prompt strategies to evaluate their performance in generating six target chart types using the Vega-Lite visualization specification. We assess model performance with multiple metrics, including vis accuracy, validity and legality. Our results reveal substantial performance disparities across prompt strategies, chart types, and LLMs. Furthermore, based on the evaluation results, we uncover several counterintuitive behaviors across these dimensions, and propose directions for enhancing the NL2VIS benchmark to better support future NL2VIS research.

연구 동기 및 목표

대형 언어 모델이 자연어 쿼리(NL2VIS)로 Vega-Lite 시각화를 생성하는 능력을 평가한다.
NL2VIS 성능에 대한 zero-shot 대 few-shot 프롬프트 전략의 영향을 평가한다.
모델 출력 및 벤치마크에서의 한계를 식별하여 향후 NL2VIS 연구 및 평가를 안내한다.

제안 방법

GPT-3.5를 대표 LLM으로 사용하여 Vega-Lite 명세를 생성한다.
목표 출력으로 Vega-Lite 문법으로 Vega-Lite 명세를 표현한다.
NL2VIS 벤치마크 데이터셋으로 nvBench를 사용한다.
zero-shot과 few-shot 프롬프팅 전략을 설계하고 비교한다.
zero-shot 프롬프트에서 일반적인 오류를 줄이기 위한 ground-truth에서 영감을 받은 규칙을 도입한다.
few-shot 프롬프트에서 차트 유형에 맞춘 예시를 제공하여 생성을 안내한다.

실험 결과

연구 질문

RQ1GPT-3.5가 자연어 쿼리에서 올바른 Vega-Lite 명세를 생성하는 능력은 어느 수준인가?
RQ2few-shot 프롬프트가 Vega-Lite 생성을 위한 NL2VIS 정확도를 zero-shot 프롬프트보다 높게 만들까?
RQ3GPT-3.5 NL2VIS 출력에서의 주요 오류 원인은 무엇이며 이것이 Vega-Lite 문법 및 데이터 속성 이해와 어떤 관련이 있는가?
RQ4NL2VIS 평가를 향상시킬 수 있는 NVBench 벤치마크의 개선점은 무엇인가?

주요 결과

GPT-3.5가 nvBench에서 강한 Vega-Lite 생성 성능을 보이며, 이전의 NL2VIS 접근법보다 우수하다.
few-shot 프롬프팅은 NL2VIS에서 zero-shot 프롬프팅보다 정확도를 높인다.
GPT-3.5는 여전히 Vega-Lite 문법 오류를 범하거나 때때로 데이터 속성을 잘못 해석하여 정확도에 영향을 준다.
몇 가지 벤치마크 이슈가 확인되었는데, 일부 ground-truth 시각화가 작업Descriptions와 완전히 일치하지 않거나 애매한 부분이 있어 평가에 영향을 준다.
Vega-Lite 변환에서 특정 이슈(예: 정렬 사용)로 인해 규칙 기반 프롕 안내 및 문법 준수의 한계를 드러낸다.
ground-truth 및 벤치마크의 애매함은 향후 개선 방향으로 지목된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.