[論文レビュー] Do LLMs Have Visualization Literacy? An Evaluation on Modified Visualizations to Test Generalization in Data Interpretation
This paper assesses whether GPT-4/vision and Gemini possess visualization literacy by testing them on a modified VLAT (Visualization Literacy Assessment Test) with PNG visualizations, comparing their performance to humans, and analyzing whether answers rely on pre-existing knowledge or the visual data. It concludes current LLMs lag behind human VL and often rely on prior knowledge, with variations by model, visualization type, and task, while offering a methodological template for such evaluations.
In this paper, we assess the visualization literacy of two prominent Large Language Models (LLMs): OpenAI's Generative Pretrained Transformers (GPT), the backend of ChatGPT, and Google's Gemini, previously known as Bard, to establish benchmarks for assessing their visualization capabilities. While LLMs have shown promise in generating chart descriptions, captions, and design suggestions, their potential for evaluating visualizations remains under-explored. Collecting data from humans for evaluations has been a bottleneck for visualization research in terms of both time and money, and if LLMs were able to serve, even in some limited role, as evaluators, they could be a significant resource. To investigate the feasibility of using LLMs in the visualization evaluation process, we explore the extent to which LLMs possess visualization literacy -- a crucial factor for their effective utility in the field. We conducted a series of experiments using a modified 53-item Visualization Literacy Assessment Test (VLAT) for GPT-4 and Gemini. Our findings indicate that the LLMs we explored currently fail to achieve the same levels of visualization literacy when compared to data from the general public reported in VLAT, and LLMs heavily relied on their pre-existing knowledge to answer questions instead of utilizing the information provided by the visualization when answering questions.
研究の動機と目的
- Define visualization literacy for LLM evaluation and establish a benchmark against human VL performance.
- Systematically test GPT-4 (vision) and Gemini (vision) on a modified VLAT with PNG visuals.
- Analyze whether LLMs rely on pre-existing knowledge or data from visualizations in answering questions.
- Quantify time and cost differences between LLMs and human evaluators in visualization interpretation.
提案手法
- Develop a testing template based on a modified 53-item VLAT to assess 12 visualizations and 8 tasks.
- Use PNG visualizations with randomized values to prevent memorization from VLAT training data; exclude data labels to force data extraction from visuals.
- Conduct Experiment 1 with GPT-4 Vision Preview and Gemini Pro Vision, 6,360 trials across 53 questions and 120 permutations of answer choices per question.
- Conduct Experiment 2 to test performance without visualizations using GPT-4 Turbo and Gemini Pro to isolate reliance on knowledge.
- Model the results with logistic regression across visualization type, task type, model, and visualization presence, with bootstrapped coefficient distributions for hypothesis testing.
- Hyperparameter tuning and bootstrapping (1000 resamples) to compare model coefficients and probabilities.

実験結果
リサーチクエスチョン
- RQ1RQ1: To what extent do LLMs have visualization literacy?
- RQ2RQ2: What are LLMs’ limitations in interpreting visualizations?
- RQ3RQ3: What are the cost differences between LLMs and humans in interpreting visualizations and answering related questions?
主な発見
- LLMs do not achieve visualization literacy comparable to the general public according to VLAT baselines.
- GPT-4 and Gemini often rely on their pre-existing knowledge rather than information in the visualizations when answering questions.
- Performance varies by visualization type and task; some tasks show partial alignment with humans, but overall LLMs lag behind.
- Decontextualization (removing context) tended to improve GPT-4 more than Gemini in some cases.
- Cost analysis indicates LLMs are more time- and money-efficient than humans, with Gemini generally more cost-effective than GPT-4.
- Across 53 visualization/task pairs, GPT-4 answered correctly in 14, Gemini in 15, and both exceeded random chance on 25 and 24 questions respectively.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。