QUICK REVIEW

[论文解读] RoadscapesQA: A Multitask, Multimodal Dataset for Visual Question Answering on Indian Roads

Vijayasri Iyer, Maahin Rathinagiriswaran|arXiv (Cornell University)|Feb 13, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

RoadscapesQA 引入一个约9k张印度路面图像的多任务多模态VQA数据集，涵盖对象检测、可行驶区域分割和图像级VQA，并提供VQA任务的零样本基线。

ABSTRACT

Understanding road scenes is essential for autonomous driving, as it enables systems to interpret visual surroundings to aid in effective decision-making. We present Roadscapes, a multitask multimodal dataset consisting of upto 9,000 images captured in diverse Indian driving environments, accompanied by manually verified bounding boxes. To facilitate scalable scene understanding, we employ rule-based heuristics to infer various scene attributes, which are subsequently used to generate question-answer (QA) pairs for tasks such as object grounding, reasoning, and scene understanding. The dataset includes a variety of scenes from urban and rural India, encompassing highways, service roads, village paths, and congested city streets, captured in both daytime and nighttime settings. Roadscapes has been curated to advance research on visual scene understanding in unstructured environments. In this paper, we describe the data collection and annotation process, present key dataset statistics, and provide initial baselines for image QA tasks using vision-language models.

研究动机与目标

弥补基于非结构化印度路况（城市、农村、高速公路）驾驶VQA基准的空缺。
提供一种可扩展的数据收集与标注管道，使用单目视频、预标注、人工验证与启发式方法。
在多样化的光照与条件下促进对象检测、可行驶区域分割与图像级VQA。
使视觉-语言模型在印度路况下进行对象计数、描述对象以及描述周边情境的评估成为可能。

提出的方法

使用前置摄像头从Coimbatore–Kochi走廊和国道采集约9k张单目图像。
针对对象检测、可行驶区域分割进行标注，并通过规则基启发式与从大型语言模型推断的场景图生成功能生成VQA。
使用基于YOLOv5的检测器对车牌进行匿名化处理，并进行人工抽查以核对。
在三类VQA任务上以嵌入式与精确匹配度量对零样本视觉-语言模型进行评估。

Figure 1: A example of an image and corresponding questions from the VQA Dataset.

实验结果

研究问题

RQ1零样本视觉-语言模型在印度路况场景中的对象计数、对象描述和周边描述任务上的表现如何？
RQ2当前VLM在非结构化驾驶环境中的常见失效模式（幻觉、计数错误、属性错误）有哪些？
RQ3RoadscapesQA在多样性、光照与场景类型方面与现有驾驶VQA数据集相比有何差异？
RQ4有哪些关于在非结构化印度路况下的VQA基线洞见可为未来模型开发提供指导？

主要发现

Phi-3.5在对象计数方面达到最高准确率0.667，4o-mini达到0.628。
对象描述的最佳表现为Paligemma，余弦相似度0.501。
周边描述的最佳表现是4o，余弦相似度0.701。
对象描述的幻觉率在各模型中普遍偏高，例如对于258/500到308/500预测的纠正，幻觉率为50.8%–61.6%；对象计数呈现过度计数/假阳性为主要问题。
零样本VQA显示魄力因任务而异，其中上下文推理（周边描述）对某些模型相对细粒度属性任务更为可靠。
数据集揭示了真实世界伪影（运动模糊、眩光、挡风玻璃反射）以及印度路况的非结构化特征，凸显VLM在实际应用中的可靠性挑战。

Figure 2: A minimal working example to demonstrate how to place two images side-by-side.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。