Skip to main content
QUICK REVIEW

[论文解读] Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

Saiqa Khan, Ali Faraz|arXiv (Cornell University)|Mar 6, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

介绍Chitrakshara数据集系列用于印度语言:Chitrakshara-IL(193M 图像、300亿文本标记、5000万文档)和 Chitrakshara-Cap(4400万图文对、7.33亿标记),并提供用于包容性VLM的详尽数据管线与分析。

ABSTRACT

Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset's representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.

研究动机与目标

  • 解决多模态数据集中印度语言的代表性不足问题。
  • 提供大规模的交错数据与字幕数据,以训练面向Indic语言的文化包容性VLM。
  • 提出面向印度语言的稳健网络数据采集与筛选管线。
  • 评估语言分布、领域覆盖与模态多样性,以确保质量与覆盖范围。

提出的方法

  • 收集覆盖2013–2023的95个Common Crawl转储,以最大化Indic语言覆盖。
  • 使用语言检测器(FastText LID)和启发式方法对文档进行筛选与去重。
  • 将清洗后的HTML文档转换为保留布局语义的交错多模态序列。
  • 将Chitrakshara-IL设为交错数据,将Chitrakshara-Cap设为图像-替代文本对。
  • 评估数据集在语言、领域和模态方面的质量与多样性。
Figure 1 : Chitrakshara dataset creation pipeline
Figure 1 : Chitrakshara dataset creation pipeline

实验结果

研究问题

  • RQ1基于网络来源的交错和字幕多模态数据中,Indic语言的代表性与多样性如何?
  • RQ2与英语为主的数据集相比,大规模、以印度为焦点的交错数据集是否能提升印度语言的视觉-语言建模?
  • RQ3Chitrakshara-IL与Chitrakshara-Cap在11种语言中的语言分布、文档、图像等实际特征是什么?

主要发现

  • Chitrakshara-IL大约包含1.93亿图像、300亿文本标记和5000万多语言文档,来源于Common Crawl。
  • Chitrakshara-Cap包含4400万图文对,733亿文本标记。
  • Chitrakshara在若干印度语言上,在文档、标记和每语言的图像数量方面,优于偏英语的多语言交错数据集。
  • 该数据集覆盖领域广泛,以新闻和娱乐内容为主,并在11种语言中呈现多样的语言表示。
Figure 2 : Illustration of multimodal document extraction from the web. On the left, Chitrakshara-Cap includes image alt-text pairs, while on the right, Chitrakshara-IL retains the interleaved structure (truncated) of text & images from the source Hindi document.
Figure 2 : Illustration of multimodal document extraction from the web. On the left, Chitrakshara-Cap includes image alt-text pairs, while on the right, Chitrakshara-IL retains the interleaved structure (truncated) of text & images from the source Hindi document.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。