QUICK REVIEW

[논문 리뷰] Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

Saiqa Khan, Ali Faraz|arXiv (Cornell University)|2026. 03. 06.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

인도 언어를 위한 Chitrakshara 데이터셋 시리즈를 소개합니다: Chitrakshara-IL (193M 이미지, 30B 토큰, 50M 문서)와 Chitrakshara-Cap (44M 이미지-텍스트 쌍, 733M 토큰), 포함적 VLM을 위한 자세한 데이터 파이프라인과 분석을 제공합니다.

ABSTRACT

Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset's representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.

연구 동기 및 목표

멀티모달 데이터셋에서 인도 언어의 불충분한 표현 문제를 해결합니다.
인도어의 문화적으로 포용적인 VLMs를 훈련하기 위한 대규모 인터리브드(interleaved) 및 캡션 데이터 제공.
인도 언어에 맞춘 견고한 웹 수집 데이터 수집 및 필터링 파이프라인 개요.
품질과 커버리지를 보장하기 위한 언어 분포, 도메인 표현 및 모달리티 다양성 평가."

제안 방법

Indic 언어 범위를 최대화하기 위해 2013–2023에 걸친 95개의 Common Crawl 덤프를 수집합니다.
언어 탐지기(FastText LID)와 휴리스틱을 사용하여 문서를 필터링하고 중복을 제거합니다.
정제된 HTML 문서를 레이아웃 의미를 보존하면서 인터리브드 멀티모달 시퀀스로 변환합니다.
Chitrakshara-IL을 인터리브드 데이터로, Chitrakshara-Cap을 이미지-대체 텍스트 쌍으로 생성합니다.
언어, 도메인 및 모달리티 전반에 걸친 데이터셋 품질과 다양성을 평가합니다.

Figure 1 : Chitrakshara dataset creation pipeline

실험 결과

연구 질문

RQ1웹 원천에서 얻은 인터리브드 및 캡션이 포함된 멀티모달 데이터에서 인도 언어의 대표성 및 다양성은 어느 정도인가요?
RQ2영어 중심 데이터셋과 비교하여 인도 중심의 대규모 인터리브드 데이터가 인도 언어의 비전-언어 모델링을 개선할 수 있을까요?
RQ311개 언어에 걸친 Chitrakshara-IL 및 Chitrakshara-Cap의 실용적 특성(언어 분포, 문서, 이미지)은 무엇인가요?

주요 결과

Chitrakshara-IL은 Common Crawl에서 가져온 약 193 million 이미징, 30 billion 텍스트 토큰, 50 million 다국어 문서를 포함합니다.
Chitrakshara-Cap includes 44 million image-text pairs with 733 million tokens.
Chitrakshara outperforms English-leaning multilingual interleaved datasets on several Indian languages in terms of documents, tokens, and images per language.
The dataset exhibits broad domain coverage with a predominance of news and entertainment content and shows diverse linguistic representation across the 11 languages.

Figure 2 : Illustration of multimodal document extraction from the web. On the left, Chitrakshara-Cap includes image alt-text pairs, while on the right, Chitrakshara-IL retains the interleaved structure (truncated) of text & images from the source Hindi document.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.