QUICK REVIEW

[論文レビュー] Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores

Zhiyong Shen, Gongpeng Zhao|arXiv (Cornell University)|Jan 29, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

Ostrakon-VL is an FSRS-focused multimodal LLM built on Qwen3-VL-8B, paired with ShopBench and QUAD data curation to achieve strong FSRS perception and reasoning efficiency, outperforming larger open-source rivals.

ABSTRACT

Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.

研究の動機と目的

Motivate domain-specific adaptation of multimodal LLMs for Food-Service and Retail Stores (FSRS).
Develop a robust, auditable data-curation pipeline to handle noisy, heterogeneous FSRS data.
Create a standardized FSRS benchmark (ShopBench) to enable fine-grained evaluation across image, multi-image, and video inputs.
Demonstrate a domain-aware training strategy that yields robust FSRS perception-to-reasoning capabilities.

提案手法

Introduce QUAD, a four-stage data-curation pipeline: Quality Filtering, Foundation Model Referenced Filtering, Multimodal Semantic Deduplication, and Capability Coverage Redistribution.
Synthesize data using a multimodal generator to form a large preliminary corpus and then prune it with QUAD.
Apply a multi-stage Training Strategy: Caption Bootstrapping, Offline Curriculum Learning, and Mixed Preference Optimization (MPO).
Use Caption Bootstrapping to inject FSRS domain knowledge via dense, evidence-rich captions; employ Offline Curriculum Learning to stage learning from easy to hard; apply MPO to align outputs with high-quality preferences and maintain generation stability.

実験結果

リサーチクエスチョン

RQ1Can a domain-focused MLLM for FSRS outperform general-purpose FSRS models with a dedicated data-curation and training loop?
RQ2How can a standardized, auditable FSRS benchmark (ShopBench) enable robust evaluation of perception and reasoning across image, multi-image, and video inputs?
RQ3What is the impact of a quality-driven, multi-stage data-curation pipeline on downstream FSRS model performance?
RQ4Does a multi-stage, domain-aware training strategy yield superior end-to-end FSRS perception-to-reasoning capabilities?
RQ5How does a domain-specific model compare in parameter efficiency to larger general-purpose models in FSRS tasks?

主な発見

Ostrakon-VL achieves an average score of 60.1 on ShopBench, a new state-of-the-art among open-source MLLMs with similar parameter scales.
It surpasses the larger Qwen3-VL-235B-A22B (59.4) by +0.7 points.
It exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8 points, showing improved parameter efficiency.
ShopBench provides fine-grained evaluation across single-image, multi-image, and video FSRS scenarios, enabling robust robustness and evidence extraction assessment.
QUAD distills a 69.25M candidate pool down to a high-signal 3.40M corpus (retaining ~1/20 of data), improving downstream performance.
Ostrakon-VL and ShopBench will be publicly released to support reproducible FSRS research.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。