[論文レビュー] Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores
Ostrakon-VL is an FSRS-focused multimodal LLM built on Qwen3-VL-8B, paired with ShopBench and QUAD data curation to achieve strong FSRS perception and reasoning efficiency, outperforming larger open-source rivals.
Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.
研究の動機と目的
- Motivate domain-specific adaptation of multimodal LLMs for Food-Service and Retail Stores (FSRS).
- Develop a robust, auditable data-curation pipeline to handle noisy, heterogeneous FSRS data.
- Create a standardized FSRS benchmark (ShopBench) to enable fine-grained evaluation across image, multi-image, and video inputs.
- Demonstrate a domain-aware training strategy that yields robust FSRS perception-to-reasoning capabilities.
提案手法
- Introduce QUAD, a four-stage data-curation pipeline: Quality Filtering, Foundation Model Referenced Filtering, Multimodal Semantic Deduplication, and Capability Coverage Redistribution.
- Synthesize data using a multimodal generator to form a large preliminary corpus and then prune it with QUAD.
- Apply a multi-stage Training Strategy: Caption Bootstrapping, Offline Curriculum Learning, and Mixed Preference Optimization (MPO).
- Use Caption Bootstrapping to inject FSRS domain knowledge via dense, evidence-rich captions; employ Offline Curriculum Learning to stage learning from easy to hard; apply MPO to align outputs with high-quality preferences and maintain generation stability.
実験結果
リサーチクエスチョン
- RQ1Can a domain-focused MLLM for FSRS outperform general-purpose FSRS models with a dedicated data-curation and training loop?
- RQ2How can a standardized, auditable FSRS benchmark (ShopBench) enable robust evaluation of perception and reasoning across image, multi-image, and video inputs?
- RQ3What is the impact of a quality-driven, multi-stage data-curation pipeline on downstream FSRS model performance?
- RQ4Does a multi-stage, domain-aware training strategy yield superior end-to-end FSRS perception-to-reasoning capabilities?
- RQ5How does a domain-specific model compare in parameter efficiency to larger general-purpose models in FSRS tasks?
主な発見
- Ostrakon-VL achieves an average score of 60.1 on ShopBench, a new state-of-the-art among open-source MLLMs with similar parameter scales.
- It surpasses the larger Qwen3-VL-235B-A22B (59.4) by +0.7 points.
- It exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8 points, showing improved parameter efficiency.
- ShopBench provides fine-grained evaluation across single-image, multi-image, and video FSRS scenarios, enabling robust robustness and evidence extraction assessment.
- QUAD distills a 69.25M candidate pool down to a high-signal 3.40M corpus (retaining ~1/20 of data), improving downstream performance.
- Ostrakon-VL and ShopBench will be publicly released to support reproducible FSRS research.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。