Skip to main content
QUICK REVIEW

[论文解读] Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores

Zhiyong Shen, Gongpeng Zhao|arXiv (Cornell University)|Jan 29, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

Ostrakon-VL 是一个以 FSRS 为焦点的多模态大模型,基于 Qwen3-VL-8B 构建,搭配 ShopBench 与 QUAD 数据整理,以实现出色的 FSRS 感知与推理效率,超越同类更大规模的开源对手。

ABSTRACT

Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.

研究动机与目标

  • Motivate domain-specific adaptation of multimodal LLMs for Food-Service and Retail Stores (FSRS).
  • Develop a robust, auditable data-curation pipeline to handle noisy, heterogeneous FSRS data.
  • Create a standardized FSRS benchmark (ShopBench) to enable fine-grained evaluation across image, multi-image, and video inputs.
  • Demonstrate a domain-aware training strategy that yields robust FSRS perception-to-reasoning capabilities.

提出的方法

  • Introduce QUAD, a four-stage data-curation pipeline: Quality Filtering, Foundation Model Referenced Filtering, Multimodal Semantic Deduplication, and Capability Coverage Redistribution.
  • Synthesize data using a multimodal generator to form a large preliminary corpus and then prune it with QUAD.
  • Apply a multi-stage Training Strategy: Caption Bootstrapping, Offline Curriculum Learning, and Mixed Preference Optimization (MPO).
  • Use Caption Bootstrapping to inject FSRS domain knowledge via dense, evidence-rich captions; employ Offline Curriculum Learning to stage learning from easy to hard; apply MPO to align outputs with high-quality preferences and maintain generation stability.

实验结果

研究问题

  • RQ1Can a domain-focused MLLM for FSRS outperform general-purpose FSRS models with a dedicated data-curation and training loop?
  • RQ2How can a standardized, auditable FSRS benchmark (ShopBench) enable robust evaluation of perception and reasoning across image, multi-image, and video inputs?
  • RQ3What is the impact of a quality-driven, multi-stage data-curation pipeline on downstream FSRS model performance?
  • RQ4Does a multi-stage, domain-aware training strategy yield superior end-to-end FSRS perception-to-reasoning capabilities?
  • RQ5How does a domain-specific model compare in parameter efficiency to larger general-purpose models in FSRS tasks?

主要发现

  • Ostrakon-VL 与 ShopBench 在 ShopBench 上取得平均分 60.1 分,成为同等参数规模的开源多模态大模型中的新一代最优表现。
  • 它以 +0.7 分超越规模更大的 Qwen3-VL-235B-A22B(59.4)。
  • 它以 +4.8 分超过同等规模的 Qwen3-VL-8B(55.3),显示出参数效率的提升。
  • ShopBench 提供单图像、多人像与视频 FSRS 场景的细粒度评测,便于评估鲁棒性与证据提取能力。
  • QUAD 将 69.25M 的候选池蒸馏为高信号的 3.40M 语料库(保留约 1/20 的数据),提升下游性能。
  • Ostrakon-VL 与 ShopBench 将公开发布,以支持可重复的 FSRS 研究。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。