QUICK REVIEW

[论文解读] TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models

Daniel Nobrega Medeiros|arXiv (Cornell University)|Feb 27, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

tldr: TACIT Benchmark introduces a language-minimal, dual-track visual reasoning suite with deterministic verification across 10 tasks in 6 domains, enabling reproducible evaluation of generative and discriminative models on identical puzzles.

ABSTRACT

Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology. The benchmark provides dual-track evaluation: a generative track in which models must produce solution images verified through deterministic computer-vision pipelines, and a discriminative track offering five-way multiple choice with structurally plausible near-miss distractors. Each distractor violates exactly one structural constraint, requiring models to reason about fine-grained visual differences rather than exploit superficial cues. Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. The dataset, generation code, and evaluation harness are released under the Apache 2.0 license on HuggingFace (DOI: 10.57967/hf/7904).

研究动机与目标

Objective 0: Provide a language-minimal, visually specified benchmark to isolate visual reasoning from linguistic ability.
Objective 1: Offer dual-track evaluation (generative and discriminative) on identical stimuli to diagnose constructive versus selective reasoning.
Objective 2: Ensure reproducible, deterministic scoring via computer-vision verification pipelines.
Objective 3: Cover diverse reasoning domains including spatial, causal, logical, graph-theoretic, topological, and geometric reasoning.
Objective 4: Release an extensible, open-source generation and evaluation pipeline for reproducible research.

提出的方法

Method 0: Design ten tasks across six reasoning domains with parameterized difficulty levels.
Method 1: Implement a dual-track evaluation where models either generate a solution image or choose from five candidates.
Method 2: Use deterministic, task-specific computer-vision pipelines to verify generative outputs.
Method 3: Render puzzles from SVG sources and rasterize to three PNG resolutions for reproducible evaluation.
Method 4: Employ a single-constraint distractor system so each distractor violates exactly one structural constraint.
Method 5: Provide seed-based, deterministic puzzle generation with a fixed global seed to ensure reproducibility.

实验结果

研究问题

RQ1Research Question 0: Can models exhibit constructive visual reasoning by generating correct solution images that pass deterministic CV verification?
RQ2Research Question 1: What is the gap between generative and discriminative performance on identical TACIT puzzles across tasks?
RQ3Research Question 2: How do models perform across the six reasoning domains and three difficulty levels?
RQ4Research Question 3: Do near-miss distractors effectively diagnose specific reasoning weaknesses in models?
RQ5Research Question 4: How reproducible are results with a fully automated, seed-driven generation and verification pipeline?

主要发现

Key Finding 0: The benchmark provides 10 tasks across 6 domains with parameterized difficulty.
Key Finding 1: It supports dual-track evaluation (generative and discriminative) with deterministic CV-based verification for all generative outputs.
Key Finding 2: Distractors are generated to violate exactly one structural constraint, ensuring plausible but incorrect options.
Key Finding 3: The release includes 6,000 puzzles (108,000 PNG images across three resolutions) with seed-based deterministic generation.
Key Finding 4: All content and tooling are open-source under Apache 2.0 on HuggingFace, enabling reproducible research.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。