QUICK REVIEW

[论文解读] Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams

Aayam Bansal|arXiv (Cornell University)|Feb 19, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

Sketch2Feedback 提出一个四阶段的 grammar-in-the-loop 管道，结合混合感知、符号图推理、约束检查和带约束的 VLM 反馈，以提供符合评分标准的对学生图示的反馈，在 FBD 和电路数据集上结果参差不齐。

ABSTRACT

Providing timely, rubric-aligned feedback on student-drawn diagrams is a persistent challenge in STEM education. While large multimodal models (LMMs) can jointly parse images and generate explanations, their tendency to hallucinate undermines trust in classroom deployments. We present Sketch2Feedback, a grammar-in-the-loop framework that decomposes the problem into four stages -- hybrid perception, symbolic graph construction, constraint checking, and constrained VLM feedback -- so that the language model verbalizes only violations verified by an upstream rule engine. We evaluate on two synthetic micro-benchmarks, FBD-10 (free-body diagrams) and Circuit-10 (circuit schematics), each with 500 images spanning standard and hard noise augmentation tiers, comparing our pipeline against end-to-end LMMs (LLaVA-1.5-7B, Qwen2-VL-7B), a vision-only detector, a YOLOv8-nano learned detector, and an ensemble oracle. On n=100 test samples per benchmark with 95% bootstrap CIs, results are mixed and instructive: Qwen2-VL-7B achieves the highest micro-F1 on both FBDs (0.570) and circuits (0.528), but with extreme hallucination rates (0.78, 0.98). An ensemble oracle that selects the best prediction per sample reaches F1=0.556 with hallucination 0.320 on FBDs, demonstrating exploitable complementarity between grammar and end-to-end approaches. Confidence thresholding at tau=0.7 reduces circuit hallucination from 0.970 to 0.880 with no F1 loss. Hard noise augmentation reveals domain-dependent robustness: FBD detection is resilient while circuit detection degrades sharply. An LLM-as-judge evaluation confirms that the grammar pipeline produces more actionable circuit feedback (4.85/5) than the end-to-end LMM (3.11/5). We release all code, datasets, and evaluation scripts.

研究动机与目标

为学生绘制的 STEM 图示提供及时且符合评分标准的反馈，以应对端到端大模型产生幻觉的问题。
将感知与推理分离，以提高反馈的可信度和可操作性。
在 FBD-10 和 Circuit-10 基准上评估四阶段管线，并具有真实错误。
分析感知、推理与生成在哪些方面成功或失败，并透明地归因错误。

提出的方法

阶段 1 使用混合视觉检测（CLAHE、自适应阈值、轮廓、HoughLinesP）来检测基本元素。
阶段 2 从检测到的原始元素构建一个带类型的符号图 G=(V,E)。
阶段 3 针对场景关键点执行领域特定的局部与非局部约束检查。
阶段 4 仅把经过验证的违规项送入带约束的 VLM（Qwen2-VL-2B）以生成符合评分标准的反馈，如有需要回退至模板。

Figure 1 : Sketch2Feedback pipeline overview. Stage 1 : Hybrid CV perception detects primitives (arrows, wires, components, junctions) via CLAHE preprocessing, adaptive thresholding, contour analysis, and HoughLinesP. Stage 2 : Detected primitives form a typed symbolic graph $G=(V,E)$ with spatial p

实验结果

研究问题

RQ1一个 grammar-in-the-loop 管道是否能对学生图示提供带有扎实、可验证观察的符合评分标准的反馈？
RQ2模块化感知+推理在检测图示错误和产生可操作反馈方面，与端到端大模型相比表现如何？
RQ3感知或推理阶段在哪些地方失败，错误归因是否能为未来改进提供信息？
RQ4所提方法在自由体图和电路原理图上表现如何？
RQ5在检测准确性、反馈质量、幻觉、校准与潜在延迟方面有哪些权衡？

主要发现

端到端大模型在 FBD 错误检测方面的表现优于 grammar 管道（micro-F1 0.471 vs 0.263），并在 FBD 情境下提供更强的反馈。
grammar 管道在电路图上优于端到端模型（micro-F1 0.329 vs 0.038），且实现了完全的可操作性（5.0/5）。
由于感知假阳性导致的电路幻觉在 grammar 管道中较高（0.925），并非由 LLM 制造导致，从而使错误归因可精确到阶段 1。
当检测到违规时，通过基于模板的生成，grammar 管道实现了完美的电路反馈可操作性（5.0/5）。
视觉仅基线表现出幻觉极低但检测差，凸显需要结构化推理以获得可操作的反馈。
按类型分析显示互补优势：grammar 在 FBD 的结构性约束违规和电路中的接地缺失方面表现出色，而端到端模型更擅长检测遗漏型错误（如缺失的力）。

Figure 2 : Model complementarity across error types. The grammar pipeline excels at structural constraint violations (wrong direction, missing ground), while the E2E-LMM detects omission-type errors (missing force). Neither model detects missing components or wrong polarity, indicating a shared perc

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。