Skip to main content
QUICK REVIEW

[论文解读] Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams

Aayam Bansal|arXiv (Cornell University)|Feb 19, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

Sketch2Feedback 提出一个四阶段的 grammar-in-the-loop 管道,结合混合感知、符号图推理、约束检查和带约束的 VLM 反馈,以提供符合评分标准的对学生图示的反馈,在 FBD 和电路数据集上结果参差不齐。

ABSTRACT

Providing timely, rubric-aligned feedback on student-drawn diagrams is a persistent challenge in STEM education. While large multimodal models (LMMs) can jointly parse images and generate explanations, their tendency to hallucinate undermines trust in classroom deployments. We present Sketch2Feedback, a grammar-in-the-loop framework that decomposes the problem into four stages -- hybrid perception, symbolic graph construction, constraint checking, and constrained VLM feedback -- so that the language model verbalizes only violations verified by an upstream rule engine. We evaluate on two synthetic micro-benchmarks, FBD-10 (free-body diagrams) and Circuit-10 (circuit schematics), each with 500 images spanning standard and hard noise augmentation tiers, comparing our pipeline against end-to-end LMMs (LLaVA-1.5-7B, Qwen2-VL-7B), a vision-only detector, a YOLOv8-nano learned detector, and an ensemble oracle. On n=100 test samples per benchmark with 95% bootstrap CIs, results are mixed and instructive: Qwen2-VL-7B achieves the highest micro-F1 on both FBDs (0.570) and circuits (0.528), but with extreme hallucination rates (0.78, 0.98). An ensemble oracle that selects the best prediction per sample reaches F1=0.556 with hallucination 0.320 on FBDs, demonstrating exploitable complementarity between grammar and end-to-end approaches. Confidence thresholding at tau=0.7 reduces circuit hallucination from 0.970 to 0.880 with no F1 loss. Hard noise augmentation reveals domain-dependent robustness: FBD detection is resilient while circuit detection degrades sharply. An LLM-as-judge evaluation confirms that the grammar pipeline produces more actionable circuit feedback (4.85/5) than the end-to-end LMM (3.11/5). We release all code, datasets, and evaluation scripts.

研究动机与目标

  • 为学生绘制的 STEM 图示提供及时且符合评分标准的反馈,以应对端到端大模型产生幻觉的问题。
  • 将感知与推理分离,以提高反馈的可信度和可操作性。
  • 在 FBD-10 和 Circuit-10 基准上评估四阶段管线,并具有真实错误。
  • 分析感知、推理与生成在哪些方面成功或失败,并透明地归因错误。

提出的方法

  • 阶段 1 使用混合视觉检测(CLAHE、自适应阈值、轮廓、HoughLinesP)来检测基本元素。
  • 阶段 2 从检测到的原始元素构建一个带类型的符号图 G=(V,E)。
  • 阶段 3 针对场景关键点执行领域特定的局部与非局部约束检查。
  • 阶段 4 仅把经过验证的违规项送入带约束的 VLM(Qwen2-VL-2B)以生成符合评分标准的反馈,如有需要回退至模板。
Figure 1 : Sketch2Feedback pipeline overview. Stage 1 : Hybrid CV perception detects primitives (arrows, wires, components, junctions) via CLAHE preprocessing, adaptive thresholding, contour analysis, and HoughLinesP. Stage 2 : Detected primitives form a typed symbolic graph $G=(V,E)$ with spatial p
Figure 1 : Sketch2Feedback pipeline overview. Stage 1 : Hybrid CV perception detects primitives (arrows, wires, components, junctions) via CLAHE preprocessing, adaptive thresholding, contour analysis, and HoughLinesP. Stage 2 : Detected primitives form a typed symbolic graph $G=(V,E)$ with spatial p

实验结果

研究问题

  • RQ1一个 grammar-in-the-loop 管道是否能对学生图示提供带有扎实、可验证观察的符合评分标准的反馈?
  • RQ2模块化感知+推理在检测图示错误和产生可操作反馈方面,与端到端大模型相比表现如何?
  • RQ3感知或推理阶段在哪些地方失败,错误归因是否能为未来改进提供信息?
  • RQ4所提方法在自由体图和电路原理图上表现如何?
  • RQ5在检测准确性、反馈质量、幻觉、校准与潜在延迟方面有哪些权衡?

主要发现

  • 端到端大模型在 FBD 错误检测方面的表现优于 grammar 管道(micro-F1 0.471 vs 0.263),并在 FBD 情境下提供更强的反馈。
  • grammar 管道在电路图上优于端到端模型(micro-F1 0.329 vs 0.038),且实现了完全的可操作性(5.0/5)。
  • 由于感知假阳性导致的电路幻觉在 grammar 管道中较高(0.925),并非由 LLM 制造导致,从而使错误归因可精确到阶段 1。
  • 当检测到违规时,通过基于模板的生成,grammar 管道实现了完美的电路反馈可操作性(5.0/5)。
  • 视觉仅基线表现出幻觉极低但检测差,凸显需要结构化推理以获得可操作的反馈。
  • 按类型分析显示互补优势:grammar 在 FBD 的结构性约束违规和电路中的接地缺失方面表现出色,而端到端模型更擅长检测遗漏型错误(如缺失的力)。
Figure 2 : Model complementarity across error types. The grammar pipeline excels at structural constraint violations (wrong direction, missing ground), while the E2E-LMM detects omission-type errors (missing force). Neither model detects missing components or wrong polarity, indicating a shared perc
Figure 2 : Model complementarity across error types. The grammar pipeline excels at structural constraint violations (wrong direction, missing ground), while the E2E-LMM detects omission-type errors (missing force). Neither model detects missing components or wrong polarity, indicating a shared perc

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。