QUICK REVIEW

[논문 리뷰] Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams

Aayam Bansal|arXiv (Cornell University)|2026. 02. 19.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

Sketch2Feedback은 하이브리드 인식, 상징적 그래프 추론, 제약 검사, 제약된 VLM 피드백을 결합한 네 단계의 그람마르-루프 파이프라인을 제시하여 학생 다이어그램에 루브릭에 맞춘 피드백을 제공하며, FBD 및 회로 데이터셋에서 혼합된 결과를 보인다.

ABSTRACT

Providing timely, rubric-aligned feedback on student-drawn diagrams is a persistent challenge in STEM education. While large multimodal models (LMMs) can jointly parse images and generate explanations, their tendency to hallucinate undermines trust in classroom deployments. We present Sketch2Feedback, a grammar-in-the-loop framework that decomposes the problem into four stages -- hybrid perception, symbolic graph construction, constraint checking, and constrained VLM feedback -- so that the language model verbalizes only violations verified by an upstream rule engine. We evaluate on two synthetic micro-benchmarks, FBD-10 (free-body diagrams) and Circuit-10 (circuit schematics), each with 500 images spanning standard and hard noise augmentation tiers, comparing our pipeline against end-to-end LMMs (LLaVA-1.5-7B, Qwen2-VL-7B), a vision-only detector, a YOLOv8-nano learned detector, and an ensemble oracle. On n=100 test samples per benchmark with 95% bootstrap CIs, results are mixed and instructive: Qwen2-VL-7B achieves the highest micro-F1 on both FBDs (0.570) and circuits (0.528), but with extreme hallucination rates (0.78, 0.98). An ensemble oracle that selects the best prediction per sample reaches F1=0.556 with hallucination 0.320 on FBDs, demonstrating exploitable complementarity between grammar and end-to-end approaches. Confidence thresholding at tau=0.7 reduces circuit hallucination from 0.970 to 0.880 with no F1 loss. Hard noise augmentation reveals domain-dependent robustness: FBD detection is resilient while circuit detection degrades sharply. An LLM-as-judge evaluation confirms that the grammar pipeline produces more actionable circuit feedback (4.85/5) than the end-to-end LMM (3.11/5). We release all code, datasets, and evaluation scripts.

연구 동기 및 목표

끝까지 이어지는 LMM이 환각을 만들어 내는 경우에도 학생이 그린 STEM 다이어그램에 대한 시기적절한 루브릭에 맞춘 피드백을 동기 부여한다.
피드백의 신뢰성과 실행 가능성을 높이기 위해 인식과 추론을 분리한다.
Ground-truth 오류가 있는 FBD-10 및 Circuit-10 벤치마크에서 네 단계 파이프라인을 평가한다.
인식, 추론, 생성이 어디서 성공하거나 실패하는지 분석하고 오류를 투명하게 귀속한다.

제안 방법

Stage 1은 하이브리드 CV 검출(CLAHE, 적응 임계값, 윤곽, HoughLinesP)을 사용하여 기본 요소를 검출한다.
Stage 2는 검출된 기본 요소로부터 Typed된 기호 그래프 G=(V,E)을 구축한다.
Stage 3는 시나리오 키에 대해 도메인 특정의 지역적 및 비지역적 제약 검사 를 수행한다.
Stage 4는 검증된 위반만을 제약된 VLM(Qwen2-VL-2B)에 피드백으로 제공하고 필요시 템플릿으로 대체한다.

Figure 1 : Sketch2Feedback pipeline overview. Stage 1 : Hybrid CV perception detects primitives (arrows, wires, components, junctions) via CLAHE preprocessing, adaptive thresholding, contour analysis, and HoughLinesP. Stage 2 : Detected primitives form a typed symbolic graph $G=(V,E)$ with spatial p

실험 결과

연구 질문

RQ1그람마르-루프 파이프라인이 학생 다이어그램에 대해 grounded하고 검증 가능한 관찰에 기반한 루브릭에 맞춘 피드를 제공할 수 있는가?
RQ2모듈식 인식+추론이 다이어그램 오류를 감지하고 실행 가능한 피드백을 제공하는 데 엔드투 엔드 LMM보다 얼마나 우수한가?
RQ3인식 또는 추론 단계가 어디에서 실패하는지, 그리고 오류 귀속이 향후 개선에 어떤 정보를 제공하는가?
RQ4제안된 방법이 자유물체도(FBD)와 회로 도면에서 어떻게 성능을 보이나?
RQ5감지 정확도, 피드백 품질, 환각, 보정, 대기 시간의 trade-off는 무엇인가?

주요 결과

엔드투엔드 LMM은 FBD 오류 감지에서 그람마르 파이프라인보다 우수하다(마이크로-F1 0.471 대 0.263) 및 FBD 맥락에서 더 강한 피드백을 제공한다.
그람마르 파이프라인은 회로도에서 엔드투엔드 모델보다 우수하다(마이크로-F1 0.329 대 0.038) 및 완전한 실행 가능성(5.0/5)을 달성한다.
그람마르 파이프라인은 인식의 위양성으로 인한 회로 환각이 높게 나타나며(0.925), 이는 Stage 1에 정확한 실패 귀속을 가능하게 한다.
위반이 감지되면 템플릿 기반 생성으로 완벽한 회로 피드백 실행 가능성(5.0/5)을 제공한다.
비전-전용 베이스라인은 환각은 매우 낮지만 탐지 성능이 저조해 실행 가능한 피드백을 위한 구조적 추론의 필요성을 강조한다.
유형별 분석은 보완적 강점을 드러낸다: 문법은 FBD의 구조적 제약 위반과 회로의 접지 누락에 강하고, 엔드투엔드는 누락 유형의 오류를 더 잘 탐지한다(예: 힘 누락).

Figure 2 : Model complementarity across error types. The grammar pipeline excels at structural constraint violations (wrong direction, missing ground), while the E2E-LMM detects omission-type errors (missing force). Neither model detects missing components or wrong polarity, indicating a shared perc

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.