QUICK REVIEW

[논문 리뷰] TorR: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-design

Hyunwoo Oh, SungHeon Jeong|arXiv (Cornell University)|2026. 03. 24.

Advanced Neural Network Applications인용 수 0

한 줄 요약

TorR은 뇌에서 영감을 받은 캐시 지향 알고리즘-아키텍처 공동 설계를 제안하며, Dense CLIP 스타일의 정렬을 초고차원(Hyperdimensional) 연상 추론기와 쿼리 캐싱으로 대체하여 실시간, 에너지 효율적인 엣지 태스크 지향 탐지를 가능하게 한다. 다섯 가지 태스크에서 경쟁력 있는 AP@0.5를 유지하면서 프레임당 밀리줄 규모의 에너지로 30/60 FPS를 달성한다.

ABSTRACT

Task-oriented object detection (TOOD) atop CLIP offers open-vocabulary, prompt-driven semantics, yet dense per-window computation and heavy memory traffic hinder real-time, power-limited edge deployment. We present \emph{TorR}, a brain-inspired extbf{algorithm--architecture co-design} that extbf{replaces CLIP-style dense alignment with a hyperdimensional (HDC) associative reasoner} and turns temporal coherence into reuse. On the \emph{algorithm} side, TorR reformulates alignment as HDC similarity and graph composition, introducing \emph{partial-similarity reuse} via (i) query caching with per-class score accumulation, (ii) exact $δ$-updates when only a small set of hypervector bits change, and (iii) similarity/load-gated bypass under high system load. On the \emph{architecture} side, TorR instantiates a lane-scalable, bit-sliced item memory with bank/precision gating and a lightweight controller that schedules bypass/$δ$/full paths to meet RT-30/RT-60 targets as object counts vary. Synthesized in a TSMC 28\,nm process and exercised with a cycle-accurate simulator, TorR sustains real-time throughput with millijoule-scale energy per window ($\approx$50\,mJ at 60\,FPS; $\approx$113\,mJ at 30\,FPS) and low latency jitter, while delivering competitive AP@0.5 across five task prompts (mean 44.27\%) within a bounded margin to strong VLM baselines, but at orders-of-magnitude lower energy. The design exposes deployment-time configurability (effective dimension $D'$, thresholds, precision) to trade accuracy, latency, and energy for edge budgets.

연구 동기 및 목표

엣지 기반의 태스크 지향 탐지를 오픈 보캐뷸러리(자연어) 의미론과 엄격한 전력/지연 예산으로 동기화하기.
Dense CLIP 스타일의 정렬을 뇌에서 영감을 얻은 하이퍼디멘션(associate) 추론기로 대체하기.
프레임 간의 시간적 일관성을 활용하기 위한 부분 유사도 재사용(캐시 기반) 도입.
이벤트 주도 인코더, 비트 슬라이스 메모리, 경량 컨트롤러를 포함한 하드웨어-소프트웨어 공동 설계를 개발하여 RT-30/RT-60 달성.
에너지 효율을 유지하며 실시간 성능을 입증하고 태스크 정확도를 경쟁력 있게 보존하기.

제안 방법

Event-driven SNN encoder produces a query hypervector q from DVS events.
Hyperdimensional computing (HDC) associates q with a bank of concept hypervectors h_j via cosine similarity.
Query caching plus partial-similarity (delta) updates reuse prior results when scene changes are small.
HDC graph reasoner applies task-specific weights to aligner scores to produce final per-item scores.
FPS/QoS controller gates work with bank/precision gating to meet 30/60 FPS under dynamic loads.
Hardware accelerator implements a cache-gated similarity kernel with delta/full paths, and a lightweight controller.

실험 결과

연구 질문

RQ1Can temporal reuse and cache-guided partial updates reduce data movement and energy in task-oriented detection at the edge?
RQ2How does replacing dense CLIP-style alignment with an HDC-based associative reasoner impact accuracy and latency under real-time constraints?
RQ3What deployment-time knobs (dimension D', delta budget, precision, thresholds) optimize accuracy, latency, and energy for varying scenes?
RQ4Is a lane-scalable, memory-bound architecture able to sustain RT-30/RT-60 with millijoule-scale energy across multiple prompts?
RQ5How does the proposed co-design compare to strong VLM baselines in AP@0.5 under edge budgets?
RQ6What is the sensitivity of performance to scene dynamics (coherence vs. motion) and resource gating?

주요 결과

TorR sustains 30/60 FPS with millijoule-scale energy per window (≈50 mJ at 60 FPS; ≈113 mJ at 30 FPS).
Mean AP@0.5 across five tasks is 44.27%, within a bounded margin to strong VLM baselines, with significantly lower energy.
Partial-similarity reuse reduces work from O(MD') to O(M|Δ|) and lowers memory traffic.
Aggressive reuse and cache-guided bypass provide predictable latency with low jitter under dynamic loads.
Hardware synthesis (28 nm) shows the associative aligner dominates area and power, with total runtime power around 4.66 W peak and reduced average power via gating.
RT targets are met across tasks, with p95 latency well within budgets and energy per frame scaling with scene reuse and motion

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.