[논문 리뷰] EventFlash: Towards Efficient MLLMs for Event-Based Vision
EventFlash는 이벤트 기반 비전에서 시공간 토큰 희소화를 도입하여 baselines보다 더 높은 처리량을 달성하면서 유사한 정확도를 유지하고, 최대 1000개의 버킷까지의 장거리 이벤트 스트림 처리를 가능하게 한다.
Event-based multimodal large language models (MLLMs) enable robust perception in high-speed and low-light scenarios, addressing key limitations of frame-based MLLMs. However, current event-based MLLMs often rely on dense image-like processing paradigms, overlooking the spatiotemporal sparsity of event streams and resulting in high computational cost. In this paper, we propose EventFlash, a novel and efficient MLLM to explore spatiotemporal token sparsification for reducing data redundancy and accelerating inference. Technically, we build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets, providing both short and long event stream sequences to support our curriculum training strategy. We then present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens while retaining key temporal cues. Finally, a sparse density-guided attention module is designed to improve spatial token efficiency by selecting informative regions and suppressing empty or sparse areas. Experimental results show that EventFlash achieves a $12.4 imes$ throughput improvement over the baseline (EventFlash-Zero) while maintaining comparable performance. It supports long-range event stream processing with up to 1,000 bins, significantly outperforming the 5-bin limit of EventGPT. We believe EventFlash serves as an efficient foundation model for event-based vision.
연구 동기 및 목표
- MLLMs에서 희소 이벤트 스트림의 조밀하고 이미지와 유사한 처리의 계산 비효율성을 해소한다.
- 이벤트 데이터에서 데이터 중복을 줄이고 추론 속도를 높이기 위한 시공간 토큰 희소화 프레임워크를 개발한다.
- 장거리 이벤트 이해를 위한 커리큘럼 학습을 지원하기 위해 대규모의 다양하고 EventMind 데이터셋을 만든다.
- 시간 신호와 중요한 영역을 보존하기 위해 적응적 시간 샘플링과 밀도 가이드 공간 주의를 제안한다.
- 제안된 방법이 상당한 처리량 이득을 달성하면서도 비슷한 성능을 유지함을 입증한다.
제안 방법
- Adaptive temporal window aggregation (ATWA) to compress temporal tokens while preserving key motion cues.
- Sparse density-guided attention (SDGA) to select informative spatial regions and suppress low-density areas.
- Event encoder (e.g., CLIP-ViT) plus an event-language projector to align event tokens with text tokens.
- Fusion of compact event tokens with text tokens through an LLM decoder (e.g., Qwen-2.5).
- Two-stage density-aware merging with semantic similarity and event-density weighting for temporal sparsification.
- Short-to-long curriculum learning (short-to-long event streams) to enhance generalization for long-range understanding.

실험 결과
연구 질문
- RQ1How can spatiotemporal sparsification of event streams reduce redundancy and improve efficiency in event-based MLLMs?
- RQ2Can adaptive temporal sampling and density-guided spatial attention preserve essential temporal and spatial cues while achieving high throughput?
- RQ3Does a curriculum that progresses from short to long event streams improve generalization and reasoning in multimodal event-based models?
- RQ4What is the performance-efficiency trade-off of EventFlash across long-range event sequences up to 1000 bins?
- RQ5How does EventFlash compare to existing event-based and video-based MLLMs on diverse event-stream tasks?
주요 결과
- EventFlash achieves 12.4x higher throughput than the baseline (EventFlash-Zero) with comparable task performance.
- The model supports long-range event streams up to 1,000 bins versus 5-bin limits in competing EventGPT.
- Throughput reaches 28.5 tokens/s (3B/7B variants shown) with strong results across GDC, FGQA, HAQA, and MCQA.
- Temporal and spatial sparsification both contribute to efficiency gains, with combined sparsification yielding the largest speedup.
- A large-scale EventMind dataset (500k instruction samples) supports curriculum learning across short, medium, and long event sequences.
- Open-ended evaluation demonstrates EventFlash’s robustness in high-speed and low-light scenarios.

더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.