QUICK REVIEW

[논문 리뷰] Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

Christopher A. Wolters, Xiaoxuan Yang|arXiv (Cornell University)|2024. 06. 12.

Topic Modeling인용 수 9

한 줄 요약

본 논문은 compute-in-memory (CIM) 아키텍처를 조사하여 large language model 추론을 가속시키고, 트랜스포머 워크로드, 메모리 병목 현상, 하드웨어-소프트웨어 공동 설계의 도전과제를 분석한다.

ABSTRACT

Large language models (LLMs) have recently transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. This development necessitates speed, efficiency, and accessibility in LLM inference as the computational and memory requirements of these systems grow exponentially. Meanwhile, advancements in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore's law. With LLMs exceeding the capacity of single GPUs, they require complex, expert-level configurations for parallel processing. Memory accesses become significantly more expensive than computation, posing a challenge for efficient scaling, known as the memory wall. Here, compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory, potentially reducing latency and power consumption. By closely integrating memory and compute elements, CIM eliminates the von Neumann bottleneck, reducing data movement and improving energy efficiency. This survey paper provides an overview and analysis of transformer-based models, reviewing various CIM architectures and exploring how they can address the imminent challenges of modern AI computing systems. We discuss transformer-related operators and their hardware acceleration schemes and highlight challenges, trends, and insights in corresponding CIM designs.

연구 동기 및 목표

LLM 추론에서의 메모리 월(memory-wall) 문제와 그것이 지연 시간과 에너지에 미치는 영향을 강조한다.
트랜스포머 기반 모델과 CIM 가속에 적합한 핵심 계산 커널을 검토한다.
CIM 기술(CMOS 및 신흥 NVM)의 트랜스포머 워크로드에 대한 적합성을 분석한다.
LLM 추론을 위한 CIM의 설계, 신뢰성 및 시스템 차원의 과제를 식별하고 향후 연구 방향을 제시한다.

제안 방법

트랜스포머 아키텍처와 핵심 연산(MVM, attention) 및 이들이 하드웨어 가속에 미치는 시사점을 설명한다.
CIM 어레이 작동 방식과 아날로그 MAC가 메모리 컨덕턴스와 Kirchhoff's law를 이용한 행렬-벡터 곱을 어떻게 수행하는지 설명한다.
메모리 기술(SRAM, ReRAM, PCM, FeFET, MRAM)과 CIM에 대한 트레이드오프를 비교한다.
아날로그 비이상성, 주변 오버헤드(ADC), 정밀도 한계, CIM 설계에서의 내구성을 다룬다.
LLM 추론을 CIM 하드웨어에 매핑하기 위한 하드웨어-소프트웨어 공동 설계 고려사항을 평가한다.
정밀도, 지연 및 에너지를 균형 있게 맞추기 위한 설계 가이드라인과 향후 CIM 기반 LLM 가속기에 대한 가능 경로를 제시한다.

Figure 1: Model size of state-of-the-art LLMs [ 7 ]

실험 결과

연구 질문

RQ1트랜스포머 기반 LLM 추론에서 데이터 이동 병목 현상을 CIM이 어떻게 줄일 수 있는가?
RQ2현실적인 제약 조건 하에서 트랜스포머 워크로드를 가속하기에 최적의 CIM 아키텍처 및 메모리 기술은 무엇인가?
RQ3CIM이 LLM에 대해 직면하는 주요 신뢰성, 정밀도 및 주변 오버헤드 문제는 무엇이며 이를 어떻게 완화할 수 있는가?
RQ4하드웨어-소프트웨어 공동 설계가 LLM 추론에서 CIM의 효과성에 어떤 영향을 미치는가?

주요 결과

CIM은 메모리 내부에서 MAC를 수행함으로써 데이터 이동을 줄이고 지연 및 에너지 효율성을 개선할 수 있는 잠재력을 가진다.
Emerging non-volatile memories (NVM)은 높은 밀도와 낮은 누설로 CIM에서 특히 대형 매트릭스의 LLM에 매력적이다.
Analog CIM은 소자 비이상성, 드리프트, 읽기 잡음, 내구성으로 인해 정확도에 영향을 받고 보완 전략이 필요하다.
Peripheral overhead, 특히 ADC가 면적과 전력의 지배적 요인이 될 수 있어 정밀도-소프트웨어 의식 최적화가 필요하다.
Transformers는 동적 가중치 연산(queries/keys/values)을 도입하여 crossbar 기반 CIM 워크로드를 복잡하게 만들고, 신중한 설계/분할이 필요하다.
전반적인 시스템 차원의 이익은 크로스바 규모, 정밀도 및 공동 설계 선택에 따라 달라져 정확도, 지연 및 에너지를 균형 있게 달성할 수 있다.

Figure 2: The transformer model architecture [ 4 ]

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.