QUICK REVIEW

[논문 리뷰] Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions

Saugata Ghose, Kevin Hsieh|arXiv (Cornell University)|2018. 02. 01.

Advanced Memory and Neural Computing참고 문헌 98인용 수 39

한 줄 요약

이 논문은 3D 스택형 DRAM 내에서 효율적인 가상 메모리 지원과 캐시 일관성 관리를 가능하게 하는 두 가지 메커니즘인 IMPICA와 LazyPIM을 제안함으로써, 프로세싱-인-메모리(PIM) 아키텍처를 도입하는 데 있어 핵심 과제를 해결한다. 주소 번역과 일관성 관리를 메모리 내에서 직접 수행함으로써, 이 솔루션들은 외부 메모리 통신을 줄이고, 메모리 집약적 워크로드에 대해 성능 향상과 에너지 효율성 향상을 이끌어내며, 공유 메모리 프로그래밍 모델을 유지한다.

ABSTRACT

Poor DRAM technology scaling over the course of many years has caused DRAM-based main memory to increasingly become a larger system bottleneck. A major reason for the bottleneck is that data stored within DRAM must be moved across a pin-limited memory channel to the CPU before any computation can take place. This requires a high latency and energy overhead, and the data often cannot benefit from caching in the CPU, making it difficult to amortize the overhead. Modern 3D-stacked DRAM architectures include a logic layer, where compute logic can be integrated underneath multiple layers of DRAM cell arrays within the same chip. Architects can take advantage of the logic layer to perform processing-in-memory (PIM), or near-data processing. In a PIM architecture, the logic layer within DRAM has access to the high internal bandwidth available within 3D-stacked DRAM (which is much greater than the bandwidth available between DRAM and the CPU). Thus, PIM architectures can effectively free up valuable memory channel bandwidth while reducing system energy consumption. A number of important issues arise when we add compute logic to DRAM. In particular, the logic does not have low-latency access to common CPU structures that are essential for modern application execution, such as the virtual memory and cache coherence mechanisms. To ease the widespread adoption of PIM, we ideally would like to maintain traditional virtual memory abstractions and the shared memory programming model. This requires efficient mechanisms that can provide logic in DRAM with access to CPU structures without having to communicate frequently with the CPU. To this end, we propose and evaluate two general-purpose solutions that minimize unnecessary off-chip communication for PIM architectures. We show that both mechanisms improve the performance and energy consumption of many important memory-intensive applications.

연구 동기 및 목표

PIM 논리가 CPU 기반의 TLB나 페이지 테이블 워커에 액세스할 수 없는 상황에서 PIM 아키텍처에서 효율적인 가상 메모리 지원이 부족한 문제를 해결하기 위해.
CPU와 PIM 코어 간의 캐시 일관성을 유지하면서 빈번한 외부 메모리 통신을 방지하는 도전 과제를 해결하기 위해.
기존의 공유 메모리 프로그래밍 모델을 유지하면서 실제 시스템에 PIM을 원활하게 통합할 수 있도록 하기 위해.
기존 아키텍처에서 CPU와 메모리 간의 데이터 이동으로 인한 성능 및 에너지 오버헤드를 줄이기 위해.
계산을 제한하거나 아키텍처적 대대적인 개선을 요구하지 않는 일반 목적의 확장 가능한 솔루션을 개발하기 위해.

제안 방법

IMPICA는 DRAM 내부에서 주소 번역 가속기를 사용하여 포인터 추적과 가상 주소에서 물리 주소로의 번역을 완전히 메모리 내에서 수행함으로써 CPU 간섭이 필요 없도록 한다.
LazyPIM은 사전 실행과 일관성 메시지 압축을 활용하여 캐시 일관성에 대한 외부 메모리 통신을 최소화하며, 필수적인 경우에만 업데이트를 연기한다.
이 두 메커니즘은 3D 스택형 DRAM의 제약 조건 내에서 작동하도록 설계되었으며, 메모리와 논리 레이어 간의 높은 내부 대역폭을 활용한다.
번역 및 일관성 논리를 메모리의 논리 레이어에 직접 통합함으로써, CPU 기반의 가상 메모리 구조(예: TLB 또는 페이지 테이블 워커)에 의존하지 않도록 한다.
IMPICA는 번역 결과를 캐시하고 메모리 칩 내부에서 하드웨어 지원 주소 해석을 통해 포인터 추적 워크로드의 성능을 가속화한다.
LazyPIM은 메시지 압축과 사전 업데이트를 통해 일관성 오버헤드를 줄이며, 필요한 경우에만 변경 사항을 검증한다.

실험 결과

연구 질문

RQ1CPU 기반의 TLB나 페이지 테이블 워커에 의존하지 않고 PIM 논리 내에서 가상 주소 번역을 어떻게 효율적으로 수행할 수 있는가?
RQ2CPU와 PIM 코어 간의 캐시 일관성을 유지하면서 외부 메모리 통신을 최소화할 수 있는 메커니즘은 무엇인가?
RQ3공유 메모리 프로그래밍 모델을 깨뜨리지 않고도 PIM 아키텍처가 일반 목적의 다중 스레드 애플리케이션을 지원할 수 있는가?
RQ43D 스택형 DRAM에서 계산을 데이터에 가까이 위치시킴으로써 메모리 집약적 워크로드의 성능과 에너지 효율성을 어떻게 향상시킬 수 있는가?
RQ5시스템 수준의 통신 오버헤드를 줄이기 위해 활용할 수 있는 주소 번역 및 일관성의 핵심 행동 특성은 무엇인가?

주요 결과

IMPICA는 포인터 추적 워크로드의 지연을 줄이기 위해 주소 번역을 메모리 내에서 수행함으로써, CPU로의 반복적인 외부 요청을 방지한다.
LazyPIM은 메시지 압축과 사전 업데이트를 통해 외부 메모리 일관성 메시지 수를 최대 70%까지 줄여 시스템 효율성을 향상시킨다.
IMPICA와 LazyPIM은 표준 공유 메모리 프로그래밍 모델과 호환되며, 기존 애플리케이션 스택에 원활하게 통합될 수 있다.
제안된 메커니즘은 그래프 처리, 데이터베이스, 연결된 데이터 구조와 같은 메모리 집약적 애플리케이션의 성능 향상과 에너지 소비 감소를 개선한다.
가상 메모리 및 일관성 문제를 저비용의 메모리 내 메커니즘으로 해결함으로써 PIM 도입이 크게 가속화될 수 있음을 입증한다.
평가 결과, IMPICA와 LazyPIM의 조합은 최소한의 시스템 수준의 통신으로 near-optimal 성능을 달성할 수 있으며, 이는 PIM을 실제 환경에 구현 가능한 것으로 만든다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.