QUICK REVIEW

[論文レビュー] qHiPSTER: The Quantum High Performance Software Testing Environment

Mikhail Smelyanskiy, Nicolas P. D. Sawaya|arXiv (Cornell University)|Jan 26, 2016

Quantum Computing Algorithms and Architecture参考文献 30被引用数 115

ひとこと要約

qHiPSTER は古典的 HPC システム上の分散高性能量子回路シミュレーターで、Stampede 上の 1000 ノードで 40 qubits までの単一ゲートおよび二-qubit ゲートを用いたシミュレーションを可能にし、メモリ/ネットワーク境界の性能、最適化、およびスケーラビリティを分析する。

ABSTRACT

We present qHiPSTER, the Quantum High Performance Software Testing Environment. qHiPSTER is a distributed high-performance implementation of a quantum simulator on a classical computer, that can simulate general single-qubit gates and two-qubit controlled gates. We perform a number of single- and multi-node optimizations, including vectorization, multi-threading, cache blocking, as well as overlapping computation with communication. Using the TACC Stampede supercomputer, we simulate quantum circuits ("quantum software") of up to 40 qubits. We carry out a detailed performance analysis to show that our simulator achieves both high performance and high hardware efficiency, limited only by the sustainable memory and network bandwidth of the machine.

研究の動機と目的

アルゴリズムの性能、誤差、およびノイズの影響を研究するために、HPC システム上で高精度な量子回路シミュレーションを動機づける。
一般的な単一量子ビットおよび二量子ビット制御ゲートを up to ~40 qubits まで扱える分散シミュレータを開発する。
大規模な HPC ハードウェア上での性能とハードウェア効率を評価し、メモリ帯域、ネットワーク帯域、および通信と計算のオーバーラップに焦点を当てる。
分散量子状態シミュレーションの限界を押し広げるためのアーキテクチャとアルゴリズムの最適化を調査する。）
method:[
Implement a distributed state-vector simulator that partitions a 2^n amplitude vector across 2^p nodes, with m = n - p local amplitudes per node.
Apply general single-qubit gates by updating amplitude pairs with a 2x2 unitary Q directly on the state vector (and similarly for two-qubit controlled gates).
Use a communication scheme where pairs of processors exchange halves of their local state to enable inter-node gate application, with four cases for controlled gates depending on whether control/target qubits lie above or below local memory boundary m.
Introduce vectorization (AVX2) and complex-SIMD arithmetic to accelerate inner loops, and employ multi-threading to parallelize outer/inner loops.
Develop cache-blocking and gate-fusion strategies to keep blocks of the state vector in LLC and increase effective memory bandwidth.
Overcome memory- and network-bound limits by multi-step communication to overlap computation with data exchange and by experimenting with gate fusion and LLC-resident blocks.

提案手法

Implement a distributed state-vector simulator that partitions a 2^n amplitude vector across 2^p nodes, with m = n - p local amplitudes per node.
Apply general single-qubit gates by updating amplitude pairs with a 2x2 unitary Q directly on the state vector (and similarly for two-qubit controlled gates).
Use a communication scheme where pairs of processors exchange halves of their local state to enable inter-node gate application, with four cases for controlled gates depending on whether control/target qubits lie above or below local memory boundary m.
Introduce vectorization (AVX2) and complex-SIMD arithmetic to accelerate inner loops, and employ multi-threading to parallelize outer/inner loops.
Develop cache-blocking and gate-fusion strategies to keep blocks of the state vector in LLC and increase effective memory bandwidth.
Overcome memory- and network-bound limits by multi-step communication to overlap computation with data exchange and by experimenting with gate fusion and LLC-resident blocks.

実験結果

リサーチクエスチョン

RQ1大規模分散量子状態シミュレーターにおける単一量子ビットおよび二量子ビット操作の性能（ゲートあたりの時間）はどの程度か。
RQ2高性能シミュレーターは HPC ハードウェア上のメモリ-bound/ネットワーク-bound 理論限界にどれだけ近づけるか？
RQ3大規模な qubit 数に対して、どのようなアーキテクチャ/アルゴリズム最適化（ベクトル化、スレッド、キャッシュブロック化、ゲート融合、多段通信）が最大の性能向上をもたらすか？
RQ440 qubits までの分散ゲート適用は、百ノード〜千ノード規模でどうスケールするか？
RQ5このフレームワークにおける Quantum Fourier Transform (QFT) カーネルの性能はどの程度か？

主な発見

ケース	解析	Stampede (n=29, B_mem=40 GB/s, B_net=5.5 GB/s)
1	2^{m+5}/B_mem	0.43 sec
2	2^{m+5}/B_mem	3.12 sec
3	2^{m+4}/B_mem	0.21 sec
4	2^{m+5}/B_mem	0.43 sec
5	2^{m+4}/B_mem	1.56 sec
6	2^{m+5}/B_net	3.12 sec

このシミュレーターは 1000 Stampede ノードで 32 TB total memory までの 40 qubits に対応でき、ゲート性能は qubit の配置と通信に依存して memory-bound または network-bound となる。
k < m の qubit に対する単一量子ビットゲートは memory bound で、m=29 (n=29) の場合の時間は約 0.43 s で、k ≥ m の場合はこのハードウェア下で network-bound となり約 3.12 s。
二量子ビットゲートは、m、c、およびノード間通信の要否に応じて memory-bound (0.21 s) または network-bound (≈3.12 s) といった同様の特性を示す。
ゲート融合と LLC-aware キャッシュブロックは、ベースライン STREAM レートをはるかに上回る実効メモリ帯域を引き上げ、特定の qubit 数で fused IQFT シナリオで最大約 ~100 GB/s に達する。
中程度ノード数で通信が不要な場合、マルチノードの強いスケーリングはほぼ線形の速度向上を示し、ノード間通信がボトルネックになると大幅な低下が生じる。大規模では LLC の居残りとネットワーク競合がパフォーマンスを左右する。
QFT の性能は、29 qubits で ~0.27 s/ゲートから、40 qubits で ~1.22 s/ゲートへとスケールし、より大きな回路で通信の影響が大きくなることを示している。）

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。