QUICK REVIEW

[論文レビュー] Accelerating HPC codes on Intel(R) Omni-Path Architecture networks: From particle physics to Machine Learning

Peter A. Boyle, Michael Chuvelev|arXiv (Cornell University)|Nov 13, 2017

Advanced Data Storage Technologies被引用数 4

ひとこと要約

本論文では、Xeon Phi 72xx プロセッサを搭載した Intel Omni-Path Architecture クラスタ上で、近似ワイヤースピード性能を達成するための最適化技術を提示する。主なワークロードは、構造化グリッド PDE ソルバーにおけるホールォ交換と、同期的確率的勾配降下法における勾配低減である。2MB の巨大ページ、複数の PSM2 エンドポイント、Intel MPI 2019 のスレッド並列処理を活用することで、Baidu Research の低減カーネルで 10× の高速化を達成し、HPC および機械学習ワークロードにおける顕著な性能向上を示した。

ABSTRACT

We discuss practical methods to ensure near wirespeed performance from clusters with either one or two Intel(R) Omni-Path host fabric interfaces (HFI) per node, and Intel(R) Xeon Phi(TM) 72xx (Knight's Landing) processors, and using the Linux operating system. The study evaluates the performance improvements achievable and the required programming approaches in two distinct example problems: firstly in Cartesian communicator halo exchange problems, appropriate for structured grid PDE solvers that arise in quantum chromodynamics simulations of particle physics, and secondly in gradient reduction appropriate to synchronous stochastic gradient descent for machine learning. As an example, we accelerate a published Baidu Research reduction code and obtain a factor of ten speedup over the original code using the techniques discussed in this paper. This displays how a factor of ten speedup in strongly scaled distributed machine learning could be achieved when synchronous stochastic gradient descent is massively parallelised with a fixed mini-batch size. We find a significant improvement in performance robustness when memory is obtained using carefully allocated 2MB huge virtual memory pages, implying that either non-standard allocation routines should be used for communication buffers. These can be accessed via a LD\_PRELOAD override in the manner suggested by libhugetlbfs. We make use of a the Intel(R) MPI 2019 library Technology Preview and underlying software to enable thread concurrency throughout the communication software stake via multiple PSM2 endpoints per process and use of multiple independent MPI communicators. When using a single MPI process per node, we find that this greatly accelerates delivered bandwidth in many core Intel(R) Xeon Phi processors.

研究の動機と目的

Intel Omni-Path アーキテクチャと Xeon Phi 72xx プロセッサを搭載したクラスタ上で、HPC および機械学習ワークロードの通信パフォーマンスを最適化すること。
特に 2MB の巨大ページを用いたメモリ割り当て戦略が通信バッファのパフォーマンスに与える影響を評価すること。
複数の PSM2 エンドポイントと独立したコミュニケータを用いて、MPI 通信スタックにおけるスレッド並列処理を可能にすること。
代表的な HPC ワークロード2つ、すなわち構造化グリッドソルバーにおけるホールォ交換と、同期的 SGD における勾配低減でのパフォーマンス向上を示すこと。
洗練されたメモリ管理と通信スタックのチューニングにより、マルチ HFI システムでも近似ワイヤースピード性能を達成できることを示すこと。

提案手法

libhugetlbfs を LD_PRELOAD でオーバーライドすることで、通信バッファに 2MB の巨大仮想メモリページを適用し、メモリ遅延を低減し帯域幅を向上させる。
Intel MPI 2019 Technology Preview を使用し、1プロセスあたり複数の PSM2 エンドポイントを活用することで、通信におけるスレッドレベルの並列処理を実現する。
複数の独立した MPI コミュニケータを活用し、マルチコア環境における通信操作を分離・高速化する。
量子色力学シミュレーションに使用される構造化グリッド PDE ソルバーにおけるカーテシアンコミュニケータのホールォ交換の通信パターンを最適化する。
機械学習における勾配低減カーネルに、同じ最適化手法を適用する。
1ノードあたり 1 または 2 個の Omni-Path HFI を搭載したシステム上で、1ノードあたり 1 個の MPI プロセスを使用してパフォーマンス向上をベンチマークする。

実験結果

リサーチクエスチョン

RQ1Intel Omni-Path アーキテクチャと Xeon Phi 72xx プロセッサを搭載したクラスタ上で、通信パフォーマンスをどのように最大化できるか？
RQ22MB の巨大ページは、MPI 通信バッファのパフォーマンスにどの程度向上効果をもたらすか？
RQ31プロセスあたり複数の PSM2 エンドポイントを用いることで、マルチコア環境における配信帯域幅を顕著に向上させられるか？
RQ4提案された最適化手法を用いることで、ホールォ交換および勾配低減ワークロードでどの程度のパフォーマンス向上が達成できるか？
RQ5メモリ割り当てと MPI スレッド並列処理の組み合わせが、エンドツーエンドのアプリケーションパフォーマンスに与える影響はいかほどか？

主な発見

提案された最適化手法を適用することで、公開済みの Baidu Research 勾配低減カーネルで 10× の高速化を達成した。
2MB の巨大ページの使用により、パフォーマンスのロバスト性が顕著に向上し、高性能通信バッファには非標準的なメモリ割り当てが不可欠であることが示された。
1プロセスあたり複数の PSM2 エンドポイントを活用することで、スレッド並列処理が効果的に実現され、マルチコア Xeon Phi プロセッサにおける配信帯域幅が著しく向上した。
パフォーマンス向上は、1ノードあたり 1 個の MPI プロセスを使用した場合に最も顕著であり、プロセスから HFI へのマッピングの重要性が浮き彫りになった。
巨大ページと複数の MPI コミュニケータの組み合わせにより、ホールォ交換および勾配低減ワークロードの両方で近似ワイヤースピード性能が達成された。
固定ミニバッチサイズの強化スケーリング分散機械学習において、最適化手法により顕著な性能向上が得られ、収束がより速くなった。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。