QUICK REVIEW

[論文レビュー] Constant RMR Recoverable Mutex under System-wide Crashes

Dhoked, Sahil, Golab, Wojciech|arXiv (Cornell University)|Jan 1, 2023

Distributed systems and fault tolerance被引用数 2

ひとこと要約

本稿では、キャッシュ整合性（CC）および分散共有メモリ（DSM）モデルにおけるシステムワイドなクラッシュを想定した、2つの新しい回復可能相互排除（RME）ロックを提案する。標準的なアトミック命令（CAS、FAS）のみを用いて、O(1)の最悪ケースRMR複雑度とプロセスあたり定数空間を達成し、動的生成されたスレッドの参加も可能である。主な貢献は、RMR複雑度においてシステムワイドクラッシュモデルと個別クラッシュモデルの明確な分離を証明したことであり、個別クラッシュモデルではΩ(log n / log log n)の下界があるのに対し、システムワイドクラッシュでは定数RMR性能が達成可能であることを示している。

ABSTRACT

Recoverable mutual exclusion (RME) is a fault-tolerant variation of Dijkstra’s classic mutual exclusion (ME) problem that allows processes to fail by crashing as long as they recover eventually. A growing body of literature on this topic, starting with the problem formulation by Golab and Ramaraju (PODC'16), examines the cost of solving the RME problem, which is quantified by counting the expensive shared memory operations called remote memory references (RMRs), under a variety of conditions. Published results show that the RMR complexity of RME algorithms, among other factors, depends crucially on the failure model used: individual process versus system-wide. Recent work by Golab and Hendler (PODC'18) also suggests that explicit failure detection can be helpful in attaining constant RMR solutions to the RME problem in the system-wide failure model. Follow-up work by Jayanti, Jayanti, and Joshi (SPAA'23) shows that such a solution exists even without employing a failure detector, albeit this solution uses a more complex algorithmic approach. In this work, we dive deeper into the study of RMR-optimal RME algorithms for the system-wide failure model, and present contributions along multiple directions. First, we introduce the notion of withdrawing from a lock acquisition rather than resetting the lock. We use this notion to design a withdrawable RME algorithm with optimal O(1) RMR complexity for both cache-coherent (CC) and distributed shared memory (DSM) models in a modular way without using an explicit failure detector. In some sense, our technique marries the simplicity of Golab and Hendler’s algorithm with Jayanti, Jayanti and Joshi’s weaker system model. Second, we present a variation of our algorithm that supports fully dynamic process participation (i.e., both joining and leaving) in the CC model, while maintaining its constant RMR complexity. We show experimentally that our algorithm is substantially faster than Jayanti, Jayanti, and Joshi’s algorithm despite having stronger correctness properties. Finally, we establish an impossibility result for fully dynamic RME algorithms with bounded RMR complexity in the DSM model that are adaptive with respect to space, and provide a wait-free withdraw section.

研究の動機と目的

電力断などによりすべてのプロセスが同時にクラッシュする状況下でも耐えられる、CCおよびDSMモデルにおけるRMEロックの設計。
CCおよびDSMモデルの両方で、既知の個別クラッシュモデルの下界を超えるO(1)最悪ケースRMR複雑度の達成。
事前に割り当てを行わず、任意の名前をもつスレッドが実行時において動的にプロトコルに参加できるようにすること。
クラッシュ後の効率的で予測可能な再起動を可能にする、有界な回復および有界な退出の性質の確保。
最悪ケースRMR複雑度の観点から、システムワイドクラッシュモデルと個別クラッシュモデルの理論的分離の実証。

提案手法

臨界領域へのアクセスと回復状態を管理するため、三段階ロック（Lock[0]、Lock[1]、Lock[2]）を用いた二段階RMEプロトコルを設計。
プロセスの状態を同期し、回復中における相互排除を保証するため、シーケンス番号（Seq）と状態追跡（Sp、Sq、CSowner、Stop）を採用。
ロックの取得と解放にアトミック命令（CAS、FAS）を用い、システムワイドクラッシュ下でも正しさを保証。
プロセスが再起動後にℓ.recoverp()を呼び出すことで回復を実行し、直前の状態に基づきIN REMまたはIN CSを返す回復メカニズムを導入。
各プロセスが28のプログラムカウンタ状態を保持する状態機械を用い、クラッシュ後の適切な再開を保証。
アクセスの同期と有界な回復・退出を保証するため、三段階ロック抽象（Lock[i].try、Lock[i].exit、Lock[i].recover）を採用。

実験結果

リサーチクエスチョン

RQ1CCおよびDSMモデルにおけるシステムワイドクラッシュ下でも、O(1)の最悪ケースRMR複雑度を達成できるRMEロックは設計可能か？
RQ2事前に割り当てを行わず、動的生成されたスレッドをサポートしつつ、定数空間とRMRを維持できるRMEロックは設計可能か？
RQ3既知の下界を踏まえると、システムワイドクラッシュモデルは個別クラッシュモデルよりも優れたRMR複雑度を達成可能か？
RQ4システムワイドクラッシュモデル下でも、有界な回復と有界な退出を達成でき、かつO(1)のRMRを維持できるか？
RQ5システムワイドクラッシュのどのような構造的性質が、個別クラッシュと比較してRMR複雑度の分離を可能にするか？

主な発見

最初のアルゴリズムは、CCモデルでO(1)の最悪ケースRMR複雑度を達成し、プロセスあたり定数空間を確保し、動的スレッド参加をサポートする。
2番目のアルゴリズムは、最初のものに拡張して、CCおよびDSMモデルの両方でO(1)の最悪ケースRMR複雑度を達成する。空間的および動的参加の制約は同一。
本研究のアルゴリズムは、標準的なハードウェアサポート（CAS、FAS）を前提とした条件下で、両モデルで初めてO(1)の最悪ケースRMR複雑度を達成した。
理論的分離が確立された：システムワイドクラッシュではO(1)のRMRが可能である一方、個別クラッシュモデルではΩ(log n / log log n)の下界がある。
プロトコルは、臨界領域再入（CSR）、相互排除、スターヴェーションの回避、有界な回復、有界な退出をすべて満たす。
すべての構成（クラッシュおよび再起動を含む）において正しさを検証するため、形式的な帰納的不変条件が証明された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。