[Paper Review] Sparse Tensor Algebra as a Parallel Programming Model
This paper proposes sparse tensor algebra as a high-level parallel programming model that generalizes dense tensor operations to support sparse data structures and arbitrary elementwise functions. By extending the Cyclops Tensor Framework (CTF) with sparse layouts and customizable operations, the model enables efficient, communication-avoiding execution of iterative solvers, graph algorithms, and electronic structure calculations, achieving up to 6× speed-up in MP3 electronic structure computations and improved weak scalability through sparsity-aware runtime optimization.
Dense and sparse tensors allow the representation of most bulk data structures in computational science applications. We show that sparse tensor algebra can also be used to express many of the transformations on these datasets, especially those which are parallelizable. Tensor computations are a natural generalization of matrix and graph computations. We extend the usual basic operations of tensor summation and contraction to arbitrary functions, and further operations such as reductions and mapping. The expression of these transformations in a high-level sparse linear algebra domain specific language allows our framework to understand their properties at runtime to select the preferred communication-avoiding algorithm. To demonstrate the efficacy of our approach, we show how key graph algorithms as well as common numerical kernels can be succinctly expressed using our interface and provide performance results of a general library implementation.
Motivation & Objective
- To unify array, matrix, and graph computations under a single sparse tensor algebra abstraction for high-performance computing.
- To enable automatic selection of communication-avoiding algorithms at runtime based on tensor properties and data layout.
- To extend the CTF framework to support sparse tensors with arbitrary element types and user-defined functions.
- To demonstrate that sparsity can significantly improve performance and weak scalability in key scientific workloads.
- To provide a minimal yet powerful interface that avoids explicit loops and reduces programming errors in parallel code.
Proposed method
- Extends tensor operations—summation, contraction, mapping, and reductions—to arbitrary functions and algebraic structures, enabling flexible data transformations.
- Introduces a high-level C++ domain-specific language (DSL) that abstracts data layout and enables automatic parallelization across distributed memory.
- Leverages the cyclic CTF layout to randomize nonzero distribution, avoiding costly graph partitioning while maintaining near-optimal performance for random and arbitrary sparsity patterns.
- Uses runtime analysis to select optimal communication-avoiding algorithms based on sparsity and data structure characteristics.
- Supports sparse-sparse, sparse-dense, and dense-dense contractions, with performance tuned via integration with MKL and custom kernels for integer types.
- Employs bulk-synchronous parallel (BSP) execution to ensure low-depth, highly parallelizable programs, avoiding complex dependencies.
Experimental results
Research questions
- RQ1Can sparse tensor algebra serve as a unified, high-level programming model for diverse scientific workloads including graphs and PDE solvers?
- RQ2How does sparsity in tensor operations affect communication and computation costs in distributed-memory systems?
- RQ3To what extent can a high-level DSL with automatic algorithm selection improve performance and scalability compared to hand-optimized kernels?
- RQ4Can user-defined functions and arbitrary data types be efficiently supported in a sparse tensor framework without sacrificing performance?
- RQ5How does the use of sparsity impact weak and strong scaling in real-world scientific applications?
Key findings
- The sparse tensor algebra model achieved up to a 6× speed-up in MP3 electronic structure calculations compared to dense execution, with performance gains increasing under higher parallelism.
- Sparsity improved weak scalability significantly, particularly in path-doubling-based all-pairs shortest-path computations, where the sparse kernel outperformed the dense kernel in weak scaling due to reduced communication and computation costs.
- On 384 cores, the sparse path-doubling kernel spent only 71.4% of time in local kernels (vs. 86.6% for dense), indicating better load balancing and reduced communication overhead.
- For the dense graph APSP benchmark, the sparse kernel achieved 87.8% and 90.8% time in local kernels on 384 cores under weak scaling, showing better scalability than the dense alternative.
- The model reduced both computation and communication costs in proportion to the fraction of nonzeros, with performance gains observed even with minimal tuning.
- The framework’s ability to automatically select communication-avoiding algorithms at runtime enabled efficient execution without requiring manual partitioning or layout optimization.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.