Loading…

Task-based Parallel Programming for Scalable Matrix Product Algorithms

Task-based programming models have succeeded in gaining the interest of the high-performance mathematical software community because they relieve part of the burden of developing and implementing distributed-memory parallel algorithms in an efficient and portable way.In increasingly larger, more het...

Full description

Saved in:

Bibliographic Details
Published in:	ACM transactions on mathematical software 2023-06, Vol.49 (2), p.1-23, Article 15
Main Authors:	Agullo, Emmanuel, Buttari, Alfredo, Guermouche, Abdou, Herrmann, Julien, Jego, Antoine
Format:	Article
Language:	English
Subjects:	Computations on matrices Computer Science Computing methodologies Distributed algorithms Distributed programming languages Distributed, Parallel, and Cluster Computing Mathematical software performance Mathematics of computing Software and its engineering Software design engineering
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Task-based programming models have succeeded in gaining the interest of the high-performance mathematical software community because they relieve part of the burden of developing and implementing distributed-memory parallel algorithms in an efficient and portable way.In increasingly larger, more heterogeneous clusters of computers, these models appear as a way to maintain and enhance more complex algorithms. However, task-based programming models lack the flexibility and the features that are necessary to express in an elegant and compact way scalable algorithms that rely on advanced communication patterns. We show that the Sequential Task Flow paradigm can be extended to write compact yet efficient and scalable routines for linear algebra computations. Although, this work focuses on dense General Matrix Multiplication, the proposed features enable the implementation of more complex algorithms. We describe the implementation of these features and of the resulting GEMM operation. Finally, we present an experimental analysis on two homogeneous supercomputers showing that our approach is competitive up to 32,768 CPU cores with state-of-the-art libraries and may outperform them for some problem dimensions. Although our code can use GPUs straightforwardly, we do not deal with this case because it implies other issues which are out of the scope of this work.
ISSN:	0098-3500 1557-7295
DOI:	10.1145/3583560