Loading…
Low-Latency Collectives for the Intel SCC
Message passing has been adopted as the main programming paradigm for many-core processors with on-chip networks for inter-core communication. To this end, message-passing libraries such as MPI can be used, as they provide well-known interfaces to application developers. Since MPI implementations we...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Message passing has been adopted as the main programming paradigm for many-core processors with on-chip networks for inter-core communication. To this end, message-passing libraries such as MPI can be used, as they provide well-known interfaces to application developers. Since MPI implementations were originally developed for macroscopic computer networks, the different characteristics of on-chip networks may require rethinking existing solutions. With the example of All reduce, we identify points where collective operations benefit from routines optimized for on-chip networks. The identified issues are then applied to additional collectives including Broadcast, All gather and All to all. The effectiveness of the proposed optimizations is demonstrated on the Single-Chip Cloud Computer (SCC), a many-core research chip created by Intel Labs. Experiments show that collective operations subjected to the identified optimizations are accelerated by factors roughly between 2 to 3 compared to current state of the art implementations. In addition to synthetic benchmarks, we show that the use of the optimized routines accelerates a scientific application by more than 40%. |
---|---|
ISSN: | 1552-5244 2168-9253 |
DOI: | 10.1109/CLUSTER.2012.58 |