Loading…

Parallel Deep Neural Network Training for Big Data on Blue Gene/Q

Deep Neural Networks (DNNs) have recently been shown to significantly outperform existing machine learning techniques in several pattern recognition tasks. DNNs are the state-of-the-art models used in image recognition, object detection, classification and tracking, and speech and language processin...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on parallel and distributed systems 2017-06, Vol.28 (6), p.1703-1714
Main Authors: I-Hsin Chung, Sainath, Tara N., Ramabhadran, Bhuvana, Picheny, Michael, Gunnels, John, Austel, Vernon, Chauhari, Upendra, Kingsbury, Brian
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Deep Neural Networks (DNNs) have recently been shown to significantly outperform existing machine learning techniques in several pattern recognition tasks. DNNs are the state-of-the-art models used in image recognition, object detection, classification and tracking, and speech and language processing applications. The biggest drawback to DNNs has been the enormous cost in computation and time taken to train the parameters of the networks-often a tenfold increase relative to conventional technologies. Such training time costs can be mitigated by the application of parallel computing algorithms and architectures. However, these algorithms often run into difficulties because of the cost of inter-processor communication bottlenecks. In this paper, we describe how to enable Parallel Deep Neural Network Training on the IBM Blue Gene/Q (BG/Q) computer system. Specifically, we explore DNN training using the data-parallel Hessian-free 2nd order optimization algorithm. Such an algorithm is particularly well-suited to parallelization across a large set of loosely coupled processors. BG/Q, with its excellent inter-processor communication characteristics, is an ideal match for this type of algorithm. The paper discusses how issues regarding programming model and data-dependent imbalances are addressed. Results on large-scale speech tasks show that the performance on BG/Q scales linearly up to 4,096 processes with no loss in accuracy. This allows us to train neural networks using billions of training examples in a few hours.
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2016.2626289