Treffer: TorchBraid: High-Performance Layer-Parallel Training of Deep Neural Networks with MPI and GPU Acceleration.

Title:
TorchBraid: High-Performance Layer-Parallel Training of Deep Neural Networks with MPI and GPU Acceleration.
Source:
ACM Transactions on Mathematical Software; Sep2025, Vol. 51 Issue 3, p1-30, 30p
Database:
Complementary Index

Weitere Informationen

TorchBraid is a high-performance implementation of layer-parallel training for deep neural networks (DNNs) supporting MPI-based parallelism and GPU acceleration. Layer-parallel training has been developed to overcome the serialization inherent in forward and backward propagation of DNNs that limits utilization of computational resources in the strong scaling limit. To achieve this, TorchBraid integrates the PyTorch neural network framework with the state-of-the-art XBraid time-parallel library. This article presents the use and performance of TorchBraid, in addition to solutions for overcoming the algorithmic challenges inherent in combining automatic differentiation with layer-parallel. Results are presented with and without GPU acceleration for the Tiny ImageNet and MNIST image classification data sets, as well as recurrent neural networks. Overall, TorchBraid enables fast training of DNNs, both in a strong and weak scaling context. In addition to the TorchBraid software, several new advances in applying layer-parallel algorithms are detailed. Integration of layer-parallel with data-parallel algorithms is presented for the first time, showing the computational advantages of the combination. Standard deep learning techniques, like batch-normalization, are developed for layer-parallel training. Finally, a new approach combining layer-parallel with spatial coarsening in order to accelerate training for 3D image classification shows roughly a 10× speedup over serial execution. [ABSTRACT FROM AUTHOR]

Copyright of ACM Transactions on Mathematical Software is the property of Association for Computing Machinery and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)