Treffer: Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems

Title:
Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems
Contributors:
Universidad Complutense de Madrid = Complutense University of Madrid [Madrid] (UCM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Ecole Polytechnique Fédérale de Lausanne (EPFL), European Project: 101016776,H2020-FETPROACT-2018-2020,H2020-FETPROACT-2020-01,FVLLMONTI(2021)
Source:
GLSVLSI '25: Great Lakes Symposium on VLSI 2025. :320-327
Publisher Information:
CCSD; ACM, 2025.
Publication Year:
2025
Collection:
collection:CNRS
collection:ENSEIRB
collection:UNIV-BORDEAUX
collection:UNIVERSITE-BORDEAUX
collection:DDRS-TEST-CJ
Subject Geographic:
Original Identifier:
HAL: hal-05242392
Document Type:
Konferenz conferenceObject<br />Conference papers
Language:
English
Relation:
info:eu-repo/semantics/altIdentifier/doi/10.1145/3716368.3735158; info:eu-repo/grantAgreement//101016776/EU/Ferroelectric Vertical Low energy Low latency low volume Modules fOr Neural network Transformers In 3D/FVLLMONTI
DOI:
10.1145/3716368.3735158
Rights:
info:eu-repo/semantics/OpenAccess
Accession Number:
edshal.hal.05242392v1
Database:
HAL

Weitere Informationen

Efficient deployment of resource-intensive transformers on edge devices necessitates cross-stack optimization. We thus study the interrelation between structured pruning and systolic acceleration, matching the size of pruned blocks with the systolic array dimensions. In this setting, computations of pruned weight blocks can be skipped, reducing run-time and energy consumption, but potentially impacting quality of service (QoS). To evaluate the trade-offs between systolic array size and sparsity opportunities, we present a novel co-design framework that integrates algorithmic optimization, system simulation, and hardware design. Targeting speech recognition and machine translation using transformers as case study, we analyze how configuration choices across the stack affect performance metrics. Results demonstrate that structured pruning on systems featuring systolic array acceleration can effectively increase performance, while maintaining high QoS levels. Up to 44% system-wide speedups due to structured pruning and quantization were measured, with only 1.4% word error rate degradation on the standard LibriSpeech dataset. CCS Concepts • Hardware → Hardware-software codesign; • Computer systems organization → Systolic arrays; • Computing methodologies → Neural networks.