Treffer: CQS-Attention: Scaling Up the Standard Attention Computation for Infinitely Long Sequences

Title:
CQS-Attention: Scaling Up the Standard Attention Computation for Infinitely Long Sequences
Contributors:
Department of Electrical and Computer Engineering
Publisher Information:
Institute of Electrical and Electronics Engineers (IEEE)
Publication Year:
2025
Collection:
Digital Repository @ Iowa State University
Document Type:
Fachzeitschrift article in journal/newspaper
File Description:
application/pdf
Language:
English
Rights:
Copyright 2025 The Authors. This open access article is licensed under a Creative Commons Attribution 4.0 License.
Accession Number:
edsbas.8CED8D2D
Database:
BASE

Weitere Informationen

Transformer models suffer from unaffordable high memory consumption when the sequence is long and standard self-attention is utilized. We developed a sequence parallelism scheme called CQS-Attention that can break the limit of sequence length. A long sequence is divided into multiple overlapping subsequences. The attention of each subsequence is independently computed and gathered as the final exact attention of the original long sequence. CQS-Attention is a fork-join parallel model comprising three components: Scheduler, Workers, and Tiler. The Scheduler equally partitions computation responsibility in a completely mutually exclusive manner and ensures the local subsequence length is minimum. Each worker independently computes the standard attention of the assigned subsequence and transfers local results to the Tiler, which produces the final attention. CQS-Attention makes attention computation embarrassingly parallel. Hence, it enjoys great performance regarding single-device memory and computation time consumption, mathematical stability and scalability. More importantly, it is fully compatible with all state-of-the-art attention optimizations. Our code and supplementary information (SI) are available at https://github.com/CQS-Attention/CQS_Attention. ; This article is published as Bian, Yiming, and Arun K. Somani. "CQS-Attention: Scaling Up the Standard Attention Computation for Infinitely Long Sequences." IEEE Access (2025). doi: https://doi.org/10.1109/ACCESS.2025.3544550.