Treffer: CQS-Attention: Scaling Up the Standard Attention Computation for Infinitely Long Sequences

Title:

CQS-Attention: Scaling Up the Standard Attention Computation for Infinitely Long Sequences

Authors:

Bian, Yiming, Somani, Arun

Contributors:

Department of Electrical and Computer Engineering

Source:

https://doi.org/10.1109/ACCESS.2025.3544550.

Publisher Information:

Institute of Electrical and Electronics Engineers (IEEE)

Publication Year:

2025

Collection:

Digital Repository @ Iowa State University

Subject Terms:

DegreeDisciplines::Physical Sciences and Mathematics::Computer Sciences::Theory and Algorithms, DegreeDisciplines::Engineering::Computational Engineering, DegreeDisciplines::Engineering::Electrical and Computer Engineering, Attention computation, cyclic quorum sets, parallel algorithm, transformer

Document Type:

Fachzeitschrift article in journal/newspaper

File Description:

application/pdf

Language:

English

Relation:

https://dr.lib.iastate.edu/handle/20.500.12876/3wxaJK9v

Availability:

https://dr.lib.iastate.edu/handle/20.500.12876/3wxaJK9v
https://hdl.handle.net/20.500.12876/3wxaJK9v

Rights:

Accession Number:

edsbas.8CED8D2D

Database:

BASE

Weitere Informationen

Transformer models suffer from unaffordable high memory consumption when the sequence is long and standard self-attention is utilized. We developed a sequence parallelism scheme called CQS-Attention that can break the limit of sequence length. A long sequence is divided into multiple overlapping subsequences. The attention of each subsequence is independently computed and gathered as the final exact attention of the original long sequence. CQS-Attention is a fork-join parallel model comprising three components: Scheduler, Workers, and Tiler. The Scheduler equally partitions computation responsibility in a completely mutually exclusive manner and ensures the local subsequence length is minimum. Each worker independently computes the standard attention of the assigned subsequence and transfers local results to the Tiler, which produces the final attention. CQS-Attention makes attention computation embarrassingly parallel. Hence, it enjoys great performance regarding single-device memory and computation time consumption, mathematical stability and scalability. More importantly, it is fully compatible with all state-of-the-art attention optimizations. Our code and supplementary information (SI) are available at https://github.com/CQS-Attention/CQS_Attention. ; This article is published as Bian, Yiming, and Arun K. Somani. "CQS-Attention: Scaling Up the Standard Attention Computation for Infinitely Long Sequences." IEEE Access (2025). doi: https://doi.org/10.1109/ACCESS.2025.3544550.

Treffer: CQS-Attention: Scaling Up the Standard Attention Computation for Infinitely Long Sequences

Weitere Informationen

Links

Zusatz-Funktionen