Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

6-2024

Abstract

This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic complexity of the attention mechanism and the sequential nature of autoregressive inference. The research is structured into three main parts. The first part enhances the learning capabilities of non-autoregressive Transformers, achieving a remarkable 15.0x acceleration on machine translation tasks. The following section focuses on lossless acceleration through speculative decoding, where the proposed algorithm, Glide with CAPE, is shown to accelerate 33-billion parameter LLMs by approximately 2.5 times. In the last segment, the complexity of the attention mechanism is reduced to a constant level through the implementation of a Markov autoregressive Transformer, without significantly compromising model performance. This comprehensive study not only tackles the computational challenges of Transformer models but also paves the way for more efficient deployment of LLMs in real-world applications.

Keywords

Large Language Model, Neural Network, Language Processing, General AI

Degree Awarded

PhD in Computer Science

Discipline

Artificial Intelligence and Robotics | Programming Languages and Compilers

Supervisor(s)

JIANG, Jing

First Page

Last Page

Publisher

Singapore Management University

City or Country

Singapore

Citation

DU, Cunxiao. Towards faster inference of transformers: Strategies for accelerating decoding processes. (2024). 1-72.
Available at: https://ink.library.smu.edu.sg/etd_coll/613

Copyright Owner and License

Author

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Artificial Intelligence and Robotics Commons, Programming Languages and Compilers Commons

COinS

Dissertations and Theses Collection (Open Access)

Towards faster inference of transformers: Strategies for accelerating decoding processes

Publication Type

Version

Publication Date

Abstract

Keywords

Degree Awarded

Discipline

Supervisor(s)

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Search

Links

Browse

Links

Dissertations and Theses Collection (Open Access)

Towards faster inference of transformers: Strategies for accelerating decoding processes

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Degree Awarded

Discipline

Supervisor(s)

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Share

Search

Links

Browse

Links