Publication Type
Journal Article
Version
acceptedVersion
Publication Date
9-2025
Abstract
Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17 throughput improvement with minimal performance loss ( on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.
Discipline
Databases and Information Systems
Research Areas
Data Science and Engineering
Areas of Excellence
Digital transformation
Publication
Transactions on Machine Learning Research
Issue
2835-8856
First Page
1
Last Page
26
Citation
ZHANG, Xuan; ZHANG, Fengzhuo; DU, Cunxiao; DU, Chao; PANG, Tianyu; GAO, Wei; and LIN, Min.
LightTransfer: Your long-context LLM is secretly a hybrid model with effortless adaptation. (2025). Transactions on Machine Learning Research. 1-26.
Available at: https://ink.library.smu.edu.sg/sis_research/10602
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://openreview.net/forum?id=kne4vWICr0