Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
11-2025
Abstract
Automated audio captioning (AAC) benefits from incorporatingexternal context to interpret complex sounds, but doing so withretrieval-augmented generation (RAG) at inference is sometimesinfeasible due to data availability or incurs significant latency andcomplexity. We propose DistillCaps, a novel training-time frame-work that leverages RAG to guide knowledge distillation for im-proved audio-language alignment, while lessening the relianceon retrieval during inference. In our framework, a RAG-equippedteacher model retrieves relevant textual information (e.g., simi-lar captions) for each audio clip and uses it for training to gener-ate context-enriched captions. Simultaneously, a student model istrained to imitate this teacher, learning to produce high-qualitycaptions from audio alone. We further introduce a Fast FourierTransform (FFT) adapter in the audio encoder to inject frequency-domain features, enhancing the quality of audio representationsbefore feeding them into the language model. The result is an ef-ficient captioning model that retains RAG’s contextual benefitswithout its deployment overhead. On standard AAC benchmarks(AudioCaps and Clotho), DistillCaps achieves performance compet-itive with or exceeding prior RAG-based systems despite using noretrieval at test time. Notably, our distilled model matches state-of-the-art captioning results under real-time settings, and whenoptionally allowing retrieval, it even outperforms previous modelsby up to 4% on the Clotho benchmark on the in-distribution set-ting, demonstrating the effectiveness of RAG-guided distillation foraudio-language alignment. Code and dataset are available here
Keywords
Audio Captioning, Knowledge Distillation, RAG
Discipline
Artificial Intelligence and Robotics | Programming Languages and Compilers
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
CIKM '25: Proceedings of the 34th ACM International Conference on Information and Knowledge Management, Seoul, Korea, November 10-14
First Page
2346
Last Page
2356
Identifier
10.1145/3746252.3761269
Publisher
ACM
City or Country
New York
Citation
PHAM, Thinh; DIEP, Nghiem; LIAO, Lizi; and NGUYEN, Binh.
DistillCaps: Enhancing audio-language alignment in captioning via retrieval-augmented knowledge distillation. (2025). CIKM '25: Proceedings of the 34th ACM International Conference on Information and Knowledge Management, Seoul, Korea, November 10-14. 2346-2356.
Available at: https://ink.library.smu.edu.sg/sis_research/10755
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3746252.3761269
Included in
Artificial Intelligence and Robotics Commons, Programming Languages and Compilers Commons