Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

11-2025

Abstract

Automated audio captioning (AAC) benefits from incorporatingexternal context to interpret complex sounds, but doing so withretrieval-augmented generation (RAG) at inference is sometimesinfeasible due to data availability or incurs significant latency andcomplexity. We propose DistillCaps, a novel training-time frame-work that leverages RAG to guide knowledge distillation for im-proved audio-language alignment, while lessening the relianceon retrieval during inference. In our framework, a RAG-equippedteacher model retrieves relevant textual information (e.g., simi-lar captions) for each audio clip and uses it for training to gener-ate context-enriched captions. Simultaneously, a student model istrained to imitate this teacher, learning to produce high-qualitycaptions from audio alone. We further introduce a Fast FourierTransform (FFT) adapter in the audio encoder to inject frequency-domain features, enhancing the quality of audio representationsbefore feeding them into the language model. The result is an ef-ficient captioning model that retains RAG’s contextual benefitswithout its deployment overhead. On standard AAC benchmarks(AudioCaps and Clotho), DistillCaps achieves performance compet-itive with or exceeding prior RAG-based systems despite using noretrieval at test time. Notably, our distilled model matches state-of-the-art captioning results under real-time settings, and whenoptionally allowing retrieval, it even outperforms previous modelsby up to 4% on the Clotho benchmark on the in-distribution set-ting, demonstrating the effectiveness of RAG-guided distillation foraudio-language alignment. Code and dataset are available here

Keywords

Audio Captioning, Knowledge Distillation, RAG

Discipline

Artificial Intelligence and Robotics | Programming Languages and Compilers

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

CIKM '25: Proceedings of the 34th ACM International Conference on Information and Knowledge Management, Seoul, Korea, November 10-14

First Page

2346

Last Page

2356

Identifier

10.1145/3746252.3761269

Publisher

ACM

City or Country

New York

Additional URL

https://doi.org/10.1145/3746252.3761269

Share

COinS