Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
1-2026
Abstract
Music, as a unique and integral element of human life, is characterized by its complex structures, intricate details, and the fusion of multimodal information. Recent study advance music understanding by leveraging knowledge and reasoning capabilities derived from Large Language Models (LLMs). However, they often lack compatibility and fail to fully utilize the complementary strengths of diverse representations (e.g., ABC, MIDI, Waveform). To address these limitations, we propose a unified music-language model framework, named UniMuLM, transitioning from single-representation approaches to the integration of multiple music representations for LLM. Unifying different music representation formats poses challenges such as patch integrity and boundary ambiguity that arise from temporal discrepancies across these representations. To address these issues, UniMuLM employs a unified encoder that hierarchically aligns representations across multiple granularities, using contrastive learning and cross-reconstruction training to support coherent integration. Fine-tuned in multiple stages on open-source datasets, UniMuLM demonstrates the potential to handle dual-representation inputs. Notably, it achieves performance competitive with specialized waveform-only models on music understanding tasks, while surpassing open-source baselines in downstream applications such as music knowledge answering and ABC melody completion.
Keywords
Multimodal Language Model, Music Language Model, Music Understanding, Sound and Music Computing
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems | Music
Research Areas
Data Science and Engineering
Publication
Multimedia Modeling: 32nd International Conference on Multimedia Modeling, MMM 2026, Prague, Czech Republic, January 29-31, Proceedings
First Page
89
Last Page
103
ISBN
9789819569564
Identifier
10.1007/978-981-95-6957-1_7
Publisher
Springer
City or Country
Cham
Citation
TU, Teng; LIU, Xiaohao; MA, Yunshan; QI, Ji; and CHUA, Tat-Seng.
Integrating symbolic and waveform music into Large Language Models. (2026). Multimedia Modeling: 32nd International Conference on Multimedia Modeling, MMM 2026, Prague, Czech Republic, January 29-31, Proceedings. 89-103.
Available at: https://ink.library.smu.edu.sg/sis_research/11026
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1007/978-981-95-6957-1_7
Included in
Artificial Intelligence and Robotics Commons, Databases and Information Systems Commons, Music Commons