Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

1-2026

Abstract

Music, as a unique and integral element of human life, is characterized by its complex structures, intricate details, and the fusion of multimodal information. Recent study advance music understanding by leveraging knowledge and reasoning capabilities derived from Large Language Models (LLMs). However, they often lack compatibility and fail to fully utilize the complementary strengths of diverse representations (e.g., ABC, MIDI, Waveform). To address these limitations, we propose a unified music-language model framework, named UniMuLM, transitioning from single-representation approaches to the integration of multiple music representations for LLM. Unifying different music representation formats poses challenges such as patch integrity and boundary ambiguity that arise from temporal discrepancies across these representations. To address these issues, UniMuLM employs a unified encoder that hierarchically aligns representations across multiple granularities, using contrastive learning and cross-reconstruction training to support coherent integration. Fine-tuned in multiple stages on open-source datasets, UniMuLM demonstrates the potential to handle dual-representation inputs. Notably, it achieves performance competitive with specialized waveform-only models on music understanding tasks, while surpassing open-source baselines in downstream applications such as music knowledge answering and ABC melody completion.

Keywords

Multimodal Language Model, Music Language Model, Music Understanding, Sound and Music Computing

Discipline

Artificial Intelligence and Robotics | Databases and Information Systems | Music

Research Areas

Data Science and Engineering

Publication

Multimedia Modeling: 32nd International Conference on Multimedia Modeling, MMM 2026, Prague, Czech Republic, January 29-31, Proceedings

First Page

89

Last Page

103

ISBN

9789819569564

Identifier

10.1007/978-981-95-6957-1_7

Publisher

Springer

City or Country

Cham

Additional URL

https://doi.org/10.1007/978-981-95-6957-1_7

Share

COinS