Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

10-2025

Abstract

Learning user preferences in recommendation systems is enriched by multimodal features, such as textual and visual content, and amplified by multi-interest modeling with Variational AutoEncoders (VAEs). However, prior efforts are limited by single modality focus and cumbersome, parameter-heavy architecture designs. To address these limitations, we introduce an innovative solution that blends the semantic richness of multimodal data with the representational power of multi-representation VAEs. Drawing inspiration from Mixture of Experts (MoE), we cast each VAE as an expert tailored to a specific modality, then fuse them via a novel parameter-merging function into a lean, unified model. This approach efficiently captures diverse user preferences behind multimodal data with minimal complexity. Rigorous experiments on real-world benchmarks show our method outshines state-of-the-art baselines while slashing parameter counts. Our work sets a new, streamlined standard for multimodal, multi-interest recommendation systems.

Keywords

multimodal recommendation, multi-representation VAE, parametermerging

Discipline

Artificial Intelligence and Robotics

Research Areas

Data Science and Engineering

Publication

MM '25: Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, October 27-31

First Page

6412

Last Page

6420

Identifier

10.1145/3746027.3754597

Publisher

ACM

City or Country

New York

Additional URL

https://doi.org/10.1145/3746027.3754597

Share

COinS