Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
10-2025
Abstract
Multimodal models leverage complementary information across modalities to enrich feature representations. While visual information shows potential in representing structure for some combinatorial optimization problems (COPs), its application to complex scheduling like the Flexible Job Shop Scheduling Problem (FJSP) remains underexplored. Current learning-based FJSP solvers predominantly rely on handcrafted state features. This dependence can lead to inconsistencies and may not fully capture the problem's intricate dynamics. Crucially, these methods overlook visual modalities. Visual representations offer a distinct advantage by inherently capturing the global topological structure and complex resource interactions within the FJSP state. Unlike localized handcrafted features, this holistic, structural view provides a richer foundation for understanding scheduling complexity and making informed decisions. To overcome these limitations by leveraging visual information-known for representing topological structures and providing richer state representations-we introduce the AO-framework. This multimodal feature fusion approach enhances handcrafted state features by integrating insights from visual data. Our core contribution is a novel fusion mechanism utilizing orthogonal projection and local attention. Unlike traditional methods that often rely on simple concatenation of visual data, our method uniquely reduces redundancy by projecting global image-derived features onto local handcrafted features. This process extracts distinct information inherent to the visual modality, significantly improving the quality and complementarity of the resulting state features and enabling more informed scheduling decisions. To our knowledge, the AO-framework represents the first multimodal framework applied to scheduling problems, demonstrating the significant potential of visual information in this domain. Extensive experiments across various FJSP solvers and datasets confirm that our framework yields substantial enhancements in solution quality, decision-making capabilities, and generalization.
Keywords
Flexible Job-Shop Scheduling Problem, Multimodal Fusion, Combinatorial Optimization, Reinforcement Learning
Discipline
Artificial Intelligence and Robotics
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Sustainability
Publication
MM '25: Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 2025 October 27-31
First Page
2496
Last Page
2505
Identifier
10.1145/3746027.37545
Publisher
ACM
City or Country
New York
Citation
ZHAO, Peng; CAO, Zhiguang; WANG, Di; SONG, Wen; PANG, Wei; ZHOU, You; and JIANG, Yuan.
Visual-enhanced multimodal framework for flexible job shop scheduling problem. (2025). MM '25: Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 2025 October 27-31. 2496-2505.
Available at: https://ink.library.smu.edu.sg/sis_research/10561
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3746027.37545