Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
7-2025
Abstract
Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities:Skylines, Stardew Valley and Dealer’s Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents.
Discipline
Artificial Intelligence and Robotics
Areas of Excellence
Digital transformation
Publication
Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada, 2025 July 13-19
First Page
58658
Last Page
58725
City or Country
Vancouver
Citation
TAN, Weihao and et. al..
Cradle: Empowering foundation agents towards general computer control. (2025). Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada, 2025 July 13-19. 58658-58725.
Available at: https://ink.library.smu.edu.sg/sis_research/10797
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.