Publication Type
Journal Article
Version
publishedVersion
Publication Date
10-2025
Abstract
Website owner identification aims to recognize the organization or individual who owns a given website that is served on the web. It is a crucial step for cyberspace surveying and mapping, playing a significant role in cyberspace administration and governance. Existing widely employed solutions for website owner identification mainly fall into two paradigms: (1) querying the public information databases such as WHOIS, which store the Internet resource’s registered users or assignees; and (2) directly extracting the organization or individual name of the website owner from the webpage using the technique of named entity recognition. However, the former is less reliable due to the incomplete, encrypted, and outdated records in the public information databases. Meanwhile, the latter requires that the webpages explicitly and precisely present their owner names without ambiguity, which is often hard to guarantee in practice.To address these limitations, we propose to formulate website owner identification as a problem of webpage representation learning, thereby introducing a novel representation learning framework empowered by large language model-based text Rewriting and Multi-level contrastive learning, named ReMon. First, we devise a prompt to rewrite the webpages using large language models, which effectively filters out noise from the original webpages. Second, we model website–website, website–owner, and owner–owner interactions through multi-level contrastive learning, fully utilizing the self-supervision signals on long-tail items to learn the multi-level constraints. Third, we design a retrieval-based prediction framework and a clustering-based framework to apply websites’ and owners’ representations for different scenarios of the website owner identification task. To evaluate ReMon under our formulation, we construct two datasets based on real-world data. Compared to existing approaches, our ReMon can address the challenging scenarios when valid information cannot be found in public information databases and the owner’s name does not appear on the webpage. Meanwhile, the experimental results show that ReMon outperforms all representation learning-based baselines and significantly enhances training efficiency. The code is available at https://github.com/tuchen9/ReMon.
Keywords
Website Owner Identification, Contrastive Learning, Text RepresentationLearning, Large Language Model
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
ACM Transactions on Knowledge Discovery from Data
Volume
19
Issue
9
First Page
1
Last Page
39
ISSN
1556-4681
Identifier
10.1145/3767155
Publisher
Association for Computing Machinery (ACM)
Citation
TU, Cheng; MA, Yunshan; LI, Yang; ZHANG, Min; HU, Miao; SHI, Fan; and WANG, Xiang.
Website owner identification through multi-level contrastive representation learning. (2025). ACM Transactions on Knowledge Discovery from Data. 19, (9), 1-39.
Available at: https://ink.library.smu.edu.sg/sis_research/10875
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3767155