Publication Type

Journal Article

Version

publishedVersion

Publication Date

10-2025

Abstract

Website owner identification aims to recognize the organization or individual who owns a given website that is served on the web. It is a crucial step for cyberspace surveying and mapping, playing a significant role in cyberspace administration and governance. Existing widely employed solutions for website owner identification mainly fall into two paradigms: (1) querying the public information databases such as WHOIS, which store the Internet resource’s registered users or assignees; and (2) directly extracting the organization or individual name of the website owner from the webpage using the technique of named entity recognition. However, the former is less reliable due to the incomplete, encrypted, and outdated records in the public information databases. Meanwhile, the latter requires that the webpages explicitly and precisely present their owner names without ambiguity, which is often hard to guarantee in practice.To address these limitations, we propose to formulate website owner identification as a problem of webpage representation learning, thereby introducing a novel representation learning framework empowered by large language model-based text Rewriting and Multi-level contrastive learning, named ReMon. First, we devise a prompt to rewrite the webpages using large language models, which effectively filters out noise from the original webpages. Second, we model website–website, website–owner, and owner–owner interactions through multi-level contrastive learning, fully utilizing the self-supervision signals on long-tail items to learn the multi-level constraints. Third, we design a retrieval-based prediction framework and a clustering-based framework to apply websites’ and owners’ representations for different scenarios of the website owner identification task. To evaluate ReMon under our formulation, we construct two datasets based on real-world data. Compared to existing approaches, our ReMon can address the challenging scenarios when valid information cannot be found in public information databases and the owner’s name does not appear on the webpage. Meanwhile, the experimental results show that ReMon outperforms all representation learning-based baselines and significantly enhances training efficiency. The code is available at https://github.com/tuchen9/ReMon.

Keywords

Website Owner Identification, Contrastive Learning, Text RepresentationLearning, Large Language Model

Discipline

Artificial Intelligence and Robotics | Databases and Information Systems

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

ACM Transactions on Knowledge Discovery from Data

Volume

19

Issue

9

First Page

1

Last Page

39

ISSN

1556-4681

Identifier

10.1145/3767155

Publisher

Association for Computing Machinery (ACM)

Additional URL

https://doi.org/10.1145/3767155

Share

COinS