Research Collection School Of Computing and Information Systems

Website owner identification through multi-level contrastive representation learning

Publication Type

Journal Article

Version

publishedVersion

Publication Date

10-2025

Abstract

Website owner identification aims to recognize the organization or individual who owns a given website that is served on the web. It is a crucial step for cyberspace surveying and mapping, playing a significant role in cyberspace administration and governance. Existing widely employed solutions for website owner identification mainly fall into two paradigms: (1) querying the public information databases such as WHOIS, which store the Internet resource’s registered users or assignees; and (2) directly extracting the organization or individual name of the website owner from the webpage using the technique of named entity recognition. However, the former is less reliable due to the incomplete, encrypted, and outdated records in the public information databases. Meanwhile, the latter requires that the webpages explicitly and precisely present their owner names without ambiguity, which is often hard to guarantee in practice.To address these limitations, we propose to formulate website owner identification as a problem of webpage representation learning, thereby introducing a novel representation learning framework empowered by large language model-based text Rewriting and Multi-level contrastive learning, named ReMon. First, we devise a prompt to rewrite the webpages using large language models, which effectively filters out noise from the original webpages. Second, we model website–website, website–owner, and owner–owner interactions through multi-level contrastive learning, fully utilizing the self-supervision signals on long-tail items to learn the multi-level constraints. Third, we design a retrieval-based prediction framework and a clustering-based framework to apply websites’ and owners’ representations for different scenarios of the website owner identification task. To evaluate ReMon under our formulation, we construct two datasets based on real-world data. Compared to existing approaches, our ReMon can address the challenging scenarios when valid information cannot be found in public information databases and the owner’s name does not appear on the webpage. Meanwhile, the experimental results show that ReMon outperforms all representation learning-based baselines and significantly enhances training efficiency. The code is available at https://github.com/tuchen9/ReMon.

Keywords

Website Owner Identification, Contrastive Learning, Text RepresentationLearning, Large Language Model

Discipline

Artificial Intelligence and Robotics | Databases and Information Systems

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

ACM Transactions on Knowledge Discovery from Data

Volume

Issue

First Page

Last Page

ISSN

1556-4681

Identifier

10.1145/3767155

Publisher

Association for Computing Machinery (ACM)

Citation

TU, Cheng; MA, Yunshan; LI, Yang; ZHANG, Min; HU, Miao; SHI, Fan; and WANG, Xiang. Website owner identification through multi-level contrastive representation learning. (2025). ACM Transactions on Knowledge Discovery from Data. 19, (9), 1-39.
Available at: https://ink.library.smu.edu.sg/sis_research/10875

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3767155

Download

Included in

Artificial Intelligence and Robotics Commons, Databases and Information Systems Commons

COinS

Research Collection School Of Computing and Information Systems

Website owner identification through multi-level contrastive representation learning

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Website owner identification through multi-level contrastive representation learning

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links