Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

8-2021

Abstract

Graph neural networks (GNNs) emerge as the state-of-the-art representation learning methods on graphs and often rely on a large amount of labeled data to achieve satisfactory performance. Recently, in order to relieve the label scarcity issues, some works propose to pre-train GNNs in a self-supervised manner by distilling transferable knowledge from the unlabeled graph structures. Unfortunately, these pre-training frameworks mainly target at homogeneous graphs, while real interaction systems usually constitute large-scale heterogeneous graphs, containing different types of nodes and edges, which leads to new challenges on structure heterogeneity and scalability for graph pre-training. In this paper, we first study the problem of pre-training on large-scale heterogeneous graph and propose a novel pre-training GNN framework, named PT-HGNN. The proposed PT-HGNN designs both the node- and schema-level pre-training tasks to contrastively preserve heterogeneous semantic and structural properties as a form of transferable knowledge for various downstream tasks. In addition, a relationbased personalized PageRank is proposed to sparsify large-scale heterogeneous graph for efficient pre-training. Extensive experiments on one of the largest public heterogeneous graphs (OAG) demonstrate that our PT-HGNN significantly outperforms various state-of-the-art baselines.

Keywords

Heterogeneous graph, Self-supervised learning, Pre-training

Discipline

Databases and Information Systems

Research Areas

Data Science and Engineering

Publication

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 21), Virtual Online, August 14-18

First Page

756

Last Page

766

Identifier

10.1145/3447548.3467396

Publisher

ACM

City or Country

New York

Share

COinS