Publication Type

PhD Dissertation

Publication Date



This dissertation studies the problem of preparing good-quality social network data for data analysis and mining. Modern online social networks such as Twitter, Facebook, and LinkedIn have rapidly grown in popularity. The consequent availability of a wealth of social network data provides an unprecedented opportunity for data analysis and mining researchers to determine useful and actionable information in a wide variety of fields such as social sciences, marketing, management, and security. However, raw social network data are vast, noisy, distributed, and sensitive in nature, which challenge data mining and analysis tasks in storage, efficiency, accuracy, etc. Many mining algorithms cannot operate or generate accurate results on the vast and messy data. Thus social network data preparation deserves special attention as it processes raw data and transforms them into usable forms for data mining and analysis tasks. Data preparation consists of four main steps, namely data collection, data cleaning, data reduction, and data conversion, each of which deals with different challenges of the raw data. In this dissertation, we consider three important problems related to the data collection and data conversion steps in social network data preparation. The first problem is the sampling issue for social network data collection. Restricted by processing power and resources, most research that analyzes user-generated content from social networks relies on samples obtained via social network APIs. But the lack of consideration for the quality and potential bias of the samples reduces the effectiveness and validity of the analysis results. To fill this gap, in the first work of the dissertation, we perform an exploratory analysis of data samples obtained from social network stream APIs to understand the representativeness of the samples to the corresponding complete data and their potential for use in various data mining tasks. The second problem is the privacy protection issue at the data conversion step. We discover a new type of attacks in which malicious adversaries utilize the connection information of a victim (anonymous) user to some known public users in a social network to re-identify the user and compromise identity privacy. We name this type of attacks connection fingerprint (CFP) attacks. In the second work of the dissertation, we investigate the potential risk of CFP attacks on social networks and propose two efficient k-anonymity-based network conversion algorithms to protect social networks against CFP attacks and preserve the utility of converted networks. The third problem is the utility issue in privacy preserving data conversion. Existing k-anonymization algorithms convert networks to protect privacy via modifying edges, and they preserve utility by minimizing the number of edges modified. We find this simple utility model cannot reflect real utility changes of networks with complex structure. Thus, existing k-anonymization algorithms designed based on this simple utility model cannot guarantee generating social networks with high utility. To solve this problem, in the third work of this dissertation, we propose a new utility benchmark that directly measures the change on network community structure caused by a network conversion algorithm. We also design a general k- anonymization algorithm framework based on this new utility model. Our algorithm can significantly improve the utility of generated networks compared with existing algorithms. Our work in this dissertation emphasizes the importance of data preparation for social network analysis and mining tasks. Our study of the sampling issue in social network collection provides guidelines for people to use or not to use sampled social network content data for their research. Our work on privacy preserving social network conversion provides methods to better protect the identity privacy of social network users and maintain the utility of social network data.


social network, Twitter, stream API, sample quality, privacy, data utility

Degree Awarded

PhD in Information Systems


Databases and Information Systems | Numerical Analysis and Scientific Computing | Social Media


ZHENG, Baihua

First Page


Last Page



Singapore Management University

City or Country


Copyright Owner and License


Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.