The advent of online social media including Facebook, Twitter, Flickr and Youtube has drawn massive attention in recent years. These online platforms generate massive data capturing the behavior of multiple types of human actors as they interact with one another and with resources such as pictures, books and videos. Unfortunately, the openness of these platforms often leaves them highly susceptible to abuse by suspicious entities such as spammers. It therefore becomes increasingly important to automatically identify these suspicious entities and eliminate their threats. We call these suspicious entities anomalies in social data, as they often hold different agenda comparing to normal ones and manifest anomalous behaviors. In this dissertation, we are interested in two kinds of anomalous behaviors in social data, namely the unusual coalition among a collection of entities and the unusual conflicting opinions among entities. The two kinds of anomalous behaviors lead us to define two types of anomalies, namely, anomaly collections of the same entity type and anomalous nodes of different entity types in bipartite graphs. This dissertation introduces two anomaly collection definitions, namely, Extreme Rank Anomalous Collection (or ERAC) and Coherent Anomaly Collection (or CAC). An ERAC is a set of entities that cluster toward the top or bottom ranks, when all entities in the population are ranked on certain features. We propose a statistical model to quantify the anomalousness of an ERAC, and present the exact as well as heuristic algorithms for finding top-K ERACs. We then propose the follow-up problem of expanding top- K ERACs to anomalous supersets. We apply the algorithms for ERAC detection and expansion on both synthetic and real-life datasets, including a web spam, an IMDB and a Chinese online forum dataset. Results show that our algorithms achieve higher precisions compared to existing spam and anomaly detection methods. CAC is defined based on ERAC, emphasizing the coherence among members of an ERAC. As top-K ERACs are often overlapping with each other, for applications where disjoint anomaly collections are of interest, we propose to find top-K disjoint CACs with exact and heuristic algorithms. Experiments on both synthetic and reallife datasets, including a Twitter, a web spam, and a Chinese online forum dataset show that our approach discovers not only injected anomaly collections in synthetic datasets but also real-life coherent collections of hashtag spammer, web spammers and opinion spammers which are hard to detect by clustering-based methods. We detect the second type of anomalies in a bipartite graph, where nodes in one partite represent human actors, nodes in the other partite represent resources, and edges carry the agreeing and disagreeing opinions from human actors to resources. The anomalousness of nodes in one partite depends on that of their connected nodes in the other partite. Previous studies have shown that this mutual dependency can be positive or negative. We integrate both mutual dependency principles to model the anomalous behavior of nodes. We formulate our principles and design an iterative algorithm to simultaneously compute the anomaly scores of nodes in both partites. Our method is applied on synthetic graphs and the results show that our algorithm outperforms existing ones with only positive or negative mutual dependency principles. Results on two real-life datasets, namely Goodreads and Buzzcity, show that our method is able to detect suspected spammed books in Goodreads and fraudulent publishers in mobile advertising networks with higher precision than existing approaches.
anomaly detection, social media, collective anomaly, anomalous behavior, algorithm, graph anomaly
PhD in Information Systems
Computer Sciences | Databases and Information Systems | Social Media
LIM, Ee Peng
Singapore Management University
City or Country
Anomaly Detection on Social Data. (2013). 1-135. Dissertations and Theses Collection (Open Access).
Available at: http://ink.library.smu.edu.sg/etd_coll/90
Copyright Owner and License
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.