Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

7-2019

Abstract

The rapid advances of the Web have changed the ways information is distributed and exchanged among individuals and organizations. Various content from different domains are generated daily and contributed by users' daily activities, such as posting messages in a microblog platform, or collaborating in a question and answer site. To deal with such tremendous volume of user generated content, there is a need for approaches that are able to handle the mass amount of available data and to extract knowledge hidden in the user generated content. This dissertation attempts to make sense of the generated content to help in three concrete tasks.

In the first work performed as part of the dissertation, a machine learning approach was proposed to predict a customer's feedback behavior based on her first feedback tweet. First, a few categories of customers were observed based on their feedback frequency and the sentiment of the feedback. Three main categories were identified: spiteful, one-off, and kind. By using the Twitter API, user profile and content features were extracted. Next, a model was built to predict the category of a customer given his or her first feedback. The experiment results show that the prediction model performs better than a baseline approach in terms of precision, recall, and F-measure.

In the second work, a method was proposed to predict readers' emotion distribution affected by a news article. The approach analyzed affective annotations provided by readers of news articles taken from a non-English online news site. A new corpus was created from the annotated articles. A domain-specific emotion lexicon was constructed along with word embedding features. Finally, a multi-target regression model was built from a set of features extracted from online news articles. By combining lexicon and word embedding features, the regression model is able to predict the emotion distribution with RMSE scores between 0.067 to 0.232. For the final work of this dissertation, an approach was proposed to improve the effectiveness of knowledge extraction tasks by performing cross-platform analysis. This approach is based on transfer representation learning and word embedding to leverage information extracted from a source platform which contains rich domain-related content to solve tasks in another platform (considered as target platform) with less domain-related content. We first build a word embedding model as a representation learned from the source platform, and use the model to improve the performance of knowledge extraction tasks in the target platform. We experiment with Software Engineering Stack Exchange and Stack Overflow as source platforms, and two different target platforms, i.e., Twitter and YouTube. Our experiments show that our approach improves performance of existing work for the tasks of finding software-related tweets and filtering informative YouTube comments.

Degree Awarded

PhD in Information Systems

Discipline

Numerical Analysis and Scientific Computing | Software Engineering

Supervisor(s)

LO, David

First Page

1

Last Page

108

Publisher

Singapore Management University

City or Country

Singapore

Copyright Owner and License

Author

Share

COinS