Research Collection School Of Computing and Information Systems

Learning to classify e-mail

Publication Type

Journal Article

Version

publishedVersion

Publication Date

5-2007

Abstract

In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for automatic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algorithms such as decision trees, support vector machines and naive Bayes. We introduce a new accurate feature selector with linear time complexity. Secondly, we examine the applicability of the semi-supervised co-training paradigm for spam e-mail filtering by employing random forests, support vector machines, decision tree and naive Bayes as base classifiers. The study shows that a classifier trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples. We investigate the performance of co-training with one natural feature split and show that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits. (C) 2006 Elsevier Inc. All rights reserved.

Keywords

e-mail classification into folders, spam e-mail filtering, random forest, co-training, machine learning

Discipline

Databases and Information Systems

Research Areas

Data Science and Engineering

Publication

Information Sciences

Volume

177

Issue

First Page

2167

Last Page

2187

ISSN

0020-0255

Identifier

10.1016/j.ins.2006.12.005

Publisher

Elsevier

Citation

KOPRINSKA, Irena; POON, Josiah; CLARK, James; and CHAN, Jason Yuk Hin. Learning to classify e-mail. (2007). Information Sciences. 177, (10), 2167-2187.
Available at: https://ink.library.smu.edu.sg/sis_research/7703

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1016/j.ins.2006.12.005

Download

Included in

Databases and Information Systems Commons

COinS

Research Collection School Of Computing and Information Systems

Learning to classify e-mail

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Learning to classify e-mail

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links