Research Collection School Of Computing and Information Systems

Large-scale online feature selection for ultra-high dimensional sparse data

Publication Type

Journal Article

Version

acceptedVersion

Publication Date

8-2017

Abstract

Feature selection (FS) is an important technique in machine learning and data mining, especially for large scale high-dimensional data. Most existing studies have been restricted to batch learning, which is often inefficient and poorly scalable when handling big data in real world. As real data may arrive sequentially and continuously, batch learning has to retrain the model for the new coming data, which is very computationally intensive. Online feature selection (OFS) is a promising new paradigm that is more efficient and scalable than batch learning algorithms. However, existing online algorithms usually fall short in their inferior efficacy. In this article, we present a novel second-order OFS algorithm that is simple yet effective, very fast and extremely scalable to deal with large-scale ultra-high dimensional sparse data streams. The basic idea is to exploit the second-order information to choose the subset of important features with high confidence weights. Unlike existing OFS methods that often suffer from extra high computational cost, we devise a novel algorithm with a MaxHeap-based approach, which is not only more effective than the existing first order algorithms, but also significantly more efficient and scalable. Our extensive experiments validated that the proposed technique achieves highly competitive accuracy as compared with state-of-The-Art batch FS methods, meanwhile it consumes significantly less computational cost that is orders of magnitude lower. Impressively, on a billion-scale synthetic dataset (1-billion dimensions, 1-billion non-zero features, and 1- million samples), the proposed algorithm takes less than 3 minutes to run on a single PC.

Keywords

Feature selection, Second-order online learning, Sparsity, Ultra-high dimensionality

Discipline

Databases and Information Systems | Numerical Analysis and Computation | Theory and Algorithms

Research Areas

Data Science and Engineering

Publication

ACM Transactions on Knowledge Discovery from Data

Volume

Issue

First Page

48-1

Last Page

ISSN

1556-4681

Identifier

10.1145/3070646

Publisher

Association for Computing Machinery (ACM)

Citation

WU, Yue; HOI, Steven C. H.; MEI, Tao; and YU, Nenghai. Large-scale online feature selection for ultra-high dimensional sparse data. (2017). ACM Transactions on Knowledge Discovery from Data. 11, (4), 48-1-22.
Available at: https://ink.library.smu.edu.sg/sis_research/3781

Copyright Owner and License

LARC and Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3070646

Download

Find it in your library

Included in

Databases and Information Systems Commons, Numerical Analysis and Computation Commons, Theory and Algorithms Commons

COinS

Research Collection School Of Computing and Information Systems

Large-scale online feature selection for ultra-high dimensional sparse data

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Large-scale online feature selection for ultra-high dimensional sparse data

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links