Publication Type

Journal Article

Version

publishedVersion

Publication Date

11-2014

Abstract

Due to limited time and resources, web software engineers need support in identifying vulnerable code. A practical approach to predicting vulnerable code would enable them to prioritize security auditing efforts. In this paper, we propose using a set of hybrid (staticþdynamic) code attributes that characterize input validation and input sanitization code patterns and are expected to be significant indicators of web application vulnerabilities. Because static and dynamic program analyses complement each other, both techniques are used to extract the proposed attributes in an accurate and scalable way. Current vulnerability prediction techniques rely on the availability of data labeled with vulnerability information for training. For many real world applications, past vulnerability data is often not available or at least not complete. Hence, to address both situations where labeled past data is fully available or not, we apply both supervised and semi-supervised learning when building vulnerability predictors based on hybrid code attributes. Given that semi-supervised learning is entirely unexplored in this domain, we describe how to use this learning scheme effectively for vulnerability prediction. We performed empirical case studies on seven open source projects where we built and evaluated supervised and semi-supervised models. When cross validated with fully available labeled data, the supervised models achieve an average of 77 percent recall and 5 percent probability of false alarm for predicting SQL injection, cross site scripting, remote code execution and file inclusion vulnerabilities. With a low amount of labeled data, when compared to the supervised model, the semi-supervised model showed an average improvement of 24 percent higher recall and 3 percent lower probability of false alarm, thus suggesting semi-supervised learning may be a preferable solution for many real world applications where vulnerability data is missing.

Keywords

Vulnerability prediction, security measures, input validation and sanitization, program analysis, empirical study

Discipline

Information Security | Programming Languages and Compilers

Research Areas

Cybersecurity

Publication

IEEE Transactions on Dependable and Secure Computing

Volume

Issue

First Page

688

Last Page

707

ISSN

1545-5971

Identifier

10.1109/TDSC.2014.2373377

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Citation

SHAR, Lwin Khin; BRIAND, Lionel; and TAN, Hee Beng Kuan. Web application vulnerability prediction using hybrid program analysis and machine learning. (2014). IEEE Transactions on Dependable and Secure Computing. 12, (6), 688-707.
Available at: https://ink.library.smu.edu.sg/sis_research/4895

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/TDSC.2014.2373377

Download

Find it in your library

Included in

Information Security Commons, Programming Languages and Compilers Commons

COinS

Research Collection School Of Computing and Information Systems

Web application vulnerability prediction using hybrid program analysis and machine learning

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Web application vulnerability prediction using hybrid program analysis and machine learning

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links