Research Collection School Of Computing and Information Systems

DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode

Tiezhu SUN
Kevin ALLIX
Kisub KIM
Xin ZHOU
Dongsun KIM
David LO, Singapore Management UniversityFollow
Tegawendé F. BISSYANDE
Jacques KLEIN

Publication Type

Journal Article

Version

publishedVersion

Publication Date

10-2023

Abstract

The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts ( e.g. , source code or executable code) into a form that is suitable for learning. Traditionally, researchers and practitioners have relied on manually selected features, based on expert knowledge, for the task at hand. Such knowledge is sometimes imprecise and generally incomplete. To overcome this limitation, many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations and selections of the most relevant features. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level ( e.g. , apk2vec ) or conducted for one specific downstream task ( e.g. , smali2vec ). Thus, the produced representation may turn out to be unsuitable for fine-grained tasks or cannot generalize beyond the task that they have been trained on. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks ( e.g. , at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task.

Keywords

Representation learning, Android app analysis, Code representation, Malicious code localization, Defect prediction, Predictive models, Operating systems, Software engineering

Discipline

Artificial Intelligence and Robotics | OS and Networks | Software Engineering

Research Areas

Cybersecurity; Intelligent Systems and Optimization; Software and Cyber-Physical Systems

Publication

IEEE Transactions on Software Engineering

Volume

Issue

First Page

4691

Last Page

4706

ISSN

0098-5589

Identifier

10.1109/TSE.2023.3310874

Publisher

Institute of Electrical and Electronics Engineers

Citation

SUN, Tiezhu; ALLIX, Kevin; KIM, Kisub; ZHOU, Xin; KIM, Dongsun; LO, David; BISSYANDE, Tegawendé F.; and KLEIN, Jacques. DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode. (2023). IEEE Transactions on Software Engineering. 49, (10), 4691-4706.
Available at: https://ink.library.smu.edu.sg/sis_research/8509

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/TSE.2023.3310874

Download

Included in

Artificial Intelligence and Robotics Commons, OS and Networks Commons, Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links