Publication Type
Conference Proceeding Article
Publication Date
9-2010
Abstract
Malware clustering and classification are important tools that enable analysts to prioritize their malware analysis efforts. The recent emergence of fully automated methods for malware clustering and classification that report high accuracy suggests that this problem may largely be solved. In this paper, we report the results of our attempt to confirm our conjecture that the method of selecting ground-truth data in prior evaluations biases their results toward high accuracy. To examine this conjecture, we apply clustering algorithms from a different domain (plagiarism detection), first to the dataset used in a prior work's evaluation and then to a wholly new malware dataset, to see if clustering algorithms developed without attention to subtleties of malware obfuscation are nevertheless successful. While these studies provide conflicting signals as to the correctness of our conjecture, our investigation of possible reasons uncovers, we believe, a cautionary note regarding the significance of highly accurate clustering results, as can be impacted by testing on a dataset with a biased cluster-size distribution.
Keywords
malware clustering and classification, plagiarism detection
Discipline
Information Security
Research Areas
Information Security and Trust
Publication
Recent Advances in Intrusion Detection: 13th International Symposium, RAID 2010, Ottawa, Ontario, Canada, September 15-17, 2010: Proceedings
Volume
6307
First Page
238
Last Page
255
ISBN
9783642155123
Identifier
10.1007/978-3-642-15512-3_13
Publisher
Springer Verlag
City or Country
Ottawa, Ontario, Canada
Citation
LI, Peng; LIU, Limin; GAO, Debin; and Reiter, Michael K.
On Challenges in Evaluating Malware Clustering. (2010). Recent Advances in Intrusion Detection: 13th International Symposium, RAID 2010, Ottawa, Ontario, Canada, September 15-17, 2010: Proceedings. 6307, 238-255.
Available at: https://ink.library.smu.edu.sg/sis_research/1319
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
http://dx.doi.org/10.1007/978-3-642-15512-3_13