Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

6-2022

Abstract

Deep-learning based video models, which have remarkable performance on action recognition tasks, are recently proved to be vulnerable to adversarial samples, even those generated in the black-box setting. However, these black-box attack methods are insufficient to attack videos models in real-world applications due to the requirement of lots of queries. To this end, we propose to boost the efficiency of black-box attacks on video recognition models. Although videos carry rich temporal information, they include redundant spatial information from adjacent frames. This motivates us to introduce the adaptive temporal grouping (ATG) method, which groups video frames by the similarity of their features extracted from the ImageNet-pretrained image model. By selecting one key-frame from each group, ATG helps any black-box attack methods to optimize the adversarial perturbations over key-frames instead of all frames, where the estimated gradient of key-frame is shared with other frames in each group. To balance the efficiency and precision of estimated gradients, ATG adaptively adjusts the group number by the magnitude of the current perturbation and the current query number. Through extensive experiments on the HMDB-51 dataset and the UCF-101 dataset, we demonstrate that ATG can significantly reduce the number of queries by more than 10% for the targeted attack.

Keywords

Black-box attacks, Video recognition models, Adaptive temporal grouping, Model security

Discipline

Graphics and Human Computer Interfaces

Publication

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, New Jersey, USA June 27-30

First Page

587

Last Page

593

ISBN

9781450392389

Identifier

10.1145/3512527.3531411

Publisher

ACM

City or Country

New York

Additional URL

https://doi.org/10.1145/3512527.3531411

Share

COinS