A simple defense against adversarial attacks on heatmap explanations

Laura Rieger*, Lars Kai Hansen

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

5 Downloads (Pure)

Abstract

With machine learning models being used for more sensitive applications, we rely on interpretability methods to prove that no discriminating attributes were used for classification. A potential concern is the so-called "fair-washing" - manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead. In our work we present an effective defence against such adversarial attacks on neural networks. By a simple aggregation of multiple explanation methods, the network becomes robust against manipulation. This holds even when the attacker has exact knowledge of the model weights
and the explanation methods used.
Original languageEnglish
Title of host publicationProceedings of 2020 Workshop on Human Interpretability in Machine Learning
Number of pages22
Publication date2020
Publication statusPublished - 2020
Event5th Annual Workshop on Human Interpretability in Machine Learning - Virtual event
Duration: 17 Jul 202017 Jul 2020

Workshop

Workshop5th Annual Workshop on Human Interpretability in Machine Learning
LocationVirtual event
Period17/07/202017/07/2020

Fingerprint Dive into the research topics of 'A simple defense against adversarial attacks on heatmap explanations'. Together they form a unique fingerprint.

Cite this