Abstract
With machine learning models being used for more sensitive applications, we rely on interpretability methods to prove that no discriminating attributes were used for classification. A potential concern is the so-called "fair-washing" - manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead. In our work we present an effective defence against such adversarial attacks on neural networks. By a simple aggregation of multiple explanation methods, the network becomes robust against manipulation. This holds even when the attacker has exact knowledge of the model weights
and the explanation methods used.
and the explanation methods used.
Original language | English |
---|---|
Title of host publication | Proceedings of 2020 Workshop on Human Interpretability in Machine Learning |
Number of pages | 22 |
Publication date | 2020 |
Publication status | Published - 2020 |
Event | 5th Annual Workshop on Human Interpretability in Machine Learning - Virtual event Duration: 17 Jul 2020 → 17 Jul 2020 |
Workshop
Workshop | 5th Annual Workshop on Human Interpretability in Machine Learning |
---|---|
Location | Virtual event |
Period | 17/07/2020 → 17/07/2020 |