Training algorithms with the help of adversarial examples can potentially improve the accuracy of their outputs.
Adversarial examples are inputs to AI algorithms that are intentionally designed and fed into algorithms by a hacker or an attacker to make them predict erroneously. Adversarial examples can be classified into two different types, which are white-box and black-box attacks. White-box attacks are where the attacker has complete knowledge about the algorithm that is attacked like hyperparameters, biases, and weights. Black-box attacks are where the adversary knows the output of an algorithm and has no internal details of it. Since adversarial attacks can drastically affect the output of AI algorithms, researchers and algorithm developers have come up with countermeasures that can minimize the effect of adversarial examples on AI algorithms.
Countermeasures for adversarial examples
When algorithm creators think about the robustness of algorithms, they usually think about how to ensure that an algorithm will do what it is intended to do. But, adversarial examples show us that even sophisticated modern algorithms can be affected by them. Techniques such as weight decay and dropout, which were earlier used to increase the robustness of algorithms, are not capable of providing a defense against adversarial examples. Hence researchers have come up with some new methods that can provide significant defense against adversarial examples.
Adversarial training
Adversarial training is a solution where algorithm creators intentionally create a lot of adversarial examples and use them for training AI models to reduce misclassification. For instance, including attack examples in the spam filtering process to increase its accuracy for detecting various spam words. But, it is not possible to train an algorithm with all the possible adversarial examples as every minute difference in input data can become an adversarial example. Hence, the introduction of adversarial training has shown mixed results in making AI algorithms robust against adversarial examples.
Defensive distillation
Defensive distillation is a strategy used to add flexibility to an algorithm’s classification process to make it defensive to adversarial perturbations. It is a model trained to predict the probabilities of another model for the same task. For example, the first model can find the probability of a fingerprint being real, and if it finds that the probability is low, then it can be used to train the second model to act as an additional filter and defend against any similar attacks. Since the first model partly trains the second model, the biggest advantage of distillation is that it is adaptable to some unknown threats. Whereas the biggest disadvantage of the distillation approach is that with enough fine-tuning by the attacker, the AI models can be easily reverse engineered to find fundamental exploits.
Adversarial examples can transfer from one AI model to another, and hence it becomes difficult to defend against them. It is mentioned above that all the countermeasures that are introduced to date can only help to minimize the effect of adversarial examples. They may block one kind of attack but leaves another vulnerability open. Hence if an attacker knows what type of defense is being used, then it becomes easy for them to attack the other vulnerabilities that are left open. Thus if attackers can insert adversarial examples into training data, then AI algorithms can also have negative sides.