Unfortunately we have almost no insight into how a neural network comes to its decision. It does
not “show its work” and would therefore fail every university exam. If a neural network makes a
fatal mistake, we need to know why that happened. If we doubt a neural network’s decision, it is
helpful to look at the reasoning for the decision. If we use the neural network decision along with
other input, e.g. to support a medical diagnosis, we need to know what this decision is based on.
Therefore my project aims to open up the black box a neural network represents by looking at the
interpretability of its decisions in the context of uncertainty. By doing this we can see how different
features of the input influence the certainty of the output decision. This will help to know if we
should conduct more medical tests to solidify a health diagnosis, how vulnerable a system is to
malicious attacks and how to defend against them and whether a system has an unwanted
discriminative bias.
To achieve this I will start by applying current methods for interpreting neural network decisions on
Bayesian neural networks and training neural networks on different subsets of datasets to gain
insight in the convergence of the learning process. The success of the applied methods can be
examined by comparing the neural network explanation to reasoning and attention on the same
decision by humans.