Improving Deep Learning Explainability with Interpretable Outputs and Representations

Anders Christensen: Opening the black box: Making machine learning interpretable

Imagine you are going to a well-renowned restaurant. Upon entry, you are surprised by the waiter asking you to put on a blindfold during the dinner. Even though you are aware that the restaurant has great ratings, can you enjoy the following fish course without any fear that you may accidently swallow a fish bone?

Machine learning is a field focusing on creating and training mathematical models that solve specific tasks. Through recent years, machine learning (and specifically deep learning) has become synonymous with artificial intelligence due to the impressive results achieved, sometimes far surpassing the performance of humans. However, there is rarely access to reasoning behind why a model yields a certain output for some input. This is often referred to as the “black box problem” of deep learning: you give the model some data and get out a corresponding answer that may very well be good, but what has happened within the model is typically obscure and cannot be interpreted by humans.

As such, it is difficult, if not impossible, to validate a specific output for some input if the correct answer is not already known. Model performance is therefore typically evaluated based on statistical performance in a controlled test setting, e.g., of the correctness in the form of a percentage. This may be satisfactory in e.g., a production setting, but what if the model was evaluated on you – let’s say, for assisting a doctor with a medical diagnosis?

And so, we arrive back to the dilemma in the restaurant. Although the restaurant (the model) has a high rating (accuracy in the test setting), it is still difficult to enjoy the meal (trust the output), because we cannot validate it due to the blindfold (the black box problem).

My PhD will consist of a series of smaller projects that will attempt to poke holes in the blindfold and make it slightly more see-through. As an example, the first project revolves around so-called zero-shot classification models. These models can generalize to new and unseen categories after deployment by using simple side information. As an illustrative example, if a model has been trained to classify only European animals in images, if it has zero-shot capabilities it may still be able to recognize zebras if it knows that a zebra looks like a “horse with black and white stripes” – just like a human would. This type of model is interesting due to its ability to continually adapt and incorporate new information from our everchanging world. But how can we know if the model has actually understood what a zebra is? Our work allows the model to generate images of zebras without ever having seen a zebra. This is like asking a human to draw an image of a zebra from the striped horse description to peer into the imagination of that person. Users can then verify visually that the model has understood a particular concept, and not just some random noise or correlations that may lead to faulty predictions.

Deep learning has tremendous power that can be leveraged to improve society, but for these models to be used as advisory tools in crucial domains such as health care, their reasoning must be verifiable and explainable by domain experts such as doctors. During this PhD, we hope to develop techniques and knowledge that will promote this development.

PhD project

By: Anders Christensen

Section: Cognitive Systems

Principal supervisor: Ole Winther

Co-supervisor: Zeynep Akata, UT

Project title: Improving Deep Learning Explainability with Interpretable Outputs and Representations

Term: 01/12/2021 → 30/11/2024


Anders Christensen
PhD student
DTU Compute


Ole Winther
DTU Compute
+45 45 25 38 95