DTU.dk

Extracting Essential Data and Making Inference from Big Data

Rune Dodensig Kjærsgaard: Extracting Essential Data from Big Data

In recent years the value of extracting knowledge from big data has become more and more evident. Unfortunately, big data is often accompanied with undesired noise, missing values and imbalanced data, which encumbers the execution of big data projects. Additionally, in most big data projects only a
small fraction of the data is relevant for creating the value of interest. Since no amount of irrelevant data is useful, a large hurdle in data science projects can be overcome by identifying and extracting the central data required to solve a given problem.

This PhD project seeks to address the problem by finding a smaller data representation which is equivalent to the original data in the sense that studying the smaller data is as relevant for the problem as studying the original data.

The aim of the project is to arrive at this smaller data representation through a collaboration and combination of techniques from the world of statistical inference and machine learning with the world of algorithmic and complexity tools. The project will investigate statistical inference as a means to address
the problem of data compression while the algorithmic issues arising in the methodology will also be addressed. A range of popular machine learning methods, including for example autoencoders to handle the dimensionality reducing problem, will be relevant for the project, while the hope is to also create new tools and techniques to study data compression in both theory and practice.

In essence, the project seeks to research which statistical inference methods are best suited to find a small and relevant data representation in big data situations, and how quickly this can be computed. Furthermore, the relation and potential synergies between statistical inference and algorithmic compression
methods will be investigated.

The results from this PhD project will have wide applicability in areas with a strong demand for data driven solutions on massive data sets and will help pave the way towards a future where big data projects are easier to undertake.

Updated on 27 May 2022