PhD defence by Frederik Rye Skjoldjensen: Matching and Compression of Strings with Automata and Word Packing

PhD defence by Frederik Rye Skjoldjensen, Wednesday, 7 June, at 13:00, The Technical University of Denmark, Anker Engelunds Vej 1, 2800 Kgs. Lyngby, Building 101A, S01

Principal supervisor: Associate Professor Phillip Bille.
Co supervisor: Associate Professor Inge Li Gørtz.

Examiners: 
Associate Professor Carsten Witt, DTU Compute.
Professor Rasmus Pagh, IT University of Copenhagen, Denmark.
Professor Gad M. Landau, University of Haifa, Israel.

Chairperson at defence: Consultant Dorte Olesen.

Summary:
This dissertation presents new algorithms for pattern matching in text strings and compression of text strings. A text string could be the text of a book, the text on a web page or the base pairs of a DNA sequence or any other type of data that can be seen as a sequence of consecutive characters. A pattern could be a specific word that we want to find in a book. A compression of a text could be the substitution of string ATGCAAAAAAAA with the reduced size string ATGC8A where we have substituted the last eight characters with 8A to indicate eight times repetition of A. We assess the algorithms we develop based on how they scale in computation time and memory consumption when the amount of data increases.

The dissertation is divided in four parts: In the first part we consider compression of a source text with sequences of consecutive characters from a reference string. A compression for the string ABCDABC with respect to the string ABCD could for example be (1,4),(1,3) because the first string can be represented as the first four character of ABCD followed by the first three characters ABCD. We are interested in only storing the compressed representation and the reference string while still being able to find the i-th character of S fast. In doing this we present a new result that generalizes the best known result and a result that improves the previously best known result.

In the second part we consider subsequence matching in a formal model. A subsequence of a string is any string obtainable from the original string by removing zero or more characters. We are interested in deciding if one string is a subsequence of another string. We present new solutions to the problem by extending a classic model.

In the third part we consider substring matching. Here we compute and store a representation of a text string such that we afterwards can decide if a pattern is a substring of the text string. A substring is any sequence of consecutive characters from a string. We consider a specific setting where the space we use should be proportional in size to the text string and where randomization is disallowed. We additionally require that the strings are stored in efficiently in machine words. We generalize the best known result to work on strings stored efficiently in machine words.

In the last part, we present an algorithm for maintaining a set of numbers such that we can access each number fast, update each number fast and find the sum of the first i numbers fast. We do this in an unconventional model of computation that was recently proposed in the literature and present a new result that improves the previously best known result.

Read more about this thesis in DTU Orbit.

A copy of the PhD thesis is available for reading at the department.

Everyone is welcome.

Tidspunkt

ons 07 jun 17
13:00

Arrangør

DTU Compute

Hvor

The Technical University of Denmark, Anker Engelunds Vej 1, 2800 Kgs. Lyngby, Building 101A, S01