MSc Topics at the Multimedia Computing Group

Below you will find a list of potential MSC thesis topics open in recommender systems and information retrieval. It is important to choose a topic that you find interesting. However, the supervisor recommendation gives you an idea of what the supervisors themselves feel are the most interesting and promising topics for the purpose of an MSc student (also taking into account the context, the expertise and the ongoing activities in the department).

Projects at Companies and other Institutions


Europeana is a digital platform that aggregates more than fifty million cultural objects from more than 3700 museums, libraries, galleries and audio-visual archives around Europe. A good search functionality is key to grant access to this dynamic and heterogeneous collection. In order to improve this functionality, proper evaluation of the search engine must be conducted.

As opposed to typical offline evaluation where a set of queries and assessed documents is used, we propose the use of implicit feedback (e.g. clicks, dwell times, etc.) obtained from actual users interacting with the Europeana search engine. The project will be developed in collaboration with the Europeana Foundation, which main office is located in The Hague. It will roughly comprise the following tasks:

  1. Automatically extract interaction data from the Europeana logging system (eg. clicks)
  2. Create a model to estimate the relevance of a document from these data, so we can compare the performance of a new search engine with respect to the one in production
  3. Apply this model to the Europeana collection

EUR-TUDELFT-X Master Thesis Program

The EUR-TUDELFT-X Master Thesis Program between Rotterdam School of Management and Computer Science of TUDelft aims to provide students with unique educational experiences in their thesis projects. We offer a number of projects; please contact Dr. Huijuan Wang. ).

Projects at TU Delft

For projects supervised by Elvin Isufi, please see the listings here.

For projects supervised by Pablo Cesar on affective computing, HCI and QoE, and multimedia systems, please see the listings here.

Analysing human neural responses to ambiguous speech using graph networks

There is ample evidence showing that listeners are able to quickly adapt their phoneme classes to ambiguous sounds using a process called lexically-guided perceptual learning. In a lexically-guided perceptual learning experiments, listeners are exposed to words containing an ambiguous sound, e.g., a sound between /f/ and /s/. Listeners exposed to the ambiguous sound in /f/-final words (e.g., gira[f/s], where giraffe is an existing Dutch word and giras is not) learn to interpret the sound as /f/. The group exposed to the ambiguous sound in /s/-final words (e.g., mui[f/s], where muis (mouse) is an existing Dutch word) learn to interpret the same ambiguous sound as /s/. In a subsequent phonetic-categorization task, the listeners exposed to ambiguous /f/-final stimuli characterized stimuli on an [ɛf-ɛs] continuum more often as an [ɛf] than the other group, thus showing a retuning of their /f/ phoneme category.

In a recent experiment, we examined the neural correlates underlying this process (Scharenborg et al. (2019). Specifically, we compared the brain’s responses to ambiguous [f/s] sounds in Dutch non-native listeners of English (N=36) before and after exposure to the ambiguous sound to induce learning, using Event-Related Potentials (ERPs). We observed differences in mean ERP amplitude to ambiguous phonemes at pretest and posttest. However, we observed no significant correlation between the size of behavioral and neural pre/posttest effects. Possibly, the observed behavioral and ERP differences between pretest and posttest link to different aspects of the sound classification task.

In this project, we will analyse the neural responses in the pretest and posttest in a new way by using a network spectral analysis of the brain responses. The brain is in fact a complex network and its response to different stimuli changes accordingly. By using tools from graph signal processing, we are able to see the frequencies of the brain network and how much they contribute to the recorded signals. We will investigate whether 1) we observe differences between the neural responses pretest and posttest and 2) whether these correlate to the behavioural results. These findings will pave the way to the learning algorithms for developing machine learning tools that are able to identify the sound by identifying patterns in the brain recording.

Relevant literature: Scharenborg, O., Koemans, J., Smith, C., Hasegawa-Johnson, M., Federmeier, K. (2019). The neural correlates underlying lexically-guided perceptual learning. Proceedings of Interspeech, Graz, Austria.

Speech2image: Retrieve Images Using Speech

A speech-to-image system learns to map images and speech to the same embedding space, and retrieves an image using spoken captions. While doing so, the deep neural network uses multi-modal input to discover speech units in an unsupervised manner, similar to how children learn there first language. See for a minor description of the task here. This project investigates whether a different encoding of the data, and whether highlighting certain areas of the image improve the task and improved speech units. Data and (parts of the) software are available.

Image2speech: Automatic Captioning of Images with Speech

The image2speech task generates a spoken description of an image. A baseline system which creates one caption for each image is described here. The goal of this project is to build a system for Dutch which outputs multiple descriptions for each image.

Adaptation of a Deep Neural Network to Non-Standard Speech

Human listeners have the ability to quickly adapt to non-standard speech, for instance an accent of a speech impediment. This project investigates deep neural networks’ ability to adapt to non-standard speech through computing the distance between sound clusters in the hidden layers for standard and non-standard speech and the visualization of the activations in the hidden layers. A description of initial work can be found here.

DeepSpeech; Visualising Speech Representations in Deep Neural Networks

Recently, Deep Neural Networks (DNNs) have achieved striking performance gains on multimedia analysis tasks involving processing of speech, images, music, and video. DNNs are inspired by the human brain, which the literature often suggests to be the source of their impressive abilities. Although DNNs resemble the brain at level of neural connections, little is known about whether they actually solve specific tasks in the same way the brain does. In this project, we focus on speech recognition, which was one of the first multimedia processing areas to see remarkable gains due to the introduction of (deep) neural networks.

In order to investigate the way DNNs solve specific tasks, visualisation of the activations of the hidden nodes in the DNN’s hidden layers is crucial. Such tools/approaches exist in the field of automatic image processing, most notably “DeepEyes” developed at the Computer Graphics and Visualization Group in the Intelligent Systems Department at TU Delft. In this project, we aim to develop a speech counterpart to “DeepEyes”, which we refer to as “DeepSpeech”.

Labels extraction from failure reports

One of the main limitations in the application of Machine Learning Techniques in Monitoring of Engineering Structures is the lack of accurate and comprehensive labels. These labels are usually rarely available and inaccurate since they are obtained by relying on people detecting and timely annotating events during operating conditions. One way around this, is to extract the labels from the so-called failure reports. These reports are free-text documents, written after a failure investigation has been carried out to evaluate the root cause of failure. However, the manual extraction of the information from these reports can be time consuming and costly. Alternatively, Natural Language Processing techniques can be developed to speed up the labels extraction.

Each word in the report can be represented as a point in a multi-dimensional space, the so-called embedding, to perform mathematical operations on linguistic objects. However, not all the words contained in the report are relevant. This project will explore approaches based on text pre-processing and on the Latent Dirichlet allocation models to identify the most relevant words which yield meaningful failure labels. You will work on a very exciting dataset which reflects typical challenges in engineering applications: failure reports written as free text with typographical errors, very similar failure types to be identified, imbalanced number of failures for each failure type. It is anticipated that the work will be presented at one international conference.

Speech-2-text converter / Voice Activation Device

Do you want to be part of an innovation project and have a direct and tangible positive impact on people and society? In collaboration with a startup in the medical field and as a small sub-project, we need a very reliable and robust system for in the Medical field that converts Speech-to-Text. The system uses a short life-saving voice commands (e.g. “Eli help”), which gets its input from a microphone and converts it to text for the post-processing and activating a device or system. About us: We are an innovative startup in the Medical field supported by an international team of scientists and engineers on a mission to help and protect very fragile groups like premature babies and elderly people in the society. Our investment is our knowledge, expertise and over 18 years of experience in the healthcare at national and international level. We are independent in building and using state-of-the-art technology in a seamlessly integrated platform that includes remote sensing, multimedia, data monitoring and analytics. Over time, we have built strong relationships and collaborations with different working groups and disciplines at TU-Delft, HHS-Delft, companies and hospitals to help us make this innovative dream a reality.


Remark: You may use open-source programs and libraries to jumpstart the project progress.

Supervisor: Odette Scharenborg

Automatic Discovery of Speech Units: Multitask- and Multilingual Deep Neural Network Training

In order to train automatic speech recognition systems, deep neural networks require a lot of annotated training speech data. However, for most languages in the world, not enough (annotated) data is available. A new research strand focuses on the automatic discovery of phoneme-like acoustic units from raw speech using out-of-domain languages. Building on previous work, this project investigates the use of multi-task training for the deep neural network and training the deep neural net on multiple languages. Data and software are partially available.

Automatic Discovery of Speech Units: New Architectures and Creating Artificial Data for Deep Neural Network Training

In order to train automatic speech recognition systems, deep neural networks require a lot of annotated training speech data. However, for most languages in the world, not enough (annotated) data is available. A new research strand focuses on the automatic discovery of phoneme-like acoustic units from raw speech using out-of-domain languages. Building on previous work, this project investigates the use of creating artificial data for deep neural network training and investigating different deep neural network architectures. Data and software are partially available.

What is The Best Deep Neural Network-based Phoneme Recognition System? (multiple projects)

Building and comparing of different architectures of deep neural networks for the task of phoneme recognition in Dutch. Multiple projects are available, each project focusing on a different architecture and/or speaking style (conversational Dutch versus read speech Dutch) and/or whether the training material is only Dutch or a combination of different languages (multi-lingual model). Questions include whether a recurrent network improves phoneme recognition; or whether a sequence-trained model (e.g., CTC) is better than a frame-based model?

Confidence Intervals for Measurements of Dataset Reliability

Implement various formulations available in the literature to compute confidence intervals for certain statistical descriptors. Once implemented, evaluate their accuracy using simulation tools already available for IR. Some motivation can be found in sections 2 and 3 of this paper.

Prediction of User Annotations on Structural Segmentation

Similar to the topic on music similarity annotations, but from the ground up, and for the task of structural music segmentation. See minor description of the task here. Available data and software here