MSc Topics at the Multimedia Computing Group

Below you will find a list of potential MSC thesis topics open in recommender systems and information retrieval. It is important to choose a topic that you find interesting. However, the supervisor recommendation gives you an idea of what the supervisors themselves feel are the most interesting and promising topics for the purpose of an MSc student (also taking into account the context, the expertise and the ongoing activities in the department).

Projects at Companies and other Institutions

Europeana

Europeana is a digital platform that aggregates more than fifty million cultural objects from more than 3700 museums, libraries, galleries and audio-visual archives around Europe. A good search functionality is key to grant access to this dynamic and heterogeneous collection. In order to improve this functionality, proper evaluation of the search engine must be conducted.

As opposed to typical offline evaluation where a set of queries and assessed documents is used, we propose the use of implicit feedback (e.g. clicks, dwell times, etc.) obtained from actual users interacting with the Europeana search engine. The project will be developed in collaboration with the Europeana Foundation, which main office is located in The Hague. It will roughly comprise the following tasks:

Automatically extract interaction data from the Europeana logging system (eg. clicks)
Create a model to estimate the relevance of a document from these data, so we can compare the performance of a new search engine with respect to the one in production
Apply this model to the Europeana collection

Supervisor recommendation: high (Julián Urbano)
Background: Information Retrieval, and Machine Learning or Statistics
Engineering/implement component: extraction of data from logs; experiment (probably crowdsourcing) to create a model to estimate the relevance of a document; automatically apply this model to compare two existing search engines using the same technology and collection.
Publication possibilities: high
Literature:
- B. Carterette and R. Jones, Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks, NIPS 2008.

EUR-TUDELFT-X Master Thesis Program

The EUR-TUDELFT-X Master Thesis Program between Rotterdam School of Management and Computer Science of TUDelft aims to provide students with unique educational experiences in their thesis projects. We offer a number of projects; please contact Dr. Huijuan Wang. ).

Projects at TU Delft

For projects supervised by Elvin Isufi, please see the listings here.

For projects supervised by Pablo Cesar on affective computing, HCI and QoE, and multimedia systems, please see the listings here.

Analysing human neural responses to ambiguous speech using graph networks

There is ample evidence showing that listeners are able to quickly adapt their phoneme classes to ambiguous sounds using a process called lexically-guided perceptual learning. In a lexically-guided perceptual learning experiments, listeners are exposed to words containing an ambiguous sound, e.g., a sound between /f/ and /s/. Listeners exposed to the ambiguous sound in /f/-final words (e.g., gira[f/s], where giraffe is an existing Dutch word and giras is not) learn to interpret the sound as /f/. The group exposed to the ambiguous sound in /s/-final words (e.g., mui[f/s], where muis (mouse) is an existing Dutch word) learn to interpret the same ambiguous sound as /s/. In a subsequent phonetic-categorization task, the listeners exposed to ambiguous /f/-final stimuli characterized stimuli on an [ɛf-ɛs] continuum more often as an [ɛf] than the other group, thus showing a retuning of their /f/ phoneme category.

In a recent experiment, we examined the neural correlates underlying this process (Scharenborg et al. (2019). Specifically, we compared the brain’s responses to ambiguous [f/s] sounds in Dutch non-native listeners of English (N=36) before and after exposure to the ambiguous sound to induce learning, using Event-Related Potentials (ERPs). We observed differences in mean ERP amplitude to ambiguous phonemes at pretest and posttest. However, we observed no significant correlation between the size of behavioral and neural pre/posttest effects. Possibly, the observed behavioral and ERP differences between pretest and posttest link to different aspects of the sound classification task.

In this project, we will analyse the neural responses in the pretest and posttest in a new way by using a network spectral analysis of the brain responses. The brain is in fact a complex network and its response to different stimuli changes accordingly. By using tools from graph signal processing, we are able to see the frequencies of the brain network and how much they contribute to the recorded signals. We will investigate whether 1) we observe differences between the neural responses pretest and posttest and 2) whether these correlate to the behavioural results. These findings will pave the way to the learning algorithms for developing machine learning tools that are able to identify the sound by identifying patterns in the brain recording.

Relevant literature: Scharenborg, O., Koemans, J., Smith, C., Hasegawa-Johnson, M., Federmeier, K. (2019). The neural correlates underlying lexically-guided perceptual learning. Proceedings of Interspeech, Graz, Austria. http://homepage.tudelft.nl/f7h35/papers/interspeech19.1.pdf

Supervisors: Elvin Isufi & Odette Scharenborg
Background: Graph neural networks, machine learning or statistics, signal processing, interest in speech.

Speech2image: Retrieve Images Using Speech

A speech-to-image system learns to map images and speech to the same embedding space, and retrieves an image using spoken captions. While doing so, the deep neural network uses multi-modal input to discover speech units in an unsupervised manner, similar to how children learn there first language. See for a minor description of the task here. This project investigates whether a different encoding of the data, and whether highlighting certain areas of the image improve the task and improved speech units. Data and (parts of the) software are available.

Supervisor recommendation: high (Odette Scharenborg)
Background: Machine Learning or Statistics, Signal Processing, Multimedia Analysis, Interest in speech and images
Engineering/implement component: Adaptation of software to work with new image encodings; extraction and analysis of the discovered speech units using visualisations of the deep neural nets’ hidden layers.
Publication possibilities: high

Image2speech: Automatic Captioning of Images with Speech

The image2speech task generates a spoken description of an image. A baseline system which creates one caption for each image is described here. The goal of this project is to build a system for Dutch which outputs multiple descriptions for each image.

Supervisor recommendation: high (Odette Scharenborg)
Background: Machine Learning or Statistics, Signal Processing, Multimedia Analysis, Interest in speech and images
Engineering/implement component: Adaptation and integration of existing software to build an image2speech system for Dutch.
Publication possibilities: high

Adaptation of a Deep Neural Network to Non-Standard Speech

Human listeners have the ability to quickly adapt to non-standard speech, for instance an accent of a speech impediment. This project investigates deep neural networks’ ability to adapt to non-standard speech through computing the distance between sound clusters in the hidden layers for standard and non-standard speech and the visualization of the activations in the hidden layers. A description of initial work can be found here.

Supervisor recommendation: medium (Odette Scharenborg)
Background: Machine Learning or Statistics, Signal Processing, Multimedia Analysis, Interest in speech and images
Engineering/implement component: Adaptation of software to work with non-standard speech; analysis of the speech unit representations using visualisations of the deep neural nets’ hidden layers.
Publication possibilities: high

DeepSpeech; Visualising Speech Representations in Deep Neural Networks

Recently, Deep Neural Networks (DNNs) have achieved striking performance gains on multimedia analysis tasks involving processing of speech, images, music, and video. DNNs are inspired by the human brain, which the literature often suggests to be the source of their impressive abilities. Although DNNs resemble the brain at level of neural connections, little is known about whether they actually solve specific tasks in the same way the brain does. In this project, we focus on speech recognition, which was one of the first multimedia processing areas to see remarkable gains due to the introduction of (deep) neural networks.

In order to investigate the way DNNs solve specific tasks, visualisation of the activations of the hidden nodes in the DNN’s hidden layers is crucial. Such tools/approaches exist in the field of automatic image processing, most notably “DeepEyes” developed at the Computer Graphics and Visualization Group in the Intelligent Systems Department at TU Delft. In this project, we aim to develop a speech counterpart to “DeepEyes”, which we refer to as “DeepSpeech”.

Supervisor recommendation: high (Odette Scharenborg, in collaboration with the Computer Graphics and Visualization Group).
Background: Machine Learning or Statistics, Signal Processing, Multimedia Analysis, Interest in speech and images
Engineering/implement component: Adaptation of DeepEyes software to work with speech; analysis of the speech unit representations using visualisations of the deep neural nets’ hidden layers.
Publication possibilities: high
Literature:
- N. Pezzotti, T. Höllt, J. van Gemert, B.P.F. Lelieveldt, E. Eisemann, A. Vilanova. DeepEyes: Progressive Visual Analytics for Designing Deep Neural Networks. Transaction on Visualization and Computer Graphics (Proceedings of IEEE VIS 2017), 2018.
- Scharenborg, O., van der Gouw, N., Larson, M., Marchiori, E. (2019). The representation of speech in deep neural networks. Proceedings of the International Conference on MultiMedia Modeling, Thessaloniki, Greece.

Labels extraction from failure reports

One of the main limitations in the application of Machine Learning Techniques in Monitoring of Engineering Structures is the lack of accurate and comprehensive labels. These labels are usually rarely available and inaccurate since they are obtained by relying on people detecting and timely annotating events during operating conditions. One way around this, is to extract the labels from the so-called failure reports. These reports are free-text documents, written after a failure investigation has been carried out to evaluate the root cause of failure. However, the manual extraction of the information from these reports can be time consuming and costly. Alternatively, Natural Language Processing techniques can be developed to speed up the labels extraction.

Each word in the report can be represented as a point in a multi-dimensional space, the so-called embedding, to perform mathematical operations on linguistic objects. However, not all the words contained in the report are relevant. This project will explore approaches based on text pre-processing and on the Latent Dirichlet allocation models to identify the most relevant words which yield meaningful failure labels. You will work on a very exciting dataset which reflects typical challenges in engineering applications: failure reports written as free text with typographical errors, very similar failure types to be identified, imbalanced number of failures for each failure type. It is anticipated that the work will be presented at one international conference.

Book: D. Jurafsky, J.H. Martin. Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Prentice Hall, 2009.
Daily supervisor: Dr. Alice Cicirello, supervisor from MMC: Odette Scharenborg
Background: Machine Learning or Statistics, Multimedia Analysis

What is The Best Deep Neural Network-based Phoneme Recognition System? (multiple projects)

Building and comparing of different architectures of deep neural networks for the task of phoneme recognition in Dutch. Multiple projects are available, each project focusing on a different architecture and/or speaking style (conversational Dutch versus read speech Dutch) and/or whether the training material is only Dutch or a combination of different languages (multi-lingual model). Questions include whether a recurrent network improves phoneme recognition; or whether a sequence-trained model (e.g., CTC) is better than a frame-based model?

Supervisor recommendation: medium/high (Odette Scharenborg)
Background: Machine Learning, Signal Processing, Multimedia Analysis, Interest in speech
Engineering/implement component: implementation of new deep neural network architectures; comparing alignments output by the recognition systems
Publication possibilities: high

Confidence Intervals for Measurements of Dataset Reliability

Implement various formulations available in the literature to compute confidence intervals for certain statistical descriptors. Once implemented, evaluate their accuracy using simulation tools already available for IR. Some motivation can be found in sections 2 and 3 of this paper.

Supervisor recommendation: high (Julián Urbano)
Background: Information Retrieval, Statistics
Engineering/implement component: Implementation of a tool for the community to use
Publication possibilities: high

Prediction of User Annotations on Structural Segmentation

Similar to the topic on music similarity annotations, but from the ground up, and for the task of structural music segmentation. See minor description of the task here. Available data and software here

Supervisor recommendation: medium (Julián Urbano)
Background: Information Retrieval, Machine Learning or Statistics, Signal Processing for Music
Engineering/implement component: Implementation of models in a tool for the community to use
Publication possibilities: high