MSc Topics at the Multimedia Computing Group

Below you will find a list of potential MSC thesis topics open in recommender systems and information retrieval. It is important to choose a topic that you find interesting. However, the supervisor recommendation gives you an idea of what the supervisors themselves feel are the most interesting and promising topics for the purpose of an MSc student (also taking into account the context, the expertise and the ongoing activities in the department).

Speech2image: Retrieve Images Using Speech

A speech-to-image system learns to map images and speech to the same embedding space, and retrieves an image using spoken captions. While doing so, the deep neural network uses multi-modal input to discover speech units in an unsupervised manner, similar to how children learn there first language. See for a minor description of the task here. This project investigates whether a different encoding of the data, and whether highlighting certain areas of the image improve the task and improved speech units. Data and (parts of the) software are available.

Image2speech: Automatic Captioning of Images with Speech

The image2speech task generates a spoken description of an image. A baseline system which creates one caption for each image is described here. The goal of this project is to build a system for Dutch which outputs multiple descriptions for each image.

Adaptation of a Deep Neural Network to Non-Standard Speech

Human listeners have the ability to quickly adapt to non-standard speech, for instance an accent of a speech impediment. This project investigates deep neural networks’ ability to adapt to non-standard speech through computing the distance between sound clusters in the hidden layers for standard and non-standard speech and the visualization of the activations in the hidden layers. A description of initial work can be found here.

DeepSpeech; Visualising Speech Representations in Deep Neural Networks

Recently, Deep Neural Networks (DNNs) have achieved striking performance gains on multimedia analysis tasks involving processing of speech, images, music, and video. DNNs are inspired by the human brain, which the literature often suggests to be the source of their impressive abilities. Although DNNs resemble the brain at level of neural connections, little is known about whether they actually solve specific tasks in the same way the brain does. In this project, we focus on speech recognition, which was one of the first multimedia processing areas to see remarkable gains due to the introduction of (deep) neural networks.

In order to investigate the way DNNs solve specific tasks, visualisation of the activations of the hidden nodes in the DNN’s hidden layers is crucial. Such tools/approaches exist in the field of automatic image processing, most notably “DeepEyes” developed at the Computer Graphics and Visualization Group in the Intelligent Systems Department at TU Delft. In this project, we aim to develop a speech counterpart to “DeepEyes”, which we refer to as “DeepSpeech”.

Speech-2-text converter / Voice Activation Device

Do you want to be part of an innovation project and have a direct and tangible positive impact on people and society? In collaboration with a startup in the medical field and as a small sub-project, we need a very reliable and robust system for in the Medical field that converts Speech-to-Text. The system uses a short life-saving voice commands (e.g. “Eli help”), which gets its input from a microphone and converts it to text for the post-processing and activating a device or system. About us: We are an innovative startup in the Medical field supported by an international team of scientists and engineers on a mission to help and protect very fragile groups like premature babies and elderly people in the society. Our investment is our knowledge, expertise and over 18 years of experience in the healthcare at national and international level. We are independent in building and using state-of-the-art technology in a seamlessly integrated platform that includes remote sensing, multimedia, data monitoring and analytics. Over time, we have built strong relationships and collaborations with different working groups and disciplines at TU-Delft, HHS-Delft, companies and hospitals to help us make this innovative dream a reality.

Assignment:

Remark: You may use open-source programs and libraries to jumpstart the project progress.

Supervisor: Odette Scharenborg

Automatic Discovery of Speech Units: Multitask- and Multilingual Deep Neural Network Training

In order to train automatic speech recognition systems, deep neural networks require a lot of annotated training speech data. However, for most languages in the world, not enough (annotated) data is available. A new research strand focuses on the automatic discovery of phoneme-like acoustic units from raw speech using out-of-domain languages. Building on previous work, this project investigates the use of multi-task training for the deep neural network and training the deep neural net on multiple languages. Data and software are partially available.

Automatic Discovery of Speech Units: New Architectures and Creating Artificial Data for Deep Neural Network Training

In order to train automatic speech recognition systems, deep neural networks require a lot of annotated training speech data. However, for most languages in the world, not enough (annotated) data is available. A new research strand focuses on the automatic discovery of phoneme-like acoustic units from raw speech using out-of-domain languages. Building on previous work, this project investigates the use of creating artificial data for deep neural network training and investigating different deep neural network architectures. Data and software are partially available.

What is The Best Deep Neural Network-based Phoneme Recognition System? (multiple projects)

Building and comparing of different architectures of deep neural networks for the task of phoneme recognition in Dutch. Multiple projects are available, each project focusing on a different architecture and/or speaking style (conversational Dutch versus read speech Dutch) and/or whether the training material is only Dutch or a combination of different languages (multi-lingual model). Questions include whether a recurrent network improves phoneme recognition; or whether a sequence-trained model (e.g., CTC) is better than a frame-based model?

Confidence Intervals for Measurements of Dataset Reliability

Implement various formulations available in the literature to compute confidence intervals for certain statistical descriptors. Once implemented, evaluate their accuracy using simulation tools already available for IR. Some motivation can be found in sections 2 and 3 of this paper.

Order Effects in Annotation of Music Similarity

Study how the order effect described in this paper (how annotators go back and change previous ratings) affects the conclusions in this paper and investigate ways to minimize the problem, such as preference judgments.

Better Prediction of User Annotations on Music Similarity

Following this work on music similarity, build better models to predict user annotations given output from systems and metadata.

Prediction of User Annotations on Structural Segmentation

Similar to the topic on music similarity annotations, but from the ground up, and for the task of structural music segmentation. See minor description of the task here. Available data and software here

TextileEmotionSensing: Exploring smart textiles for emotion recognition in mobile interactions

Can the clothes we wear or chairs we sit on know how we’re feeling? In this project, you will explore a range of smart textiles (e.g., capacitative pressure sensors) for emotion recognition in mobile interactions. Should the textile be embedded in a couch? Should it be attached it to the users? Can we robustly detect affective states such as arousal, valence, joy, anger, etc.? This project will require knowledge and know-how of embedded systems, and use of fabrication techniques for embedding sensors in such fabrics. It will involve running controlled user studies to collect (and later analyze) such biometric data.

ThermalEmotions: Exploring thermal cameras for pose-invariant emotion recognition

Thermal cameras have the unique advantage of being able to capture thermal signatures (heat radiation) from energy-emitting entities. Previous work has shown the potential of such cameras for cognitive load estimation, even under high pose variance. In this project, you will explore (using a standard computer vision approach) the potential for mobile (FLIR) (or possibly higher resolution) thermal cameras for pose-invariant emotion recognition. The idea is that the emotional signature on a user’s face allows such recognition, when coupled with standard facial expression features. You will explore different types of SOTA deep neural network architectures to perform such supervised emotion classification.

EmotionalFingerprints: exploring individual patterns (user profiling) of emotional behavior through physiological sensing

Do we all physiologically react to events and experiences the same way? Are there commonalities across age groups or gender differences? In this project the goal is to investigate emotional fingerprints, and tease out to what extent our physiological data (EKG, GSR, EEG, pupil dilation) can be used to bucket users. This data may be combined with behavioral markers such as gait and posture. This will involve running controlled in-lab user studies and user modeling. This can provide the first steps towards an affect-aware recommender system.

PhysiologicalPrivacy: Quantifying privacy and monetary costs of personal physiologically sensed data

The goal of this project is to understand the privacy and monetary costs of physiologically sensed data (e.g., heart rate, GSR, EEG). You will explore different auction-based protocols to model these economic and privacy factors. You will have to run controlled user studies to understand the perceived price users place on such personal data. The outcome is to provide an empirically valid monetary and privacy model of physiological data.

AffectMaps: quantifying the mood of urban city locations with wearable physiological sensing

Is there a correlation between user defined tags and physiological markers? What about between the photos they take and the semantics of those photos? In this project, you will investigate how we can physiologically crowdsource urban feelings. This will involve either finding a physiological dataset that is geotagged, or deploying an Android crowdsourcing app to collect such data. This project will be co-supervised by Telefonica Research in Spain, where they will provide call detail record (CDR) datasets for Barcelona and London.

EmotionalFashion: Visualizing (emotional) biometric data for smart fashion

Wearable biotech fashion is becoming a recent trend, however we still know very little on what the best means of visualizing such biometric data. How and when should a necklace visualize a user’s heartbeat? Should a person’s clothe indicate their body heat emission? How should such on-body wearable sensors look like, and should they actuate in the same or a different place? This is a project to explore the intersection between fashion, aesthetics, and wearable biotech sensors. The project should result in a series of smart wearable fashion prototypes, and evaluated in the field.

Internet of Things (IoT) programming notation

Development of a programming notation for a new interactive IoT platform. A declarative interface technique is used to access and control the devices without having to worry about the variety of device-interfaces. This project will analyse programming notations for other similar projects, such as IFTTT, and design a declarative programming notation based on state changes, rather than using events, or traditional programming.

Igor is an IoT platform using new techniques for managing IoT devices by placing a thin layer around a non-homogeneous collection of Internet of Things devices, hiding the data-format and data-access differences, and auto-updating the devices as needed. See https://www.cwi.nl/~steven/iot/

Objective metrics for point cloud quality assessment

Volumetric data captured by state of the art capture devices, in its most primitive form, consists of a collection of points called a point cloud (PC). A point cloud consists of a set of individual 3D points. Each point, in addition to having a 3D (x, y, z) position, i.e., spatial attribute, may also contain a number of other attributes such as color, reflectance, surface normal, etc. There are no spatial connections or ordering relations specified among the individual points. When a PC signal is processed, for example undergoing lossy compression to reduce its size, it is critical to be able to quantify how well the processed signal is approximating the original one, as in the perception of the end user, which is the human being who will visualise the signal. The goal of this project is to develop a new algorithm (i.e. objective full-reference quality metric) to evaluate the perceptual fidelity of a processed PC with respect to its original version. A framework implementing the objective metrics currently available in literature to assess PC visual quality, and comparing the performance to the proposed method will also be developed. Subjective feedbacks on the visual quality of the signals will be collected from users to serve as ground-truth.

User navigation in 6DoF Virtual Reality

Nowadays, Virtual Reality (VR) applications are typically designed to provide an immersive experience with three Degrees of Freedom (3DoF): a user who watches a 360-degree video on a Head-Mounted Display (HMD) can choose the portion of the spherical content to view, by rotating the head to a specific direction. Nevertheless, the feeling of immersion in VR results not only from the possibility to turn the head and change the viewing direction but also from changing the viewpoint, moving within the virtual scene. VR applications allowing translations inside the virtual scene are referred to as six Degrees of Freedom (6DoF) applications.

The goal of this project is the development of a platform to capture user’s navigation patterns in 6DoF VR. First, an interface to capture user’s position in the virtual space will be implemented in Unity3D for a HMD equipped with controllers and eventually special sensors for positional tracking. Second, a user study to collect the navigation patterns of actual users in a virtual environment, such a 3D scene model, will be designed and performed. Third, the data will be analysed to explore correlation between different user’s navigation behaviour.

Studying the impact of audio cues on Focus of Attention in 3DoF VR

Omnidirectional, i.e., 360-degree videos, are spherical signals used in Virtual Reality (VR) applications: a user who watches a 360-degree video on a Head-Mounted Display (HMD) can choose which portion of spherical content to display by moving the head to a specific direction. This is referred to as three Degrees of Freedom (3DoF) navigation. The portion of spherical surface attended by the user is projected to a segment of plane, called viewport. Recently, many studies have appeared proposing datasets of user’s head movements during 360-degree video consumption to analyse how users explore immersive virtual environments. Understanding VR content navigation is crucial for many applications, such as designing VR content, developing new compression algorithms, or learning computational models of saliency or visual attention. Nevertheless, most of the existing dataset are considering video content without an audio channel. The goal of this project is the creation of a dataset of head movements of users watching 360-degree videos, with or without an audio channel, on a Head-Mounted Display (HMD). First, a user experiment to collect such data will be designed and performed. Second, the collected data will be analysed by means of statistical tools to quantify the impact of audio cues, which might drive the user’s visual Focus of Attention.

Comparing the performance of mesh versus point cloud-based compression

Recent advances in 3D capturing technologies enable the generation of dynamic and static volumetric visual signals from real-world scenes and objects, opening the way to a huge number of applications using this data, from robotics to immersive communications. Volumetric signals are typically represented as polygon meshes or point clouds and can be visualized from any viewpoint, providing six Degrees of Freedom (6DoF) viewing capabilities. They represent a key enabling technology for Augmented and Virtual Reality (AR and VR) applications, which are receiving a lot of attention from main technological innovation players, both in academic and industrial communities. Volumetric signals are extremely high rate, thus require efficient compression algorithms able to remove the visual redundancy in the data while preserving the perceptual quality of the processed visual signal. Existing compression technologies for mesh-based signals include open source libraries such a Draco. Compression of point clouds signals is currently under standardisation. The goal of this project is the development of a platform to compare the performance of mesh versus point cloud based compression algorithms in terms of visual quality of the resulting compressed volumetric object. Starting from a set of high quality point cloud (or mesh) volumetric objects, the corresponding mesh (or point cloud) representations of the same objects are extracted. Each representation is then compressed using a point cloud/mesh-based codec, and the resulting compressed signals are evaluated in terms of objective and subjective quality.

Human perception of volumes

Recent advances in 3D capturing technologies enable the generation of dynamic and static volumetric visual signals from real-world scenes and objects, opening the way to a huge number of applications using this data, from robotics to immersive communications. Volumetric signals are typically represented as polygon meshes or point clouds and can be visualized from any viewpoint, providing six Degrees of Freedom (6DoF) viewing capabilities. They represent a key enabling technology for Augmented and Virtual Reality (AR and VR) applications, which are receiving a lot of attention from main technological innovation players, both in academic and industrial communities. The goal of this project is to design and perform a set of psychovisual experiments, using VR technology and visualisation via a Head Mounted Display (HMD), in which the impact on human perception of different properties of the volumetric signal representation via point clouds or meshes, such as the convexity and concavity of a surface, the resolution, the illumination and color, are analysed. First, a review of the state of the art on the perception of volumetric objects will be performed, second a set of open research questions will be chosen and a set of experiments will be designed and performed, and the collected data will be analysed in order to answer the research question.

Reconstructing high frame rate point clouds of human bodies

Volumetric sensing, based on range-sensing technology, allows to capture the depth of on object or an entire scene, in addition to its color information. A format that has recently become widespread to represent volumetric signals captured by such sensors, such as the Intel Real Sense camera, is the point cloud. A point cloud is a set of individual points in the 3D space, each associated with attributes, such as a color triplet. With respect to other volumetric representations, such as polygon meshes, the point cloud content generation implies much less computational processing, thus it is suitable for live capture and transmission. The goal of this project is the development of a machine learning-based approach to interpolate the point clouds representing a human body captured at different instants in time, in order to increase the frame rate of dynamic point cloud capturing a user’s body. The core of the project will be on the design of the network. The second main goal will be the collection of training data produced by using a capture set-up made of multiple Intel Real Sense sensors, in order to train the neural network aiming at learning how the point cloud signal representing body movement evolves in time.