Value of multimodal networks
In natural language processing (NLP) researchers use one type of data which is text, in order to perform tasks such as text classification, question-answering, or language generation. Whereas, in computer vision algorithms like support vector machines (SVM) or convolutional neural networks (CNNs) use image data extracted from static images or from video frames to perform image classification, object detection, or instance segmentation. Multimodal networks use more than one data type, or modality, to perform a variety of real-world tasks ranging from “affect”, or emotion, recognition to media description like video captioning, to activity recognition.
A modality is defined by Morency and Baltrusaitis as a “certain type of information and/or the representation form in which information is stored” where examples including natural language (spoken or written), visual (images or videos), auditory (voice, sounds, music), haptics/touch, smell, taste, self-motion, physiological signals (e.g. electrocardiogram), in addition to infrared images, depth images, fMRI.
The history of modeling multiple modalities dates back to the 1970s starting with the first era of behavioral or psychological research focused on gestures, by the 1980s with computational research into affect (emotion) recognition and audio-visual speech recognition (AVSR). In the 2000s, was the “interaction” era focused on modeling human multimodal interaction such as the CALO Meeting Assistant, a system that provides for distributed meeting capture, annotation, automatic transcription and semantic analysis of multiparty meetings funded by the DoD Defense Advanced Research Projects Agency (DARPA) in 2008. Starting in the 2010s multimodal research entered the “deep learning” era driven by new, large-scale multimodal datasets, access to GPU machines, and algorithm advances in computer vision and natural language processing (NLP).
Fusing multiple modalities of data into one model offers several benefits over unimodal-only algorithms. The most obvious is that different modalities may capture different types of information, otherwise not available with one type of data alone. Secondly, a fusion system offers the ability to infer prediction using one modality, even if another one is not available – for example, one can recognize emotion on a face image even if the person is not speaking, but by using the image alone and not hearing the acoustic signal. Multimodal machine learning (MMML) trains on multiple modalities at a time, such as image and text combined. Baltrusaitis et al provides a taxonomy of MMML and describes how MMML consists of five major challenges as a result of working with multimodal data that include i) multimodal representations, ii) translation, iii) alignment, iv) fusion, and v) co-learning. Here are each of the challenges with a combination of seminal papers and healthcare-domain relevant research on each topic. An overview of papers surveyed are provided in Table 1.