top of page
Multimodal Personality Detection


Role: Machine Learning Intern
Date: May 2024 - July 2024
Implemented a proprietary personality detection ML algorithm that takes in real-time video feed of a user and outputs a score for openness, conscientiousness, agreeableness, extroversion, and neuroticism traits. In the process, I designed a novel fusion mechanism that captures temporal dependencies across audio and video frame sequences.
So what was I trying to do?
How Do We Go From This Video...
To this?
Mike's Apparent Personality
OCEAN Model
| Openness | Conscientiousness | Extroversion | Agreeableness | Neuroticism
0.63
0.82
0.48
0.56
0.12
All values between 0 and 1
The solution lies in Multimodal Machine Learning
But what is Multimodal Machine Learning?
Multimodal Machine Learning is a type of learning where machine learning models are trained to understand and work with various modalities of data including text, audio, images, etc. In the case of personality detection, the model must work with raw audio as well as image frames from the video.
A multimodal machine learning model generally has the following structure:

Encoder (1): This module generates feature representations of the data.
Fusion (2): This module fuses the feature representations of the input modalities into a single vector.
Prediction (3): This module either uses classification or regression in order to produce the final model prediction.
I experimented with various types of encoders. I refer to encoders used to produce feature representations from image frames as visual subnetworks and those that produce feature representations from raw audio as audio subnetworks.
Here are a couple of the visual subnetworks I worked with:

Video Action Transformers (VAT) is a network where an input clip is passed to produce a spatio-temporal feature representation using a trunk network, represented by the initial layers of I3D. The center frame of the feature map is then passed into a region Proposal Network (RPN) to generate bounding box proposals. Augmented with location embeddings, the feature map, as well as each proposal, are passed into a stack of Action Transformer units that extract features necessary for regression.


FaceNet is a network that produces compact and discriminative embeddings of facial images originally intended for facial recognition, verification, and clustering tasks. The FaceNet architecture consists of a batch input layer and a deep CNN followed by L2 normalization.

Here are a couple of the audio subnetworks I worked with:

VGGish is a network inspired by the VGG architecture, originally intended for sound classification and music information retrieval. The VGGish architecture consists of convolutional layers with 3x3 filters, max pooling and global average pooling layers, as well as fully-connected layers.


Wav2Vec 2.0 is used as an audio subnetwork that produces audio embeddings from raw audio data. The Wav2Vec 2.0 model uses an encoder that processes the raw audio waveform using multiple convolutional layers, a quantization module that discretizes the encoder's output into a set of codebook vectors, and a context network that processes the discretized audio representations to capture long-range dependencies and produce contextualized embeddings


WavLM is a speech recognition model that jointly learns masked speech prediction and denoising in pre-training.The WavLM model consists of a convolutional encoder and a Transformer encoder.The convolutional encoder is composed of seven blocks of temporal convolution, followed by layer normalization and a GELU activation layer while the Transformer is equipped with a convolution-based relative position embedding layer with 128 kernel size and 16 groups at the bottom.

The architecture of my personality detection model is the following:

The audio and visual subnetworks chosen in my model encompass Wav2Vec 2.0 and FaceNet respectively. The feature representations produced by these subnetworks are fed into a temporal aggregation module which marks a key component of my model's fusion mechanism. The temporal aggregation module produces new feature representations that also capture the temporal dependencies in the sequence of image frames and audio signals. Of various fusion mechanisms like concatenation, mean, and cross attention, the concatenation mechanism works best to fuse the aggregation vectors the previous module produced. Finally, the fused vector is fed into the regression module to produce predictions.
Ultimately, I see personality detection as a highly-relevant problem with plenty of potential application. An artificial agent capable of understanding a user's apparent personality could improve game interactions, customer service, or even help inform company recruitment.
Takeaways from my experience working at Openstream.AI?
Working at Openstream.AI gave me great exposure to the field of multimodal machine learning where I learned about state of the art models and architectures. Working under my supervisor, I gained insight into the research process in industry, whether it be applying techniques and models from literature, acquiring publicly available datasets for commercial purposes, and fine-tuning my experimentation process as I trained models and gathered results. In addition to enhancing my fundamental machine learning knowledge, I learned to write code more efficiently to reduce model runtime and make best use of the GPUs offered on the company's remote servers.
** Please note all code is proprietary to Openstream.AI and therefore not publicly accessible **

bottom of page