top of page

Multimodal Personality Detection

Personality_icon_edited_edited.jpg
ExternalLinkArrow_edited.png

Role: Machine Learning Intern

Date: May 2024 - July 2024

Implemented a proprietary personality detection ML algorithm that takes in real-time video feed of a user and outputs a score for openness, conscientiousness, agreeableness, extroversion, and neuroticism traits. In the process, I designed a novel fusion mechanism that captures temporal dependencies across audio and video frame sequences.

So what was I trying to do?

How Do We Go From This Video...

To this?

Mike's Apparent Personality

OCEAN Model
| Openness | Conscientiousness | Extroversion | Agreeableness | Neuroticism

0.63

0.82

0.48

0.56

0.12

All values between 0 and 1

The solution lies in Multimodal Machine Learning

But what is Multimodal Machine Learning?

Multimodal Machine Learning is a type of learning where machine learning models are trained to understand and work with various modalities of data including text, audio, images, etc. In the case of personality detection, the model must work with raw audio as well as image frames from the video. 

A multimodal machine learning model generally has the following structure:

Encoder (1): This module generates feature representations of the data. 
Fusion (2): This module fuses the feature representations of the input modalities into a single vector.  
Prediction (3): This module either uses classification or regression in order to produce the final model prediction. 

I experimented with various types of encoders. I refer to encoders used to produce feature representations from image frames as visual subnetworks and those that produce feature representations from raw audio as audio subnetworks

Here are a couple of the visual subnetworks I worked with: 

VAT.png

Video Action Transformers (VAT) is a network where an input clip is passed to produce a spatio-temporal feature representation using a trunk network, represented by the initial layers of I3D. The center frame of the feature map is then passed into a region Proposal Network (RPN) to generate bounding box proposals. Augmented with location embeddings, the feature map, as well as each proposal, are passed into a stack of Action Transformer units that extract features necessary for regression.

ExternalLinkArrow_edited.png
FaceNet.png

FaceNet is a network that produces compact and discriminative embeddings of facial images originally intended for facial recognition, verification, and clustering tasks. The FaceNet architecture consists of a batch input layer and a deep CNN followed by L2 normalization

ExternalLinkArrow_edited.png

Here are a couple of the audio subnetworks I worked with: 

VGGish.png

VGGish is a network inspired by the VGG architecture, originally intended for sound classification and music information retrieval. The VGGish architecture consists of convolutional layers with 3x3 filters, max pooling and global average pooling layers, as well as fully-connected layers

ExternalLinkArrow_edited.png
Wav2Vec2.0.png

Wav2Vec 2.0 is used as an audio subnetwork that produces audio embeddings from raw audio data. The Wav2Vec 2.0 model uses an encoder that processes the raw audio waveform using multiple convolutional layers, a quantization module that discretizes the encoder's output into a set of codebook vectors, and a context network that processes the discretized audio representations to capture long-range dependencies and produce contextualized embeddings

ExternalLinkArrow_edited.png
WavLM.png

WavLM is a speech recognition model that jointly learns masked speech prediction and denoising in pre-training.The WavLM model consists of a convolutional encoder and a Transformer encoder.The convolutional encoder is composed of seven blocks of temporal convolution, followed by layer normalization and a GELU activation layer while the Transformer is equipped with a convolution-based relative position embedding layer with 128 kernel size and 16 groups at the bottom. 

ExternalLinkArrow_edited.png

The architecture of my personality detection model is the following:

Personality_Detection_Solution.png

The audio and visual subnetworks chosen in my model encompass Wav2Vec 2.0 and FaceNet respectively. The feature representations produced by these subnetworks are fed into a temporal aggregation module which marks a key component of my model's fusion mechanism. The temporal aggregation module produces new feature representations that also capture the temporal dependencies in the sequence of image frames and audio signals. Of various fusion mechanisms like concatenation, mean, and cross attention, the concatenation mechanism works best to fuse the aggregation vectors the previous module produced. Finally, the fused vector is fed into the regression module to produce predictions.   

Ultimately, I see personality detection as a highly-relevant problem with plenty of potential application. An artificial agent capable of understanding a user's apparent personality could improve game interactions, customer service, or even help inform company recruitment.  

Takeaways from my experience working at Openstream.AI? 

Working at Openstream.AI gave me great exposure to the field of multimodal machine learning where I learned about state of the art models and architectures. Working under my supervisor, I gained insight into the research process in industry, whether it be applying techniques and models from literature, acquiring publicly available datasets for commercial purposes, and fine-tuning my experimentation process as I trained models and gathered results. In addition to enhancing my fundamental machine learning knowledge, I learned to write code more efficiently to reduce model runtime and make best use of the GPUs offered on the company's remote servers.

** Please note all code is proprietary to Openstream.AI and therefore not publicly accessible **

AEnB2Ur5juO1svp3OGX2fnqA1bsA2TWa04sgeF-7NF4k9fCx4z3cF7Yk6FgXOIE0PpuOy9L-52eXmzNwKtwe9CIvx-

Yuvanshu Agarwal @2023

bottom of page