Explaining Active Speaker Detection (ASD)

Example of Active Speaker Detection (Ground Truth Labeled Image). Credit: "LoCoNet: Long-Short Context Network for Active Speaker Detection" (Link)

Introduction

Imagine a system that can model how people work and interact. Well, one interaction to model is detecting who spoke in a given scene/frame. This way, not only do models learn how people interact (e.g. two people talking, one waiting for the other, or both talking on top of each other), but can then be used for human-model interactions, and speech diarization (segmenting audio + identifying who spoke when), and much more.

Model Architectures

In this section, we will discuss two models. 
TalkNet's goal is to capture long-term context features, as previous models only focused on short term context features, and small time/temporal segments.

Their solution utilizes two cross-attention mechanisms, swapping the query values in between the audio and visual features. Then, you concatenate the cross-attention values and pass it through a self-attention mechanism. Finally, they apply a linear layer to get the outputs. 
Diagram of the multimodal fusion technique

LoCoNet (Wang et. al. 2024)

LoCoNet, or Long-Short Context Network, is a transformer-based model that incorporates audio and visual cues to detect whether somebody is speaking or not. 

Their proposed solution is two-fold: use a self-attention mechanism to detect how a single face changes over time (Long-term Intra-Speaker Modeling), and use cues from other people in frames as context to detect whether or not a person is speaking (Short-Term Inter-Speaker Modeling). 
Overview of the LoCoNet model

The network encodes audio (Mel-Spectrogram) through a VGGish network, the ish coming from the fact they remove the pooling layer from the fourth block, and instead use a deconvolution layer to upsample the features. 

Then, they are passed into the Long-term Intra-Speaker Modeling (LIM) block. Here, the audio and visual features go through self-attention mechanisms and a linear layer, where subscript v stands for video.
And, this can be used for audio as well, which will be represented by subscript a

Then, both audio and visual features go through cross-attention mechanisms to fuse both modalities, 
(Also done vice-versa for audio). This is the audio-enchanced visual embeddings.

Next is the Short-term Inter-Speaker Modeling (SIM) block. Here, the authors utilized a CNN to extract features to see how Visual Cues/Audio Cues change over time (both separately). Then, it is passed through a linear layer to get outputs/predictions.

The equation would be, 

Then, audio and visual features are concatenated to get one vector with the outputs. 

Demo

Here's a demo of the LocoNet model. In this video, there are two people in this scene, and one person has a bounding box (red = not speaking, green = speaking). 


Link to YouTube video of TalkNet: https://www.youtube.com/watch?v=U-eiMA0STY4

Works Referenced

Here are works I used to understand the previous materials above. 
"Self-Supervised Multimodal Learning: A Survey" by Zong et. al. 2024 (arxiv; accepted to IEEE T-PAMI).
"Why Cross-Attention is the Secret Sauce of Multimodal Models" by Jakub Strawa 2025 (Medium).
  


Comments

Popular posts from this blog

What is a Multimodal LLM?

A Brief Introduction into Model Quantization

A Mathematical Explanation of Gradient Descent