Explaining Active Speaker Detection (ASD)
Example of Active Speaker Detection (Ground Truth Labeled Image). Credit: "LoCoNet: Long-Short Context Network for Active Speaker Detection" (Link)
Introduction
Imagine a system that can model how people work and interact. Well, one interaction to model is detecting who spoke in a given scene/frame. This way, not only do models learn how people interact (e.g. two people talking, one waiting for the other, or both talking on top of each other), but can then be used for human-model interactions, and speech diarization (segmenting audio + identifying who spoke when), and much more.
Model Architectures
In this section, we will discuss two models.
TalkNet (Tao et. al. 2021)
TalkNet's goal is to capture long-term context features, as previous models only focused on short term context features, and small time/temporal segments.
Their solution utilizes two cross-attention mechanisms, swapping the query values in between the audio and visual features. Then, you concatenate the cross-attention values and pass it through a self-attention mechanism. Finally, they apply a linear layer to get the outputs.
LoCoNet (Wang et. al. 2024)
LoCoNet, or Long-Short Context Network, is a transformer-based model that incorporates audio and visual cues to detect whether somebody is speaking or not.
Their proposed solution is two-fold: use a self-attention mechanism to detect how a single face changes over time (Long-term Intra-Speaker Modeling), and use cues from other people in frames as context to detect whether or not a person is speaking (Short-Term Inter-Speaker Modeling).
Overview of the LoCoNet model
Cite: Wang et. al. 2024
The network encodes audio (Mel-Spectrogram) through a VGGish network, the ish coming from the fact they remove the pooling layer from the fourth block, and instead use a deconvolution layer to upsample the features.
Then, they are passed into the Long-term Intra-Speaker Modeling (LIM) block. Here, the audio and visual features go through self-attention mechanisms and a linear layer, where subscript v stands for video.
Cite: Wang et. al. 2024
And, this can be used for audio as well, which will be represented by subscript a. Then, both audio and visual features go through cross-attention mechanisms to fuse both modalities,
Cite: Wang et. al. 2024
(Also done vice-versa for audio). This is the audio-enchanced visual embeddings.Next is the Short-term Inter-Speaker Modeling (SIM) block. Here, the authors utilized a CNN to extract features to see how Visual Cues/Audio Cues change over time (both separately). Then, it is passed through a linear layer to get outputs/predictions.
Then, audio and visual features are concatenated to get one vector with the outputs. Demo
Here's a demo of the LocoNet model. In this video, there are two people in this scene, and one person has a bounding box (red = not speaking, green = speaking).
Works Referenced
Here are works I used to understand the previous materials above.
"Self-Supervised Multimodal Learning: A Survey" by Zong et. al. 2024 (arxiv; accepted to IEEE T-PAMI).
"Why Cross-Attention is the Secret Sauce of Multimodal Models" by Jakub Strawa 2025 (Medium).
Comments
Post a Comment