Posts

Showing posts from July, 2025

Explaining Active Speaker Detection (ASD)

Image
Example of Active Speaker Detection (Ground Truth Labeled Image). Credit: " LoCoNet: Long-Short Context Network for Active Speaker Detection " ( Link ) Introduction Imagine a system that can model how people work and interact. Well, one interaction to model is detecting who spoke in a given scene/frame. This way, not only do models learn how people interact (e.g. two people talking, one waiting for the other, or both talking on top of each other), but can then be used for human-model interactions, and speech diarization (segmenting audio + identifying who spoke when), and much more. Model Architectures In this section, we will discuss two models.  TalkNet ( Tao et. al. 2021 ) TalkNet's goal is to capture long-term context features, as previous models only focused on short term context features, and small time/temporal segments. Their solution utilizes two cross-attention mechanisms, swapping the query values in between the audio and visual features. Then, you concatenate ...