Loading…

One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving "who spoke what...

Full description

Saved in:

Bibliographic Details
Main Authors:	Cornell, Samuele, Jung, Jee-Weon, Watanabe, Shinji, Squartini, Stefano
Format:	Conference Proceeding
Language:	English
Subjects:	Acoustics conversation transcription conversational speech recognition end-to-end multi-talker automatic speech recognition Pipelines Recording speaker diarization Speech processing Speech recognition Training Training data
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving "who spoke what, when" concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and "Whisper-style" prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
ISSN:	2379-190X
DOI:	10.1109/ICASSP48485.2024.10447957