Who Stated What? Recorder’s On-device Resolution for Labeling Audio system – Google AI Weblog

In 2019 we launched Recorder, an audio recording app for Pixel telephones that helps customers create, handle, and edit audio recordings. It leverages current developments in on-device machine studying to transcribe speech, acknowledge audio occasions, recommend tags for titles, and assist customers navigate transcripts.

Nonetheless, some Recorder customers discovered it tough to navigate lengthy recordings which have a number of audio system as a result of it isn’t clear who stated what. Throughout the Made By Google occasion this 12 months, we introduced the “speaker labels” function for the Recorder app. This opt-in function annotates a recording transcript with distinctive and nameless labels for every speaker (e.g., “Speaker 1”, “Speaker 2”, and so forth.) in actual time in the course of the recording. It considerably improves the readability and usefulness of the recording transcripts. This function is powered by Google’s new speaker diarization system named Flip-to-Diarize, which was first introduced at ICASSP 2022.

Left: Recorder transcript with out speaker labels. Proper: Recorder transcript with speaker labels.

System Structure

Our speaker diarization system leverages a number of extremely optimized machine studying fashions and algorithms to permit diarizing hours of audio in a real-time streaming trend with restricted computational sources on cell units. The system primarily consists of three parts: a speaker flip detection mannequin that detects a change of speaker within the enter speech, a speaker encoder mannequin that extracts voice traits from every speaker flip, and a multi-stage clustering algorithm that annotates speaker labels to every speaker flip in a extremely environment friendly method. All parts run absolutely on the gadget.

Structure of the Flip-to-Diarize system.

Detecting Speaker Turns

The primary part of our system is a speaker flip detection mannequin based mostly on a Transformer Transducer (T-T), which converts the acoustic options into textual content transcripts augmented with a particular token <st> representing a speaker flip. In contrast to previous custom-made programs that use role-specific tokens (e.g., <physician> and <affected person>) for conversations, this mannequin is extra generic and could be skilled on and deployed to numerous utility domains.

In most functions, the output of a diarization system just isn’t immediately proven to customers, however mixed with a separate automated speech recognition (ASR) system that’s skilled to have smaller phrase errors. Subsequently, for the diarization system, we’re comparatively extra tolerant to phrase token errors than errors of the <st> token. Primarily based on this instinct, we suggest a brand new token-level loss perform that enables us to coach a small speaker flip detection mannequin with excessive accuracy on predicted <st> tokens. Mixed with edit-based minimal Bayes threat (EMBR) coaching, this new loss perform considerably improved the interval-based F1 rating on seven analysis datasets.

Extracting Voice Traits

As soon as the audio recording has been segmented into homogeneous speaker turns, we use a speaker encoder mannequin to extract an embedding vector (i.e., d-vector) to symbolize the voice traits of every speaker flip. This strategy has a number of benefits over prior work that extracts embedding vectors from small fixed-length segments. First, it avoids extracting an embedding from a phase containing speech from a number of audio system. On the similar time, every embedding covers a comparatively massive time vary that comprises enough alerts from the speaker. It additionally reduces the overall variety of embeddings to be clustered, thus making the clustering step cheaper. These embeddings are processed solely on-device till speaker labeling of the transcript is accomplished, after which deleted.

Multi-Stage Clustering

After the audio recording is represented by a sequence of embedding vectors, the final step is to cluster these embedding vectors, and assign a speaker label to every. Nonetheless, since audio recordings from the Recorder app could be as quick as just a few seconds, or so long as as much as 18 hours, it’s vital for the clustering algorithm to deal with sequences of drastically completely different lengths.

For this we suggest a multi-stage clustering technique to leverage the advantages of various clustering algorithms. First, we use the speaker flip detection outputs to find out whether or not there are no less than two completely different audio system within the recording. For brief sequences, we use agglomerative hierarchical clustering (AHC) because the fallback algorithm. For medium-length sequences, we use spectral clustering as our primary algorithm, and use the eigen-gap criterion for correct speaker rely estimation. For lengthy sequences, we cut back computational value through the use of AHC to pre-cluster the sequence earlier than feeding it to the principle algorithm. Throughout the streaming, we hold a dynamic cache of earlier AHC cluster centroids that may be reused for future clustering calls. This mechanism permits us to implement an higher certain on all the system with fixed time and area complexity.

This multi-stage clustering technique is a vital optimization for on-device functions the place the price range for CPU, reminiscence, and battery could be very small, and permits the system to run in a low energy mode even after diarizing hours of audio. As a tradeoff between high quality and effectivity, the higher certain of the computational value could be flexibly configured for units with completely different computational sources.

Diagram of the multi-stage clustering technique.

Correction and Customization

In our real-time streaming speaker diarization system, because the mannequin consumes extra audio enter, it accumulates confidence on predicted speaker labels, and should sometimes make corrections to beforehand predicted low-confidence speaker labels. The Recorder app routinely updates the speaker labels on the display screen throughout recording to replicate the most recent and most correct predictions.

On the similar time, the Recorder app’s UI permits the consumer to rename the nameless speaker labels (e.g., “Speaker 2”) to personalised labels (e.g., “automotive supplier”) for higher readability and simpler memorization for the consumer inside every recording.

Recorder permits the consumer to rename the speaker labels for higher readability.

Future Work

At the moment, our diarization system largely runs on the CPU block of Google Tensor, Google’s custom-built chip that powers more moderen Pixel telephones. We’re engaged on delegating extra computations to the TPU block, which can additional cut back the general energy consumption of the diarization system. One other future work path is to leverage multilingual capabilities of speaker encoder and speech recognition fashions to broaden this function to extra languages.


The work described on this put up represents joint efforts from a number of groups inside Google. Contributors embrace Quan Wang, Yiling Huang, Evan Clark, Qi Cao, Han Lu, Guanlong Zhao, Wei Xia, Hasim Sak, Alvin Zhou, Jason Pelecanos, Luiza Timariu, Allen Su, Fan Zhang, Hugh Love, Kristi Bradford, Vincent Peng, Raff Tsai, Richard Chou, Yitong Lin, Ann Lu, Kelly Tsai, Hannah Bowman, Tracy Wu, Taral Joglekar, Dharmesh Mokani, Ajay Dudani, Ignacio Lopez Moreno, Diego Melendo Casado, Nino Tasca, Alex Gruenstein.

Leave a Reply