Spectron: First end-to-end spoken language model

3 min readNov 2, 2023

Spectron: First end-to-end spoken language model

Spectron: The first end-to-end trained spoken language model

Traditional voice dialogue systems usually need to perform speech recognition, semantic understanding, text generation, and then convert it into speech output.

Google has developed an end-to-end trained spoken language model that directly uses the “frequency image” (spectrogram) of sound for learning and prediction without first converting the sound into text.

Doing so will more accurately capture the various details of the sound.

Spectrogram-driven: A novel approach to speech processing

What is a spectrogram?

A spectrogram is a way of representing an audio signal that shows the distribution of energy at different frequencies in the audio signal. In a spectrogram, the horizontal axis usually represents time, the vertical axis represents frequency, and color or brightness represents energy at a specific time and frequency.

In speech and audio processing, the spectrogram is a commonly used tool because of its ability to clearly demonstrate the frequency characteristics of a sound signal.

The Spectron model operates directly on the spectrogram, which means it can directly process richer and more complete audio information without the need for discretized speech-to-text-to-speech conversion. One advantage of this approach is that it better preserves information in the audio data, such as the speaker’s vocal characteristics or the coherence of speech.

Main principle of Spectron:

In the traditional speech processing process, there are usually multiple steps, such as speech recognition, natural language understanding, etc. Each step may have its own input and output formats. But in a spectrogram-driven model, all these steps are unified into an end-to-end process, performed directly at the spectrogram level.

Technical advantages of Spectron:

Because the model directly operates on the spectrogram, it can more accurately capture subtle characteristics in the speech signal, such as pitch, rhythm, and intensity. This generally results in higher performance and more accurate results.

Work process of Spectron:

The Spectron model connects the encoder of a speech recognition model with a pre-trained Transformer-based decoder language model. During the training phase, the speech utterance is divided into a cue and its continuation. Then, the entire transcription (cue and continuation) as well as the phonetic features of the continuation are reconstructed.

1. Input stage: The Spectron model receives a spectrogram as input. This spectrogram is a graphical representation of a sound signal, showing the variation of different frequency components over time.

2. Encoder: The model first uses an encoder to process the spectrogram. The task of the encoder is to extract important features from the sound signal and encode them into a more manageable form.

3. Connect to the decoder: The encoded data is then passed to a pre-trained Transformer-based decoder. This decoder is a large language model specifically designed to generate text.

4. Generation and reconstruction: The decoder generates a text transcription (that is, what it thinks the input spectrogram represents as spoken language). At the same time, the model also attempts to reconstruct the input spectrogram to generate a new sound signal.

5. Output stage: Finally, the model outputs a new spectrogram, which is the model’s “response” or “continuation” of the input sound signal.

Performance evaluation:

Spectron conducted experiments on the Libri-Light dataset, a dataset containing 60k hours of English data. Experimental results show that Spectron has excellent performance in answering spoken questions and speech continuation.

Detailed introduction: https://blog.research.google/2023/10/spoken-question-answering-and-speech.html

Project demo: michelleramanovich.github.io/spectron/spectron/

Paper: arxiv.org/abs/2305.15255

More AI News