Moshi: Open-Source Real-Time Speech Transformer

Moshi: Real-time Transformer model for speech, handling interruptions, emotions, and complex conversations with low latency.

10 min readSep 26, 2024

Moshi is a sophisticated, multi-stream, real-time speech-to-speech generation Transformer model designed to support full-duplex voice conversations. Its primary features include the capability for simultaneous speech input and output (full-duplex), as well as the ability to manage complex conversational scenarios. These scenarios encompass overlapping speech, interruptions, and the integration of non-verbal cues such as emotional expressions.

By enabling simultaneous listening and speaking, Moshi aims to mitigate several issues inherent in traditional dialogue systems. These issues include latency, the loss of non-verbal information like emotions, and the rigid structure of conversational turns. Unlike traditional turn-based conversational systems, where one speaker must finish before the other begins, Moshi supports full-duplex communication. This allows it to generate voice responses while the user is still speaking, free from turn-taking constraints. It adeptly handles complex conversational dynamics, including overlapping speech, interruptions, and rapid feedback.

Moshi’s multi-stream processing capability allows it to listen and generate speech concurrently by managing multiple audio streams. This architecture ensures that voice interactions between users and the system remain fluid, maintaining the natural flow of the conversation without interruptions.

When compared to traditional voice dialogue systems, Moshi offers several significant advantages. Its real-time response speed, with a delay of only 160–200 milliseconds, closely mirrors the reaction speed found in natural conversation, providing a seamless conversational experience. Unlike traditional systems that rely on a speech-to-text-to-speech process, Moshi processes speech input directly and generates speech output, preserving non-verbal information such as tone and emotion. Additionally, Moshi’s full-duplex capability allows it to handle user and system speech simultaneously, accommodating overlapping speech and interruptions. This brings it closer to the natural form of human conversation.

Key Features of Moshi

Moshi offers real-time speech-to-speech conversations, generating audio output directly from audio input without the traditional speech-to-text-to-speech process. By directly processing speech data, Moshi preserves non-verbal cues such as tone, emotion, overlapping speech, and interruptions, ensuring conversations remain natural and fluid.

The system supports full-duplex communication, allowing it to listen and speak simultaneously. This capability enables Moshi to generate voice responses as the user speaks, eliminating the need for strict conversational turn-taking. It adeptly handles complex conversational scenarios, including overlapping speech and non-interruptive feedback like “hmm” or “I understand.”

Designed with low latency, Moshi achieves a theoretical latency of only 160 milliseconds and practical latency around 200 milliseconds. This responsiveness ensures Moshi can reply to user input almost in real time, providing a smoother conversational experience.

Moshi employs an Inner Monologue Method, predicting text tokens before generating speech. This approach significantly enhances the language quality and consistency of the generated speech, making it clearer and improving the system’s speech recognition and text-to-speech capabilities in streaming media environments. The “inner monologue” mechanism supports simultaneous processing of language and audio in continuous conversation flow.

Capable of processing multiple audio streams in parallel, Moshi manages both user and system voice streams simultaneously. This multi-stream processing allows Moshi not only to generate its own speech but also to understand and respond to the user’s speech in real time.

By processing speech directly rather than through intermediate text, Moshi excels in understanding and generating emotionally charged speech and handling complex conversational dynamics such as emotional expressions and voice inflection.

Moshi adeptly handles the intricate dynamics of natural conversations, including interruptions, interleaving, interjections, and responses. While traditional systems depend on clear conversational turns, Moshi transcends this limitation, making interactions more natural and engaging.

Model Architecture of Moshi

Moshi comprises three primary components: Helium, a 7B language model trained with 2.1 trillion tokens; Mimi, a neural audio codec that models semantic and acoustic information; and an innovative multi-stream architecture that handles the user’s and Moshi’s audio separately. These elements work in unison to facilitate seamless full-duplex conversations, emotional expression, and the management of intricate conversational dynamics.

Helium Text Language Model

At the heart of Moshi lies Helium, a robust text language model featuring 7 billion parameters based on the Transformer architecture, akin to GPT. Helium equips Moshi with formidable language comprehension and generation capabilities, enabling it to tackle complex text reasoning and dialogue tasks. Trained on 2.1 trillion English words, Helium possesses extensive linguistic knowledge and proficiency.

Mimi Neural Audio Codec

Mimi serves as Moshi’s audio processing unit. This neural network audio codec is tasked with converting audio into discrete speech tokens and generating high-quality speech output in reverse. Employing Residual Vector Quantization (RVQ) technology, Mimi encodes speech data into discrete phonetic and semantic markers, ensuring high speech fidelity and linguistic consistency. By integrating semantic and acoustic markers, Mimi not only produces natural speech but also processes intricate speech contexts and emotional nuances.

Inner Monologue Method

The inner monologue method is a pivotal technology for Moshi’s speech generation, enabling the model to predict text tags synchronized with audio prior to speech generation. This technique enhances the linguistic quality of the generated speech and allows Moshi to perform speech recognition and text-to-speech conversion in a streaming environment. Before generating audio, Moshi produces a text stream corresponding to its speech output, which serves as the foundation for speech generation, thereby improving accuracy and facilitating the handling of complex conversational scenarios.

Designed to process multiple parallel audio streams, the model architecture generates speech and text in real-time. Moshi can produce system speech while simultaneously processing user speech, supporting uninterrupted natural conversations.

Detailed Technical Methods of Moshi

1. Speech-to-Speech Generation Architecture

Moshi’s core innovation redefines voice conversation by treating it as a speech-to-speech generation task, moving away from the traditional multi-component process of converting text to speech and vice versa. Conventional voice conversation systems rely on multiple independent modules such as Voice Activity Detection (VAD), Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), Natural Language Generation (NLG), and Text-to-Speech (TTS). Moshi, however, directly generates speech tokens, eliminating the need for intermediate text representations during understanding and generation. This approach helps preserve crucial information such as emotion, tone, and non-verbal sounds.

2. Helium Text Language Model

At the heart of Moshi lies the Helium text language model, a powerful text generation model with 7 billion parameters. Pre-trained on 2.1 trillion English data points, Helium exhibits robust capabilities in language understanding, reasoning, and generation. It forms the semantic foundation of Moshi, enabling sophisticated natural language processing functions, including open-ended conversations and question-answering.

Key Features of Helium:

Autoregressive Transformer Architecture: Helium utilizes a Transformer-based architecture with a multi-layer attention mechanism and autoregressive modeling to process text input and generate output. With 7 billion parameters, it is well-equipped to handle large-scale corpora.\n- RMS Normalization: Implemented in the attention module, feedforward module, and output layer, RMS normalization enhances the training stability of the model.\n- Rotated Positional Encoding (RoPE): This technique manages longer context windows (up to 4096 tokens), ensuring the model can capture long-range dependencies in conversations.\n- Efficient FlashAttention: By optimizing attention calculations, the model achieves more efficient reasoning for long sequence inputs.

3. Mimi Neural Audio Codec

Mimi serves as the neural audio codec for speech processing within Moshi. Its primary function is to discretize continuous speech signals into audio tokens, akin to text tokens, which can convey detailed speech information. Utilizing Residual Vector Quantization (RVQ) technology, Mimi maintains high-quality audio at lower bit rates, facilitating real-time speech generation and processing.

Key Technologies of Mimi:

Residual Vector Quantization (RVQ): Mimi employs multi-level residual vector quantization to discretize complex audio signals into multiple levels of audio tokens. This method efficiently encodes both semantic and acoustic information at each time step while ensuring high-quality audio reconstruction.\n- Combination of Semantic and Acoustic Tokens: The audio tokens generated by Mimi include both semantic and acoustic information. Semantic tokens capture the content of the speech, while acoustic tokens describe characteristics such as timbre, emotion, and intonation.\n- Streaming Encoding and Decoding: Mimi supports streaming, enabling continuous speech generation and recognition in real-time conversations. This capability ensures that Moshi’s response speed closely mirrors natural conversation.

Mimi’s advanced technologies and seamless integration with Moshi’s architecture exemplify a sophisticated approach to speech processing, enhancing both the quality and efficiency of voice interactions.

4. Architecture of RQ-Transformer

Moshi employs a sophisticated multi-stream hierarchical generation architecture, enabling the parallel processing of multiple audio streams. This innovative design allows for flexible interaction within conversations by concurrently modeling both the user’s voice stream and the system’s voice stream. As a result, it adeptly handles complex conversational dynamics, including interleaving, interruptions, and interjections between speakers.

Originally proposed for discrete image generation, this architecture facilitates the modeling of a hierarchy of semantic and acoustic tokens without extending the Helium sequence length. Consequently, each second of audio only needs to be processed through the 7B backbone model 12.5 times, allowing real-time operation on an L4 or M3 MacBook Pro. When combined with MusicGen’s token delay, this approach delivers state-of-the-art performance in audio language modeling.

Moshi utilizes the RQ-Transformer (Residual Quantizer Transformer) to decompose audio tokens into multiple levels, generating audio through hierarchical autoregressive modeling. The model employs a larger Temporal Transformer to handle the time series initially, followed by a smaller Depth Transformer to process multiple subsequences at each time step. This design significantly enhances the efficiency of generating extended audio sequences.

The model generates multiple sequences simultaneously, including text, semantic tokens, and audio tokens, ensuring precise temporal alignment through the inner monologue mechanism. Each time step’s generated content includes both the current speech and the corresponding text prefix, resulting in speech content that is more semantically coherent.

5. “Internal Monologue” Mechanism

Moshi’s “Inner Monologue” mechanism stands out as a key innovation in speech generation. This mechanism predicts the corresponding time-aligned text tokens prior to generating audio, thereby enhancing the language consistency of the produced speech. It also supports real-time speech recognition (ASR) and text-to-speech (TTS) conversion.

Aligned text and audio generation is a hallmark of this mechanism. By predicting the text first, Moshi ensures that the generated speech is more accurate and fluent in both grammar and content. The introduction of a delay between text and audio generation allows Moshi to perform ASR and TTS tasks independently. For instance, if the text is generated first and the audio subsequently, the model operates in TTS mode; conversely, if the audio is generated first, it operates in ASR mode. This seamless switching capability ensures that the model can proficiently generate and recognize speech.

6. Multi-Stream Modeling

Moshi’s architecture is adept at handling multiple audio streams simultaneously, enabling it to monitor the user’s voice while generating its own responses. This capability allows Moshi to dynamically manage overlapping audio segments, such as interruptions and interleaving, without necessitating a pre-defined turn-taking structure. Consequently, conversations with Moshi feel more natural and fluid.

The system utilizes a synchronous mechanism for generating semantic and acoustic tokens, optimizing the dependencies between them through the introduction of time delays. By accurately modeling the audio flows of both users and the system, Moshi adeptly navigates complex conversational scenarios.

Moshi’s dual-stream audio processing allows it to handle user and system voice streams concurrently, achieving full-duplex conversations by modeling two autoregressive audio streams in parallel. This design is particularly effective in managing overlapping speech and interruptions, ensuring a seamless conversational experience.

By introducing a delay between semantic and audio tokens, Moshi ensures that the generated speech content remains coherent and efficient. Depending on the dynamics of the conversation, this delay can range from 1 to 2 frames.

7. Model Training and Fine-Tuning

Moshi’s text language model, Helium, undergoes extensive pre-training on over 2.1 trillion English tokens, endowing it with robust language understanding and generation capabilities. This large-scale training on both text and voice data equips Moshi to handle a wide array of complex conversational scenarios.

The training process involves a multi-stage approach, incorporating both unsupervised and supervised methods. Initially, Moshi is pre-trained on large-scale unsupervised speech data. This is followed by post-training on multi-stream data that includes natural conversations, culminating in instruction fine-tuning to enhance its performance in real-world interactions.

The Helium text language model is first pre-trained on a vast text dataset, bolstering its language comprehension and reasoning skills. Subsequently, the multi-stream audio model is trained on an unlabeled audio dataset to master speech generation and semantic understanding.

Fine-tuning is conducted using the Fisher dataset, which contains two-channel voice dialogue data, to improve Moshi’s proficiency in handling multi-stream voice inputs. Finally, instruction fine-tuning is applied using generated dialogue data to further refine the model’s performance in natural conversational scenarios.

Throughout the training process, Moshi employs data enhancement techniques, such as adding background noise and simulating user echoes. These methods ensure that the model remains stable across various voice environments, thereby enhancing its robustness and reliability.

……

For more specific details ↓