Optimized for Dark Mode

The website is optimized for dark mode to enhance your user experience. Switch to dark mode to enjoy it.

Real-Time TTS Streaming with Orpheus and SNAC on a single RTX 3090

Learn how to set up a low-latency, streaming Text-to-Speech system using an Orpheus-style model, vLLM for efficient inference, and SNAC for audio decoding. This setup can run on a single 24GB GPU, or distribute the Orpheus model (vLLM/SGLang) and decoder (FastAPI/SNAC) across multiple GPUs.


Table of Contents

Streaming Audio Waveform Illustration
Created with my custom Flux LoRA on Replicate

tl;dr - What is the Post About?

  • Building a streaming Text-to-Speech (TTS) pipeline with an Orpheus model (Kartoffel_Orpheus-3B_german_natural-v0.1).
  • Using vLLM or SGLang with FP8 quantization to run the LLM efficiently.
  • Decoding audio tokens on-the-fly using the SNAC model.
  • Creating a FastAPI server to manage the stream and send audio chunks with low latency.
  • Showing it's possible to run the whole pipeline on a single NVIDIA RTX 3090 (24GB VRAM).

Introduction

Most open-source TTS inference systems generate the full audio file before playback begins. For short sentences, the delay is minimal, but for longer texts, the delay can become significant. Orpheus-style TTS models, which use LLMs to generate discrete audio tokens (similar to LLaSA), open up possibilities for streaming audio generation. By processing the LLM's output token by token, or in SNAC-sized chunks of tokens, and immediately decoding these chunks into audio, we can significantly reduce latency, allowing users to hear the speech almost instantly.

This post demonstrates how to build such a streaming TTS system. We'll use my fine-tuned German Orpheus model, SebastianBodza/Kartoffel_Orpheus-3B_german_natural-v0.1, run the LLM component using vLLM with optional FP8 quantization or SGLang, decode the resulting audio tokens with the SNAC decoder, and wrap everything in a FastAPI application for easy access. This setup will run on a single RTX 3090.

Similarly, you can speed up the inference of non-streaming outputs by chunking the text at punctuation marks such as . , ? ! and processing them in batches with vLLM/SGLang. Splitting the text into smaller chunks also improves the quality of longer audio sequences, as the model tends to generate artifacts when processing texts that exceed the trained maximum of 4k tokens without splitting.

The Components

You can find the full implementation in this Github project: https://github.com/SebastianBodza/Orpheus_Distributed_FastAPI

And here is a video of the system in action:

1. The Orpheus LLM (via vLLM/SGLang)

The core of the TTS generation is the Orpheus LLM. Instead of directly generating waveforms, it predicts a sequence of discrete audio tokens representing the sound. To run this efficiently, especially on constrained hardware, we use an optimized inference server like vLLM or SGLang.

  • vLLM/SGLang: These frameworks implement techniques like PagedAttention to maximize throughput and minimize memory usage for LLM inference.
  • Quantization (Optional but Recommended): Both frameworks support FP8 or INT4 quantization using AWQ and GPTQ. This significantly reduces the VRAM required by the LLM while speeding up inference, with minimal impact on quality for this task. This optimization makes it possible to run both the SNAC model and the LLM on the same GPU with only 24 GB of VRAM.
  • API: We interact with the vLLM/SGLang server via an OpenAI-compatible API.

2. The SNAC Model

The LLM outputs audio tokens, not playable sound. We need a encoder/decoder to convert these tokens into an audio waveform. We use the SNAC model for this.

  • Efficiency: SNAC is relatively lightweight and fast, making it suitable for real-time decoding.
  • Input Format: It expects audio tokens grouped in a specific structure (7 tokens per group across 3 layers in this setup).
  • Output: It generates raw audio samples in waveform.

3. The FastAPI Streaming Server

This Python server acts as the orchestrator:

  • Receives text input via an HTTP request.
  • Formats the text into the prompt expected by the Orpheus model.
  • Sends the prompt to the vLLM/SGLang server and requests a stream of generated tokens.
  • As tokens arrive from the LLM:
    • Identifies the special audio code tokens.
    • Accumulates enough tokens to form complete groups for SNAC.
    • Calls the SNAC model to decode these token groups into audio chunks.
    • Streams the raw PCM audio chunks back to the client.
  • Handles details like WAV header generation, chunking, and applying fades for smoother playback.

Running the System on a Single RTX 3090

  1. Start the vLLM Server:

    • Choose your Orpheus model (e.g., SebastianBodza/Kartoffel_Orpheus-3B_german_natural-v0.1).
    • Use FP8 quantization (--quantization fp8) to reduce VRAM usage. Or you can use GPTQ or AWQ for INT4 Quantization after you converted the models.
    • Adjust --max_model_len based on expected input/output lengths.
    # Example vLLM command with FP8
    vllm serve SebastianBodza/Kartoffel_Orpheus-3B_german_natural-v0.1 \
    --dtype auto \
    --quantization fp8 \
    --enable-chunked-prefill \
    --max_model_len 4096 \
    --gpu-memory-utilization 0.7 

    Alternativelly you can use SGLang:

    # Example SGLang command
    python -m sglang.launch_server --model-path SebastianBodza/Kartoffel_Orpheus-3B_german_natural-v0.1 \
    --context-length 4096 \
    --mem-fraction-static 0.7 

    Note: Adjust memory utilization (--gpu-memory-utilization) to leave enough VRAM for the FastAPI server and SNAC model. With CUDA_VISIBLE_DEVICES and tp/tensor-parallel-size you can set the GPUs

  2. Start the FastAPI/SNAC Server:

    • Ensure the script points to the correct vLLM or SGLang API endpoint (VLLM_BASE_URL) and tokenizer path.
    • Make sure CUDA_VISIBLE_DEVICES is set correctly if you have multiple GPUs, otherwise it should use the same GPU as vLLM if only one is available/visible.
    • Run the Python script.
    # Set environment variables (adjust paths and GPU ID if needed)
    export HF_HOME="/path/to/your/hf_cache/"
    export MODEL_NAME="/path/to/your/model"
    export VLLM_BASE_URL="http://localhost:8000/v1" 
    export CUDA_VISIBLE_DEVICES="0" # Or the GPU ID vLLM is using
     
    # Run the server
    python streaming_api_server.py

    The script uses Uvicorn to run the FastAPI app on port 8001 by default.

  3. Test the Stream: Use curl and pipe the output to an audio player like paplay (Linux) or ffplay (cross-platform).

    # Example using paplay (Linux)
    curl -X POST "http://localhost:8001/generate-audio-stream/" \
         -H "Content-Type: application/json" \
         -d '{
               "text": "Hallo Welt! Dies ist ein Test der Streaming-Audioausgabe.",
               "voice": "in_prompt"
             }' \
         --no-buffer \
         | paplay --raw --format=s16le --rate=24000 --channels=1
     
    # Example using ffplay (cross-platform)
    curl -X POST "http://localhost:8001/generate-audio-stream/" \
         -H "Content-Type: application/json" \
         -d '{
               "text": "Ein weiterer Test mit einem etwas längeren Satz, um zu sehen, wie das Streaming funktioniert.",
               "voice": "in_prompt"
             }' \
         --no-buffer \
         | ffplay -i pipe:0 -nodisp -autoexit -f s16le -ar 24000 -ac 1

    The --no-buffer flag for curl is important to ensure it doesn't wait for the entire response before piping.

The Streaming Logic

The core logic lives in an async function within the FastAPI app, generate_audio_stream. Here's the flow:

  1. Send Header First: As soon as a request comes in, format the prompt and immediately send back a standard WAV header. The client's audio player can start buffering, expecting 24kHz, 16-bit mono PCM data. This minimizes the initial delay.
  2. Start LLM Stream: Call the vLLM server's completion endpoint with stream=True. This tells vLLM to send back tokens as soon as they are generated, rather than waiting for the entire sequence.
  3. Catch and Process Tokens: Loop through the incoming token chunks from vLLM.
    • Accumulate the text/tokens.
    • Look for the special CODE_START_TOKEN_ID (128257 in this model) which marks the beginning of the audio token sequence.
    • Filter and collect valid audio tokens (checking they are within the expected range, e.g., >= CODE_TOKEN_OFFSET which is 128266 here).
    • Wait until you have enough tokens for a processing batch (e.g., STREAM_CHUNK_SIZE_GROUPS * 7 = 210 tokens). I used a smaller initial chunk (INITIAL_CHUNK_SIZE_GROUPS) to get the first bit of audio out even faster.
  4. Decode with SNAC:
    • Take the collected batch of raw audio tokens.
    • Offset them (subtract CODE_TOKEN_OFFSET) to get the actual code values (0-4095 range).
    • Restructure them into the 3-layer format SNAC expects using the redistribute_codes_sync function. This involves some indexing based on the Orpheus code.
    • Pass these structured codes to snac_model.decode().
  5. Send Audio Chunk:
    • Convert the resulting audio tensor from SNAC into raw 16-bit PCM bytes (convert_to_pcm16_bytes).
    • Crucially, apply a short fade-in/fade-out (e.g., 5-10ms) to the ends of each chunk using apply_fade. This helps prevent audible clicks or pops at the chunk boundaries where they get stitched back together by the client player. The original orpheus code uses a different approach with a sliding window. Probably something that should also be used.
    • yield the PCM bytes. This sends the audio chunk back to the client immediately.
  6. Repeat: Continue catching tokens, decoding, and yielding audio chunks until the vLLM stream ends.
  7. Final Flush: Process any leftover tokens that didn't form a full chunk at the end.

Summary

By combining an Orpheus-style TTS model with an efficient LLM inference server like vLLM (using FP8 quantization) and a fast encoder/decoder like SNAC, we can create a low-latency, streaming audio generation system. The FastAPI server couples the process, fetching tokens from the LLM, decoding them into audio chunks with SNAC, and streaming the results back to the client. This entire pipeline can be deployed and run effectively on a single consumer GPU like an RTX 3090, making responsive TTS accessible without requiring high-end server hardware.