Any Thought on Latency Optimization: Batch to Streaming?
May, 02 2026. 10 minutes read.
One of the things I quietly enjoy about being a lecturer is when students from other classes come to me with their technical problems. Not because I am smarter than their assigned lecturers (I am not), but because it means something I said or did in my own class made them feel safe enough to ask. That trust is not something you can demand. It is earned, one honest answer at a time.
Recently, a student reached out with a question about their AI counseling bot project. The architecture was straightforward: Speech-to-Text โ AI Orchestrator โ Text-to-Speech. The problem? High latency in audio response time. The user speaks, waits, waits some more, and finally hears a response. Not exactly the "companion" experience they were aiming for.
They had already tried FastAPI and attempted WebSocket streaming, but could not get the performance they needed. Here is what I told them.
The Real Problem: Cumulative Delay
The issue of "slow audio delivery" is rarely a single bottleneck. It is usually a cumulative delay caused by:
- Waiting for Speech-to-Text (STT) to finish transcribing the entire utterance
- Waiting for the LLM to generate the full response
- Text-to-Speech (TTS) starting only after the complete text is available
- Sending the audio in a heavy format or via an inefficient method
Each step waits for the previous one to fully complete before starting. The delays stack. A 500ms STT wait + 2s LLM generation + 1s TTS rendering + 300ms network transfer = nearly 4 seconds of silence. For a "companion" bot, that is an eternity.
For the best User Experience (UX), your target should be minimizing time-to-first-audio: how quickly the user hears the first sound, even if the entire response is not finished yet.
The Strategy: From "Batch" to "Streaming Pipeline"
If your current workflow is:
STT completes โ LLM completes โ TTS renders full audio โ Send full file
It will inevitably feel slow. Every arrow (โ) is a wall where the next step waits for the previous one to finish entirely.
The fix is to turn those walls into streams. Here are three steps:
1. Streaming STT (Partial Transcripts)
As soon as the STT engine generates a partial result (e.g., "So I feel..."), immediately pass it to the orchestrator rather than waiting for the user to stop talking. Most modern STT APIs (Google Cloud Speech, Whisper with streaming, Deepgram) support partial/interim results. Use them.
2. Streaming LLM (Token Streaming)
Consume the LLM output as a stream of tokens. Do not wait for the full response. Buffer these tokens by phrase or sentence (roughly 6 to 15 words, or until a punctuation mark is reached). This gives you "chunks" of coherent text that are ready for the next step.
3. Chunked TTS
Take each coherent chunk of text and send it to the TTS engine immediately. Do not wait for the full paragraph. These audio chunks should be sent and played back-to-back on the client side, creating the illusion of continuous speech.
The result: the user hears the first word within hundreds of milliseconds to 1 to 2 seconds, rather than waiting 4+ seconds for the entire pipeline to finish.
The Sweet Spot
Not all chunk sizes are equal:
- TTS per token (one word at a time): Too much network overhead. Sounds robotic. Prosody breaks.
- Waiting for a full paragraph: Takes far too long. Defeats the purpose of streaming.
- Per-sentence or short phrase: The sweet spot. Natural prosody. Fast enough. Manageable overhead.
The keyword here is "segmenter" (a small piece of logic that buffers tokens and decides when a chunk is "ready" to send to TTS). A simple implementation: flush the buffer when you hit a period, question mark, exclamation mark, or when the buffer exceeds ~15 words.
Why This Matters (Beyond the Technical)
Today is Hari Pendidikan Nasional (National Education Day of Indonesia). And I think this interaction captures something important about education: the best learning happens when students feel comfortable asking questions to anyone who might help, not just the person assigned to them. Knowledge does not belong to a single classroom or a single lecturer. It flows to whoever is willing to receive it and whoever is willing to share it.
To the student who asked: your instinct to reach out was correct. Your architecture was already 80% right. You just needed someone to point out that the arrows between your boxes do not have to be walls. They can be pipes.
Keep building.
The arrows between your boxes do not have to be walls. They can be pipes.
@hepidad