Modern neural TTS models (Eleven Labs Turbo, OpenAI TTS, Cartesia Sonic) achieve sub-200ms latency and near-human prosody. The arms race in TTS is largely about emotional expressiveness and multi-language coverage.
Callsy's voice stack supports 40+ languages and 60+ regional voices, with sub-second response latency end-to-end.
TTS quality is one of the three factors (along with LLM reasoning and STT accuracy) that determine whether an AI voice agent feels natural or robotic.