Papers
arxiv:2601.11141

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

Published on Jan 16
· Submitted by
Rajkumar rawal
on Jan 22
Authors:
,
,

Abstract

Chroma 1.0 enables real-time spoken dialogue with personalized voice cloning through discrete speech representations and interleaved text-audio token scheduling.

AI-generated summary

Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma and https://huggingface.co/FlashLabs/Chroma-4B .

Community

Some of the observations founded are :-

-- End to end S2S advantage :
Chroma 1.0 avoids cascaded ASR LLM TTS pipelines, reducing latency and preserving paralinguistic cues like timbre and prosody.

-- High fidelity voice cloning :
With only a few seconds of reference audio, Chroma achieves 10.96% higher speaker similarity than the human baseline, outperforming existing open and commercial models.

-- Real time streaming design :
The interleaved 1:2 text audio token schedule enables sub-second responsiveness (TTFT ≈ 147 ms) and smooth streaming synthesis.

-- Efficiency at small scale :
Despite having only 4B parameters, Chroma maintains competitive understanding, reasoning, and dialogue performance compared to larger 7–9B models .

-- Naturalness vs fidelity trade off :
Subjective tests show commercial systems may sound more “natural,” but Chroma preserves speaker identity more faithfully highlighting that listener preference does not always equal true speaker similarity.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.11141 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.11141 in a Space README.md to link it from this page.

Collections including this paper 1