Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Hanzhao Li¹, Liumeng Xue², Haohan Guo³, Xinfa Zhu¹, Yuanjun Lv¹, Lei Xie¹ Yunlin Chen⁴, Hao Yin⁴, Zhifei Li⁴

¹Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China
²School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China ³The Chinese University of Hong Kong, Hong Kong SAR, China ⁴Shanghai Mobvoi Information Technology Co., Ltd

0. Contents

Abstract
Samples on Audio Reconstruction: Ablation Study
Samples on Audio Reconstruction: Comparison Study
Audio Samples: VALL-E Zero-shot TTS based on differenct codecs

1. Abstract

The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility.

Fig.1 The architecture of Single-Codec.

2. Samples on Audio Reconstruction: Ablation Study

GT	VQVAE	Ref-short	Ref-long	Ref-BLSTM	Ref-HybSam	Ref-BLSTM-HybSam	Ref-BLSTM-HybSam-Conf	Single-Codec

3. Samples on Audio Reconstruction: Comparison Study

GT	VQVAE	EnCodec-1VQ	TiCodec-1VQ	TiCodec-2VQ	Single-Codec

4. Audio Samples: VALL-E Zero-shot TTS based on different codecs

Prompt Audio	VQVAE	EnCodec-1VQ	EnCodec-4VQ	EnCodec-8VQ	TiCodec-1VQ	Single-Codec
Chinese speaker
	That was before the day of high-school athletics.


	He had found your humble servant, then about six months old.


	Some, wounded or killed, fell back into the rooms, uttering piercing cries.


	She will do for me.

English speaker
	"Your wife!" cried Kate.


	She had heard all about the Princess's dream.


	mrs Presty interposed.


	She will do for me.

English speaker
	and a service of plate.


	It always ends, it is true, in an awakening, but the awakening is tardy.


	"I won't let the little beast kiss me," stipulated Victor.


	A beautiful silver studded sword was the King's gift to him.

Chinese speaker
	他以二十一米七四的成绩摘得一枚银牌


	CHAPTER sixteen.


	The man let me have him for my silver chain.


	而在基金销售市场发挥的作用越来越大