Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Hanzhao Li1, Liumeng Xue2, Haohan Guo3, Xinfa Zhu1, Yuanjun Lv1, Lei Xie1
Yunlin Chen4, Hao Yin4, Zhifei Li4

1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China
2School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China
3The Chinese University of Hong Kong, Hong Kong SAR, China
4Shanghai Mobvoi Information Technology Co., Ltd

0. Contents

  1. Abstract
  2. Samples on Audio Reconstruction: Ablation Study
  3. Samples on Audio Reconstruction: Comparison Study
  4. Audio Samples: VALL-E Zero-shot TTS based on differenct codecs


1. Abstract

The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility.

Fig.1 The architecture of Single-Codec.

2. Samples on Audio Reconstruction: Ablation Study

GTVQVAERef-shortRef-longRef-BLSTMRef-HybSamRef-BLSTM-HybSamRef-BLSTM-HybSam-ConfSingle-Codec

3. Samples on Audio Reconstruction: Comparison Study

GTVQVAEEnCodec-1VQTiCodec-1VQTiCodec-2VQSingle-Codec

4. Audio Samples: VALL-E Zero-shot TTS based on different codecs

Prompt AudioVQVAEEnCodec-1VQEnCodec-4VQEnCodec-8VQTiCodec-1VQSingle-Codec

Chinese speaker

That was before the day of high-school athletics.
He had found your humble servant, then about six months old.
Some, wounded or killed, fell back into the rooms, uttering piercing cries.
She will do for me.

English speaker

"Your wife!" cried Kate.
She had heard all about the Princess's dream.
mrs Presty interposed.
She will do for me.

English speaker

and a service of plate.
It always ends, it is true, in an awakening, but the awakening is tardy.
"I won't let the little beast kiss me," stipulated Victor.
A beautiful silver studded sword was the King's gift to him.

Chinese speaker

他以二十一米七四的成绩摘得一枚银牌
CHAPTER sixteen.
The man let me have him for my silver chain.
而在基金销售市场发挥的作用越来越大