Demonstration of BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing

This paper proposes a highly compact lightweight text-to-speech (TTS) model for on-device applications. To reduce the model size, the proposed model introduces two techniques. First, we introduce quantization-aware training (QAT), which quantizes model parameters during training to as low as 1.58-bit. In this case, most of 32-bit model parameters are quantized to ternary values {-1, 0, 1}. Second, we propose a method named weight indexing. In this method, we save a group of 1.58-bit weights as a single int8 index. This allows for efficient storage of model parameters, even on hardwares that treat values in units of 8-bit. Experimental results demonstrate that the proposed method achieved 83 % reduction in model size with reasonable synthesis quality.

Method	Quantization (Acoustic Model)	Quantization (Vocoder Part)
32-bit	-	-
32-bit (small model)	-	-
4-bit	✓
4-bit		✓
4-bit	✓	✓
1.58-bit	✓
1.58-bit		✓
1.58-bit	✓	✓
Ground truth	-	-

BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing

Abstract

Demo

BitTTS: Highly Compact Text-to-Speech
Using 1.58-bit Quantization and Weight Indexing