Masaya Kawamura1, Takuya Hasumi1, Yuma Shirahata1, Ryuichi Yamamoto1
1LY Corporation, Tokyo, Japan
Accepted to INTERSPEECH 2025
[Paper]
This paper proposes a highly compact lightweight text-to-speech (TTS) model for on-device applications. To reduce the model size, the proposed model introduces two techniques. First, we introduce quantization-aware training (QAT), which quantizes model parameters during training to as low as 1.58-bit. In this case, most of 32-bit model parameters are quantized to ternary values {-1, 0, 1}. Second, we propose a method named weight indexing. In this method, we save a group of 1.58-bit weights as a single int8 index. This allows for efficient storage of model parameters, even on hardwares that treat values in units of 8-bit. Experimental results demonstrate that the proposed method achieved 83 % reduction in model size with reasonable synthesis quality.
Method | Quantization (Acoustic Model) |
Quantization (Vocoder Part) |
Utterance ID: 46_128001_000006_000012 | Utterance ID: 65_125860_000024_000001 | Utterance ID: 118_47824_000110_000001 | Utterance ID: 52_123202_000019_000005 |
---|---|---|---|---|---|---|
32-bit | - | - | ||||
32-bit (small model) |
- | - | ||||
4-bit | ✓ | |||||
4-bit | ✓ | |||||
4-bit | ✓ | ✓ | ||||
1.58-bit | ✓ | |||||
1.58-bit | ✓ | |||||
1.58-bit | ✓ | ✓ | ||||
Ground truth | - | - |