LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

Masaya Kawamura¹, Ryuichi Yamamoto¹, Yuma Shirahata¹, Takuya Hasumi¹, Kentaro Tachibana¹

¹LY Corp., Japan

Accepted to INTERSPEECH 2024
[Paper] [Dataset]

Abstract

We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing English prompt datasets, our corpus provides more diverse prompt annotations for all speakers of LibriTTS-R. Experimental results for prompt-based controllable TTS demonstrate that the TTS model trained with LibriTTS-P achieves higher naturalness than the model using the conventional dataset. Furthermore, the results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset. Our corpus, LibriTTS-P, is available at https://github.com/line/LibriTTS-P.

Overview of LibriTTS-P and its applications.

Demos

This page consists of three demos: StyleCap [1], PromptTTS++ [2] (PromptSpeech [3] evaluation data), and PromptTTS++ (LibriTTS-P evaluation data).

StyleCap Demo

Input audio 1

Dataset	Example
PromptSpeech	Normal tone, this man said slowly
LibriTTS-P (ours)	A man speaks with slow speaking speed, normal pitch and volume. Descriptions of the speaker's vocal style are very masculine, very adult-like, slightly refreshing.

Input audio 2

Dataset	Example
PromptSpeech	Please use a slow tone speed and lower volume to say a man
LibriTTS-P (ours)	Request a man with a high-pitched voice to speak slowly and with low energy. Descriptions of the speaker's vocal style are adult-like, sincere, masculine.

Input audio 3

Dataset	Example
PromptSpeech	He said loudly
LibriTTS-P (ours)	A man speaks with high pitch, normal talking speed and normal power. Descriptions of the speaker's vocal style are masculine, slightly muffled, adult-like.

Input audio 4

Dataset	Example
PromptSpeech	Her sound height is normal, but the speed is fast, the sound is very loud
LibriTTS-P (ours)	Her pitch is normal, but the speed of speech is fast, and the volume is loud. Descriptions of the speaker's vocal style are adult-like, fluent, feminine.

Input audio 5

Dataset	Example
PromptSpeech	Normal tone, this man said slowly
LibriTTS-P (ours)	Ask a woman to speak slowly with a low pitch and low volume. Listeners might perceive the speaker's voice to be slightly dark, adult-like.

PromptTTS++ Demo

- Evaluation data: PromptSpeech

Style prompt: a man bass said slowly

Dataset	Audio
Ground truth
PromptSpeech
LibriTTS-P (ours)

Style prompt: a man at low speed and low pitch say

Dataset	Audio
Ground truth
PromptSpeech
LibriTTS-P (ours)

Style prompt: her tone is very low, her speech is very slow, and her voice is very small

Dataset	Audio
Ground truth
PromptSpeech
LibriTTS-P (ours)

Style prompt: this woman whispered, her speech was very slow, but her voice was very shrilly

Dataset	Audio
Ground truth
PromptSpeech
LibriTTS-P (ours)

- Evaluation data: LibriTTS-P

Style prompt and speaker prompt: a low-pitched male speaks in low pitch and low volume. The speaker identity can be described as masculine, adult-like, slightly thick, cool, slightly refreshing, sincere.

Dataset	Audio
Ground truth
PromptSpeech
LibriTTS-P (ours)

Style prompt and speaker prompt: a high-pitched man speaks with normal speed and low volume. The speaker identity can be described as masculine, very adult-like, slightly thick, cool, slightly intellectual, sincere, slightly wild.

Dataset	Audio
Ground truth
PromptSpeech
LibriTTS-P (ours)

Style prompt and speaker prompt: ask a low-pitched man to speak slowly with low energy. The speaker identity can be described as masculine, very adult-like, slightly thick, cool, slightly refreshing, sincere, slightly friendly.

Dataset	Audio
Ground truth
PromptSpeech
LibriTTS-P (ours)

Style prompt and speaker prompt: let a woman speak with low pitch, normal speed and normal volume. The speaker identity can be described as slightly masculine, adult-like, refreshing, slightly unique.

Dataset	Audio
Ground truth
PromptSpeech
LibriTTS-P (ours)

Style prompt and speaker prompt: a low-pitched woman speaks with fast speaking speed and low volume. The speaker identity can be described as very feminine, very adult-like, slightly intellectual, slightly calm, slightly sexy.

Dataset	Audio
Ground truth
PromptSpeech
LibriTTS-P (ours)

Style prompt and speaker prompt: a female person speaks in high pitch with normal speed and standard volume. The speaker identity can be described as feminine, slightly adult-like, fluent, slightly cute, refreshing, slightly unique.

Dataset	Audio
Ground truth
PromptSpeech
LibriTTS-P (ours)

References

[1] K. Yamauchi, Y. Ijima, Y. Saito, "StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models", in Proc. ICASSP, 2024, pp. 11261-11265.
[2] R. Shimizu, R. Yamamoto, M. Kawamura, Y. Shirahata, H. Doi, T. Komatsu, K. Tachibana, "PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions", in Proc. ICASSP, 2024, pp. 12672-12676.
[3] Z. Guo, Y. Leng, Y. Wu, S. Zhao, X. Tan, "Prompttts: Controllable Text-To-Speech With Text Descriptions" in Proc. ICASSP, 2023, pp. 1-5.