GPTInformal-Persian-Speech-Dataset

GPTInformal Persian is a free licensed Persian dataset of audio and text pairs designed for speech synthesis and other speech-related tasks. The dataset has been collected, processed, and annotated as a part of the Mana-TTS project. For details on data processing pipeline and statistics on this dataset, please refer to the paper in the Citation secition.

Data Source

The text for this dataset was generated using GPT4o, with prompts covering a wide range of subjects such as politics and nature. The texts are intentionally crafted in informal Persian. Below is the prompt format used to generate these texts:

Please give me a very long text written in informal Persian. I want it to be mostly about [SUBJECT].

These generated texts were then recorded in a quiet environment. The audio and text files underwent forced alignment using aeneas, resulting in smaller chunks of audio-text pairs as presented in this dataset.

Download

You can download the dataset from this repository.

Data Columns

Each Parquet file contains the following columns:

file name (string): The unique identifier of the audio file.
transcript (string): The ground-truth transcript of the audio.
duration (float64): Duration of the audio file in seconds.
subject (string): The subject used in prompt to get the original text file.
audio (sequence): The actual audio data.
samplerate (float64): The sample rate of the audio.

Citation

If you use GPTInformal-Persian in your research or projects, please cite the following paper:

@article{fetrat2024manatts,
      title={ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages}, 
      author={Mahta Fetrat Qharabagh and Zahra Dehghanian and Hamid R. Rabiee},
      journal={arXiv preprint arXiv:2409.07259},
      year={2024},
}

License

This dataset is available under the cc0-1.0. However, the dataset should not be utilized for replicating or imitating the speaker’s voice for malicious purposes or unethical activities, including voice cloning for malicious intent.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPTInformal-Persian-Speech-Dataset

Data Source

Download

Data Columns

Citation

License

About

Releases

Packages

License

MahtaFetrat/GPTInformal-Persian-Speech-Dataset

Folders and files

Latest commit

History

Repository files navigation

GPTInformal-Persian-Speech-Dataset

Data Source

Download

Data Columns

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages