Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FInetuning with existing transcript in large datasets. #363

Closed
Dolyfin opened this issue Oct 7, 2024 · 4 comments
Closed

FInetuning with existing transcript in large datasets. #363

Dolyfin opened this issue Oct 7, 2024 · 4 comments

Comments

@Dolyfin
Copy link

Dolyfin commented Oct 7, 2024

Is your feature request related to a problem? Please describe.
I have a dataset with uncommon words that I cannot expect Whisper or any ASR model to be able transcribe accurately. The dataset is already perfectly transcribed with each audio file having an accompanying .lab (label) file with the manual transcription in raw text.

The generating dataset step only allows modifications of the transcript csv after the wavs are automatically split. This makes the perfect existing transcriptions essentially useless and impossible to combine manually on large datasets. (dataset in hours).

Describe the solution you'd like
Option to not use ASR for transcription and to use existing text files. If files need to be at a certain length for training. Make the user responsible for having audio files within set max audio length.

Describe alternatives you've considered
Modifying csv manually after is not practical when the dataset is already transcribed in full.

@erew123
Copy link
Owner

erew123 commented Oct 10, 2024

Hi @Dolyfin

You can manually jump to step 2 and populate your own CSV files, into the relevant boxes, however I apprecaite you are talking about something slightly different here. Also, I do intend to document that process a little better in the wiki (finetuning is still on my WIKI list to write at some point).

I cant find any details on "lab" file format (other than for label printers, which clearly isnt right. Have you any links to something about the file format or tell me some software that works with it, so I can get a better understanding about what you are suggesting.

Thanks

@Dolyfin
Copy link
Author

Dolyfin commented Oct 10, 2024

.lab is just raw text here. It’s what Fishspeech uses in their fine tuning process. I would just assume .txt for text instead.

@erew123
Copy link
Owner

erew123 commented Oct 20, 2024

Hi @Dolyfin

Sorry for the later reply, but Ive been dealing with other things in life for a while, see here #377

So, the underlying requirement for the formatting layout for Coqui XTTS training, is set by Coqui's scripts. Please see the reference here on their documentation https://docs.coqui.ai/en/latest/formatting_your_dataset.html#formatting-your-dataset (see the bit that says We recommend the following format delimited by |. In the following example, audio1, audio2 refer to files audio1.wav, audio2.wav etc...........)

The only way I could see to handle what you are describing would be to write a bit of script to rip through the .lab files and generate the resulting/required Coqui CSV files, which in principle, shouldn't be too hard. The only real decision the user would need to make would be the % to use for the Evaluation CVS and the % used for the Training CSV. It wouldnt be too hard to knock something together to do this.

Q. I assume you would have 1x folder that has all your dataset in it, populated with your audio and lab files? And of course, that would be the dataset to convert to the Coqui format.

Thanks

@erew123
Copy link
Owner

erew123 commented Nov 24, 2024

Hi @Dolyfin The whole of Finetuning has been re-written and updated. You may well find it a good jump on what was there before. I can look to do something with conversion of other datasets, but I would need some examples (as in, a small dataset or so) and I probably could get something working at some point.

Thanks

@erew123 erew123 closed this as completed Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants