-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
character limit of 273 for language 'fr'" error #153
Comments
At a quick glance, this looks really great! I will take a more careful look through all this, but it looks like it really could help when the source is a text file. Detecting sentences reliably is not easy, that's why I ended up using NLTK (natural language tool kit). I like your approach though, and will play around with it some. I would also really like to figure out how to make a good guess at separating text files into chapters so they get useful "part" splits, but other than just trying to match on CHAPTER ##, I am not really sure how else to approach it (and obviously that won't work if the text file doesn't have the explicit word "chapter" at the start of each chapter). Thank you again for using this and helping to make it better, I really appreciate it! |
Thank you for your enthusiastic response. I understand the appeal of using NLTK for natural language processing, but I've encountered some challenges with this method in my files. Therefore, I propose an alternative that I believe could be more adaptable and less complex. My idea is to convert Epub files into text , CSV or MD files. These formats are easily editable and allow for the manual or semi-automatic insertion of chapter markers. This method would involve searching for and replacing specific titles or formats with predefined tags, inspired by the Markdown format. For example, "Chapter 1" could be replaced with "# Chapter 1" to clearly indicate the start of a new chapter. Following our conversation, I've also been thinking about integrating automated preprocessing into the main script. The idea would be to allow maximum flexibility: those looking for a quick and direct solution could opt for the integrated automation, while those who wish to further customize the file could use a separate script for manual preprocessing. I believe this semi-automatic approach, combined with the option of automated or manual preprocessing, offers significant flexibility, particularly in adapting to different authors' styles and languages. It also allows for manual intervention for those who wish to further customize the structure of their files. I hope these proposals will be useful for the project, and I am open to any collaboration to further develop these ideas. |
Let's move this to a discussion: #158 |
Hello @Vodou4460 I've been playing with your script and it works really great. I've made one change to have a more "human" reading experience. Because I'm using xtts I've changed two lines in epub2tts.py, in line 281 Excuse me if this can be done in a better place or in a more elegant way. I'm not a Python programmer. If you keep improving your function, please tell us! @aedocw please consider integrate the "split sentence way" in your app. This combination works really great! |
I pushed up a branch that incorporates this suggestion, and it works well on a very small sample I tried. I'm going to try it with a full book before merging, but I think this is a nice improvement and I'm glad you suggested it! |
Thank you very much. I've been busy playing with this, and learning some Python to try to understand how things work and how to get it work better. I'm now testing this code (please excuse my coding). Tomorrow I'll try some more things, like quotes, double quotes, dialogs punctuation marks (-, .-) ... by now I have this (using @Vodou4460 code) import re
import datetime
import fire
def reformat_line(line):
line = line.strip()
if not line.endswith("."):
line += "."
return line
def split_sentence(line):
# Using regular expression to find split points
parts = re.split(r'(?<![A-Z])([\.|\?|\!])\s', line)
# Reconstructing sentences with punctuation characters
sentences = []
# this is a mess, but is the only way I've found to make it work
if len(parts)>1:
for i in range(0, len(parts)-1, 2):
sentences.append((parts[i] + parts[i+1]).strip())
sentences.append((parts[len(parts)-1]).strip())
else:
sentences.append(parts[0].strip())
return sentences
def shorten_sentence(sentence, max_length):
sentences = []
while len(sentence)>max_length:
# find "secondary" puntuation marks, if not, space, if not just cut in max_length
if (cut_point := max(sentence.rfind(',', 0, max_length),
sentence.rfind(';', 0, max_length),
sentence.rfind(':', 0, max_length)))<=0:
if (cut_point := sentence.rfind(' ', 0, max_length))<=0:
cut_point=max_length
sentences.append(sentence[:cut_point+1].strip())
# rest of sentence
sentence = sentence[cut_point+1:].strip()
sentences.append(sentence)
return sentences
def save_to_file(lines,original_filename,max_length):
now = datetime.datetime.now()
timestamp = now.strftime("%Y-%m-%d_%H-%M-%S")
basename = original_filename.split('.')[0]
ext = original_filename.split('.')[1]
new_filename = f"{timestamp}_{basename}_split_{max_length}.{ext}"
with open(new_filename, 'w', encoding='utf-8') as new_file:
for line in lines:
new_file.write(line + '\n')
return new_filename
def split_and_save_text(original_filename, max_length=239):
with open(original_filename, 'r') as file:
text = file.readlines()
# "normalize" text, delete empty lines, end all lines with "."
# because only lines ended with '.' generate a pause after them
# Made this because things like:
#
# "Don't explain your philosophy. Embody it."
# ― Epictetus
#
# was "joined" with the next line in text. The oposite for lines
# processed in shorten_sentence
text3 = [reformat_line(line) for line in text if line.strip()]
# split sentences in "primary" punctuation signs
text2=[]
for line in text3:
if line.startswith('#'):
text2.append(line)
else:
lines=split_sentence(line)
text2 += lines
# split sentences longer than max_length in "seconday" puntuation
# or in space
text3=[]
for line in text2:
if len(line)<=max_length:
text3.append(line)
else:
for line2 in shorten_sentence(line,max_length):
text3.append(line2)
print(save_to_file(text3,original_filename,max_length))
if __name__ == "__main__":
fire.Fire(split_and_save_text) you can use it I'm using it only for XTTS. Once again, thank you very much! |
Hi. Does this branch do the split in sentences? Or should I keep using the code above to split and then process using this branch? |
This branch splits into sentences, but it would be worth trying it with |
Ok. I will give it a go tomorrow, comparing with current results. Thanks |
Hello. I've tested the sentences-pause branch with two test text I have. The beginning is
The only thing I've found is that it joins the 2nd, 3rd and 4th lines, reading
The rest worked ok. The output:
It is something I've detected. The lines must end with a punctuation sign. That is why I do: def reformat_line(line):
line = line.strip()
if not line.endswith("."):
line += "."
return line This is redundant if line ends with ",","?" or any other punctuation sign, so maybe something like line = line.strip()
if not line[-1] in [".", "!", "?",","]:
line += "," could work. |
Hello, |
The latest update does not include everything noted in this ticket, but it
does break down to individual sentences, and includes a consistent pause
between each sentence (if you are using XTTS).
…On Mon, Feb 26, 2024 at 5:47 AM larry77 ***@***.***> wrote:
Hello,
If I update the installation of epub2tts on my machine, will these
enhancements be automatically made available?
—
Reply to this email directly, view it on GitHub
<#153 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFBJGLNBQIVRPYXUV366Z3YVSG57AVCNFSM6AAAAABBKVNUGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRUGE4DMNZTHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Please share a sample if you are able to reproduce this error with the current release. I think since we now only send one sentence at a time to TTS, this issue is resolved now. |
Dear aedocw,
I've been using your epub2tts script with CoquiTTS_XTTS for French language processing and encountered a couple of issues. Specifically, I frequently ran into the "character limit of 273 for language 'fr'" error and faced problems with empty data. These seemed to stem from processing text segments that were too lengthy for the XTTS system.
To address these, I experimented with two main modifications:
1. Modification of the
combine_sentences
Function inepub2tts.py
:Rather than combining sentences into longer segments, I tweaked this function to yield each sentence individually. This approach helps in managing the character limit more effectively. Here’s the adjusted function:
2. Preprocessing the Text with a Custom Function:
Additionally, I crafted a separate function for text preparation. This function employs regular expressions to split the text into sentences and then shortens each sentence to fit within the character limit. It also handles the replacement of certain characters for text cleanup.
Functionality Explanation:
split_sentences function: This splits a given text into sentences using a regular expression. It looks for punctuation marks like
.
,?
, or!
and ensures that these marks are not preceded by a capital letter (to avoid splitting at abbreviations).shorten_sentences function: It shortens sentences to a specified maximum length. If a sentence is longer than the maximum length, it looks for a suitable point to split the sentence, preferably at a comma or semicolon, or else at a space. Each new sentence is ended with a period.
replace_characters function: This replaces specified characters in the text. It's useful for cleaning up the text or ensuring consistency in formatting.
save_to_file function: This function saves the modified sentences to a new file. The new file's name includes a timestamp for easy identification. It prints part of each sentence as it's saved to provide a progress update.
split_and_save_text_v9 function: This is the main function that orchestrates the process. It reads the text from a file, splits the text into sentences, shortens the sentences if necessary, replaces certain characters, and then saves the modified sentences to a new file. The maximum sentence length can be specified, with a default value of 300 characters.
This preprocessing ensures that each sentence inputted into the
combine_sentences
function conforms to the character limits imposed by CoquiTTS_XTTS, greatly reducing errors and enhancing the text-to-speech process for French.While my solution is not perfect and can be considered a makeshift "Bricolage," I wanted to share it with you. I believe you might find a much better solution, and I am eager to see how this can be further improved.
Best regards,
The text was updated successfully, but these errors were encountered: