Improve/expand use of text files #158

aedocw · 2024-01-04T20:16:47Z

aedocw
Jan 4, 2024
Maintainer

Starting from #153, I thought it would be good to bring the conversation over here.

Your comments got me thinking and I am in agreement with you that for some use-cases, using text has big advantages. It obviously makes it easy to adjust any text or language, and could potentially even be useful for putting in phonetic spelling for words that TTS has a hard time pronouncing.

My inclination is to add a "export" flag that takes an epub and writes it out to a text file. This would be really easy to implement. Along the way it could insert a header for each chapter break that epub2tts can watch for later when creating an audiobook from text file. As suggested in the linked issue, something like "# CHAPTER" would work, though I might be even more explicit (###CHAPTER### for instance).

The workflow could look something like:
epub2tts mybook.epub --export mybook.txt
Make any edits/adjustments to mybook.txt
epub2tts mybook.txt --sayparts

The more I think about this approach, the more I realize this would probably become my preferred approach too, since it will be very easy to validate it's reading just what I want (properly skipping footnotes or links for instance, etc).

Vodou4460 · 2024-01-05T23:20:22Z

Vodou4460
Jan 5, 2024

Your idea of adding an export function to convert Epub files is excellent. To take it a step further, I propose using the Markdown format (.md) instead of plain text (.txt) for this conversion. Markdown offers several significant advantages:

Availability of Markdown Files: There is already a large number of files available in Markdown format. Using these in your script could easily integrate into an already rich ecosystem.
Ease of Conversion and Automation: Markdown is easily convertible from many file formats, paving the way for efficient automation of the conversion process. Automating the conversion of Epub files to Markdown and to audiobook files could simplify the process for end users.
Flexibility in Managing Chapters: Markdown makes it easier to identify chapter heads and sub-chapters, which would be particularly useful for creating audiobooks. Another interesting possibility is the ability to have chapter titles read in a different voice, making it easier to identify changes in chapters or sub-chapters during listening.

Additionally, it might be interesting to offer a choice (random) of speaker from a list available to read different chapters or layout styles, like quotes. The use of Markdown would greatly facilitate this task, especially since it allows the insertion of specific tags. Moreover, Markdown is easily convertible from many different file formats.

Additionally, I suggest adding an option to insert a line break after each sentence in the Markdown file to facilitate visualization of sentence length. This option could be optional and would depend on the system's ability to check the length of sentences.

The workflow could be as follows:

Convert an Epub to a Markdown file with the --export option:
```
epub2tts mybook.epub --export mybook.md
```
Make edits/adjustments to mybook.md.
Convert mybook.md into an audiobook with the --sayparts option:
```
epub2tts mybook.md --sayparts
```

This approach would offer not only a richer and more personalized listening experience but also the flexibility to read the file as a simple Markdown document.

I am convinced that adopting the Markdown format and these improvements would increase the value and utility of your script. I am very enthusiastic about contributing to its development.

Best regards,

0 replies

Vodou4460 · 2024-01-05T23:42:45Z

Vodou4460
Jan 5, 2024

Continuing our reflection on the project, I would like to discuss the possibility of automating the reading of various types of documents, such as emails, HTML files, Epubs, and PDFs, using our script. The idea would be to leverage the many converters already available to transform these documents into Markdown files, and then use our script to convert them into audio.

For instance, for an email, the script could first convert the email content into Markdown, and then the main script could transform this Markdown into an audio file. Similarly, for HTML, Epub, or PDF files, preliminary conversion scripts could be used to convert them into Markdown before passing them to our audio tool.

I believe an effective approach would be to focus on automating the conversion of Markdown files into audio files, using specific templates. This method could simplify and standardize the process, while offering great flexibility for customizing the listening experience.

In my opinion, it's not necessary for us to focus on implementing a feature that converts other types of documents into Markdown format, as there are already many converters capable of doing this. Our main goal could be to perfect the process of converting Markdown into audio, which would add significant value to our tool.

The templates could be designed to handle different styles of Markdown documents, automatically adjusting elements such as pauses, intonations, and voice selection for different chapters or sections. This would allow for a richer and more nuanced audio production, tailored to the specifics of each document.

I think this focus on automating Markdown-to-audio templates would be an important step towards creating a powerful and versatile tool for the production of audiobooks and podcasts.

I am curious to hear your thoughts on this approach and look forward to the possibility of collaborating on its development.

0 replies

aedocw · 2024-01-05T23:44:32Z

aedocw
Jan 5, 2024
Maintainer Author

Love the idea of adding markdown support, it would definitely open this up to more use-cases.

I also like the idea of using a different voice for chapter titles. I was planning to start experimenting with switching voices, though XTTS is not really well suited to this as loading the model is resource intensive, and to use a different voice you would have to switch models. It's possible to load more than one as individual objects, but I don't know how much ram that ends up consuming (thus the experimentation). I'm going to start a new discussion about making audiobooks with multiple voices by using an LLM to look at the whole book, figure out how many different characters there are, and try to use different voices for narrator and characters. Though making that work would be a long way off :)

I'm going to take a quick stab at making a TXT export, and will share a link to that branch as soon as I do it. I'd be very happy to have you contribute the markdown export bits, as well as support for using markdown as the source file.

0 replies

Vodou4460 · 2024-01-08T09:40:04Z

Vodou4460
Jan 8, 2024

I am delighted with the positive reception of the idea to add Markdown support.

I've been considering various options to optimize our text-to-speech conversion process. One idea is to split the text into lines while respecting a maximum character count per line. This segmentation would allow us to create a code or identifier for each line, thereby facilitating the allocation of specific lines to particular voice models. For instance, one model could handle titles, another could deal with quotes, and so on, with the ability to load only one model at a time.

However, I see a potential drawback: if we need to modify the text post-production, for example, to correct a misinterpretation by the TTS system. It would be practical to be able to modify the sentences and have them reprocessed by the appropriate model. A solution could be to retain the uncompiled versions of the different sentences, making modifications easier without necessitating the complete regeneration of the book.

To manage modifications of lines that might be too long, we could consider an incremental numbering system with increments of 10, similar to programming in Basic. This system allows us to easily interpose additional lines as needed, for example, when a line is too long and needs to be split. With line numbers like 10, 20, 30, etc., we can insert new lines (such as 15, 25) without having to renumber the entire document. This provides flexibility for modifications while preserving the order and synchronization with the associated voice models.

As your code already checks for the existence of the dedicated part, it will detect the parts that are good and have not been deleted and will regenerate those that have been modified, allowing for easy reprocessing of problematic phrases. And to be able to divide it into two in case of difficulty without having to regenerate the entire book. I hope I am clear on this point.

Regarding the use of an LLM to analyze an entire book, I think it could be costly and time-consuming to do it online. An alternative could be to use a local LLM, although this requires resources. However, I wonder if this step is essential at the moment. For most books, a quick modification or integration of predefined tags might suffice to identify speakers in plays, for example. As for novels, using an LLM to identify protagonists might not be a major asset. Perhaps we could consider this functionality in a later phase, possibly developing a dedicated external software.

I am curious to hear your thoughts on these proposals and to discuss ways to integrate them into our project.

0 replies

aedocw · 2024-01-08T20:13:48Z

aedocw
Jan 8, 2024
Maintainer Author

There's a lot to discuss in here, but one thing I just wanted to explain was what I meant with using an LLM for analysis. There is a standard markup language for TTS (SSML, quick example here).

What I would want to use the LLM for would be for it to look at an entire novel, figure out who say the most frequently speaking characters are, in addition to detecting tone (i.e. scared, relaxed, angry, etc) and wrap all the text in SSML. Then you can specify a voice for the narrator and each character. I know folks at Amazon/Audible are really far down the road already with automating this but it would be cool to do the same thing in open source.

0 replies

Vodou4460 · 2024-01-08T21:30:24Z

Vodou4460
Jan 8, 2024

Thank you for sharing this information with me. I was not previously aware of this method or the specific markup for audiobooks. I find it very interesting and potentially resourceful. Indeed, making it available as open-source would add a unique depth and richness to audiobooks. This project also seems particularly relevant for people with visual impairments, as they rely on text-to-speech conversions of texts. Enriching the quality and diversity of these transcriptions while allowing them to choose the narrator is also a social action that I am passionate about.

In any case, I thank you for this project, and I propose to remain involved in advancing the various possible solutions.

0 replies

mmol67 · 2024-01-18T13:35:06Z

mmol67
Jan 18, 2024

Hello. I've been testing and I've been having some errors like strange sounds, artifacts and voices in between words sometimes, duplicated phrases or words... I think this could because the app is surpassing the character limit (in this case 239 for Spanish).

Anyway I think that a correct text splitting in sentences is key for get the most natural reading possible.

I've been looking at the code in #153 (comment) posted by @Vodou4460 and it looks very promising!

One thing I've noted is that the puntuation sings are different for every language. Say, in Spanish you have: (.) (,) (;) (:) and (¿) (?) (¡) (!), the suspensory points (...), double quotation marks ("") or («»), inverted (“”) and sometimes single (''), and you can use (-) for dialogs. Of course parentheses and brackets too.A lot, I know. In French I thing there are the same except for the opening interrogation and exclamation marks.

Maybe you can use a special formatted text as @Vodou4460 sugested to keep splited sentences and assign some kind of code for timming like
time between words = 100ms -> used when a sentence is longer than max_length and you have to split in a space
time after (,) = 300ms
time after semicolon (;) or (.) no new line = 700ms
time after (.) period and new line= 900ms
time for empty new line = I don't know....

And use these in-times when you collate the audio sentences...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve/expand use of text files #158

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Improve/expand use of text files #158

aedocw Jan 4, 2024 Maintainer

Replies: 7 comments

Vodou4460 Jan 5, 2024

Vodou4460 Jan 5, 2024

aedocw Jan 5, 2024 Maintainer Author

Vodou4460 Jan 8, 2024

aedocw Jan 8, 2024 Maintainer Author

Vodou4460 Jan 8, 2024

mmol67 Jan 18, 2024

aedocw
Jan 4, 2024
Maintainer

Vodou4460
Jan 5, 2024

Vodou4460
Jan 5, 2024

aedocw
Jan 5, 2024
Maintainer Author

Vodou4460
Jan 8, 2024

aedocw
Jan 8, 2024
Maintainer Author

Vodou4460
Jan 8, 2024

mmol67
Jan 18, 2024