In-memory streaming conversions and other performance improvements #62

pgaskin · 2021-06-07T15:33:28Z

Benefits of in-memory streaming conversions:

Faster conversions.
Safer/more secure conversions (no disk extraction).
Possibility to compile to WASM.
Simpler code.
More flexible if using io/fs.

Other potential performance improvements:

Go 1.17 archive/zip raw/copy functions.
State-machine (potential zero-allocation) instead of regexp for sentence splitting.
Pre-allocated buffers where possible.

* Use the OPF package document to find HTML files (fixes #55). * Refactor content/files/opf transformation code. * Rewrite conversion code to allow converting directly from a EPUB zip.Reader or fs.FS into an output zipped KEPUB io.Writer (closes #62). * This simplifies the code for kepub conversion. * Input and transformation code is unified. * Useless function to convert a directory to a KEPUB in-place has been eliminated. * Arbitrary virtual file-systems can now be used as input. * This makes the code easier to test. * This resolves security concerns with extracting untrusted EPUBs directly into a temp folder with limited path sanitization, which is important when embedding kepubify into server-side software. * This allows kepubify to easily be compiled and used as a WebAssembly library. * This allows us to greatly reduce the amount of IO required when converting books with a large amount of media or fonts by directly piping it into the output file. And, on Go 1.17+, it will also significantly reduce the CPU time, while also increasing the amount of time spent doing content transformation in parallel rather than waiting for unchanged files to compress by directly copying the untransformed compressed files as-is. * The slightly increased memory cost (~8-12%) is negligible compared to the performance gains mentioned in the previous point and the reduced time waiting for disk IO (especially on HDDs). * Make use of Go 1.16's io/fs for more flexible code and tests. * Remove cascadia dependency * We don't need full selector parsing or specificity. * Depending on cascadia complicates replacing x/net/html. * Doing things manually is slightly more efficient, and almost as concise. * Reduce exposed functions for kepub library. * They weren't really used. * Removing them increases flexibility for future improvements. * Use a less obtrusive hack for giving kobotest access to the un-exported kepub functions. * Converted files should be identical to before this change, except for: * Whitespace changes in content.opf. * Improved MathML/SVG tag filtering in content files (some instances which would have previously been incorrectly modified are now left as-is). * Content files not listed in the package document are now left as-is. * Content files with nonstandard extensions, but listed in the package document, should now be converted correctly. * Performance should be equal or better than before this change on Go 1.16, and significantly faster for books with many non-content files on Go 1.17. On slow storage, kepubify should also be much faster.

* Use the OPF package document to find HTML files (fixes #55). * Refactor content/files/opf transformation code. * Rewrite conversion code to allow converting directly from a EPUB zip.Reader or fs.FS into an output zipped KEPUB io.Writer (closes #62). * This simplifies the code for kepub conversion. * Input and transformation code is unified. * Useless function to convert a directory to a KEPUB in-place has been eliminated. * Arbitrary virtual file-systems can now be used as input. * This makes the code easier to test. * This resolves security concerns with extracting untrusted EPUBs directly into a temp folder with limited path sanitization, which is important when embedding kepubify into server-side software. * This allows kepubify to easily be compiled and used as a WebAssembly library. * This allows us to greatly reduce the amount of IO required when converting books with a large amount of media or fonts by directly piping it into the output file. And, on Go 1.17+, it will also significantly reduce the CPU time, while also increasing the amount of time spent doing content transformation in parallel rather than waiting for unchanged files to compress by directly copying the untransformed compressed files as-is. * The slightly increased memory cost (~8-12%) is negligible compared to the performance gains mentioned in the previous point and the reduced time waiting for disk IO (especially on HDDs). * Make use of Go 1.16's io/fs for more flexible code and tests. * Remove cascadia dependency * We don't need full selector parsing or specificity. * Depending on cascadia complicates replacing x/net/html. * Doing things manually is slightly more efficient, and almost as concise. * Reduce exposed functions for kepub library. * They weren't really used. * Removing them increases flexibility for future improvements. * Use a less obtrusive hack for giving kobotest access to the un-exported kepub functions. * Make documentation for transformations more detailed. * Converted files should be identical to before this change, except for: * Whitespace changes in content.opf. * Improved MathML/SVG tag filtering in content files (some instances which would have previously been incorrectly modified are now left as-is). * Content files not listed in the package document are now left as-is. * Content files with nonstandard extensions, but listed in the package document, should now be converted correctly. * Performance should be equal or better than before this change on Go 1.16, and significantly faster for books with many non-content files on Go 1.17. On slow storage, kepubify should also be much faster.

* Use the OPF package document to find HTML files (fixes #55). * Refactor content/files/opf transformation code. * Rewrite conversion code to allow converting directly from a EPUB zip.Reader or fs.FS into an output zipped KEPUB io.Writer (closes #62). * This simplifies the code for kepub conversion. * Input and transformation code is unified. * Useless function to convert a directory to a KEPUB in-place has been eliminated. * Arbitrary virtual file-systems can now be used as input. * This makes the code easier to test. * This resolves security concerns with extracting untrusted EPUBs directly into a temp folder with limited path sanitization, which is important when embedding kepubify into server-side software. * This allows kepubify to easily be compiled and used as a WebAssembly library. * This allows us to greatly reduce the amount of IO required when converting books with a large amount of media or fonts by directly piping it into the output file. And, on Go 1.17+, it will also significantly reduce the CPU time, while also increasing the amount of time spent doing content transformation in parallel rather than waiting for unchanged files to compress by directly copying the untransformed compressed files as-is. * The slightly increased memory cost (~8-12%) is negligible compared to the performance gains mentioned in the previous point and the reduced time waiting for disk IO (especially on HDDs). * Make use of Go 1.16's io/fs for more flexible code and tests. * Remove cascadia dependency * We don't need full selector parsing or specificity. * Depending on cascadia complicates replacing x/net/html. * Doing things manually is slightly more efficient, and almost as concise. * Reduce exposed functions for kepub library. * They weren't really used. * Removing them increases flexibility for future improvements. * Use a less obtrusive hack for giving kobotest access to the un-exported kepub functions. * Make documentation for transformations more detailed. * Converted files should be identical to before this change, except for: * Improved MathML/SVG tag filtering in content files (some instances which would have previously been incorrectly modified are now left as-is). * Content files not listed in the package document are now left as-is. * Content files with nonstandard extensions, but listed in the package document, should now be converted correctly. * Performance should be equal or better than before this change on Go 1.16, and significantly faster for books with many non-content files on Go 1.17. On slow storage, kepubify should also be much faster.

pgaskin added the enhancement label Jun 7, 2021

pgaskin self-assigned this Jun 7, 2021

pgaskin changed the title ~~In-memory streaming conversions~~ In-memory streaming conversions and other performance improvements Jun 13, 2021

pgaskin closed this as completed in 948788e Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-memory streaming conversions and other performance improvements #62

In-memory streaming conversions and other performance improvements #62

pgaskin commented Jun 7, 2021 •

edited

Loading

In-memory streaming conversions and other performance improvements #62

In-memory streaming conversions and other performance improvements #62

Comments

pgaskin commented Jun 7, 2021 • edited Loading

pgaskin commented Jun 7, 2021 •

edited

Loading