-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode.replace_invalid(body, :utf8) on some content seems to get stuck forever. #10
Comments
Wow, thats definitely unexpected. The implementation from @Moosieus is really quick and efficient. And the implementation in To help guide debugging can you let me know:
|
@Moosieus I ran a quick test with a 1.2MB text file. It runs quickly - about 8ms - on my machine with iex> text = File.read! "/Users/kip/Desktop/moby_dick.txt"
"The Project Gutenberg eBook of Moby Dick; Or, The Whale\n\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License" <> ...
iex> String.length(text)
1238159
iex> :timer.tc fn -> UniRecover.sub(text) end
{8238,
"The Project Gutenberg eBook of Moby Dick; Or, The Whale\n\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under" <> ...
The content comes from https://www.gutenberg.org/cache/epub/2701/pg2701.txt |
Sorry for the late reply, I had to jump.
yes, it works with UniRecover, we rolled back to change and its working as expected again.
We only have relatively large files to process and seeing it for all of them, but in iex is works as expected.
The one I tested had valid UTF8, but have not seen any being processed with or without valid UTF8. I see you can reproduce it :). Thanks for looking into it even with my rushed report. |
Damnedest thing... The same bits of code (such as de-structuring Outsized memory allocation appears to be the issue which I specifically sought to avoid even if it traded some speed. Actively sorting this out, I'll let you know what I find. I see worst case as falling back on the original fugly-but-only-inefficeint-when-there's-mostly-errors implementation. |
https://github.com/elixir-unicode/unicode/tree/validation-perf Calls to Once that's sorted, we're golden. |
With this branch I can run the function on a file ~52mb's in < 2 secs, which looks to be closer to the uni_recover package. |
Good stuff @Moosieus, let me know when you're ready and I'll merge and publish to hex. |
I've never found a good way to do release-to-release performance regression testing. Any ideas along those lines would be welcome too. |
Pushed another update, turned I'll look into performance regression testing next. I think ensuring allocations stay 128 bytes is an achievable target. |
What do you think about making |
I'd be down for that 👍 |
Cool. I guess |
I will add more detail as I discover, but logging this so long.
After moving from UniRecover.sub(body) https://github.com/Moosieus/UniRecover to Unicode.replace_invalid(body, :utf8) we realised that our processes gets stuck for hours on .replace_invalid without using much memory or cpu. This is for contents > 30mb's read from a file. Unfortunately the files have sensitive information, so can't share.
The text was updated successfully, but these errors were encountered: