Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode.replace_invalid(body, :utf8) on some content seems to get stuck forever. #10

Closed
wkirschbaum opened this issue Nov 8, 2023 · 12 comments

Comments

@wkirschbaum
Copy link

I will add more detail as I discover, but logging this so long.

After moving from UniRecover.sub(body) https://github.com/Moosieus/UniRecover to Unicode.replace_invalid(body, :utf8) we realised that our processes gets stuck for hours on .replace_invalid without using much memory or cpu. This is for contents > 30mb's read from a file. Unfortunately the files have sensitive information, so can't share.

@kipcole9
Copy link
Collaborator

kipcole9 commented Nov 8, 2023

Wow, thats definitely unexpected. The implementation from @Moosieus is really quick and efficient. And the implementation in Unicode is a direct copy of that module. So there is something very weird going on for sure.

To help guide debugging can you let me know:

  1. Your use case works with UniRecover but not Unicode?
  2. The issue is with all examples, or just with "large" binaries?
  3. Do you know if your failing example is actually valid UTF8, or known to be invalid - or you don't know if its valid or not?

@kipcole9
Copy link
Collaborator

kipcole9 commented Nov 8, 2023

@Moosieus I ran a quick test with a 1.2MB text file. It runs quickly - about 8ms - on my machine with UniRecover.sub/1. The version we have in Unicode.replace_invalid/2 doesn't complete in several minutes so I killed it. Do you think you could take a look and see where we have diverged?

iex> text = File.read! "/Users/kip/Desktop/moby_dick.txt"
"The Project Gutenberg eBook of Moby Dick; Or, The Whale\n\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License" <> ...
iex> String.length(text)
1238159
iex> :timer.tc fn -> UniRecover.sub(text) end
{8238,
 "The Project Gutenberg eBook of Moby Dick; Or, The Whale\n\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under" <> ...

The content comes from https://www.gutenberg.org/cache/epub/2701/pg2701.txt

@wkirschbaum
Copy link
Author

Sorry for the late reply, I had to jump.

Wow, thats definitely unexpected. The implementation from @Moosieus is really quick and efficient. And the implementation in Unicode is a direct copy of that module. So there is something very weird going on for sure.

To help guide debugging can you let me know:

1. Your use case works with UniRecover but not Unicode?

yes, it works with UniRecover, we rolled back to change and its working as expected again.

2. The issue is with all examples, or just with "large" binaries?

We only have relatively large files to process and seeing it for all of them, but in iex is works as expected.

3. Do you know if your failing example is actually valid UTF8, or known to be invalid - or you don't know if its valid or not?

The one I tested had valid UTF8, but have not seen any being processed with or without valid UTF8.

I see you can reproduce it :). Thanks for looking into it even with my rushed report.

@Moosieus
Copy link
Collaborator

Moosieus commented Nov 9, 2023

Damnedest thing... The same bits of code (such as de-structuring n_lead::2, rest::bits and passing it back) results in heap instead of stack allocations in UniRecover vs here.

Outsized memory allocation appears to be the issue which I specifically sought to avoid even if it traded some speed.

Actively sorting this out, I'll let you know what I find. I see worst case as falling back on the original fugly-but-only-inefficeint-when-there's-mostly-errors implementation.

@Moosieus
Copy link
Collaborator

Moosieus commented Nov 9, 2023

https://github.com/elixir-unicode/unicode/tree/validation-perf
👆This branch has performance more inline with what I originally observed and sought.

Calls to ii_of_iii/2, ii_of_iv/2, and iii_of_iv/2 aren't being optimized though, resulting in an outsized amount of allocations. That said, replacing them with acc <> rep <> rep, acc <> rep <> rep, and acc <> rep <> rep <> rep respectively achieves a consistent 128 bytes for any size input (what I originally intended)

Once that's sorted, we're golden.

@wkirschbaum
Copy link
Author

With this branch I can run the function on a file ~52mb's in < 2 secs, which looks to be closer to the uni_recover package.

@kipcole9
Copy link
Collaborator

kipcole9 commented Nov 9, 2023

Good stuff @Moosieus, let me know when you're ready and I'll merge and publish to hex.

@kipcole9
Copy link
Collaborator

kipcole9 commented Nov 9, 2023

I've never found a good way to do release-to-release performance regression testing. Any ideas along those lines would be welcome too.

@Moosieus
Copy link
Collaborator

Moosieus commented Nov 9, 2023

Pushed another update, turned ii_of_iii/2 et al into bitwise guards. Getting 128 bytes allocated in benchee for any arbitrary sized input, as intended.

I'll look into performance regression testing next. I think ensuring allocations stay 128 bytes is an achievable target.

@kipcole9
Copy link
Collaborator

kipcole9 commented Nov 9, 2023

What do you think about making Unicode.replace_invalid(string, :utf8, replacement) delegate to String.replace_invalid/2 on Elixir 1.16 and later?

@Moosieus
Copy link
Collaborator

Moosieus commented Nov 9, 2023

I'd be down for that 👍

@kipcole9
Copy link
Collaborator

kipcole9 commented Nov 9, 2023

Cool. I guess String.replace_invalid/2 will land in the next Elixir 1.16 RC. Anyway, I can conditionally check for it now so I'll do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants