Unicode.replace_invalid(body, :utf8) on some content seems to get stuck forever. #10

wkirschbaum · 2023-11-08T15:05:21Z

I will add more detail as I discover, but logging this so long.

After moving from UniRecover.sub(body) https://github.com/Moosieus/UniRecover to Unicode.replace_invalid(body, :utf8) we realised that our processes gets stuck for hours on .replace_invalid without using much memory or cpu. This is for contents > 30mb's read from a file. Unfortunately the files have sensitive information, so can't share.

kipcole9 · 2023-11-08T21:26:13Z

Wow, thats definitely unexpected. The implementation from @Moosieus is really quick and efficient. And the implementation in Unicode is a direct copy of that module. So there is something very weird going on for sure.

To help guide debugging can you let me know:

Your use case works with UniRecover but not Unicode?
The issue is with all examples, or just with "large" binaries?
Do you know if your failing example is actually valid UTF8, or known to be invalid - or you don't know if its valid or not?

kipcole9 · 2023-11-08T22:08:51Z

@Moosieus I ran a quick test with a 1.2MB text file. It runs quickly - about 8ms - on my machine with UniRecover.sub/1. The version we have in Unicode.replace_invalid/2 doesn't complete in several minutes so I killed it. Do you think you could take a look and see where we have diverged?

iex> text = File.read! "/Users/kip/Desktop/moby_dick.txt"
"The Project Gutenberg eBook of Moby Dick; Or, The Whale\n\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License" <> ...
iex> String.length(text)
1238159
iex> :timer.tc fn -> UniRecover.sub(text) end
{8238,
 "The Project Gutenberg eBook of Moby Dick; Or, The Whale\n\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under" <> ...

The content comes from https://www.gutenberg.org/cache/epub/2701/pg2701.txt

wkirschbaum · 2023-11-09T06:30:13Z

Sorry for the late reply, I had to jump.

Wow, thats definitely unexpected. The implementation from @Moosieus is really quick and efficient. And the implementation in Unicode is a direct copy of that module. So there is something very weird going on for sure.

To help guide debugging can you let me know:
1. Your use case works with UniRecover but not Unicode?

yes, it works with UniRecover, we rolled back to change and its working as expected again.

2. The issue is with all examples, or just with "large" binaries?

We only have relatively large files to process and seeing it for all of them, but in iex is works as expected.

3. Do you know if your failing example is actually valid UTF8, or known to be invalid - or you don't know if its valid or not?

The one I tested had valid UTF8, but have not seen any being processed with or without valid UTF8.

I see you can reproduce it :). Thanks for looking into it even with my rushed report.

Moosieus · 2023-11-09T07:03:17Z

Damnedest thing... The same bits of code (such as de-structuring n_lead::2, rest::bits and passing it back) results in heap instead of stack allocations in UniRecover vs here.

Outsized memory allocation appears to be the issue which I specifically sought to avoid even if it traded some speed.

Actively sorting this out, I'll let you know what I find. I see worst case as falling back on the original fugly-but-only-inefficeint-when-there's-mostly-errors implementation.

Moosieus · 2023-11-09T10:16:37Z

https://github.com/elixir-unicode/unicode/tree/validation-perf
👆This branch has performance more inline with what I originally observed and sought.

Calls to ii_of_iii/2, ii_of_iv/2, and iii_of_iv/2 aren't being optimized though, resulting in an outsized amount of allocations. That said, replacing them with acc <> rep <> rep, acc <> rep <> rep, and acc <> rep <> rep <> rep respectively achieves a consistent 128 bytes for any size input (what I originally intended)

Once that's sorted, we're golden.

wkirschbaum · 2023-11-09T10:59:05Z

With this branch I can run the function on a file ~52mb's in < 2 secs, which looks to be closer to the uni_recover package.

kipcole9 · 2023-11-09T11:42:33Z

Good stuff @Moosieus, let me know when you're ready and I'll merge and publish to hex.

kipcole9 · 2023-11-09T11:43:19Z

I've never found a good way to do release-to-release performance regression testing. Any ideas along those lines would be welcome too.

Moosieus · 2023-11-09T17:03:22Z

Pushed another update, turned ii_of_iii/2 et al into bitwise guards. Getting 128 bytes allocated in benchee for any arbitrary sized input, as intended.

I'll look into performance regression testing next. I think ensuring allocations stay 128 bytes is an achievable target.

kipcole9 · 2023-11-09T20:59:59Z

What do you think about making Unicode.replace_invalid(string, :utf8, replacement) delegate to String.replace_invalid/2 on Elixir 1.16 and later?

Moosieus · 2023-11-09T23:32:26Z

I'd be down for that 👍

kipcole9 · 2023-11-09T23:34:58Z

Cool. I guess String.replace_invalid/2 will land in the next Elixir 1.16 RC. Anyway, I can conditionally check for it now so I'll do that.

This was referenced Nov 9, 2023

Fix String.replace_invalid/2 perf regressions elixir-lang/elixir#13090

Merged

Validation perf #11

Merged

kipcole9 closed this as completed in b9a0019 Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode.replace_invalid(body, :utf8) on some content seems to get stuck forever. #10

Unicode.replace_invalid(body, :utf8) on some content seems to get stuck forever. #10

wkirschbaum commented Nov 8, 2023

kipcole9 commented Nov 8, 2023

kipcole9 commented Nov 8, 2023 •

edited

Loading

wkirschbaum commented Nov 9, 2023

Moosieus commented Nov 9, 2023

Moosieus commented Nov 9, 2023 •

edited

Loading

wkirschbaum commented Nov 9, 2023

kipcole9 commented Nov 9, 2023

kipcole9 commented Nov 9, 2023

Moosieus commented Nov 9, 2023

kipcole9 commented Nov 9, 2023 •

edited

Loading

Moosieus commented Nov 9, 2023

kipcole9 commented Nov 9, 2023 •

edited

Loading

Unicode.replace_invalid(body, :utf8) on some content seems to get stuck forever. #10

Unicode.replace_invalid(body, :utf8) on some content seems to get stuck forever. #10

Comments

wkirschbaum commented Nov 8, 2023

kipcole9 commented Nov 8, 2023

kipcole9 commented Nov 8, 2023 • edited Loading

wkirschbaum commented Nov 9, 2023

Moosieus commented Nov 9, 2023

Moosieus commented Nov 9, 2023 • edited Loading

wkirschbaum commented Nov 9, 2023

kipcole9 commented Nov 9, 2023

kipcole9 commented Nov 9, 2023

Moosieus commented Nov 9, 2023

kipcole9 commented Nov 9, 2023 • edited Loading

Moosieus commented Nov 9, 2023

kipcole9 commented Nov 9, 2023 • edited Loading

kipcole9 commented Nov 8, 2023 •

edited

Loading

Moosieus commented Nov 9, 2023 •

edited

Loading

kipcole9 commented Nov 9, 2023 •

edited

Loading

kipcole9 commented Nov 9, 2023 •

edited

Loading