Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Improved performance of utf8 validation of large strings via simdutf8 (-40%) #426

Merged
merged 7 commits into from
Sep 20, 2021

Conversation

Dandandan
Copy link
Collaborator

This is helpful for strings >= 8 bytes.

@Dandandan Dandandan changed the title Add simdutf8 feature Add simdutf8 feature to speed up utf8 validation Sep 19, 2021
@codecov
Copy link

codecov bot commented Sep 19, 2021

Codecov Report

Merging #426 (9fdf174) into main (55ff79c) will decrease coverage by 0.02%.
The diff coverage is 71.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #426      +/-   ##
==========================================
- Coverage   80.80%   80.78%   -0.03%     
==========================================
  Files         353      372      +19     
  Lines       22649    22651       +2     
==========================================
- Hits        18302    18299       -3     
- Misses       4347     4352       +5     
Impacted Files Coverage Δ
src/error.rs 20.00% <0.00%> (-1.43%) ⬇️
src/io/csv/read/deserialize.rs 57.42% <42.85%> (ø)
src/array/specification.rs 67.74% <50.00%> (-2.26%) ⬇️
src/array/ord.rs 64.21% <81.81%> (ø)
src/compute/cast/mod.rs 88.67% <100.00%> (ø)
src/compute/like.rs 47.36% <100.00%> (ø)
src/ffi/schema.rs 66.07% <100.00%> (ø)
src/io/avro/read/deserialize.rs 81.13% <100.00%> (ø)
src/io/avro/read/util.rs 80.43% <100.00%> (ø)
src/io/parquet/read/schema/metadata.rs 79.06% <100.00%> (ø)
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55ff79c...9fdf174. Read the comment docs.

@jorgecarleitao
Copy link
Owner

Looks great!

I am tempted to define utils::from_utf8 that branches on the feature and use this function everywhere in the crate, so that we cover all cases. I can do it on a new PR, just wanted to check if this makes sense to you.

@Dandandan
Copy link
Collaborator Author

My benchmarks indicate it's not an improvement on benchmarks / the TPC-H files.
I misread the documentation of simdutf8 - it only kicks in from 64 bytes - which makes it more useful for larger text / documents / etc.

I don't think the overhead is very high but it seems less useful beyond larger documents in that case.

@Dandandan
Copy link
Collaborator Author

Let me add some >= 64 bytes examples to the parquet benchmark first. I think it makes sense to have proof of being faster first.

@Dandandan
Copy link
Collaborator Author

@jorgecarleitao
I benchmarked this change in 1fbe8e3 - and it gives results of around -40% for a large string of emojis on reading the column from parquet - I think pretty impressive. I couldn't spot clear negative consequences on smaller texts, so looks like it is a useful feature to enable in dependencies / at other places where utf8 validation is done.
The write_parquet approach is a bit hard to get memory efficient though, so for now I reverted the addition of the test.

@jorgecarleitao
Copy link
Owner

Looks great. IMO we can make simdutf8 a non-optional feature and use it through the crate (likely until it is merged into std).

@jorgecarleitao jorgecarleitao changed the title Add simdutf8 feature to speed up utf8 validation Speed up utf8 validation via simdutf8 (-40%) Sep 20, 2021
@jorgecarleitao jorgecarleitao changed the title Speed up utf8 validation via simdutf8 (-40%) Improved performance of utf8 validation of large strings via simdutf8 (-40%) Sep 20, 2021
@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Sep 20, 2021
@jorgecarleitao
Copy link
Owner

I've updated the title accordingly. Could you paste the summary of the benches on the description so that when someone visits this PR can have a quick glance?

@jorgecarleitao jorgecarleitao merged commit 7dedd02 into jorgecarleitao:main Sep 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants