-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Substring_by_char
#1784
Add Substring_by_char
#1784
Conversation
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1784 +/- ##
==========================================
+ Coverage 83.39% 83.42% +0.02%
==========================================
Files 198 198
Lines 56142 56303 +161
==========================================
+ Hits 46822 46969 +147
- Misses 9320 9334 +14 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, left some comments on how it could be improved - but happy for this to be a follow up PR
length.to_usize().unwrap().min(char_count - start) | ||
}); | ||
|
||
val.chars().skip(start).take(length).collect::<String>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocating a string temporary, only to copy out of it, is likely a significant portion of the slow-down. That combined with the null handling.
This could definitely be handled as a separate PR, but you might want to consider doing something like (not tested).
let nulls = // align bitmap to 0 offset, copying if already aligned
let mut vals = BufferBuilder::<u8>::new(array.value_data().len());
let mut indexes = BufferBuilder::<OffsetSize>::new(array.len() + 1);
indexes.append(0);
for val in array.iter() {
let char_count = val.chars().count();
let start = if start >= 0 {
start.to_usize().unwrap().min(char_count)
} else {
char_count - (-start).to_usize().unwrap().min(char_count)
};
let length = length.map_or(char_count - start, |length| {
length.to_usize().unwrap().min(char_count - start)
});
let mut start_byte = 0;
let mut end_byte = val.len();
for ((idx, (byte_idx, _)) in val.char_indices().enumerate() {
if idx == start {
start_byte = byte_idx;
} else if idx == start + length {
end_byte = byte_idx;
break
}
}
// Could even be unchecked
vals.append_slice(&val[start_byte..end_byte]);
indexes.append(vals.len() as _);
}
let data = ArrayDataBuilder::new(array.data_type()).len(array.len()).add_buffer(vals.finish())
.add_buffer(indexes.finish()).add_buffer(vals.finish());
Ok(GenericStringArray::<OffsetSize>::from(unsafe {data.build_unchecked()}))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked by #1800
(input_vals.clone(), -4, Some(4), vec!["ello", "", "⊢x:T"]), | ||
]; | ||
|
||
cases.into_iter().try_for_each::<_, Result<()>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could extract this as a function instead of duplicating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked by #1801
} | ||
|
||
fn generic_string_by_char_with_non_zero_offset<O: OffsetSizeTrait>() -> Result<()> { | ||
let values = "S→T = Πx:S.T"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to follow to just construct the regular array, and then call array.slice(1, 2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked by #1801
@@ -1083,6 +1135,164 @@ mod tests { | |||
generic_string_with_non_zero_offset::<i64>() | |||
} | |||
|
|||
fn with_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loving the test coverage 👍
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Which issue does this PR close?
Closes #1768.
Rationale for this change
Support substring by char
What changes are included in this PR?
substring_by_char
Performance
1/6 of the speed of
substring
(by byte)Are there any user-facing changes?
Yes.