-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add StringArray::num_chars
for calculating number of characters
#1503
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> { | |
self.data.buffers()[1].clone() | ||
} | ||
|
||
/// Returns the number of chars in the string at index `i`. | ||
/// # Panic | ||
/// If an invalid utf-8 byte is found, the function will panic. | ||
/// However, this function does not check every byte. So you might | ||
/// get an unexpected result if the string is in invalid utf-8 format. | ||
/// # Performance | ||
/// This function has `O(n)` time complexity where `n` is the string length. | ||
/// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`, | ||
/// please use the function [`value_length`](#method.value_length) which has O(1) time complexity. | ||
pub fn num_chars(&self, i: usize) -> usize { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't find an elegant way to make the returned type as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think returning the length as a |
||
let offsets = self.value_offsets(); | ||
let start = offsets[i].to_usize().unwrap(); | ||
let end = offsets[i + 1].to_usize().unwrap(); | ||
let chars = &self.data.buffers()[1].as_slice()[start..end]; | ||
|
||
let mut char_iter = chars.iter(); | ||
let mut length: usize = 0; | ||
while let Some(prefix) = char_iter.next() { | ||
let ones = prefix.leading_ones() as usize; | ||
match ones { | ||
HaoYang670 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
0 => {} | ||
2..=4 => { | ||
char_iter.nth(ones - 2); | ||
} | ||
_ => { | ||
panic!("invalid utf-8 format"); | ||
} | ||
}; | ||
length += 1; | ||
} | ||
length | ||
} | ||
|
||
/// Returns the element at index | ||
/// # Safety | ||
/// caller is responsible for ensuring that index is within the array bounds | ||
|
@@ -407,9 +440,9 @@ mod tests { | |
|
||
#[test] | ||
fn test_large_string_array_from_u8_slice() { | ||
let values: Vec<&str> = vec!["hello", "", "parquet"]; | ||
let values: Vec<&str> = vec!["hello", "", "A£ऀ𖼚𝌆৩ƐZ"]; | ||
|
||
// Array data: ["hello", "", "parquet"] | ||
// Array data: ["hello", "", "A£ऀ𖼚𝌆৩ƐZ"] | ||
let string_array = LargeStringArray::from(values); | ||
|
||
assert_eq!(3, string_array.len()); | ||
|
@@ -418,10 +451,13 @@ mod tests { | |
assert_eq!("hello", unsafe { string_array.value_unchecked(0) }); | ||
assert_eq!("", string_array.value(1)); | ||
assert_eq!("", unsafe { string_array.value_unchecked(1) }); | ||
assert_eq!("parquet", string_array.value(2)); | ||
assert_eq!("parquet", unsafe { string_array.value_unchecked(2) }); | ||
assert_eq!("A£ऀ𖼚𝌆৩ƐZ", string_array.value(2)); | ||
assert_eq!("A£ऀ𖼚𝌆৩ƐZ", unsafe { | ||
string_array.value_unchecked(2) | ||
}); | ||
assert_eq!(5, string_array.value_offsets()[2]); | ||
assert_eq!(7, string_array.value_length(2)); | ||
assert_eq!(20, string_array.value_length(2)); // 1 + 2 + 3 + 4 + 4 + 3 + 2 + 1 | ||
assert_eq!(8, string_array.num_chars(2)); | ||
for i in 0..3 { | ||
assert!(string_array.is_valid(i)); | ||
assert!(!string_array.is_null(i)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length of string
==number of chars
when all chars are in 0000 ~ 007FThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you have considered using
array.value(i).chars().len()
to count utf8 codepoints, as described inhttps://stackoverflow.com/questions/46290655/get-the-string-length-in-characters-in-rust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your helpful suggestion, I will have a try!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!