-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add StringArray::num_chars
for calculating number of characters
#1503
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -78,6 +78,15 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> { | |
self.data.buffers()[1].clone() | ||
} | ||
|
||
/// Returns the number of `Unicode Scalar Value` in the string at index `i`. | ||
/// # Performance | ||
/// This function has `O(n)` time complexity where `n` is the string length. | ||
/// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`, | ||
/// please use the function [`value_length`](#method.value_length) which has O(1) time complexity. | ||
pub fn num_chars(&self, i: usize) -> usize { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't find an elegant way to make the returned type as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think returning the length as a |
||
self.value(i).chars().count() | ||
} | ||
|
||
/// Returns the element at index | ||
/// # Safety | ||
/// caller is responsible for ensuring that index is within the array bounds | ||
|
@@ -377,9 +386,9 @@ mod tests { | |
|
||
#[test] | ||
fn test_string_array_from_u8_slice() { | ||
let values: Vec<&str> = vec!["hello", "", "parquet"]; | ||
let values: Vec<&str> = vec!["hello", "", "A£ऀ𖼚𝌆৩ƐZ"]; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
|
||
// Array data: ["hello", "", "parquet"] | ||
// Array data: ["hello", "", "A£ऀ𖼚𝌆৩ƐZ"] | ||
let string_array = StringArray::from(values); | ||
|
||
assert_eq!(3, string_array.len()); | ||
|
@@ -388,10 +397,12 @@ mod tests { | |
assert_eq!("hello", unsafe { string_array.value_unchecked(0) }); | ||
assert_eq!("", string_array.value(1)); | ||
assert_eq!("", unsafe { string_array.value_unchecked(1) }); | ||
assert_eq!("parquet", string_array.value(2)); | ||
assert_eq!("parquet", unsafe { string_array.value_unchecked(2) }); | ||
assert_eq!(5, string_array.value_offsets()[2]); | ||
assert_eq!(7, string_array.value_length(2)); | ||
assert_eq!("A£ऀ𖼚𝌆৩ƐZ", string_array.value(2)); | ||
assert_eq!("A£ऀ𖼚𝌆৩ƐZ", unsafe { | ||
string_array.value_unchecked(2) | ||
}); | ||
assert_eq!(20, string_array.value_length(2)); // 1 + 2 + 3 + 4 + 4 + 3 + 2 + 1 | ||
assert_eq!(8, string_array.num_chars(2)); | ||
for i in 0..3 { | ||
assert!(string_array.is_valid(i)); | ||
assert!(!string_array.is_null(i)); | ||
|
@@ -407,9 +418,9 @@ mod tests { | |
|
||
#[test] | ||
fn test_large_string_array_from_u8_slice() { | ||
let values: Vec<&str> = vec!["hello", "", "parquet"]; | ||
let values: Vec<&str> = vec!["hello", "", "A£ऀ𖼚𝌆৩ƐZ"]; | ||
|
||
// Array data: ["hello", "", "parquet"] | ||
// Array data: ["hello", "", "A£ऀ𖼚𝌆৩ƐZ"] | ||
let string_array = LargeStringArray::from(values); | ||
|
||
assert_eq!(3, string_array.len()); | ||
|
@@ -418,10 +429,13 @@ mod tests { | |
assert_eq!("hello", unsafe { string_array.value_unchecked(0) }); | ||
assert_eq!("", string_array.value(1)); | ||
assert_eq!("", unsafe { string_array.value_unchecked(1) }); | ||
assert_eq!("parquet", string_array.value(2)); | ||
assert_eq!("parquet", unsafe { string_array.value_unchecked(2) }); | ||
assert_eq!("A£ऀ𖼚𝌆৩ƐZ", string_array.value(2)); | ||
assert_eq!("A£ऀ𖼚𝌆৩ƐZ", unsafe { | ||
string_array.value_unchecked(2) | ||
}); | ||
assert_eq!(5, string_array.value_offsets()[2]); | ||
assert_eq!(7, string_array.value_length(2)); | ||
assert_eq!(20, string_array.value_length(2)); // 1 + 2 + 3 + 4 + 4 + 3 + 2 + 1 | ||
assert_eq!(8, string_array.num_chars(2)); | ||
for i in 0..3 { | ||
assert!(string_array.is_valid(i)); | ||
assert!(!string_array.is_null(i)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length of string
==number of chars
when all chars are in 0000 ~ 007FThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you have considered using
array.value(i).chars().len()
to count utf8 codepoints, as described inhttps://stackoverflow.com/questions/46290655/get-the-string-length-in-characters-in-rust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your helpful suggestion, I will have a try!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!