-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-4605: [Rust] Move filter and limit code from DataFusion into compute module #3741
Conversation
rust/arrow/src/compute/array_ops.rs
Outdated
} | ||
|
||
pub fn limit(array: &Array, num_rows_to_read: usize) -> Result<ArrayRef> { | ||
if array.len() < num_rows_to_read { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious as to why we return an error here. Why not set the limit to the lower of the 2 instead?
vec![1,2,3,4,5].iter().take(7)
doesn't throw an error, and SELECT * FROM table_with_100_records LIMIT 200
will also only return the 100 records. What do you think about the suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thought was that if someone calls this function with a num_rows_to_read
larger than array.len()
, they will probably want to know rather than silently choosing the minimum.
However given take
and LIMIT
behavior this is indeed probably not what is expected from this function. I will change it, good call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you limit the num_rows_to_read
to the array length.
Have a look at https://issues.apache.org/jira/browse/ARROW-3954, which will make filter more efficient in future.
…String > &str > &[u8]
create BinaryArray directly from byte slice
/// | ||
/// Returns the whole array if the number of elements specified is larger than the length of the array | ||
pub fn limit(array: &Array, num_elements: usize) -> Result<ArrayRef> { | ||
let num_elements_safe: usize = cmp::min(array.len(), num_elements); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nit, we could return the array as ArrayRef
immediately if the limit >= len. I sold have thought of it earlier, my apologies.
I'm happy with everything else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can help me here: how can I wrap the reference to array
in an Arc<Array>
/ ArrayRef
? Since the reference to array
can be freed at any time it would leave the Arc
invalid, so I have to specify the lifetime somehow. Can I do that with Arc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're right, I missed that part. We can improve limit when we have zero-copy array slicing 👍🏾
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I made one suggestion
rust/arrow/src/array.rs
Outdated
@@ -627,6 +627,26 @@ impl<'a> From<Vec<&'a str>> for BinaryArray { | |||
} | |||
} | |||
|
|||
impl<'a> From<Vec<&[u8]>> for BinaryArray { | |||
fn from(v: Vec<&[u8]>) -> Self { | |||
let mut offsets = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more efficient to initialize these vectors with Vec::with_capacity(v.len())
rust/arrow/src/array.rs
Outdated
@@ -609,8 +609,8 @@ impl From<ArrayDataRef> for BinaryArray { | |||
|
|||
impl<'a> From<Vec<&'a str>> for BinaryArray { | |||
fn from(v: Vec<&'a str>) -> Self { | |||
let mut offsets = vec![]; | |||
let mut values = vec![]; | |||
let mut offsets = Vec::with_capacity(v.len());; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I guess this should be Vec::with_capacity(v.len() + 1)
. Also you have an extra ;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one more comment ...
rust/arrow/src/array.rs
Outdated
let mut offsets = vec![]; | ||
let mut values = vec![]; | ||
let mut offsets = Vec::with_capacity(v.len() + 1); | ||
let mut values = Vec::with_capacity(v.len() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing this again, I realize that this change is good for offsets
but maybe not for values
. We can only allocate values
to the correct size if we first go and sum the lengths of all of the input strings, which would be expensive. Maybe we should just go back to vec![]
or Vec::new()
for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. Also since v
can contain empty strings, we do not even have a lower bound for an initial capacity, so I'll revert to Vec::new()
for values
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks @ntrinquier !
No description provided.