Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-4605: [Rust] Move filter and limit code from DataFusion into compute module #3741

Closed
wants to merge 14 commits into from
Closed

Conversation

ntrinquier
Copy link

No description provided.

}

pub fn limit(array: &Array, num_rows_to_read: usize) -> Result<ArrayRef> {
if array.len() < num_rows_to_read {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as to why we return an error here. Why not set the limit to the lower of the 2 instead?

vec![1,2,3,4,5].iter().take(7) doesn't throw an error, and SELECT * FROM table_with_100_records LIMIT 200 will also only return the 100 records. What do you think about the suggestion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thought was that if someone calls this function with a num_rows_to_read larger than array.len(), they will probably want to know rather than silently choosing the minimum.

However given take and LIMIT behavior this is indeed probably not what is expected from this function. I will change it, good call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you limit the num_rows_to_read to the array length.
Have a look at https://issues.apache.org/jira/browse/ARROW-3954, which will make filter more efficient in future.

///
/// Returns the whole array if the number of elements specified is larger than the length of the array
pub fn limit(array: &Array, num_elements: usize) -> Result<ArrayRef> {
let num_elements_safe: usize = cmp::min(array.len(), num_elements);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit, we could return the array as ArrayRef immediately if the limit >= len. I sold have thought of it earlier, my apologies.

I'm happy with everything else

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can help me here: how can I wrap the reference to array in an Arc<Array> / ArrayRef? Since the reference to array can be freed at any time it would leave the Arc invalid, so I have to specify the lifetime somehow. Can I do that with Arc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're right, I missed that part. We can improve limit when we have zero-copy array slicing 👍🏾

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I made one suggestion

@@ -627,6 +627,26 @@ impl<'a> From<Vec<&'a str>> for BinaryArray {
}
}

impl<'a> From<Vec<&[u8]>> for BinaryArray {
fn from(v: Vec<&[u8]>) -> Self {
let mut offsets = vec![];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more efficient to initialize these vectors with Vec::with_capacity(v.len())

@@ -609,8 +609,8 @@ impl From<ArrayDataRef> for BinaryArray {

impl<'a> From<Vec<&'a str>> for BinaryArray {
fn from(v: Vec<&'a str>) -> Self {
let mut offsets = vec![];
let mut values = vec![];
let mut offsets = Vec::with_capacity(v.len());;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I guess this should be Vec::with_capacity(v.len() + 1). Also you have an extra ;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one more comment ...

let mut offsets = vec![];
let mut values = vec![];
let mut offsets = Vec::with_capacity(v.len() + 1);
let mut values = Vec::with_capacity(v.len() + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing this again, I realize that this change is good for offsets but maybe not for values. We can only allocate values to the correct size if we first go and sum the lengths of all of the input strings, which would be expensive. Maybe we should just go back to vec![] or Vec::new() for this one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. Also since v can contain empty strings, we do not even have a lower bound for an initial capacity, so I'll revert to Vec::new() for values.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks @ntrinquier !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants