Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column_by_name() for RecordBatch #3448

Closed
wjones127 opened this issue Jan 4, 2023 · 4 comments
Closed

Add column_by_name() for RecordBatch #3448

wjones127 opened this issue Jan 4, 2023 · 4 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@wjones127
Copy link
Member

wjones127 commented Jan 4, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

This is quite verbose:

let col: Option<ArrayRef> = record_batch.column(record_batch.schema().column_with_name("my_column")?.0)

Describe the solution you'd like

The simplest option is to add column_by_name() for RecordBatch. It is already implemented for StructArray.

let col: Option<ArrayRef> = record_batch.column_by_name("my_column")

But it might also be desirable to implement std::ops::Index for RecordBatch:

let col: Option<ArrayRef> = record_batch["my_column"]

Describe alternatives you've considered

Additional context

I was working with highly nested schemas with multiple levels of StructArrays in some unit tests, and it was a little annoying to access those deep fields. In an ideal world, I might be able to do something along the lines of:

let col: Option<ArrayRef> = record_batch["top_level_column"]?
    .as_struct_array()?["mid_level_column"]
    .as_struct_array()?["inner_column"];

This implies StructArray would implement Index in the same way that RecordBatch does. If we wanted to get fancy, it might not be crazy to implement this for ListArray or even MapArray.

@wjones127 wjones127 added the enhancement Any new improvement worthy of a entry in the changelog label Jan 4, 2023
@tustvold
Copy link
Contributor

tustvold commented Jan 4, 2023

Makes sense to me

@askoa
Copy link
Contributor

askoa commented Jan 4, 2023

I'll pick this up.

Edit: I'll work on RecordBatch and StructArray. I am not planning to pick up ListArray and MapArray

@askoa
Copy link
Contributor

askoa commented Jan 5, 2023

I don't think we can return Option (or Result) from std::ops::Index. The trait function index returns a reference and hence cannot be a value created in the function. The function has to return reference to an item in self.

pub trait Index<Idx>
where
    Idx: ?Sized,
{
    type Output: ?Sized;

    fn index(&self, index: Idx) -> &Self::Output;
}

So the behavior of record_batch["name"] will be similar to record_batch.column(index).

Below are the return types I am planning. Let me know if you have alternate suggestion.

  • column_by_name will return Option<&ArrayRef>
  • column[name] will return &ArrayRef. The function will panic if the given name is not in the schema.

@ttencate
Copy link
Contributor

This is done and can be closed, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

4 participants