Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Schema::project and RecordBatch project function to project / select a subset of columns #1014

Closed
alamb opened this issue Dec 9, 2021 · 3 comments · Fixed by #1033
Closed
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Dec 9, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It is common to "project" (and pick a subset) of columns from a schema (and then also RecordBatch) for processing.

https://github.com/apache/arrow-datafusion/blob/299ab7d1c37c707fcd500d3428abbdbe4dc5399b/datafusion/src/datasource/empty.rs#L65-L71

https://github.com/apache/arrow-datafusion/blob/0facd4d483e8c289ee4e3a89487d0cd1ede1a110/datafusion/src/physical_plan/file_format/mod.rs#L83-L93

There are many instances of projection

            // apply projection
            match &self.projection {
                Some(columns) => Some(RecordBatch::try_new(
                    self.schema.clone(),
                    columns.iter().map(|i| batch.column(*i).clone()).collect(),
                )),
                None => Some(Ok(batch.clone())),
            }

Many (most) instances of projection don't handle metadata leading to bugs like apache/datafusion#1361

Describe the solution you'd like
Add projection functions to Schema and RecordBatch structs in the arrow-rs crate that properly handle metadata.

Proposed signatures:

/// Returns a new schema consisting of only the specified columns
///
/// So if a schema had Fields A, B and C, schema.project([2,1]) would return a new
/// schema with Fields B, and A
///
/// TODO example
fn Schema::project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {
...
}
/// Returns a new RecordBatch consisting of only the specified columns
///
/// So if a RecordBatch had Columns A, B and C, batch.project([2,1]) would return a new
/// RecordBatch with Columns B, and A
///
/// TODO example
fn RecordBatch::project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {
...
}

Describe alternatives you've considered

Additional context
@hntd187 added this feature in DataFusion in apache/datafusion#1378

@hntd187
Copy link
Contributor

hntd187 commented Dec 9, 2021

I suppose it only makes sense to tackle this one next. 🔢

@novemberkilo
Copy link
Contributor

Is this still open for me to pick up or are you on it @hntd187 // @alamb

@hntd187
Copy link
Contributor

hntd187 commented Dec 10, 2021

I was going to tackle this, still learning the code base but I think I'll manage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants