Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RecordBatchWriter trait and implement it for CSV, JSON, IPC and P… #4206

Merged
merged 1 commit into from
May 12, 2023

Conversation

alexandreyc
Copy link
Contributor

Which issue does this PR close?

This PR doesn't close any particular issue.

Rationale for this change

I found myself needing to work generically with writers of record batches and I need a common interface for doing that.

Do you find this useful? Feel free to reject the PR if you don't see any use case for it.

What changes are included in this PR?

A new trait is introduced and implemented for CSV, JSON, IPC and Parquet :

/// Trait for types that can write `RecordBatch`'s.
pub trait RecordBatchWriter {
    /// Write a single batch to the writer.
    fn write(&mut self, batch: &RecordBatch) -> Result<(), ArrowError>;
}

Are there any user-facing changes?

According to my analysis there are at least one breaking change:

Prototypes of arrow_json::writer::Writer::write and arrow_json::writer::record_batches_to_json_rows have changed and now takes a reference to the record batch instead of taking ownership of it.

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels May 11, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, only thought is whether we also want a schema method?

@tustvold tustvold added the api-change Changes to the arrow API label May 12, 2023
@alexandreyc
Copy link
Contributor Author

I'm not sure we absolutely need a schema method.

First, the caller should already know the schema since it gives the writer RecordBatchs which have a schema.

Second, it would be a bigger refactoring because the schema is not directly available on the writer object for arrow_ipc::writer::StreamWriter, arrow_csv::writer::Writer and arrow_json::writer::Writer.

Additionally, I wonder if we shouldn't add a deprecated attribute to the original write method on the writer object?

What's your opinion?

@tustvold
Copy link
Contributor

First, the caller should already know the schema since it gives the writer RecordBatchs which have a schema.

Fair enough

Additionally, I wonder if we shouldn't add a deprecated attribute to the original write method on the writer object?

I would rather not force all users to have a trait in scope, so I would prefer not to

@tustvold tustvold merged commit 0190408 into apache:master May 12, 2023
@alexandreyc
Copy link
Contributor Author

I would rather not force all users to have a trait in scope, so I would prefer not to

I'm fine with that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants