Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet writing examples/macro/guidance #58

Closed
lyuben-todorov opened this issue Oct 11, 2021 · 1 comment
Closed

Parquet writing examples/macro/guidance #58

lyuben-todorov opened this issue Oct 11, 2021 · 1 comment
Labels
question Further information is requested

Comments

@lyuben-todorov
Copy link

Hi, I have a long-standing puzzle with parquet's presence in rust.

My end-goal is to be able to write parquet files containing my data in rust in an efficient manner (no json, senseless conversions, etc.). For this my logical approach would be to look for/make my own derive macro for Parquet writers for structs. However, the parquet_derive crate is lacking a lot of features (nested structures) and according to ASF slack isn't really actively developed at the moment. I tried implementing a derive macro for an Arrow RecordBatch writer (from the arrow crate) but I quickly ran into problems with using the arrow crate itself. And because of that experience I'm really starting to think that writing parquet in anything other than Java was not meant to be, but that is still not the theoretical case.

I'm asking the maintainer as a person with more experience with the parquet format and ecosystem, are my goals possible? If yes, could you please provide me some guidance on what would need to be done, what's the best way to approach it and maybe a code example of implementing the conversion of data (Vec) into a parquet file.

@jorgecarleitao
Copy link
Owner

Thanks a lot for reaching out! Some questions:

  1. Is the data columnar in nature (or is it e.g. a stream of rows)?
  2. can you lay out your data according to the arrow format?
  3. is the data flat (i.e. no nested structures)?

If all are yes, then I would try using arrow2 directly.

If 1 or 2 is negative, I would try out arrow2-derive to build a RecordBatch

If 3 is negative, we still do not support it in arrow2 (in the roadmap, see e.g. jorgecarleitao/arrow2#504).

Repository owner locked and limited conversation to collaborators Oct 11, 2021
@jorgecarleitao jorgecarleitao added no-changelog question Further information is requested and removed no-changelog labels Oct 18, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants