Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short-term roadmap for this implementation #34

Open
12 of 16 tasks
waynexia opened this issue Nov 3, 2023 · 5 comments
Open
12 of 16 tasks

Short-term roadmap for this implementation #34

waynexia opened this issue Nov 3, 2023 · 5 comments
Labels
documentation Improvements or additions to documentation

Comments

@waynexia
Copy link
Collaborator

waynexia commented Nov 3, 2023

Previous discussion: apache/datafusion#4707

Though the ORC format is not as widely used as parquet in arrow-rs and datafusion related projects, there are still some (growing, to my feelings) interesting and requirements on this format. As @Jefffrey said here, a noticeable and viable milestone for this project is it can be merged into arrow-rs. This draft roadmap is raised to help us discuss, arrange and take our efforts toward that milestone.

Given the ORC format is less complex than parquet, there are still many work to do in various aspects. Here is a list of functionalities need to be done if we consider making ORC files queriable from datafusion as the primary use case on this stage. Please feel free to add/remove/set priorities to them. It's likely that we can't finish all of them in a short term, thus marking what are going to be done is also important.

The below are also related but with lower priorities

Long term items:

  • encryption

Then something I'm not sure about. Looking for more information. Also feel free to change previous two lists.

@Jefffrey
Copy link
Contributor

Jefffrey commented Nov 3, 2023

Thanks for writing this here.

Just to preface, I'm no expert in ORC nor do I technically have a usecase for it, so can take my thoughts with a grain of salt. With that said:

  • Encryption can probably be placed lowest priority, probably into the longer-term roadmap. Even parquet in arrow-rs doesn't yet support encryption
  • We can probably bump the encodings to highest priority. I assume you're referring to the V1 encodings, which should be simpler to implement than V2 which seems to already be present
  • I haven't looked into statistics and indexes much, but they do seem important for stuff like predicate pushdown, so can be medium priority or so

I'll create more issues based on this roadmap

Also I assume all our focus will be on a read implementation first, with write coming much later

Another question I have is if we'll focus solely on arrow interop (that is, we focus only on reading from ORC directly into arrow arrays). Parquet crate in arrow-rs seems to support a more generic ColumnReader API for users who don't need arrow. If we focus only on arrow then we can optimize the read behaviour as such, wheres it might require a separate read implemention for a more generic API

@alamb
Copy link
Contributor

alamb commented Nov 4, 2023

BTW some potentially relevant documents in case anyone is interested:

A Deep Dive into Common Open Formats for Analytical

An Empirical Evaluation of Columnar Storage Formats

@Jefffrey
Copy link
Contributor

Jefffrey commented Nov 4, 2023

BTW some potentially relevant documents in case anyone is interested:

A Deep Dive into Common Open Formats for Analytical

An Empirical Evaluation of Columnar Storage Formats

Thanks for these, will definitely give a read!

WenyXu referenced this issue in WenyXu/datafusion-orc Nov 9, 2023
@Jefffrey Jefffrey added the documentation Improvements or additions to documentation label Apr 1, 2024
@klangner
Copy link
Contributor

What is missing from this roadmap which is required to allow this library be added to the datafusion (and arrow-rs, polars?)
I would be interested in helping with this effort as we have orc files which we would like to query.

@Jefffrey
Copy link
Contributor

hey @klangner thanks for the interest!

For DataFusion there is an issue for it: datafusion-contrib/datafusion-orc#63

Right now it lacks support for projection, not to mention the code is sequestered in an example instead of being code as part of the library.

For arrow-rs it's basically just... all features needed for supporting read use cases (sorry if this is too vague 😅 )

I'm not familiar with polars so I can't say on that front.

For now I'm imagining enhancing the API for RecordBatch reading (akin to what parquet provides in arrow-rs) and also creating the necessary impl's to allow DataFusion to read from ORC files using this library.

waynexia referenced this issue in datafusion-contrib/datafusion-orc Oct 24, 2024
@waynexia waynexia transferred this issue from datafusion-contrib/datafusion-orc Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants