-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38837: [Format] Add the specification to pass statistics through the Arrow C data interface #43553
base: main
Are you sure you want to change the base?
Conversation
@github-actions crossbow submit preview-docs |
|
I'm not a native English speaker. Wording suggestions are very welcome. I'll add examples after I implement a convenient API to C++. |
Revision: 22336f4 Submitted crossbow builds: ursacomputing/crossbow @ actions-28c2a45b3d
|
* Provide a common way to pass statistics that can be used for | ||
other interfaces such Arrow Flight too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the Arrow IPC format? Can you add a sentence here that explains why we do not recommend using this to pass statistics over Arrow IPC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This may fit the Arrow IPC format. (A producer sends data and statistics as 2 separated the Arrow IPC format data.)
But the Arrow IPC format can use more approaches. For example, the Arrow IPC format can have metadata for each record batch data: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format (The Arrow C data can't have metadata for ArrowArray
.)
The Arrow IPC format can be used with other mechanisms such as Arrow Flight and ADBC.
So this may not be the best approach for the Arrow IPC format. We should discuss this use case with the Arrow IPC format separately.
I'll add something to here.
For example, ADBC has the statistics related APIs. This specification | ||
doesn't replace them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I should have done it.
@pdet please take a look and add comments if you have any, thanks! |
The format and contents LGTM! I was just slightly confused for one second that the second mapping is the value @Tmonster I think the proposed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ianmcook Thanks for your suggestions! I've merged all of them!
* Provide a common way to pass statistics that can be used for | ||
other interfaces such Arrow Flight too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This may fit the Arrow IPC format. (A producer sends data and statistics as 2 separated the Arrow IPC format data.)
But the Arrow IPC format can use more approaches. For example, the Arrow IPC format can have metadata for each record batch data: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format (The Arrow C data can't have metadata for ArrowArray
.)
The Arrow IPC format can be used with other mechanisms such as Arrow Flight and ADBC.
So this may not be the best approach for the Arrow IPC format. We should discuss this use case with the Arrow IPC format separately.
I'll add something to here.
For example, ADBC has the statistics related APIs. This specification | ||
doesn't replace them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I should have done it.
The ``ARROW`` pattern is a reserved namespace for pre-defined | ||
statistics keys. User-defined statistics must not use it. | ||
|
||
Here are pre-defined statistics keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are the same as ADBC's one: https://github.com/apache/arrow-adbc/blob/05fa60d643c66b572d426ab28aa78fc52e9520e8/c/include/arrow-adbc/adbc.h#L524-L570
Ah, it makes sense. I'll improve it. Thanks. |
@github-actions crossbow submit preview-docs |
6cf71e0
to
ed0cbe2
Compare
I've added the original DuckDB use case. DuckDB may be able to get statistics without I'll start a discussion on the mailing list tomorrow. |
The discussion thread: https://lists.apache.org/thread/b6chzlyn95rztoybs39b6olz907g12gj |
Rationale for this change
Statistics are useful for fast query processing. Many query engines
use statistics to optimize their query plan.
Apache Arrow format doesn't have statistics but other formats that can
be read as Apache Arrow data may have statistics. For example, Apache
Parquet C++ can read Apache Parquet file as Apache Arrow data and
Apache Parquet file may have statistics.
One of the Arrow C data interface use cases is the following:
Arrow C data interface
If module A can pass the statistics associated with the Apache Parquet
file to module B through the Arrow C data interface, module B can use
the statistics to optimize its query plan.
What changes are included in this PR?
Add the specification to pass statistics through the Arrow C data interface based on the discussion on the
dev@
mailing list: https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cxAre these changes tested?
Yes.
Are there any user-facing changes?
Yes.