-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-32538: [C++][Parquet] Add JSON canonical extension type #13901
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
|
We don't have a way to standardize extension types yet so I posted a proposal of how to do it: Until that is decided I think this PR should be converted to draft. Does that sound ok @progger-dev ? |
@pitrou Absolutely! That's fine. I was hoping that this PR would trigger that conversation. |
@progger-dev Since we now have an official way to standardize extension types, do you want to resurrect this by posting a proposal to the ML? |
|
||
// When the original Arrow schema isn't stored and Arrow extensions are disabled, | ||
// LogicalType::JSON is read as Binary. | ||
const auto binary_array = ::arrow::ArrayFromJSON(::arrow::binary(), json); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description says
This extension is backed by utf8()
So I would have naively expected this to be inferred as arrow::utf8
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to preserve the existing behavior.
We currently read LogicalType::JSON
as arrow::binary
. So, if there is no schema information and the extensions are disabled, then we use the current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you foresee an issue with my changing the Rust parquet reader to infer as UTF-8 in apache/arrow-rs#3376, at least until such a time as this extension type is stabilised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should be fine. The behavior should be implementation dependent. So, if the Rust implementation reads the Parquet JSON type as utf8
it should continue to do that.
I'm not currently planning on touching the Rust implementation. If and when it is updated to support this extension, I would expect the current behavior to be preserved if the user requests the extension to be disabled.
@pradeepg26 did you want to continue work on this? There was some work done on canonical extension types lately (see cpp/src/arrow/extension, docs/source/format/CanonicalExtensions.rst) that could perhaps help here. |
I expected someone else at BigQuery to take over this work. I would love to finish this PR, but I don't think I'll have time for a while. |
I have some time next week and can try to move it forward if you don't mind? |
35ff87d
to
598c467
Compare
JNI failure seems unrelated. |
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
8b27962
to
1ca8f1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks a lot for pushing this through @rok !
Thanks for the reviews all! |
Created #44066 for the Python wrapper. |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 27acf8b. There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. It also includes information about 274 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…che#13901) Arrow now provides a canonical extension type for JSON data. This extension is backed by utf8(). Parquet will recognize this extension and appropriately propagate the LogicalType to the storage format. * GitHub Issue: apache#32538 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Pradeep Gollakota <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
bool JsonExtensionType::ExtensionEquals(const ExtensionType& other) const { | ||
return other.extension_name() == this->extension_name(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This equality check does not take into account the storage type, but only the name.
As a consequence, a JsonExtensionType<string>
type will be seen as equal to JsonExtensionType<large_string>
. Was that intentional?
While from a user point of view, it certainly makes sense to have those seen as equal, but the same is true for string vs large_string itself. And in general in Arrow C++, the types are concrete types where variants of the same "logical" type (eg string vs large_string) are not seen as equal. So should the same logic be followed here?
I assume that such type equality will for example be used to check if schemas are equal to see if a set of batches can be concatenated or written to the same IPC stream, etc, and for those cases we require exact equality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's certainly a bug. Sorry for not spotting this, and feel free to submit a fix :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I suppose I missed that when switching from string only to it being a parametric type. I can make a fix later today if no one started on it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't start yet
ARROW_CHECK(storage_type->id() != Type::STRING || | ||
storage_type->id() != Type::STRING_VIEW || | ||
storage_type->id() != Type::LARGE_STRING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is not correct also.
…#44215) ### Rationale for this change As noted in #13901 (review): ```cpp bool JsonExtensionType::ExtensionEquals(const ExtensionType& other) const { return other.extension_name() == this->extension_name(); } ``` > This equality check does not take into account the storage type, but only the name. > As a consequence, a JsonExtensionType<string> type will be seen as equal to JsonExtensionType<large_string>. ### What changes are included in this PR? This change introduces storage equality check into `JsonExtensionType` equality check. This also fixes a storage type check in `JsonExtensionType::Make`. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #44214 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change We [added canonical JsonExtensionType](#13901) and we should make it usable from Python. ### What changes are included in this PR? Python wrapper for `JsonExtensionType` and `JsonArray` are added on Python side as well as `JsonArray` on c++ side. ### Are these changes tested? Python tests for the extension type and array are included. ### Are there any user-facing changes? This adds a json canonical extension type to pyarrow. * GitHub Issue: #44066 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Arrow now provides a canonical extension type for JSON data. This
extension is backed by utf8(). Parquet will recognize this extension
and appropriately propagate the LogicalType to the storage format.