Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-32538: [C++][Parquet] Add JSON canonical extension type #13901

Merged
merged 26 commits into from
Sep 11, 2024

Conversation

progger-dev
Copy link
Contributor

@progger-dev progger-dev commented Aug 16, 2022

Arrow now provides a canonical extension type for JSON data. This
extension is backed by utf8(). Parquet will recognize this extension
and appropriately propagate the LogicalType to the storage format.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@progger-dev progger-dev changed the title [ARROW-17255] Add JSON canonical extension type ARROW-17255: [C++][Parquet] Add JSON canonical extension type Aug 16, 2022
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@pitrou
Copy link
Member

pitrou commented Aug 17, 2022

We don't have a way to standardize extension types yet so I posted a proposal of how to do it:
https://lists.apache.org/thread/qxc1g7h9ow79qt6r7sqtgbj8mdbdgnhb

Until that is decided I think this PR should be converted to draft. Does that sound ok @progger-dev ?

@progger-dev
Copy link
Contributor Author

@pitrou Absolutely! That's fine. I was hoping that this PR would trigger that conversation.

@progger-dev progger-dev marked this pull request as draft August 17, 2022 20:05
@pitrou
Copy link
Member

pitrou commented Oct 5, 2022

@progger-dev Since we now have an official way to standardize extension types, do you want to resurrect this by posting a proposal to the ML?


// When the original Arrow schema isn't stored and Arrow extensions are disabled,
// LogicalType::JSON is read as Binary.
const auto binary_array = ::arrow::ArrayFromJSON(::arrow::binary(), json);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says

This extension is backed by utf8()

So I would have naively expected this to be inferred as arrow::utf8?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to preserve the existing behavior.

We currently read LogicalType::JSON as arrow::binary. So, if there is no schema information and the extensions are disabled, then we use the current behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you foresee an issue with my changing the Rust parquet reader to infer as UTF-8 in apache/arrow-rs#3376, at least until such a time as this extension type is stabilised?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be fine. The behavior should be implementation dependent. So, if the Rust implementation reads the Parquet JSON type as utf8 it should continue to do that.

I'm not currently planning on touching the Rust implementation. If and when it is updated to support this extension, I would expect the current behavior to be preserved if the user requests the extension to be disabled.

@rok
Copy link
Member

rok commented Apr 10, 2024

@pradeepg26 did you want to continue work on this? There was some work done on canonical extension types lately (see cpp/src/arrow/extension, docs/source/format/CanonicalExtensions.rst) that could perhaps help here.

@progger-dev
Copy link
Contributor Author

I expected someone else at BigQuery to take over this work. I would love to finish this PR, but I don't think I'll have time for a while.

@rok
Copy link
Member

rok commented Apr 12, 2024

I have some time next week and can try to move it forward if you don't mind?

@rok rok force-pushed the json_support branch 4 times, most recently from 35ff87d to 598c467 Compare April 15, 2024 22:51
@rok rok marked this pull request as ready for review April 16, 2024 11:45
@rok rok requested a review from wgtmac as a code owner April 16, 2024 11:45
@rok rok requested review from pradeepg26, pitrou and mapleFU and removed request for pradeepg26 April 16, 2024 11:45
@rok
Copy link
Member

rok commented Apr 16, 2024

JNI failure seems unrelated.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks a lot for pushing this through @rok !

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 11, 2024
@rok
Copy link
Member

rok commented Sep 11, 2024

Thanks for the reviews all!

@pitrou pitrou merged commit 27acf8b into apache:main Sep 11, 2024
41 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Sep 11, 2024
@rok
Copy link
Member

rok commented Sep 11, 2024

Created #44066 for the Python wrapper.

Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 27acf8b.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 274 possible false positives for unstable benchmarks that are known to sometimes produce them.

khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…che#13901)

Arrow now provides a canonical extension type for JSON data. This
extension is backed by utf8(). Parquet will recognize this extension
and appropriately propagate the LogicalType to the storage format.
* GitHub Issue: apache#32538

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Pradeep Gollakota <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Comment on lines +30 to +32
bool JsonExtensionType::ExtensionEquals(const ExtensionType& other) const {
return other.extension_name() == this->extension_name();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This equality check does not take into account the storage type, but only the name.

As a consequence, a JsonExtensionType<string> type will be seen as equal to JsonExtensionType<large_string>. Was that intentional?

While from a user point of view, it certainly makes sense to have those seen as equal, but the same is true for string vs large_string itself. And in general in Arrow C++, the types are concrete types where variants of the same "logical" type (eg string vs large_string) are not seen as equal. So should the same logic be followed here?

I assume that such type equality will for example be used to check if schemas are equal to see if a set of batches can be concatenated or written to the same IPC stream, etc, and for those cases we require exact equality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's certainly a bug. Sorry for not spotting this, and feel free to submit a fix :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I suppose I missed that when switching from string only to it being a parametric type. I can make a fix later today if no one started on it yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't start yet

@github-actions github-actions bot added the awaiting changes Awaiting changes label Sep 24, 2024
Comment on lines +55 to +57
ARROW_CHECK(storage_type->id() != Type::STRING ||
storage_type->id() != Type::STRING_VIEW ||
storage_type->id() != Type::LARGE_STRING);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not correct also.

@rok
Copy link
Member

rok commented Sep 24, 2024

Created and issue #44214 and opened a PR #44215 addressing both of these.

pitrou added a commit that referenced this pull request Oct 8, 2024
…#44215)

### Rationale for this change

As noted in #13901 (review):
```cpp
bool JsonExtensionType::ExtensionEquals(const ExtensionType& other) const {
  return other.extension_name() == this->extension_name();
}
```
> This equality check does not take into account the storage type, but only the name.
> As a consequence, a JsonExtensionType<string> type will be seen as equal to JsonExtensionType<large_string>.

### What changes are included in this PR?

This change introduces storage equality check into `JsonExtensionType` equality check.

This also fixes a storage type check in `JsonExtensionType::Make`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #44214

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
pitrou pushed a commit that referenced this pull request Oct 22, 2024
### Rationale for this change

We [added canonical JsonExtensionType](#13901) and we should make it usable from Python.

### What changes are included in this PR?

Python wrapper for `JsonExtensionType` and `JsonArray` are added on Python side as well as `JsonArray` on c++ side.

### Are these changes tested?

Python tests for the extension type and array are included.

### Are there any user-facing changes?

This adds a json canonical extension type to pyarrow.
* GitHub Issue: #44066

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Add JSON canonical extension type
9 participants