GH-32538: [C++][Parquet] Add JSON canonical extension type #13901

progger-dev · 2022-08-16T20:38:56Z

Arrow now provides a canonical extension type for JSON data. This
extension is backed by utf8(). Parquet will recognize this extension
and appropriately propagate the LogicalType to the storage format.

GitHub Issue: [C++][Parquet] Add JSON canonical extension type #32538

github-actions · 2022-08-16T20:39:16Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2022-08-16T20:43:50Z

https://issues.apache.org/jira/browse/ARROW-17255

github-actions · 2022-08-16T20:43:52Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

pitrou · 2022-08-17T16:45:42Z

We don't have a way to standardize extension types yet so I posted a proposal of how to do it:
https://lists.apache.org/thread/qxc1g7h9ow79qt6r7sqtgbj8mdbdgnhb

Until that is decided I think this PR should be converted to draft. Does that sound ok @progger-dev ?

progger-dev · 2022-08-17T17:29:10Z

@pitrou Absolutely! That's fine. I was hoping that this PR would trigger that conversation.

pitrou · 2022-10-05T16:02:13Z

@progger-dev Since we now have an official way to standardize extension types, do you want to resurrect this by posting a proposal to the ML?

tustvold · 2022-12-20T15:15:19Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+
+  // When the original Arrow schema isn't stored and Arrow extensions are disabled,
+  // LogicalType::JSON is read as Binary.
+  const auto binary_array = ::arrow::ArrayFromJSON(::arrow::binary(), json);


The PR description says

This extension is backed by utf8()

So I would have naively expected this to be inferred as arrow::utf8?

This is to preserve the existing behavior.

We currently read LogicalType::JSON as arrow::binary. So, if there is no schema information and the extensions are disabled, then we use the current behavior.

Do you foresee an issue with my changing the Rust parquet reader to infer as UTF-8 in apache/arrow-rs#3376, at least until such a time as this extension type is stabilised?

I think that should be fine. The behavior should be implementation dependent. So, if the Rust implementation reads the Parquet JSON type as utf8 it should continue to do that.

I'm not currently planning on touching the Rust implementation. If and when it is updated to support this extension, I would expect the current behavior to be preserved if the user requests the extension to be disabled.

rok · 2024-04-10T22:46:09Z

@pradeepg26 did you want to continue work on this? There was some work done on canonical extension types lately (see cpp/src/arrow/extension, docs/source/format/CanonicalExtensions.rst) that could perhaps help here.

progger-dev · 2024-04-12T20:43:23Z

I expected someone else at BigQuery to take over this work. I would love to finish this PR, but I don't think I'll have time for a while.

rok · 2024-04-12T20:48:39Z

I have some time next week and can try to move it forward if you don't mind?

rok · 2024-04-16T14:40:29Z

JNI failure seems unrelated.

Co-authored-by: Antoine Pitrou <[email protected]>

pitrou

+1, thanks a lot for pushing this through @rok !

rok · 2024-09-11T10:15:39Z

Thanks for the reviews all!

rok · 2024-09-11T12:17:12Z

Created #44066 for the Python wrapper.

conbench-apache-arrow · 2024-09-12T00:22:53Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 27acf8b.

There was 1 benchmark result indicating a performance regression:

Commit Run on amd64-m5-4xlarge-linux at 2024-09-11 12:10:49Z
- dataset-serialize (Python) with dataset=nyctaxi_multi_parquet_s3, format=feather, selectivity=1pc

The full Conbench report has more details. It also includes information about 274 possible false positives for unstable benchmarks that are known to sometimes produce them.

…che#13901) Arrow now provides a canonical extension type for JSON data. This extension is backed by utf8(). Parquet will recognize this extension and appropriately propagate the LogicalType to the storage format. * GitHub Issue: apache#32538 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Pradeep Gollakota <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

jorisvandenbossche · 2024-09-24T07:17:08Z

cpp/src/arrow/extension/json.cc

+bool JsonExtensionType::ExtensionEquals(const ExtensionType& other) const {
+  return other.extension_name() == this->extension_name();
+}


This equality check does not take into account the storage type, but only the name.

As a consequence, a JsonExtensionType<string> type will be seen as equal to JsonExtensionType<large_string>. Was that intentional?

While from a user point of view, it certainly makes sense to have those seen as equal, but the same is true for string vs large_string itself. And in general in Arrow C++, the types are concrete types where variants of the same "logical" type (eg string vs large_string) are not seen as equal. So should the same logic be followed here?

I assume that such type equality will for example be used to check if schemas are equal to see if a set of batches can be concatenated or written to the same IPC stream, etc, and for those cases we require exact equality?

No, that's certainly a bug. Sorry for not spotting this, and feel free to submit a fix :-)

Oh, I suppose I missed that when switching from string only to it being a parametric type. I can make a fix later today if no one started on it yet.

I didn't start yet

rok · 2024-09-24T10:44:38Z

cpp/src/arrow/extension/json.cc

+  ARROW_CHECK(storage_type->id() != Type::STRING ||
+              storage_type->id() != Type::STRING_VIEW ||
+              storage_type->id() != Type::LARGE_STRING);


This check is not correct also.

rok · 2024-09-24T11:59:17Z

Created and issue #44214 and opened a PR #44215 addressing both of these.

…#44215) ### Rationale for this change As noted in #13901 (review): ```cpp bool JsonExtensionType::ExtensionEquals(const ExtensionType& other) const { return other.extension_name() == this->extension_name(); } ``` > This equality check does not take into account the storage type, but only the name. > As a consequence, a JsonExtensionType<string> type will be seen as equal to JsonExtensionType<large_string>. ### What changes are included in this PR? This change introduces storage equality check into `JsonExtensionType` equality check. This also fixes a storage type check in `JsonExtensionType::Make`. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #44214 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

### Rationale for this change We [added canonical JsonExtensionType](#13901) and we should make it usable from Python. ### What changes are included in this PR? Python wrapper for `JsonExtensionType` and `JsonArray` are added on Python side as well as `JsonArray` on c++ side. ### Are these changes tested? Python tests for the extension type and array are included. ### Are there any user-facing changes? This adds a json canonical extension type to pyarrow. * GitHub Issue: #44066 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

github-actions bot added Component: C++ Component: Parquet labels Aug 16, 2022

progger-dev changed the title ~~[ARROW-17255] Add JSON canonical extension type~~ ARROW-17255: [C++][Parquet] Add JSON canonical extension type Aug 16, 2022

progger-dev marked this pull request as draft August 17, 2022 20:05

tustvold reviewed Dec 20, 2022

View reviewed changes

tustvold mentioned this pull request Dec 20, 2022

Infer Parquet JSON Logical and Converted Type as UTF-8 apache/arrow-rs#3376

Merged

asfimport mentioned this pull request Dec 20, 2022

[C++][Parquet] Add JSON canonical extension type #32538

Closed

rok force-pushed the json_support branch from c940208 to 22b37df Compare April 14, 2024 21:59

github-actions bot added Component: Documentation awaiting review Awaiting review labels Apr 14, 2024

rok force-pushed the json_support branch 4 times, most recently from 35ff87d to 598c467 Compare April 15, 2024 22:51

rok marked this pull request as ready for review April 16, 2024 11:45

rok requested a review from wgtmac as a code owner April 16, 2024 11:45

rok requested review from pradeepg26, pitrou and mapleFU and removed request for pradeepg26 April 16, 2024 11:45

rok and others added 7 commits September 11, 2024 11:30

Review feedback

5551d7b

Review feedback

e9b44ad

Apply suggestions from code review

9c09cbe

Co-authored-by: Antoine Pitrou <[email protected]>

Review feedback

f518ebf

Update cpp/src/parquet/arrow/arrow_schema_test.cc

e2f82a8

Co-authored-by: Antoine Pitrou <[email protected]>

Review feedback

e32805e

Fix lint

1ca8f1b

pitrou force-pushed the json_support branch from 8b27962 to 1ca8f1b Compare September 11, 2024 09:31

pitrou approved these changes Sep 11, 2024

View reviewed changes

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 11, 2024

pitrou merged commit 27acf8b into apache:main Sep 11, 2024
41 checks passed

pitrou removed the awaiting change review Awaiting change review label Sep 11, 2024

This was referenced Sep 11, 2024

feat: support UUID as arrow type lancedb/lance#1471

Closed

[Arrow] Add UUID and JSON extension types duckdb/duckdb#13446

Merged

[Python] Add Python wrapper for JsonExtensionType #44066

Closed

This was referenced Sep 11, 2024

GH-38007: [C++] Add VariableShapeTensor implementation #38008

Open

GH-44066: [Python] Add Python wrapper for JsonExtensionType #44070

Merged

jorisvandenbossche reviewed Sep 24, 2024

View reviewed changes

github-actions bot added the awaiting changes Awaiting changes label Sep 24, 2024

rok reviewed Sep 24, 2024

View reviewed changes

This was referenced Sep 24, 2024

[C++] JsonExtensionType equality check ignores storage type #44214

Closed

GH-44214: [C++] JsonExtensionType equality check ignores storage type #44215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-32538: [C++][Parquet] Add JSON canonical extension type #13901

GH-32538: [C++][Parquet] Add JSON canonical extension type #13901

progger-dev commented Aug 16, 2022 •

edited by github-actions bot

Loading

github-actions bot commented Aug 16, 2022

github-actions bot commented Aug 16, 2022

github-actions bot commented Aug 16, 2022

pitrou commented Aug 17, 2022

progger-dev commented Aug 17, 2022

pitrou commented Oct 5, 2022 •

edited

Loading

tustvold Dec 20, 2022

pradeepg26 Dec 20, 2022

tustvold Dec 20, 2022

pradeepg26 Dec 20, 2022

rok commented Apr 10, 2024

progger-dev commented Apr 12, 2024

rok commented Apr 12, 2024

rok commented Apr 16, 2024

pitrou left a comment

rok commented Sep 11, 2024

rok commented Sep 11, 2024

conbench-apache-arrow bot commented Sep 12, 2024

jorisvandenbossche Sep 24, 2024

pitrou Sep 24, 2024

rok Sep 24, 2024

jorisvandenbossche Sep 24, 2024

rok Sep 24, 2024

rok commented Sep 24, 2024

GH-32538: [C++][Parquet] Add JSON canonical extension type #13901

GH-32538: [C++][Parquet] Add JSON canonical extension type #13901

Conversation

progger-dev commented Aug 16, 2022 • edited by github-actions bot Loading

github-actions bot commented Aug 16, 2022

github-actions bot commented Aug 16, 2022

github-actions bot commented Aug 16, 2022

pitrou commented Aug 17, 2022

progger-dev commented Aug 17, 2022

pitrou commented Oct 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok commented Apr 10, 2024

progger-dev commented Apr 12, 2024

rok commented Apr 12, 2024

rok commented Apr 16, 2024

pitrou left a comment

Choose a reason for hiding this comment

rok commented Sep 11, 2024

rok commented Sep 11, 2024

conbench-apache-arrow bot commented Sep 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok commented Sep 24, 2024

progger-dev commented Aug 16, 2022 •

edited by github-actions bot

Loading

pitrou commented Oct 5, 2022 •

edited

Loading