Add in the ability to fingerprint JSON columns #11002

revans2 · 2024-06-07T18:31:17Z

This gives datagen the ability to automatically gather some statistics about JSON formatted columns so that the data gen tool can produce data that would work with JSON parsing tools like get_json_object or from_json. This replaces some of the previous JSON generation code.

It does this by introducing the concept of using multiple different data gens to produce a single column. Right now that is limited to string columns, but it could be expanded out into others in the future. I also want to extend these same concepts so that we could fingerprint a table the same way as a good starting point.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2024-06-07T18:31:27Z

build

revans2 · 2024-06-10T20:39:07Z

build

datagen/src/main/scala/org/apache/spark/sql/tests/datagen/bigDataGen.scala

revans2 · 2024-06-12T15:50:59Z

@jlowe please take another look when you get a chance

revans2 · 2024-06-12T15:51:08Z

build

This reverts commit d9686d4.

Revert "Add in the ability to fingerprint JSON columns (#11002)" [skip ci]

Signed-off-by: Robert (Bobby) Evans <[email protected]>

This reverts commit d9686d4.

Add in the ability to fingerprint JSON columns

676c61f

Signed-off-by: Robert (Bobby) Evans <[email protected]>

sameerz added the data gen label Jun 7, 2024

revans2 added 2 commits June 10, 2024 14:48

Add main for JSON fingerprint

9eca4f2

Scala 2.13

4c67226

jlowe reviewed Jun 11, 2024

View reviewed changes

Review Comments

1403f60

jlowe approved these changes Jun 12, 2024

View reviewed changes

revans2 merged commit d9686d4 into NVIDIA:branch-24.08 Jun 12, 2024
44 checks passed

revans2 deleted the json_datagen branch June 12, 2024 21:26

firestarman mentioned this pull request Jun 13, 2024

[BUG] Build on Databricks 330 fails #11053

Closed

revans2 added a commit to revans2/spark-rapids that referenced this pull request Jun 13, 2024

Revert "Add in the ability to fingerprint JSON columns (NVIDIA#11002)"

cfd8f00

This reverts commit d9686d4.

revans2 added a commit that referenced this pull request Jun 13, 2024

Merge pull request #11059 from revans2/revert_json_datagen

900ae6f

Revert "Add in the ability to fingerprint JSON columns (#11002)" [skip ci]

revans2 mentioned this pull request Jun 13, 2024

Add in the ability to fingerprint JSON columns [databricks] #11060

Merged

SurajAralihalli pushed a commit to SurajAralihalli/spark-rapids that referenced this pull request Jul 12, 2024

Add in the ability to fingerprint JSON columns (NVIDIA#11002)

688ac10

Signed-off-by: Robert (Bobby) Evans <[email protected]>

SurajAralihalli pushed a commit to SurajAralihalli/spark-rapids that referenced this pull request Jul 12, 2024

Revert "Add in the ability to fingerprint JSON columns (NVIDIA#11002)"

71415ca

This reverts commit d9686d4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in the ability to fingerprint JSON columns #11002

Add in the ability to fingerprint JSON columns #11002

revans2 commented Jun 7, 2024

revans2 commented Jun 7, 2024

revans2 commented Jun 10, 2024

revans2 commented Jun 12, 2024

revans2 commented Jun 12, 2024

Add in the ability to fingerprint JSON columns #11002

Add in the ability to fingerprint JSON columns #11002

Conversation

revans2 commented Jun 7, 2024

revans2 commented Jun 7, 2024

revans2 commented Jun 10, 2024

revans2 commented Jun 12, 2024

revans2 commented Jun 12, 2024