-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add in the ability to fingerprint JSON columns #11002
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Robert (Bobby) Evans <[email protected]>
build |
build |
jlowe
reviewed
Jun 11, 2024
datagen/src/main/scala/org/apache/spark/sql/tests/datagen/bigDataGen.scala
Outdated
Show resolved
Hide resolved
datagen/src/main/scala/org/apache/spark/sql/tests/datagen/bigDataGen.scala
Outdated
Show resolved
Hide resolved
datagen/src/main/scala/org/apache/spark/sql/tests/datagen/bigDataGen.scala
Outdated
Show resolved
Hide resolved
@jlowe please take another look when you get a chance |
build |
jlowe
approved these changes
Jun 12, 2024
revans2
added a commit
to revans2/spark-rapids
that referenced
this pull request
Jun 13, 2024
This reverts commit d9686d4.
revans2
added a commit
that referenced
this pull request
Jun 13, 2024
Revert "Add in the ability to fingerprint JSON columns (#11002)" [skip ci]
SurajAralihalli
pushed a commit
to SurajAralihalli/spark-rapids
that referenced
this pull request
Jul 12, 2024
Signed-off-by: Robert (Bobby) Evans <[email protected]>
SurajAralihalli
pushed a commit
to SurajAralihalli/spark-rapids
that referenced
this pull request
Jul 12, 2024
This reverts commit d9686d4.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This gives datagen the ability to automatically gather some statistics about JSON formatted columns so that the data gen tool can produce data that would work with JSON parsing tools like get_json_object or from_json. This replaces some of the previous JSON generation code.
It does this by introducing the concept of using multiple different data gens to produce a single column. Right now that is limited to string columns, but it could be expanded out into others in the future. I also want to extend these same concepts so that we could fingerprint a table the same way as a good starting point.