Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache DataSketches plugin for Trino #6643

Conversation

ShashwatArghode
Copy link
Contributor

Plugin to query Apache Datasketches (https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html). This includes merge and estimate function for theta sketch.

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Can we have some tests for like validating the result when we hit it to Trino. And a docs describing the functions would also be helpful.

return GroupedSketchState.class;
}

public static class GroupedSketchState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they are more details to capture, can we move it to a new file ? WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping the structure consistent with other state factories would make more sense. Ex. HyperLogLog, LongApproximateMostFrequent, NumericHistogram etc.

@Praveen2112
Copy link
Member

Please change the commit message to Trino

@findepi findepi added enhancement New feature or request syntax-needs-review labels Jan 19, 2021
@findepi findepi requested a review from martint January 19, 2021 13:02
@ShashwatArghode ShashwatArghode force-pushed the trino_datasketches_plugin branch from b1d0670 to 6fec5ed Compare June 2, 2021 21:27
@cla-bot cla-bot bot added the cla-signed label Jun 2, 2021
@ShashwatArghode ShashwatArghode force-pushed the trino_datasketches_plugin branch from 6fec5ed to 52fc9b9 Compare June 2, 2021 21:28
@ShashwatArghode ShashwatArghode changed the title Apache DataSketches plugin for Presto Apache DataSketches plugin for Trino Jun 2, 2021
@ShashwatArghode ShashwatArghode force-pushed the trino_datasketches_plugin branch from 636a2f5 to f814770 Compare June 2, 2021 22:57
@ShashwatArghode
Copy link
Contributor Author

Apologies for the delay in getting back to this. I have addressed all the comments and added the doc as well. Thanks for reviewing the PR, do let me know if it needs more changes.

plugin/trino-datasketches/pom.xml Outdated Show resolved Hide resolved
plugin/trino-datasketches/pom.xml Outdated Show resolved Hide resolved
@@ -0,0 +1,36 @@
======================
DataSketches Connector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this plugin is set of functions as Teradata functions and Machine learning functions. I would move to functions directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the suggested changes to the doc. Please let me know if the there is any issue with sphinx syntax or anything needs to be changed

docs/src/main/sphinx/connector/datasketches.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/datasketches.rst Outdated Show resolved Hide resolved
plugin/trino-datasketches/pom.xml Outdated Show resolved Hide resolved
union.update(sketch2);
Sketch unionResult = union.getResult();

assertEquals(unionResult.getEstimate(), estimate, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add tests using Trino queries? You can refer to TestMLQueries or any other function tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into this as soon as I get some time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into TestMLQueries and adding test cases using Trino queries, but, for that, we will either need to support the creation of datasketches in Presto or register Hive UDFs for test cases for inserting Sketch in the database in binary format. We can add this when we add support for the creation of Sketches in Presto.

Copy link
Member

@ebyhr ebyhr Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we talked in Slack, I assume it's testable without such supports. We can prepare Sketches data in Java library and then cast to varbinary format at this time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BytesWritable from hadoop turned out to be the key to insert the sketches in binary format in the database. Added the test case.

@ShashwatArghode ShashwatArghode force-pushed the trino_datasketches_plugin branch from f814770 to 8a09d22 Compare August 26, 2021 06:18
@ShashwatArghode ShashwatArghode force-pushed the trino_datasketches_plugin branch from 127131b to a638a5e Compare September 1, 2021 02:05
@ShashwatArghode ShashwatArghode force-pushed the trino_datasketches_plugin branch from a638a5e to b772682 Compare September 1, 2021 02:07
@ShashwatArghode
Copy link
Contributor Author

@ebyhr, Do let me know if any more changes are needed to this.

@bitsondatadev
Copy link
Member

👋 @ShashwatArghode - this PR has become inactive. If you're still interested in working on it, please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

@ShashwatArghode
Copy link
Contributor Author

Hi @bitsondatadev,
All the old review comments have been addressed, we can try to get reviewers to move this forward. I will resolve the conflicts in the meantime.

public interface SketchState
extends AccumulatorState
{
Slice getSketch();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#14290 added a Theta sketch aggregation function: IcebergThetaSketchForStats
I forgot about this PR (and it is still waiting for "syntax review"), but anyway that function is for Iceberg plugin's internal use. Also this happens to be sketch building aggregation function that's not being added here.

Anyway, i was able to use UpdateSketch directly in the state class.
For sketch union aggregation, the state could probably hold directly onto an org.apache.datasketches.theta.Union object

private UnionWithParams() {}

@InputFunction
public static void input(@AggregationState SketchState state, @SqlType(StandardTypes.VARBINARY) Slice inputValue, @SqlType(StandardTypes.INTEGER) Integer normEntries, @SqlType(StandardTypes.BIGINT) Long seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does normEntries mean?

{
state.setNominalEntries(normEntries);
state.setSeed(seed);
state.setSketch(inputValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this overwrite the state/result after previous input function invocation?

DataSketches functions
----------------------

.. function:: theta_sketch_union(sketches) -> sketch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two variants of theta_sketch_union, one having parameters. The parameters need to be documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, should the docs say this is an aggregation function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> sketch

there is no "sketch" type. should this be "varbinary"?

@findepi
Copy link
Member

findepi commented Oct 26, 2022

I was just skimming. Not reviewing until the syntax is reviewed (#6643 (comment))

@vgankidi
Copy link

vgankidi commented Dec 5, 2022

Hi @findepi , could I ask what open concerns are there on the syntax? I couldn't find relevant context from the comment above.

@vgankidi
Copy link

vgankidi commented Dec 5, 2022

We are interested in this as well and can provide any help required to take this forward.

@bitsondatadev
Copy link
Member

Hey @vgankidi, syntax needs review needs a particular review generally from @martint, @electrum, or @dain.

This just means that there needs to be a review around the consensus on how things are being done. I believe this is something we should work on defining or making a little more transparent.

@mosabua
Copy link
Member

mosabua commented Jan 11, 2024

👋 @ShashwatArghode - this PR is inactive and doesn't seem to be under development. If you'd like to continue work on this at any point in the future, feel free to re-open.

@mosabua mosabua closed this Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants