Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for StringSet metric in Java SDK to track set of unique s… #31789

Merged
merged 10 commits into from
Jul 8, 2024

Conversation

rohitsinha54
Copy link
Contributor

Add support for StringSet metric in Java SDK to track set of unique string as metric.
addresses #31788


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@rohitsinha54
Copy link
Contributor Author

R: @robertwb

Copy link
Contributor

github-actions bot commented Jul 6, 2024

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@github-actions github-actions bot added the model label Jul 6, 2024
Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks realy good.

}

/**
* Return a {@code StringSetCell} named {@code metricName}.If it doesn't exist, return {@code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: space after period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public static ByteString encodeStringSet(StringSetData data) {
try (ByteStringOutputStream output = new ByteStringOutputStream()) {
// encode the length of set
STRING_CODER.encode(String.valueOf(data.stringSet().size()), output);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use IterableCoder.of(StringsUtf8Coder.of()) rather than encoding the length as a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice!!! Thank you.

(I will work on this and other review comments tomorrow. Didn't realize the difference between "Add single comment" and "start review" in previous comment)


@Override
public void add(String value) {
update(StringSetData.create(new HashSet<>(Collections.singletonList(value))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps optimized for the case where value is already in the set by skipping the update (and creation of sets and just to do a no-op merge, and setting the dirty bit).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Thanks.

@AutoValue
public abstract class StringSetData implements Serializable {

public abstract Set<String> stringSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps here and below we should document whether the sets are (expected to be?) immutable or not. E.g. is one allowed ot modify the set after passing it to create? Modify the set returned from stringSet()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Also I see I missed to set well defined mutability behavior for StringSetData and Result. Fixed it now. Result needs to immutable. Data I think we can have either ways, having mutable will allow being able combine more efficiently but can lead to confusing contract specially for EmptyStringSetData. I am going to make StringSetData immutable and they can be combined by only copying

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We ended up changing StringSetData to be mutable because in certain IOs such as TextIOs the number lineage records (per file) are very large and immutable hashset copy cost became huge

https://github.com/apache/beam/pull/32650/files#diff-5bbbc1e2641f65bfb9040d37f09cddb696d642e7a0bb5f4779b996350141b9ba


StringSetCell differentDirty = new StringSetCell(MetricName.named("namespace", "name"));
differentDirty.getDirty().afterModification();
Assert.assertNotEquals(stringSetCell, differentDirty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit this feels a bit odd. But when do cells need to be compared (or be hashable)? (If it's just following the convention of what's done elsewhere, that's fine.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just conventional. I did not find any direct (or even indirect) usage in my work so far.

if (metricUpdate.getSet() == null) {
return StringSetResult.empty();
}
return StringSetResult.create(new HashSet<>(((ArrayList) metricUpdate.getSet())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableSet.copyOf(...) is likely faster, especially for sets of {0,1} elements. (Might be worth using elsewhere too.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ya true. Thanks.

@@ -184,6 +189,13 @@ private MetricUpdate makeCounterMetricUpdate(
return setStructuredName(update, name, namespace, step, tentative);
}

private MetricUpdate makeStringSetMetricUpdate(
String name, String namespace, String step, List<String> setValues, boolean tentative) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a List of values? Should it be a Set or Container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you are right it can be set. Fixed. Earlier the documentation tripped me https://screenshot.googleplex.com/4npsge3GVUo3zRs I thought it needs to be "list" .


@Override
public void add(String value) {
stringSetData.combine(StringSetData.create(ImmutableSet.of("ab")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, perhaps worth optimizing for the case where value is already present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Also fixed my typo of ignoring value and using "ab".

public StringSetData combine(Iterable<StringSetData> updates) {
StringSetData result = StringSetData.empty();
for (StringSetData update : updates) {
result = result.combine(update);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a one-at-a-time binary combine of multiple sets might be inefficient (vs. creating a single set and adding everything to it). I don't know how often this method is called thoug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. Done.

Copy link
Contributor Author

@rohitsinha54 rohitsinha54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertwb
Thanks for the prompt review. Fixed review comment. Please have another look.

Thank you.


@Override
public void add(String value) {
update(StringSetData.create(new HashSet<>(Collections.singletonList(value))));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Thanks.

@AutoValue
public abstract class StringSetData implements Serializable {

public abstract Set<String> stringSet();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Also I see I missed to set well defined mutability behavior for StringSetData and Result. Fixed it now. Result needs to immutable. Data I think we can have either ways, having mutable will allow being able combine more efficiently but can lead to confusing contract specially for EmptyStringSetData. I am going to make StringSetData immutable and they can be combined by only copying


StringSetCell differentDirty = new StringSetCell(MetricName.named("namespace", "name"));
differentDirty.getDirty().afterModification();
Assert.assertNotEquals(stringSetCell, differentDirty);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just conventional. I did not find any direct (or even indirect) usage in my work so far.

public StringSetData combine(Iterable<StringSetData> updates) {
StringSetData result = StringSetData.empty();
for (StringSetData update : updates) {
result = result.combine(update);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. Done.

if (metricUpdate.getSet() == null) {
return StringSetResult.empty();
}
return StringSetResult.create(new HashSet<>(((ArrayList) metricUpdate.getSet())));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ya true. Thanks.


@Override
public void add(String value) {
stringSetData.combine(StringSetData.create(ImmutableSet.of("ab")));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Also fixed my typo of ignoring value and using "ab".

@@ -184,6 +189,13 @@ private MetricUpdate makeCounterMetricUpdate(
return setStructuredName(update, name, namespace, step, tentative);
}

private MetricUpdate makeStringSetMetricUpdate(
String name, String namespace, String step, List<String> setValues, boolean tentative) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you are right it can be set. Fixed. Earlier the documentation tripped me https://screenshot.googleplex.com/4npsge3GVUo3zRs I thought it needs to be "list" .

robertwb added 2 commits July 8, 2024 13:06
* Null-containing sets don't need to be tested as they can no longer be constructed.
* Use vendered guava.
@robertwb robertwb merged commit de4645d into apache:master Jul 8, 2024
113 checks passed
acrites pushed a commit to acrites/beam that referenced this pull request Jul 17, 2024
Add support for StringSet metric in Java SDK to track set of unique string as metric.
addresses apache#31788

* Add support for StringSet metric in Java SDK to track set of unique string as metric.

* Fix compilation and tests

* Add support for StringSet in PortableRunner and fix some spotless java checks

* Add support for StringSet in JetRunner

* Fix precommit errors

* Fixes for review comments

* Other fixes

* Fixes for spotless java

* Fix a couple of tests.

* Null-containing sets don't need to be tested as they can no longer be constructed.
* Use vendered guava.

* unused imports

---------

Co-authored-by: Robert Bradshaw <[email protected]>
reeba212 pushed a commit to reeba212/beam that referenced this pull request Dec 4, 2024
Add support for StringSet metric in Java SDK to track set of unique string as metric.
addresses apache#31788

* Add support for StringSet metric in Java SDK to track set of unique string as metric.

* Fix compilation and tests

* Add support for StringSet in PortableRunner and fix some spotless java checks

* Add support for StringSet in JetRunner

* Fix precommit errors

* Fixes for review comments

* Other fixes

* Fixes for spotless java

* Fix a couple of tests.

* Null-containing sets don't need to be tested as they can no longer be constructed.
* Use vendered guava.

* unused imports

---------

Co-authored-by: Robert Bradshaw <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants