Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BEAM-36 : TimestampedValueInMultipleWindows changes LinkedHashSet to … #43

Closed
wants to merge 1 commit into from

Conversation

yssharma
Copy link

Minor fix for TimestampedValueInMultipleWindows. Changed LinkedHashSet to HashSet.
Checked build with test cases.

@davorbonaci
Copy link
Member

R: @kennknowles

@yssharma
Copy link
Author

If we need both order and distinct I think LinkedHashSet is still the best bet. For Array based implementation we will need to compare every element for uniqueness. Thoughts ?

@bjchambers
Copy link
Contributor

We looked some more at the original Jira issue and realized that it is likely a non-issue. It was created to track the fact we needed to examine our usage of a HashSet there, since we ran into problems with the over-allocation of a hash set (eg., 64 slots to hold 23 items, etc.). When we have 1000 of these in memory at a time, the over-allocation starts to hurt.

Upon further scrutiny, those WindowedValues should only be getting turned into a Set when we need to do equals or hashCode, to make sure we get an order-independent comparison. Assuming this is limited to tests, we can probably resolve the Jira issue as won't fix.

What follows is more detail about the solution we had in mind when we created the Jira ticket:

In general, an array-based Set can be more efficient for small numbers of elements when the equality comparison is cheap enough. It also definitely has better memory usage, since most hash tables will be over-allocated to prevent collisions.

In this case case we know the following properties about window sets:

  1. We don't call contains on these sets.
  2. These sets tend to be small (~30 elements at most)
  3. The only reason we turn them into sets is to get order-independent equals and hashCode.
  4. These sets were already de-duplicated before serialiazation. Further, since we don't create the set until calling hashCode or equality, the outside world doesn't care whether the representation is a Set or not.

Given this isn't likely on the hot-path, it may be reasonable to close the Jira ticket as "won't fix". If we want to keep it around, another option is to implement the Set as follows:

  1. Implement the Set interface, using a backing collection which has the values sorted by their hashCode.
  2. Use this collection to implement hashCode and equals. Resolve "collisions" by a linear scan over all elements with equal hashCode (similar to chaining or linear probing).

Note that this data structure is essentially a hash table where we've collapsed the table into a list.

This would have 0-additional memory overhead and it would support linear time equals and hashCode. It could throw UnsupportedOperationException for insert/contains or implement them using a binary-search on the hashCode and then a linear scan over items with equal hashCode.

@yssharma
Copy link
Author

Closing pull request. Can one of the committers mark the issue as wont fix. - Thanks

@yssharma yssharma closed this Mar 15, 2016
aljoscha pushed a commit to aljoscha/beam that referenced this pull request Mar 29, 2018
Fix compile errors after changes to SdkHarnessClient
mareksimunek pushed a commit to mareksimunek/beam that referenced this pull request May 9, 2018
dmvk pushed a commit to dmvk/beam that referenced this pull request May 15, 2018
tvalentyn pushed a commit to tvalentyn/beam that referenced this pull request May 15, 2018
kennknowles pushed a commit that referenced this pull request Oct 16, 2018
robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request Apr 30, 2020
sjvanrossum pushed a commit to sjvanrossum/beam that referenced this pull request May 25, 2023
feat: add multiplexing data channel
pl04351820 pushed a commit to pl04351820/beam that referenced this pull request Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants