Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intern IndexFieldCapabilities Type String on Read #76405

Conversation

original-brownbear
Copy link
Member

Kind of a brute force approach to the problem but this should improve a few reported instances of slow+memory intesive field caps at no real risk IMO:

In case of handling a large number of these messages, i.e. when fetching field caps
for many indices (and/or those indices contain lots of fields) the type string is repeated
many times over. As these strings are already interned because they are constants, taking
the performance hit of interning them on deserialization seems a reasonable trade-off
for the benefit of saving a non-trivial amount of memory for large clusters as well as
speeding up org.elasticsearch.action.fieldcaps.TransportFieldCapabilitiesAction#merge
which uses these strings in map lookup and will run significantly faster with interned strings
instead of fresh strings that do not have their hash values cached yet.

In case of handling a large number of these messages, i.e. when fetching field caps
for many indices (and/or those indices contain lots of fields) the type string is repeated
many times over. As these strings are already interned because they are constants, taking
the performance hit of interning them on deserialization seems a reasonable trade-off
for the benefit of saving a non-trivial amount of memory for large clusters as well as
speeding up `org.elasticsearch.action.fieldcaps.TransportFieldCapabilitiesAction#merge`
which uses these strings in map lookup and will run significantly faster with interned strings
instead of fresh strings that do not have their hash values cached yet.
@original-brownbear original-brownbear added :Data Management/Indices APIs APIs to create and manage indices and templates v8.0.0 v7.15.0 labels Aug 12, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Aug 12, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@@ -50,7 +50,7 @@

IndexFieldCapabilities(StreamInput in) throws IOException {
this.name = in.readString();
this.type = in.readString();
this.type = in.readString().intern();
Copy link
Contributor

@arteam arteam Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe String.intern is usually not recommended because you have no control over the size of the string pool. It used to be pretty bad before Java 8 where the string pool was finally moved to the heap from PermGen. I think a better approach would be to use a ConcurrentMap with Weak/SoftReferences or an actual Cache if it's available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arteam True true, in this one instance I figured I could get away with it because we know for a fact that these strings will be in the pool (these are field types and the field mapper objects all have them as concrete strings in the source).
I could certainly see the point of using a static constant CHM and putting the interned strings in therefor performance which will probably be quite a bit faster even in recent JDK. It's just extra code and maybe a bit of a risk in case the type ever stops coming from a constant pool of strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arteam I made this a little more "high-tech" now by putting a CHM in front of intern :) Should be way faster now with the same result as before. Let me know what you think :)

Copy link
Contributor

@arteam arteam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jtibshirani jtibshirani added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Aug 16, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Aug 16, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am largely good with this but have a couple of detailed comments.


private static final Logger logger = LogManager.getLogger(StringLiteralDeduplicator.class);

private static final int MAX_SIZE = 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will start thrashing once we exceed 1000 strings. I worry about the reuse of this INSTANCE for other purposes. Can we instead make a single instance that is used solely for the field capabilities type field by making a private static instance in the IndexFieldCapabilities class? If other needs arise for this deduplicator, those would thereby be separated completely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ made it a private static cache for now just limited to this class

final String interned = string.intern();
if (map.size() > MAX_SIZE) {
boolean cleared = false;
synchronized (this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think exceeding the MAX_SIZE is unexpected. I would advocate not synchronizing in that case, since this runs on the transport thread and simply clear the map in all threads that enter this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure if we only use this in a limited case ++ to that, simplified as requested :)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jtibshirani
Copy link
Contributor

Sorry for jumping in late. I was wondering if you had rough numbers showing how much this helps in terms of speed + memory? I am curious because it does add a little complexity, and imagine there are other big contributors to speed/ memory.

@original-brownbear
Copy link
Member Author

No worries @jtibshirani .

I was wondering if you had rough numbers showing how much this helps in terms of speed + memory?

Speed I'm having a hard time predicting. The only thing I have to go by here was a user running into trouble in the last step of this transport action where using the type a map key got slow on hashing it (because it's a different string over and over I assume combined with the fact that its internal bytes might not be in the CPU cache after waiting for other responses for a longer period of time) => I could see a bit of a gain here from just keying by the same type string instances every time when merging and in the specific case it would be a measurable speedup I think.

Memory is more straight forward though I think. If you take a field type string like "_routing" that is ~50b for each string instance. That quickly translates into a couple of MB saved and I've seen nodes under heavy load from these requests have hundreds of MB of duplicate type strings on heap in real-world dumps.

@original-brownbear
Copy link
Member Author

I'm merging this one now. We've identified additional spots where deduplicating these very same strings will be helpful so the added complexity will soon be reusable :)

Thanks everyone!

@original-brownbear original-brownbear merged commit 6d20dbc into elastic:master Sep 15, 2021
@original-brownbear original-brownbear deleted the intern-index-field-caps-type-on-deserialization branch September 15, 2021 11:00
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 15, 2021
In case of handling a large number of these messages, i.e. when fetching field caps
for many indices (and/or those indices contain lots of fields) the type string is repeated
many times over. As these strings are already interned because they are constants, taking
the performance hit of interning them on deserialization seems a reasonable trade-off
for the benefit of saving a non-trivial amount of memory for large clusters as well as
speeding up `org.elasticsearch.action.fieldcaps.TransportFieldCapabilitiesAction#merge`
which uses these strings in map lookup and will run significantly faster with interned strings
instead of fresh strings that do not have their hash values cached yet.
original-brownbear added a commit that referenced this pull request Sep 15, 2021
In case of handling a large number of these messages, i.e. when fetching field caps
for many indices (and/or those indices contain lots of fields) the type string is repeated
many times over. As these strings are already interned because they are constants, taking
the performance hit of interning them on deserialization seems a reasonable trade-off
for the benefit of saving a non-trivial amount of memory for large clusters as well as
speeding up `org.elasticsearch.action.fieldcaps.TransportFieldCapabilitiesAction#merge`
which uses these strings in map lookup and will run significantly faster with interned strings
instead of fresh strings that do not have their hash values cached yet.
@original-brownbear original-brownbear restored the intern-index-field-caps-type-on-deserialization branch April 18, 2023 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates >non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Data Management Meta label for data/management team Team:Search Meta label for search team v7.16.0 v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants