Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report stats related to new sizing guidance #86639

Closed
3 tasks done
Tracked by #77466
DaveCTurner opened this issue May 10, 2022 · 33 comments
Closed
3 tasks done
Tracked by #77466

Report stats related to new sizing guidance #86639

DaveCTurner opened this issue May 10, 2022 · 33 comments
Assignees
Labels
:Data Management/Stats Statistics tracking and retrieval APIs >enhancement Team:Data Management Meta label for data/management team

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented May 10, 2022

In #86223 we adjusted our guidance about node capacity to be more in line with actual resource usage and constraints given recent changes (#77466). The metrics mentioned in the guidance are a little difficult to understand since they depend on opaque mechanisms like mapping deduplication. We should report the relevant stats directly to avoid any misunderstandings.

In particular, in nodes stats we should report the total number estimated overhead of field mappers on each node, and in cluster stats we should report the total number of fields in the cluster (after deduplication).

Edited (2022-06-09) to change "total number of field mappers" to "total estimated overhead of field mappers", because I expect we'll want to refine our "1kB-per-field" guidance in the future, which we can do if we build this guidance into ES too. Total number of field mappers is also likely still useful.

@DaveCTurner DaveCTurner added >enhancement :Data Management/Stats Statistics tracking and retrieval APIs labels May 10, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label May 10, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 9, 2022
Adds measures of the total size of all mappings and the total number of
fields in the cluster (both before and after deduplication).

Relates elastic#86639
Relates elastic#77466
@DaveCTurner
Copy link
Contributor Author

#87556 addresses the cluster-wide stats. Adding the node-level stats is a little bit tricky because today everything is done at the shard level, and yet mappers are shared across all shards in an index on each node.

DaveCTurner added a commit that referenced this issue Jun 14, 2022
Adds measures of the total size of all mappings and the total number of
fields in the cluster (both before and after deduplication).

Relates #86639
Relates #77466
@kingherc
Copy link
Contributor

I briefly discussed with @original-brownbear about this feature, and gave me the following tips on getting started:

  • We can introduce a new mappers field under NodeIndicesStats with two values, one for the total field count (note that there's no deduplication for data nodes) and one for the estimated overhead (just multiply the field count by 1KiB).
  • In org.elasticsearch.indices.IndicesService#statsByShard function, for each index, use the index Service to get the mapper Service to finally get the count of the fieldMappers. There's a small doubt whether this count contains nested ones, but I will verify.
  • Maybe add a test for NodeStats similar to ClusterStatsIT.

@kingherc kingherc self-assigned this Aug 25, 2022
@kingherc
Copy link
Contributor

kingherc commented Aug 25, 2022

@original-brownbear , I see that somebody can call also _nodes/stats?level=shards for shared-level info. Reading @DaveCTurner 's comment above, I see that mappers are shared across all shards in an index on each node. Thus, I understand that the total number of fields we want to expose should only be part of the node level (under nodes > indices in the final json) and not part of the indices (I mean under nodes > indices > indices in the final json) nor shards level. Feel free to correct me if I am wrong.

@DaveCTurner
Copy link
Contributor Author

note that there's no deduplication for data nodes

Although there's no deduplication of mappers today I expect this will be subject to improvement in a future version (see #86440). It would be good if the solution on which we land doesn't assume "no deduplication" too fundamentally.

the total number of fields we want to expose should only be part of the node level ... and not part of the indices ... nor shards level

Definitely yes to node-level and no to shard-level, but I think it would make sense to report it at the index level too. That way users can see which indices are contributing most to this memory usage which will help them address it.

@kingherc
Copy link
Contributor

kingherc commented Sep 5, 2022

Thanks @DaveCTurner ! I have an implementation underway that exposes the field mapper stats in the _nodes/stats API at both the node & index level, but not the shard level finally. A PR is imminent in the next couple of days. It will look like this:

"field_mappers": {
  "total_count": 58,
  "total_estimated_overhead": "58kb",
  "total_estimated_overhead_in_bytes": 59392
}

And we should be able to extend it with more stats about deduplicated fields. Only minor complication would be to see adding the deduplicated stats only at the node level and not at the index level (since I guess the deduplication would happen across indices if my reasoning is correct). But that should be an implementation complication to handle, rather than an API one.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Sep 5, 2022

Sounds good.

I guess the deduplication would happen across indices if my reasoning is correct

Maybe, although intra-index deduplication would be enough, and perhaps easier to implement. Edit: I was chatting with Luca and it turns out that the search folks tried some ideas for intra-index deduplication and didn't get the gains they expected, so it seems you're right 😁

I'm not sure we will need any extra stats when deduplication lands, as long as the total estimated overhead is updated to match the new implementation. Indeed that kind of future-proofing is why I think it better to report the actual bytes of estimated overhead even if it's just total_count * 1kiB today.

@kingherc
Copy link
Contributor

kingherc commented Sep 5, 2022

Are labels here correct? I just realized the team is data management. But I see @DaveCTurner handled a part in this PR with different labels, and I also thought so far it was for the distributed team.

@DaveCTurner
Copy link
Contributor Author

It could be owned by several teams - the distrib team broadly own the recent sizing guidance work (#77466) even if this particular sizing concern could arguably be handled under :Search/Mapping, and the data management team own stats in general.

@kingherc
Copy link
Contributor

kingherc commented Sep 5, 2022

OK, thanks. Just to note: I am working on the node-level stats for field mappings stats. I have a PR in the works.

Definitely yes to node-level and no to shard-level, but I think it would make sense to report it at the index level too.

@DaveCTurner just to be clear on this point. I discussed with @original-brownbear and we think for now we can add the new field mapper stats to the _nodes/stats API only. It will be available at both the ?level=node and ?level=indices but not at ?level=shards. For the moment we think there is no usefulness to add these stats in the cluster-wide index/_stats API.

@DaveCTurner
Copy link
Contributor Author

For the moment we think there is no usefulness to add these stats in the cluster-wide index/_stats API.

I see some value in having these stats in the indices stats API too, because I'd expect if you told users that they had too many fields in the node stats then they'd mostly use the indices stats API to investigate further. But I see that this is quite a different thing from the changes to the nodes stats and not quite so important so I'm ok with leaving this out for now.

@kingherc
Copy link
Contributor

kingherc commented Sep 6, 2022

Indeed I also thought the same way, but we discussed with @original-brownbear that it's a bit of a chicken-and-egg problem: one needs to actually create the indices to get the estimated overhead of the field mappers on the data nodes, and then potentially re-try a different allocation or something to again get a new estimation and so on. I think ideally one would need an offline tool to give an estimation of the total dataset (or indices) and then how many data nodes would be ideal to have.

So indeed for the moment we will expose them in the node stats. But if in the future we also want to expose them in index stats, that will be also possible with some further implementation.

@javanna
Copy link
Member

javanna commented Sep 6, 2022

For more info about the deduplication effort of mappings in a data node, see here. I see the mention of field mappers in this issue, and I wanted to raise that while indexed fields have both an instance of a field mapper as well as a mapped field type, runtime fields don't have a corresponding field mapper but only a mapped field type. Should the new stats target also runtime fields, and in that case be less specific about "field mappers" in the API response? Should we rather have a higher-level estimation of the overhead of a MappingLookup for a certain index, which is shared across shards of the same index that are allocated on the same data node?

@kingherc
Copy link
Contributor

kingherc commented Sep 6, 2022

Hi, thanks for the conversation as I am also getting more nuances and terminology :)

For the moment, the implementation I am trying is counting the field mappers using basically the line indexService.mapperService().mappingLookup().fieldMappers().size(). This does not account runtime fields for the moment. I am not sure we would like to consider runtime fields since we would like to estimate the overhead of mapped fields in data nodes.

Here is an example of what my in-progress implementation does. If an index has the following mapping:

  "mappings": {
    "runtime": {
      "day_of_week": {
        "type": "keyword",
        "script": {
          "source": "emit(doc['@timestamp'].value.dayOfWeekEnum.getDisplayName(TextStyle.FULL, Locale.ROOT))",
          "lang": "painless"
        }
      }
    },
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "authors": {
        "properties": {
          "company": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "first_name": {
            "type": "keyword"
          },
          "full_name": {
            "type": "text"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      },
      "title": {
        "type": "text"
      },
      "url": {
        "type": "keyword"
      }
    }
  }

It will then calculate the following fields:

1. authors.last_name - {"last_name":{"type":"keyword"}}
2. _data_stream_timestamp - {}
3. _routing - {}
4. _feature - {}
5. authors.full_name - {"full_name":{"type":"text"}}
6. _source - {}
7. _id - {}
8. @timestamp - {"@timestamp":{"type":"date"}}
9. _version - {}
10. url - {"url":{"type":"keyword"}}
11. title - {"title":{"type":"text"}}
12. authors.company - {"company":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}
13. _index - {}
14. authors.first_name - {"first_name":{"type":"keyword"}}
15. _seq_no - {}
16. _nested_path - {}
17. authors.company.keyword - {"keyword":{"type":"keyword","ignore_above":256}}
18. _tier - {}
19. _ignored - {}
20. _field_names - {}
21. _doc_count - {}

Meaning a total of 21 fields and thus an estimated overhead of 21*1KiB = 21KiB. As you see it does not include the runtime field. I have two questions:

  • Would we need to include the runtime field(s) in the calculation?
  • Note also that it does include a lot of other "artificial" fields that the user did not define (e.g., _tier), and I am a bit unaware of those. Should they not be counted?

Thanks!

@javanna
Copy link
Member

javanna commented Sep 6, 2022

My take on runtime fields is that they should be taken into account, and the same applies to object fields which are not among field mappers. I don't see a reason that they should be treated differently. I think a more accurate way to count would be: number of mapped field types + number of object mappers. I'm afraid the former is not currently exposed though.

@kingherc
Copy link
Contributor

kingherc commented Sep 6, 2022

Thanks @javanna ! When you refer to object fields, do you mean the nested field like the one I have in the example in my comment above? If that's the case, then my current implementation takes them into account (you will see in the output that they are included in the flattened list).

For the runtime fields, would you be able to give me a hint of where to find them in the code so that I can count them?

@kingherc
Copy link
Contributor

kingherc commented Sep 6, 2022

OK, I researched a bit more the code. I understand that if I use the following instead:

indexService.mapperService().mappingLookup().fieldTypeLookup.fullNameToFieldType.size()

This will include both field mappers + runtime fields + flattened list of object fields, so I believe everything we need @javanna .

Now, if we take into account the runtime fields as well, then the name I have chosen for the stats is not accurate. Maybe I can name the new node stats like this:

"mapping_lookup": {
  "total_field_count": 58,
  "total_estimated_overhead": "58kb",
  "total_estimated_overhead_in_bytes": 59392
}

What do you think @javanna , @DaveCTurner , @original-brownbear on including runtime fields and the naming?

@DaveCTurner
Copy link
Contributor Author

I'm not sure mapping_lookup means anything to end-users. How about just mappings?

The important numbers from a sizing perspective are total_estimated_overhead (and total_estimated_overhead_in_bytes) - I expect we might refine the computation of these values in future versions but the fields themselves seem pretty future-proof. total_field_count also makes sense to me and explains the overhead computation (for now) and we will likely add more stats in future.

@kingherc
Copy link
Contributor

kingherc commented Sep 6, 2022

Hi @DaveCTurner . I see there's already a org.elasticsearch.action.admin.cluster.stats.MappingStats which is exposed as mappings in Cluster stats. Those mappings seem to be more exact than the ones we try to estimate here in the node stats. For this reason, I would suggest having a slightly different name to differentiate between the two mapping stats (even though they appear in different APIs).

Would any of these work?

  1. field_mappings
  2. mapped_fields
  3. mapped_types
  4. mapped_field_types

@javanna
Copy link
Member

javanna commented Sep 6, 2022

Objects are not counted in field type lookup: objects have their own object mapper that takes memory too, and they contribute to the total fields limit, hence I was thinking you may want to count them too. When you have some structure in your docs and hundreds of leaf fields, it's very common to end up with tens if not hundreds of objects. You can count them by inspecting MappingLookup#objectMappers.

@DaveCTurner
Copy link
Contributor Author

I see there's already a org.elasticsearch.action.admin.cluster.stats.MappingStats which is exposed as mappings in Cluster stats

That's ok I think, these are also mapping stats at the node level.

@kingherc
Copy link
Contributor

kingherc commented Sep 6, 2022

@javanna are you sure? I just debugged the above example I mentioned and I see the following:

image

This shows that the authors is indeed an objectMapper but it also appears flattened in fieldMappers and also in fieldTypeLookup.fullNameToFieldType. Moreover, the latter seems to also have the runtime field of the example. So that is why I was thinking that finally counting the fields of fieldTypeLookup.fullNameToFieldType would be fine. Right?

@kingherc
Copy link
Contributor

kingherc commented Sep 6, 2022

That's ok I think, these are also mapping stats at the node level.

@DaveCTurner , oh I did not see mapping stats at the node level. Where are they in the code and in the API?

Still I would suggest we have a different name. Sounds like future trouble to just have it mappings while it's different stats with a different underlying object.

@kingherc
Copy link
Contributor

kingherc commented Sep 6, 2022

Hi again! For the moment, I am thinking of naming the stats field_mappings which sounds a bit more generic than field_mappers. What do you think?

Also, in using fieldTypeLookup.fullNameToFieldType to count the mappings, I realize this will include also alias fields. Would we want to exclude alias fields from the calculations? I guess not, since I see them taking a place in the MappingLookup structures.

@javanna
Copy link
Member

javanna commented Sep 6, 2022

The authors object does not appear in the field type lookup nor in the field mappers, only the leaves that belong to it do. In your case you have a single object, hence the count should be what you have + 1, but with many objects the difference is bigger (every level that contains sub-fields is a separate object).

@DaveCTurner
Copy link
Contributor Author

oh I did not see mapping stats at the node level

Nono I mean we're adding node-level stats about mappings here, we should just call them mappings. The cluster-level mappings stats don't make sense at the node level so I don't see how this might cause future problems.

@kingherc
Copy link
Contributor

kingherc commented Sep 8, 2022

Hi @DaveCTurner , OK, I will name them as mappings in the node-level stats if you are certain :)

@javanna , ah I see what you mean, that I should account also the top-level field for each object. OK, I made a further example with the following mapping:

  "mappings": {
    "runtime": {
      "day_of_week": {
        "type": "keyword",
        "script": {
          "source": "emit(doc['@timestamp'].value.dayOfWeekEnum.getDisplayName(TextStyle.FULL, Locale.ROOT))",
          "lang": "painless"
        }
      }
    },
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "authors": {
        "properties": {
          "age": {
            "type": "long"
          },
          "company": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "name": {
            "properties": {
              "first_name": {
                "type": "keyword"
              },
              "full_name": {
                "type": "text"
              },
              "last_name": {
                "type": "keyword"
              }
            }
          }
        }
      },
      "link": {
        "type": "alias",
        "path": "url"
      },
      "title": {
        "type": "text"
      },
      "url": {
        "type": "keyword"
      }
    }
  }

And indeed I see the following:

image

I understand now that the final number I'm looking for (for field count) is summing the fieldMappers.size(), objectMappers.size(), and runtimeFieldMappersCount of the MappingLookup object. Please tell me if I am mistaken.

kingherc added a commit to kingherc/elasticsearch that referenced this issue Sep 8, 2022
So that they are visible in NodeIndicesStats only at
the node and index (but not shard) levels.
Also visible in the _cat/nodes table.

Relates to issue elastic#86639
@kingherc
Copy link
Contributor

kingherc commented Sep 8, 2022

Hi @DaveCTurner , @javanna , opened a PR at #89807 . Should I invite you as reviewers as well (apart from @original-brownbear )? Or feel free to add yourselves as reviewers and provide feedback. Thanks!

@kingherc
Copy link
Contributor

PR #89807 got merged. But there is a remaining task in this ticket for "Update sizing guidance docs to refer to new stats". In the PR, we briefly updated the documentation, see it here to say to consult the new mappings node stats for the estimation. On the PR, @DaveCTurner had mentioned "I think we should consider rephrasing this whole section in terms of these stats now that they're available. We can however do that in a followup - it's the third item on the list for #86639."

@original-brownbear , @DaveCTurner do you have some recommendations on how to rephrase the section, or what would you expect it to mention (and I can try working out a first PR and then receive suggestions)?

@DaveCTurner
Copy link
Contributor Author

I'd like the guidance to be in terms of the stats we expose, and ideally include some instructions about how to obtain those stats (GET _nodes/stats?filter_path=nodes.*.mappings.total_estimated_overhead* I guess) and compare them to the overall heap size on each node (also available in nodes stats at some other ?filter_path). I'd probably start from those instructions and then rework the surrounding prose so it all fits in. I'm happy to take on the docs change if you'd prefer.

@kingherc
Copy link
Contributor

Thanks @DaveCTurner ! No worries, I can give it a try to give you a headstart, and then you can directly modify on the PR if you'd like :)

kingherc added a commit to kingherc/elasticsearch that referenced this issue Sep 22, 2022
Now that we have the estimated field mappings heap overhead
in nodes stats, we can refer to them in the guide for sizing
data nodes appropriately.

Relates to elastic#86639
kingherc added a commit that referenced this issue Sep 30, 2022
Now that we have the estimated field mappings heap overhead
in nodes stats, we can refer to them in the guide for sizing
data nodes appropriately.

Relates to #86639
kingherc added a commit to kingherc/elasticsearch that referenced this issue Sep 30, 2022
Now that we have the estimated field mappings heap overhead
in nodes stats, we can refer to them in the guide for sizing
data nodes appropriately.

Relates to elastic#86639
elasticsearchmachine pushed a commit that referenced this issue Sep 30, 2022
Now that we have the estimated field mappings heap overhead
in nodes stats, we can refer to them in the guide for sizing
data nodes appropriately.

Relates to #86639
@kingherc
Copy link
Contributor

kingherc commented Oct 3, 2022

Closed with latest PR #90274

@kingherc kingherc closed this as completed Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Stats Statistics tracking and retrieval APIs >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

4 participants