Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use Stack Monitoring to have Elasticsearch monitor itself #4709

Closed
thbkrkr opened this issue Jul 29, 2021 · 6 comments · Fixed by #5489
Closed

Cannot use Stack Monitoring to have Elasticsearch monitor itself #4709

thbkrkr opened this issue Jul 29, 2021 · 6 comments · Fixed by #5489
Assignees
Labels
>enhancement Enhancement of existing functionality

Comments

@thbkrkr
Copy link
Contributor

thbkrkr commented Jul 29, 2021

Configuring an Elasticsearch cluster to monitor itself is not possible due to a circular dependency issue (#4627).

The Elasticsearch monitoring cluster referenced in monitoring.metrics and monitoring.logs must be a separate cluster.

This is currently documented as a limitation:

CAUTION: You cannot configure an Elasticsearch cluster to monitor itself, the monitoring cluster has to be a separate cluster.

@thbkrkr thbkrkr added the >bug Something isn't working label Jul 29, 2021
@shubhaat
Copy link
Contributor

I've thought a bit more about this, and on the whole I think this is not super important for ECK or ECE (though ECE supports it). I think we see this more with ESS customers, mostly because some customers don't want to setup another cluster for monitoring. Our recommendation (best practices) is to setup a separate cluster for monitoring, because it helps when your monitored cluster is overwhelmed and with separation of concerns. I'd not call this a bug, but just a limitation for now, and see if we get requests for this.

@malcolm061990
Copy link

Hi, guys.
Firstly thanks for your product :)
But this issue is still very actual. Previously we deployed raw elasticsearch cluster (not elastic cloud) with several beats and it monitors itself. Simple.
Please figure out this issue for elastic-cloud.

@pebrc pebrc added >enhancement Enhancement of existing functionality and removed >bug Something isn't working labels Nov 23, 2021
@brsolomon-deloitte
Copy link

@shubhaat is it also recommended to have a separate Kibana instance dedicated to only displaying Stack Monitoring? (Separate from a 'main' Kibana instance used to discover/search data from a 'main' Elasticsearch instance.)

@shubhaat
Copy link
Contributor

Yes, that would be the case @brsolomon-deloitte. A self monitoring cluster is easier to setup, but if your cluster goes down so does your monitoring cluster which can be inconvenient. For production usecases it is recommended setting up a separate monitoring cluster so when the monitored cluster is under stress, the monitoring cluster continues to work, and is any alerts and such still work.

@thbkrkr
Copy link
Contributor Author

thbkrkr commented Feb 17, 2022

Update: with #5339 it almost works because now we avoid to deploy ES with an invalid monitoring config but it remains a tricky issue.

The Elasticsearch controller starts by reconciling the required k8s objects for ES (http/transport secrets and services, user/role secrets, ...).
In //, the association controller configures the es->-es association and as soon as the http service and the user secret exist, it sets the association conf annotation in the ES resource.
Since association reconciliation is much faster than ES reconciliation, we don't have time to create pods without monitoring. ES reconciliation fails when adjusting the discovery config due to a conflict on update because the association controller has already updated the resource. Note that for ES we accept to reconcile ES even if an association is not configured, we just requeue if they are not. I think for self monitoring we should be strict on that. I need to check why we are doing this.
Then, the Elasticsearch controller reconciles the ES resource again and this time it creates the pods with monitoring.
Everything looks good until the cluster is ready. In background, ES reconciliation is always requeued until we can get and annotate the cluster uuid.
As soon as we reconcile the cluster uuid in the annotation, it removes the monitoring and the last pod is rotated once until the next reconciliation that recreate the pods with monitoring. -The end-.

The bug is caused because we consider that the association is configured and we configure monitoring iff the assocConfs map is populated. Because this map is not persisted and set only at runtime at the beginning of the reconciliation loop, any update on ES resource wipes the map.

# Pseudo code flow
Reconcile Elasticsearch (R1)
|- FetchWithAssociations // set assocConfs from the annotation
|- ReconcileNodeSpecs    // prepare pods specs
   |- monitoring.IsReconcilable // yes, monitoring ref is defined and configured (depends on assocConfs)
      |- WithMonitoring // yes, configure monitoring
=> pods are created with monitoring

... when the cluster is started and we get its cluster uuid

Reconcile Elasticsearch (R2)
|- FetchWithAssociations // set assocConfs from the annotation
|- ReconcileClusterUUID  // update ES with the cluster uuid annotation => reset assocConfs
|- ReconcileNodeSpecs    // prepare pods specs
   |- monitoring.IsReconcilable // no, monitoring is not configured because no assocConfs
=> pods are recreated without monitoring, last pod is rotated

...

Reconcile Elasticsearch again (like R1)
// ReconcileClusterUUID is done, it will never update again the ES
=> pods are recreated with monitoring, last pod is rotated

@thbkrkr
Copy link
Contributor Author

thbkrkr commented Feb 17, 2022

Summary

If you have an Elasticsearch resource and its associations are configured. We populate a map of AssociationConf at the beginning of the reconciliation loop using an AssociationConf stored in JSON in an annotation of the resource:

requeue, err := r.fetchElasticsearchWithAssociations(ctx, request, &es)

This map is not persisted, only set at runtime:

AssocConfs map[types.NamespacedName]commonv1.AssociationConf `json:"-"`

If there is an update to the ES resource:

return k8sClient.Update(context.Background(), cluster)

It resets the map!
So, if you depends on the map after the update, you will see that the associations are not configured even though they are 💥 .

Note: there are several places during the ES reconciliation where we update the resource for safety reason. For example, we don't want to reboostrap a cluster already bootstrapped.

Ideas to solve that

  • Stop to update the ES resource on the fly during the reconciliation and just sent one update at the end. We would loose the safety side of this early updates.
  • Reorder the code so that we never depends on the assocConfs map after an update of the ES resource. Super wonky.
  • When we read the map, always verify that if the annotation has an association conf, the map is populated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality
Projects
None yet
5 participants