You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a Cloud Service (Managed) ROSA Cluster the data_science_pipelines_application_apiserver_ready alerts are not firing and metrics return "Empty query result" for Operator v1.34
Expected Behavior
data_science_pipelines_application_apiserver_ready alerts should fire and metrics return data
Steps To Reproduce
In Dashboard, create a project e.g. 'test-dspa-alerts' and deploy an example pipeline
Verify the following metrics:
data_science_pipelines_application_ready{dspa_namespace="test-dspa-alerts"} data_science_pipelines_application_apiserver_ready{dspa_namespace="test-dspa-alerts"}
data_science_pipelines_application_persistenceagent_ready{dspa_namespace="test-dspa-alerts"}
data_science_pipelines_application_scheduledworkflow_ready{dspa_namespace="test-dspa-alerts"}
Expected result: All metric should have value = 1
Result: data_science_pipelines_application_apiserver_ready{dspa_namespace="test-dspa-alerts"} throws an "Empty query result" the other three queries return 1:
Provoke a disruption in the service providing the Data Science Pipelines API in the user's namespace
Workloads > Deployments
Project: test-dspa-alerts
Scale down to 0 pods:
ds-pipeline-persistenceagent-pipelines-definition
ds-pipeline-pipelines-definition
ds-pipeline-scheduledworkflow-pipelines-definition
Result: data_science_pipelines_application_apiserver_ready{dspa_namespace="test-dspa-alerts"} throws an "Empty query result" the other three queries return 0
Verify that after 5 minutes of disruption the following alerts are firing:
Data Science Pipeline Application Unavailable
Data Science Pipeline APIServer Unavailabl
Data Science Pipeline PersistenceAgent Unavailable
Data Science Pipeline ScheduledWorkflows Unavailable
Data Science Pipelines Application Route Error Burn Rate (for 2m)
Result: All 4 alerts were firing except for the "Data Science Pipeline APIServer Unavailable" one:
Verify alerts are firing also in Alertmanager:
Result: All 4 alerts were firing except for the "Data Science Pipeline APIServer Unavailable" one:
Verify that the alert can be seen in OpenShift Cluster Monitoring prometheus:
OpenShift Console > Monitoring > Metrics
Run this query: ALERTS{namespace=~"redhat-ods-applications|redhat-ods-monitoring|redhat-ods-operator|rhods-notebooks"}
Expected result: Alerts should be active
Actual result: All 4 alerts were firing except for the "Data Science Pipeline APIServer Unavailable" one:
The text was updated successfully, but these errors were encountered:
asanzgom
changed the title
[Bug]: On Cloud Service (Managed) Cluster data_science_pipelines_application_apiserver_ready alert is not firing for Operator v1.34
[Bug]: On Cloud Service (Managed) ROSA Cluster data_science_pipelines_application_apiserver_ready alert is not firing for Operator v1.34
Oct 24, 2023
ODH Component
Data Science Pipelines
Current Behavior
On a Cloud Service (Managed) ROSA Cluster the data_science_pipelines_application_apiserver_ready alerts are not firing and metrics return "Empty query result" for Operator v1.34
Expected Behavior
data_science_pipelines_application_apiserver_ready alerts should fire and metrics return data
Steps To Reproduce
data_science_pipelines_application_ready{dspa_namespace="test-dspa-alerts"}
data_science_pipelines_application_apiserver_ready{dspa_namespace="test-dspa-alerts"}
data_science_pipelines_application_persistenceagent_ready{dspa_namespace="test-dspa-alerts"}
data_science_pipelines_application_scheduledworkflow_ready{dspa_namespace="test-dspa-alerts"}
Expected result: All metric should have value = 1
Result: data_science_pipelines_application_apiserver_ready{dspa_namespace="test-dspa-alerts"} throws an "Empty query result" the other three queries return 1:
Provoke a disruption in the service providing the Data Science Pipelines API in the user's namespace
Workloads > Deployments
Project: test-dspa-alerts
Scale down to 0 pods:
ds-pipeline-persistenceagent-pipelines-definition
ds-pipeline-pipelines-definition
ds-pipeline-scheduledworkflow-pipelines-definition
Verify metrics:
data_science_pipelines_application_ready{dspa_namespace="test-dspa-alerts"}
data_science_pipelines_application_apiserver_ready{dspa_namespace="test-dspa-alerts"}
data_science_pipelines_application_persistenceagent_ready{dspa_namespace="test-dspa-alerts"}
data_science_pipelines_application_scheduledworkflow_ready{dspa_namespace="test-dspa-alerts"}
Expected result: All metric should have value = 0
Result: data_science_pipelines_application_apiserver_ready{dspa_namespace="test-dspa-alerts"} throws an "Empty query result" the other three queries return 0
Verify that after 5 minutes of disruption the following alerts are firing:
Data Science Pipeline Application Unavailable
Data Science Pipeline APIServer Unavailabl
Data Science Pipeline PersistenceAgent Unavailable
Data Science Pipeline ScheduledWorkflows Unavailable
Data Science Pipelines Application Route Error Burn Rate (for 2m)
Result: All 4 alerts were firing except for the "Data Science Pipeline APIServer Unavailable" one:
Result: All 4 alerts were firing except for the "Data Science Pipeline APIServer Unavailable" one:
Verify that the alert can be seen in OpenShift Cluster Monitoring prometheus:
OpenShift Console > Monitoring > Metrics
Run this query:
ALERTS{namespace=~"redhat-ods-applications|redhat-ods-monitoring|redhat-ods-operator|rhods-notebooks"}
Expected result: Alerts should be active
Actual result: All 4 alerts were firing except for the "Data Science Pipeline APIServer Unavailable" one:
The text was updated successfully, but these errors were encountered: