Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect values for millisSinceLastRepairForSchedule in prometheusMetrics #1454

Open
rtib opened this issue Jan 3, 2024 · 4 comments
Open

Comments

@rtib
Copy link

rtib commented Jan 3, 2024

Project board link

I'm running Reaper in sidecar mode and tried to monitor it via Prometheus metrics. I've created a dashboard showing repair progress along with the time since the last scheduled repair. While repair progress is working fine, time since last repair is showing correct value for only one schedule, others are arrested to epoch.

Screenshot 2024-01-03 at 09 57 04

Reaper shows on the webui that all schedules have run within 7 days

Screenshot 2024-01-03 at 09 57 24

and the schedules are still active

Screenshot 2024-01-03 at 09 57 35

When looking at the prometheusMetrics endpoint, however, the values exported are wrong

Screenshot 2024-01-03 at 09 57 53

However, this is not a prometheus-exporter issue, the Dropwizard report contains the same problem

Screenshot 2024-01-03 at 10 22 41

Looking into it I've found that millis since epoch is the fallback value for a repair schedule if no repairs from this schedule were completed.

.orElse(DateTime.now().getMillis()); // Return epoch if no repairs from this schedule were completed

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: REAP-18

@rtib
Copy link
Author

rtib commented Jan 3, 2024

Taking a look into the database, the according repair_run entry does have a valid end_time.

@rtib
Copy link
Author

rtib commented Jan 3, 2024

Digging a bit deeper unveiled that the last_run field of repair_schedule_v1 contains null for all but one entries. That makes millisSinceLastRepairForSchedule to fall back to return epoch.

@Nassz
Copy link

Nassz commented Jan 3, 2024

This metric is also not working for multiple hosts (for 3 nodes cluster), which can answer why U have only 1 KS metric. Alternatively, U can use 7days - io_cassandrareaper_service_RepairRunner_millisSinceLastRepair :)

@rtib
Copy link
Author

rtib commented Jan 3, 2024

That would also contain manually started repairs, which is okay, but then it is hard to distinguish between schedules and manual runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants