Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled flows accumulate sleep overhead between long appointment delta #1214

Closed
andreleblanc11 opened this issue Sep 9, 2024 · 5 comments
Assignees
Labels
bug Something isn't working likely-fixed likely fix is in the repository, success not confirmed yet. Priority 3 - Important worrisome impact... should study carefully v3only Only affects v3 branches.

Comments

@andreleblanc11
Copy link
Member

andreleblanc11 commented Sep 9, 2024

A scheduled flow will sometimes miss appointments when there are big gaps between appointments.

# In a config
scheduled_hour 19,20
scheduled_minute 0
# In the logs
ddsr-cmc-stage01: 2024-09-06 18:55:38,766 [INFO] sarracenia.flowcb.scheduled wait_until appointment at: 2024-09-06 19:00:00+00:00, need to wait: 0:04:21.233847)
ddsr-cmc-stage01: 2024-09-06 19:00:49,033 [INFO] sarracenia.flowcb.scheduled wait_until_next sleep interrupted, returning for housekeeping.
ddsr-cmc-stage01: 2024-09-06 19:00:49,034 [INFO] sarracenia.flow _runHousekeeping on_housekeeping pid: 2150 flow/scheduled_albertaforestry_pm instance: 1
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1.94 system=1.5
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current mem usage: 135.6 MiB, accumulating count (2 or 2/100 so far) before self-setting threshold
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flowcb.log stats version: 3.00.54rc3, started: a day ago, last_housekeeping: 310.3 seconds ago
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flowcb.log stats messages received: 0, accepted: 0, rejected: 0   rate accepted: 0.0% or 0.0 m/s
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flowcb.log stats files transferred: 0 bytes: 0 Bytes rate: 0 Bytes/sec
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flow metricsFlowReset looking for old metrics for /local/home/sarra/.cache/sr3/metrics/flow_scheduled_albertaforestry_pm_01.json
ddsr-cmc-stage01: 2024-09-06 19:00:49,041 [INFO] sarracenia.flowcb.scheduled wait_until_next already too late to 2024-09-06 19:00:00+00:00 skipping
ddsr-cmc-stage01: 2024-09-06 19:00:49,041 [INFO] sarracenia.flowcb.scheduled wait_until appointment at: 2024-09-06 20:00:00+00:00, need to wait: 0:59:10.958606)
ddsr-cmc-stage01: 2024-09-06 19:01:19,065 [INFO] sarracenia.flowcb.scheduled wait_until_next ok 0 appointments left today
ddsr-cmc-stage01: 2024-09-06 19:01:19,065 [INFO] sarracenia.flowcb.scheduled gather time to run
ddsr-cmc-stage01: 2024-09-06 19:01:19,066 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 0.00 ) https://wildfire.alberta.ca /files/pmwx.csv
ddsr-cmc-stage01: 2024-09-06 19:01:19,069 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.c
sv
ddsr-cmc-stage01: 2024-09-06 19:01:19,074 [INFO] sarracenia.flowcb.scheduled update_appointments for 2024-09-07 00:00:00+00:00: ["2024-09-07 19:00:00+00:00", "2024-09-07 20:00:00+00:00"]
ddsr-cmc-stage01: 2024-09-06 19:01:19,074 [INFO] sarracenia.flowcb.scheduled wait_until appointment at: 2024-09-07 19:00:00+00:00, need to wait: 23:58:40.925420)
ddsr-cmc-stage01: 2024-09-06 19:06:29,317 [INFO] sarracenia.flowcb.scheduled wait_until_next sleep interrupted, returning for housekeeping.

I tried looking at the code to find if there was an area where the sleep interval wasn't being properly adjusted, but couldn't find anything significant.

We already have logic to adjust the total sleep time for the housekeeping interval duration. We also update the sleep time at every call of wait_until which should take into account the sleep overhead accounted in the while loop of wait_until_seconds

now = datetime.datetime.fromtimestamp(time.time(),datetime.timezone.utc)
sleepfor=appointment-now
logger.info( f"appointment at: {appointment}, need to wait: {sleepfor})" )
self.wait_seconds( sleepfor )

Regardless, @petersilva and I agreed it would be good that the code checks for a 1 minute buffer before and after the appointment time, so that we don't skip apointments for a little bit of overhead.This should cover most cases where there is some overhead.

That would likely mean changing line 188

next_appointment=None
missed_appointments=[]
for t in self.appointments:
if now < t:
next_appointment=t
break

@reidsunderland
Copy link
Member

reidsunderland commented Sep 11, 2024

I refactored the scheduling to not do any sleep inside the scheduled class. It relies on the sleep in flow/__init__.py, which simplifies the scheduling because it doesn't need to deal with stopping or housekeeping.

@petersilva , @MagikEh , @andreleblanc11 and I discussed on Teams whether the scheduled next time to gather should be calculated from the current time when the gather actually runs + scheduled_interval, or from the previously scheduled time + scheduled_interval.

My example:

example to demonstrate my question: 
scheduled interval is 30s
 
calculating next using current time:
we run at:
00:00:02 (next scheduled for 00:00:32)
00:00:34 (next scheduled for 00:01:04)
00:01:05 (next scheduled for 00:01:35)
00:01:38 (next scheduled for 00:02:08)
--> we always poll the remote server >= 30 seconds apart. 
 
or calculating next using the last scheduled time:
we run at:
00:00:02 (next scheduled for 00:00:32)
00:00:34 (next scheduled for 00:01:02)
00:01:03 (next scheduled for 00:01:32)
00:01:34 (next scheduled for 00:02:02)
 
--> we always poll the remote server <= 30 seconds apart.
 
 
Would we rather overshoot or undershoot our interval?

@MagikEh brought up good points about what would happen if polling (or other work) takes longer than the scheduled_interval. If every poll takes longer than the scheduled interval, the next poll would always happen immediately.

If the poll takes longer than the scheduled interval occasionally, then using the last scheduled time to calculate the next gather time means that every subsequent gather after the long one would happen immediately, until the lost time is made up.

Using the current time for scheduling the next gather, the next scheduled gather after the long poll would happen immediately, but then the next gather after that would be scheduled for a future time. This gives a bit of a cool down period.


@petersilva said "An argument against precise intervals is it tends to synchronize work... that we don't want synchronized... if you have 150 polls... is there a good reason for them all to start polling at exactly the same time? It's not synchronized swimming."

Calculating the next scheduled time from the last scheduled time would cause a lot of synchronization, leading to load spikes at common intervals (30s, 1m, 15m, 30m, etc.), so it seems like calculating based on the current gather time is the better option for us.

@petersilva
Copy link
Contributor

petersilva commented Sep 11, 2024

fwiw, what james wants already works:


scheduled_minute 2,17,32,57

will run pretty close to every fifteen minutes starting at exactly those times. This is new in 3.0.55, because prior to that you would have had to also specify scheduled_hour 0,1,2,3... (all of them.) which was a bit ugly, now it just defaults to all of them if omitted.

@reidsunderland
Copy link
Member

I re-read the original description from Andre, and I think the problem was partially caused by the housekeeping needing to run at roughly the same time as the appointment. It would sleep until housekeeping was required, which coincided with when the next appointment was scheduled. So it would run housekeeping, but by the time that finished and the scheduled gather was called again, the appointment was already in the past.

But I added likely fixed because we merged #1223, that refactors the scheduling so all sleeping happens in the main flow run loop. Now the scheduled code doesn't need to worry about housekeeping at all. And if it is too late to the next scheduled appointment, it will still do the poll/gather and only "future" appointments might get skipped if they're already in the past.

@reidsunderland reidsunderland added the likely-fixed likely fix is in the repository, success not confirmed yet. label Sep 17, 2024
@petersilva
Copy link
Contributor

When @andreleblanc11 gets back, he can try to reproduce the issue he had with the new version. I agree that it is likely resolved... I'm just not sure.

@andreleblanc11
Copy link
Member Author

With the refactored code, there doesn't seem to be any skipped appointments like previously.

We used to be missing posts because of missed appointments, but now we aren't getting the same scenario.

### Config file ###
post_baseUrl             https://wildfire.alberta.ca

# What directory holds the data
path                    /files/amwx.csv
path                    /files/pmwx.csv
scheduled_time          13:30,14:30,19:00,20:00
######

### Results , there's been no missed apointments for the last 3 days ###
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-09-30:2024-09-30 13:30:01,368 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-09-30:2024-09-30 13:30:01,368 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-09-30:2024-09-30 14:30:02,951 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-09-30:2024-09-30 14:30:02,951 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-09-30:2024-09-30 19:00:04,129 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-09-30:2024-09-30 19:00:04,129 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-09-30:2024-09-30 20:00:02,256 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-09-30:2024-09-30 20:00:02,257 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-01:2024-10-01 13:30:01,278 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-01:2024-10-01 13:30:01,279 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-01:2024-10-01 14:30:03,740 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-01:2024-10-01 14:30:03,740 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-01:2024-10-01 19:00:03,510 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-01:2024-10-01 19:00:03,510 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-01:2024-10-01 20:00:01,505 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-01:2024-10-01 20:00:01,505 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-02:2024-10-02 13:30:04,194 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-02:2024-10-02 13:30:04,194 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-02:2024-10-02 14:30:02,224 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-02:2024-10-02 14:30:02,225 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-02:2024-10-02 19:00:02,020 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-02:2024-10-02 19:00:02,020 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-02:2024-10-02 20:00:00,061 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/amwx.csv
ddsr-cmc-dev01: flow_scheduled_albertaforestry_01.log.2024-10-02:2024-10-02 20:00:00,061 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.csv
######

I also checked if we had similar results earlier (2 weeks ago), and the results were the same.

I think this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working likely-fixed likely fix is in the repository, success not confirmed yet. Priority 3 - Important worrisome impact... should study carefully v3only Only affects v3 branches.
Projects
None yet
Development

No branches or pull requests

3 participants