Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError during crop_gribs startup due to race condition with collect_weather startup #197

Closed
douglatornell opened this issue Aug 29, 2023 · 0 comments · Fixed by #203
Assignees
Labels
bug Something isn't working Workers
Milestone

Comments

@douglatornell
Copy link
Member

Log messages just before failure:

2023-08-28 20:49:47,132 INFO [manager] launching NextWorker(module='nowcast.workers.collect_weather', args=['06', '2.5km'], host='localhost')
2023-08-28 20:49:47,134 DEBUG [manager] cmd = ['/SalishSeaCast/nowcast-env/bin/python3', '-m', 'nowcast.workers.collect_weather', '/SalishSeaCast/SalishSeaNowcast/config/nowcast.yaml', '06', '2.5km']
2023-08-28 20:49:47,134 DEBUG [collect_weather] received message from manager: (ack) message acknowledged
2023-08-28 20:49:47,135 DEBUG [collect_weather] shutting down
2023-08-28 20:49:47,138 INFO [manager] launching NextWorker(module='nowcast.workers.crop_gribs', args=['06', '--fcst-date', '2023-08-29'], host='localhost')
2023-08-28 20:49:47,139 DEBUG [manager] cmd = ['/SalishSeaCast/nowcast-env/bin/python3', '-m', 'nowcast.workers.crop_gribs', '/SalishSeaCast/SalishSeaNowcast/config/nowcast.yaml', '06', '--fcst-date', '2023-08-29']
2023-08-28 20:49:47,144 DEBUG [manager] listening...
2023-08-28 20:49:51,174 DEBUG [crop_gribs] wrote GRIB file cropped to SalishSeaCast subdomain: /results/forcing/atmospheric/continental2.5/GRIB/20230829/00/048/20230829T00Z_MSC_HRDPS_APCP_Sfc_RLatLon0.0225_PT048H_SSC.grib2
2023-08-28 20:49:51,174 DEBUG [crop_gribs] observer thread files remaining to process: 5
2023-08-28 20:49:52,831 INFO [crop_gribs] running in process 25692
2023-08-28 20:49:52,832 INFO [collect_weather] running in process 25690
2023-08-28 20:49:52,833 INFO [crop_gribs] read config from /SalishSeaCast/SalishSeaNowcast/config/nowcast.yaml
2023-08-28 20:49:52,834 INFO [crop_gribs] publishing log messages to tcp://*:5562
2023-08-28 20:49:52,835 INFO [crop_gribs] connected to skookum.eos.ubc.ca port 5555
2023-08-28 20:49:52,836 INFO [crop_gribs] cropping 2023-08-29 ECCC HRDPS 2.5km continental 06Z GRIB files to SalishSeaCast subdomain
2023-08-28 20:49:52,838 DEBUG [crop_gribs] creating ECCC GRIB file paths list for 20230829 06Z forecast
2023-08-28 20:49:52,838 INFO [collect_weather] read config from /SalishSeaCast/SalishSeaNowcast/config/nowcast.yaml
2023-08-28 20:49:52,839 INFO [collect_weather] publishing log messages to tcp://*:5563
2023-08-28 20:49:52,840 INFO [collect_weather] connected to skookum.eos.ubc.ca port 5555
2023-08-28 20:49:52,850 DEBUG [collect_weather] calculated set of expected file paths for 2.5 km 20230829/06
2023-08-28 20:49:52,851 INFO [crop_gribs] starting to watch for ECCC grib files to crop in /results/forcing/atmospheric/continental2.5/GRIB/20230829/06/
2023-08-28 20:49:52,853 DEBUG [collect_weather] created /results/forcing/atmospheric/continental2.5/GRIB/20230829/
2023-08-28 20:49:52,853 DEBUG [collect_weather] created /results/forcing/atmospheric/continental2.5/GRIB/20230829/06/
2023-08-28 20:49:52,854 INFO [collect_weather] starting to watch for files in /SalishSeaCast/datamart/hrdps-continental/06/
2023-08-28 20:49:52,862 CRITICAL [crop_gribs] unhandled exception:

Note that crop_gribs starts to watch /results/forcing/atmospheric/continental2.5/GRIB/20230829/06/ before collect_weather creates it.

Traceback:

2023-08-28 20:49:52,862 CRITICAL [crop_gribs] unhandled exception:
Traceback (most recent call last):
  File "/SalishSeaCast/NEMO_Nowcast/nemo_nowcast/worker.py", line 391, in _do_work
    checklist = self.worker_func(
                ^^^^^^^^^^^^^^^^^
  File "/SalishSeaCast/SalishSeaNowcast/nowcast/workers/crop_gribs.py", line 135, in crop_gribs
    observer.start()
  File "/SalishSeaCast/nowcast-env/lib/python3.11/site-packages/watchdog/observers/api.py", line 261, in start
    emitter.start()
  File "/SalishSeaCast/nowcast-env/lib/python3.11/site-packages/watchdog/utils/__init__.py", line 92, in start
    self.on_thread_start()
  File "/SalishSeaCast/nowcast-env/lib/python3.11/site-packages/watchdog/observers/inotify.py", line 119, in on_thread_start
    self._inotify = InotifyBuffer(path, self.watch.is_recursive)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/SalishSeaCast/nowcast-env/lib/python3.11/site-packages/watchdog/observers/inotify_buffer.py", line 37, in __init__
    self._inotify = Inotify(path, recursive)
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/SalishSeaCast/nowcast-env/lib/python3.11/site-packages/watchdog/observers/inotify_c.py", line 181, in __init__
    self._add_watch(path, event_mask)
  File "/SalishSeaCast/nowcast-env/lib/python3.11/site-packages/watchdog/observers/inotify_c.py", line 416, in _add_watch
    Inotify._raise_error()
  File "/SalishSeaCast/nowcast-env/lib/python3.11/site-packages/watchdog/observers/inotify_c.py", line 432, in _raise_error
    raise OSError(err, os.strerror(err))
FileNotFoundError: [Errno 2] No such file or directory
@douglatornell douglatornell added bug Something isn't working Workers labels Aug 29, 2023
@douglatornell douglatornell added this to the v23.1 milestone Aug 29, 2023
@douglatornell douglatornell linked a pull request Sep 25, 2023 that will close this issue
douglatornell added a commit that referenced this issue Sep 25, 2023
Create the directory we're going to watch to handle 2 rare situations that can
cause a FileNotFoundError to be raised (issue #197):
1. collect_weather and crop_gribs are launched concurrently. Occasionally
   crop_gribs tries to start watching the directory before collect_weather
   creates it.
2. When collect_weather --backfill has to be run to recover from automation
   problems we need to start crop_gribs first. So, it needs to create the
   directory.

re: issue #197
douglatornell added a commit that referenced this issue Sep 26, 2023
* Update pkgs & versions used in recent dev env

* Enable Slack notifications for crop_gribs completion

* Change observer thread timeout to 0.5s

Tests in productions showed that timeout of 2s did not appear to reduce the
likelihood of the observer stalling with 1 or a few files uncropped
compared to the original 1s timeout. So, perhaps a shorter timeout will be
better.

* Add retry & timeout for crop_gribs stalled observer

If crop_gribs instance has been running for 8 hours the file system observer
thread may have stalled, or there may be a problem with the HRDPS data
dissemination.

If >10 files remain in the set of files to be cropped, it is probably due to a
data dissemination problem. So, log a critical message that includes
the set of uncropped files so that we can send it to MSC support.

If <=10 files remain, it is possible that the watchdog observer thread just
missed 1 or more files. So, log an info message and retry cropping the set of
uncropped files. If that fails with FileNotFoundError, it is probably due to a
data dissemination problem. So, log a critical message that includes the set of
uncropped files so that we can send it to MSC support.

* Bump cryptography from 41.0.3 to 41.0.4 in /envs (#202)

Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.3 to 41.0.4.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](pyca/cryptography@41.0.3...41.0.4)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Handle directory to watch no yet created

Create the directory we're going to watch to handle 2 rare situations that can
cause a FileNotFoundError to be raised (issue #197):
1. collect_weather and crop_gribs are launched concurrently. Occasionally
   crop_gribs tries to start watching the directory before collect_weather
   creates it.
2. When collect_weather --backfill has to be run to recover from automation
   problems we need to start crop_gribs first. So, it needs to create the
   directory.

re: issue #197

* Fix broken NOAA link in docs found by sphinx linkcheck
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Workers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant