Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MAUDE telemetry for events #343

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

taldcroft
Copy link
Member

@taldcroft taldcroft commented Dec 5, 2024

Description

Support using MAUDE as the source of telemetry for kadi events. This has been a long-standing request from FOT engineering and will allow the events to be up to date with ground telemetry soon after MAUDE dump data ingest completes.

In order to realize this reduction in latency, we need to change a couple of things in ground processing:

  • Run kadi_update_events more frequently. Running this once per hour is realistic, but up for discussion.
  • rsync to GRETA at the same cadence but offset in phase (e.g. once per hour at 10 minutes after the hour).

Interface impacts

No interface impact.

Testing

Unit tests

  • No unit tests
  • Mac
  • Linux
  • Windows

Independent check of unit tests by [REVIEWER NAME]

  • [PLATFORM]:

Functional tests

  • Run an update test with a daily update over a time span covering the two nearby safe modes.
  • Run this for at least a week on HEAD in exactly the way that is planned for the flight job (e.g. as an hourly cron job).

In both cases confirm that the test database matches the flight database (modulo expected small timing differences).

@taldcroft taldcroft changed the title Support maude telemetry for events Support MAUDE telemetry for events Dec 5, 2024
@jeanconn
Copy link
Contributor

jeanconn commented Dec 5, 2024

Few questions:

  1. We don't really rsync to GRETA anymore, we run cheta_sync there too. Does the whole cheta sync process run into issues if we make more updates and sync them more frequently?
  2. On a related note, it seems like we run into more opportunities for race conditions on the syncing - would we need any other status monitoring in update jobs to avoid cheta-syncing while an update was in progress? Or because of the way the files are moved into position does it just not matter?
  3. Could there be value in running the the update task conditional on something kicked off by MAUDE? (though if even a job to watch or some such would be a regular cron it would kinda be the same)

One comment:
It seems like functional testing would need to also include a period of real on-the side updates in addition to the faux historical ones.

@taldcroft
Copy link
Member Author

taldcroft commented Dec 6, 2024

Few questions:

  1. We don't really rsync to GRETA anymore, we run cheta_sync there too. Does the whole cheta sync process run into issues if we make more updates and sync them more frequently?

See the sync_ska_data_occ repo, which includes a call to rsync for the non-cheta data. The file in question for kadi events is /proj/sot/ska3/events3.db3, so just that one file would be synced more frequently.

  1. On a related note, it seems like we run into more opportunities for race conditions on the syncing - would we need any other status monitoring in update jobs to avoid cheta-syncing while an update was in progress? Or because of the way the files are moved into position does it just not matter?

I think there is a disconnect. The cheta data archive is specifically unrelated to this PR since the idea is to use MAUDE instead.

  1. Could there be value in running the the update task conditional on something kicked off by MAUDE? (though if even a job to watch or some such would be a regular cron it would kinda be the same)

I've been planning on something equivalent, which is to not do the kadi events update at all (even for non-telemetry events) if there is no new telemetry since the last update. In that way the script can kick off every hour, but only about 1 in 8 times does it actually end up updating events3.db3. Likewise an hourly rsync will mostly just do nothing.

My biggest concern is what happens to a client application that is accessing the database while the file gets updated. One option that would be a possibility with this PR is to actually run the update process on GRETA as well since it has access to MAUDE. My main concern there is reduced visibility into processing problems.

One comment: It seems like functional testing would need to also include a period of real on-the side updates in addition to the faux historical ones.

Agreed, I updated the description.

@jeanconn
Copy link
Contributor

jeanconn commented Dec 6, 2024

Thanks Tom! "I think there is a disconnect" yeah - I immediately went to "what if we were updating the cxc cheta archive based on nrt maude" which is explicitly not what you are doing with this kadi PR. Sorry! I suppose the better question is if the there are any differences between data sources such as timing that will cause us any confusion in the future with the events.

@taldcroft
Copy link
Member Author

My biggest concern is what happens to a client application that is accessing the database while the file gets updated.

I just did some experiments on chimchim with updating the events3.db3 file after doing a few kadi event queries. After changing the file with updates, nothing changed on my side. I then did a more extreme change and deleted events3.db3 entirely and even this did not disrupt the running application. Only after quitting IPython and then starting over did it notice that there was no events database file.

Which is to say that the linux kernel / NFS do interesting things and may protect us from problems related to rsync'ing while people are working.

@taldcroft
Copy link
Member Author

if the there are any differences between data sources such as timing that will cause us any confusion in the future with the events.

There will be differences in timing of up to 1/2 the sample period, so e.g. up to 1/2 second for maneuver event times. In practice this doesn't matter since we already have that disconnect between MAUDE/GRETA and CXC times all over the place. Going forward, using MAUDE times will probably be less confusing for FOT engineers.

@jeanconn
Copy link
Contributor

jeanconn commented Dec 6, 2024

Right, the "does it matters if the the file updates while the user is working" question is also important. My long running cron jobs would still crash if the kadi events changed while the job was running (such that if working on head I usually made a local events copy) but that seems inconsistent with your testing just now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants