-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CFE_TIME_GetReference
error handling
#1544
Comments
The intent of this loop is to be able to read the "ReferenceState" structure in a lock-less manner. This avoids the possibility of a task switch which could happen if a lock/mutex was used here. It also allows multiple tasks to read the structure at the same time. The "VersionCounter" within the structure is incremented every time the structure is updated by the tone task or command. The code is checking that, after copying all data out, that the "VersionCounter" within the global data structure still matches the expected value. If it does not match, this means that the structure was updated during read, and the data might not be consistent, so it reads it again. The "RetryCount" simply puts a reasonable limit on the number of times this will happen. Important to note that the possibility of a concurrent update is already very low - with the buffering done now, its quite unlikely that it will happen even once. So in reality the number of retries just has to be more than 1 to handle this remote possibility, but if it retries more than once and the version counter is still not matching, then something is likely broken. The possibility is very remote because there are multiple copies of the state structure stored in the global, for persistence across updates. In order to get to the point of overwriting the existing entry, it would require So to summarize:
|
Sounds like any time it takes more than 2 attempts there should be some indication to the user there's abnormal behavior, and if all 4 attempts are exhausted there should be an indication of the system being broken/unhealthy. |
Yes, the main issue with this code is that a failure is not really reported at all. If the retry limit is reached then at least a syslog is warranted, consider also an error event (as I would consider this a significant error, something is fundamentally broken). |
Use one of the unused time state bits to indicate if an error has occurred where CFE_TIME_GetReference was not able to get a consistent copy of the reference state data. In a functional system this should never occur - there should be at most one retry, which only happens in the event there was a burst of updates (more than 4) concurrently with reading the structure. The previous implementation did not report or handle the condition at all, this at least sets a TLM status bit and returns a reference struct filled with all zeros.
Use one of the unused time state bits to indicate if an error has occurred where CFE_TIME_GetReference was not able to get a consistent copy of the reference state data. In a functional system this should never occur - there should be at most one retry, which only happens in the event there was a burst of updates (more than 4) concurrently with reading the structure. The previous implementation did not report or handle the condition at all, this at least sets a TLM status bit and returns a reference struct filled with all zeros.
Fix #1544, add time get reference error bit
Is your feature request related to a problem? Please describe.
Hard coding Retry count to 4 is a bit "magic", not clear how much margin this has, not clear statistically how may conflicts are occurring, no error reporting, no way for the calling routine to take action or track, not clear even if these values are always required to be in sync or what happens if they aren't, not obvious what is intended here.
Describe the solution you'd like
Clarify design, track/monitor/report performance/errors. If there's uses where it's critical, may need to deconflict (protect query from update). Maybe provide API that's slower but always correct, vs faster but possibly invalid if that's really a need.
Describe alternatives you've considered
None
Additional context
Code review
Requester Info
Jacob Hageman - NASA/GSFC
The text was updated successfully, but these errors were encountered: