Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to mangle alert words in retry warnings and ruff it #57

Merged
merged 4 commits into from
Feb 23, 2024

Conversation

taldcroft
Copy link
Member

@taldcroft taldcroft commented Feb 17, 2024

Description

This is intended to make it so that warnings about HDF5 file "resource unavailable" which are eventually resolved (the H5 file open finally succeeds) do NOT result in a task_schedule alert. This has been a problem because it is difficult to distinguish from the email whether it failed and stopped processing or it succeeded and processing completed successfully.

The idea here is a little hacky, but when issuing a logger warning it mangles certain alert keywords, replacing an i or l with 1 and o with 0. So WARNING: Failing fatAL Error exception message becomes WARN1NG: Fai1ing fatA1 Err0r excepti0n message. This will remove all the matches in task_schedule checking for warnings only.

Any exceptions (i.e. it failed every time) will be passed through as normal.

Unrelated fixes

This replaces use of logging.getLogger with ska_helpers.logging.basic_logger following modern best-practices. In particular logging.getLogger will end up delegating to the root logger and there are no guarantees on how that is configured.

This introduces use of ruff in the standard way (https://github.com/sot/skare3/wiki/Configuration-for-linting-and-formatting-with-ruff). It was failing black and lint and it was just easier to migrate to ruff since that is what we see in our IDE.

Interface impacts

Testing

Unit tests

  • Mac
(ska3) ➜  ska_helpers git:(retry-mangle-alert-words) git rev-parse HEAD
7dc1fc19e067e749e4e81a778393016daef13f20
(ska3) ➜  ska_helpers git:(retry-mangle-alert-words) pytest
========================================================================== test session starts ==========================================================================
platform darwin -- Python 3.10.8, pytest-7.2.1, pluggy-1.0.0
rootdir: /Users/aldcroft/git, configfile: pytest.ini
plugins: timeout-2.1.0, anyio-3.6.2
collected 67 items                                                                                                                                                      

ska_helpers/retry/tests/test_retry.py ..............                                                                                                              [ 20%]
ska_helpers/tests/test_chandra_models.py ..................                                                                                                       [ 47%]
ska_helpers/tests/test_git_helpers.py s                                                                                                                           [ 49%]
ska_helpers/tests/test_paths.py ......                                                                                                                            [ 58%]
ska_helpers/tests/test_run_info.py ..                                                                                                                             [ 61%]
ska_helpers/tests/test_utils.py ..........................                                                                                                        [100%]

Independent check of unit tests by Jean

  • Linux
ska3-jeanconn-fido> git rev-parse HEAD
c68a9fd46908b9b11765a69047db0318a87e8f0b
ska3-jeanconn-fido> pytest
==================================================================================== test session starts ====================================================================================
platform linux -- Python 3.10.8, pytest-7.2.1, pluggy-1.0.0
rootdir: /proj/sot/ska/jeanproj/git, configfile: pytest.ini
plugins: anyio-3.6.2, timeout-2.1.0
collected 67 items                                                                                                                                                                          

ska_helpers/retry/tests/test_retry.py ..............                                                                                                                                  [ 20%]
ska_helpers/tests/test_chandra_models.py ..................                                                                                                                           [ 47%]
ska_helpers/tests/test_git_helpers.py .                                                                                                                                               [ 49%]
ska_helpers/tests/test_paths.py ......                                                                                                                                                [ 58%]
ska_helpers/tests/test_run_info.py ..                                                                                                                                                 [ 61%]
ska_helpers/tests/test_utils.py ..........................                                                                                                                            [100%]

==================================================================================== 67 passed in 11.54

Functional tests

No functional testing.

@taldcroft taldcroft changed the title Add option to mangle alert words in retry warnings Add option to mangle alert words in retry warnings and ruff it Feb 17, 2024
.pre-commit-config.yaml Outdated Show resolved Hide resolved
@taldcroft taldcroft requested a review from jeanconn February 19, 2024 11:11
@jeanconn
Copy link
Contributor

@taldcroft Can you point me to an example of one of the task schedule alerts that was unhelpful? I'm a little confused as I figured that if we don't want the retry warnings as warnings, the fix would be to go to

logger.warning(
and set those as debug messages.

@taldcroft
Copy link
Member Author

taldcroft commented Feb 22, 2024

@jeanconn - I want the full warning messages to be in the logs, but just not trigger task_schedule alerts if they are warnings and not actual exceptions. Setting the logger to debug would make the warnings invisible to us since e.g. cheta runs logging at info during normal processing. I don't think we want to lose log visibility entirely that this is happening, but warnings should not trigger an alert email that is indistinguishable from the cheta update process crashing and not completing.

FYI I gave @javierggt a detailed code walkthrough and after some convincing he said this was at least "acceptable".

to:aca "Engineering telemetry archive: ALERT" retrying should bring up some examples.

@jeanconn
Copy link
Contributor

"I want the full warning messages to be in the logs, but just not trigger task_schedule alerts if they are warnings and not actual exceptions". OK, though then instead of changing the logging output to debug wouldn't that just be changing the "warning" text to "retry warn" or something in the try output? I'm still a little confused about the need for mangling.

@jeanconn
Copy link
Contributor

And using a reference email, I'm not confident about how the task_schedule regex is applied, but I'd figure the back trace in the example email would get caught by the first check like:

https://github.com/sot/cheta/blob/9f192dcc99b815fb1ce5928421eb488b06435242/cheta/task_schedule.cfg#L50

@taldcroft
Copy link
Member Author

taldcroft commented Feb 22, 2024

It's a good point that this PR should also be accompanied by a cheta PR to effectively back out sot/cheta#253. In retrospect that was not a great solution and I think this PR is better.

This PR applies universally for all applications that use tables_open_file.

  • The current situation requires applying the arcane negative lookaheads in the task schedule check rules for each application.
  • These lookaheads could potentially block reporting of real errors. This is important and we are just crossing our fingers on this one because we have no idea on the full scope of possible warning/error messages.
  • If tables changes their exception reporting it might break the negative lookaheads.

I'm still a little confused about the need for mangling.

To the last bullet point, the contents of the exception message from tables (that gets put into the logger.warn() output) should be considered out of our control. So instead of just using current experience to carefully craft a regex that seems to work, instead define a set of alert words and systematically sanitize (mangle) them.

Copy link
Contributor

@jeanconn jeanconn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are probably ways to trigger the alerts we want and not see the others without this technique but if 1/3 of the user base wants this, it seems fine to me.


Example::

>>> mangle_alert_words("WARNING: This is a fatal Error message.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In passing I note this example suggests mangle_alert_words is available as a function without the leading underscore.

@taldcroft
Copy link
Member Author

I think there are probably ways to trigger the alerts we want and not see the others without this technique but if 1/3 of the user base wants this, it seems fine to me.

I am open to concrete suggestions which address all of the concerns I raised above.

@jeanconn
Copy link
Contributor

jeanconn commented Feb 22, 2024

OK, first then, do we need full tracebacks in the logs for warnings -- the ones handled in the retry?

@jeanconn
Copy link
Contributor

Your comment "the contents of the exception message from tables (that gets put into the logger.warn() output) should be considered out of our control." was a bit confusing to me, as I thought including the stack trace is within our control and seems to be one of the things we're working around here.

@jeanconn
Copy link
Contributor

And if the stack trace is required (and with my code reading and no real debugging I'm confused about why we would be seeing those traces for handled exceptions out of ska_helpers retry in the first place) it seems like the error and the trace and anything out of ska_helpers retry could be reformatted to something more transparent like prepending something like "log_check_ignore" to every line out of that retry function? And removing the lookaheads.

@taldcroft
Copy link
Member Author

anything out of ska_helpers retry could be reformatted to something more transparent like prepending something like "log_check_ignore" to every line out of that retry function? And removing the lookaheads.

True, though that is just a different form of mangling in which each line is mangled with a magic word in front. In this case you would still need a negative lookbefore assertion to preclude matching any alert word which is on a line that starts with log_check_ignore. Or else a code change to watch_cron_logs to ignore any log line starting with the magic word. Neither of those options seem much better than the mangling in this PR.

@taldcroft
Copy link
Member Author

Your comment "the contents of the exception message from tables (that gets put into the logger.warn() output) should be considered out of our control." was a bit confusing to me, as I thought including the stack trace is within our control and seems to be one of the things we're working around here.

I meant that the original exception message is generated by tables or other packages and it could change without notice. Of course once we get that message text we can anything we like with it.

@taldcroft
Copy link
Member Author

@jeanconn - the original version transformed like warning => warn-ing and error => err-or. Would that make you any more comfortable with this?

@jeanconn
Copy link
Contributor

Yes, I don't disagree that adding word to mark for exclusion would just be another kind of mangling, but I though it would be a little easier to find these "try then succeeds" in the logs without the letter-to-number mangling.

I think the other problem I had in review was that I've only got 7 email messages "to:aca "Engineering telemetry archive: ALERT" retrying, with a confirmation that the last 3 were legit and a thought that basically the others looked legit too.

@taldcroft
Copy link
Member Author

sot/cheta#253 does actually work. But mica uses retry_call and I'm not sure if the one negative lookahead there is handling that.

Basically I just want to upstream the solution to this particular problem instead of using regex lookahead/behind assertions (which always break my brain). If you want to make a new PR that mangles the lines and then make and test the sister PR to watch_cron_logs then I'd be fine with that. Otherwise I'm done with this. I'd like one solution in speedy.

@jeanconn
Copy link
Contributor

That's fine. I approved when fine with this. I agree that my suggestion to mangle differently is not substantively different from this PR. I was a little confused about the tracebacks and lookahead so the other PR to remove the lookahead check for cheta is a good outcome of this conversation.

@taldcroft taldcroft merged commit bba06f4 into master Feb 23, 2024
2 checks passed
@taldcroft taldcroft deleted the retry-mangle-alert-words branch February 23, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants