Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error logging on multiple cores #47

Closed
Higginbottom opened this issue Sep 2, 2013 · 6 comments
Closed

Error logging on multiple cores #47

Higginbottom opened this issue Sep 2, 2013 · 6 comments
Labels

Comments

@Higginbottom
Copy link
Collaborator

The current method of causing python to exit if it sees 1 million errors / number processors of any error type is not really working well.
This is because there are some fairly benign errors, like 'no photons in band' that can easily exceed this number if you are running on a few hundred cores.
I think that for the time being, we should go back to simply setting the fail on 1 million errors full stop - and be careful to check our total errors.
In the long term, we need to think carefully about which errors we really care about, and which we merely want to be informed of rather than causing many hour long jobs to simply stop.

@kslong
Copy link
Collaborator

kslong commented Sep 2, 2013

I believe the current method, unless it was changed is to stop if any single error exceeds 1 million. If that is not the case, I certainly agree with you that we need to think about this, especially in the light of wanting to run on multiple cores.

There are at least 3 topics associated with this:

  1. We wanted a way of logging the number of errors so we could see if something was happening a large number of times in a particular run. (In single processor mode, this was one of the easiest ways of telling whether something had gone wrong in a particular run. In multi processor mode, to preserve this capability and to be able to compare to runs with different processors, this suggests we need to have a way to collect all the error counts).
  2. We wanted to stop the program and not waste computer time if something went seriously wrong. This was original reason I put these limits in.
  3. Is something an error? If you are prepared to have an error happen a million times and not care, perhaps you wish to reclassify it as not an error, or you need to write a better set of conditions to suggest it is a real error in this case. (I guess the reason you might want to do this is that we do not have a capability to count other messages, though perhaps one could be generated.)

So my statement of this problem would be:

We need a way to collect up errors across multiple processors so that we can compare results obtained in single and multiprocessor mode.
We need to eliminate as errors things that are willing to have happen a million times.
We should avoid doing the same calculation on multiple cores (so errors are generated only once. (Indeed I did not realize we still did any of this.)

@jhmatthews
Copy link
Collaborator

For errors which we don't want python to quit for we can quite easily count them with a few edits to the scripts py_error.py and watchdog.py. This would just required a unique string like 'Warning' or a new function like Warning() in log.c. This could even be counted by error_log/error_count internally in python itself and wouldn't take much effort.

The watchdog script also count errors across various threads fairly well, although because it counts by looking in the diag file it can only pick up 100 errors per thread (as that is the maximum python reports). The py_error script does count them properly as it runs after the run has finished and so looks at the error summary for all processes.

I think the best solution is firstly to implement some kind of 'error warning' that isn't as serious as an actual error, and also have a value nerr_tot which is the total of all errors and is communicated through the threads at certain stages (say in wind_updates and spectrum_summary), and then we can quit if this number is exceeded- we then won't ever quit on a single error but rather the total errors exceeding one million...is that satisfactory. We can also quit if a single error on a certain thread exceeds a slightly smaller number, perhaps.

If we still want to quit if the total of a single error across all threads exceeds 1 mill then that will be a bit trickier, but may be doable. It would either mean some kind of clever memory being assigned as errors are found, or a list of all possible errors being declared at the start.

@kslong
Copy link
Collaborator

kslong commented Sep 3, 2013

This is in response to James' message

For errors which we don't want python to quit for we can quite easily count them with a few edits to the scripts py_error.py and watchdog.py. This would just required a unique string like 'Warning' or a new function like Warning() in log.c. This could even be counted by error_log/error_count internally in python itself and wouldn't take much effort.

Yes this would be possible, if necessary, but it creates differences between single processor mode and multiprocessor mode.

The watchdog script also count errors across various threads fairly well, although because it counts by looking in the diag file it can only pick up 100 errors per thread (as that is the maximum python reports). The py_error script does count them properly as it runs after the run has finished and so looks at the error summary for all processes.

I don't think this is what we are talking about.

I think the best solution is firstly to implement some kind of 'error warning' that isn't as serious as an actual error, and also have a value nerr_tot which is the total of all errors and is communicated through the threads at certain stages (say in wind_updates and spectrum_summary), and then we can quit if this number is exceeded- we then won't ever quit on a single error but rather the total errors exceeding one million...is that satisfactory. We can also quit if a single error on a certain thread exceeds a slightly smaller number, perhaps.

If we still want to quit if the total of a single error across all threads exceeds 1 mill then that will be a bit trickier, but may be doable. It would either mean some kind of clever memory being assigned as errors are found, or a list of all possible errors being declared at the start.

I think the tricky question is whether we can take an array from a thread and bright it to the master thread and then process it in some way. Our current structure for keeping track of the errors consists of 'Error message" and count.

We need to be able to meld two structures but not simple add or average them. We actually need to do work on them.

However, in the short term, I think we need to pay attention to my last major point in my earlier message. If we don't care if something happens a million times, then it should not be an error. If we do, then it should not be a problem, that when it happens a million times that the program stops.


Reply to this email directly or view it on GitHubhttps://github.com/agnwinds/python/issues/47#issuecomment-23703673.

@Higginbottom
Copy link
Collaborator Author

So, I have implemented a new command 'warning' that one can use for things that one wants to know about, but not necessarily think is a problem that should cause the code to crash.
It exactly parallels Error, and can be turned off by dropping verbosity below 3.
At the moment, no photons in band has been changed to a warning, since this is the one that was causing python to stop for me in the proga models, plus it is not an 'error' as such.
I'm just testing it, and will commit with a load of other changes over the weekend...

Nick

@jhmatthews
Copy link
Collaborator

Proposed fix for me to implement:

  • There is always a limit of say 10,000 errors per core before which the program quits (we want to get this number down) - set to 100,000 at the moment.
  • we introduce a -e flag / line in pf file to allow a developmental user like me to 'force' the program to change number of errors before quit. done:, See commit agnwinds/python@59d726d
  • one runs py_error.py or watchdog.py to get overall errors (in ~/Dropbox/Python/reporting)
  • remove warning function and replace with errors- done: See commit agnwinds/python@5236128

@jhmatthews
Copy link
Collaborator

Closing this as the above tasks are now done. The commit in question had a missing , but agnwinds/python@75cf093 addresses this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants