Error logging on multiple cores #47

Higginbottom · 2013-09-02T16:41:48Z

The current method of causing python to exit if it sees 1 million errors / number processors of any error type is not really working well.
This is because there are some fairly benign errors, like 'no photons in band' that can easily exceed this number if you are running on a few hundred cores.
I think that for the time being, we should go back to simply setting the fail on 1 million errors full stop - and be careful to check our total errors.
In the long term, we need to think carefully about which errors we really care about, and which we merely want to be informed of rather than causing many hour long jobs to simply stop.

kslong · 2013-09-02T17:10:33Z

I believe the current method, unless it was changed is to stop if any single error exceeds 1 million. If that is not the case, I certainly agree with you that we need to think about this, especially in the light of wanting to run on multiple cores.

There are at least 3 topics associated with this:

We wanted a way of logging the number of errors so we could see if something was happening a large number of times in a particular run. (In single processor mode, this was one of the easiest ways of telling whether something had gone wrong in a particular run. In multi processor mode, to preserve this capability and to be able to compare to runs with different processors, this suggests we need to have a way to collect all the error counts).
We wanted to stop the program and not waste computer time if something went seriously wrong. This was original reason I put these limits in.
Is something an error? If you are prepared to have an error happen a million times and not care, perhaps you wish to reclassify it as not an error, or you need to write a better set of conditions to suggest it is a real error in this case. (I guess the reason you might want to do this is that we do not have a capability to count other messages, though perhaps one could be generated.)

So my statement of this problem would be:

We need a way to collect up errors across multiple processors so that we can compare results obtained in single and multiprocessor mode.
We need to eliminate as errors things that are willing to have happen a million times.
We should avoid doing the same calculation on multiple cores (so errors are generated only once. (Indeed I did not realize we still did any of this.)

jhmatthews · 2013-09-03T10:49:37Z

For errors which we don't want python to quit for we can quite easily count them with a few edits to the scripts py_error.py and watchdog.py. This would just required a unique string like 'Warning' or a new function like Warning() in log.c. This could even be counted by error_log/error_count internally in python itself and wouldn't take much effort.

The watchdog script also count errors across various threads fairly well, although because it counts by looking in the diag file it can only pick up 100 errors per thread (as that is the maximum python reports). The py_error script does count them properly as it runs after the run has finished and so looks at the error summary for all processes.

I think the best solution is firstly to implement some kind of 'error warning' that isn't as serious as an actual error, and also have a value nerr_tot which is the total of all errors and is communicated through the threads at certain stages (say in wind_updates and spectrum_summary), and then we can quit if this number is exceeded- we then won't ever quit on a single error but rather the total errors exceeding one million...is that satisfactory. We can also quit if a single error on a certain thread exceeds a slightly smaller number, perhaps.

If we still want to quit if the total of a single error across all threads exceeds 1 mill then that will be a bit trickier, but may be doable. It would either mean some kind of clever memory being assigned as errors are found, or a list of all possible errors being declared at the start.

kslong · 2013-09-03T12:14:03Z

This is in response to James' message

For errors which we don't want python to quit for we can quite easily count them with a few edits to the scripts py_error.py and watchdog.py. This would just required a unique string like 'Warning' or a new function like Warning() in log.c. This could even be counted by error_log/error_count internally in python itself and wouldn't take much effort.

Yes this would be possible, if necessary, but it creates differences between single processor mode and multiprocessor mode.

The watchdog script also count errors across various threads fairly well, although because it counts by looking in the diag file it can only pick up 100 errors per thread (as that is the maximum python reports). The py_error script does count them properly as it runs after the run has finished and so looks at the error summary for all processes.

I don't think this is what we are talking about.

I think the best solution is firstly to implement some kind of 'error warning' that isn't as serious as an actual error, and also have a value nerr_tot which is the total of all errors and is communicated through the threads at certain stages (say in wind_updates and spectrum_summary), and then we can quit if this number is exceeded- we then won't ever quit on a single error but rather the total errors exceeding one million...is that satisfactory. We can also quit if a single error on a certain thread exceeds a slightly smaller number, perhaps.

If we still want to quit if the total of a single error across all threads exceeds 1 mill then that will be a bit trickier, but may be doable. It would either mean some kind of clever memory being assigned as errors are found, or a list of all possible errors being declared at the start.

I think the tricky question is whether we can take an array from a thread and bright it to the master thread and then process it in some way. Our current structure for keeping track of the errors consists of 'Error message" and count.

We need to be able to meld two structures but not simple add or average them. We actually need to do work on them.

However, in the short term, I think we need to pay attention to my last major point in my earlier message. If we don't care if something happens a million times, then it should not be an error. If we do, then it should not be a problem, that when it happens a million times that the program stops.

—
Reply to this email directly or view it on GitHubhttps://github.com/agnwinds/python/issues/47#issuecomment-23703673.

Higginbottom · 2013-09-07T09:31:48Z

So, I have implemented a new command 'warning' that one can use for things that one wants to know about, but not necessarily think is a problem that should cause the code to crash.
It exactly parallels Error, and can be turned off by dropping verbosity below 3.
At the moment, no photons in band has been changed to a warning, since this is the one that was causing python to stop for me in the proga models, plus it is not an 'error' as such.
I'm just testing it, and will commit with a load of other changes over the weekend...

Nick

jhmatthews · 2014-07-16T09:19:01Z

Proposed fix for me to implement:

There is always a limit of say 10,000 errors per core before which the program quits (we want to get this number down) - set to 100,000 at the moment.
we introduce a -e flag / line in pf file to allow a developmental user like me to 'force' the program to change number of errors before quit. done:, See commit agnwinds/python@59d726d
one runs py_error.py or watchdog.py to get overall errors (in ~/Dropbox/Python/reporting)
remove warning function and replace with errors- done: See commit agnwinds/python@5236128

jhmatthews · 2014-07-21T14:45:45Z

Closing this as the above tasks are now done. The commit in question had a missing , but agnwinds/python@75cf093 addresses this

Higginbottom added a commit that referenced this issue Sep 9, 2013

Added new class of errors - the warning - as a partial fix to bug #47

18f4a6c

jhmatthews mentioned this issue Sep 18, 2013

Parallelised reporting #28

Closed

jhmatthews added the high label Jul 16, 2014

jhmatthews added a commit that referenced this issue Jul 21, 2014

Added -e option for user to change error numbers, see #47

59d726d

jhmatthews closed this as completed Jul 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error logging on multiple cores #47

Error logging on multiple cores #47

Higginbottom commented Sep 2, 2013

kslong commented Sep 2, 2013

jhmatthews commented Sep 3, 2013

kslong commented Sep 3, 2013

Higginbottom commented Sep 7, 2013

jhmatthews commented Jul 16, 2014

jhmatthews commented Jul 21, 2014

Error logging on multiple cores #47

Error logging on multiple cores #47

Comments

Higginbottom commented Sep 2, 2013

kslong commented Sep 2, 2013

jhmatthews commented Sep 3, 2013

kslong commented Sep 3, 2013

Higginbottom commented Sep 7, 2013

jhmatthews commented Jul 16, 2014

jhmatthews commented Jul 21, 2014