-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error logging on multiple cores #47
Comments
I believe the current method, unless it was changed is to stop if any single error exceeds 1 million. If that is not the case, I certainly agree with you that we need to think about this, especially in the light of wanting to run on multiple cores. There are at least 3 topics associated with this:
So my statement of this problem would be: We need a way to collect up errors across multiple processors so that we can compare results obtained in single and multiprocessor mode. |
For errors which we don't want python to quit for we can quite easily count them with a few edits to the scripts py_error.py and watchdog.py. This would just required a unique string like 'Warning' or a new function like Warning() in log.c. This could even be counted by error_log/error_count internally in python itself and wouldn't take much effort. The watchdog script also count errors across various threads fairly well, although because it counts by looking in the diag file it can only pick up 100 errors per thread (as that is the maximum python reports). The py_error script does count them properly as it runs after the run has finished and so looks at the error summary for all processes. I think the best solution is firstly to implement some kind of 'error warning' that isn't as serious as an actual error, and also have a value nerr_tot which is the total of all errors and is communicated through the threads at certain stages (say in wind_updates and spectrum_summary), and then we can quit if this number is exceeded- we then won't ever quit on a single error but rather the total errors exceeding one million...is that satisfactory. We can also quit if a single error on a certain thread exceeds a slightly smaller number, perhaps. If we still want to quit if the total of a single error across all threads exceeds 1 mill then that will be a bit trickier, but may be doable. It would either mean some kind of clever memory being assigned as errors are found, or a list of all possible errors being declared at the start. |
This is in response to James' message For errors which we don't want python to quit for we can quite easily count them with a few edits to the scripts py_error.py and watchdog.py. This would just required a unique string like 'Warning' or a new function like Warning() in log.c. This could even be counted by error_log/error_count internally in python itself and wouldn't take much effort. Yes this would be possible, if necessary, but it creates differences between single processor mode and multiprocessor mode. The watchdog script also count errors across various threads fairly well, although because it counts by looking in the diag file it can only pick up 100 errors per thread (as that is the maximum python reports). The py_error script does count them properly as it runs after the run has finished and so looks at the error summary for all processes. I don't think this is what we are talking about. I think the best solution is firstly to implement some kind of 'error warning' that isn't as serious as an actual error, and also have a value nerr_tot which is the total of all errors and is communicated through the threads at certain stages (say in wind_updates and spectrum_summary), and then we can quit if this number is exceeded- we then won't ever quit on a single error but rather the total errors exceeding one million...is that satisfactory. We can also quit if a single error on a certain thread exceeds a slightly smaller number, perhaps. If we still want to quit if the total of a single error across all threads exceeds 1 mill then that will be a bit trickier, but may be doable. It would either mean some kind of clever memory being assigned as errors are found, or a list of all possible errors being declared at the start. I think the tricky question is whether we can take an array from a thread and bright it to the master thread and then process it in some way. Our current structure for keeping track of the errors consists of 'Error message" and count. We need to be able to meld two structures but not simple add or average them. We actually need to do work on them. However, in the short term, I think we need to pay attention to my last major point in my earlier message. If we don't care if something happens a million times, then it should not be an error. If we do, then it should not be a problem, that when it happens a million times that the program stops. — |
So, I have implemented a new command 'warning' that one can use for things that one wants to know about, but not necessarily think is a problem that should cause the code to crash. Nick |
Proposed fix for me to implement:
|
Closing this as the above tasks are now done. The commit in question had a missing , but agnwinds/python@75cf093 addresses this |
The current method of causing python to exit if it sees 1 million errors / number processors of any error type is not really working well.
This is because there are some fairly benign errors, like 'no photons in band' that can easily exceed this number if you are running on a few hundred cores.
I think that for the time being, we should go back to simply setting the fail on 1 million errors full stop - and be careful to check our total errors.
In the long term, we need to think carefully about which errors we really care about, and which we merely want to be informed of rather than causing many hour long jobs to simply stop.
The text was updated successfully, but these errors were encountered: