-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CallWithTimeout and related. #331
Conversation
Some minor issues: Passing 0,0 as the seconds/microseconds makes the loop go forever. The following code segfaults gap:
|
Setting the kernel timer to 0 is how you unset it, which explains the first one. This is probably best dealt with in the GAP wrapper function. I can also reproduce the crash on OS X. It would be useful to know if anyone sees it on Linux. Also does any of this work under cygwin at all? |
Chris's crash is messy. I'm not yet 100% sure of the details, but essentially when the user quits from the inner break loop, the longjump buffer used for the timeout is left with its value from the inner timeout, which is invalid since it comes from stack frame that we have left (via the separate longjump used for error handling) so when the outer timeout fires, we try to longjump to a place that no longer exists, and segfault. The only really nice way to fix this is a full review of the whole error/exception handling concept, and the related business of saving and restoring reader and execution state. Ultimately this would be nice but it will be quite hard and most likely break a few things along the way. In the meantime, we either add another piece of sticking plaster by trying to find all the right places to save and restore the timeout jump buffer, or we avoid the problem by banning nested timeouts, or something like that. |
Have added sticking plaster. Not quite as complex as I thought. |
b9e9a82
to
ff39f83
Compare
As far as I know, this is now ready for merge, although more testing is always good. It has documentation and a basic testfile. |
Can make this code crash about 1 time in 3 with the following nasty invocation:
Looks like an incorrect longjmp is occuring (there is no backtrace, just longjmping into nowhere). |
@ChrisJefferson I can't reproduce this. |
Did you try popping it in a |
I can reproduce it now, although rarely. I'll have a look at the code. There might be a race condition around setting up the jump buffer. |
The problem is that a we try calling with 'NumAlarmJumpBuffers' = 0, which I think should never happen. This suggests an alarm is getting called after we disabled the alarms (might be something written somewhere that that can happen. |
Think I know what the problem is. If the signal comes in after we execute the last statement of the function, we are left with InterruptExecStats having been called and SyAlarmHasGoneOff set to true, I can fix. |
Not sure what is going on with travis's timing -- seems to be going crazy but not sure how it's going crazy. |
OK. I think I've fixed the race condition. Travis timings probably relate to which timers see the system CPU time associated with calls to RunTime(). Possibly need to rethink how to test this code. |
I tried injecting some 'Runtimes()' calls into the test to see where the time is being spent, it looks like there isn't any great amount of time being spent anywhere:
|
That’s 8ms of system time, which is 8000 microseconds, so the alarm might go off if that is being counted.
|
Woops, I got my units wrong, you are right. So just change
|
I am wondering whether it would be a good idea to add I also think there is a potential race condition (including in the existing code for handling SIGINT). Remember that the compiler is unaware of signals and can reorder assignments to |
Thanks.
|
Setting up signal delivery and the timing mechanism should work the same way as in the single-threaded version (assuming that per-thread CPU accounting is actually correct and not faked [1]); however, the actual signal delivery and handling is complicated when you have multiple threads (note that this is also the case, if, say, a GAP package starts a new thread [2]). The underlying problem is that POSIX doesn't guarantee that a signal is delivered to a specific thread, but allows the OS to pick any thread (even randomly) that doesn't block the signal. The usual solution is to block the signal in all but a dedicated signal handler thread and have that signal be forwarded by that thread to the thread it was intended for via One final caveat: The current implementation of [1] This can be simulated, in the same way that profilers do sample-based profiling, but that may become hairy. [2] I.e. if a GAP package did this, it would have to block signals that the main thread is prepared to handle; this is not a problem in HPC-GAP, which blocks the signals it handles itself in the main thread and worker threads, and new threads inherit the signal mask of the creating thread. |
OK Thanks for the information. |
@stevelinton tested on Windows VM, compiling and running GAP in Cygwin shell. It builds without warnings and errors, but then
|
@alex-konovalov Great. So cygwin provides either setiimer or timer-create, but then it doesn't work. Can you get the compile log from that system? Even better, can you try to generate this error interactively and tell me what LastSystemError shows? Thanks |
Thanks @alex-konovalov , quick google reveals cygwin doesn't do ITIMER_VIRTUAL, we'll have to use ITIMER_REAL on cygwin (which does measure wall time rather than process time, but should be a reasonable substitute) |
Well, just replacing ITIMER_VIRTUAL by ITIMER_REAL in two places where it's used doesn't help... |
Still error in the same place? |
@ChrisJefferson: yes :( I gave you the wrong fragment of code. It happens here:
|
@stevelinton The
|
They must be using it to mean “no such timer”. I would guess that the timer-create call in InitAlarms, the return value of which, mea culpa, is not being checked, must be failing. I imagine it is more-or-less the same as what Chris reported — CPU timing isn’t supported, so we have to use CLOCK_REALTIME instead of one of the others.
|
My windows install is currently broken -- once I fix it, I'll figure out exactly what does and doesn't work in cygwin. |
ba07060
to
b8fdb15
Compare
The current version should in theory work on cygwin.If it does, then as far as I know it's ready to merge, although it might be sensible to squash the commits a bit. |
Yep, works fine with cygwin |
b8fdb15
to
6a61af2
Compare
@stevelinton @ChrisJefferson thanks - that was the only issue remaining, so let's merge this then. |
Modify test file to hopefully please travis Add warning to documentation that a timeout may leave things inconsistent. Some robsutness improvements suggested by Reimer Check return value from timer_create
Make alarms work with TakeInterrupt() This allows long, but cooperative kernel functions to be interrupted.
6a61af2
to
3ac7ce3
Compare
Add CallWithTimeout and related.
No further comments for 4 days - merged now. Thanks |
@stevelinton interestingly, the test passes on Travis infrastructure, but fails on all three machines that we use for Jenkins nightly tests:
|
@stevelinton could it be a race condition somewhere? The longer is the timeout, the more robust it is. For example, I have
each time, with the 1st argument being 10000 I can get |
Most likely, it's just the timeout being hit. As I understand it, the value is given in microseconds, and 5000 µs = 5 ms can easily fail if the machine is just a bit busy (or if the OS doesn't support high resolution timers). |
@rbehrends I think you're right... so I suggest to increase the timeout in the failing calls from 5000 to 50000 to make the test reproducible. |
Changes Unknown when pulling 3ac7ce3 on stevelinton:sl/alarm into ** on gap-system:master**. |
This PR adds a GAP level function
CALL_WITH_TIMEOUT( <seconds>, <useconds>, <fn>, <args> )
<seconds>
and<useconds>
must be small intsThe return value is
The timer is paused during a break loop (and might even by nestable) and reset on quitting from a break.
There are some support functions:
TIMEOUTS_SUPPORTED(); -- returns true if timeouts could be compiled in this system and false otherwise
STOP_TIMEOUT();
-- stops the timer and returns a length 2 list[<seconds>, <useconds>]
RESUME_TIMEOUT( <state> );
-- takes such a list and sets the timer to it and starts it.these are used in the break loop handling.
This is all intended to be wrapped up at GAP level into something a little bit nicer. There are two implementations behind the scenes, depending on your OS -- one using setitimer and a nicer one using the newer POSIX interface --
timer-create()
etc. The second of these could be made to work sensibly in HPC-GAP, I think, but is not available on a Mac.I'd welcome testing.