Frequent crashes at various steps of the workflow #33

suhrig · 2018-09-03T10:56:37Z

Dear Keiran,

The containers available at dockstore crash frequently in our environment. I tried the versions 1.0.8, 1.1.2, 1.1.3, and 1.1.4. The crashes occur at random steps of the workflow even for the same dataset, which led me to believe that it is a technical issue and unrelated to the data. With few exceptions, I could not find any error messages in the log files. The *.wrapper.log files contained an exit code of 255 and the files inside the timings folder, too. But other than that there was no hint about the source of the error in any of the other log files.

After extensive debugging I managed to track down the crashes to two issues:

In order to launch a job, a command is written to a shell script file, for example WGS_tumor_vs_control/caveman/tmpCaveman/logs/Sanger_CGP_Caveman_Implement_caveman_estep.94.sh. This script is then made executable and called right after. Apparently, some versions/storage drivers of docker have an issue with this. When there is no delay between making the script executable and running it, occasionally the change in permissions has not yet become effective before the script is run, resulting in an error Text file busy and the termination of the workflow. Others have reported this issue, too: Running chmod on file results in 'text file busy' when running straight after. moby/moby#9547. Supposedly, it helps to insert a sync or sleep 1 between making the script executable and running it. I am not sure whether this helps, because switching to singularity fixed this issue for me, so I did not bother to find out which scripts would need to be modified and actually try it out. Even though this is not a bug in the workflow itself but in Docker, you might want to consider inserting a sync, because other users might run into the same error.
After solving the above issue, only about half of the runs crashed (rather than 9/10). The remaining crashes were caused by the need_backoff function in /opt/wtsi-cgp/lib/perl5/PCAP/Threaded.pm. The following line occasionally threw an error Use of uninitialized value $one_min:
$ret = 1 if($one_min > $self->{'system_cpus'});
I was unable to find out, why $one_min is undefined sometimes. I tried writing the value of $uptime to STDERR to check, if the regex fails to match, but for reasons I do not understand the values did not get written to the log files of the workflow. I tried replacing the uptime tool with something that is guaranteed to produce an output string matching the regex, but the error still occurred. At this point, I'm thinking that perhaps the call to the external command uptime from within Perl fails from time to time. I eventually gave up, since it takes days to reproduce the issue and I was able to avoid the crashes altogether by simply wrapping the offending line into this:

if (defined $one_min) {
  $ret = 1 if($one_min > $self->{'system_cpus'});
}

I assume you do not bump into these issues as often as I do, because you certainly would have noticed an error that affects a major fraction of the runs. I have no explanation as to why these two errors happen so frequently in our environment. Still, I was able to reproduce the issues on various systems (openSuSE/CentOS) with various kernel/Docker versions and various storage drivers, so other users might be affected, too. I therefore figured that it is reasonable to take precautions to circumvent the errors and wanted to give you this feedback.

Regards,
Sebastian

The text was updated successfully, but these errors were encountered:

keiranmraine · 2019-02-16T17:06:21Z

Thank you for the detailed report.

The actual logs are quite buried, we will try to improve docs for how to investigate issues.

The module that handles the script creation can do a sync but requires the env variable to be set (automatic via cwl):

https://github.com/cancerit/PCAP-core/blob/develop/lib/PCAP/Threaded.pm#L296-L298

But, it may be that we need to move that sync until after the chmod, we are currently reviewing many of our tools so this should get picked up relatively soon.

FYI, we do seems to be finding users on CentOS have more problems, not sure why.

keiranmraine added the bug label Feb 16, 2019

keiranmraine mentioned this issue Feb 16, 2019

Move sync after chmod cancerit/PCAP-core#31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent crashes at various steps of the workflow #33

Frequent crashes at various steps of the workflow #33

suhrig commented Sep 3, 2018

keiranmraine commented Feb 16, 2019

Frequent crashes at various steps of the workflow #33

Frequent crashes at various steps of the workflow #33

Comments

suhrig commented Sep 3, 2018

keiranmraine commented Feb 16, 2019