-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
two instances of the same task running in parallel #6329
Comments
Have managed to reproduce using this example: [scheduler]
allow implicit tasks = True
[task parameters]
a = 1..3
b = 1..50
[scheduling]
[[queues]]
[[[default]]]
limit = 40
[[graph]]
R1 = """
<a> => <b>
"""
[runtime]
[[<a=1>]]
script = """
if [[ $CYLC_TASK_SUBMIT_NUMBER -lt 2 ]]; then
false
fi
""" When you run this example, one of the first three tasks will fail. Unstick the workflow by triggering all tasks. # start the workflow
cylc vip .
# wait for the first task to fail
# trigger all tasks
cylc trigger '<wid>//*' Then look for evidence of parallel submissions in the workflow log: $ cylc cat-log tmp.zu3j5ygsks | grep 'succeeded for job'
2024-08-27T12:45:19+01:00 WARNING - [1/_b28/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b24/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b30/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b26/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b35/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b41/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b37/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b27/02:preparing] (received-ignored)succeeded for job(01) != job(02) I think it's the triggering of large numbers of tasks that leads to this bug. ... Confirmed! Manually triggered tasks are being released via |
Confirmed here too. |
It looks like the same task is coming off the queue twice, could this be a queue-side bug? |
That was difficult to track down, but I got it in the end. It's sort-of a queue bug, but not quite - see the PR. |
Just adding this here, I'm not sure if its related or not, but someone at my workplace reported this on Friday evening.
Four submissions in the space of 3 seconds. You can see she was reloading the workflow if that has any impact. I think this is Cylc 8.2. |
Thanks Tom. Actually this was 8.3.3 that I had the issue in. |
I stand corrected, 8.3.3. Also, @sjrennie pointed out, all of her's were submitted as job 01. |
Hmm, this looks different, I think. There's no resetting of |
No, same platform, different PBS IDs. |
So you can confirm those were all real job submissions? |
Yes, PBS thought so! I could qstat them all, and they all reported back to job.status, and ran for 20 mins before I decided to kill them, since they wouldn't survive trying to write to the same file all at once, at the end of the job. |
OK is that a grepped subset of the log? It might be useful to see the whole thing (you email the relevant excerpt to me if it's large) to see the exactly where the reload occurred. |
Here is the exact bit where it started to go wrong. I can extract a larger bit to send to you if necessary 2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] => submitted |
The |
OK thanks for the more detailed log. Maybe it's something to do with reloading while the task is in the preparing state. I'll see if I can reproduce this... |
A-ha, reproduced - thanks for the bug report! I'll open a new issue for this. |
Closed by #6337 |
@ColemanTom @sjrennie - I backed out of my first attempt at a fix for your new duplicate job submission, but the real thing is up now: #6345 |
A bug where two instances of the same task have been observed running at the same time.
Whilst investigating #6315, I discovered an example where there were two parallel active submissions of the same task. This doesn't appear to be the cause of #6315 as it has been observed in the absence of this issue.
State Transitions
This line is particularly concerning:
Which is followed by this line providing some evidence of the two parallel tasks:
A task should never be able to regress from running to preparing right?!
Looking through the log for this workflow, there are 17 instances of the first message and 11 of the second. These messages relate to various tasks, they always relate to even numbered jobs (02, 04, 06), never odd numbered.
Job Timings
For further evidence of the parallel jobs we can inspect the timings in the remote logs:
The task was triggered (along with many others) at
2024-07-29T08:46:51Z
viacylc trigger '20190331T0000Z/*'
.Reproducible(ish) example
See #6329 (comment)
The text was updated successfully, but these errors were encountered: