Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting Pulsar is not safe #340

Closed
natefoo opened this issue Nov 2, 2023 · 4 comments
Closed

Restarting Pulsar is not safe #340

natefoo opened this issue Nov 2, 2023 · 4 comments
Assignees

Comments

@natefoo
Copy link
Member

natefoo commented Nov 2, 2023

It breaks jobs that were in the process of staging. It will in fact try to resume staging in (that's good!) but fails because it doesn't understand files that have already been fully transferred, e.g.:

2023-11-02 17:54:16,796 DEBUG [pulsar.managers.staging.pre][[manager=jetstream2]-[action=preprocess]-[job=53412028]] Staging jobdir 'tool_script.sh' via FileAction[path=/corral4/main/jobs/053/412/53412028/tool_script.sh,action_type=remote_transfer,url=https://galaxy-web-04.galaxyproject.org/_job_files?job_id=beef&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fjobs%2F053%2F412%2F53412028%2Ftool_script.sh&file_type=jobdir] to /jetstream2/scratch/main/jobs/53412028/tool_script.sh
2023-11-02 17:54:16,821 INFO  [pulsar.client.transport.curl][[manager=jetstream2]-[action=preprocess]-[job=53412028]] transfer of https://galaxy-web-04.galaxyproject.org/_job_files?job_id=beef&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fjobs%2F053%2F412%2F53412028%2Ftool_script.sh&file_type=jobdir will resume at 1789 bytes
Nov 02 17:54:16 jetstream2.galaxyproject.org pulsar[1233136]: 2023-11-02 17:54:16,826 INFO  [pulsar.managers.util.retry][[manager=jetstream2]-[action=preprocess]-[job=53408428]] Failed to execute action[Staging jobdir 'tool_script.sh' via FileAction[path=/corral4/main/jobs/053/408/53408428/tool_script.sh,action_type=remote_transfer,url=https://galaxy-web-03.galaxyproject.org/_job_files?job_id=beef&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fjobs%2F053%2F408%2F53408428%2Ftool_script.sh&file_type=jobdir] to /jetstream2/scratch/main/jobs/53408428/tool_script.sh], retrying in 4.0 seconds.
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
    return fun(*args, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/staging/pre.py", line 20, in <lambda>
    action_executor.execute(lambda: action.write_to_path(path), "action[%s]" % description)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/action_mapper.py", line 479, in write_to_path
    get_file(self.url, path)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/transport/curl.py", line 99, in get_file
    raise Exception(message)
Exception: Failed to get_file properly for url https://galaxy-web-03.galaxyproject.org/_job_files?job_id=beef&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fjobs%2F053%2F408%2F53408428%2Ftool_script.sh&file_type=jobdir, remote server returned status code of 416.
@mvdbeek
Copy link
Member

mvdbeek commented Nov 6, 2023

it doesn't understand files that have already been fully transferred

pulsar shouldn't have to know that. Doesn't the 416 response from Galaxy indicate that the job isn't active anymore ?

@natefoo
Copy link
Member Author

natefoo commented Nov 8, 2023

No, I don't think so - when we spoke about these errors before I was confusing it with Pulsar trying to stage out data for jobs that Galaxy already considers terminal (e.g. due to user deletion), but the 416 here I believe is because Pulsar sets the offset of completed stage-in files to EOF+1 and then tries to request that from Galaxy (nginx x-accel-redirect), which returns 416 since it can't seek beyond the end of the file.

@mvdbeek
Copy link
Member

mvdbeek commented Nov 9, 2023

Thanks, I was wondering where that is coming from.

@mvdbeek mvdbeek self-assigned this Nov 14, 2023
@mvdbeek
Copy link
Member

mvdbeek commented Dec 18, 2023

Fixed in #348

@mvdbeek mvdbeek closed this as completed Dec 18, 2023
@github-project-automation github-project-automation bot moved this from Triage/Discuss to Done in Backend Working Group - weeklies Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants