-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conflicting Namespace- Using the SLURM scheduler to check the status of a project tries to find bundled operations from other users and/or instances of FlowProject #758
Comments
Here's some information that may be helpful for tracking down the cause of the bug. For individual jobs (not bundles), signac-flow includes the project's full path in the hash when generating the submission names. Lines 970 to 973 in c0f44b2
For bundles, the bundle ID is dependent on the ids of the input job-operations. In my understanding this should already incorporate the project's full path (which we use to ensure uniqueness across projects). If the users are operating on separate signac data spaces, I would not have expected this to be a problem. Line 2156 in c0f44b2
Can you confirm if the projects share a path? Are the users operating on independent signac projects, or two FlowProjects pointing to the same signac project? Maybe you can insert a breakpoint and share the output of |
I would lean away from solutions that insert the username into the job/bundle identifier, because theoretically two users operating on a shared signac project should be able to submit and query over the same jobs without conflict. Adding a username to the mix would mean that signac-flow can't identify whether duplicate work is being submitted by both users acting on the same data space. |
This is a fair point about not inserting the username for that use case. For the example error message posted above, the two projects were at different paths with the two FlowProjects pointing to different signac projects, but still conflicted when checking their status, allowing neither to be checked while the other had jobs in the queue or active on nodes. I'll generate a toy example, such as what I posted above and see if I can replicate the error while also inserting that breakpoint. Thanks for your help! |
Okay here's a toy example within the filetree: Code to reproduce the error craven76@head :~$ ls ./
project1 project2
craven76@head :~$ cd project1
craven76@head :~$ python init_project.py
craven76@head :~$ python project.py submit -n 4 --bundle 2
Using environment configuration: Rahman
Querying scheduler...
Submitting cluster job 'TestProject/bundle/2b7a0a6de2e0451ac1a2805d7623eefa12d1d780':
WARNING:flow.project:Unable to load template from package. Original Error '__main__.__spec__ is None'.
- Group: wait_1min(89e412cabc2300a734873ff43b2dd367)
- Group: wait_1min(f3e56c602771e9541aef61d502562b89)
Submitting cluster job 'TestProject/bundle/0333205b13dc702ff1adafb5fe7a1ba309c90a2c':
- Group: wait_1min(e535d5106574b0506407aebaec71318e)
- Group: wait_1min(c7426fb8d903c232eeb81f6454c81069)
craven76@head :~$ cd ../project2
craven76@head :~$ python project.py status
Using environment configuration: Rahman
WARNING:flow.project:Unable to load template from package. Original Error '__main__.__spec__ is None'.
Querying scheduler...
ERROR:flow.project:Error during status update: [Errno 2] No such file or directory: '/raid6/homes/craven76/test_signac_bug/project2/.bundles/TestProject/bundle/2b7a0a6de2e0451ac1a2805d7623eefa12d1d780'
Use '--ignore-errors' to complete the update anyways.
Traceback (most recent call last):
File "project.py", line 30, in <module>
TestProject().main()
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 5120, in main
args.func(args)
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 4767, in _main_status
raise error
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 4761, in _main_status
self.print_status(jobs=aggregates, **args)
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2980, in print_status
status_results, job_labels, individual_jobs = self._fetch_status(
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2646, in _fetch_status
scheduler_info = self._query_scheduler_status(
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2554, in _query_scheduler_status
return {
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2554, in <dictcomp>
return {
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2198, in scheduler_jobs
yield from self._expand_bundled_jobs(scheduler.jobs())
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2172, in _expand_bundled_jobs
with open(self._fn_bundle(job.name())) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/raid6/homes/craven76/test_signac_bug/project2/.bundles/TestProject/bundle/2b7a0a6de2e0451ac1a2805d7623eefa12d1d780' |
And then I went and add the breakpoint With the result of: craven76@head :~$ python project.py submit -n 4 --bundle 2
WARNING:root:The operations for the space are ['TestProject/e535d5106574b0506407aebaec71318e/wait_1min/19d14bb5f93e4a7089005cd827fc69eb', 'TestProject/c7426fb8d903c232eeb81f6454c81069/wait_1min/f99b4a74c99916888f94527db2002770']
craven76@head :~$ cd ../project2
craven76@head :~$ python project.py status
Using environment configuration: Rahman
WARNING:flow.project:Unable to load template from package. Original Error '__main__.__spec__ is None'.
Querying scheduler...
ERROR:flow.project:Error during status update: [Errno 2] No such file or directory: '/raid6/homes/craven76/test_signac_bug/project2/.bundles/TestProject/bundle/2b7a0a6de2e0451ac1a2805d7623eefa12d1d780'
Use '--ignore-errors' to complete the update anyways.
Traceback (most recent call last):
File "project.py", line 30, in <module>
TestProject().main()
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 5124, in main
args.func(args)
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 4771, in _main_status
raise error
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 4765, in _main_status
self.print_status(jobs=aggregates, **args)
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2984, in print_status
status_results, job_labels, individual_jobs = self._fetch_status(
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2650, in _fetch_status
scheduler_info = self._query_scheduler_status(
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2558, in _query_scheduler_status
return {
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2558, in <dictcomp>
return {
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2202, in scheduler_jobs
yield from self._expand_bundled_jobs(scheduler.jobs())
File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2176, in _expand_bundled_jobs
with open(self._fn_bundle(job.name())) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/raid6/homes/craven76/test_signac_bug/project2/.bundles/TestProject/bundle/2b7a0a6de2e0451ac1a2805d7623eefa12d1d780' |
Original .py files for generating the issues above. |
@CalCraven I am not sure how this is happening with two users, given signac-flow/flow/scheduling/slurm.py Lines 46 to 49 in 74de383
but for a single user with two projects, I think it would error due to Lines 2169 to 2174 in 74de383
where Thus, I am surprised that someone else's submissions are causing errors, but it makes sense that one person's submissions could. We assume that if it has the same bundle prefix it is the same, but the prefix is not at all expected to be unique. Solutions:
Out of these, the second is most appealing from a design perspective (we would just need to place the jobs in a |
I also accidentally triggered this error, using two FlowProjects that have the same class name (MyProject in my case). Debugging with @cbkerr! A workaround that I'm using is to name the class after the specific project I'm working on and hoping that no one else is using it. There is a slurm job name of The following is generated when calling print_status() Querying scheduler...
Traceback (most recent call last):
File "/gpfs/accounts/sglotzer_root/sglotzer0/gabs/why_is_signac_not_working/testSignac.py", line 35, in <module>
project.print_status(detailed=True, parameters=['n'])
File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2980, in print_status
status_results, job_labels, individual_jobs = self._fetch_status(
File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2646, in _fetch_status
scheduler_info = self._query_scheduler_status(
File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2554, in _query_scheduler_status
return {
File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2554, in <dictcomp>
return {
File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2198, in scheduler_jobs
yield from self._expand_bundled_jobs(scheduler.jobs())
File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2172, in _expand_bundled_jobs
with open(self._fn_bundle(job.name())) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs/accounts/sglotzer_root/sglotzer0/gabs/why_is_signac_not_working/.bundles/MyProject/bundle/3572942c6bf210551200f6013ddfac753c662383' |
Thanks for reminding me of this thread @seoulfood! Yeah, I use the same workaround and hope my naming is unique. Will keep an eye on this if you and @cbkerr can think up a more permanent fix. I can also test anything on our cluster if need be. |
@CalCraven @seoulfood I posted some solutions above. The problem is that we check all scheduler jobs by a user, regardless of project. The bundle prefix, which determines if we assume the scheduler job comes from the current project, only depends on the project class name. |
Yeah that seems strange, that could be something unique to the setup of our cluster. I haven't been able to replicate it so I'm not necessarily worried about that issue. My intuition says it could easily have been coming from myself and another person using the "FLowProject" default label for two unique projects apiece, and the status checks being limited to the same bug per user. I will dig a little more to see if that's the case, but I think you're right that conflict shouldn't be an issue.
I agree with your opinion, although I know little about the extent the other three solutions would take to implement. Especially in the case where operation is on a huge cluster with thousands of jobs, it makes sense to operate only on jobs you know exist in the current project. |
I don't think this alone completely solves the problem. If the same user has two projects of the same name AND submits the same operations on the same jobs in each - then there is no way to tell them apart. I agree that looping only over bundles known to the project is a good improvement, but I think we additionally need to to disambiguate the bundle ids. Perhaps add a hash of the project's absolute path? In signac 2.0, it is not possible to store 2 projects at the same path. We wouldn't necessarily need to add additional hash characters to an already long bundle id - we could include the project path as a salt in the hash that is already computed. |
I agree this sounds like the right solution. However, when I've looked at this issue in the past, I was confused. I saw there is already a project path being included here: Line 981 in 2657151
Perhaps we're missing this somewhere else? |
We current assume that if a job exist in the cluster's scheduler with the bundle prefix which only includes the project name and word bundle the scheduler job is for that project, and we attempt to open the file corresponding to that bundle; however, it may not exist if two projects with the same name exist in the same place. @joaander You are right, I didn't think about false positives the other way around. We do need to disambiguate the bundle prefix more regardless of going from file to scheduler job or scheduler job to bundle. |
@bdice Pointed out that this is already done. I failed to read the entire comment thread or look at the code. |
This is true of scheduler jobs, but as I mentioned above and can also be seen here, Lines 2129 to 2132 in 74de383
and here, Lines 2169 to 2174 in 74de383
when it comes to searching for extant bundles bundle_prefix does not contain disambiguating information, and, thus, we attempt to open non-existent files when we have two projects with the same name using bundles.
|
Description
When submitting jobs using the default SLURM template to a scheduler, the job status is returned with the [A], [Q], etc. progress to keep track of these jobs. The jobs themselves are labeled in the scheduler via their NAME attribute, which looks something like
Project_name/bundle/job_hash
. The issues arises when two users call their projects the sameProject_name
when creating theFlowProject
subclass. An error will be raised because the status check sees that there's a job under your project, but that job_hash will not exist in the .bundles directory. A possible solution might be to append the username to the Project_name when the job is posted to the scheduler i.e.Project_name-user_name/bundle/job_hash
.To reproduce
./project1/project1.py
./project2/project2.py
python project1.py submit cd ../project2 python project2.py status
Error output
System configuration
Please complete the following information:
The text was updated successfully, but these errors were encountered: