You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DeviceIDs will be change after update job rule when preview allocation is running without changes in real.
Used nvidia device and qemu driver
In this case scheduler working with an error beacuse try to create new allocation to already used DeviceID
Reproduction steps
Create new job with Resources.Devices != nil
Update this job any fills, but need to save current allocation
DeviceID will be changed by GenericStack
Nomad server logs
//log debug info from https://github.com/hashicorp/nomad/blob/master/scheduler/rank.go#L189
task request resources:
&{Name:nvidia/gpu/p102 Count:1 Constraints:[] Affinities:[]}
create offer for &{Name:nvidia/gpu/p102 Count:1 Constraints:[] Affinities:[]}
device offer &{Vendor:nvidia Type:gpu Name:p102 DeviceIDs:[0c:00.0]}
addReserved reqID {Vendor:nvidia Type:gpu Name:p102} devInst &{Device:0xc000820af0 Instances:map[04:00.0:0 05:00.0:0 06:00.0:1 07:00.0:1 08:00.0:1 09:00.0:1 0a:00.0:1 0b:00.0:0 0c:00.0:0 0d:00.0:0]} deviceIds [0c:00.0]
//nomad logger
[DEBUG] worker: submitted plan for evaluation: eval_id=1d118d67-4f98-fcad-a425-f8cc0288b2dc
[DEBUG] worker.service_sched: setting eval status: eval_id=1d118d67-4f98-fcad-a425-f8cc0288b2dc job_id=8.74.8.71e3d7dc-b3ff-e526-3ecf-758b745de200 namespace=default status=complete
/////////Many others
//log debug info from https://github.com/hashicorp/nomad/blob/master/scheduler/rank.go#L189
task request resources:
&{Name:nvidia/gpu/p102 Count:1 Constraints:[] Affinities:[]}
create offer for &{Name:nvidia/gpu/p102 Count:1 Constraints:[] Affinities:[]}
device offer &{Vendor:nvidia Type:gpu Name:p102 DeviceIDs:[0b:00.0]}
addReserved reqID {Vendor:nvidia Type:gpu Name:p102} devInst &{Device:0xc000820af0 Instances:map[04:00.0:0 05:00.0:0 06:00.0:1 07:00.0:1 08:00.0:1 09:00.0:1 0a:00.0:1 0b:00.0:0 0c:00.0:0 0d:00.0:0]} deviceIds [0b:00.0]
[DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=66725242-1f5b-1286-ed28-64e709a011ef job_id=8.74.8.71e3d7dc-b3ff-e526-3ecf-758b745de200 namespace=default results="Total changes: (place 0) (destructive 0) (inplace 1) (stop 0)
//nomad logger
Created Deployment: "5b23ed79-804a-9a1f-e017-4253bc36d2da"
Deployment Update for ID "10ca81e7-eb09-b443-6efc-75b3f240ca3a": Status "cancelled"; Description "Cancelled due to newer version of job"
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v0.10.4 and older
Operating system and Environment details
5.4.30-1-MANJARO
4.15.0-91-generic #92-Ubuntu
Any
Issue
DeviceIDs will be change after update job rule when preview allocation is running without changes in real.
Used nvidia device and qemu driver
In this case scheduler working with an error beacuse try to create new allocation to already used DeviceID
Reproduction steps
Create new job with Resources.Devices != nil
Update this job any fills, but need to save current allocation
DeviceID will be changed by GenericStack
Nomad server logs
Fix
Add restore resources from existing allocation like network in this case
https://github.com/hashicorp/nomad/blob/master/scheduler/util.go#L899
The text was updated successfully, but these errors were encountered: