You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following the instructions provided on the Nomad documentation (as mentioned in #33), I tried to configure the Nvidia plugin on my computer having a NVIDIA GeForce GTX 1050 Ti, but I couldn't manage to do so.
The problem I observe is that no Nvidia device appears in when I run nomad node status <my node id>.
In particular,
it is not clear to me if some operation is required to install the Nomad Nvidia Plugin,
other than letting Nomad load the following configuration:
For instance: should I install the plugin somehow? Or is it shipped with Nomad?
If it is shipped with Nomad, where should I find it?
If not what should I do?
Datails about what I did
In my setup, Nomad in installed into Manjaro Linux, via Hombrew.
Docker is installed as well, along with the Nvidia runtime.
I tested them and they work.
Nomdad is configured with a single machine acting as both the server and the client.
The data_dir is /var/nomad/data, hence the plugin_dir defaults to /var/nomad/data/plugins.
When I the Nomad daemon, the logs are as follows:
==> WARNING: mTLS is not configured - Nomad is not secure without mTLS!
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from /etc/nomad.d/docker.hcl, /etc/nomad.d/nomad.hcl, /etc/nomad.d/nvidia.hcl, /etc/nomad.d/raw_exec.hcl
==> Starting Nomad agent...
2024-08-09T15:14:57.237+0200 [TRACE] plugin.stdio: waiting for stdio data
==> Nomad agent configuration:
Advertise Addrs: HTTP: 172.21.0.1:4646; RPC: 172.21.0.1:4647; Serf: 172.21.0.1:4648
Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Client: true
Log Level: INFO
Node Id: 8d565b58-fb42-4627-eac6-b8f98d0c09ff
Region: global (DC: dc1)
Server: true
Version: 1.8.2
==> Nomad agent started! Log data will stream in below:
2024-08-09T15:14:47.180+0200 [INFO] nomad: setting up raft bolt store: no_freelist_sync=false
2024-08-09T15:14:47.182+0200 [INFO] nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:8aee8578-229b-bee2-86a7-3bfb368652a7 Address:172.21.0.1:4647}]"
2024-08-09T15:14:47.182+0200 [INFO] nomad.raft: entering follower state: follower="Node at 172.21.0.1:4647 [Follower]" leader-address= leader-id=
2024-08-09T15:14:47.184+0200 [INFO] nomad: serf: EventMemberJoin: node-pc-nicolas.global 172.21.0.1
2024-08-09T15:14:47.184+0200 [INFO] nomad: starting scheduling worker(s): num_workers=16 schedulers=["service", "batch", "system", "sysbatch", "_core"]
2024-08-09T15:14:47.184+0200 [WARN] nomad: serf: Failed to re-join any previously known node
2024-08-09T15:14:47.184+0200 [INFO] nomad: started scheduling worker(s): num_workers=16 schedulers=["service", "batch", "system", "sysbatch", "_core"]
2024-08-09T15:14:47.184+0200 [INFO] nomad: adding server: server="node-pc-nicolas.global (Addr: 172.21.0.1:4647) (DC: dc1)"
2024-08-09T15:14:47.184+0200 [WARN] agent.plugin_loader: skipping subdir in plugin folder: plugin_dir=/var/nomad/data/plugins subdir=/var/nomad/data/plugins/nomad-device-nvidia
2024-08-09T15:14:47.185+0200 [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2024-08-09T15:14:47.185+0200 [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2024-08-09T15:14:47.185+0200 [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2024-08-09T15:14:47.185+0200 [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2024-08-09T15:14:47.185+0200 [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2024-08-09T15:14:47.185+0200 [INFO] client: using state directory: state_dir=/var/nomad/data/client
2024-08-09T15:14:47.185+0200 [INFO] client: using alloc directory: alloc_dir=/var/nomad/data/alloc
2024-08-09T15:14:47.185+0200 [INFO] client: using dynamic ports: min=20000 max=32000 reserved=""
2024-08-09T15:14:47.200+0200 [WARN] client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
2024-08-09T15:14:48.917+0200 [WARN] nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2024-08-09T15:14:48.917+0200 [INFO] nomad.raft: entering candidate state: node="Node at 172.21.0.1:4647 [Candidate]" term=8
2024-08-09T15:14:48.924+0200 [INFO] nomad.raft: election won: term=8 tally=1
2024-08-09T15:14:48.924+0200 [INFO] nomad.raft: entering leader state: leader="Node at 172.21.0.1:4647 [Leader]"
2024-08-09T15:14:48.924+0200 [INFO] nomad: cluster leadership acquired
2024-08-09T15:14:48.935+0200 [ERROR] nomad.fsm: DeleteServiceRegistrationByID failed: error="service registration not found"
2024-08-09T15:14:48.935+0200 [ERROR] nomad.fsm: DeleteServiceRegistrationByID failed: error="service registration not found"
2024-08-09T15:14:48.935+0200 [ERROR] nomad.fsm: DeleteServiceRegistrationByID failed: error="service registration not found"
2024-08-09T15:14:48.936+0200 [INFO] nomad: eval broker status modified: paused=false
2024-08-09T15:14:48.936+0200 [INFO] nomad: blocked evals status modified: paused=false
2024-08-09T15:14:57.205+0200 [INFO] client.proclib.cg2: initializing nomad cgroups: cores=0-15
2024-08-09T15:14:57.205+0200 [INFO] client.plugin: starting plugin manager: plugin-type=csi
2024-08-09T15:14:57.205+0200 [INFO] client.plugin: starting plugin manager: plugin-type=driver
2024-08-09T15:14:57.205+0200 [INFO] client.plugin: starting plugin manager: plugin-type=device
2024-08-09T15:14:57.227+0200 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=1a46b8ad-a5d9-a06a-face-c01731a911e9 task=web type=Received msg="Task received by client" failed=false
2024-08-09T15:14:57.233+0200 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=a59d0d66-25f5-75d7-b900-71c370e9ae92 task=web type=Received msg="Task received by client" failed=false
2024-08-09T15:14:57.235+0200 [INFO] client: node registration complete
2024-08-09T15:14:57.238+0200 [INFO] client: started client: node_id=f4e76f5b-287e-a96b-0388-1eb09fa7dc8e
2024-08-09T15:14:57.238+0200 [INFO] client.gc: marking allocation for GC: alloc_id=1a46b8ad-a5d9-a06a-face-c01731a911e9
2024-08-09T15:14:57.239+0200 [ERROR] nomad.fsm: DeleteServiceRegistrationByID failed: error="service registration not found"
2024-08-09T15:14:57.239+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=172.21.0.1:4647
2024-08-09T15:14:57.239+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=172.21.0.1:4647
2024-08-09T15:14:57.239+0200 [INFO] client.service_registration.nomad: attempted to delete non-existent service registration: service_id=_nomad-task-1a46b8ad-a5d9-a06a-face-c01731a911e9-group-servers-hello-world-servers-www namespace=default
2024-08-09T15:14:57.241+0200 [INFO] agent: (runner) creating new runner (dry: false, once: false)
2024-08-09T15:14:57.242+0200 [INFO] agent: (runner) creating watcher
2024-08-09T15:14:57.242+0200 [INFO] agent: (runner) starting
So Nomad agent is up and running, but no mention about the Nvidia plugin being loaded is reported.
The only relevant line is
agent.plugin_loader: skipping subdir in plugin folder: plugin_dir=/var/nomad/data/plugins subdir=/var/nomad/data/plugins/nomad-device-nvidia
which I guess is because I tried to clone this repository into the plugin_dir... which I guess is not the correct way to install the plugin?
Output of nomad node status
ID = f4e76f5b-287e-a96b-0388-1eb09fa7dc8e
Name = *****
Node Pool = default
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 4h58m55s
Host Volumes = <none>
Host Networks = <none>
CSI Volumes = <none>
Driver Status = docker,exec,java,qemu,raw_exec
Node Events
Time Subsystem Message
2024-08-09T14:29:29+02:00 Driver: docker Healthy
2024-08-09T14:28:02+02:00 Driver: docker Failed to connect to docker daemon
2024-08-09T14:14:47+02:00 Cluster Node registered
Allocated Resources
CPU Memory Disk
0/56000 MHz 0 B/62 GiB 0 B/582 GiB
Allocation Resource Utilization
CPU Memory
0/56000 MHz 0 B/62 GiB
Host Resource Utilization
CPU Memory Disk
2015/56000 MHz 14 GiB/62 GiB 308 GiB/938 GiB
Allocations
ID Node ID Task Group Version Desired Status Created Modified
a59d0d66 f4e76f5b servers 0 stop complete 1h11m ago 6m45s ago
1a46b8ad f4e76f5b servers 0 stop failed 1h22m ago 44m2s ago
The text was updated successfully, but these errors were encountered:
Following the instructions provided on the Nomad documentation (as mentioned in #33), I tried to configure the Nvidia plugin on my computer having a
NVIDIA GeForce GTX 1050 Ti
, but I couldn't manage to do so.The problem I observe is that no Nvidia device appears in when I run
nomad node status <my node id>
.In particular,
it is not clear to me if some operation is required to install the Nomad Nvidia Plugin,
other than letting Nomad load the following configuration:
For instance: should I install the plugin somehow? Or is it shipped with Nomad?
If it is shipped with Nomad, where should I find it?
If not what should I do?
Datails about what I did
In my setup, Nomad in installed into Manjaro Linux, via Hombrew.
Docker is installed as well, along with the Nvidia runtime.
I tested them and they work.
Nomdad is configured with a single machine acting as both the server and the client.
The
data_dir
is/var/nomad/data
, hence theplugin_dir
defaults to/var/nomad/data/plugins
.When I the Nomad daemon, the logs are as follows:
So Nomad agent is up and running, but no mention about the Nvidia plugin being loaded is reported.
The only relevant line is
which I guess is because I tried to clone this repository into the
plugin_dir
... which I guess is not the correct way to install the plugin?Output of
nomad node status
The text was updated successfully, but these errors were encountered: