Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Again, not so clear if and how should nomad be installed #51

Open
gciatto opened this issue Aug 9, 2024 · 0 comments
Open

Again, not so clear if and how should nomad be installed #51

gciatto opened this issue Aug 9, 2024 · 0 comments

Comments

@gciatto
Copy link

gciatto commented Aug 9, 2024

Following the instructions provided on the Nomad documentation (as mentioned in #33), I tried to configure the Nvidia plugin on my computer having a NVIDIA GeForce GTX 1050 Ti, but I couldn't manage to do so.

The problem I observe is that no Nvidia device appears in when I run nomad node status <my node id>.

In particular,
it is not clear to me if some operation is required to install the Nomad Nvidia Plugin,
other than letting Nomad load the following configuration:

plugin "nomad-device-nvidia" {
  config {
    enabled            = true
    # ignored_gpu_ids    = ["GPU-fef8089b", "GPU-ac81e44d"]
    fingerprint_period = "1m"
  }
}

For instance: should I install the plugin somehow? Or is it shipped with Nomad?
If it is shipped with Nomad, where should I find it?
If not what should I do?

Datails about what I did

In my setup, Nomad in installed into Manjaro Linux, via Hombrew.
Docker is installed as well, along with the Nvidia runtime.
I tested them and they work.

Nomdad is configured with a single machine acting as both the server and the client.
The data_dir is /var/nomad/data, hence the plugin_dir defaults to /var/nomad/data/plugins.

When I the Nomad daemon, the logs are as follows:

==> WARNING: mTLS is not configured - Nomad is not secure without mTLS!
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from /etc/nomad.d/docker.hcl, /etc/nomad.d/nomad.hcl, /etc/nomad.d/nvidia.hcl, /etc/nomad.d/raw_exec.hcl
==> Starting Nomad agent...
2024-08-09T15:14:57.237+0200 [TRACE] plugin.stdio: waiting for stdio data
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 172.21.0.1:4646; RPC: 172.21.0.1:4647; Serf: 172.21.0.1:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: true
             Log Level: INFO
               Node Id: 8d565b58-fb42-4627-eac6-b8f98d0c09ff
                Region: global (DC: dc1)
                Server: true
               Version: 1.8.2

==> Nomad agent started! Log data will stream in below:

    2024-08-09T15:14:47.180+0200 [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2024-08-09T15:14:47.182+0200 [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:8aee8578-229b-bee2-86a7-3bfb368652a7 Address:172.21.0.1:4647}]"
    2024-08-09T15:14:47.182+0200 [INFO]  nomad.raft: entering follower state: follower="Node at 172.21.0.1:4647 [Follower]" leader-address= leader-id=
    2024-08-09T15:14:47.184+0200 [INFO]  nomad: serf: EventMemberJoin: node-pc-nicolas.global 172.21.0.1
    2024-08-09T15:14:47.184+0200 [INFO]  nomad: starting scheduling worker(s): num_workers=16 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-08-09T15:14:47.184+0200 [WARN]  nomad: serf: Failed to re-join any previously known node
    2024-08-09T15:14:47.184+0200 [INFO]  nomad: started scheduling worker(s): num_workers=16 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-08-09T15:14:47.184+0200 [INFO]  nomad: adding server: server="node-pc-nicolas.global (Addr: 172.21.0.1:4647) (DC: dc1)"
    2024-08-09T15:14:47.184+0200 [WARN]  agent.plugin_loader: skipping subdir in plugin folder: plugin_dir=/var/nomad/data/plugins subdir=/var/nomad/data/plugins/nomad-device-nvidia
    2024-08-09T15:14:47.185+0200 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2024-08-09T15:14:47.185+0200 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2024-08-09T15:14:47.185+0200 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2024-08-09T15:14:47.185+0200 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2024-08-09T15:14:47.185+0200 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2024-08-09T15:14:47.185+0200 [INFO]  client: using state directory: state_dir=/var/nomad/data/client
    2024-08-09T15:14:47.185+0200 [INFO]  client: using alloc directory: alloc_dir=/var/nomad/data/alloc
    2024-08-09T15:14:47.185+0200 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2024-08-09T15:14:47.200+0200 [WARN]  client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
    2024-08-09T15:14:48.917+0200 [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2024-08-09T15:14:48.917+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 172.21.0.1:4647 [Candidate]" term=8
    2024-08-09T15:14:48.924+0200 [INFO]  nomad.raft: election won: term=8 tally=1
    2024-08-09T15:14:48.924+0200 [INFO]  nomad.raft: entering leader state: leader="Node at 172.21.0.1:4647 [Leader]"
    2024-08-09T15:14:48.924+0200 [INFO]  nomad: cluster leadership acquired
    2024-08-09T15:14:48.935+0200 [ERROR] nomad.fsm: DeleteServiceRegistrationByID failed: error="service registration not found"
    2024-08-09T15:14:48.935+0200 [ERROR] nomad.fsm: DeleteServiceRegistrationByID failed: error="service registration not found"
    2024-08-09T15:14:48.935+0200 [ERROR] nomad.fsm: DeleteServiceRegistrationByID failed: error="service registration not found"

    2024-08-09T15:14:48.936+0200 [INFO]  nomad: eval broker status modified: paused=false
    2024-08-09T15:14:48.936+0200 [INFO]  nomad: blocked evals status modified: paused=false
    2024-08-09T15:14:57.205+0200 [INFO]  client.proclib.cg2: initializing nomad cgroups: cores=0-15
    2024-08-09T15:14:57.205+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2024-08-09T15:14:57.205+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2024-08-09T15:14:57.205+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2024-08-09T15:14:57.227+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1a46b8ad-a5d9-a06a-face-c01731a911e9 task=web type=Received msg="Task received by client" failed=false
    2024-08-09T15:14:57.233+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a59d0d66-25f5-75d7-b900-71c370e9ae92 task=web type=Received msg="Task received by client" failed=false
    2024-08-09T15:14:57.235+0200 [INFO]  client: node registration complete
    2024-08-09T15:14:57.238+0200 [INFO]  client: started client: node_id=f4e76f5b-287e-a96b-0388-1eb09fa7dc8e
    2024-08-09T15:14:57.238+0200 [INFO]  client.gc: marking allocation for GC: alloc_id=1a46b8ad-a5d9-a06a-face-c01731a911e9
    2024-08-09T15:14:57.239+0200 [ERROR] nomad.fsm: DeleteServiceRegistrationByID failed: error="service registration not found"
    2024-08-09T15:14:57.239+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=172.21.0.1:4647
    2024-08-09T15:14:57.239+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=172.21.0.1:4647
    2024-08-09T15:14:57.239+0200 [INFO]  client.service_registration.nomad: attempted to delete non-existent service registration: service_id=_nomad-task-1a46b8ad-a5d9-a06a-face-c01731a911e9-group-servers-hello-world-servers-www namespace=default
    2024-08-09T15:14:57.241+0200 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
    2024-08-09T15:14:57.242+0200 [INFO]  agent: (runner) creating watcher
    2024-08-09T15:14:57.242+0200 [INFO]  agent: (runner) starting

So Nomad agent is up and running, but no mention about the Nvidia plugin being loaded is reported.

The only relevant line is

agent.plugin_loader: skipping subdir in plugin folder: plugin_dir=/var/nomad/data/plugins subdir=/var/nomad/data/plugins/nomad-device-nvidia

which I guess is because I tried to clone this repository into the plugin_dir... which I guess is not the correct way to install the plugin?

Output of nomad node status

ID              = f4e76f5b-287e-a96b-0388-1eb09fa7dc8e
Name            = *****
Node Pool       = default
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 4h58m55s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec,java,qemu,raw_exec

Node Events
Time                       Subsystem       Message
2024-08-09T14:29:29+02:00  Driver: docker  Healthy
2024-08-09T14:28:02+02:00  Driver: docker  Failed to connect to docker daemon
2024-08-09T14:14:47+02:00  Cluster         Node registered

Allocated Resources
CPU          Memory      Disk
0/56000 MHz  0 B/62 GiB  0 B/582 GiB

Allocation Resource Utilization
CPU          Memory
0/56000 MHz  0 B/62 GiB

Host Resource Utilization
CPU             Memory         Disk
2015/56000 MHz  14 GiB/62 GiB  308 GiB/938 GiB

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
a59d0d66  f4e76f5b  servers     0        stop     complete  1h11m ago  6m45s ago
1a46b8ad  f4e76f5b  servers     0        stop     failed    1h22m ago  44m2s ago
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant