Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run job deregistering in a single transaction #4861

Merged
merged 2 commits into from
Nov 13, 2018

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Nov 10, 2018

Fixes #4299

Upon investigating this case further, we determined the issue to be a race between applying JobBatchDeregisterRequest fsm operation and processing job-deregister evals.

Processing job-deregister evals should wait until the FSM log message finishes applying, by using the snapshot index. However, with JobBatchDeregister, any single individual job deregistering was applied accidentally incremented the snapshot index and resulted into processing job-deregister evals. When a Nomad server receives an eval for a job in the batch that is yet to be deleted, we accidentally re-run it depending on the state of allocation.

This change ensures that we delete deregister all of the jobs and inserts all evals in a single transactions, thus blocking processing related evals until deregistering complete.

Fixes #4299

Upon investigating this case further, we determined the issue to be a race between applying `JobBatchDeregisterRequest` fsm operation and processing job-deregister evals.

Processing job-deregister evals should wait until the FSM log message finishes applying, by using the snapshot index.  However, with `JobBatchDeregister`, any single individual job deregistering was applied accidentally incremented the snapshot index and resulted into processing job-deregister evals.  When a Nomad server receives an eval for a job in the batch that is yet to be deleted, we accidentally re-run it depending on the state of allocation.

This change ensures that we delete deregister all of the jobs and inserts all evals in a single transactions, thus blocking processing related evals until deregistering complete.
@notnoop notnoop requested a review from dadgar November 10, 2018 03:42
@notnoop
Copy link
Contributor Author

notnoop commented Nov 10, 2018

I have tested this change with my re-production setup and didn't notice a case of a job retrying even after 10 hours - previously a case would appear in ~5-10 minutes. So success?!

nomad/state/state_store.go Show resolved Hide resolved
nomad/state/state_store.go Show resolved Hide resolved
@dadgar
Copy link
Contributor

dadgar commented Nov 12, 2018

LGTM after comments

@notnoop notnoop force-pushed the b-batch-deregister-transaction branch from a422d8c to 9405473 Compare November 12, 2018 21:09
@notnoop notnoop merged commit 91ea405 into master Nov 13, 2018
@endocrimes endocrimes deleted the b-batch-deregister-transaction branch November 13, 2018 04:55
@tantra35
Copy link
Contributor

@notnoop after we backport this to 0.8.6 we begin see time to time when we manual make GC via curl -XPUT http://localhost:4646/v1/system/gc

Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:     2018/11/18 19:29:38 [DEBUG] memberlist: TCP connection from=127.0.0.1:41524
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:     2018/11/18 19:29:38.900118 [DEBUG] worker: dequeued evaluation 42512a64-2cb9-0e2e-f8f7-9ee9428366be
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:     2018/11/18 19:29:38.900525 [DEBUG] sched.core: forced job GC
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:     2018/11/18 19:29:38.900720 [DEBUG] sched.core: job GC: 2 jobs, 3 evaluations, 2 allocs eligible
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: panic: runtime error: invalid memory address or nil pointer dereference
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x74ce52]
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: goroutine 45 [running]:
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-immutable-radix.(*Iterator).Next(0xc420571d20, 0x1b190e0, 0xc420868d00, 0x1d6bce0, 0xc4205dd8e0, 0xc4205dd910, 0xc4205dd910)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-immutable-radix/iter.go:81 +0xc2
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.(*radixIterator).Next(0xc4205dd840, 0x1bee89a, 0xb)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/txn.go:634 +0x2e
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad/state.(*StateStore).deleteJobVersions(0xc420333a00, 0xe6, 0xc4204bd520, 0xc4205c7d40, 0xc420571ce0, 0x0)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:1095 +0x1ae
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad/state.(*StateStore).DeleteJobTxn(0xc420333a00, 0xe6, 0xc42070b0b0, 0x7, 0xc42070b090, 0xd, 0xc4205c7d40, 0x1d640a0, 0xc4205dd770)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:1067 +0x46f
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad.(*nomadFSM).handleJobDeregister(0xc4202ef6e0, 0xe6, 0xc42070b090, 0xd, 0xc42070b0b0, 0x7, 0x197d001, 0xc4205c7d40, 0x0, 0x0)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:556 +0x1ab
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyBatchDeregisterJob.func1(0xc4205c7d40, 0x1c6c8a0, 0xc4205c7d40)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:524 +0x12f
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad/state.(*StateStore).WithWriteTransaction(0xc420333a00, 0xc4207d9c58, 0x0, 0x0)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:3936 +0x79
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyBatchDeregisterJob(0xc4202ef6e0, 0xc420728001, 0x42d, 0x4bf, 0xe6, 0x0, 0x0)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:522 +0x1b1
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0xc4202ef6e0, 0xc420cf8c20, 0x2ba2d40, 0x3)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:242 +0xf81
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc4205dd360)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:57 +0x15a
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM(0xc420536580)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:120 +0x2fa
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.runFSM)-fm()
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/api.go:506 +0x2a
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420536580, 0xc420440100)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

after that nomad server cluster fully demaged, and stops functional properly

@tantra35
Copy link
Contributor

on our test stand we launch folow jobs

job "test"
{
        datacenters = ["test"]
        type = "batch"

        constraint
        {
                attribute = "${attr.kernel.name}"
                value = "linux"
        }

        constraint
        {
                attribute = "${node.unique.name}"
                operator = "="
                value = "dockerworker-1"
        }

        task "diamondbcapacitycollector"
        {
                driver = "exec"

                config
                {
                        command = "sleep"
                        args = ["600"]
                }

                logs
                {
                        max_files = 3
                        max_file_size = 10
                }

                resources
                {
                        cpu = 100
                        memory = 300
                }
        }
}

and

job "test_debug-00"
{
        datacenters = ["test"]
        priority = 50

        constraint
        {
                attribute = "${attr.kernel.name}"
                value = "linux"
        }

        constraint
        {
                distinct_hosts = true
        }

        update
        {
                stagger = "10s"
                max_parallel = 1
        }

        meta
        {
                CONTAINER_VERSION = "02"
        }

        group test
        {
                count = 2

                task "test_debug_task"
                {
                        driver = "raw_exec"
                        kill_timeout = "1m"

                        artifact
                        {
                                source = "http://docker.service.consul/playrix-vault_debug-${NOMAD_META_CONTAINER_VERSION}.tar.gz"

                                options
                                {
                                        archive=false
                                }
                        }

                        config
                        {
                                command = "bash"
                                args = ["-c", "sleep 10; exit 10"]
                        }

                        vault
                        {
                                policies = ["bind_password","vault_debug-consul"]
                                change_mode   = "noop"
                        }

                        template
                        {
                                data = <<EOH
                                CONSUL_TOKEN="{{with secret "secrets/consul/plr/global/creds/vault_debug-consul"}}{{.Data.token}}{{end}}"
EOH
                                destination = "secrets/consul_token.env"
                                env = true
                        }

                        logs
                        {
                                max_files = 3
                                max_file_size = 10
                        }

                        resources
                        {
                                memory = 100
                                cpu = 200
                        }
                }
        }
}

@tantra35
Copy link
Contributor

before cluster crash the state of test job was

vagrant@consulnomad-1:~/nomad$ nomad status test
ID            = test
Name          = test
Submit Date   = 2018-11-18T17:06:04+03:00
Type          = batch
Priority      = 50
Datacenters   = test
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
diamondbcapacitycollector  0       0         0        0       2         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created    Modified
56bb5e80  a2455479  diamondbcapacitycollector  2        run      complete  2h22m ago  2h12m ago
4fb97421  a2455479  diamondbcapacitycollector  0        stop     complete  2h25m ago  2h24m ago
vagrant@consulnomad-1:~/nomad$ curl -XPUT http://localhost:4646/v1/system/gc
vagrant@consulnomad-1:~/nomad$ nomad status test
Error querying search with id: "Unexpected response code: 500 (rpc error: rpc error: failed to get conn: dial tcp 192.168.142.103:4647: connect: connection refu                                                                                               sed)"

@tantra35
Copy link
Contributor

We absolutly sure that made backport right

@tantra35
Copy link
Contributor

First of all we compare our version with 0.8.7 branch, and also we can provide our patch

@tantra35
Copy link
Contributor

Also i must said that this not 100% reproducible and we catched this only two(2) times

@notnoop
Copy link
Contributor Author

notnoop commented Nov 18, 2018

@tantra35 Thank you so much for reporting the issue and providing a test case. I'll dig into this tomorrow (on Monday) and follow up.

@tantra35
Copy link
Contributor

@notnoop thanks fo reaction, i also must said that i'm not sure that the reason in exactly this PR, because our version, also include

#4764
#4839
#4842

backported, but when we test them, we doesn;t see any issues.

Its seems that this happens when we wait when batch job succsefully ends, then we launch it again, so our steps is follow

nomad run test.nomad
<wait when it stops normaly>
nomad run test.nomad
<wait some time>
curl -XPUT http://localhost:4646/v1/system/gc

@notnoop
Copy link
Contributor Author

notnoop commented Nov 20, 2018

@tantra35 Thanks again for reporting the issue - I was able to reproduce the panic reliably and have a fix in #4903 - would you be able to test that PR in your environment by any chance?

@tantra35
Copy link
Contributor

@notnoop We tested this patch and seems all worked well, now we can't launch the same job if it previous version in dead state, only if we make GC, nomad allow to us launch new

@preetapan
Copy link
Contributor

@tantra35 re: we can't launch the same job if it previous version in dead state, only if we make GC, nomad allow to us launch new

if jobs are resubmitted with a change it should run without needing a GC. If a batch job's allocs are all complete and you resubmit it with no changes Nomad will not rerun completed allocations.

@tantra35
Copy link
Contributor

@preetapan before @notnoop made final fixes, we had ability to launch batch job which doesn't changed, and you can saw this on provided node status in this issue

@notnoop
Copy link
Contributor Author

notnoop commented Nov 21, 2018

That's interesting - I wasn't able to reproduce it with different nomad releases. I captured my output for Nomad 0.8.2 for example https://gist.github.com/notnoop/151105c99070d93333bed23aec7ce42c - and you can see that resubmitting the same job doesn't trigger a new allocation (i.e. doesn't run again) until the job is modified.

If you'd like to investigate this further, please open a new ticket with the version of Nomad you are using and the commands you ran, and we'd follow up there - I'm afraid of losing valuable insights and issues in PR comments.

Thanks!

@tantra35
Copy link
Contributor

@notnoop this 100% respodisible in case which i descibed in #4532, yes there we manualy stop batch jobs. And also i mast mansio i see gain that i can launch

vagrant@consulnomad-1:~$ nomad status e2914cd3
ID                  = e2914cd3
Eval ID             = f826017e
Name                = test.diamondbcapacitycollector[0]
Node ID             = aafdcb7a
Job ID              = test
Job Version         = 0
Client Status       = complete
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 11m40s ago
Modified            = 1m1s ago

Task "diamondbcapacitycollector" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
100 MHz  300 MiB  300 MiB  0

Task Events:
Started At     = 2018-11-24T11:37:18Z
Finished At    = 2018-11-24T11:47:18Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2018-11-24T14:47:18+03:00  Terminated  Exit Code: 0
2018-11-24T14:37:18+03:00  Started     Task started by client
2018-11-24T14:36:39+03:00  Task Setup  Building Task Directory
2018-11-24T14:36:39+03:00  Received    Task received by client
vagrant@consulnomad-1:~$ nomad status test
ID            = test
Name          = test
Submit Date   = 2018-11-24T14:36:38+03:00
Type          = batch
Priority      = 50
Datacenters   = test
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
diamondbcapacitycollector  0       0         0        0       1         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created     Modified
e2914cd3  aafdcb7a  diamondbcapacitycollector  0        run      complete  11m45s ago  1m6s ago
vagrant@consulnomad-1:~$ nomad run ./test.nomad
Error getting job struct: Error getting jobfile from "./test.nomad": source path error: stat /home/vagrant/test.nomad: no such file or directory
vagrant@consulnomad-1:~$ cd ./nomad/
vagrant@consulnomad-1:~/nomad$ nomad run ./test.nomad
==> Monitoring evaluation "db5d4074"
    Evaluation triggered by job "test"
    Allocation "f77d2310" created: node "aafdcb7a", group "diamondbcapacitycollector"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "db5d4074" finished with status "complete"
vagrant@consulnomad-1:~/nomad$ nomad status
ID    Type   Priority  Status   Submit Date
test  batch  50        running  2018-11-24T14:48:41+03:00
vagrant@consulnomad-1:~/nomad$ nomad status test
ID            = test
Name          = test
Submit Date   = 2018-11-24T14:48:41+03:00
Type          = batch
Priority      = 50
Datacenters   = test
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
diamondbcapacitycollector  0       0         1        0       1         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created     Modified
f77d2310  aafdcb7a  diamondbcapacitycollector  1        run      running   16s ago     10s ago
e2914cd3  aafdcb7a  diamondbcapacitycollector  0        run      complete  12m19s ago  1m40s ago

@tantra35
Copy link
Contributor

Here we can see that version of allocation changes, but actual job description not

@tantra35
Copy link
Contributor

This happens randomly, for test vagrant environment

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants