Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Client generates lots of IOPS when idle, saturating HDD #9047

Closed
ashtuchkin opened this issue Oct 8, 2020 · 5 comments · Fixed by #9093
Closed

Nomad Client generates lots of IOPS when idle, saturating HDD #9047

ashtuchkin opened this issue Oct 8, 2020 · 5 comments · Fixed by #9093

Comments

@ashtuchkin
Copy link
Contributor

Nomad version

Nomad v0.12.1 (14a6893)

Operating system and Environment details

Ubuntu 16.04, Linux kernel 4.15.0-70
3 on-premise servers, each having 64Gb RAM, 32 cores CPU, 1Tb HDD.

Issue

I'm testing batch job scheduling on my small cluster and noticed that after running 1-2 thousand allocations, the Nomad Clients slow down to a crawl: 5-10 minutes to start small jobs, even if there's no other jobs running. Restarting the clients doesn't help. The problem disappears only after I issue nomad system gc command.

After a short investigation, I noticed that almost all the time is spent by the Nomad Client continuously writing to its state file (<data root>/client/state.db), fully saturating disk iops, which slows down everything else on the host. Note, this is with an idle cluster - no jobs are running at that point.

A deeper investigation (golang profiling) took me to Client.saveState() function. It's called every 60 seconds by Client.periodicSnapshot() and calls PersistState() for all allocations tracked by the client. Each PersistState() internally issues a BoltDB transaction to unconditionally write current state of the allocation to the state file, ending with a mandatory fsync(). This results is a very high number of IOPS, especially for an HDD (which unfortunately I have to use), saturating disk access for the whole host.

As an example, an average HDD tops out at maybe ~120 IOPS. If we assume 2 IO operations per PersistState() transaction, then the disk i/o would be at 100% at ~3600 allocations. In my tests it's even less than that - around 2000. Again, this is in idle state - nothing is actually running at that time on the cluster.

In my use case each batch job can have 1000+ allocations, easily surpassing the numbers above. After this happens, the cluster becomes unresponsive and it's pretty hard to fix it. The only way I found is issuing a nomad system gc command.

So I have a couple of questions:

  1. Is it really necessary to persist the state of every allocation every minute in idle state? Can you add a condition there to persist only if something has actually changed?
  2. If the above is hard to do, can you batch the writes so that e.g. 100 allocations be persisted at a time? That would decrease the IOPS requirements dramatically.
  3. Is there any workaround you could suggest?

Note, I think garbage collection of allocations would help the situation somewhat, but if jobs are executed in an unpredictable manner, it would be very hard to tune the gc threshold period while still keeping the logs for a bit to help debug things.

Separately, minor thing, I noticed that there's an attempt at parallelizing the persistence in Client.saveState() function by running a goroutine per allocation. AFAIK it doesn't help at all, because write transactions in BoltDB are serialized using locks and each goroutine have to wait for its order anyway, making the whole thing sequential. Not a big issue, just muddies the code a little.

Reproduction steps

  1. Set up a cluster with HDD drives. I used 3 nodes, but it should work with one as well.
  2. Run an echo "hello world" batch job with job.group.count=3000.
  3. Observe saturated HDD long after the job is finished and the cluster is idle. Observe it takes 5-10 minutes to run a trivial task.

Job file

Any simple job should do. I'm using docker with something like this:

job "test" {
  datacenters = ["dc1"]
  type = "batch"
  group "tasks" {
    count = 3000
    task "main" {
      driver = "docker"
      config {
        image = "alpine"
        args = ["echo", "Hello world"]
      }
    }
  }
}

Nomad Client logs (if appropriate)

I have some logs and pprof traces, but not sure how helpful they would be. Let me know if you need them.

@ashtuchkin
Copy link
Contributor Author

Update: just stumbled on the logic in helper/boltdd package that uses hashing to avoid writing the same data to the database (https://github.com/hashicorp/nomad/blob/master/helper/boltdd/boltdd.go#L316), so it seems that my first question is moot. On the other hand, the IOPS is still there, so my current hypothesis is that the actual writes happen in transaction begin/commit stages.

@tgross
Copy link
Member

tgross commented Oct 12, 2020

Hi @ashtuchkin just wanted to make sure you knew we saw this one and someone will look into it.

@ashtuchkin
Copy link
Contributor Author

ashtuchkin commented Oct 12, 2020 via email

@ashtuchkin
Copy link
Contributor Author

I've added a PR that fixes this issue in our environment. We'll use my fork with this change for now to unblock the tests, but I'd be happy to go back to using upstream when this issue is fixed (whether by merging the PR or not).

ashtuchkin added a commit to ashtuchkin/nomad that referenced this issue Oct 20, 2020
Fixes hashicorp#9047, see problem details there.

As a solution, we use BoltDB's 'Batch' mode that combines multiple
parallel writes into small number of transactions. See
https://github.com/boltdb/bolt#batch-read-write-transactions for
more information.
tgross pushed a commit that referenced this issue Oct 20, 2020
Fixes #9047, see problem details there.

As a solution, we use BoltDB's 'Batch' mode that combines multiple
parallel writes into small number of transactions. See
https://github.com/boltdb/bolt#batch-read-write-transactions for
more information.
fredrikhgrelland pushed a commit to fredrikhgrelland/nomad that referenced this issue Oct 22, 2020
…hicorp#9093)

Fixes hashicorp#9047, see problem details there.

As a solution, we use BoltDB's 'Batch' mode that combines multiple
parallel writes into small number of transactions. See
https://github.com/boltdb/bolt#batch-read-write-transactions for
more information.
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants