Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client Side Allocation Directory GC #1418

Closed
camerondavison opened this issue Jul 13, 2016 · 16 comments
Closed

Client Side Allocation Directory GC #1418

camerondavison opened this issue Jul 13, 2016 · 16 comments

Comments

@camerondavison
Copy link
Contributor

nomad 0.4.0

Since the disk fingerprint is both checking the disk for "free" space, and then is also being allocated for logs the logs on disk can end up counting twice.

If for instance I have 450 MB of disk space, and and 2 tasks that are set to have 10 rolling log files of 10MB each, and then I set both of those tasks to require 100MB of space. After my 2 tasks run for a long time and fill up all of the logs I can end up with resource allocation of 200/250 . Since the logs will be taking up physically 200MB on disk, and then also I will be trying to allocate 200MB worth for them.

This would mean that I could not put another task of 100MB disk allocation onto this node, because according to the resource check there is only 50MB available, but since we know that 200MB of the 250 have already been accounted for we should be able to provision the task.

@camerondavison
Copy link
Contributor Author

Looking through the code more. Looks like the storage fingerprint is not periodic. I think that this especially is hurting me while trying to upgrade from 0.3.2 to 0.4.0. I ran nomad-drain which removed all of the tasks from the node (but probably did not gc the allocated log dirs) then stopped nomad, then restarted the machine, started nomad again. This time though when nomad came back up it fingerprinted the disk to be much less, because of the fact that all the old logs had not been gc'd yet.

@dadgar dadgar changed the title disk resource fingerprint with full logs, allocation can count twice Client Side Allocation Directory GC Jul 13, 2016
@dadgar
Copy link
Contributor

dadgar commented Jul 13, 2016

Hey,

This is something we are aware of and will be fixing. It is really do to the client not garbage collecting allocation directories it manages. It currently waits for a signal from the server that occurs on an interval which is incorrect.

@camerondavison
Copy link
Contributor Author

Are you saying that you want to gc the allocation directories before startup, and before the fingerprint runs?

If you want to try and re-attach to any executors that are still running after startup (or run this check periodically) then you will encounter the problem of counting logs twice.

@diptanu
Copy link
Contributor

diptanu commented Jul 13, 2016

@a86c6f7964 We will GC allocations which are dead and if there are new tasks trying to get disk space.

@stephenlb
Copy link

+1 yo

@camerondavison
Copy link
Contributor Author

I can wait to see what happens, but I feel like I am a little lost.

Current state of the world

  • disk allocation out of checked free disk space
  • 1 disk check for free space at startup
  • free space as calculated by the os (total - os stuff - any alloc logs on disk(both running, and dead))

State that I think would be good

  • disk allocation would still be out of free space
  • disk check more often maybe every 10 minutes
  • free space calculated as os free space plus all running alloc logs (we need to add this to the free space, if we are in fact going to also use the disk resource allocation checks). This would mean that the non running allocations would eat into the free space (but could be GC'd when space is needed as you stated above)

@diptanu
Copy link
Contributor

diptanu commented Jul 15, 2016

There is also a PR which is going to land soon which is going to kill tasks when they exceed the allocated quota.

@dadgar
Copy link
Contributor

dadgar commented Jul 15, 2016

@a86c6f7964 What you described is the goal

@jshaw86
Copy link

jshaw86 commented Jul 18, 2016

@dadgar So we are seeing two issues:

  1. the disk space is being reported as incorrect
  2. the disk space never gets cleaned up

From the above conversation it seems that my first point is being addressed. I'm unclear though from the above if the disk will actually be cleaned up from stopped or failed allocations automatically, or if we will need to run a GC task to clean up the file system manually?

@diptanu
Copy link
Contributor

diptanu commented Jul 19, 2016

@jshaw86 You won't have to run a GC task to clean up the dead allocations. Nomad will clean them up automatically once we have implemented the client GC feature.

@camerondavison
Copy link
Contributor Author

Also they are automatically cleaned up periodically when the master server
issues a GC currently.

On Mon, Jul 18, 2016, 8:12 PM Diptanu Choudhury [email protected]
wrote:

@jshaw86 https://github.com/jshaw86 You won't have to run a GC task to
clean up the dead allocations. Nomad will clean them up automatically once
we have implemented the client GC feature.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1418 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKGBebLM1lw9Vru0OjNWJwDt8Es6SRNks5qXCRxgaJpZM4JLfBA
.

@jshaw86
Copy link

jshaw86 commented Jul 20, 2016

@a86c6f7964 are you currently seeing this automatically cleaned up behavior by master server GC? We are not seeing any disk cleanup under 0.4.0 even after the 24 hours.

@camerondavison
Copy link
Contributor Author

I saw them go away because in order to accomplish an upgrade of nomad I did.

nomad node-drain -self -enable
<wait for drain>
curl $NOMAD_SERVER_CLUSTER_ADDR/v1/system/gc
<restart server to upgrade os and nomad, wait for new nomad version to be up in the cluster>
nomad node-drain -self -disable

So maybe it only does it if you run the system gc?

@camerondavison
Copy link
Contributor Author

Does anyone know if #2081 helps this issue out at all?

@diptanu
Copy link
Contributor

diptanu commented Jan 3, 2017

Fixed via #2081

@diptanu diptanu closed this as completed Jan 3, 2017
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants