Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable periodic cleanup of work_dir directories in ballista executor #1780

Closed
Ted-Jiang opened this issue Feb 8, 2022 · 3 comments · Fixed by #1783
Closed

Enable periodic cleanup of work_dir directories in ballista executor #1780

Ted-Jiang opened this issue Feb 8, 2022 · 3 comments · Fixed by #1783
Labels
enhancement New feature or request

Comments

@Ted-Jiang
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Enable periodic cleanup of work_dir directories in ballista executor which introduce 3 args
executor_cleanup_enable : Enable periodic cleanup of work_dir directories.
executor_cleanup_interval: Controls the interval in seconds , which the worker cleans up old job dirs on the local machine.
executor_cleanup_ttl: Number of seconds to retain job work_dir on each executor. This is a Time To Live and should depend on the amount of available disk space you have.

Describe the solution you'd like
Executor periodic spawn a task to clean work_dir, if all the files in job_dir not modified in executor_cleanup_ttl seconds, it will be deleted.

Describe alternatives you've considered
Scheduler send rpc call to delete files when job done.

Additional context
apache/datafusion-ballista#9

@Ted-Jiang Ted-Jiang added the enhancement New feature or request label Feb 8, 2022
@houqp
Copy link
Member

houqp commented Feb 8, 2022

On top of a background GC task, would it make sense to also clean up job dirs on job completion preemptively?

@mingmwang
Copy link
Contributor

preemptively
@houqp
Sorry for my confusion , You mean if a job has 3 stage, when stage3 is running, we can delete stage 1 first?

IMO, I think when a SQL is finished, all the immediate shuffle data can be cleared except for the result data.

@Ted-Jiang
Copy link
Member Author

preemptively
@houqp
Sorry for my confusion , You mean if a job has 3 stage, when stage3 is running, we can delete stage 1 first?

IMO, I think when a SQL is finished, all the immediate shuffle data can be cleared except for the result data.

@houqp @mingmwang It sounds very reasonable , i thinks this will handles some error cases for robustness.
IMHO, keep both of them and create a separate issue to capture for future improvement (maybe after separate shuffle data and result data).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants