-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable periodic cleanup of work_dir directories in ballista executor #1783
Conversation
@@ -112,6 +116,21 @@ async fn main() -> Result<()> { | |||
.context("Could not connect to scheduler")?; | |||
|
|||
let scheduler_policy = opt.task_scheduling_policy; | |||
let cleanup_ttl = opt.executor_cleanup_ttl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it might be possible to use NamedTempFile
or some other struct from tempfile
(already a dependency): https://crates.io/crates/tempfile
There are at least two benefits:
- The files are dropped immediately after they are no longer used so required intermediate diskspace is minimized
- They can't be accidentally dropped while still in use (which perhaps affects long running queries)
I did something similar in DataFusion here: #1680
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Thanks for your advice 😊!
IMHP, if one job has 3 stage, stage2 read stage1 input then delete the file, but stage2 task fail,
In ballista, scheduler will start a task to reload stage1 input. I think using NamedTempFile
will cause some trouble and complexity.
we need keep the file for task-recovery and stage retry (like spark). So i decide if all the files under job_dir not modified in TTL we can safely delete it.
If i am not right, Please correct me 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't honestly know enough about Ballista and its executor to know.
What do you think @yahoNanJing and @liukun4515 ?
@Ted-Jiang this is a great step toward making Ballista production ready 👍 Thanks for making the change. |
cc @liukun4515 |
name = "executor_cleanup_enable" | ||
type = "bool" | ||
doc = "Enable periodic cleanup of work_dir directories." | ||
default = "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set the default value to false and open it after a long time of testing?
@Ted-Jiang
If the default is false, this feature will not impact the current status. Users who want to use this feature can open it.
Do you agree? @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good idea to put this feature in turned off by default so it can be tested more thoroughly
|
||
[[param]] | ||
name = "executor_cleanup_ttl" | ||
type = "i64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to use u64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks fix !
Once someone who focuses on Ballista is happy with this PR I am happy to merge it |
I think @Ted-Jiang will address the comments and please hold the status. |
fix info Co-authored-by: Kun Liu <[email protected]>
fix info Co-authored-by: Kun Liu <[email protected]>
fix Co-authored-by: Kun Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ted-Jiang
LGTM
Which issue does this PR close?
Closes #1780 .
Rationale for this change
Avoid work_dir quickly fill up disk space.
What changes are included in this PR?
like spark standalone mode, Executor periodic spawn a task to clean work_dir, if all the files in job_dir not modified in executor_cleanup_ttl seconds, it will be deleted.
https://spark.apache.org/docs/latest/spark-standalone.html
spark.worker.cleanup.enabled