Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable periodic cleanup of work_dir directories in ballista executor #1783

Merged
merged 11 commits into from
Mar 8, 2022

Conversation

Ted-Jiang
Copy link
Member

Which issue does this PR close?

Closes #1780 .

Rationale for this change

Avoid work_dir quickly fill up disk space.

What changes are included in this PR?

like spark standalone mode, Executor periodic spawn a task to clean work_dir, if all the files in job_dir not modified in executor_cleanup_ttl seconds, it will be deleted.
https://spark.apache.org/docs/latest/spark-standalone.html spark.worker.cleanup.enabled

@@ -112,6 +116,21 @@ async fn main() -> Result<()> {
.context("Could not connect to scheduler")?;

let scheduler_policy = opt.task_scheduling_policy;
let cleanup_ttl = opt.executor_cleanup_ttl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it might be possible to use NamedTempFile or some other struct from tempfile (already a dependency): https://crates.io/crates/tempfile

There are at least two benefits:

  1. The files are dropped immediately after they are no longer used so required intermediate diskspace is minimized
  2. They can't be accidentally dropped while still in use (which perhaps affects long running queries)

I did something similar in DataFusion here: #1680

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thanks for your advice 😊!
IMHP, if one job has 3 stage, stage2 read stage1 input then delete the file, but stage2 task fail,
In ballista, scheduler will start a task to reload stage1 input. I think using NamedTempFile will cause some trouble and complexity.
we need keep the file for task-recovery and stage retry (like spark). So i decide if all the files under job_dir not modified in TTL we can safely delete it.
If i am not right, Please correct me 🙈

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't honestly know enough about Ballista and its executor to know.

What do you think @yahoNanJing and @liukun4515 ?

@realno
Copy link
Contributor

realno commented Feb 9, 2022

@Ted-Jiang this is a great step toward making Ballista production ready 👍 Thanks for making the change.

ballista/rust/executor/src/main.rs Outdated Show resolved Hide resolved
ballista/rust/executor/src/main.rs Outdated Show resolved Hide resolved
ballista/rust/executor/src/main.rs Show resolved Hide resolved
ballista/rust/executor/src/main.rs Show resolved Hide resolved
ballista/rust/executor/src/main.rs Outdated Show resolved Hide resolved
ballista/rust/executor/src/main.rs Outdated Show resolved Hide resolved
@Ted-Jiang
Copy link
Member Author

cc @houqp @yahoNanJing

@Ted-Jiang
Copy link
Member Author

cc @liukun4515

name = "executor_cleanup_enable"
type = "bool"
doc = "Enable periodic cleanup of work_dir directories."
default = "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set the default value to false and open it after a long time of testing?
@Ted-Jiang
If the default is false, this feature will not impact the current status. Users who want to use this feature can open it.
Do you agree? @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to put this feature in turned off by default so it can be tested more thoroughly


[[param]]
name = "executor_cleanup_ttl"
type = "i64"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to use u64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks fix !

@alamb
Copy link
Contributor

alamb commented Mar 4, 2022

Once someone who focuses on Ballista is happy with this PR I am happy to merge it

@liukun4515
Copy link
Contributor

Once someone who focuses on Ballista is happy with this PR I am happy to merge it

I think @Ted-Jiang will address the comments and please hold the status.
If I think it is ok, I will give my LGTM.

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ted-Jiang
LGTM

@alamb alamb merged commit 432861e into apache:master Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable periodic cleanup of work_dir directories in ballista executor
5 participants