Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downgrade docker-compose.yaml to version 3.3 so that we can support Ubuntu 20.04.4 LTS #329

Merged
merged 5 commits into from
Oct 15, 2022

Conversation

andygrove
Copy link
Member

Which issue does this PR close?

N/A

Rationale for this change

I could not build Docker images on Ubuntu 20.04.4 LTS

What changes are included in this PR?

Specify version 3.3 in docker-compose.yaml

Are there any user-facing changes?

@@ -54,8 +54,7 @@ services:
volumes:
- ./benchmarks/data:/data
depends_on:
ballista-scheduler:
condition: service_healthy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to start executors before the scheduler runs, but I haven't tested this in a long time. If we require the scheduler to be running first, then we should file a bug for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and the executor fails if it cannot connect to the scheduler. In theory, this should be fine in an environment such as k8s because the pod will just keep restarting until it succeeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is docker-compose, not k8s - so maybe we need to mess with restart policy? @iajoiner might have more to say.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restart config option could be added to the compose file. https://docs.docker.com/compose/compose-file/#restart

restart:always

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A restart policy only takes effect after a container starts successfully. In this case, starting successfully means that the container is up for at least 10 seconds and Docker has started monitoring it. This prevents a container which does not start at all from going into a restart loop.

😢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am tempted to add a --retry-timeout--seconds 10 option to the executor, which would fix this. Ballista is targeted at enterprise users, and they are often not on the latest distro, so would be nice for this to work out of the box for as many people as possible.

I was thinking of the same option given that the application does not seem to have a hard constraint to be readily available within milliseconds on startup. I second the part with having it working out of the box for different users.

Copy link
Contributor

@onthebridgetonowhere onthebridgetonowhere Oct 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove - I am not too familiar with the codebase yet, but I saw that if the scheduler is killed while the executor is running, the executor tries to reconnect forever. Once the scheduler is restarted, the executor is fine. Probably a silly question, but couldn't this behaviour be there from the start? Why does it first need to connect to an actual running instance of the scheduler, instead of keep trying the default/configured one until it succeeds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TaskRunnerPool loops forever, polling schedulers to fetch tasks to run. This code opens a new connection to the scheduler each time, which is inefficient but does mean that it eventually recovers from scheduler downtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit to add the new config. This fixes it for me.

ballista-executor_1   | 2022-10-15T15:44:09.490225Z  WARN main ThreadId(01) ballista_executor: Failed to connect to scheduler at http://ballista-scheduler:50050 (Could not connect to scheduler); retrying ...    
ballista-executor_1   | 2022-10-15T15:44:09.991817Z  INFO main ThreadId(01) ballista_executor: Connected to scheduler at http://ballista-scheduler:50050    

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TaskRunnerPool loops forever, polling schedulers to fetch tasks to run. This code opens a new connection to the scheduler each time, which is inefficient but does mean that it eventually recovers from scheduler downtime.

I see, I thought it might be the case. Thanks for taking the time to explain!

@andygrove andygrove changed the title Use Docker version 3.3 so that we can support Ubuntu 20.04.4 LTS Downgrade docker-compose.yaml to version 3.3 so that we can support Ubuntu 20.04.4 LTS Oct 8, 2022
@andygrove andygrove mentioned this pull request Oct 8, 2022
Copy link
Contributor

@avantgardnerio avantgardnerio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andygrove
Copy link
Member Author

Thanks for the reviews @avantgardnerio and @onthebridgetonowhere

@andygrove andygrove merged commit 9ad583e into apache:master Oct 15, 2022
@andygrove andygrove deleted the downgrade-docker-compose-version branch October 15, 2022 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants