Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating the EC2Architect #603

Merged
merged 11 commits into from
Nov 12, 2021
Merged

Creating the EC2Architect #603

merged 11 commits into from
Nov 12, 2021

Conversation

JackUrb
Copy link
Contributor

@JackUrb JackUrb commented Nov 11, 2021

Overview

Introduces our new available backend for deploying Mephisto tasks. This version uses AWS EC2 instances, along with a host of other services, to set up an automated structure relying on just one SSL certificate. This allows multiple users to host tasks, and includes a fallback server that workers can see if one of the task servers is down (plus the server logs these bad requests, so we can trace down broken tasks!).

Implementation

The primary implementation is split into the pre-setup, which is handled by prepare_ec2_servers.py, and the deploy of task-related instances, which is handled by the EC2Architect. Both of these rely heavily on helpers written in ec2_helpers.py.

The core of the pre-work is to set up a VPC to host our servers, a load balancer to point to the fallback by default, and various auxilliary components to make this happen (like requesting and registering the SSL certificate). The fallback takes anything routed to *.<registered_domain.com>.

When an instance is being launched for a specific task, we register it to <task_name>.<registered_domain.com>, and route traffic from the load balancer based on that subdomain to the task server.

When a server is actually deployed, we set it up to run with systemctl remotely (which is what the .service files are for), and this methodology is used for both the fallback server and for task servers.

Files are stored in the servers directory containing information about launched servers, such that they can be referenced in the future and cleaned up when their usage is done (even if there's an issue with the python process).

For the future

An in-depth document on the architecture of what services are set up to run this architect, how the pre-setup work (involving the VPC and load balancer) differs from what is done to launch instances later, and more.

Testing

Tested by launching the parlai_chat with an ec2 architect after having launched the pre-setup with one of our internal roles. Servers successfully launch, deploy, then cleanup after the task is done. After executing everything, the only remaining resources are those for the fallback server.

@JackUrb JackUrb requested a review from pringshia November 11, 2021 00:32
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 11, 2021
@JackUrb JackUrb requested a review from s0mya November 11, 2021 19:24
@codecov-commenter
Copy link

codecov-commenter commented Nov 11, 2021

Codecov Report

Merging #603 (fd5f629) into main (4937db1) will decrease coverage by 2.46%.
The diff coverage is 22.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #603      +/-   ##
==========================================
- Coverage   66.36%   63.90%   -2.47%     
==========================================
  Files          83       85       +2     
  Lines        7605     8050     +445     
==========================================
+ Hits         5047     5144      +97     
- Misses       2558     2906     +348     
Impacted Files Coverage Δ
...ueprints/static_html_task/static_html_blueprint.py 41.79% <ø> (ø)
...ephisto/abstractions/architects/ec2/ec2_helpers.py 16.72% <16.72%> (ø)
...histo/abstractions/architects/ec2/ec2_architect.py 35.25% <35.25%> (ø)
mephisto/operations/registry.py 87.17% <100.00%> (+0.16%) ⬆️
...tractions/architects/channels/websocket_channel.py 79.06% <0.00%> (-1.17%) ⬇️
mephisto/operations/supervisor.py 80.26% <0.00%> (-0.67%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4937db1...fd5f629. Read the comment docs.

@kushalchawla
Copy link

Dear @JackUrb, thank you so much for this PR. I have been facing some pushback from my IT team to use Heroku. Using EC2Architect would mean an all-AWS + AMT solution, which is perfect for me.

Requests:

  1. Please let us know when EC2Architect is stable enough for us to explore and use for our studies.
  2. When do you expect 1) to happen? What is the timeline in your mind?
  3. The documentation that you plan for the future would be super helpful. But apart from that, let us know what additional steps are required at our end such as setting up a domain name, etc.
  4. I am willing to provide feedback on this by trying it out for my use case. Let me know if you need me there. Otherwise, I will just wait for a while for this PR to mature.

Thanks!

@JackUrb
Copy link
Contributor Author

JackUrb commented Nov 12, 2021

Hi @kushalchawla, I'm hoping to have it committed Monday. The only thing you need is to register a domain name from somewhere, either with a standard registrar or somewhere that offers free domains. After that, running the setup script will tell you where to delegate your domain to, and then it will set up the rest (including registering certificates).

The one downside of the setup is it requires around 2 hours for the initial setup to propagate, after which you do incur costs while that is running (on the order of a dollar or so a day, given my current testing).

@JackUrb
Copy link
Contributor Author

JackUrb commented Nov 12, 2021

Merging without documentation for now after extended testing, will make official usage documentation next week.

@JackUrb JackUrb merged commit bc3526f into main Nov 12, 2021
@JackUrb JackUrb deleted the ec2-architect branch November 12, 2021 18:28
@JackUrb JackUrb added this to the 🚀 Mephisto 1.0 milestone Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants