Replies: 35 comments
-
Starting to question this, actually. Now that I actually went through an AWS Batch tutorial and successfully ran some R code in a job, I appreciate how straightforward |
Beta Was this translation helpful? Give feedback.
-
AWS is definitely on the list of things I want to support, so thank you for raising this! That said, I've got zero experience with it, so I don't even know which steps I need to consider and I'm afraid I won't be able to read up on this very soon. Should we coordinate roadmaps a bit with yours and @HenrikBengtsson? |
Beta Was this translation helpful? Give feedback.
-
Awesome! I think all of us care about this, and I would love to coordinate. Also cc @davidkretch. I have been experimenting some with Batch through the console, and thanks to you I can now send jobs to workers inside a Docker container via SSH. SSH into Batch jobs sounds trickier (ref: paws-r/paws#330, https://stackoverflow.com/questions/64342893/ssh-into-aws-batch-jobs). However, the user guide repeatedly mentions the possibility of SSH for Batch container instances, so I am not convinced that it is impossible.
No worries, I think I have the least general knowledge here. |
Beta Was this translation helpful? Give feedback.
-
Here's another idea: for the moment, given the unexpected difficulty of tunneling into AWS Batch jobs, why not drop one level lower and work with EC2 instances directly? From https://gist.github.com/DavisVaughan/865d95cf0101c24df27b37f4047dd2e5, EC2 seems easier for us than Batch, and tackling the former first may help us work up to the latter later. |
Beta Was this translation helpful? Give feedback.
-
There's a lot we can do to improve on https://gist.github.com/DavisVaughan/865d95cf0101c24df27b37f4047dd2e5, such as
|
Beta Was this translation helpful? Give feedback.
-
For the kinds of workflows we deal with, the only value added I see of Batch relative to EC2 is cost optimized resource provisioning, e.g. waiting for spot instances to get cheap before submitting jobs. Not sure we can do that with EC2 alone. (That and the ability to automatically connect to S3, which |
Beta Was this translation helpful? Give feedback.
-
It is worth noting that in the AWS Batch console, I cannot select the |
Beta Was this translation helpful? Give feedback.
-
Hi @wlandau: With respect to testing on AWS, I highly recommend applying for the AWS Open Source promotional credits program, which is described more here. We got this for the Paws package. I haven't put nearly as much thought into this, but I think if R is going to be in control of starting and stopping instances, then I feel like I agree that Batch doesn't get you much extra. With respect to Batch, I think some of its pros and cons are: Pros
Cons
To use spot prices with EC2, you could get spot prices with the DescribeSpotPriceHistory API call, but I think supporting spot prices would likely be a lot of work since it would also have to handle things like restarting jobs when instances get stopped due to changing prices. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the advice. Your assessment is helpful, and I had no idea about the promotional credit. |
Beta Was this translation helpful? Give feedback.
-
How does the |
Beta Was this translation helpful? Give feedback.
-
We don't know workers' node or IP address because the scheduler will assign them to nodes. Instead, each worker connects back to the main process (which is listening on the port corresponding to the So each worker needs to know either the host+port of the main process or of the SSH tunnel. |
Beta Was this translation helpful? Give feedback.
-
Awesome! Sounds like my whole quixotic pursuit of worker IPs is moot! This makes me more optimistic about the potential to work through the scheduling software of Batch itself. I bet By the way, I asked about ZeroMQ compatibility with Batch and got a great answer here. I do not have the expertise to understand all the cloud-specific network barriers, but I take it as more evidence that a new Batch QSys class is possible. |
Beta Was this translation helpful? Give feedback.
-
I think we will need to handle AWS Batch differently from the other schedulers. Instead of traditional template files, Batch uses JSON strings called "job definitions". In the AWS CLI, users can pass job definitions to But for our purposes, i think we should require the user to create a job definition in advance through the AWS web console or other means, then pass the job definition name to @mschubert, what do you think? If this sounds reasonable, what help would be most useful? |
Beta Was this translation helpful? Give feedback.
-
Array jobs in AWS Batch are straightforward with paws::batch()$submit_job(
jobDefinition = "job-definition",
jobName = "example-job-array",
jobQueue = "job-queue",
arrayProperties = list(size = 3)
) Prework:
$ aws batch describe-job-definitions
{
"jobDefinitions": [
{
"jobDefinitionName": "job-definition",
"jobDefinitionArn": "arn:aws:batch:us-east-1:912265024257:job-definition/job-definition:3",
"revision": 3,
"status": "ACTIVE",
"type": "container",
"parameters": {},
"containerProperties": {
"image": "wlandau/cmq-docker",
"vcpus": 2,
"memory": 2048,
"command": [
"Rscript",
"-e",
"print(Sys.getenv('AWS_BATCH_JOB_ARRAY_INDEX'))"
],
"volumes": [],
"environment": [],
"mountPoints": [],
"ulimits": [],
"resourceRequirements": [],
"linuxParameters": {
"devices": []
}
}
}
]
} So I think that sketches out how to submit array jobs to AWS Batch from R. If that looks good, is there anything else I can do to help get an implementation going? |
Beta Was this translation helpful? Give feedback.
-
This looks great, thank you so much! 👍 I still see a few issues before this can be implemented:
|
Beta Was this translation helpful? Give feedback.
-
Awesome! I will propose an hour on Google Meet. I attempted to capture some of our ideas in https://github.com/wlandau/r-cloud-ideas at a high level. |
Beta Was this translation helpful? Give feedback.
-
I agree in principle, but we never require internet access of workers or SSH access between HPC compute nodes. If, however, a user wants the convenience of submitting remotely via SSH, then of course this is a requirement. The comment above was for caching data on the remote end, which only makes sense if the connection of local>remote is much slower than remote>workers. |
Beta Was this translation helpful? Give feedback.
-
This answer claims reverse tunneling should be possible if the compute environment is unmanaged. The poster recommends using the metadata endpoint to find out which EC2 instance a worker is running on. @davidkretch, is that the same as paws-r/paws#330 (comment)? |
Beta Was this translation helpful? Give feedback.
-
@wlandau The metadata endpoint would be much, much easier than the example code I created. The metadata endpoint is a local web API on the instance/container that you can query to get info about the machine you're running on. Paws doesn't have any publicly exposed functions for accessing them at the moment but it's pretty easy (example here). Is it true that if the worker is reverse SSH tunneling to your R session, it would need to know your IP, and it's less critical that you know the worker nodes' IPs? |
Beta Was this translation helpful? Give feedback.
-
Not quite: The SSH connection is established from the local session to AWS (or remote HPC), which attaches a reverse tunnel to the same connection (so we need to be able to access the remote session). The result is that the remote session can access this tunnel to connect to the local process. |
Beta Was this translation helpful? Give feedback.
-
Ah, I understand. In that case (local to worker), the metadata endpoint would be unhelpful here, because it is only accessible from the worker, but we'll need to know that information in our local session. |
Beta Was this translation helpful? Give feedback.
-
To clarify further, the remote session thinks it's connected to a port on the localhost, it just happens to be tunneled back to a port on your local compute via that SSH reverse tunnel. Here's an illustration on how parallelly SSH into a remote machine with a reverse tunnel so that localhost:12345 on the remote machine will talk to localhost:12345 on your local computer; > cl <- parallelly::makeClusterPSOCK("remote.server.org", port = 12345L, revtunnel = TRUE, dryrun=TRUE)
----------------------------------------------------------------------
Manually, (i) login into external machine 'remote.server.org':
'/usr/bin/ssh' -R 12345:localhost:12345 remote.server.org
and (ii) start worker #1 from there:
'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'workRSOCK <- tryCatch(parallel:::.slaveRSOCK, error=function(e) parallel:::.workRSOCK); workRSOCK()' MASTER=localhost PORT=12345 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE
Alternatively, start worker #1 from the local machine by combining both step in a single call:
'/usr/bin/ssh' -R 12345:localhost:12345 remote.server.org "'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'workRSOCK <- tryCatch(parallel:::.slaveRSOCK, error=function(e) parallel:::.workRSOCK); workRSOCK()' MASTER=localhost PORT=12345 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE" |
Beta Was this translation helpful? Give feedback.
-
Maybe I am getting too far ahead here, but it make sense to implement all this ZeroMQ + R + cloud functionality in an entirely new package of its own? (Say, |
Beta Was this translation helpful? Give feedback.
-
Then again, there is a lot more to |
Beta Was this translation helpful? Give feedback.
-
My current impression is:
|
Beta Was this translation helpful? Give feedback.
-
Automatic spot pricing and Docker image support seem compelling (#208 (comment)). And if we pursue Lambda and Fargate in similar ways, we might see shorter worker startup times. |
Beta Was this translation helpful? Give feedback.
-
Another thought: persistent |
Beta Was this translation helpful? Give feedback.
-
I am thinking GHA runners could just make it easier to get R code on the cloud, provided the setup and teardown happens automatically. |
Beta Was this translation helpful? Give feedback.
-
It didn't look to me that GitHub Actions self hosted did auto setup/teardown. It looks like it uses an agent that runs on your infrastructure that polls GitHub and waits for work. If you do already have compute infrastructure, I think you could also hypothetically have clustermq talk to it directly rather than through GHA. In terms of clustermq, which assumes an SSH connection, I think the only real options on AWS are
All other options I can think of would prohibit SSH connections and would need an S3 or other intermediary for delivering results. The two open questions I have about Batch are:
I am currently slowly working on a non-clustermq approach using Lambda, but will look into the Batch approach next, hopefully later this month. |
Beta Was this translation helpful? Give feedback.
-
Thanks Will & David! I'm not sure if routing AWS access via GHA would simplify or complicate things. I'm definitely still interested in exploring an SSH-based approach (but also trying to wrap up a science project over here, so unfortunately not much spare time right now) |
Beta Was this translation helpful? Give feedback.
-
I propose AWS Batch as a new
clustermq
scheduler. Batch has become extremely popular, especially as traditional HPC is waning. I have a strong personal interest in making Batch integrate nicely with R (ref: ropensci/targets#152, ropensci/tarchetypes#8, https://wlandau.github.io/targets-manual/cloud.html, #102) and I am eager to help on the implementation side.Batch is super easy to set up through the AWS web console, and I think it would fit nicely into
clustermq
's existing interface:options(clustermq.scheduler = "aws_batch")
andoptions(clustermq.template = "batch.tmpl")
, wherebatch.tmpl
contains an AWS API call with the compute environment, job queue, job definition, and key pair. I think we could usecurl
directly instead of the much larger and rapidly developingpaws
package. I think this direct approach could be far more seamless and parsimonious than the existing SSH connector with multiple hosts.Beta Was this translation helpful? Give feedback.
All reactions