-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Cloud integrations #720
Comments
Hi Mark! Thank you for your interest! Google Cloud integration has been in the back of my mind, and I would love to support it in StorageTo start, it would be great to have nicely abstracted utilities for Google Cloud Storage comparable to the Amazon ones ( After that, the next step is to create a new abstract storage class to govern internal behaviors like hashing and metadata storage, as well as concrete subclasses that inherit from both that abstract class and classes specific to each supported file format. (I was thinking of supporting Compute
|
Great thanks, will get started.
No problem and first PR will do this.
Was looking through that as a place to get started,
This is already supported on Google VMs as library(future)
library(targets)
library(googleComputeEngineR)
vms <- gce_vm_cluster()
plan <- plan(cluster, workers = as.cluster(vms))
tar_resources_future(plan = plan)
... But I think there is an opportunity to move this more into a serverless direction, as the cloud build steps seem to seamlessly map to As an example an equivalent library(googleCloudRunner)
bs <- c(
cr_buildstep_gcloud("gsutil",
id = "raw_data_file",
args = c("gsutil",
"cp",
"gs://your-bucket/data/raw_data.csv",
"/workspace/data/raw_data.csv")),
# normally would not use readRDS()/saveRDS() in multiple steps but for sake of example
cr_buildstep_r("read_csv('/workspace/data/raw_data.csv', col_types = cols()) %>% saveRDS('raw_data')",
id = "raw_data",
name = "verse"),
cr_buildstep_r("readRDS('raw_data') %>% filter(!is.na(Ozone)) %>% saveRDS('data')",
id = "data",
name = "verse"),
cr_buildstep_r("create_plot(readRDS('data')) %>% saveRDS('hist')",
id = "hist",
waitFor = "data", # so it runs concurrently to 'fit'
name = "verse"),
cr_buildstep_r("biglm(Ozone ~ Wind + Temp, readRDS('data'))",
waitFor = "data", # so it runs concurrently to 'hist'
id = "fit",
name = "gcr.io/mydocker/biglm")
)
bs |> cr_build_yaml() Normally I would put all the r steps in one buildstep sourced from a file but have added Makes this yaml object that I think maps to ==cloudRunnerYaml==
steps:
- name: gcr.io/google.com/cloudsdktool/cloud-sdk:alpine
entrypoint: gsutil
args:
- gsutil
- cp
- gs://your-bucket/data/raw_data.csv
- /workspace/data/raw_data.csv
id: raw_data_file
- name: rocker/verse
args:
- Rscript
- -e
- read_csv('/workspace/data/raw_data.csv', col_types = cols()) %>% saveRDS('raw_data')
id: raw_data
- name: rocker/verse
args:
- Rscript
- -e
- readRDS('raw_data') %>% filter(!is.na(Ozone)) %>% saveRDS('data')
id: data
- name: rocker/verse
args:
- Rscript
- -e
- create_plot(readRDS('data')) %>% saveRDS('hist')
id: hist
waitFor:
- data
- name: gcr.io/mydocker/biglm
args:
- Rscript
- -e
- biglm(Ozone ~ Wind + Temp, readRDS('data'))
id: fit
waitFor:
- data (more build args here) Do the build on GCP via And/or each buildstep could be its own dedicated This holds several advantages:
I see that as a tool that is better than Airflow for visualising DAGs, taking care of state management on whether each node needs to be run but with a lot of scale to build each step in a cloud environment. |
I think looking through another simple addition will be to create a version of |
Sounds good, I am totally willing to work through future PRs with you that add the OO-based functionality. Perhaps the next one could be the class that contains user-defined resources that will get passed to GCS, e.g. the bucket and ACL. The AWS equivalent is at https://github.com/ropensci/targets/blob/main/R/class_resources_aws.R. That PR could include a user-facing function to create an object and an argument to add it to the whole resources object (either for a target or default for the pipeline). With that in place, it will be easier to create GCS classes equivalent to https://github.com/ropensci/targets/blob/main/R/class_aws.R and https://github.com/ropensci/targets/blob/main/R/class_aws_parquet.R, etc. |
Not for Compute Engine, I think.
I agree that serverless computing is an ideal direction, and # _targets.R file:
library(targets)
library(tarchetypes)
source("R/functions.R")
options(tidyverse.quiet = TRUE)
tar_option_set(packages = c("biglm", "dplyr", "ggplot2", "readr", "tidyr"))
library(future.googlecloudrunner)
plan <- future::tweak(cloudrunner, cores = 4)
resources_gcp <- tar_resources(
future = tar_resources_future(plan = plan) # Run on the cloud.
)
list(
tar_target(
raw_data_file,
"data/raw_data.csv",
format = "file",
deployment = "main" # run locally
),
tar_target(
raw_data,
read_csv(raw_data_file, col_types = cols()),
deployment = "local"
),
tar_target(
data,
raw_data %>%
filter(!is.na(Ozone)),
resources = resources_gcp
),
tar_target(
hist,
create_plot(data),
resources = resources_gcp
),
tar_target(fit, biglm(Ozone ~ Wind + Temp, data), resources = resources_gcp),
tar_render(report, "index.Rmd", deployment = "main") # not run on the cloud
) |
I think I see your point about directly mapping So these days, I prefer that
Line 125 in e144bdb
Line 193 in e144bdb
Line 189 in e144bdb
These 3 tasks would be cumbersome to handle directly in |
Thanks for valuable feedback :) I think I can get what I'm looking for building on top of existing code now I've looked at the GitHub trigger. The key thing is how to use targets to signal the state of the pipeline between builds, which I think the GCS integration will do eg can the targets folder be downloaded in between builds to indicate if it should run the step or not. Some boilerplate code to do that could then sit in googleCloudRunner with possibly a S3 method for a target build step, but will see it working first. To prep for that I have built a public Docker image with renv and targets installed that will be a requirement that's on "gcr.io/gcer-public/targets". |
Yeah, if all the target output is in GCS, you only need to download
Awesome! So then are you thinking of using
Would you elaborate? I am not sure I follow the connection with GitHub actions. |
I hope something like a normal
I see how the GitHub action deals with loading packages (via I think if GCS can take the role of the Replicating it will necessitate including boilerplate code (the docker image, downloading the The |
So kind of like treating Related: so I take it the idea of developing a |
Maybe I'm misunderstanding, does that mean you'll run
I agree, handling packages beforehand through the Dockerfile seems ideal. I believe Henrik has anticipated some situations where packages are not known in advance and have to be installed dynamically (or marshaled, if that is possible). Really excited to try out a prototype when this all comes together. |
Yes, I think it will start to make sense for long running tasks (>10mins) and/or those that can run in parallel a lot, since there is a long start time but with practically infinite resources if you have the cash, and not as much cash as you would need for running on a traditional VM cluster as its charged per second of job build time. My immediate use case will be for a lot smaller pipelines than that but those that can be triggered by changes in an API, BigQuery table or cloud storage file, since the
That sounds like a nice future project that perhaps I can look at in 2022 - I'm not sure if its a fit since Cloud Build is API based not SSH but I have contacted Henrik about it. There is also already the existing
My first example runs one pipeline in one Cloud Build with For the "one Cloud Build per target" work will think about whats most useful. There is scope to:
...and all the permutations of that ;) For all though, the Cloud Storage bucket keeping state between them. |
Thanks, Mark! I really appreciate your openness to all these directions. |
For the DSL approach, I think |
Nice will take a look at I have the one build per pipeline function working for my local example, but I'd like some tests to check that its not re-running steps etc when it downloads the Cloud Storage artifacts. There were a few rabbit holes but otherwise it turned into not much code for I hope powerful impact. MarkEdmondson1234/googleCloudRunner#155 The current workflow is:
Using The test build I'm doing is taking around 1m30 vs 2m30 the first run (it downloads from an API, does some dplyr transformations then uploads results to BigQuery if API data has updated) example based off my local testsThe function cr_build_targets(path=tempfile())
# adding custom environment args and secrets to the build
cr_build_targets(
task_image = "gcr.io/my-project/my-targets-pipeline",
options = list(env = c("ENV1=1234",
"ENV_USER=Dave")),
availableSecrets = cr_build_yaml_secrets("MY_PW","my-pw"),
task_args = list(secretEnv = "MY_PW")) Resulting in build: ==cloudRunnerYaml==
steps:
- name: gcr.io/google.com/cloudsdktool/cloud-sdk:alpine
entrypoint: bash
args:
- -c
- gsutil -m cp -r ${_TARGET_BUCKET}/* /workspace/_targets || exit 0
id: get previous _targets metadata
- name: ubuntu
args:
- bash
- -c
- ls -lR
id: debug file list
- name: gcr.io/my-project/my-targets-pipeline
args:
- Rscript
- -e
- targets::tar_make()
id: target pipeline
secretEnv:
- MY_PW
timeout: 3600s
options:
env:
- ENV1=1234
- ENV_USER=Dave
substitutions:
_TARGET_BUCKET: gs://mark-edmondson-public-files/googleCloudRunner/_targets
availableSecrets:
secretManager:
- versionName: projects/mark-edmondson-gde/secrets/my-pw/versions/latest
env: MY_PW
artifacts:
objects:
location: gs://mark-edmondson-public-files/googleCloudRunner/_targets/meta
paths:
- /workspace/_targets/meta/** Looks like this when build after I commit to the repo. For my use case I would put it also on a daily schedule. |
Nice! One pattern I have been thinking about for parallel workflows is |
I have some tests now which can run without needing a cloudbuild.yaml file or trigger. They confirm
There is now also a The minimal example takes about 1minute to run with 20seconds for the https://github.com/MarkEdmondson1234/googleCloudRunner/blob/master/R/build_targets.R I will if I get time before Christmas look at the comments from the pull request and create a
|
Amazing! Excellent alternative to
Yup, the DSL we talked about. That would at least convert
As an extension of the current
Yeah, I saw
With futureverse/future#567 or https://cloudyr.github.io/googleComputeEngineR/articles/massive-parallel.html#remote-r-cluster, right? Another reason I like these options is that many pipelines do not need distributed computing for all targets. # _targets.R file
library(targets)
library(tarchetypes)
# For tar_make_clustermq() on a SLURM cluster:
options(
clustermq.scheduler = "slurm",
clustermq.template = "my_slurm_template.tmpl"
)
list(
tar_target(model, run_model()), # Runs on a worker.
tar_render(report, "report.Rmd", deployment = "main") # Runs locally.
) |
I've had a bit of a restructure to allow passing in the different strategies outlined above, customising the buildsteps you send up.
Now in via library(googleCloudRunner)
targets::tar_script(
list(
targets::tar_target(file1, "targets/mtcars.csv", format = "file"),
targets::tar_target(input1, read.csv(file1)),
targets::tar_target(result1, sum(input1$mpg)),
targets::tar_target(result2, mean(input1$mpg)),
targets::tar_target(result3, max(input1$mpg)),
targets::tar_target(result4, min(input1$mpg)),
targets::tar_target(merge1, paste(result1, result2, result3, result4))
),
ask = FALSE
)
cr_buildstep_targets_multi()
ℹ 2021-12-21 11:57:07 > targets cloud location: gs://bucket/folder
ℹ 2021-12-21 11:57:07 > Resolving targets::tar_manifest()
── # Building DAG: ─────────────────────────────────────────────────────────────
ℹ 2021-12-21 11:57:09 > [ get previous _targets metadata ] -> [ file1 ]
ℹ 2021-12-21 11:57:09 > [ file1 ] -> [ input1 ]
ℹ 2021-12-21 11:57:09 > [ input1 ] -> [ result1 ]
ℹ 2021-12-21 11:57:09 > [ input1 ] -> [ result2 ]
ℹ 2021-12-21 11:57:09 > [ input1 ] -> [ result3 ]
ℹ 2021-12-21 11:57:09 > [ input1 ] -> [ result4 ]
ℹ 2021-12-21 11:57:09 > [ result1, result2, result3, result4 ] -> [ merge1 ]
ℹ 2021-12-21 11:57:09 > [ merge1 ] -> [ Upload Artifacts ]
|
My test works but working with a real _targets file I'm coming across an error in my dag when it seems an edge is existing that is not in nodes. My target list is similar to: list(
tar_target(
cmd_args,
parse_args(),
cue = tar_cue(mode = "always")
),
tar_target(
surveyid_file,
"data/surveyids.csv",
format = "file"
),
tar_target(
surveyIds,
parse_surveyIds(surveyid_file, cmd_args)
),
... parse_args() takes command line arguments so is first entry into the DAG. This creates from Am I approaching this the wrong way, is there a way to handle the above situation? The current pertinent code is myMessage("Resolving targets::tar_manifest()", level = 3)
nodes <- targets::tar_manifest()
edges <- targets::tar_network()$edges
first_id <- nodes$name[[1]]
myMessage("# Building DAG:", level = 3)
bst <- lapply(nodes$name, function(x){
wait_for <- edges[edges$to == x,"from"][[1]]
if(length(wait_for) == 0){
wait_for <- NULL
}
if(x == first_id){
wait_for <- "get previous _targets metadata"
}
myMessage("[",
paste(wait_for, collapse = ", "),
"] -> [", x, "]",
level = 3)
cr_buildstep_targets(
task_args = list(
waitFor = wait_for
),
tar_make = c(tar_config, sprintf("targets::tar_make('%s')", x)),
task_image = task_image,
id = x
)
})
bst <- unlist(bst, recursive = FALSE)
if(is.null(last_id)){
last_id <- nodes$name[[nrow(nodes)]]
}
last_id <- nodes$name[[nrow(nodes)]]
myMessage("[",last_id,"] -> [ Upload Artifacts ]", level = 3)
c(
cr_buildstep_targets_setup(target_bucket),
bst,
cr_buildstep_targets_teardown(target_bucket,
last_id = last_id)
) |
I think |
Thanks, all working! |
Thanks for all your work on #722, Mark! With that PR merged, I think we can move on to resources, basically replicating https://github.com/ropensci/targets/blob/main/R/tar_resources_aws.R and https://github.com/ropensci/targets/blob/main/R/class_resources_aws.R for GCP and adding a Are you still interested in implementing this? |
Sure I will take a look see how far I get |
Thanks for #748, Mark! I think we are ready for a new |
Also, I think I am coming around to the idea of GCR inside |
Thanks I will take a look. May I ask if its going to be a case of a central function with perhaps different parsing of the blobs of bytes for the different formats? (e.g. I will take your word on what is best approach for the partial workers, it does seem more complicated than initial blush. It makes sense to me there is some kind of meta layer on top that chooses between Cloud Build, future, local etc. and I would say this is probably the trend on where cloud compute is going so worth looking at. |
That's what I was initially picturing. For AWS, the Lines 91 to 103 in 70846eb
Looks like that could speed things up in cases where connection objects are supported. However, it does require that we hold both the serialized blob and the unserialized R object in memory at the same time, and for a moment the garbage collector cannot clean up either one. This drawback turned out to be limiting in |
Prework
Proposal
Integrate
targets
withgoogleCloudRunner
and/orgoogleCloudStorageR
. I would like to prepare a pull request to enable this and would appreciate some guidance on where best to spend time.I was inspired by the recent AWS S3 integration and I would like to have similar functionality for
googleCloudStorageR
. From what I see the versioning of cloud objects that is required is available via the existing gcs_get_object() function to check for updates.The most interesting integration I think would be with
googleCloudRunner
, since via Cloud Build and Cloud Run parallel processing of R jobs in a cheap cloud environment could be achieved.Cloud Build is based on a yaml format that seems to map closely with
targets
including anid
andwait-for
attribute that can create DAGs. I proposetargets
help create those ids, and then download the Build Logs to check for changes? Get a bit woolly here what's best to do. I anticipate lots ofcr_buildstep_r()
calls withcr_build()
called in the final make. I think this can be done via existingtargets
code calling R scripts withlibrary(googleCloudRunner)
within them, but I would like to see if there is anything deserving a pull request withintargets
itself that would make the process more streamlined.Cloud Run can be used to create lots of R micro-service APIs via
plumber
that could trigger R scripts fortarget
steps. There is an example at the bottom of here showing parallel execution. I proposetargets
could help create the parallel jobs.The text was updated successfully, but these errors were encountered: