-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seamless integration with AWS S3 buckets #154
Comments
To clarify, the idea is to simply use # _targets.R
library(targets)
tar_pipeline(
tar_target(raw_data_file, "data/raw_data.csv", format = "file" ),
tar_target(raw_data, read_csv(raw_data_file, col_types = cols())),
tar_target(
data,
raw_data %>%
mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
),
tar_target(hist, create_plot(data)),
tar_target(fit, biglm(Ozone ~ Wind + Temp, data))
) we just write this: # _targets.R
library(targets)
library(targets.aws.s3)
tar_pipeline(
tar_target(raw_data_file, "data/raw_data.csv", format = "file" ),
tar_target(raw_data, read_csv(raw_data_file, col_types = cols()), format = "aws_fst_tbl"),
tar_target(
data,
raw_data %>%
mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))),
format = "aws_fst_tbl"
),
tar_target(hist, create_plot(data), format = "aws_qs"),
tar_target(fit, biglm(Ozone ~ Wind + Temp, data), format = "aws_qs")
) or even just this: # _targets.R
library(targets)
library(targets.aws.s3)
tar_option_set(format = "aws_qs")
tar_pipeline(
tar_target(raw_data_file, "data/raw_data.csv", format = "file" ),
tar_target(raw_data, read_csv(raw_data_file, col_types = cols())),
tar_target(
data,
raw_data %>%
mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
),
tar_target(hist, create_plot(data)),
tar_target(fit, biglm(Ozone ~ Wind + Temp, data))
) and the data will go to a bucket (lots of |
I take that back. We would need to export methods from the |
But I still went through most of the prework, using S3 dispatch to break apart |
I am now comfortable with the relevant parts of the AWS console and CLI for S3. Next to learn: AWS S3 web API and regular curl, closely followed by R |
Also, the "eventual" part of AWS eventual consistency means that if I overwrite a target, there may be a delay until the new target becomes available: https://stackoverflow.com/questions/64073793/etag-availability-guarantees-for-aws-s3-objects/64079706#64079706. So I think we should just hash locally, stick the hash in the metadata, and poll HEAD until the bucket has the right value. |
On second thought, direct use of |
Uploads are super simple, and aws.s3::put_object(
file = "object-local-file-path",
object = "object-key",
bucket = "bucket-name",
multipart = TRUE,
headers = c("x-amz-meta-targets-hash" = "custom-hash")
) |
Getting the custom hash: tryCatch({
x <- suppressMessages(aws.s3::head_object(
object = "object-key",
bucket = "bucket-name"
))
attr(x, "x-amz-meta-targets-hash")
}, error = function(e) NA_character_) |
And finally, downloading an S3 object: aws.s3::save_object(
object = "object-name"
bucket = "bucket-name",
file = "file-path"
) Seems to all work super smoothly. |
Got a sketch in https://github.com/wlandau/targets/tree/154. It fits the existing internals reasonably well. And because of double inheritance (e.g. from classes |
Prework
After playing around with Metaflow's sandbox, I think there are two aspects of Metaflow-like cloud/HPC that we want in
targets
.targets
a head start here.This issue is about (2).
How Metaflow does it
I have yet to learn how to set up a serious batch computing environment on AWS. But locally, all one needs to do is set a bunch of environment variables in
~/.metaflow/config.yson
. Then, any flow called locally will automatically store all the data to S3 and retrieve it when needed, regardless of whether the steps use decorators for AWS Batch. As far as I can tell, the data never needs to touch down locally. I am not sure how much of this behavior or these env vars differ outside the sandbox, but I suspect things are similar enough.Proposal for
targets
: more formatsIn
targets
, storage classes such asfst
andkeras
support custom methods for saving, loading, hashing, and serialization. I think it would be straightforward and elegant to write AWS versions of most of these. If we export the S3 generics, offload store_formats() to individual methods, and use S3 dispatch in the constructors of subclasses,we could even put these new methods in a new package (say,targets.aws.s3
). Because the number of extra exports is so small, I doubt we will run into the same problem as #148 (comment).If we're talking about just S3 storage, I think this approach will be far smoother than ropensci/tarchetypes#8 or ropensci/tarchetypes#11 in large pipelines. It also opens up possibilities for remote storage interaction. cc @mdneuzerling, @MilesMcBain, @noamross.
The text was updated successfully, but these errors were encountered: