Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud metadata #1130

Merged
merged 36 commits into from
Aug 28, 2023
Merged

Cloud metadata #1130

merged 36 commits into from
Aug 28, 2023

Conversation

wlandau
Copy link
Member

@wlandau wlandau commented Aug 28, 2023

Prework

Related GitHub issues and pull requests

Summary

In this PR, tar_make(), tar_make_clustermq(), and tar_make_future() gain the ability to continuously upload metadata to the cloud. As with local writes to e.g. _targets/meta/meta, the metadata gets uploaded to the cloud every seconds_meta seconds unless a deployment = "main" target is currently blocking the main process. The metadata files live in AWS S3 or GCP GCS, depending on the new repository_meta option in tar_option_set() in _targets.R, and they go to the bucket you set with the resources option in tar_option_set(). (repository_meta defaults to repository, so there is no need to manually opt in to this feature.) Locally on another machine, you can manage the cloud metadata with new functions tar_meta_download(), tar_meta_sync(), tar_meta_upload(), and tar_meta_delete().

These changes align targets with the idea that in cloud computing, you are renting the machines you work with, and you want your EC2 instances and EBS volumes to disappear as soon as possible. With all the metadata and all the target data in a bucket, the local file system that ran the pipeline is free to vanish as soon as the pipeline finishes. You could even set the data store (tar_config_set(store = "...")) to a node-specific temporary directory to be kinder to shared file systems (e.g. EFS) on enterprise architectures (FYI @rpodcast). Then on a different machine, simply pull the code, and then pull the metadata with tar_meta_download(). At that point, you can read objects, check the progress of a running pipeline, and even run the pipeline there if the original run finished. In other words, targets pipelines adopts a similarly decentralized model as Git/GitHub (although it can't realistically go quite that far).

Unfortunately, after this PR, cloud targets with custom prefixes may need to rerun. This is because I needed to shift target data to a PREFIX/objects/ location in order to make room for metadata in PREFIX/meta. But I think this change is worth this inconvenience, especially given that the solution to #1108 already invalidates existing targets.

@wlandau wlandau merged commit b7d8183 into main Aug 28, 2023
@wlandau wlandau deleted the 1109 branch August 28, 2023 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants