-
Notifications
You must be signed in to change notification settings - Fork 782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to define your own AWS client provider #620
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
You can now specify a function that returns an AWS client. This is useful if you want to use something other than boto3 and can be used with the metaflow_custom mechanism to provide your own authentication
savingoyal
approved these changes
Jul 20, 2021
romain-intel
added a commit
that referenced
this pull request
Aug 17, 2021
* Add inputs to task_pre_step (#615) * Add inputs to task_pre_step * Addressed comments * Forgot to remove an import * Refactor @resources decorator (#617) * Refactor @resources decorator @resources decorator is shared by all compute related decorators - @Batch, @lambda, @K8s, @titus. This patch moves it out of batch_decorator.py so that other decorators can cleanly reference it. * Update __init__.py * Add an option to define your own AWS client provider (#620) You can now specify a function that returns an AWS client. This is useful if you want to use something other than boto3 and can be used with the metaflow_custom mechanism to provide your own authentication * Add S3 tests (#613) * Add S3 tests * Addressed comments * silence tar timestamp warnings (#627) * Handle None as default parameters properly in AWS Step Functions (#630) A parameter specification like - ``` Parameter(name="test_param", type=int, default=None) ``` will result in an error even though the default has been specified ``` Flow failed: The value of parameter test_param is ambiguous. It does not have a default and it is not required. ``` This PR fixes this issue. * IncludeFile now returns the included file in the client (#607) * DRAFT: IncludeFile now returns the included file in the client and CLI THIS IS NOT FINISHED; DO NOT MERGE AS IS. * Fix the tests * Forgot to update type check for multiple encoding * Add resource tags to AWS Batch jobs (#631) * Add resource tags to AWS Batch jobs This PR assumes that Batch:TagResource is whitelisted for the role which submits the AWS Batch job. * propagate tags * add env var * better commens * add production token * New patch release (#633) * Bug fix with s3tail exception (#635) * Revert "IncludeFile now returns the included file in the client (#607)" (#637) This reverts commit 0575f8d. * patch release (#639) * Update Unbounded Foreach Test in the case of a Conda environment (#626) Co-authored-by: Savin <[email protected]> Co-authored-by: Oleg Avdeev <[email protected]>
romain-intel
added a commit
that referenced
this pull request
Sep 24, 2021
* Add external_field to parameter * Fix timedate conversion from service to client * New datastore implementation This commit contains only the new datastore code and none of the backend implementations. The datastore is now split into different files: - flow_datastore.py contains the top-level FlowDataStore implementation - task_datastore.py contains the task-level datastore implementation - content_addressed_store.py contains the underlying content addressed store used by both previous data-stores. * Local backend for the datastore. The local backend is used to save and load local files. * Datatools performance improvements Datatools will now cache the s3 client it uses for single operations resulting in faster operation times. Another optimization (non advertised) is that datatools can now take an IOBase directly to avoid having an additional copy. Finally, __del__ now performs a close so the S3 datatool can be used as a regular object as opposed to just within a context. * Added S3 datastore backend. This backend allows the datastore to interface with S3. * Pure non-semantic changes (name changes and comment changes). One tiny semantic change in the way a Tar file is read (using the recommended open method instead of TarFile object) * Integrate the new datastore implementation into the current code * Remove no longer needed files for the datastore. * New features for the S3 datatools: - support for range queries (in get and get_many) - support for content-type and user metadata in put, put_many and put_files (metadata can also be retrieved using any of the get calls) - support for info and info_many to retrieve information about a file without fetching it. * Fix metadata to allow for non-string arguments * [WIP] Modify new datastore to use metadata in backend storage Instead of encoding needed information directly in the file, we now encode this information as file metadata leveraging the support for metadata for the S3 datatools and implementing support for it in the local filesystem (creating a separate file) * Properly use download_fileobj when no range is specified * Re-add intermediate compatibility mode for loading blobs from the CAS * Remove print message; fix local_backend in some cases * [WIP] Datatools: See if moving info to worker makes a performance difference * Fix issue with s3 datastore backend for multiple files * Fix issue with s3 datastore backend for multiple files * Addressed comments * Improve import of metaflow_custom to allow for top-level imports * Fix small issue with sidecar message types * Fix tags for attempt-done in task-datastore * Fix small issue with sidecar message types * Addressed comments -- minor tweaks and file refactoring * Addressed comment: removed handling of ephemeral CAS encoding directly in file * Addressed comments; refactored use of pipes Addressed the minor style comments. Removed the use of a pipe to communicate between workers and master. Communication now occurs via a file. * Forgot file in previous commit * Typo in parentheses causing issues with put of multiple files * Fix issue with S3PutObject * Fix bugs due to int/string conversion issues with new file mechanism * Fix issue with is_none and py2 vs py3 * Fix issue with metaflow listing all flows * Handle non-integer task/run IDs in local metadata * WIP: MFLog on convergence branch This is still in progress but pushing for reference. * Fixups to track master MFLog and specific convergence fixups * Fix to Batch logging * Add support for getting all artifacts from the filecache * Improve MFLog: - Save logs later in runtime to get all logs (even messages printed on error) - Allow for escaping variables in bash commands * Fix a bug when missing data is requested from the datastore * Trigger test on convergence branch * Fix failing tests * Fix issue caused by merge_artifacts now having artifacts read from datastore in write mode (#577) * Convergence master merge (#578) * UBF Implementation (#554) Unbounded Foreach Implementation Co-authored-by: Savin <[email protected]> * Check if R step names are valid Python identifiers (#573) * Check if R step names are valid Python identifiers * Accidentally changed a docstring step_name to validator. Reverting. * get rid of _METAFLOW_RESUMED_RUN (#575) * Add Unbounded Foreach support Co-authored-by: Savin <[email protected]> Co-authored-by: David Neuzerling <[email protected]> Co-authored-by: Oleg Avdeev <[email protected]> * Revert "Add external_field to parameter" This reverts commit b118685. * Small MFlog refactor * Add a way to pass an external client to S3() * Cache the S3 boto connection in the datastore (#570) * The datastore now caches the S3 boto connection and passes it to the S3 client This allows us to use S3() using with (and therefore properly clean up the temp directory) but also benefit from the connection reuse. As part of this change, we also had to change the interface to load_bytes. This change is invisible to the user as it is 100% internal to the datastore. Cleaned up the s3tail.py and s3util.py files to co-locate them with the datatools instead of the datastore (since they were no longer used directly in the datastore). This simplified the import story. * Addressed comments and simplified the code a bit - Use a single CloseAfterUse class. - Remove the use of caching on write (in artifact and blob cache); the cache now only caches things that are read which is the way it is being used - Simplified the types of caches used in datastore_set.py * Remove no longer needed file * Clean up usage of CloseAfterUse * remove artifact_cache, simplify blob_cache * get rid of redundant LazyFile * fix FileBlobCache root * fix paths in content_addressed_store * Addressed comments * Move check on file existence in CAS to backend for better efficiency * Fix modes for datastore due to merge_artifacts functionality * Ignore error when overwrite is False and file exists Co-authored-by: Ville Tuulos <[email protected]> * Add an option to define your own AWS client provider (#611) * Add an option to define your own AWS client provider You can now specify a function that returns an AWS client. This is useful if you want to use something other than boto3 and can be used with the metaflow_custom mechanism to provide your own authentication * Typo * Do not remove _expected_extensions since it is used in a function * Another typo * Addressed comments * Add inputs to task_pre_step (#616) * Tweaks and optimizations to datastore * Revert "Move check on file existence in CAS to backend for better efficiency" This reverts commit d0ac6fe. * Change encoding type to gzip+pickle to keep forward compatibility * Reduce memory usage of artifact save and load We ensure that we do not keep around temporaries and allow the GC to remove objects that are no longer needed * Make is_file parallel We can now call is_file in parallel. It is unclear if this helps a lot given the overhead of loading the s3op and starting them all. It may help in the case of a very large number of artifacts * Set overwrite=True when uploading metadata * Improve fetching info and file at the same time * reduce duplicate copies in datastore by using generators * minimize memory consumption by handling artifacts one by one * fix docstrings and documentation * fix @Batch to work with the new datastore * do not implicitly assume a correct key order in flow_datastore.load_data * Other minor consistency tweaks Co-authored-by: Romain Cledat <[email protected]> * Merge master into convergence * Add inputs to task_pre_step (#615) * Add inputs to task_pre_step * Addressed comments * Forgot to remove an import * Refactor @resources decorator (#617) * Refactor @resources decorator @resources decorator is shared by all compute related decorators - @Batch, @lambda, @K8s, @titus. This patch moves it out of batch_decorator.py so that other decorators can cleanly reference it. * Update __init__.py * Add an option to define your own AWS client provider (#620) You can now specify a function that returns an AWS client. This is useful if you want to use something other than boto3 and can be used with the metaflow_custom mechanism to provide your own authentication * Add S3 tests (#613) * Add S3 tests * Addressed comments * silence tar timestamp warnings (#627) * Handle None as default parameters properly in AWS Step Functions (#630) A parameter specification like - ``` Parameter(name="test_param", type=int, default=None) ``` will result in an error even though the default has been specified ``` Flow failed: The value of parameter test_param is ambiguous. It does not have a default and it is not required. ``` This PR fixes this issue. * IncludeFile now returns the included file in the client (#607) * DRAFT: IncludeFile now returns the included file in the client and CLI THIS IS NOT FINISHED; DO NOT MERGE AS IS. * Fix the tests * Forgot to update type check for multiple encoding * Add resource tags to AWS Batch jobs (#631) * Add resource tags to AWS Batch jobs This PR assumes that Batch:TagResource is whitelisted for the role which submits the AWS Batch job. * propagate tags * add env var * better commens * add production token * New patch release (#633) * Bug fix with s3tail exception (#635) * Revert "IncludeFile now returns the included file in the client (#607)" (#637) This reverts commit 0575f8d. * patch release (#639) * Update Unbounded Foreach Test in the case of a Conda environment (#626) Co-authored-by: Savin <[email protected]> Co-authored-by: Oleg Avdeev <[email protected]> * Fix Py2 compatibility * Fix typo in s3op.py * Pin test executions to master branch * Fix names of metadata file to include ending * Renamed metaflow_custom to metaflow_extensions * Remove duplicate code * remove .json suffix from metadata * test commit * remove spurious commits * Fix merge * remove new line * Fixed batch related bug in convergence. (#710) * fix merge conflicts * Batch fixes for convergence branch * fixup aws batch * fix whitespace * fix imports * fix newline * fix merge * fix local metadata for aws batch * merge mflog * stylistic fixes * Added preliminary documentation on the datastore (#593) * Added preliminary documentation on the datastore * Update datastore.md Update with new name for the backend. * Final set of patches to convergence (#714) * convergence fixes * fixes * gzip ts changes * fix logs subcommand * typo * Address comments; add support for var_transform in bash_capture_logs * Fix local metadata sync * Forgot to remove duplicate sync_metadata Co-authored-by: Romain Cledat <[email protected]> Co-authored-by: Romain <[email protected]> * Typo in error message Co-authored-by: Savin <[email protected]> Co-authored-by: David Neuzerling <[email protected]> Co-authored-by: Oleg Avdeev <[email protected]> Co-authored-by: Ville Tuulos <[email protected]> Co-authored-by: Valay Dave <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
You can now specify a function that returns an AWS client. This is
useful if you want to use something other than boto3 and can be used
with the metaflow_custom mechanism to provide your own authentication
This is #611 cherry-picked on top of master.