From aa28d214089d6b70ab4a49864c234193ddd018cd Mon Sep 17 00:00:00 2001
From: Nissan Pow <npow@users.noreply.github.com>
Date: Wed, 11 Jan 2023 21:45:11 -0800
Subject: [PATCH] Rebaseline from master (#2)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Don't explicitly break py2 support (#962)

* Don't explicitly break py2 support

* Typo

* Typo

* Pass the paths used by the external interpreter to the one launching the Env Escape server (#960)

We were currently not reflecting possible values passed using PYTHONPATH or
programmatically added to the interpreter launching the environment escape server.

* Bump follow-redirects in /metaflow/plugins/cards/ui (#966)

Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.7 to 1.14.9.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.7...v1.14.9)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Adding export of graph to JSON for cli (#955)

* Fix "Too many symbolic links" error when using Conda + Batch on MacOS (#972)

* Fix "Too many symbolic links" error when using Conda + Batch on MacOS

* Ran black

* emit app tag for AWS Batch jobs (#970)

* Bump to 2.5.3 (#974)

* Extension packaging improvements (#959)

* Fixed dual distribution import in some cases; fixed get_plugin_cli extension

If sys.path contains the same directory multiple times, metadata.distributions()
will list a package multiple times which breaks the import mechanism we have. This
is addressed by skipping duplicate packages.

get_plugin_cli for extensions wasn't evaluated late enough therefore
not allowing additional parameters to be added.

* Fix issue with metaflow_extensions and Conda environment

In a Conda environment, it is not possible to re-resolve all the
metaflow_extension packages since they appear as a single directory.
In that case, we resolve on the INFO file to give us the proper
information.

Also fixed the way metaflow_extensions are added to the Conda
environment to avoid leaking more information than needed.

* Validate configuration files for extensions and de-duplicate for internal ns packages

* Update packaging of metaflow_extensions packages

Packaging is now handled directly by extension_support.py and
each distribution/package can define its own suffixes to include.

* Rework the import mechanism for modules contained in metaflow_extensions

This change does away with the deprecated load_module and also improves
handling of loading children modules and _orig modules

* Address comments

* Squash merge of origin/master

* Fix issue when packages were left out

* Merge get_pinned_conda_libs

* Fix issue with __init__.py in non-distribution packages

* Moving card import changes on top of Romain's branch (#973)

* Moving card import changes on top of Romain's branch
- tiny bug fix to extensions.
- Added card related refactor to support mfextinit and regular package.
- removing card packaging related code in decorator + everywhere else.

* Tiny bug fix on top of romain's new changes.

* removing unneccessary logic

* Added a bit more inline documentation and addressed documentation comments

* Properly handle parsing of package requirements

* Fix Pylint errors in some case of metaflow_extensions module aliasing

* Properly handle case of files at the root of PYTHONPATH package

* Tests for Extensions  (#978)

* Added tests for extensions. Checking if they work.

* dummy commit to see if things work.

* fix

* debug

* bug fix.

* possible fix.

* tweeking context

* Bug fix to tests.

* dummy commit

* bug fix

* Added extension test to core tests.
- remove seperate gh action

* removing files.

* fix

* added extension test to py3 context
- remove redundant complexity.

Co-authored-by: Valay Dave <valaygaurang@gmail.com>

* pass DEFAULT_AWS_CLIENT_PROVIDER to remote tasks (#982)

* Fixing some hacky plumbing in card test suite (#967)

* Fixing some hacky plumbing for card tests.
- Added `--file` arg to card list cli
- Using files to get the information written by cli
- changes to test to use files instead of stdout
- Piping stderrors to stderrr to capture card not found errors.

* Removing `\n` from all card tests assertions
- `\n` was there because earlier we read stdout.
- Now since we read files, it will not be needed.

* Simplify mflog (#979)

* Fix extension root determination in some cases + card tweaks (#989)

* Use importlib_metadata 2.8.3 for Python 3.4, 3.5 but 4.8.3 for Python 3.6+ (#988)

* Use importlib_metadata 2.8.3 for Python 3.4 and 3.5 but 4.8.3 for Python 3.6+

This is required because in importlib_metadata 3.4.0, a field called `_normalized_name`
was introduced and is relied on by later versions of importlib_metadata. When
importlib_metadata looks for distributions, it queries all registered/installed
importlib_metadata and if an older version (like 2.8.3) returns something that
does not have this field, it causes a crash.

* Add __init__.py files to make the vendored package non namespace

* Allow configuring the root directory where artifacts go when pulling from S3. (#991)

Co-authored-by: kgullikson <kgullikson@sparkcognition.com>

* Add an option to pass a `role` argument to S3() to provide a role (#987)

* Add an option to pass a `role` argument to S3() to provide a role for the S3 client

* Move to partial to be able to pickle ops

* Address comments

* bump version to 2.5.4 (#993)

* Bug fix for datetime index in default card (#981)

* Bug fix for datetime_indexes.

* ran black

* Bump minimist from 1.2.5 to 1.2.6 in /metaflow/plugins/cards/ui (#995)

Bumps [minimist](https://github.com/substack/minimist) from 1.2.5 to 1.2.6.
- [Release notes](https://github.com/substack/minimist/releases)
- [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6)

---
updated-dependencies:
- dependency-name: minimist
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix black - set upper bound for click version (#1006)

* Add tags to `current` singleton (#1019)

* first attempt

* fix _set_env param typo

* add assertions for tags to current singleton test

* Dispatch Metaflow flows to Argo Workflows (#992)

* Dispatch Metaflow flows to Argo Workflows

* Remove spurious files

* Make black happy

* Make black happy again

* Remove spurious TODOs

* Handle non-numeric task ids

* Fix typos

* don't actively automount service account tokens

* make black happy

* Update @kubernetes

* remove spurious commits

* fix issue with hyphens

* raise memory for tests

* change limits to requests

* pin back test resources

* update refresh timeout

* stylistic nits

* Add lint check

* make black happy

* support gpus

* drop extra bracket

* fix

* Skip sleep if not retryting S3 operations (#1001)

* Don't sleep if not retrying S3 operations

* Remove unused env variable

* VSCode is too smart

* Bump minor version for release (#1022)

* Alternate way of adding metadata to cloned tasks (#1003)

* Alternate way of adding metadata to cloned tasks

This adds it on the client side

* Clean up metadata passing.

Also optimize use within the client

* Typo

* Fix black test

* Keep black happy

* Added tests

* Ignore namespace when getting origin task

* Bump cross-fetch from 3.1.4 to 3.1.5 in /metaflow/plugins/cards/ui (#1027)

Bumps [cross-fetch](https://github.com/lquixada/cross-fetch) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/lquixada/cross-fetch/releases)
- [Commits](https://github.com/lquixada/cross-fetch/compare/v3.1.4...v3.1.5)

---
updated-dependencies:
- dependency-name: cross-fetch
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fixed bug from #1023 (#1025)

* current.pathspec to return None when used outside Flow (#1033)

* current.pathspec to return None when used outside Flow

* formatted with black

* Fixing bug in the `card list` command (#1044)

- echo_always defaulted to stderr :(

* Enable MinIO/EMC-ECS as blob stores for Metaflow (#1045)

* Enable EMCECS

* Enable MinIO & EMC-ECS blob stores for Metaflow

* fix InvalidRange issue

* [tag-mutation-project] Stash system info in task metadata (#1039)

* stash system info as metadata (in addition to system_tags today)

* improve

* avoid parsing on ":" for user id

* remove date - it's not well defined and created_at is available

* python_version is in metadata now, and can vary between orig and resume runs

* address CR comments

* add the comments

* Address issues with S3 get and ranges (#1034)

* Address issues with S3 get and ranges

This addresses two issues:
  - get and get_many would not properly populate the range_info in S3Object
  - get_many on different ranges of the same file would return incorrect results

* Add one more test (merge with Netflix internal tests)

* Put Run() test behind a check

* Address comments

* Fix _new_task calling bug in LocalMetadataProvider (#1046)

* Release 2.6.1 (#1047)

* Support default secrets (#1048)

* Run id file should be written prior to execution when resuming (#1051)

* Run id file should be written prior to execution when resuming

* separate run id test so we can exclude from batch

* release 2.6.2 (#1052)

* Fix instance metadata calls for IMDSV2 (#1053)

* bump version to 2.6.3 (#1054)

* Small reorder of imports to make things more consistent (#1055)

* [tag-mutation-project] Local metadata provider - object read paths to return ancestral Run tags (#1043)

* wip

* Fix nits

* typo fix. (#1068)

* This is the Jackie's tag changes with an additional test and a fix. (#1063)

* rebase on master

* add replace_many as candidate subcommand

* updates from some CR suggestions

* propagate opt rename

* more changes follow CR

* improve consistency guarantees of local metadata tag mutations (retries if a race was lost)

* write comment

* use replace-many logic to back "tag replace"

* finalize client API for tag operations

* fix retry logic

* Add task.tags == run.tags check in test

* [tag-mutation-project] Add tag mutation support to ServiceMetadataProvider

* fix version

* Enable TagMutationTest in batch and k8s

* Bogus commit

* minor UX fixes on CLI

* Fully support CliCheck for TagMutationTest

* add tag and tag set validation to local metadata code path

Validate tags on run / resume paths too

* flat nor flatten on CLI

* Logs really do come via stdout - related to CliCheck refactor

* argo and step functions - validate tags as soon as possible (before other user messaging)

* no capture_output pre 3.7

* Add another test for tag mutation

* Properly pass down the metadata provider

There were two cases when the metadata provider was not properly
passed down to the client:
  - when calling the client within a flow
  - when invoking the `tag` command

This fixes both issues and also allows the user to introspect
the metadata provider and datastore being used for the flow

* Do not make public metadata and datastore in current object

* Typo

Co-authored-by: Jackie Tung <jackie@outerbounds.co>
Co-authored-by: jackie-ob <104396218+jackie-ob@users.noreply.github.com>

* Add two options to resume: the ability to specify a run-id and the ab… (#1059)

* Add two options to resume: the ability to specify a run-id and the ability to only clone tasks

The scenario for this use is mainly for external schedulers that are
capable of "resuming" failed runs by only re-executing the ones that failed
or didn't execute the first time. We need a way for Metaflow to "clone" the
tasks that are not re-executed.

* Small formatting fix

* Made clone-only resume possibly re-entrant

* Addressed comments

* Clarify the reentrant behavior of resume

* Improve sidecar message handling (#1057)

* Improve sidecar message handling

Several changes:
  - allow side-passing of a static "context" to the sidecar process to keep
    messages as small as possible
  - clean up the interfaces and make it clearer what is what (ie: there is
    an emitter inside the process and then a sidecar that runs outside)
  - better error checking
  - better handling of shutdown (sidecars will now properly get a shutdown
    message giving them the opportunity to clean up -- this may add a bit
    of latency at the end of the program execution.

Also improved message when plugins don't load in case of an issue with
Metaflow extensions

* Add the possibility to pass additional context to sidecars

This allows more information to be provided after the sidecar is
created but still maintains small message sizes

* Better handling of initial context message for sidecars

In the previous implementation, a failure to send the initial context
message could crash the flow. This change properly handles that message
like any other message (so has no impact on the flow itself). It also
implements better retry policies to send the context. Finally, this change
also improves error handling: if a message fails to send, in the previous
implementation, it was likely that the next successful one would be
an invalid json.

* Improve handling of corner conditions on message send

* Restart sidecar faster when sidecar process dies

* Properly send context after sidecar restart

* Remove the specific context from monitors/loggers

Subclasses can now handle contexts directly in the subclass.

Refactoring also cleaned up what a sidecar and a sidecar worker
do by refactoring some of the code out.

* Typo: fix names

* Minor typos for sidecar changes

* Clean up sidecars

* More cleanups for sidecars

* Addressed final comments for sidecars

* Small typo in NullEventLogger.send (#1071)

Pushing directly per our conversation with Savin

* Bump version to 2.7.0

* Change behavior of env escape when module is present (#1072)

When a module configured in the env escape was present in the local
environment, we would raise an error. We now print a message and use
the module present in the environment (not the environment escape).
This allows the env escape modules to provide a "fallback" in case the
module is not present but to still use the local module if present

* Bump version to 2.7.1 (#1073)

* Fix an issue with the environment escape server directory (#1074)

The environment escape server would launch in the cwd. This could
cause issues if, for example, a `metaflow` directory existed in that
cwd (because it would then take precedence over the actual metaflow.

To remediate this situation, the server is launched in the configuration
directory (where we know what there is).

Also took care of a quick annoyance of having to create an empty overrides.py
file for everything. This is now no longer required

* Support session_vars and client_params in S3() (#1069)

* Support session_vars and client_params in S3()

This allows the user to pass down session variables and client parameters
to set when getting a S3 client.

* Addressed comments about default configuration

* Support M1 Macs (#1077)

* changes to @kubernetes for sandbox (#1079)

* Changed error to warning in AWS retry (#1081)

Co-authored-by: Preetam Joshi <preetamj@netflix.com>

* bump to 2.7.2 (#1083)

* Fixed s3util test (#1084)

* Fixing s3util test caset

* added comment

Co-authored-by: Preetam Joshi <preetamj@netflix.com>

* Bump svelte from 3.46.1 to 3.49.0 in /metaflow/plugins/cards/ui (#1086)

Bumps [svelte](https://github.com/sveltejs/svelte) from 3.46.1 to 3.49.0.
- [Release notes](https://github.com/sveltejs/svelte/releases)
- [Changelog](https://github.com/sveltejs/svelte/blob/master/CHANGELOG.md)
- [Commits](https://github.com/sveltejs/svelte/compare/v3.46.1...v3.49.0)

---
updated-dependencies:
- dependency-name: svelte
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Metadata version check flag (#1088)

* Bump terser from 5.10.0 to 5.14.2 in /metaflow/plugins/cards/ui (#1090)

Bumps [terser](https://github.com/terser/terser) from 5.10.0 to 5.14.2.
- [Release notes](https://github.com/terser/terser/releases)
- [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md)
- [Commits](https://github.com/terser/terser/commits)

---
updated-dependencies:
- dependency-name: terser
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix r CI deps (#1092)

* fix r CI deps

* fix fractional resoure handling for batch (#1089)

* bump version (#1093)

* Fix docstrings for the API reference (no functional changes!) (#1076)

Make the docstring format in public APIs compliant with the new API reference framework

* Move a sys.path modification in s3op to __main__ (#1095)

In its current location, this could modify the sys.path of the current
running metaflow which could have nefarious consequences with the escape
hatch which uses sys.path to determine the outside environment's python path.

The following scenario would cause issues:
  - metaflow is installed in the usual path on the system
  - a conda environment was manually bootstrapped from a directory A.
    + at this point, sys.path starts with `A` and then contains the other
      system includes
    + s3op.py is imported at some point by the Conda installer when it calls `get_many`
    + this modifies sys.path to insert, at the beginning, the parent of Metaflow; so
      in this case, sys.path looks something like
      ['/apps/python3/lib/python3.7/site-packages', 'A', '/apps/python3/lib/python3.7/site-packages'...]
    + when the escape hatch trampolines are created, this sys.path is used to determine what the
      sys.path for the outside interpreter is.
    + in A, we create:
      * INFO
      * metaflow
      * metaflow_extensions
      which properly describe the installation of metaflow
  - when the escape hatch client runs, it runs in the conda environment and uses metaflow
    created in A.
  - when the client wants to start the server, this is where we run into issues because, at this
    point, the server will use the PYTHONPATH which starts with '/apps/python3/lib/python3.7/site-packages'
    in which it will find metaflow. It will therefore use that metaflow (which is the same
    as the one linked in A) to start the server. This runs into issues though because A
    is also in PYTHONPATH and so the extension support loader will also try to load `A/metaflow_extensions`.
    This will cause issues if multiple extensions are installed there (it will complain about duplicate
    configurations for example. The `INFO` file typically used to solve this problem is not read as it was
    not present for the TL metaflow.

This patch simply moves the modification of sys.path to where it is actually needed and avoids polluting
sys.path when the module is simply included (and not called as a script).

* Airflow Support (#1094)

* Airflow on Kubernetes minus Foreachs.
- Support for all metaflow construct without foreach and sensors

Squashed commit of the following:

commit ef8b1e3768695bc4d3375a947ab1da9c6520bcf1
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jul 29 01:06:26 2022 +0000

    Removed sernsors and banned foreach's

commit 8d517c4fecc6568777ad03eca81aaacfa3e91156
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jul 29 00:59:01 2022 +0000

    commiting k8s related file from master.

commit a7e1ecdbf7b8b8d1cc21321cc8e196053f8305e4
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jul 29 00:54:45 2022 +0000

    Uncommented code for foreach support with k8s

    KubernetesPodOperator version 4.2.0 renamed `resources` to
    `container_resources`
    - Check : (https://github.com/apache/airflow/pull/24673) /
    - (https://github.com/apache/airflow/commit/45f4290712f5f779e57034f81dbaab5d77d5de85)

    This was done because `KubernetesPodOperator` didn't play nice with dynamic task mapping and they had to deprecate the `resources` argument. Hence the below codepath checks for the version of `KubernetesPodOperator`
    and then sets the argument. If the version < 4.2.0 then we set the argument as `resources`.
    If it is > 4.2.0 then we set the argument as `container_resources`
    The `resources` argument of KuberentesPodOperator is going to be deprecated soon in the future.
    So we will only use it for `KuberentesPodOperator` version < 4.2.0
    The `resources` argument will also not work for foreach's.

commit 2719f5d792ada91e3ae0af6f1a9a0c7d90f74660
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jul 18 18:31:58 2022 +0000

    nit fixes :
    - fixing comments.
    - refactor some variable/function names.

commit 2079293fbba0d3d862476a7d67b36af8a3389342
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jul 18 18:14:53 2022 +0000

    change `token` to `production_token`

commit 14aad5ff717418e4183a88fa84b2f5e5bb13927a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jul 18 18:11:56 2022 +0000

    Refactored import Airflow Sensors.

commit b1472d5f7a629024ca45e8b83700400d02a4d455
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jul 18 18:08:41 2022 +0000

    new comment on `startup_timeout_seconds` env var.

commit 6d81b758e8f06911258d26f790029f557488a0d7
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jul 18 18:06:09 2022 +0000

    Removing traces of `@airflow_schedule_interval`

commit 0673db7475b22f3ce17c2680fc0a7c4271b5c946
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Jul 14 12:43:08 2022 -0700

    Foreach polish (valayDave/metaflow#62)

    * Removing unused imports
    * Added validation logic for airflow version numbers with foreaches
    * Removed `airflow_schedule_interval` decorator.

    * Added production/deployment token related changes
    - Uses s3 as a backend to store the production token
    - Token used for avoiding nameclashes
    - token stored via `FlowDatastore`

    * Graph type validation for airflow foreachs
    - Airflow foreachs only support single node fanout.
    - validation invalidates graphs with nested foreachs

    * Added configuration about startup_timeout.

    * Added final todo on `resources` argument of k8sOp
    - added a commented code block
    - it needs to be uncommented when airflow releasese the patch for the op
    - Code seems feature complete keeping aside airflow patch

commit 4b2dd1211fe2daeb76e29e4084f21e96b10cdae9
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Jul 7 19:33:07 2022 +0000

    Removed retries from user-defaults.

commit 0e87a97fea15ba3aaa6d4228b141bd796b767c43
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Jul 6 16:29:33 2022 +0000

    updated pod startup time

commit fce2bd263f368dbb78a34ac71f64e13c89277222
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Jun 29 18:44:11 2022 +0000

    Adding default 1 retry for any airflow worker.

commit 5ef6bbcde51b1f4923a192291ed0e07d07ec7321
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 27 01:22:42 2022 +0000

    Airflow Foreach Integration
    - Simple one node foreach-join support as gaurenteed by airflow
    - Fixed env variable setting issue
    - introduced MetaflowKuberentesOperator
    - Created a new operator to allow smootness in plumbing xcom values
    - Some todos

commit d319fa915c558d82f1d127736ce34d3ae0da521d
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 24 21:12:09 2022 +0000

    simplifying run-id macro.

commit 0ffc813b1c4e6ba0103be51520f42d191371741a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 24 11:51:42 2022 -0700

    Refactored parameter macro settings. (valayDave/metaflow#60)

commit a3a495077f34183d706c0edbe56d6213766bf5f6
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 24 02:05:57 2022 +0000

    added comment on need for `start_date`

commit a3147bee08a260aa78ab2fb14c6232bfab2c2dec
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Jun 21 06:03:56 2022 +0000

    Refactored an `id_creator` method.

commit 04d7f207ef2dae0ce2da2ec37163ac871f4517bc
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Jun 21 05:52:05 2022 +0000

    refactor :
    -`RUN_ID_LEN` to `RUN_HASH_ID_LEN`
    - `TASK_ID_LEN` to `TASK_ID_HASH_LEN`

commit cde4605cd57ad9214f5a6afd7f58fe4c377e09e2
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Jun 21 05:48:55 2022 +0000

    refactored an error string

commit 11458188b6c59d044fca0dd2d1f5024ec84f6488
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 20 22:42:36 2022 -0700

    addressing  savins comments. (#59)

    - Added many adhoc changes based for some comments.
    - Integrated secrets and `KUBERNETES_SECRETS`
    - cleaned up parameter setting
    - cleaned up setting of scheduling interval
    - renamed `AIRFLOW_TASK_ID_TEMPLATE_VALUE` to `AIRFLOW_TASK_ID`
    - renamed `AirflowSensorDecorator.compile` to `AirflowSensorDecorator.validate`
    - Checking if dagfile and flow file are same.
    - fixing variable names.
    - checking out `kubernetes_decorator.py` from master (6441ed5)
    - bug fixing secret setting in airflow.
    - simplified parameter type parsing logic
    - refactoring airflow argument parsing code.

commit 83b20a7c6a13b3aedb7e603e139f07f0ef2fb646
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 13 14:02:57 2022 -0700

    Addressing Final comments.  (#57)

    - Added dag-run timeout.
    - airflow related scheduling checks in decorator.
    - Auto naming sensors if no name is provided
    - Annotations to k8s operators
    - fix: argument serialization for `DAG` arguments (method names refactored like `to_dict` became `serialize`)
    - annotation bug fix
    - setting`workflow-timeout` for only scheduled dags

commit 4931f9c84e6a1d20fc3ecb41cf138b72e5dee629
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 6 04:50:49 2022 +0000

    k8s bug fix

commit 200ae8ed4a00028f094281f73a939e7a4dcdf83a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 6 04:39:50 2022 +0000

    removed un-used function

commit 70e285e9a7cfbec71fc293508a62c96f33562a01
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 6 04:38:37 2022 +0000

    Removed unused `sanitize_label` function

commit 84fc622d8b11e718a849b2e2d91ceb3ea69917e6
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 6 04:37:34 2022 +0000

    GPU support added + container naming same as argo

commit c92280d8796ec12b4ff17fa2ff3c736c7244f39c
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 6 04:25:17 2022 +0000

    Refactored sensors to different files + bug fix
    - bug caused due `util.compress_list`.
    - The function doesn't play nice with strings with variety of characters.
    - Ensured that exceptions are handled appropriately.
    - Made new file for each sensor under `airflow.sensors` module.

commit b72a1dcf0dbbbcb814581d92738fd27ec31ef673
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Jun 4 01:41:49 2022 +0000

    ran black.

commit 558c82f65b383ed0d61ded6bc80326471e284550
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 18:32:48 2022 -0700

    Moving information from airflow_utils to compiler (#56)

    - commenting todos to organize unfinished changes.
    - some environment variables set via`V1EnvVar`
        - `client.V1ObjectFieldSelector` mapped env vars were not working in json form
        - Moving k8s operator import into its own function.
        - env vars moved.

commit 9bb5f638792a671164ec95891e97f599e9a3385f
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 18:06:03 2022 +0000

    added mising Run-id prefixes to variables.
    - merged `hash` and `dash_connect` filters.

commit 37b5e6a9d8ca93cc91244c8d77c7d4f61280ba59
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 18:00:22 2022 +0000

    nit fix : variable name change.

commit 660756f952ebd92ba1e26d7f908b81036c31ff10
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 17:58:34 2022 +0000

    nit fixes to dag.py's templating variables.

commit 1202f5bc92f76df52b5957f11c8574cadfa62196
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 17:56:53 2022 +0000

    Fixed defaults passing
    - Addressed comments for airflow.py

commit b9387dd428c1a37f9a3bfe2c72cab475da708c02
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 17:52:24 2022 +0000

    Following Changes:
    - Refactors setting scheduling interval
    - refactor dag file creating function
    - refactored is_active to is_paused_upon_creation
    - removed catchup

commit 054e3f389febc6c447494a1dedb01228f5f5650f
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 17:33:25 2022 +0000

    Multiple Changes based on comments:
    1. refactored `create_k8s_args` into _to_job
    2. Addressed comments for snake casing
    3. refactored `attrs` for simplicity.
    4. refactored `metaflow_parameters` to `parameters`.
    5. Refactored setting of `input_paths`

commit d481b2fca7914b6b657a69af407cfe1a894a46dc
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 16:42:24 2022 +0000

    Removed Sensor metadata extraction.

commit d8e6ec044ef8c285d7fbe1b83c10c07d51c063e3
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 16:30:34 2022 +0000

    porting savin's comments
    - next changes : addressing comments.

commit 3f2353a647e53bc240e28792769c42a71ea8f8c9
Merge: d370ffb c1ff469
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Jul 28 23:52:16 2022 +0000

    Merge branch 'master' into airflow

commit d370ffb248411ad4675f9d55de709dbd75d3806e
Merge: a82f144 e4eb751
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Jul 14 19:38:48 2022 +0000

    Merge branch 'master' into airflow

commit a82f1447b414171fc5611758cb6c12fc692f55f9
Merge: bdb1f0d 6f097e3
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Jul 13 00:35:49 2022 +0000

    Merge branch 'master' into airflow

commit bdb1f0dd248d01318d4a493c75b6f54248c7be64
Merge: 8511215 f9a4968
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Jun 29 18:44:51 2022 +0000

    Merge branch 'master' into airflow

commit 85112158cd352cb7de95a2262c011c6f43d98283
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Jun 21 02:53:11 2022 +0000

    Bug fix from master merge.

commit 90c06f12bb14eda51c6a641766c5f67d6763abaa
Merge: 0fb73af 6441ed5
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 20 21:20:20 2022 +0000

    Merge branch 'master' into airflow

commit 0fb73af8af9fca2875261e3bdd305a0daab1b229
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Jun 4 00:53:10 2022 +0000

    squashing bugs after changes from master.

commit 09c6ba779f6b1b6ef1d7ed5b1bb2be70ec76575d
Merge: 7bdf662 ffff49b
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Jun 4 00:20:38 2022 +0000

    Merge branch 'master' into af-mmr

commit 7bdf662e14966b929b8369c65d5bd3bbe5741937
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon May 16 17:42:38 2022 -0700

    Airflow sensor api (#3)

    * Fixed run-id setting
    - Change gaurentees that multiple dags triggered at same moment have unique run-id

    * added allow multiple in `Decorator` class

    * Airflow sensor integration.
     >> support added for :
    - ExternalTaskSensor
    - S3KeySensor
    - SqlSensor
    >> sensors allow multiple decorators
    >> sensors accept those arguments which are supported by airflow

    * Added `@airflow_schedule_interval` decorator
    * Fixing bug run-id related in env variable setting.

commit 2604a29452e794354cf4c612f48bae7cf45856ee
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 21 18:26:59 2022 +0000

    Addressed comments.

commit 584e88b679fed7d6eec8ce564bf3707359170568
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:33:55 2022 +0000

    fixed printing bug

commit 169ac1535e5567149d94749ddaf70264e882d62c
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:30:59 2022 +0000

    Option help bug fix.

commit 6f8489bcc3bd715b65d8a8554a0f3932dc78c6f5
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:25:54 2022 +0000

    variable renamemetaflow_specific_args

commit 0c779abcd1d9574878da6de8183461b53e0da366
Merge: d299b13 5a61508
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:23:10 2022 +0000

    Merge branch 'airflow-tests' into airflow

commit 5a61508e61583b567ef8d3fea04e049d74a6d973
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:22:54 2022 +0000

    Removing un-used code / resolved-todos.

commit d030830f2543f489a1c4ebd17da1b47942f041d6
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:06:03 2022 +0000

    ran black,

commit 2d1fc06e41cbe45ccfd46e03bc87b09c7a78da45
Merge: f2cb319 7921d13
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:04:19 2022 +0000

    Merge branch 'master' into airflow-tests

commit d299b13ce38d027ab27ce23c9bbcc0f43b222cfa
Merge: f2cb319 7921d13
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:02:37 2022 +0000

    Merge branch 'master' into airflow

commit f2cb3197725f11520da0d49cbeef8de215c243eb
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 02:54:03 2022 +0000

    reverting change.

commit 05b9db9cf0fe8b40873b2b74e203b4fc82e7fea4
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 02:47:41 2022 +0000

    3 changes:
    - Removing s3 dep
    - remove uesless import
    - added `deployed_on` in dag file template

commit c6afba95f5ec05acf7f33fd3228cffd784556e3b
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Apr 15 22:50:52 2022 +0000

    Fixed passing secrets with kubernetes.

commit c3ce7e9faa5f7a23d309e2f66f778dbca85df22a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Apr 15 22:04:22 2022 +0000

    Refactored code .
    - removed compute/k8s.py
    - Moved k8s code to airflow_compiler.py
    - ran isort to airflow_compiler.py

commit d1c343dbbffbddbebd2aeda26d6846e595144e0b
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Apr 15 18:02:25 2022 +0000

    Added validations about:
    - un-supported decorators
    - foreach
    Changed where validations are done to not save the package.

commit 7b19f8e66e278c75d836daf6a1c7ed2c607417ce
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Apr 15 03:34:26 2022 +0000

    Fixing mf log related bug
    - No double logging on metaflow.

commit 4d1f6bf9bb32868c949d8c103c8fe44ea41b3f13
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 03:10:51 2022 +0000

    Removed usless code WRT project decorator.

commit 5ad9a3949e351b0ac13f11df13446953932e8ffc
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 03:03:19 2022 +0000

    Remove readme.

commit 60cb6a79404efe2bcf9bf9a118a68f0b98c7d771
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 03:02:38 2022 +0000

    Made file path required arguement.

commit 9f0dc1b2e01ee04b05620630f3a0ec04fe873a31
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 03:01:07 2022 +0000

    changed `--is-active`->`--is-paused-upon-creation`
    - dags are active by default.

commit 5b98f937a62ee74de8aed8b0efde5045a28f068b
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 02:55:46 2022 +0000

    shortened length of run-id and task-id hashes.

commit e53426eaa4b156e8bd70ae7510c2e7c66745d101
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 02:41:32 2022 +0000

    Removing un-used args.

commit 72cbbfc7424f9be415c22d9144b16a0953f15295
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 02:39:59 2022 +0000

    Moved exceptions to airflow compiler

commit b2970ddaa86c393c8abb7f203f6507c386ecbe00
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 02:33:02 2022 +0000

    Changes based on PR comments:
    - removed airflow xcom push file , moved to decorator code
    - removed prefix configuration
    - nit fixes.

commit 9e622bac5a75eb9e7a6594d8fa0e47f076634b44
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Apr 11 20:39:00 2022 +0000

    Removing un-used code paths + code cleanup

commit 7425f62cff2c9128eea785223ddeb40fa2d8f503
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Apr 11 19:45:04 2022 +0000

    Fixing bug fix in schedule.

commit eb775cbadd1d2d2c90f160a95a0f42c8ff0d7f4c
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Apr 10 02:52:59 2022 +0000

    Bug fixes WRT Kubernetes secrets + k8s deployments.
    - Fixing some error messages.
    - Added some comments.

commit 04c92b92c312a4789d3c1e156f61ef57b08dba9f
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Apr 10 01:20:53 2022 +0000

    Added secrets support.

commit 4a0a85dff77327640233767e567aee2b379ac13e
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Apr 10 00:11:46 2022 +0000

    Bug fix.

commit af91099c0a30c26b58d58696a3ef697ec49a8503
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Apr 10 00:03:34 2022 +0000

    bug fix.

commit c17f04a253dfe6118e2779db79da9669aa2fcef2
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Apr 9 23:55:41 2022 +0000

    Bug fix in active defaults.

commit 0d372361297857076df6af235d1de7005ac1544a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Apr 9 23:54:02 2022 +0000

    @project, @schedule, default active dag support.
    - Added a flag to allow setting dag as active on creation
    - Airflow compatible schedule interval
    - Project name fixes.

commit 5c97b15cb11b5e8279befc5b14c239463750e9b7
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 7 21:15:18 2022 +0000

    Max workers and worker pool support.

commit 9c973f2f44c3cb3a98e3e63f6e4dcef898bc8bf2
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 7 19:34:33 2022 +0000

    Adding exceptions for missing features.

commit 2a946e2f083a34b4b6ed84c70aebf96b084ee8a2
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Mar 28 19:34:11 2022 +0000

    2 changes :
    - removed hacky line
    - added support to directly throw dags in s3.

commit e0772ec1bad473482c6fd19f8c5e8b9845303c0a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Mar 23 22:38:20 2022 +0000

    fixing bugs in service account setting

commit 874b94aeeabc664f12551864eff9d8fdc24dc37b
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Mar 20 23:49:15 2022 +0000

    Added support for Branching with Airflow
    - remove `next` function in `AirflowTask`
    - `AirflowTask`s has no knowledge of next tasks.
    - removed todos + added some todos
    - Graph construction on airflow side using graph_structure datastructure.
    - graph_structure comes from`FlowGraph.output_steps()[1]`

commit 8e9f649bd8c51171c38a1e5af70a44a85e7009ca
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Mar 20 02:33:04 2022 +0000

    Added hacky line

commit fd5db04cf0a81b14efda5eaf40cd9227e2bac0d3
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Mar 20 02:06:38 2022 +0000

    Removed hacky line.

commit 5b23eb7d8446bef71246d853b11edafa93c6ef95
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Mar 20 01:44:57 2022 +0000

    Added support for Parameters.
    - Supporting int, str, bool, float, JSONType

commit c9378e9b284657357ad2997f2b492bc2f4aaefac
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Mar 20 00:14:10 2022 +0000

    Removed todos + added some validation logic.

commit 7250a44e1dea1da3464f6f71d0c5188bd314275a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 23:45:15 2022 +0000

    Fixing logs related change from master.

commit d125978619ab666dcf96db330acdca40f41b7114
Merge: 8cdac53 7e210a2
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 23:42:48 2022 +0000

    Merge branch 'master' into aft-mm

commit 8cdac53dd32648455e36955badb8e0ef7b95a2b3
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 23:36:47 2022 +0000

    making changes sync with master

commit 5a93d9f5198c360b2a84ab13a86496986850953c
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 23:29:47 2022 +0000

    Fixed bug when using catch + retry

commit 62bc8dff68a6171b3b4222075a8e8ac109f65b4c
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 22:58:37 2022 +0000

    Changed retry setting.

commit 563a20036a2dfcc48101f680f29d4917d53aa247
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 22:42:57 2022 +0000

    Fixed setting `task_id` :
    - switch task-id from airflow job is to hash to "runid/stepname"
    - refactor xcom setting variables
    - added comments

commit e2a1e502221dc603385263c82e2c068b9f055188
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 17:51:59 2022 +0000

    setting retry logic.

commit a697b56052210c8f009b68772c902bbf77713202
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Mar 17 01:02:11 2022 +0000

    Nit fix.

commit 68f13beb17c7e73c0dddc142ef2418675a506439
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Mar 16 20:46:19 2022 +0000

    Added @schedule support + readme

commit 57bdde54f9ad2c8fe5513dbdb9fd02394664e234
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Mar 15 19:47:06 2022 +0000

    Fixed setting run-id / task-id to labels in k8s
    - Fixed setting run-id has from cli macro
    - added hashing macro to ensure that jinja template set the correct run-id to k8s labels
    -

commit 3d6c31917297d0be5f9915b13680fc415ddb4421
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Mar 15 05:39:04 2022 +0000

    Got linear workflows working on airflow.
    - Still not feature complete as lots of args are still unfilled / lots of unknows
    - minor tweek in eks to ensure airflow is k8s compatible.
    - passing state around via xcom-push
    - HACK : AWS keys are passed in a shady way. : Reverse this soon.

commit db074b8012f76d9d85225a4ceddb2cde8fefa0f4
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Mar 11 12:34:33 2022 -0800

    Tweeks

commit a9f0468c4721a2017f1b26eb8edcdd80aaa57203
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Mar 1 17:14:47 2022 -0800

    some changes based on savin's comments.
    - Added changes to task datastore for different reason : (todo) Decouple these
    - Added comments to SFN for reference.
    - Airflow DAG is no longer dependent on metaflow

commit f32d089cd3865927bc7510f24ba3418d859410b6
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Feb 23 00:54:17 2022 -0800

    First version of dynamic dag compiler.
    - Not completely finished code
    - Creates generic .py file a JSON that is parsed to create Airflow DAG.
    - Currently only boiler plate to make a linear dag but doesn't execute anything.
    -  Unfinished code.

commit d2def665a86d6a6622d6076882c1c2d54044e773
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Feb 19 14:01:47 2022 -0800

    more tweeks.

commit b176311f166788cc3dfc93354a0c5045a4e6a3d4
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Feb 17 09:04:29 2022 -0800

    commit 0
    - unfinished code.

* Making version compatibility changes.
- Minimum support version to 2.2.0

* Task-id macro related logic refactor
- Done for better version support.

* bug fix: param related task-id setting.

* applied black

* Reverting `decorators.py` to master

* Move sys.path insert earlier in s3op.py (#1098)

* Update setup.py (#1099)

* Fix for env_escape bug when importing local packages (#1100)

* Bump to 2.7.5 (#1102)

* Fix another issue with the escape hatch and paths (#1105)

* Bump to 2.7.6 (#1106)

* add a flag to overwrite config when running metaflow configure sandbox (#1103)

* card dev docs tiny fix. (#1108)

* Fix an issue with get_cards not respecting a Task's ds-root (#1111)

get_cards would not always respect a Task's ds-root leading to cases
where a Task has cards but they cannot be accessed because of an invalid
path.

* Adding support for Azure Blob Storage as a datastore (#1091)

* Azure Storage

* fix BrokenProcessPool import for older pythons

* Fix batch

* add azure configure, cannot validate storage account url after all (catch-22 in configure)

* rename storage account url to AZURE_STORAGE_BLOB_SERVICE_ENDPOINT

* fix indent

* clean up kubernetes config cli, add secrets

* remove signature

* fix fractional resoure handling for batch (#1089)

* bump version (#1093)

* Fix docstrings for the API reference (no functional changes!) (#1076)

Make the docstring format in public APIs compliant with the new API reference framework

* Fix issue with get_pinned_conda_libs and metaflow extensions

* clean up includefile a bit, remove from_env (not used)

* Move a sys.path modification in s3op to __main__ (#1095)

In its current location, this could modify the sys.path of the current
running metaflow which could have nefarious consequences with the escape
hatch which uses sys.path to determine the outside environment's python path.

The following scenario would cause issues:
  - metaflow is installed in the usual path on the system
  - a conda environment was manually bootstrapped from a directory A.
    + at this point, sys.path starts with `A` and then contains the other
      system includes
    + s3op.py is imported at some point by the Conda installer when it calls `get_many`
    + this modifies sys.path to insert, at the beginning, the parent of Metaflow; so
      in this case, sys.path looks something like
      ['/apps/python3/lib/python3.7/site-packages', 'A', '/apps/python3/lib/python3.7/site-packages'...]
    + when the escape hatch trampolines are created, this sys.path is used to determine what the
      sys.path for the outside interpreter is.
    + in A, we create:
      * INFO
      * metaflow
      * metaflow_extensions
      which properly describe the installation of metaflow
  - when the escape hatch client runs, it runs in the conda environment and uses metaflow
    created in A.
  - when the client wants to start the server, this is where we run into issues because, at this
    point, the server will use the PYTHONPATH which starts with '/apps/python3/lib/python3.7/site-packages'
    in which it will find metaflow. It will therefore use that metaflow (which is the same
    as the one linked in A) to start the server. This runs into issues though because A
    is also in PYTHONPATH and so the extension support loader will also try to load `A/metaflow_extensions`.
    This will cause issues if multiple extensions are installed there (it will complain about duplicate
    configurations for example. The `INFO` file typically used to solve this problem is not read as it was
    not present for the TL metaflow.

This patch simply moves the modification of sys.path to where it is actually needed and avoids polluting
sys.path when the module is simply included (and not called as a script).

* Airflow Support (#1094)

* Airflow on Kubernetes minus Foreachs.
- Support for all metaflow construct without foreach and sensors

Squashed commit of the following:

commit ef8b1e3768695bc4d3375a947ab1da9c6520bcf1
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jul 29 01:06:26 2022 +0000

    Removed sernsors and banned foreach's

commit 8d517c4fecc6568777ad03eca81aaacfa3e91156
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jul 29 00:59:01 2022 +0000

    commiting k8s related file from master.

commit a7e1ecdbf7b8b8d1cc21321cc8e196053f8305e4
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jul 29 00:54:45 2022 +0000

    Uncommented code for foreach support with k8s

    KubernetesPodOperator version 4.2.0 renamed `resources` to
    `container_resources`
    - Check : (https://github.com/apache/airflow/pull/24673) /
    - (https://github.com/apache/airflow/commit/45f4290712f5f779e57034f81dbaab5d77d5de85)

    This was done because `KubernetesPodOperator` didn't play nice with dynamic task mapping and they had to deprecate the `resources` argument. Hence the below codepath checks for the version of `KubernetesPodOperator`
    and then sets the argument. If the version < 4.2.0 then we set the argument as `resources`.
    If it is > 4.2.0 then we set the argument as `container_resources`
    The `resources` argument of KuberentesPodOperator is going to be deprecated soon in the future.
    So we will only use it for `KuberentesPodOperator` version < 4.2.0
    The `resources` argument will also not work for foreach's.

commit 2719f5d792ada91e3ae0af6f1a9a0c7d90f74660
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jul 18 18:31:58 2022 +0000

    nit fixes :
    - fixing comments.
    - refactor some variable/function names.

commit 2079293fbba0d3d862476a7d67b36af8a3389342
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jul 18 18:14:53 2022 +0000

    change `token` to `production_token`

commit 14aad5ff717418e4183a88fa84b2f5e5bb13927a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jul 18 18:11:56 2022 +0000

    Refactored import Airflow Sensors.

commit b1472d5f7a629024ca45e8b83700400d02a4d455
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jul 18 18:08:41 2022 +0000

    new comment on `startup_timeout_seconds` env var.

commit 6d81b758e8f06911258d26f790029f557488a0d7
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jul 18 18:06:09 2022 +0000

    Removing traces of `@airflow_schedule_interval`

commit 0673db7475b22f3ce17c2680fc0a7c4271b5c946
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Jul 14 12:43:08 2022 -0700

    Foreach polish (valayDave/metaflow#62)

    * Removing unused imports
    * Added validation logic for airflow version numbers with foreaches
    * Removed `airflow_schedule_interval` decorator.

    * Added production/deployment token related changes
    - Uses s3 as a backend to store the production token
    - Token used for avoiding nameclashes
    - token stored via `FlowDatastore`

    * Graph type validation for airflow foreachs
    - Airflow foreachs only support single node fanout.
    - validation invalidates graphs with nested foreachs

    * Added configuration about startup_timeout.

    * Added final todo on `resources` argument of k8sOp
    - added a commented code block
    - it needs to be uncommented when airflow releasese the patch for the op
    - Code seems feature complete keeping aside airflow patch

commit 4b2dd1211fe2daeb76e29e4084f21e96b10cdae9
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Jul 7 19:33:07 2022 +0000

    Removed retries from user-defaults.

commit 0e87a97fea15ba3aaa6d4228b141bd796b767c43
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Jul 6 16:29:33 2022 +0000

    updated pod startup time

commit fce2bd263f368dbb78a34ac71f64e13c89277222
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Jun 29 18:44:11 2022 +0000

    Adding default 1 retry for any airflow worker.

commit 5ef6bbcde51b1f4923a192291ed0e07d07ec7321
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 27 01:22:42 2022 +0000

    Airflow Foreach Integration
    - Simple one node foreach-join support as gaurenteed by airflow
    - Fixed env variable setting issue
    - introduced MetaflowKuberentesOperator
    - Created a new operator to allow smootness in plumbing xcom values
    - Some todos

commit d319fa915c558d82f1d127736ce34d3ae0da521d
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 24 21:12:09 2022 +0000

    simplifying run-id macro.

commit 0ffc813b1c4e6ba0103be51520f42d191371741a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 24 11:51:42 2022 -0700

    Refactored parameter macro settings. (valayDave/metaflow#60)

commit a3a495077f34183d706c0edbe56d6213766bf5f6
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 24 02:05:57 2022 +0000

    added comment on need for `start_date`

commit a3147bee08a260aa78ab2fb14c6232bfab2c2dec
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Jun 21 06:03:56 2022 +0000

    Refactored an `id_creator` method.

commit 04d7f207ef2dae0ce2da2ec37163ac871f4517bc
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Jun 21 05:52:05 2022 +0000

    refactor :
    -`RUN_ID_LEN` to `RUN_HASH_ID_LEN`
    - `TASK_ID_LEN` to `TASK_ID_HASH_LEN`

commit cde4605cd57ad9214f5a6afd7f58fe4c377e09e2
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Jun 21 05:48:55 2022 +0000

    refactored an error string

commit 11458188b6c59d044fca0dd2d1f5024ec84f6488
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 20 22:42:36 2022 -0700

    addressing  savins comments. (#59)

    - Added many adhoc changes based for some comments.
    - Integrated secrets and `KUBERNETES_SECRETS`
    - cleaned up parameter setting
    - cleaned up setting of scheduling interval
    - renamed `AIRFLOW_TASK_ID_TEMPLATE_VALUE` to `AIRFLOW_TASK_ID`
    - renamed `AirflowSensorDecorator.compile` to `AirflowSensorDecorator.validate`
    - Checking if dagfile and flow file are same.
    - fixing variable names.
    - checking out `kubernetes_decorator.py` from master (6441ed5)
    - bug fixing secret setting in airflow.
    - simplified parameter type parsing logic
    - refactoring airflow argument parsing code.

commit 83b20a7c6a13b3aedb7e603e139f07f0ef2fb646
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 13 14:02:57 2022 -0700

    Addressing Final comments.  (#57)

    - Added dag-run timeout.
    - airflow related scheduling checks in decorator.
    - Auto naming sensors if no name is provided
    - Annotations to k8s operators
    - fix: argument serialization for `DAG` arguments (method names refactored like `to_dict` became `serialize`)
    - annotation bug fix
    - setting`workflow-timeout` for only scheduled dags

commit 4931f9c84e6a1d20fc3ecb41cf138b72e5dee629
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 6 04:50:49 2022 +0000

    k8s bug fix

commit 200ae8ed4a00028f094281f73a939e7a4dcdf83a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 6 04:39:50 2022 +0000

    removed un-used function

commit 70e285e9a7cfbec71fc293508a62c96f33562a01
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 6 04:38:37 2022 +0000

    Removed unused `sanitize_label` function

commit 84fc622d8b11e718a849b2e2d91ceb3ea69917e6
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 6 04:37:34 2022 +0000

    GPU support added + container naming same as argo

commit c92280d8796ec12b4ff17fa2ff3c736c7244f39c
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 6 04:25:17 2022 +0000

    Refactored sensors to different files + bug fix
    - bug caused due `util.compress_list`.
    - The function doesn't play nice with strings with variety of characters.
    - Ensured that exceptions are handled appropriately.
    - Made new file for each sensor under `airflow.sensors` module.

commit b72a1dcf0dbbbcb814581d92738fd27ec31ef673
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Jun 4 01:41:49 2022 +0000

    ran black.

commit 558c82f65b383ed0d61ded6bc80326471e284550
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 18:32:48 2022 -0700

    Moving information from airflow_utils to compiler (#56)

    - commenting todos to organize unfinished changes.
    - some environment variables set via`V1EnvVar`
        - `client.V1ObjectFieldSelector` mapped env vars were not working in json form
        - Moving k8s operator import into its own function.
        - env vars moved.

commit 9bb5f638792a671164ec95891e97f599e9a3385f
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 18:06:03 2022 +0000

    added mising Run-id prefixes to variables.
    - merged `hash` and `dash_connect` filters.

commit 37b5e6a9d8ca93cc91244c8d77c7d4f61280ba59
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 18:00:22 2022 +0000

    nit fix : variable name change.

commit 660756f952ebd92ba1e26d7f908b81036c31ff10
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 17:58:34 2022 +0000

    nit fixes to dag.py's templating variables.

commit 1202f5bc92f76df52b5957f11c8574cadfa62196
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 17:56:53 2022 +0000

    Fixed defaults passing
    - Addressed comments for airflow.py

commit b9387dd428c1a37f9a3bfe2c72cab475da708c02
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 17:52:24 2022 +0000

    Following Changes:
    - Refactors setting scheduling interval
    - refactor dag file creating function
    - refactored is_active to is_paused_upon_creation
    - removed catchup

commit 054e3f389febc6c447494a1dedb01228f5f5650f
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 17:33:25 2022 +0000

    Multiple Changes based on comments:
    1. refactored `create_k8s_args` into _to_job
    2. Addressed comments for snake casing
    3. refactored `attrs` for simplicity.
    4. refactored `metaflow_parameters` to `parameters`.
    5. Refactored setting of `input_paths`

commit d481b2fca7914b6b657a69af407cfe1a894a46dc
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 16:42:24 2022 +0000

    Removed Sensor metadata extraction.

commit d8e6ec044ef8c285d7fbe1b83c10c07d51c063e3
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Jun 3 16:30:34 2022 +0000

    porting savin's comments
    - next changes : addressing comments.

commit 3f2353a647e53bc240e28792769c42a71ea8f8c9
Merge: d370ffb c1ff469
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Jul 28 23:52:16 2022 +0000

    Merge branch 'master' into airflow

commit d370ffb248411ad4675f9d55de709dbd75d3806e
Merge: a82f144 e4eb751
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Jul 14 19:38:48 2022 +0000

    Merge branch 'master' into airflow

commit a82f1447b414171fc5611758cb6c12fc692f55f9
Merge: bdb1f0d 6f097e3
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Jul 13 00:35:49 2022 +0000

    Merge branch 'master' into airflow

commit bdb1f0dd248d01318d4a493c75b6f54248c7be64
Merge: 8511215 f9a4968
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Jun 29 18:44:51 2022 +0000

    Merge branch 'master' into airflow

commit 85112158cd352cb7de95a2262c011c6f43d98283
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Jun 21 02:53:11 2022 +0000

    Bug fix from master merge.

commit 90c06f12bb14eda51c6a641766c5f67d6763abaa
Merge: 0fb73af 6441ed5
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Jun 20 21:20:20 2022 +0000

    Merge branch 'master' into airflow

commit 0fb73af8af9fca2875261e3bdd305a0daab1b229
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Jun 4 00:53:10 2022 +0000

    squashing bugs after changes from master.

commit 09c6ba779f6b1b6ef1d7ed5b1bb2be70ec76575d
Merge: 7bdf662 ffff49b
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Jun 4 00:20:38 2022 +0000

    Merge branch 'master' into af-mmr

commit 7bdf662e14966b929b8369c65d5bd3bbe5741937
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon May 16 17:42:38 2022 -0700

    Airflow sensor api (#3)

    * Fixed run-id setting
    - Change gaurentees that multiple dags triggered at same moment have unique run-id

    * added allow multiple in `Decorator` class

    * Airflow sensor integration.
     >> support added for :
    - ExternalTaskSensor
    - S3KeySensor
    - SqlSensor
    >> sensors allow multiple decorators
    >> sensors accept those arguments which are supported by airflow

    * Added `@airflow_schedule_interval` decorator
    * Fixing bug run-id related in env variable setting.

commit 2604a29452e794354cf4c612f48bae7cf45856ee
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 21 18:26:59 2022 +0000

    Addressed comments.

commit 584e88b679fed7d6eec8ce564bf3707359170568
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:33:55 2022 +0000

    fixed printing bug

commit 169ac1535e5567149d94749ddaf70264e882d62c
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:30:59 2022 +0000

    Option help bug fix.

commit 6f8489bcc3bd715b65d8a8554a0f3932dc78c6f5
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:25:54 2022 +0000

    variable renamemetaflow_specific_args

commit 0c779abcd1d9574878da6de8183461b53e0da366
Merge: d299b13 5a61508
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:23:10 2022 +0000

    Merge branch 'airflow-tests' into airflow

commit 5a61508e61583b567ef8d3fea04e049d74a6d973
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:22:54 2022 +0000

    Removing un-used code / resolved-todos.

commit d030830f2543f489a1c4ebd17da1b47942f041d6
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:06:03 2022 +0000

    ran black,

commit 2d1fc06e41cbe45ccfd46e03bc87b09c7a78da45
Merge: f2cb319 7921d13
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:04:19 2022 +0000

    Merge branch 'master' into airflow-tests

commit d299b13ce38d027ab27ce23c9bbcc0f43b222cfa
Merge: f2cb319 7921d13
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 03:02:37 2022 +0000

    Merge branch 'master' into airflow

commit f2cb3197725f11520da0d49cbeef8de215c243eb
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 02:54:03 2022 +0000

    reverting change.

commit 05b9db9cf0fe8b40873b2b74e203b4fc82e7fea4
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Apr 20 02:47:41 2022 +0000

    3 changes:
    - Removing s3 dep
    - remove uesless import
    - added `deployed_on` in dag file template

commit c6afba95f5ec05acf7f33fd3228cffd784556e3b
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Apr 15 22:50:52 2022 +0000

    Fixed passing secrets with kubernetes.

commit c3ce7e9faa5f7a23d309e2f66f778dbca85df22a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Apr 15 22:04:22 2022 +0000

    Refactored code .
    - removed compute/k8s.py
    - Moved k8s code to airflow_compiler.py
    - ran isort to airflow_compiler.py

commit d1c343dbbffbddbebd2aeda26d6846e595144e0b
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Apr 15 18:02:25 2022 +0000

    Added validations about:
    - un-supported decorators
    - foreach
    Changed where validations are done to not save the package.

commit 7b19f8e66e278c75d836daf6a1c7ed2c607417ce
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Apr 15 03:34:26 2022 +0000

    Fixing mf log related bug
    - No double logging on metaflow.

commit 4d1f6bf9bb32868c949d8c103c8fe44ea41b3f13
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 03:10:51 2022 +0000

    Removed usless code WRT project decorator.

commit 5ad9a3949e351b0ac13f11df13446953932e8ffc
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 03:03:19 2022 +0000

    Remove readme.

commit 60cb6a79404efe2bcf9bf9a118a68f0b98c7d771
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 03:02:38 2022 +0000

    Made file path required arguement.

commit 9f0dc1b2e01ee04b05620630f3a0ec04fe873a31
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 03:01:07 2022 +0000

    changed `--is-active`->`--is-paused-upon-creation`
    - dags are active by default.

commit 5b98f937a62ee74de8aed8b0efde5045a28f068b
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 02:55:46 2022 +0000

    shortened length of run-id and task-id hashes.

commit e53426eaa4b156e8bd70ae7510c2e7c66745d101
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 02:41:32 2022 +0000

    Removing un-used args.

commit 72cbbfc7424f9be415c22d9144b16a0953f15295
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 02:39:59 2022 +0000

    Moved exceptions to airflow compiler

commit b2970ddaa86c393c8abb7f203f6507c386ecbe00
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 14 02:33:02 2022 +0000

    Changes based on PR comments:
    - removed airflow xcom push file , moved to decorator code
    - removed prefix configuration
    - nit fixes.

commit 9e622bac5a75eb9e7a6594d8fa0e47f076634b44
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Apr 11 20:39:00 2022 +0000

    Removing un-used code paths + code cleanup

commit 7425f62cff2c9128eea785223ddeb40fa2d8f503
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Apr 11 19:45:04 2022 +0000

    Fixing bug fix in schedule.

commit eb775cbadd1d2d2c90f160a95a0f42c8ff0d7f4c
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Apr 10 02:52:59 2022 +0000

    Bug fixes WRT Kubernetes secrets + k8s deployments.
    - Fixing some error messages.
    - Added some comments.

commit 04c92b92c312a4789d3c1e156f61ef57b08dba9f
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Apr 10 01:20:53 2022 +0000

    Added secrets support.

commit 4a0a85dff77327640233767e567aee2b379ac13e
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Apr 10 00:11:46 2022 +0000

    Bug fix.

commit af91099c0a30c26b58d58696a3ef697ec49a8503
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Apr 10 00:03:34 2022 +0000

    bug fix.

commit c17f04a253dfe6118e2779db79da9669aa2fcef2
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Apr 9 23:55:41 2022 +0000

    Bug fix in active defaults.

commit 0d372361297857076df6af235d1de7005ac1544a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Apr 9 23:54:02 2022 +0000

    @project, @schedule, default active dag support.
    - Added a flag to allow setting dag as active on creation
    - Airflow compatible schedule interval
    - Project name fixes.

commit 5c97b15cb11b5e8279befc5b14c239463750e9b7
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 7 21:15:18 2022 +0000

    Max workers and worker pool support.

commit 9c973f2f44c3cb3a98e3e63f6e4dcef898bc8bf2
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Apr 7 19:34:33 2022 +0000

    Adding exceptions for missing features.

commit 2a946e2f083a34b4b6ed84c70aebf96b084ee8a2
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Mon Mar 28 19:34:11 2022 +0000

    2 changes :
    - removed hacky line
    - added support to directly throw dags in s3.

commit e0772ec1bad473482c6fd19f8c5e8b9845303c0a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Mar 23 22:38:20 2022 +0000

    fixing bugs in service account setting

commit 874b94aeeabc664f12551864eff9d8fdc24dc37b
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Mar 20 23:49:15 2022 +0000

    Added support for Branching with Airflow
    - remove `next` function in `AirflowTask`
    - `AirflowTask`s has no knowledge of next tasks.
    - removed todos + added some todos
    - Graph construction on airflow side using graph_structure datastructure.
    - graph_structure comes from`FlowGraph.output_steps()[1]`

commit 8e9f649bd8c51171c38a1e5af70a44a85e7009ca
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Mar 20 02:33:04 2022 +0000

    Added hacky line

commit fd5db04cf0a81b14efda5eaf40cd9227e2bac0d3
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Mar 20 02:06:38 2022 +0000

    Removed hacky line.

commit 5b23eb7d8446bef71246d853b11edafa93c6ef95
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Mar 20 01:44:57 2022 +0000

    Added support for Parameters.
    - Supporting int, str, bool, float, JSONType

commit c9378e9b284657357ad2997f2b492bc2f4aaefac
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sun Mar 20 00:14:10 2022 +0000

    Removed todos + added some validation logic.

commit 7250a44e1dea1da3464f6f71d0c5188bd314275a
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 23:45:15 2022 +0000

    Fixing logs related change from master.

commit d125978619ab666dcf96db330acdca40f41b7114
Merge: 8cdac53 7e210a2
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 23:42:48 2022 +0000

    Merge branch 'master' into aft-mm

commit 8cdac53dd32648455e36955badb8e0ef7b95a2b3
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 23:36:47 2022 +0000

    making changes sync with master

commit 5a93d9f5198c360b2a84ab13a86496986850953c
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 23:29:47 2022 +0000

    Fixed bug when using catch + retry

commit 62bc8dff68a6171b3b4222075a8e8ac109f65b4c
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 22:58:37 2022 +0000

    Changed retry setting.

commit 563a20036a2dfcc48101f680f29d4917d53aa247
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 22:42:57 2022 +0000

    Fixed setting `task_id` :
    - switch task-id from airflow job is to hash to "runid/stepname"
    - refactor xcom setting variables
    - added comments

commit e2a1e502221dc603385263c82e2c068b9f055188
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Mar 19 17:51:59 2022 +0000

    setting retry logic.

commit a697b56052210c8f009b68772c902bbf77713202
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Mar 17 01:02:11 2022 +0000

    Nit fix.

commit 68f13beb17c7e73c0dddc142ef2418675a506439
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Mar 16 20:46:19 2022 +0000

    Added @schedule support + readme

commit 57bdde54f9ad2c8fe5513dbdb9fd02394664e234
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Mar 15 19:47:06 2022 +0000

    Fixed setting run-id / task-id to labels in k8s
    - Fixed setting run-id has from cli macro
    - added hashing macro to ensure that jinja template set the correct run-id to k8s labels
    -

commit 3d6c31917297d0be5f9915b13680fc415ddb4421
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Mar 15 05:39:04 2022 +0000

    Got linear workflows working on airflow.
    - Still not feature complete as lots of args are still unfilled / lots of unknows
    - minor tweek in eks to ensure airflow is k8s compatible.
    - passing state around via xcom-push
    - HACK : AWS keys are passed in a shady way. : Reverse this soon.

commit db074b8012f76d9d85225a4ceddb2cde8fefa0f4
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Fri Mar 11 12:34:33 2022 -0800

    Tweeks

commit a9f0468c4721a2017f1b26eb8edcdd80aaa57203
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Tue Mar 1 17:14:47 2022 -0800

    some changes based on savin's comments.
    - Added changes to task datastore for different reason : (todo) Decouple these
    - Added comments to SFN for reference.
    - Airflow DAG is no longer dependent on metaflow

commit f32d089cd3865927bc7510f24ba3418d859410b6
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Wed Feb 23 00:54:17 2022 -0800

    First version of dynamic dag compiler.
    - Not completely finished code
    - Creates generic .py file a JSON that is parsed to create Airflow DAG.
    - Currently only boiler plate to make a linear dag but doesn't execute anything.
    -  Unfinished code.

commit d2def665a86d6a6622d6076882c1c2d54044e773
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Sat Feb 19 14:01:47 2022 -0800

    more tweeks.

commit b176311f166788cc3dfc93354a0c5045a4e6a3d4
Author: Valay Dave <valaygaurang@gmail.com>
Date:   Thu Feb 17 09:04:29 2022 -0800

    commit 0
    - unfinished code.

* Making version compatibility changes.
- Minimum support version to 2.2.0

* Task-id macro related logic refactor
- Done for better version support.

* bug fix: param related task-id setting.

* applied black

* Reverting `decorators.py` to master

* Move sys.path insert earlier in s3op.py (#1098)

* Update setup.py (#1099)

* Fix for env_escape bug when importing local packages (#1100)

* Black reformat

* Bump to 2.7.5 (#1102)

* Fix another issue with the escape hatch and paths (#1105)

* Bump to 2.7.6 (#1106)

* add a flag to overwrite config when running metaflow configure sandbox (#1103)

* Azure Storage

* fix BrokenProcessPool import for older pythons

* Fix batch

* add azure configure, cannot validate storage account url after all (catch-22 in configure)

* rename storage account url to AZURE_STORAGE_BLOB_SERVICE_ENDPOINT

* fix indent

* clean up kubernetes config cli, add secrets

* remove signature

* Fix issue with get_pinned_conda_libs and metaflow extensions

* clean up includefile a bit, remove from_env (not used)

* Black reformat

* Fix order of imports in __init__.py

* More black

* fix "configure azure" merge conflict

Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>
Co-authored-by: Ville Tuulos <tuulos@gmail.com>
Co-authored-by: Romain Cledat <rcledat@netflix.com>
Co-authored-by: Romain <romain-intel@users.noreply.github.com>
Co-authored-by: Valay Dave <valaygaurang@gmail.com>
Co-authored-by: Savin <savingoyal@gmail.com>
Co-authored-by: Shashank Srikanth <108034001+hunsdiecker@users.noreply.github.com>

* more robust resource type conversions for aws batch/sfn (#1118)

* Update setup.py (#1122)

* Support airflow with metaflow on azure (#1127)

* Fix issue with S3 invocation for conda bootstrap (#1128)

* Fix issue with S3 invocation for conda bootstrap

* add comment

* apply black

* Bump minor version to 2.7.8 for release (#1129)

* Fix issue with S3 URLs (#1130)

* Patch release - 2.7.9 (#1131)

* Card bug fix when task-ids are non-unique (#1126)

* Card bug fix when task-ids are non-unique
- When task-ids are non-unique the cards don't create expected directory paths
-  To fix the bug we introduce a `steps/<stepname>` folder
- The new changes will be able to read the older version's cards
- The older version of clients wont be able to read new version's cards

* infering path from folder structure

* fix

* Writing cards to both paths (steps/tasks)
- Added thorough comment about context of change

* added `_HACK_SKIP_CARD_DUALWRITE` config var
- turns of double writing for cards.

* introduced mf config var to skip double write.

* Bump version to 2.7.10 to prepare for release (#1136)

* Fix DeprecationWarning on invalid escape sequence (#1133)

* fix docstring for MetaflowCode (#1134)

* fix cpu value formatting for aws batch/sfn (#1140)

* Update setup.py (#1141)

* Make plugins.airflow.plumbing a well-formed module (#1148)

* bump patch version for release (#1149)

* Add `cmd` extension point to allow MF extensions to extend the `metaf… (#1143)

* Add `cmd` extension point to allow MF extensions to extend the `metaflow` command

Simply add a `cmd` directory in your extension and the usual __init__.py or
mfextinit_*.py file containing a function called `get_cmd_clis` which returns
a list of CLIs to add.

Example:

@click.group()
@click.pass_context
def cli(ctx):
  pass

@cli.group(help="My commands")
@click.pass_context
def foobar(ctx):
  pass

@cli.command(help="Overrides provided `status` command")
@click.pass_context
def status(ctx):
  print("I am overriding the usual status command")

@foobar.command(help="Some other command")
@click.pass_context
def baz(ctx):
  print("Hi!")

def get_cmd_clis():
  return [cli]

* Minor string change

* Fix periodic messages printed at runtime (#1061)

* Fix periodic messages printed at runtime

* Addressed comments -- also rebased

* Merged master branch + addressed comments

* Pass datastore_type to validate_environment (#1152)

* Remove message introduced in #1061 (#1151)

* Remove message introduced in #1061

* More message tweak

* Minor log msg fix (cannot subscript a set) (#1159)

* Support `kubernetes_conn_id` in Airflow integration (#1153)

* added `kuberentes_conn_id`

* ran black

* comment about config variable.

* Use json to dump/load decorator specs. (#1144)

* Use json to dump/load decorator specs.

This cleaned up a few uses of the decospec where individual plugins were having
to parse the decospec.

Two other minor changes to improve error message and allow for echoing without a nl

* Fix to work with remote systems

* Addressed comment of making it simpler for simple types (int, float, str)

You no longer have to deal with the fun of quoting things for simple types

* Removed testing code

* argo use kubernetes client class (#1163)

* Rewrite IncludeFile implementation (#1109)

* Rewrite IncludeFile implementation

This rewrite addresses several issues:
  - IncludeFile were not properly returned in dump and via the client
  - Improves logic in create/trigger for step-functions for example
  - Properly consider the size of the included file when dumping
    artifacts
  - default value for IncludeFile can also now be a function

The code has been cleaned up overall as well and many more comments
added.

* Fixed tests and removed debugging messages

We can now check for the actual artifact value in the CLI
checker which is what we now do.

* Addressed comments

* Fixing bug with include file for schedulers (#1154)

* Minor nits

* Simplifying parameter creation (#1155)

* Add back Azure support (#1162)

* Add back Azure support

* add some comments explaining return_missing

Co-authored-by: Valay Dave <valaygaurang@gmail.com>
Co-authored-by: jackie-ob <104396218+jackie-ob@users.noreply.github.com>

* Add options to make card generation faster in the presence of loggers/monitors (#1167)

* Env escape improvements and bug fixes (#1166)

* Env escape improvements and bug fixes

  - Properly handle the case of multiple `overrides` for different
    escaped libraries (previously, only the first override was considered)
  - Add datetime.timedelta to the list of simple types transferred
  - Allow the specification of override functions (and getattr/setattr) on
    additional proxied types.

* More cleanup

* Forgot a file

* Load overrides as relative module even on server side

* Terminate server if EOF reached

* Allow figures in `Image.from_matplotlib` (#1147)

* Allow figures in `Image.from_matplotlib`

* nit fix

* Bump for release (#1168)

* fix pandas call bug (#1173)

pandas imported as pd, so pandas.DataFrame will fail

* Metaflow pathspec in Airflow UI (#1119)

* changes to allow metaflow pathspec in airflow ui
- pathspec accessible in the airflow rendered template section of task instance

* added runid to the list of rendered task strings

* added more metadata about run in rendered fields

* tiny refactor.

* Allow the input paths to be passed via a  file (#1181)

* Check compatibility for R 4.2 (#1160)

Test PR - DNR

* issue 1040 fix: apply _sanitize to template names in Argo workflows (#1180)

* issue 1040 fix: apply _sanitize to template names in Argo workflows

* apply _sanitize to foreach step template names in Argo workflows

* Bump version for release

* Handle aborted Kubernetes workloads. (#1195)

Log the message and the exit code.

* Bump loader-utils from 3.2.0 to 3.2.1 in /metaflow/plugins/cards/ui (#1194)

Bumps [loader-utils](https://github.com/webpack/loader-utils) from 3.2.0 to 3.2.1.
- [Release notes](https://github.com/webpack/loader-utils/releases)
- [Changelog](https://github.com/webpack/loader-utils/blob/master/CHANGELOG.md)
- [Commits](https://github.com/webpack/loader-utils/compare/v3.2.0...v3.2.1)

---
updated-dependencies:
- dependency-name: loader-utils
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix `._orig` access for submodules for MF extensions (#1174)

* Fix `._orig` access for submodules for MF extensions

This addresses an issue where submodules like `mymodule._orig.submodule` would
not load properly.

This will properly load things like `from .foofa import xyz` from a module or
sub-module in the original location but currently does not work if the
import path is absolute.

* Add test that orig module is accessible (#1192)

* Black format

Co-authored-by: Tom Furmston <tfurmston@googlemail.com>

* update black (#1199)

* allow = in deco spec values (#1197)

Co-authored-by: Adam Merberg <adam.merberg@equilibriumenergy.com>

* Typo repair and PEP8 cleanup (#1190)

* Fixed typos, spelling, and grammar

* Fixed several simple PEP warnings

* Reverted changes to _vendor folder

* Black formatting

* Black linting

* Reset _vendor folder after black formatting

* Pin GH tests to Ubuntu 20.04 (#1201)

* Pin GH tests to Ubuntu 20.04

Ubuntu latest which maps to 22.04 doesn't support Py 3.5/3.6 on GH
actions

* Update test.yml

* Set gpu resources correctly "--with kubernetes" (#1202)

* Set gpu resources correctly "--with kubernetes"

GPU resources were not getting propagated when used "--with kubernetes". Now, it does.

Testing Done:

- Reproduced the problem creating a step with a decorator @resources(gpu=1) and running it "--with kubernetes". The pods were scheduled without gpus.
- Verified that after the fix, the pods were scheduled with gpus (request and limit).

* Incorporate review comments + fix black.

* Clean up configuration variables (#1183)

* Clean up configuration variables

* Fix bug in argo/airflow; update default value of AZURE_STORAGE_WORKLOAD_TYPE

* Forgot file

* Clean up what values are passed down

Now, if a value is the default, it is not propagated by default since it
can be reconstructed simply from the code. With this change, we basically
propagate (by default) only those values that require some external knowledge
(env vars, configuration file) to set. This reduces the set of values
propagated to only the ones the user overrides.

* Remove propagate flag; change name of SERVICE_URL for batch

* Fixed more instances of INTERNAL_SERVICE_URL; fixed comments

* GCP datastore implementation (#1135)

* GCP

* card rendering to tolerate runs not tagged with metaflow_version

* patch GCP impl to match new includefile impl

* fix validate function bug

* fix GS includefile merge

* fix GS workload type config

* Bump version; remove R tests (#1204)

* Bump version; remove R tests

* Remove more R test

* Deal with transient errors (like SlowDowns) more effectively for S3 (#1186)

* Deal with transient errors (like SlowDowns) more effectively for S3

In the previous incantantion, SlowDown errors were treated as a regular
error and everything was retried (making things worse). With this change,
we continue retrying the operations that were unsuccessful (and only those).

Also use a better internal boto retry policy (unless specified by the user).

Modified the tests to inject failures to be able to test the functionality. Tests
now have a failure injection rate of 0, 10, 50 or 90 percent.

* Added more comments

* Addressed comments; added more comments in code

* Fix/move data files (#1206)

* Move files -- no code change

* Fixups for datatools and datastores

* Pure moving of code around -- no code modification

This commit can be safely ignored when reviewing code; it
has no semantic change although it clearly does not work.

* Modify CLI commands to make functional

* Bump version number

* Fix regression causing CL tool to not work. (#1209)

* Fix regression causing CL tool to not work.

Update version for immediate release

* Fix tox to keep github actions happy

* Bump qs from 6.5.2 to 6.5.3 in /metaflow/plugins/cards/ui (#1208)

Bumps [qs](https://github.com/ljharb/qs) from 6.5.2 to 6.5.3.
- [Release notes](https://github.com/ljharb/qs/releases)
- [Changelog](https://github.com/ljharb/qs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/ljharb/qs/compare/v6.5.2...v6.5.3)

---
updated-dependencies:
- dependency-name: qs
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Adds check for tutorials dir and flattens if necessary (#1211)

* Fix bug with datastore backend instantiation (#1210)

* Version bump

* Reduce @environment arg length for step-functions create (#1215)

* Add support for Kubernetes tolerations (#1207)

* Add support for Kubernetes tolerations

* Revert get_docker_registry import

* Fix KUBERNETES_TOLERATIONS default value

* Update toleration example

* Remove KUBERNETES_NODE_SELECTOR in kubernetes_job.py

* Fix black code style

* Add param doc to KubernetesDecorator

* Fix typo

* Serialize tolerations in runtime_step_cli

* Fix KUBERNETES_TOLERATIONS config

* Fix node_selector env var in the kubernetes decorator

* JSON loads KUBERNETES_TOLERATIONS in kubernetes_decorator init

* Parse node_selector and tolerations in the decorator

* Update comment

* Validate tolerations object in kubernetes_decorator.py

* Use hard coded tolerations attribute_map

* Fix black code style

* String formatting compatible with python 3.5

* Use V1Toleration.attribute_map to validate tolerations

* Fix black lint

* Improve error handling

* Fix black lint

* readded changes (#1205)

Co-authored-by: Dan <daniel.corvesor@medtronic.com>

* Fix CVE-2007-4559 (tar.extractall) (#1213)

* Fix CVE-2007-4559 (tar.extractall)

See #1177 for more details

* Address comment

* Support .conda packages (#1221)

* Support .conda packages

* fix black issues

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Romain <romain-intel@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Valay Dave <valaygaurang@gmail.com>
Co-authored-by: bishax <alex.bishop@nesta.org.uk>
Co-authored-by: Savin <savingoyal@gmail.com>
Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>
Co-authored-by: Kevin Gullikson <kgullikson88@users.noreply.github.com>
Co-authored-by: kgullikson <kgullikson@sparkcognition.com>
Co-authored-by: Yun Wu <wuyun1618@gmail.com>
Co-authored-by: sam-watts <35522542+sam-watts@users.noreply.github.com>
Co-authored-by: Kevin Smith <kevsmith@users.noreply.github.com>
Co-authored-by: jackie-ob <104396218+jackie-ob@users.noreply.github.com>
Co-authored-by: Jackie Tung <jackie@outerbounds.co>
Co-authored-by: Romain Cledat <rcledat@netflix.com>
Co-authored-by: Preetam Joshi <preetamjoshi93@gmail.com>
Co-authored-by: Preetam Joshi <preetamj@netflix.com>
Co-authored-by: mrfalconer <thebricoleursmontage@gmail.com>
Co-authored-by: Ville Tuulos <tuulos@gmail.com>
Co-authored-by: Shashank Srikanth <108034001+hunsdiecker@users.noreply.github.com>
Co-authored-by: Tommy Brecher <Tommy.Brecher@vonage.com>
Co-authored-by: Maciej (Mike) Balajewicz <102193656+mbalajew@users.noreply.github.com>
Co-authored-by: John Parker <japarker@uchicago.edu>
Co-authored-by: Shri Javadekar <shrinandj@users.noreply.github.com>
Co-authored-by: Tom Furmston <tfurmston@googlemail.com>
Co-authored-by: Adam Merberg <amerberg@gmail.com>
Co-authored-by: Adam Merberg <adam.merberg@equilibriumenergy.com>
Co-authored-by: James Budarz <me@jamesbudarz.com>
Co-authored-by: ashrielbrian <ashrielbrian@users.noreply.github.com>
Co-authored-by: Riccardo Bini <odracci@gmail.com>
Co-authored-by: Daniel Corvesor <dancorvesor@gmail.com>
Co-authored-by: Dan <daniel.corvesor@medtronic.com>
---
 .github/workflows/test.yml                    |   14 +-
 .pre-commit-config.yaml                       |    5 +-
 R/inst/tutorials/02-statistics/README.md      |    2 +-
 R/inst/tutorials/02-statistics/stats.Rmd      |    2 +-
 .../tutorials/05-statistics-redux/README.md   |    2 +-
 README.md                                     |    3 +-
 docs/Environment escape.md                    |   18 +-
 docs/cards.md                                 |   22 +-
 docs/concurrency.md                           |    6 +-
 docs/datastore.md                             |   14 +-
 metaflow/__init__.py                          |   27 +-
 metaflow/_vendor/v3_5/__init__.py             |    1 +
 .../{ => v3_5}/importlib_metadata.LICENSE     |    0
 .../{ => v3_5}/importlib_metadata/__init__.py |    2 +-
 .../{ => v3_5}/importlib_metadata/_compat.py  |    0
 metaflow/_vendor/{ => v3_5}/zipp.LICENSE      |    0
 metaflow/_vendor/{ => v3_5}/zipp.py           |    0
 metaflow/_vendor/v3_6/__init__.py             |    1 +
 .../_vendor/v3_6/importlib_metadata.LICENSE   |   13 +
 .../v3_6/importlib_metadata/__init__.py       | 1063 ++++++
 .../v3_6/importlib_metadata/_adapters.py      |   68 +
 .../v3_6/importlib_metadata/_collections.py   |   30 +
 .../v3_6/importlib_metadata/_compat.py        |   71 +
 .../v3_6/importlib_metadata/_functools.py     |  104 +
 .../v3_6/importlib_metadata/_itertools.py     |   73 +
 .../_vendor/v3_6/importlib_metadata/_meta.py  |   48 +
 .../_vendor/v3_6/importlib_metadata/_text.py  |   99 +
 .../v3_6/importlib_metadata/py.typed}         |    0
 .../_vendor/v3_6/typing_extensions.LICENSE    |  254 ++
 metaflow/_vendor/v3_6/typing_extensions.py    | 2908 +++++++++++++++++
 metaflow/_vendor/v3_6/zipp.LICENSE            |   19 +
 metaflow/_vendor/v3_6/zipp.py                 |  329 ++
 metaflow/_vendor/vendor_any.txt               |    1 +
 .../_vendor/{vendor.txt => vendor_v3_5.txt}   |    1 -
 metaflow/_vendor/vendor_v3_6.txt              |    1 +
 metaflow/cli.py                               |  143 +-
 metaflow/cli_args.py                          |    2 +-
 metaflow/client/core.py                       |  528 ++-
 metaflow/client/filecache.py                  |   13 +-
 metaflow/cmd/__init__.py                      |    0
 .../{main_cli.py => cmd/configure_cmd.py}     |  614 ++--
 metaflow/cmd/main_cli.py                      |  140 +
 metaflow/cmd/tutorials_cmd.py                 |  160 +
 metaflow/cmd/util.py                          |   23 +
 metaflow/current.py                           |  125 +-
 metaflow/datastore/__init__.py                |    5 -
 metaflow/datastore/datastore_storage.py       |    4 +-
 metaflow/datastore/flow_datastore.py          |    4 +-
 metaflow/datastore/task_datastore.py          |   40 +-
 metaflow/datatools/s3.py                      | 1047 ------
 metaflow/debug.py                             |    7 +-
 metaflow/decorators.py                        |   64 +-
 metaflow/event_logger.py                      |   38 +-
 metaflow/exception.py                         |    4 +
 metaflow/extension_support.py                 |  872 +++--
 metaflow/flowspec.py                          |  141 +-
 metaflow/graph.py                             |    4 +-
 metaflow/includefile.py                       |  691 ++--
 metaflow/lint.py                              |    1 +
 metaflow/metadata/heartbeat.py                |   29 +-
 metaflow/metadata/metadata.py                 |  206 +-
 metaflow/metadata/util.py                     |    4 +-
 metaflow/metaflow_config.py                   |  384 ++-
 metaflow/metaflow_config_funcs.py             |  120 +
 metaflow/metaflow_environment.py              |   74 +-
 metaflow/metaflow_version.py                  |    6 +-
 metaflow/mflog/__init__.py                    |   44 +-
 metaflow/mflog/mflog.py                       |    2 +-
 metaflow/mflog/redirect_streams.py            |   54 -
 metaflow/mflog/save_logs.py                   |    5 +-
 metaflow/mflog/save_logs_periodically.py      |   11 +-
 metaflow/monitor.py                           |  263 +-
 metaflow/multicore_utils.py                   |    8 +-
 metaflow/package.py                           |   49 +-
 metaflow/parameters.py                        |  122 +-
 metaflow/plugins/__init__.py                  |   57 +-
 metaflow/plugins/airflow/__init__.py          |    0
 metaflow/plugins/airflow/airflow.py           |  693 ++++
 metaflow/plugins/airflow/airflow_cli.py       |  434 +++
 metaflow/plugins/airflow/airflow_decorator.py |   66 +
 metaflow/plugins/airflow/airflow_utils.py     |  672 ++++
 metaflow/plugins/airflow/dag.py               |    9 +
 metaflow/plugins/airflow/exception.py         |   12 +
 metaflow/plugins/airflow/plumbing/__init__.py |    0
 .../airflow/plumbing/set_parameters.py        |   21 +
 metaflow/plugins/argo/__init__.py             |    0
 metaflow/plugins/argo/argo_client.py          |  182 ++
 metaflow/plugins/argo/argo_workflows.py       | 1404 ++++++++
 metaflow/plugins/argo/argo_workflows_cli.py   |  513 +++
 .../plugins/argo/argo_workflows_decorator.py  |   63 +
 metaflow/plugins/argo/process_input_paths.py  |   19 +
 metaflow/plugins/aws/aws_client.py            |   56 +-
 metaflow/plugins/aws/aws_utils.py             |   10 +-
 metaflow/plugins/aws/batch/batch.py           |   44 +-
 metaflow/plugins/aws/batch/batch_cli.py       |    7 +-
 metaflow/plugins/aws/batch/batch_client.py    |   49 +-
 metaflow/plugins/aws/batch/batch_decorator.py |   77 +-
 metaflow/plugins/aws/eks/kubernetes.py        |  362 --
 .../plugins/aws/eks/kubernetes_decorator.py   |  257 --
 .../aws/step_functions/dynamo_db_client.py    |   22 +-
 .../aws/step_functions/schedule_decorator.py  |   18 +
 .../aws/step_functions/step_functions.py      |   55 +-
 .../aws/step_functions/step_functions_cli.py  |   25 +-
 .../step_functions_decorator.py               |    2 +-
 metaflow/plugins/azure/__init__.py            |    0
 metaflow/plugins/azure/azure_exceptions.py    |   13 +
 metaflow/plugins/azure/azure_tail.py          |   94 +
 metaflow/plugins/azure/azure_utils.py         |  218 ++
 .../azure/blob_service_client_factory.py      |  171 +
 metaflow/plugins/azure/includefile_support.py |  123 +
 metaflow/plugins/cards/card_cli.py            |   43 +-
 metaflow/plugins/cards/card_client.py         |  141 +-
 metaflow/plugins/cards/card_datastore.py      |  152 +-
 metaflow/plugins/cards/card_decorator.py      |   93 +-
 .../plugins/cards/card_modules/__init__.py    |   93 +-
 metaflow/plugins/cards/card_modules/basic.py  |   31 +-
 metaflow/plugins/cards/card_modules/card.py   |   42 +-
 .../cards/card_modules/chevron/renderer.py    |    7 +-
 .../plugins/cards/card_modules/components.py  |  271 +-
 .../card_modules/convert_to_native_type.py    |   24 +-
 .../cards/card_modules/renderer_tools.py      |    8 +-
 .../plugins/cards/component_serializer.py     |   73 +-
 metaflow/plugins/cards/exception.py           |    6 +-
 metaflow/plugins/cards/ui/package.json        |    2 +-
 .../plugins/cards/ui/public/card-example.json |    2 +-
 metaflow/plugins/cards/ui/src/store.ts        |    2 +-
 metaflow/plugins/cards/ui/yarn.lock           |  138 +-
 metaflow/plugins/catch_decorator.py           |   27 +-
 metaflow/plugins/conda/__init__.py            |   46 +
 metaflow/plugins/conda/batch_bootstrap.py     |   61 +-
 metaflow/plugins/conda/conda.py               |    5 +-
 metaflow/plugins/conda/conda_environment.py   |   16 +-
 .../plugins/conda/conda_flow_decorator.py     |   28 +-
 .../plugins/conda/conda_step_decorator.py     |  134 +-
 metaflow/plugins/datastores/__init__.py       |    0
 metaflow/plugins/datastores/azure_storage.py  |  397 +++
 metaflow/plugins/datastores/gs_storage.py     |  275 ++
 .../datastores}/local_storage.py              |    5 +-
 .../datastores}/s3_storage.py                 |   18 +-
 metaflow/{ => plugins}/datatools/__init__.py  |    5 +-
 metaflow/plugins/datatools/local.py           |  152 +
 metaflow/plugins/datatools/s3/__init__.py     |    9 +
 metaflow/plugins/datatools/s3/s3.py           | 1645 ++++++++++
 .../datatools/s3}/s3op.py                     |  747 +++--
 .../datatools/s3}/s3tail.py                   |    4 +-
 .../datatools/s3}/s3util.py                   |   38 +-
 metaflow/plugins/debug_logger.py              |   27 +-
 metaflow/plugins/debug_monitor.py             |   53 +-
 metaflow/plugins/env_escape/__init__.py       |   67 +-
 metaflow/plugins/env_escape/client.py         |  113 +-
 metaflow/plugins/env_escape/client_modules.py |   45 +-
 .../env_escape/communication/channel.py       |    2 +-
 .../communication/socket_bytestream.py        |    3 +
 .../plugins/env_escape/communication/utils.py |    4 +-
 .../plugins/env_escape/data_transferer.py     |    7 +-
 metaflow/plugins/env_escape/server.py         |   49 +-
 metaflow/plugins/env_escape/stub.py           |    4 +-
 metaflow/plugins/environment_decorator.py     |   17 +-
 metaflow/plugins/frameworks/pytorch.py        |    2 +-
 metaflow/plugins/gcp/__init__.py              |    0
 metaflow/plugins/gcp/gs_exceptions.py         |    5 +
 .../plugins/gcp/gs_storage_client_factory.py  |   21 +
 metaflow/plugins/gcp/gs_tail.py               |   85 +
 metaflow/plugins/gcp/gs_utils.py              |   65 +
 metaflow/plugins/gcp/includefile_support.py   |  108 +
 metaflow/plugins/kubernetes/__init__.py       |    0
 metaflow/plugins/kubernetes/kubernetes.py     |  348 ++
 .../{aws/eks => kubernetes}/kubernetes_cli.py |   80 +-
 .../plugins/kubernetes/kubernetes_client.py   |   57 +
 .../kubernetes/kubernetes_decorator.py        |  396 +++
 .../kubernetes_job.py}                        |  512 ++-
 metaflow/plugins/metadata/local.py            |  300 +-
 metaflow/plugins/metadata/service.py          |  236 +-
 metaflow/plugins/parallel_decorator.py        |    2 +-
 metaflow/plugins/project_decorator.py         |   14 +
 metaflow/plugins/resources_decorator.py       |   31 +-
 metaflow/plugins/retry_decorator.py           |   26 +-
 metaflow/plugins/storage_executor.py          |  164 +
 metaflow/plugins/tag_cli.py                   |  531 +++
 .../test_unbounded_foreach_decorator.py       |    4 +-
 metaflow/plugins/timeout_decorator.py         |   23 +-
 metaflow/pylint_wrapper.py                    |    9 +
 metaflow/runtime.py                           |  367 ++-
 metaflow/sidecar.py                           |  156 -
 metaflow/sidecar/__init__.py                  |    3 +
 metaflow/sidecar/sidecar.py                   |   31 +
 metaflow/sidecar/sidecar_messages.py          |   34 +
 metaflow/sidecar/sidecar_subprocess.py        |  237 ++
 metaflow/sidecar/sidecar_worker.py            |   68 +
 metaflow/sidecar_messages.py                  |   24 -
 metaflow/sidecar_worker.py                    |   61 -
 metaflow/tagging_util.py                      |   76 +
 metaflow/task.py                              |   62 +-
 metaflow/tutorials/01-playlist/playlist.py    |    2 +-
 metaflow/tutorials/02-statistics/README.md    |    2 +-
 metaflow/tutorials/02-statistics/stats.ipynb  |    2 +-
 metaflow/tutorials/02-statistics/stats.py     |    4 +-
 .../tutorials/03-playlist-redux/playlist.py   |    4 +-
 metaflow/tutorials/04-playlist-plus/README.md |    4 +-
 .../tutorials/04-playlist-plus/playlist.py    |    4 +-
 metaflow/tutorials/05-helloaws/README.md      |    8 +-
 .../tutorials/06-statistics-redux/README.md   |    6 +-
 .../tutorials/06-statistics-redux/stats.ipynb |    2 +-
 metaflow/tutorials/07-worldview/README.md     |    4 +-
 metaflow/tutorials/08-autopilot/README.md     |    2 +-
 metaflow/util.py                              |   49 +-
 metaflow/vendor.py                            |  143 +-
 setup.py                                      |    4 +-
 test/README.md                                |    6 +-
 test/core/contexts.json                       |   38 +-
 .../test_org/plugins/frameworks/__init__.py   |    0
 .../test_org/plugins/frameworks/pytorch.py    |    8 +
 .../test_org/plugins/mfextinit_test_org.py    |    2 +-
 test/core/metaflow_test/__init__.py           |   40 +
 test/core/metaflow_test/cli_check.py          |  122 +-
 test/core/metaflow_test/formatter.py          |    6 +-
 test/core/metaflow_test/metadata_check.py     |   27 +-
 test/core/tests/basic_include.py              |   28 +-
 test/core/tests/basic_log.py                  |    8 +-
 test/core/tests/basic_tags.py                 |   10 +-
 test/core/tests/card_default_editable.py      |    4 +-
 .../tests/card_default_editable_customize.py  |    2 +-
 .../tests/card_default_editable_with_id.py    |   11 +-
 test/core/tests/card_error.py                 |    3 +-
 test/core/tests/card_extension_test.py        |   58 +
 test/core/tests/card_id_append.py             |    4 +-
 test/core/tests/card_import.py                |    4 +-
 test/core/tests/card_resume.py                |    2 +-
 test/core/tests/card_simple.py                |    6 +-
 test/core/tests/card_timeout.py               |    2 +-
 test/core/tests/catch_retry.py                |    3 +-
 test/core/tests/current_singleton.py          |    8 +
 test/core/tests/extensions.py                 |    1 +
 test/core/tests/large_artifact.py             |    6 +-
 test/core/tests/large_mflog.py                |    5 +-
 test/core/tests/resume_end_step.py            |   29 +
 test/core/tests/resume_start_step.py          |    9 +-
 test/core/tests/run_id_file.py                |   34 +
 test/core/tests/tag_catch.py                  |    6 +-
 test/core/tests/tag_mutation.py               |  114 +
 test/data/__init__.py                         |    6 +-
 test/data/s3/s3_data.py                       |   51 +-
 test/data/s3/test_s3.py                       |  312 +-
 test/extensions/README.md                     |    5 +
 test/extensions/install_packages.sh           |    3 +
 .../packages/card_via_extinit/README.md       |    3 +
 .../plugins/cards/card_a/__init__.py          |   15 +
 .../plugins/cards/card_b/__init__.py          |   15 +
 .../plugins/cards/mfextinit_X.py              |    4 +
 .../packages/card_via_extinit/setup.py        |   21 +
 .../packages/card_via_init/README.md          |    3 +
 .../card_via_init/plugins/cards/__init__.py   |   15 +
 .../packages/card_via_init/setup.py           |   21 +
 .../packages/card_via_ns_subpackage/README.md |    3 +
 .../plugins/cards/nssubpackage/__init__.py    |   15 +
 .../packages/card_via_ns_subpackage/setup.py  |   21 +
 test/unit/test_compute_resource_attributes.py |   61 +-
 test/unit/test_k8s_job_name_sanitizer.py      |   33 -
 test/unit/test_k8s_label_sanitizer.py         |   28 -
 test/unit/test_local_metadata_provider.py     |   31 +
 test_runner                                   |    8 +-
 tox.ini                                       |    1 +
 262 files changed, 22207 insertions(+), 5583 deletions(-)
 create mode 100644 metaflow/_vendor/v3_5/__init__.py
 rename metaflow/_vendor/{ => v3_5}/importlib_metadata.LICENSE (100%)
 rename metaflow/_vendor/{ => v3_5}/importlib_metadata/__init__.py (99%)
 rename metaflow/_vendor/{ => v3_5}/importlib_metadata/_compat.py (100%)
 rename metaflow/_vendor/{ => v3_5}/zipp.LICENSE (100%)
 rename metaflow/_vendor/{ => v3_5}/zipp.py (100%)
 create mode 100644 metaflow/_vendor/v3_6/__init__.py
 create mode 100644 metaflow/_vendor/v3_6/importlib_metadata.LICENSE
 create mode 100644 metaflow/_vendor/v3_6/importlib_metadata/__init__.py
 create mode 100644 metaflow/_vendor/v3_6/importlib_metadata/_adapters.py
 create mode 100644 metaflow/_vendor/v3_6/importlib_metadata/_collections.py
 create mode 100644 metaflow/_vendor/v3_6/importlib_metadata/_compat.py
 create mode 100644 metaflow/_vendor/v3_6/importlib_metadata/_functools.py
 create mode 100644 metaflow/_vendor/v3_6/importlib_metadata/_itertools.py
 create mode 100644 metaflow/_vendor/v3_6/importlib_metadata/_meta.py
 create mode 100644 metaflow/_vendor/v3_6/importlib_metadata/_text.py
 rename metaflow/{plugins/aws/eks/__init__.py => _vendor/v3_6/importlib_metadata/py.typed} (100%)
 create mode 100644 metaflow/_vendor/v3_6/typing_extensions.LICENSE
 create mode 100644 metaflow/_vendor/v3_6/typing_extensions.py
 create mode 100644 metaflow/_vendor/v3_6/zipp.LICENSE
 create mode 100644 metaflow/_vendor/v3_6/zipp.py
 create mode 100644 metaflow/_vendor/vendor_any.txt
 rename metaflow/_vendor/{vendor.txt => vendor_v3_5.txt} (66%)
 create mode 100644 metaflow/_vendor/vendor_v3_6.txt
 create mode 100644 metaflow/cmd/__init__.py
 rename metaflow/{main_cli.py => cmd/configure_cmd.py} (62%)
 create mode 100644 metaflow/cmd/main_cli.py
 create mode 100644 metaflow/cmd/tutorials_cmd.py
 create mode 100644 metaflow/cmd/util.py
 delete mode 100644 metaflow/datatools/s3.py
 create mode 100644 metaflow/metaflow_config_funcs.py
 delete mode 100644 metaflow/mflog/redirect_streams.py
 create mode 100644 metaflow/plugins/airflow/__init__.py
 create mode 100644 metaflow/plugins/airflow/airflow.py
 create mode 100644 metaflow/plugins/airflow/airflow_cli.py
 create mode 100644 metaflow/plugins/airflow/airflow_decorator.py
 create mode 100644 metaflow/plugins/airflow/airflow_utils.py
 create mode 100644 metaflow/plugins/airflow/dag.py
 create mode 100644 metaflow/plugins/airflow/exception.py
 create mode 100644 metaflow/plugins/airflow/plumbing/__init__.py
 create mode 100644 metaflow/plugins/airflow/plumbing/set_parameters.py
 create mode 100644 metaflow/plugins/argo/__init__.py
 create mode 100644 metaflow/plugins/argo/argo_client.py
 create mode 100644 metaflow/plugins/argo/argo_workflows.py
 create mode 100644 metaflow/plugins/argo/argo_workflows_cli.py
 create mode 100644 metaflow/plugins/argo/argo_workflows_decorator.py
 create mode 100644 metaflow/plugins/argo/process_input_paths.py
 delete mode 100644 metaflow/plugins/aws/eks/kubernetes.py
 delete mode 100644 metaflow/plugins/aws/eks/kubernetes_decorator.py
 create mode 100644 metaflow/plugins/azure/__init__.py
 create mode 100644 metaflow/plugins/azure/azure_exceptions.py
 create mode 100644 metaflow/plugins/azure/azure_tail.py
 create mode 100644 metaflow/plugins/azure/azure_utils.py
 create mode 100644 metaflow/plugins/azure/blob_service_client_factory.py
 create mode 100644 metaflow/plugins/azure/includefile_support.py
 create mode 100644 metaflow/plugins/datastores/__init__.py
 create mode 100644 metaflow/plugins/datastores/azure_storage.py
 create mode 100644 metaflow/plugins/datastores/gs_storage.py
 rename metaflow/{datastore => plugins/datastores}/local_storage.py (96%)
 rename metaflow/{datastore => plugins/datastores}/s3_storage.py (91%)
 rename metaflow/{ => plugins}/datatools/__init__.py (82%)
 create mode 100644 metaflow/plugins/datatools/local.py
 create mode 100644 metaflow/plugins/datatools/s3/__init__.py
 create mode 100644 metaflow/plugins/datatools/s3/s3.py
 rename metaflow/{datatools => plugins/datatools/s3}/s3op.py (50%)
 rename metaflow/{datatools => plugins/datatools/s3}/s3tail.py (89%)
 rename metaflow/{datatools => plugins/datatools/s3}/s3util.py (63%)
 create mode 100644 metaflow/plugins/gcp/__init__.py
 create mode 100644 metaflow/plugins/gcp/gs_exceptions.py
 create mode 100644 metaflow/plugins/gcp/gs_storage_client_factory.py
 create mode 100644 metaflow/plugins/gcp/gs_tail.py
 create mode 100644 metaflow/plugins/gcp/gs_utils.py
 create mode 100644 metaflow/plugins/gcp/includefile_support.py
 create mode 100644 metaflow/plugins/kubernetes/__init__.py
 create mode 100644 metaflow/plugins/kubernetes/kubernetes.py
 rename metaflow/plugins/{aws/eks => kubernetes}/kubernetes_cli.py (75%)
 create mode 100644 metaflow/plugins/kubernetes/kubernetes_client.py
 create mode 100644 metaflow/plugins/kubernetes/kubernetes_decorator.py
 rename metaflow/plugins/{aws/eks/kubernetes_client.py => kubernetes/kubernetes_job.py} (55%)
 create mode 100644 metaflow/plugins/storage_executor.py
 create mode 100644 metaflow/plugins/tag_cli.py
 delete mode 100644 metaflow/sidecar.py
 create mode 100644 metaflow/sidecar/__init__.py
 create mode 100644 metaflow/sidecar/sidecar.py
 create mode 100644 metaflow/sidecar/sidecar_messages.py
 create mode 100644 metaflow/sidecar/sidecar_subprocess.py
 create mode 100644 metaflow/sidecar/sidecar_worker.py
 delete mode 100644 metaflow/sidecar_messages.py
 delete mode 100644 metaflow/sidecar_worker.py
 create mode 100644 metaflow/tagging_util.py
 create mode 100644 test/core/metaflow_extensions/test_org/plugins/frameworks/__init__.py
 create mode 100644 test/core/metaflow_extensions/test_org/plugins/frameworks/pytorch.py
 create mode 100644 test/core/tests/card_extension_test.py
 create mode 100644 test/core/tests/run_id_file.py
 create mode 100644 test/core/tests/tag_mutation.py
 create mode 100644 test/extensions/README.md
 create mode 100644 test/extensions/install_packages.sh
 create mode 100644 test/extensions/packages/card_via_extinit/README.md
 create mode 100644 test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/card_a/__init__.py
 create mode 100644 test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/card_b/__init__.py
 create mode 100644 test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/mfextinit_X.py
 create mode 100644 test/extensions/packages/card_via_extinit/setup.py
 create mode 100644 test/extensions/packages/card_via_init/README.md
 create mode 100644 test/extensions/packages/card_via_init/metaflow_extensions/card_via_init/plugins/cards/__init__.py
 create mode 100644 test/extensions/packages/card_via_init/setup.py
 create mode 100644 test/extensions/packages/card_via_ns_subpackage/README.md
 create mode 100644 test/extensions/packages/card_via_ns_subpackage/metaflow_extensions/card_via_ns_subpackage/plugins/cards/nssubpackage/__init__.py
 create mode 100644 test/extensions/packages/card_via_ns_subpackage/setup.py
 delete mode 100644 test/unit/test_k8s_job_name_sanitizer.py
 delete mode 100644 test/unit/test_k8s_label_sanitizer.py
 create mode 100644 test/unit/test_local_metadata_provider.py

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
index 6267603aeea..93a8495edd4 100644
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -10,7 +10,7 @@ on:
   
 jobs:
   pre-commit:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-20.04
     steps:
     - uses: actions/checkout@v2
     - uses: actions/setup-python@v2
@@ -22,8 +22,8 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        os: [ubuntu-latest, macos-latest]
-        ver: ['3.5', '3.6', '3.7', '3.8', '3.9','3.10',]
+        os: [ubuntu-20.04, macos-latest]
+        ver: ['3.5', '3.6', '3.7', '3.8', '3.9', '3.10', '3.11',]
 
     steps:
     - uses: actions/checkout@v2
@@ -47,8 +47,8 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        os: [ubuntu-latest, macos-latest]
-        ver: ['4.0', '4.1']
+        os: [macos-latest]
+        ver: ['4.0']
    
     steps:
     - uses: actions/checkout@v2
@@ -58,8 +58,8 @@ jobs:
         r-version: ${{ matrix.ver }}
     
     - name: Install R ${{ matrix.ver }} system dependencies
-      if: matrix.os == 'ubuntu-latest'
-      run: sudo apt-get update; sudo apt-get install -y libcurl4-openssl-dev qpdf libgit2-dev
+      if: matrix.os == 'ubuntu-20.04'
+      run: sudo apt-get update; sudo apt-get install -y libcurl4-openssl-dev qpdf libgit2-dev libharfbuzz-dev libfribidi-dev
 
     - name: Install R ${{ matrix.ver }} Rlang dependencies
       run: |
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 0fdd3bae214..648aaa7bf51 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -6,8 +6,9 @@ repos:
       - id: check-yaml
       - id: check-json
   - repo: https://github.com/ambv/black
-    rev: 21.9b0
+    rev: 22.10.0
     hooks:
       - id: black
         language_version: python3
-        exclude: "^metaflow/_vendor/"
\ No newline at end of file
+        exclude: "^metaflow/_vendor/"
+        additional_dependencies: ["click<8.1.0"]
diff --git a/R/inst/tutorials/02-statistics/README.md b/R/inst/tutorials/02-statistics/README.md
index e96085d119f..523b74f94c5 100644
--- a/R/inst/tutorials/02-statistics/README.md
+++ b/R/inst/tutorials/02-statistics/README.md
@@ -1,6 +1,6 @@
 # Episode 02-statistics: Is this Data Science?
 
-**Use metaflow to load the movie metadata CSV file into a data frame and compute some movie genre specific statistics. These statistics are then used in
+**Use metaflow to load the movie metadata CSV file into a data frame and compute some movie genre-specific statistics. These statistics are then used in
 later examples to improve our playlist generator. You can optionally use the
 Metaflow client to eyeball the results in a Markdown Notebook, and make some simple
 plots.**
diff --git a/R/inst/tutorials/02-statistics/stats.Rmd b/R/inst/tutorials/02-statistics/stats.Rmd
index 6db6f0345d5..a805e194899 100644
--- a/R/inst/tutorials/02-statistics/stats.Rmd
+++ b/R/inst/tutorials/02-statistics/stats.Rmd
@@ -5,7 +5,7 @@ output:
     df_print: paged
 ---
 
-MovieStatsFlow loads the movie metadata CSV file into a Pandas Dataframe and computes some movie genre specific statistics. You can use this notebook and the Metaflow client to eyeball the results and make some simple plots.
+MovieStatsFlow loads the movie metadata CSV file into a Pandas Dataframe and computes some movie genre-specific statistics. You can use this notebook and the Metaflow client to eyeball the results and make some simple plots.
 
 ```{r}
 suppressPackageStartupMessages(library(metaflow))
diff --git a/R/inst/tutorials/05-statistics-redux/README.md b/R/inst/tutorials/05-statistics-redux/README.md
index e1873ca9754..22142fcc1c0 100644
--- a/R/inst/tutorials/05-statistics-redux/README.md
+++ b/R/inst/tutorials/05-statistics-redux/README.md
@@ -6,7 +6,7 @@ running on remote compute. In this example we re-run the 'stats.R' workflow
 adding the '--with batch' command line argument. This instructs Metaflow to run
 all your steps on AWS batch without changing any code. You can control the
 behavior with additional arguments, like '--max-workers'. For this example,
-'max-workers' is used to limit the number of parallel genre specific statistics
+'max-workers' is used to limit the number of parallel genre-specific statistics
 computations.
 You can then access the data artifacts (even the local CSV file) from anywhere
 because the data is being stored in AWS S3.**
diff --git a/README.md b/README.md
index edae62e93da..aae74f14b26 100644
--- a/README.md
+++ b/README.md
@@ -51,4 +51,5 @@ We welcome contributions to Metaflow. Please see our [contribution guide](https:
 
 ### Code style
 
-We use [black](https://black.readthedocs.io/en/stable/) as a code formatter. The easiest way to ensure your commits are always formatted with the correct version of `black` it is to use [pre-commit](https://pre-commit.com/): install it and then run `pre-commit install` once in your local copy of the repo. 
\ No newline at end of file
+We use [black](https://black.readthedocs.io/en/stable/) as a code formatter. The easiest way to ensure your commits are always formatted with the correct version of `black` it is to use [pre-commit](https://pre-commit.com/): install it and then run `pre-commit install` once in your local copy of the repo.
+
diff --git a/docs/Environment escape.md b/docs/Environment escape.md
index a4d76e3ecd7..c0b03354872 100644
--- a/docs/Environment escape.md	
+++ b/docs/Environment escape.md	
@@ -20,7 +20,7 @@ but *some* can execute in another Python environment.
 At a high-level, the environment escape plugin allows a Python interpreter to
 forward calls to another interpreter. To set semantics, we will say that a
 *client* interpreter escapes to a *server* interpreter. The *server* interpreter
-operates in a slave-like mode with regards to the *client*. To give a concrete
+operates in a slave-like mode with regard to the *client*. To give a concrete
 example, imagine a package ``data_accessor`` that is available in the base
 environment you are executing in but not in your Conda environment. When
 executing within the Conda environment, the *client* interpreter is the Conda
@@ -69,7 +69,7 @@ identifier to find the correct stub. There is therefore a **one-to-one mapping
 between stub objects on the client and backing objects on the server**.
 
 The next method called on ```job``` is ```wait``` which returns ```None```. In
-this system, by design, only certain objects are able to be transferred between
+this system, by design, only certain objects may be transferred between
 the client and the server:
 - any Python basic type; this can be extended to any object that can be pickled
   without any external library;
@@ -224,9 +224,9 @@ everything to the server:
   performs computations at the request of the client when the client is unable
   to do so.
 
-  The server is thus started by the client and the client is responsible for
-  terminating it when it dies. A big part of the client and server code consist
-  in loading the configuration for the emulated module, particularly the
+  The server is thus started by the client, and the client is responsible for
+  terminating the server when it dies. A big part of the client and server code 
+  consist in loading the configuration for the emulated module, particularly the
   overrides.
 
   The steps to bringing up the client/server connection are as follows:
@@ -274,7 +274,7 @@ used).
 
 ## Defining an emulated module
 
-To define an emulated module, you need to create a sub directory in
+To define an emulated module, you need to create a subdirectory in
 ```plugins/env_escape/configurations``` called ```emulate_<name>``` where
 ```<name>``` is the name of the library you want to emulate. It can be a "list"
 where ```__``` is the list separator; this allows multiple libraries to be
@@ -286,9 +286,9 @@ create two files:
   - ```EXPORTED_CLASSES```: This is a dictionary of dictionary describing the
     whitelisted classes. The outermost key is either a string or a tuple of
     strings and corresponds to the "module" name (it doesn't really have to be
-    the module but the prefix of the full name of the whitelisted class)). The
+    the module but the prefix of the full name of the whitelisted class). The
     inner key is a string and corresponds to the suffix of the whitelisted
-    class. Finally, the value is the class that the class maps to internally. If
+    class. Finally, the value is the class to which the class maps internally. If
     the outermost key is a tuple, all strings in that tuple will be considered
     aliases of one another.
   - ```EXPORTED_FUNCTIONS```: This is the same structure as
@@ -324,7 +324,7 @@ create two files:
   define how attributes are accessed. Note that this is not restricted to
   attributes accessed using the ```getattr``` and ```setattr``` functions but
   any attribute. Both of these functions take as arguments ```stub```,
-  ```name``` and ```func``` which is the function to call to call the remote
+  ```name``` and ```func``` which is the function to call in order to call the remote
   ```getattr``` or ```setattr```. The ```setattr``` version takes an additional
   ```value``` argument. The remote versions simply take the target object and
   the name of the attribute (and ```value``` if it is a ```setattr``` override)
diff --git a/docs/cards.md b/docs/cards.md
index cf3a5d9fe02..5bddab2b729 100644
--- a/docs/cards.md
+++ b/docs/cards.md
@@ -29,7 +29,7 @@ Metaflow cards can be created by placing an [`@card` decorator](#@card-decorator
 
 Since the cards are stored in the datastore we can access them via the `view/get` commands in the [card_cli](#card-cli) or by using the `get_cards` [function](../metaflow/plugins/cards/card_client.py). 
 
-Metaflow ships with a [DefaultCard](#defaultcard) which visualizes artifacts, images, and `pandas.Dataframe`s. Metaflow also ships custom components like `Image`, `Table`, `Markdown` etc. These can be added to a card at `Task` runtime. Cards can also be edited from `@step` code using the [current.card](#editing-metaflowcard-from-@step-code) interface. `current.card` helps add `MetaflowCardComponent`s from `@step` code to a `MetaflowCard`. `current.card` offers methods like `current.card.append` or `current.card['myid']` to helps add components to a card. Since there can be many `@card`s over a `@step`, `@card` also comes with an `id` argument. The `id` argument helps disambigaute the card a component goes to when using `current.card`. For example, setting `@card(id='myid')` and calling `current.card['myid'].append(x)` will append `MetaflowCardComponent` `x` to the card with `id='myid'`.
+Metaflow ships with a [DefaultCard](#defaultcard) which visualizes artifacts, images, and `pandas.Dataframe`s. Metaflow also ships custom components like `Image`, `Table`, `Markdown` etc. These can be added to a card at `Task` runtime. Cards can also be edited from `@step` code using the [current.card](#editing-metaflowcard-from-@step-code) interface. `current.card` helps add `MetaflowCardComponent`s from `@step` code to a `MetaflowCard`. `current.card` offers methods like `current.card.append` or `current.card['myid']` to helps add components to a card. Since there can be many `@card`s over a `@step`, `@card` also comes with an `id` argument. The `id` argument helps disambiguate the card a component goes to when using `current.card`. For example, setting `@card(id='myid')` and calling `current.card['myid'].append(x)` will append `MetaflowCardComponent` `x` to the card with `id='myid'`.
 
 ### `@card` decorator
 The `@card` [decorator](../metaflow/plugins/cards/card_decorator.py) is implemented by inheriting the `StepDecorator`. The decorator can be placed over `@step` to create an HTML file visualizing information from the task.
@@ -75,7 +75,7 @@ if __name__ == "__main__":
 
 
 ### `CardDatastore`
-The [CardDatastore](../metaflow/plugins/cards/card_datastore.py) is used by the the [card_cli](#card-cli) and the [metaflow card client](#access-cards-in-notebooks) (`get_cards`). It exposes methods to get metadata about a card and the paths to cards for a `pathspec`. 
+The [CardDatastore](../metaflow/plugins/cards/card_datastore.py) is used by the [card_cli](#card-cli) and the [metaflow card client](#access-cards-in-notebooks) (`get_cards`). It exposes methods to get metadata about a card and the paths to cards for a `pathspec`. 
 
 ### Card CLI
 Methods exposed by the [card_cli](../metaflow/plugins/cards/.card_cli.py). :
@@ -142,12 +142,12 @@ class CustomCard(MetaflowCard):
 
 The class consists of the `_get_mustache` method that returns [chevron](https://github.com/noahmorrison/chevron) object ( a `mustache` based [templating engine](http://mustache.github.io/mustache.5.html) ). Using the `mustache` templating engine you can rewrite HTML template file. In the above example the `PATH_TO_CUSTOM_HTML` is the file that holds the `mustache` HTML template. 
 #### Attributes
-- `type (str)`  : The `type` of card. Needs to ensure correct resolution.  
-- `ALLOW_USER_COMPONENTS (bool)` : Setting this to `True` will make the a card be user editable. More information on user editable cards can be found [here](#editing-metaflowcard-from-@step-code). 
+- `type (str)`  : The `type` of card. Needs to ensure correct resolution.
+- `ALLOW_USER_COMPONENTS (bool)` : Setting this to `True` will make the card be user editable. More information on user editable cards can be found [here](#editing-metaflowcard-from-@step-code). 
 
 #### `__init__` Parameters
 - `components` `(List[str])`: `components` is a list of `render`ed `MetaflowCardComponent`s created at `@step` runtime. These are passed to the `card create` cli command via a tempfile path in the `--component-file` argument. 
-- `graph` `(Dict[str,dict])`: The DAG associated to the flow. It is a dictionary of the form `stepname:step_attributes`. `step_attributes` is a dictionary of metadata about a step , `stepname` is the name of the step in the DAG.  
+- `graph` `(Dict[str,dict])`: The DAG associated to the flow. It is a dictionary of the form `stepname:step_attributes`. `step_attributes` is a dictionary of metadata about a step , `stepname` is the name of the step in the DAG.
 - `options` `(dict)`: helps control the behavior of individual cards. 
     - For example, the `DefaultCard` supports `options` as dictionary of the form `{"only_repr":True}`. Here setting `only_repr` as `True` will ensure that all artifacts are serialized with `reprlib.repr` function instead of native object serialization. 
 
@@ -201,7 +201,7 @@ class CustomCard(MetaflowCard):
 ```
 
 ### `DefaultCard`
-The [DefaultCard](../metaflow/plugins/cards/card_modules/basic.py) is a default card exposed by metaflow. This will be used when the `@card` decorator is called without any `type` argument or called with `type='default'` argument. It will also be the default card used with cli. The card uses a [HTML template](../metaflow/plugins/cards/card_modules/base.html) along with a [JS](../metaflow/plugins/cards/card_modules/main.js) and a [CSS](../metaflow/plugins/cards/card_modules/bundle.css) files. 
+The [DefaultCard](../metaflow/plugins/cards/card_modules/basic.py) is a default card exposed by metaflow. This will be used when the `@card` decorator is called without any `type` argument or called with `type='default'` argument. It will also be the default card used with cli. The card uses an [HTML template](../metaflow/plugins/cards/card_modules/base.html) along with a [JS](../metaflow/plugins/cards/card_modules/main.js) and a [CSS](../metaflow/plugins/cards/card_modules/bundle.css) files. 
 
 The [HTML](../metaflow/plugins/cards/card_modules/base.html) is a template which works with [JS](../metaflow/plugins/cards/card_modules/main.js) and [CSS](../metaflow/plugins/cards/card_modules/bundle.css). 
 
@@ -229,25 +229,25 @@ The JS and CSS are created after building the JS and CSS from the [cards-ui](../
 def train(self):
     from metaflow.cards import Markdown
     from metaflow import current
-    current.card.append(Markdown('# This is present in the blank card with id "a"'))
-    current.card['a'].append(Markdown('# This is present in the default card'))
+    current.card['a'].append(Markdown('# This is present in the blank card with id "a"'))
+    current.card.append(Markdown('# This is present in the default card'))
     self.t = dict(
         hi = 1,
         hello = 2
     )
     self.next(self.end)
 ```
-In the above scenario there are two `@card` decorators which are being customized by `current.card`. The `current.card.append`/ `current.card['a'].append` methods only accepts objects which are subclasses of `MetaflowCardComponent`. The `current.card.append`/ `current.card['a'].append` methods only add a component to **one** card. Since there can be many cards for a `@step`, a **default editabled card** is resolved to disambiguate which card has access to the `append`/`extend` methods within the `@step`. A default editable card is a card that will have access to the `current.card.append`/`current.card.extend` methods. `current.card` resolve the default editable card before a `@step` code gets executed. It sets the default editable card once the last `@card` decorator calls the `task_pre_step` callback. In the above case, `current.card.append` will add a `Markdown` component to the card of type `default`. `current.card['a'].append` will add the `Markdown` to the `blank` card whose `id` is `a`. A `MetaflowCard` can be user editable, if `ALLOW_USER_COMPONENTS` is set to `True`. Since cards can be of many types, **some cards can also be non editable by users** (Cards with `ALLOW_USER_COMPONENTS=False`). Those cards won't be eligible to access the `current.card.append`. A non user editable card can be edited through expicitly setting an `id` and accessing it via `current.card['myid'].append` or by looking it up by its type via `current.card.get(type=’pytorch’)`.
+In the above scenario there are two `@card` decorators which are being customized by `current.card`. The `current.card.append`/ `current.card['a'].append` methods only accepts objects which are subclasses of `MetaflowCardComponent`. The `current.card.append`/ `current.card['a'].append` methods only add a component to **one** card. Since there can be many cards for a `@step`, a **default editable card** is resolved to disambiguate which card has access to the `append`/`extend` methods within the `@step`. A default editable card is a card that will have access to the `current.card.append`/`current.card.extend` methods. `current.card` resolve the default editable card before a `@step` code gets executed. It sets the default editable card once the last `@card` decorator calls the `task_pre_step` callback. In the above case, `current.card.append` will add a `Markdown` component to the card of type `default`. `current.card['a'].append` will add the `Markdown` to the `blank` card whose `id` is `a`. A `MetaflowCard` can be user editable, if `ALLOW_USER_COMPONENTS` is set to `True`. Since cards can be of many types, **some cards can also be non-editable by users** (Cards with `ALLOW_USER_COMPONENTS=False`). Those cards won't be eligible to access the `current.card.append`. A non-user editable card can be edited through explicitly setting an `id` and accessing it via `current.card['myid'].append` or by looking it up by its type via `current.card.get(type=’pytorch’)`.
 
 #### `current.card` (`CardComponentCollector`)
 
 The `CardComponentCollector` is the object responsible for resolving a `MetaflowCardComponent` to the card referenced in the `@card` decorator. 
 
-Since there can be many cards,  `CardComponentCollector` has a `_finalize` function. The `_finalize` function is called once the **last** `@card` decorator calls `task_pre_step`. The `_finalize` function will try to find the **default editable card** from all the `@card` decorators on the `@step`. The default editable card is the card that can access the `current.card.append`/`current.card.extend` methods. If there are multiple editable cards with no `id` then `current.card` will throw warnings when users call `current.card.append`. This is done because `current.card` cannot resolve which card the component belongs.  
+Since there can be many cards,  `CardComponentCollector` has a `_finalize` function. The `_finalize` function is called once the **last** `@card` decorator calls `task_pre_step`. The `_finalize` function will try to find the **default editable card** from all the `@card` decorators on the `@step`. The default editable card is the card that can access the `current.card.append`/`current.card.extend` methods. If there are multiple editable cards with no `id` then `current.card` will throw warnings when users call `current.card.append`. This is done because `current.card` cannot resolve which card the component belongs.
 
 The `@card` decorator also exposes another argument called `customize=True`. **Only one `@card` decorator over a `@step` can have `customize=True`**. Since cards can also be added from CLI when running a flow, adding `@card(customize=True)` will set **that particular card** from the decorator as default editable. This means that `current.card.append` will append to the card belonging to `@card` with `customize=True`. If there is more than one `@card` decorator with `customize=True` then `current.card` will throw warnings that `append` won't work. 
 
-One important feature of the `current.card` object is that it will not fail.  Even when users try to access `current.card.append` with multiple editable cards, we throw warnings but don't fail. `current.card` will also not fail when a user tries to access a card of a non-existing id via `current.card['mycard']`. Since `current.card['mycard']` gives reference to a `list` of `MetaflowCardComponent`s, `current.card` will return a non-referenced `list` when users try to access the dictionary inteface with a non existing id (`current.card['my_non_existant_card']`). 
+One important feature of the `current.card` object is that it will not fail. Even when users try to access `current.card.append` with multiple editable cards, we throw warnings but don't fail. `current.card` will also not fail when a user tries to access a card of a non-existing id via `current.card['mycard']`. Since `current.card['mycard']` gives reference to a `list` of `MetaflowCardComponent`s, `current.card` will return a non-referenced `list` when users try to access the dictionary interface with a nonexistent id (`current.card['my_non_existant_card']`). 
 
 Once the `@step` completes execution, every `@card` decorator will call `current.card._serialize` (`CardComponentCollector._serialize`) to get a JSON serializable list of `str`/`dict` objects. The `_serialize` function internally calls all [component's](#metaflowcardcomponent) `render` function. This list is `json.dump`ed to a `tempfile` and passed to the `card create` subprocess where the `MetaflowCard` can use them in the final output. 
 
diff --git a/docs/concurrency.md b/docs/concurrency.md
index 53f439c9203..ed058830ab5 100644
--- a/docs/concurrency.md
+++ b/docs/concurrency.md
@@ -29,7 +29,7 @@ Concurrency is practically never needed during the first two phases.
 
 We divide the concurrency constructs into two categories: Primary and
 Secondary. Whenever possible, you should prefer the constructs in
-the first category. The patterns are well established and they have
+the first category. The patterns are well established and have
 been used successfully in the core Metaflow modules, `runtime.py`
 and `task.py`. The constructs in the second category can be used in
 subprocesses, outside the core code paths in `runtime.py` and `task.py`.
@@ -109,7 +109,7 @@ delay, to avoid the parent from blocking.
 
 The sidecar subprocess may die for various reasons, in which case
 messages sent to it by the parent may be lost. To keep communication
-essentially non-blocking and fast, there is no blocking acklowdgement of
+essentially non-blocking and fast, there is no blocking acknowledgement of
 successful message processing by the sidecar. Hence the communication is
 lossy. In this sense, communication with a sidecar is more akin to UDP
 than TCP.
@@ -139,7 +139,7 @@ Use a sidecar if you need a task that runs during scheduling or
 execution of user code. A sidecar task can not perform any critical
 operations that must succeed in order for a task or a run to be
 considered valid. This makes sidecars suitable only for opportunistic,
-best effort tasks.
+best-effort tasks.
 
 ### 3. Data Parallelism
 
diff --git a/docs/datastore.md b/docs/datastore.md
index ba74d85336d..4cb7c83c39e 100644
--- a/docs/datastore.md
+++ b/docs/datastore.md
@@ -33,8 +33,8 @@ items to operate on (for example, all the keys to fetch) than to call the same
 API multiple times with a single key at a time. All APIs are designed with
 batch processing in mind where it makes sense.
 
-#### Separation of responsabilities
-Each class implements few functionalities and we attempted to maximize reuse.
+#### Separation of responsibilities
+Each class implements few functionalities, and we attempted to maximize reuse.
 The idea is that this will also help in developing newer implementations going
 forward and being able to surgically change a few things while keeping most of
 the code the same.
@@ -46,7 +46,7 @@ Before going into the design of the datastore itself, it is worth considering
 
 Metaflow considers a datastore to have a `datastore_root` which is the base
 directory of the datastore. Within that directory, Metaflow will create multiple
-sub-directories, one per flow (identified by the name of the flow). Within each
+subdirectories, one per flow (identified by the name of the flow). Within each
 of those directories, Metaflow will create one directory per run as well as
 a `data` directory which will contain all the artifacts ever produced by that
 flow.
@@ -73,7 +73,7 @@ The datastore has several components (starting at the lowest-level):
 - a `FlowDataStore` ties everything together. A `FlowDataStore` will include
   a `ContentAddressedStore` and all the `TaskDataStore`s for all the tasks that
   are part of the flow. The `FlowDataStore` includes functions to find the
-  `TaskDataStore` for a given task as well as save and load data directly (
+  `TaskDataStore` for a given task as well as to save and load data directly (
   this is used primarily for data that is not tied to a single task, for example
   code packages which are more tied to runs).
 
@@ -111,7 +111,7 @@ additional operations:
  - transforms the data prior to storing; we currently only compress the data but
    other operations are possible.
    
-Data is always de-duplicated but you can choose to skip the transformation step
+Data is always de-duplicated, but you can choose to skip the transformation step
 by telling the content address store that the data should be stored `raw` (ie:
 with no transformation). Note that the de-duplication logic happens *prior* to
 any transformation (so the transformation itself will not impact the de-duplication
@@ -120,7 +120,7 @@ logic).
 Content stored by the content addressed store is addressable using a `key` which is
 returned when `save_blobs` is called. `raw` objects can also directly be accessed
 using a `uri` (also returned by `save_blobs`); the `uri` will point to the location
-of the `raw` bytes in the underlying `DataStoreStorage` (so for exmaple a local
+of the `raw` bytes in the underlying `DataStoreStorage` (so, for example, a local
 filesystem path or a S3 path). Objects that are not `raw` do not return a `uri`
 as they should only be accessed through the content addressed store.
 
@@ -155,7 +155,7 @@ At a high level, the `TaskDataStore` is responsible for:
  - storing artifacts (functions like `save_artifacts`, `persist` help with this)
  - storing other metadata about the task execution; this can include logs,
    general information about the task, user-level metadata and any other information
-   the user wishes the persist about the task. Functions for this include
+   the user wishes to persist about the task. Functions for this include
    `save_logs` and `save_metadata`. Internally, functions like `done` will
    also store information about the task.
 
diff --git a/metaflow/__init__.py b/metaflow/__init__.py
index 9e319b36de3..c306eb3429d 100644
--- a/metaflow/__init__.py
+++ b/metaflow/__init__.py
@@ -39,7 +39,7 @@ class and related decorators.
 # More questions?
 
 If you have any questions, feel free to post a bug report/question on the
-Metaflow Github page.
+Metaflow GitHub page.
 """
 
 import importlib
@@ -93,20 +93,29 @@ class and related decorators.
         tl_package = m.split(".")[1]
         lazy_load_aliases(alias_submodules(extension_module, tl_package, None))
 
-from .event_logger import EventLogger
+# Utilities
+from .multicore_utils import parallel_imap_unordered, parallel_map
+from .metaflow_profile import profile
+
+# current runtime singleton
+from .current import current
 
 # Flow spec
 from .flowspec import FlowSpec
-from .includefile import IncludeFile
+
 from .parameters import Parameter, JSONTypeClass
 
 JSONType = JSONTypeClass()
 
-# current runtime singleton
-from .current import current
-
 # data layer
-from .datatools import S3
+# For historical reasons, we make metaflow.plugins.datatools accessible as
+# metaflow.datatools. S3 is also a tool that has historically been available at the
+# TL so keep as is.
+lazy_load_aliases({"metaflow.datatools": "metaflow.plugins.datatools"})
+from .plugins.datatools import S3
+
+# includefile
+from .includefile import IncludeFile
 
 # Decorators
 from .decorators import step, _import_plugin_decorators
@@ -134,10 +143,6 @@ class and related decorators.
     DataArtifact,
 )
 
-# Utilities
-from .multicore_utils import parallel_imap_unordered, parallel_map
-from .metaflow_profile import profile
-
 __version_addl__ = []
 _ext_debug("Loading top-level modules")
 for m in _tl_modules:
diff --git a/metaflow/_vendor/v3_5/__init__.py b/metaflow/_vendor/v3_5/__init__.py
new file mode 100644
index 00000000000..22ae0c5f40e
--- /dev/null
+++ b/metaflow/_vendor/v3_5/__init__.py
@@ -0,0 +1 @@
+# Empty file
\ No newline at end of file
diff --git a/metaflow/_vendor/importlib_metadata.LICENSE b/metaflow/_vendor/v3_5/importlib_metadata.LICENSE
similarity index 100%
rename from metaflow/_vendor/importlib_metadata.LICENSE
rename to metaflow/_vendor/v3_5/importlib_metadata.LICENSE
diff --git a/metaflow/_vendor/importlib_metadata/__init__.py b/metaflow/_vendor/v3_5/importlib_metadata/__init__.py
similarity index 99%
rename from metaflow/_vendor/importlib_metadata/__init__.py
rename to metaflow/_vendor/v3_5/importlib_metadata/__init__.py
index 4e3680aa972..429bfa66c4f 100644
--- a/metaflow/_vendor/importlib_metadata/__init__.py
+++ b/metaflow/_vendor/v3_5/importlib_metadata/__init__.py
@@ -6,7 +6,7 @@
 import abc
 import csv
 import sys
-from metaflow._vendor import zipp
+from metaflow._vendor.v3_5 import zipp
 import operator
 import functools
 import itertools
diff --git a/metaflow/_vendor/importlib_metadata/_compat.py b/metaflow/_vendor/v3_5/importlib_metadata/_compat.py
similarity index 100%
rename from metaflow/_vendor/importlib_metadata/_compat.py
rename to metaflow/_vendor/v3_5/importlib_metadata/_compat.py
diff --git a/metaflow/_vendor/zipp.LICENSE b/metaflow/_vendor/v3_5/zipp.LICENSE
similarity index 100%
rename from metaflow/_vendor/zipp.LICENSE
rename to metaflow/_vendor/v3_5/zipp.LICENSE
diff --git a/metaflow/_vendor/zipp.py b/metaflow/_vendor/v3_5/zipp.py
similarity index 100%
rename from metaflow/_vendor/zipp.py
rename to metaflow/_vendor/v3_5/zipp.py
diff --git a/metaflow/_vendor/v3_6/__init__.py b/metaflow/_vendor/v3_6/__init__.py
new file mode 100644
index 00000000000..22ae0c5f40e
--- /dev/null
+++ b/metaflow/_vendor/v3_6/__init__.py
@@ -0,0 +1 @@
+# Empty file
\ No newline at end of file
diff --git a/metaflow/_vendor/v3_6/importlib_metadata.LICENSE b/metaflow/_vendor/v3_6/importlib_metadata.LICENSE
new file mode 100644
index 00000000000..be7e092b0b0
--- /dev/null
+++ b/metaflow/_vendor/v3_6/importlib_metadata.LICENSE
@@ -0,0 +1,13 @@
+Copyright 2017-2019 Jason R. Coombs, Barry Warsaw
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
diff --git a/metaflow/_vendor/v3_6/importlib_metadata/__init__.py b/metaflow/_vendor/v3_6/importlib_metadata/__init__.py
new file mode 100644
index 00000000000..8d3b7814d50
--- /dev/null
+++ b/metaflow/_vendor/v3_6/importlib_metadata/__init__.py
@@ -0,0 +1,1063 @@
+import os
+import re
+import abc
+import csv
+import sys
+from metaflow._vendor.v3_6 import zipp
+import email
+import pathlib
+import operator
+import textwrap
+import warnings
+import functools
+import itertools
+import posixpath
+import collections
+
+from . import _adapters, _meta
+from ._collections import FreezableDefaultDict, Pair
+from ._compat import (
+    NullFinder,
+    install,
+    pypy_partial,
+)
+from ._functools import method_cache, pass_none
+from ._itertools import always_iterable, unique_everseen
+from ._meta import PackageMetadata, SimplePath
+
+from contextlib import suppress
+from importlib import import_module
+from importlib.abc import MetaPathFinder
+from itertools import starmap
+from typing import List, Mapping, Optional, Union
+
+
+__all__ = [
+    'Distribution',
+    'DistributionFinder',
+    'PackageMetadata',
+    'PackageNotFoundError',
+    'distribution',
+    'distributions',
+    'entry_points',
+    'files',
+    'metadata',
+    'packages_distributions',
+    'requires',
+    'version',
+]
+
+
+class PackageNotFoundError(ModuleNotFoundError):
+    """The package was not found."""
+
+    def __str__(self):
+        return f"No package metadata was found for {self.name}"
+
+    @property
+    def name(self):
+        (name,) = self.args
+        return name
+
+
+class Sectioned:
+    """
+    A simple entry point config parser for performance
+
+    >>> for item in Sectioned.read(Sectioned._sample):
+    ...     print(item)
+    Pair(name='sec1', value='# comments ignored')
+    Pair(name='sec1', value='a = 1')
+    Pair(name='sec1', value='b = 2')
+    Pair(name='sec2', value='a = 2')
+
+    >>> res = Sectioned.section_pairs(Sectioned._sample)
+    >>> item = next(res)
+    >>> item.name
+    'sec1'
+    >>> item.value
+    Pair(name='a', value='1')
+    >>> item = next(res)
+    >>> item.value
+    Pair(name='b', value='2')
+    >>> item = next(res)
+    >>> item.name
+    'sec2'
+    >>> item.value
+    Pair(name='a', value='2')
+    >>> list(res)
+    []
+    """
+
+    _sample = textwrap.dedent(
+        """
+        [sec1]
+        # comments ignored
+        a = 1
+        b = 2
+
+        [sec2]
+        a = 2
+        """
+    ).lstrip()
+
+    @classmethod
+    def section_pairs(cls, text):
+        return (
+            section._replace(value=Pair.parse(section.value))
+            for section in cls.read(text, filter_=cls.valid)
+            if section.name is not None
+        )
+
+    @staticmethod
+    def read(text, filter_=None):
+        lines = filter(filter_, map(str.strip, text.splitlines()))
+        name = None
+        for value in lines:
+            section_match = value.startswith('[') and value.endswith(']')
+            if section_match:
+                name = value.strip('[]')
+                continue
+            yield Pair(name, value)
+
+    @staticmethod
+    def valid(line):
+        return line and not line.startswith('#')
+
+
+class DeprecatedTuple:
+    """
+    Provide subscript item access for backward compatibility.
+
+    >>> recwarn = getfixture('recwarn')
+    >>> ep = EntryPoint(name='name', value='value', group='group')
+    >>> ep[:]
+    ('name', 'value', 'group')
+    >>> ep[0]
+    'name'
+    >>> len(recwarn)
+    1
+    """
+
+    _warn = functools.partial(
+        warnings.warn,
+        "EntryPoint tuple interface is deprecated. Access members by name.",
+        DeprecationWarning,
+        stacklevel=pypy_partial(2),
+    )
+
+    def __getitem__(self, item):
+        self._warn()
+        return self._key()[item]
+
+
+class EntryPoint(DeprecatedTuple):
+    """An entry point as defined by Python packaging conventions.
+
+    See `the packaging docs on entry points
+    <https://packaging.python.org/specifications/entry-points/>`_
+    for more information.
+    """
+
+    pattern = re.compile(
+        r'(?P<module>[\w.]+)\s*'
+        r'(:\s*(?P<attr>[\w.]+))?\s*'
+        r'(?P<extras>\[.*\])?\s*$'
+    )
+    """
+    A regular expression describing the syntax for an entry point,
+    which might look like:
+
+        - module
+        - package.module
+        - package.module:attribute
+        - package.module:object.attribute
+        - package.module:attr [extra1, extra2]
+
+    Other combinations are possible as well.
+
+    The expression is lenient about whitespace around the ':',
+    following the attr, and following any extras.
+    """
+
+    dist: Optional['Distribution'] = None
+
+    def __init__(self, name, value, group):
+        vars(self).update(name=name, value=value, group=group)
+
+    def load(self):
+        """Load the entry point from its definition. If only a module
+        is indicated by the value, return that module. Otherwise,
+        return the named object.
+        """
+        match = self.pattern.match(self.value)
+        module = import_module(match.group('module'))
+        attrs = filter(None, (match.group('attr') or '').split('.'))
+        return functools.reduce(getattr, attrs, module)
+
+    @property
+    def module(self):
+        match = self.pattern.match(self.value)
+        return match.group('module')
+
+    @property
+    def attr(self):
+        match = self.pattern.match(self.value)
+        return match.group('attr')
+
+    @property
+    def extras(self):
+        match = self.pattern.match(self.value)
+        return list(re.finditer(r'\w+', match.group('extras') or ''))
+
+    def _for(self, dist):
+        vars(self).update(dist=dist)
+        return self
+
+    def __iter__(self):
+        """
+        Supply iter so one may construct dicts of EntryPoints by name.
+        """
+        msg = (
+            "Construction of dict of EntryPoints is deprecated in "
+            "favor of EntryPoints."
+        )
+        warnings.warn(msg, DeprecationWarning)
+        return iter((self.name, self))
+
+    def matches(self, **params):
+        attrs = (getattr(self, param) for param in params)
+        return all(map(operator.eq, params.values(), attrs))
+
+    def _key(self):
+        return self.name, self.value, self.group
+
+    def __lt__(self, other):
+        return self._key() < other._key()
+
+    def __eq__(self, other):
+        return self._key() == other._key()
+
+    def __setattr__(self, name, value):
+        raise AttributeError("EntryPoint objects are immutable.")
+
+    def __repr__(self):
+        return (
+            f'EntryPoint(name={self.name!r}, value={self.value!r}, '
+            f'group={self.group!r})'
+        )
+
+    def __hash__(self):
+        return hash(self._key())
+
+
+class DeprecatedList(list):
+    """
+    Allow an otherwise immutable object to implement mutability
+    for compatibility.
+
+    >>> recwarn = getfixture('recwarn')
+    >>> dl = DeprecatedList(range(3))
+    >>> dl[0] = 1
+    >>> dl.append(3)
+    >>> del dl[3]
+    >>> dl.reverse()
+    >>> dl.sort()
+    >>> dl.extend([4])
+    >>> dl.pop(-1)
+    4
+    >>> dl.remove(1)
+    >>> dl += [5]
+    >>> dl + [6]
+    [1, 2, 5, 6]
+    >>> dl + (6,)
+    [1, 2, 5, 6]
+    >>> dl.insert(0, 0)
+    >>> dl
+    [0, 1, 2, 5]
+    >>> dl == [0, 1, 2, 5]
+    True
+    >>> dl == (0, 1, 2, 5)
+    True
+    >>> len(recwarn)
+    1
+    """
+
+    _warn = functools.partial(
+        warnings.warn,
+        "EntryPoints list interface is deprecated. Cast to list if needed.",
+        DeprecationWarning,
+        stacklevel=pypy_partial(2),
+    )
+
+    def _wrap_deprecated_method(method_name: str):  # type: ignore
+        def wrapped(self, *args, **kwargs):
+            self._warn()
+            return getattr(super(), method_name)(*args, **kwargs)
+
+        return wrapped
+
+    for method_name in [
+        '__setitem__',
+        '__delitem__',
+        'append',
+        'reverse',
+        'extend',
+        'pop',
+        'remove',
+        '__iadd__',
+        'insert',
+        'sort',
+    ]:
+        locals()[method_name] = _wrap_deprecated_method(method_name)
+
+    def __add__(self, other):
+        if not isinstance(other, tuple):
+            self._warn()
+            other = tuple(other)
+        return self.__class__(tuple(self) + other)
+
+    def __eq__(self, other):
+        if not isinstance(other, tuple):
+            self._warn()
+            other = tuple(other)
+
+        return tuple(self).__eq__(other)
+
+
+class EntryPoints(DeprecatedList):
+    """
+    An immutable collection of selectable EntryPoint objects.
+    """
+
+    __slots__ = ()
+
+    def __getitem__(self, name):  # -> EntryPoint:
+        """
+        Get the EntryPoint in self matching name.
+        """
+        if isinstance(name, int):
+            warnings.warn(
+                "Accessing entry points by index is deprecated. "
+                "Cast to tuple if needed.",
+                DeprecationWarning,
+                stacklevel=2,
+            )
+            return super().__getitem__(name)
+        try:
+            return next(iter(self.select(name=name)))
+        except StopIteration:
+            raise KeyError(name)
+
+    def select(self, **params):
+        """
+        Select entry points from self that match the
+        given parameters (typically group and/or name).
+        """
+        return EntryPoints(ep for ep in self if ep.matches(**params))
+
+    @property
+    def names(self):
+        """
+        Return the set of all names of all entry points.
+        """
+        return {ep.name for ep in self}
+
+    @property
+    def groups(self):
+        """
+        Return the set of all groups of all entry points.
+
+        For coverage while SelectableGroups is present.
+        >>> EntryPoints().groups
+        set()
+        """
+        return {ep.group for ep in self}
+
+    @classmethod
+    def _from_text_for(cls, text, dist):
+        return cls(ep._for(dist) for ep in cls._from_text(text))
+
+    @staticmethod
+    def _from_text(text):
+        return (
+            EntryPoint(name=item.value.name, value=item.value.value, group=item.name)
+            for item in Sectioned.section_pairs(text or '')
+        )
+
+
+class Deprecated:
+    """
+    Compatibility add-in for mapping to indicate that
+    mapping behavior is deprecated.
+
+    >>> recwarn = getfixture('recwarn')
+    >>> class DeprecatedDict(Deprecated, dict): pass
+    >>> dd = DeprecatedDict(foo='bar')
+    >>> dd.get('baz', None)
+    >>> dd['foo']
+    'bar'
+    >>> list(dd)
+    ['foo']
+    >>> list(dd.keys())
+    ['foo']
+    >>> 'foo' in dd
+    True
+    >>> list(dd.values())
+    ['bar']
+    >>> len(recwarn)
+    1
+    """
+
+    _warn = functools.partial(
+        warnings.warn,
+        "SelectableGroups dict interface is deprecated. Use select.",
+        DeprecationWarning,
+        stacklevel=pypy_partial(2),
+    )
+
+    def __getitem__(self, name):
+        self._warn()
+        return super().__getitem__(name)
+
+    def get(self, name, default=None):
+        self._warn()
+        return super().get(name, default)
+
+    def __iter__(self):
+        self._warn()
+        return super().__iter__()
+
+    def __contains__(self, *args):
+        self._warn()
+        return super().__contains__(*args)
+
+    def keys(self):
+        self._warn()
+        return super().keys()
+
+    def values(self):
+        self._warn()
+        return super().values()
+
+
+class SelectableGroups(Deprecated, dict):
+    """
+    A backward- and forward-compatible result from
+    entry_points that fully implements the dict interface.
+    """
+
+    @classmethod
+    def load(cls, eps):
+        by_group = operator.attrgetter('group')
+        ordered = sorted(eps, key=by_group)
+        grouped = itertools.groupby(ordered, by_group)
+        return cls((group, EntryPoints(eps)) for group, eps in grouped)
+
+    @property
+    def _all(self):
+        """
+        Reconstruct a list of all entrypoints from the groups.
+        """
+        groups = super(Deprecated, self).values()
+        return EntryPoints(itertools.chain.from_iterable(groups))
+
+    @property
+    def groups(self):
+        return self._all.groups
+
+    @property
+    def names(self):
+        """
+        for coverage:
+        >>> SelectableGroups().names
+        set()
+        """
+        return self._all.names
+
+    def select(self, **params):
+        if not params:
+            return self
+        return self._all.select(**params)
+
+
+class PackagePath(pathlib.PurePosixPath):
+    """A reference to a path in a package"""
+
+    def read_text(self, encoding='utf-8'):
+        with self.locate().open(encoding=encoding) as stream:
+            return stream.read()
+
+    def read_binary(self):
+        with self.locate().open('rb') as stream:
+            return stream.read()
+
+    def locate(self):
+        """Return a path-like object for this path"""
+        return self.dist.locate_file(self)
+
+
+class FileHash:
+    def __init__(self, spec):
+        self.mode, _, self.value = spec.partition('=')
+
+    def __repr__(self):
+        return f'<FileHash mode: {self.mode} value: {self.value}>'
+
+
+class Distribution:
+    """A Python distribution package."""
+
+    @abc.abstractmethod
+    def read_text(self, filename):
+        """Attempt to load metadata file given by the name.
+
+        :param filename: The name of the file in the distribution info.
+        :return: The text if found, otherwise None.
+        """
+
+    @abc.abstractmethod
+    def locate_file(self, path):
+        """
+        Given a path to a file in this distribution, return a path
+        to it.
+        """
+
+    @classmethod
+    def from_name(cls, name):
+        """Return the Distribution for the given package name.
+
+        :param name: The name of the distribution package to search for.
+        :return: The Distribution instance (or subclass thereof) for the named
+            package, if found.
+        :raises PackageNotFoundError: When the named package's distribution
+            metadata cannot be found.
+        """
+        for resolver in cls._discover_resolvers():
+            dists = resolver(DistributionFinder.Context(name=name))
+            dist = next(iter(dists), None)
+            if dist is not None:
+                return dist
+        else:
+            raise PackageNotFoundError(name)
+
+    @classmethod
+    def discover(cls, **kwargs):
+        """Return an iterable of Distribution objects for all packages.
+
+        Pass a ``context`` or pass keyword arguments for constructing
+        a context.
+
+        :context: A ``DistributionFinder.Context`` object.
+        :return: Iterable of Distribution objects for all packages.
+        """
+        context = kwargs.pop('context', None)
+        if context and kwargs:
+            raise ValueError("cannot accept context and kwargs")
+        context = context or DistributionFinder.Context(**kwargs)
+        return itertools.chain.from_iterable(
+            resolver(context) for resolver in cls._discover_resolvers()
+        )
+
+    @staticmethod
+    def at(path):
+        """Return a Distribution for the indicated metadata path
+
+        :param path: a string or path-like object
+        :return: a concrete Distribution instance for the path
+        """
+        return PathDistribution(pathlib.Path(path))
+
+    @staticmethod
+    def _discover_resolvers():
+        """Search the meta_path for resolvers."""
+        declared = (
+            getattr(finder, 'find_distributions', None) for finder in sys.meta_path
+        )
+        return filter(None, declared)
+
+    @classmethod
+    def _local(cls, root='.'):
+        from pep517 import build, meta
+
+        system = build.compat_system(root)
+        builder = functools.partial(
+            meta.build,
+            source_dir=root,
+            system=system,
+        )
+        return PathDistribution(zipp.Path(meta.build_as_zip(builder)))
+
+    @property
+    def metadata(self) -> _meta.PackageMetadata:
+        """Return the parsed metadata for this Distribution.
+
+        The returned object will have keys that name the various bits of
+        metadata.  See PEP 566 for details.
+        """
+        text = (
+            self.read_text('METADATA')
+            or self.read_text('PKG-INFO')
+            # This last clause is here to support old egg-info files.  Its
+            # effect is to just end up using the PathDistribution's self._path
+            # (which points to the egg-info file) attribute unchanged.
+            or self.read_text('')
+        )
+        return _adapters.Message(email.message_from_string(text))
+
+    @property
+    def name(self):
+        """Return the 'Name' metadata for the distribution package."""
+        return self.metadata['Name']
+
+    @property
+    def _normalized_name(self):
+        """Return a normalized version of the name."""
+        return Prepared.normalize(self.name)
+
+    @property
+    def version(self):
+        """Return the 'Version' metadata for the distribution package."""
+        return self.metadata['Version']
+
+    @property
+    def entry_points(self):
+        return EntryPoints._from_text_for(self.read_text('entry_points.txt'), self)
+
+    @property
+    def files(self):
+        """Files in this distribution.
+
+        :return: List of PackagePath for this distribution or None
+
+        Result is `None` if the metadata file that enumerates files
+        (i.e. RECORD for dist-info or SOURCES.txt for egg-info) is
+        missing.
+        Result may be empty if the metadata exists but is empty.
+        """
+
+        def make_file(name, hash=None, size_str=None):
+            result = PackagePath(name)
+            result.hash = FileHash(hash) if hash else None
+            result.size = int(size_str) if size_str else None
+            result.dist = self
+            return result
+
+        @pass_none
+        def make_files(lines):
+            return list(starmap(make_file, csv.reader(lines)))
+
+        return make_files(self._read_files_distinfo() or self._read_files_egginfo())
+
+    def _read_files_distinfo(self):
+        """
+        Read the lines of RECORD
+        """
+        text = self.read_text('RECORD')
+        return text and text.splitlines()
+
+    def _read_files_egginfo(self):
+        """
+        SOURCES.txt might contain literal commas, so wrap each line
+        in quotes.
+        """
+        text = self.read_text('SOURCES.txt')
+        return text and map('"{}"'.format, text.splitlines())
+
+    @property
+    def requires(self):
+        """Generated requirements specified for this Distribution"""
+        reqs = self._read_dist_info_reqs() or self._read_egg_info_reqs()
+        return reqs and list(reqs)
+
+    def _read_dist_info_reqs(self):
+        return self.metadata.get_all('Requires-Dist')
+
+    def _read_egg_info_reqs(self):
+        source = self.read_text('requires.txt')
+        return source and self._deps_from_requires_text(source)
+
+    @classmethod
+    def _deps_from_requires_text(cls, source):
+        return cls._convert_egg_info_reqs_to_simple_reqs(Sectioned.read(source))
+
+    @staticmethod
+    def _convert_egg_info_reqs_to_simple_reqs(sections):
+        """
+        Historically, setuptools would solicit and store 'extra'
+        requirements, including those with environment markers,
+        in separate sections. More modern tools expect each
+        dependency to be defined separately, with any relevant
+        extras and environment markers attached directly to that
+        requirement. This method converts the former to the
+        latter. See _test_deps_from_requires_text for an example.
+        """
+
+        def make_condition(name):
+            return name and f'extra == "{name}"'
+
+        def quoted_marker(section):
+            section = section or ''
+            extra, sep, markers = section.partition(':')
+            if extra and markers:
+                markers = f'({markers})'
+            conditions = list(filter(None, [markers, make_condition(extra)]))
+            return '; ' + ' and '.join(conditions) if conditions else ''
+
+        def url_req_space(req):
+            """
+            PEP 508 requires a space between the url_spec and the quoted_marker.
+            Ref python/importlib_metadata#357.
+            """
+            # '@' is uniquely indicative of a url_req.
+            return ' ' * ('@' in req)
+
+        for section in sections:
+            space = url_req_space(section.value)
+            yield section.value + space + quoted_marker(section.name)
+
+
+class DistributionFinder(MetaPathFinder):
+    """
+    A MetaPathFinder capable of discovering installed distributions.
+    """
+
+    class Context:
+        """
+        Keyword arguments presented by the caller to
+        ``distributions()`` or ``Distribution.discover()``
+        to narrow the scope of a search for distributions
+        in all DistributionFinders.
+
+        Each DistributionFinder may expect any parameters
+        and should attempt to honor the canonical
+        parameters defined below when appropriate.
+        """
+
+        name = None
+        """
+        Specific name for which a distribution finder should match.
+        A name of ``None`` matches all distributions.
+        """
+
+        def __init__(self, **kwargs):
+            vars(self).update(kwargs)
+
+        @property
+        def path(self):
+            """
+            The sequence of directory path that a distribution finder
+            should search.
+
+            Typically refers to Python installed package paths such as
+            "site-packages" directories and defaults to ``sys.path``.
+            """
+            return vars(self).get('path', sys.path)
+
+    @abc.abstractmethod
+    def find_distributions(self, context=Context()):
+        """
+        Find distributions.
+
+        Return an iterable of all Distribution instances capable of
+        loading the metadata for packages matching the ``context``,
+        a DistributionFinder.Context instance.
+        """
+
+
+class FastPath:
+    """
+    Micro-optimized class for searching a path for
+    children.
+
+    >>> FastPath('').children()
+    ['...']
+    """
+
+    @functools.lru_cache()  # type: ignore
+    def __new__(cls, root):
+        return super().__new__(cls)
+
+    def __init__(self, root):
+        self.root = str(root)
+
+    def joinpath(self, child):
+        return pathlib.Path(self.root, child)
+
+    def children(self):
+        with suppress(Exception):
+            return os.listdir(self.root or '.')
+        with suppress(Exception):
+            return self.zip_children()
+        return []
+
+    def zip_children(self):
+        zip_path = zipp.Path(self.root)
+        names = zip_path.root.namelist()
+        self.joinpath = zip_path.joinpath
+
+        return dict.fromkeys(child.split(posixpath.sep, 1)[0] for child in names)
+
+    def search(self, name):
+        return self.lookup(self.mtime).search(name)
+
+    @property
+    def mtime(self):
+        with suppress(OSError):
+            return os.stat(self.root).st_mtime
+        self.lookup.cache_clear()
+
+    @method_cache
+    def lookup(self, mtime):
+        return Lookup(self)
+
+
+class Lookup:
+    def __init__(self, path: FastPath):
+        base = os.path.basename(path.root).lower()
+        base_is_egg = base.endswith(".egg")
+        self.infos = FreezableDefaultDict(list)
+        self.eggs = FreezableDefaultDict(list)
+
+        for child in path.children():
+            low = child.lower()
+            if low.endswith((".dist-info", ".egg-info")):
+                # rpartition is faster than splitext and suitable for this purpose.
+                name = low.rpartition(".")[0].partition("-")[0]
+                normalized = Prepared.normalize(name)
+                self.infos[normalized].append(path.joinpath(child))
+            elif base_is_egg and low == "egg-info":
+                name = base.rpartition(".")[0].partition("-")[0]
+                legacy_normalized = Prepared.legacy_normalize(name)
+                self.eggs[legacy_normalized].append(path.joinpath(child))
+
+        self.infos.freeze()
+        self.eggs.freeze()
+
+    def search(self, prepared):
+        infos = (
+            self.infos[prepared.normalized]
+            if prepared
+            else itertools.chain.from_iterable(self.infos.values())
+        )
+        eggs = (
+            self.eggs[prepared.legacy_normalized]
+            if prepared
+            else itertools.chain.from_iterable(self.eggs.values())
+        )
+        return itertools.chain(infos, eggs)
+
+
+class Prepared:
+    """
+    A prepared search for metadata on a possibly-named package.
+    """
+
+    normalized = None
+    legacy_normalized = None
+
+    def __init__(self, name):
+        self.name = name
+        if name is None:
+            return
+        self.normalized = self.normalize(name)
+        self.legacy_normalized = self.legacy_normalize(name)
+
+    @staticmethod
+    def normalize(name):
+        """
+        PEP 503 normalization plus dashes as underscores.
+        """
+        return re.sub(r"[-_.]+", "-", name).lower().replace('-', '_')
+
+    @staticmethod
+    def legacy_normalize(name):
+        """
+        Normalize the package name as found in the convention in
+        older packaging tools versions and specs.
+        """
+        return name.lower().replace('-', '_')
+
+    def __bool__(self):
+        return bool(self.name)
+
+
+@install
+class MetadataPathFinder(NullFinder, DistributionFinder):
+    """A degenerate finder for distribution packages on the file system.
+
+    This finder supplies only a find_distributions() method for versions
+    of Python that do not have a PathFinder find_distributions().
+    """
+
+    def find_distributions(self, context=DistributionFinder.Context()):
+        """
+        Find distributions.
+
+        Return an iterable of all Distribution instances capable of
+        loading the metadata for packages matching ``context.name``
+        (or all names if ``None`` indicated) along the paths in the list
+        of directories ``context.path``.
+        """
+        found = self._search_paths(context.name, context.path)
+        return map(PathDistribution, found)
+
+    @classmethod
+    def _search_paths(cls, name, paths):
+        """Find metadata directories in paths heuristically."""
+        prepared = Prepared(name)
+        return itertools.chain.from_iterable(
+            path.search(prepared) for path in map(FastPath, paths)
+        )
+
+    def invalidate_caches(cls):
+        FastPath.__new__.cache_clear()
+
+
+class PathDistribution(Distribution):
+    def __init__(self, path: SimplePath):
+        """Construct a distribution.
+
+        :param path: SimplePath indicating the metadata directory.
+        """
+        self._path = path
+
+    def read_text(self, filename):
+        with suppress(
+            FileNotFoundError,
+            IsADirectoryError,
+            KeyError,
+            NotADirectoryError,
+            PermissionError,
+        ):
+            return self._path.joinpath(filename).read_text(encoding='utf-8')
+
+    read_text.__doc__ = Distribution.read_text.__doc__
+
+    def locate_file(self, path):
+        return self._path.parent / path
+
+    @property
+    def _normalized_name(self):
+        """
+        Performance optimization: where possible, resolve the
+        normalized name from the file system path.
+        """
+        stem = os.path.basename(str(self._path))
+        return self._name_from_stem(stem) or super()._normalized_name
+
+    def _name_from_stem(self, stem):
+        name, ext = os.path.splitext(stem)
+        if ext not in ('.dist-info', '.egg-info'):
+            return
+        name, sep, rest = stem.partition('-')
+        return name
+
+
+def distribution(distribution_name):
+    """Get the ``Distribution`` instance for the named package.
+
+    :param distribution_name: The name of the distribution package as a string.
+    :return: A ``Distribution`` instance (or subclass thereof).
+    """
+    return Distribution.from_name(distribution_name)
+
+
+def distributions(**kwargs):
+    """Get all ``Distribution`` instances in the current environment.
+
+    :return: An iterable of ``Distribution`` instances.
+    """
+    return Distribution.discover(**kwargs)
+
+
+def metadata(distribution_name) -> _meta.PackageMetadata:
+    """Get the metadata for the named package.
+
+    :param distribution_name: The name of the distribution package to query.
+    :return: A PackageMetadata containing the parsed metadata.
+    """
+    return Distribution.from_name(distribution_name).metadata
+
+
+def version(distribution_name):
+    """Get the version string for the named package.
+
+    :param distribution_name: The name of the distribution package to query.
+    :return: The version string for the package as defined in the package's
+        "Version" metadata key.
+    """
+    return distribution(distribution_name).version
+
+
+def entry_points(**params) -> Union[EntryPoints, SelectableGroups]:
+    """Return EntryPoint objects for all installed packages.
+
+    Pass selection parameters (group or name) to filter the
+    result to entry points matching those properties (see
+    EntryPoints.select()).
+
+    For compatibility, returns ``SelectableGroups`` object unless
+    selection parameters are supplied. In the future, this function
+    will return ``EntryPoints`` instead of ``SelectableGroups``
+    even when no selection parameters are supplied.
+
+    For maximum future compatibility, pass selection parameters
+    or invoke ``.select`` with parameters on the result.
+
+    :return: EntryPoints or SelectableGroups for all installed packages.
+    """
+    norm_name = operator.attrgetter('_normalized_name')
+    unique = functools.partial(unique_everseen, key=norm_name)
+    eps = itertools.chain.from_iterable(
+        dist.entry_points for dist in unique(distributions())
+    )
+    return SelectableGroups.load(eps).select(**params)
+
+
+def files(distribution_name):
+    """Return a list of files for the named package.
+
+    :param distribution_name: The name of the distribution package to query.
+    :return: List of files composing the distribution.
+    """
+    return distribution(distribution_name).files
+
+
+def requires(distribution_name):
+    """
+    Return a list of requirements for the named package.
+
+    :return: An iterator of requirements, suitable for
+        packaging.requirement.Requirement.
+    """
+    return distribution(distribution_name).requires
+
+
+def packages_distributions() -> Mapping[str, List[str]]:
+    """
+    Return a mapping of top-level packages to their
+    distributions.
+
+    >>> import collections.abc
+    >>> pkgs = packages_distributions()
+    >>> all(isinstance(dist, collections.abc.Sequence) for dist in pkgs.values())
+    True
+    """
+    pkg_to_dist = collections.defaultdict(list)
+    for dist in distributions():
+        for pkg in _top_level_declared(dist) or _top_level_inferred(dist):
+            pkg_to_dist[pkg].append(dist.metadata['Name'])
+    return dict(pkg_to_dist)
+
+
+def _top_level_declared(dist):
+    return (dist.read_text('top_level.txt') or '').split()
+
+
+def _top_level_inferred(dist):
+    return {
+        f.parts[0] if len(f.parts) > 1 else f.with_suffix('').name
+        for f in always_iterable(dist.files)
+        if f.suffix == ".py"
+    }
diff --git a/metaflow/_vendor/v3_6/importlib_metadata/_adapters.py b/metaflow/_vendor/v3_6/importlib_metadata/_adapters.py
new file mode 100644
index 00000000000..aa460d3eda5
--- /dev/null
+++ b/metaflow/_vendor/v3_6/importlib_metadata/_adapters.py
@@ -0,0 +1,68 @@
+import re
+import textwrap
+import email.message
+
+from ._text import FoldedCase
+
+
+class Message(email.message.Message):
+    multiple_use_keys = set(
+        map(
+            FoldedCase,
+            [
+                'Classifier',
+                'Obsoletes-Dist',
+                'Platform',
+                'Project-URL',
+                'Provides-Dist',
+                'Provides-Extra',
+                'Requires-Dist',
+                'Requires-External',
+                'Supported-Platform',
+                'Dynamic',
+            ],
+        )
+    )
+    """
+    Keys that may be indicated multiple times per PEP 566.
+    """
+
+    def __new__(cls, orig: email.message.Message):
+        res = super().__new__(cls)
+        vars(res).update(vars(orig))
+        return res
+
+    def __init__(self, *args, **kwargs):
+        self._headers = self._repair_headers()
+
+    # suppress spurious error from mypy
+    def __iter__(self):
+        return super().__iter__()
+
+    def _repair_headers(self):
+        def redent(value):
+            "Correct for RFC822 indentation"
+            if not value or '\n' not in value:
+                return value
+            return textwrap.dedent(' ' * 8 + value)
+
+        headers = [(key, redent(value)) for key, value in vars(self)['_headers']]
+        if self._payload:
+            headers.append(('Description', self.get_payload()))
+        return headers
+
+    @property
+    def json(self):
+        """
+        Convert PackageMetadata to a JSON-compatible format
+        per PEP 0566.
+        """
+
+        def transform(key):
+            value = self.get_all(key) if key in self.multiple_use_keys else self[key]
+            if key == 'Keywords':
+                value = re.split(r'\s+', value)
+            tk = key.lower().replace('-', '_')
+            return tk, value
+
+        return dict(map(transform, map(FoldedCase, self)))
diff --git a/metaflow/_vendor/v3_6/importlib_metadata/_collections.py b/metaflow/_vendor/v3_6/importlib_metadata/_collections.py
new file mode 100644
index 00000000000..cf0954e1a30
--- /dev/null
+++ b/metaflow/_vendor/v3_6/importlib_metadata/_collections.py
@@ -0,0 +1,30 @@
+import collections
+
+
+# from jaraco.collections 3.3
+class FreezableDefaultDict(collections.defaultdict):
+    """
+    Often it is desirable to prevent the mutation of
+    a default dict after its initial construction, such
+    as to prevent mutation during iteration.
+
+    >>> dd = FreezableDefaultDict(list)
+    >>> dd[0].append('1')
+    >>> dd.freeze()
+    >>> dd[1]
+    []
+    >>> len(dd)
+    1
+    """
+
+    def __missing__(self, key):
+        return getattr(self, '_frozen', super().__missing__)(key)
+
+    def freeze(self):
+        self._frozen = lambda key: self.default_factory()
+
+
+class Pair(collections.namedtuple('Pair', 'name value')):
+    @classmethod
+    def parse(cls, text):
+        return cls(*map(str.strip, text.split("=", 1)))
diff --git a/metaflow/_vendor/v3_6/importlib_metadata/_compat.py b/metaflow/_vendor/v3_6/importlib_metadata/_compat.py
new file mode 100644
index 00000000000..3680940f0b0
--- /dev/null
+++ b/metaflow/_vendor/v3_6/importlib_metadata/_compat.py
@@ -0,0 +1,71 @@
+import sys
+import platform
+
+
+__all__ = ['install', 'NullFinder', 'Protocol']
+
+
+try:
+    from typing import Protocol
+except ImportError:  # pragma: no cover
+    from metaflow._vendor.v3_6.typing_extensions import Protocol  # type: ignore
+
+
+def install(cls):
+    """
+    Class decorator for installation on sys.meta_path.
+
+    Adds the backport DistributionFinder to sys.meta_path and
+    attempts to disable the finder functionality of the stdlib
+    DistributionFinder.
+    """
+    sys.meta_path.append(cls())
+    disable_stdlib_finder()
+    return cls
+
+
+def disable_stdlib_finder():
+    """
+    Give the backport primacy for discovering path-based distributions
+    by monkey-patching the stdlib O_O.
+
+    See #91 for more background for rationale on this sketchy
+    behavior.
+    """
+
+    def matches(finder):
+        return getattr(
+            finder, '__module__', None
+        ) == '_frozen_importlib_external' and hasattr(finder, 'find_distributions')
+
+    for finder in filter(matches, sys.meta_path):  # pragma: nocover
+        del finder.find_distributions
+
+
+class NullFinder:
+    """
+    A "Finder" (aka "MetaClassFinder") that never finds any modules,
+    but may find distributions.
+    """
+
+    @staticmethod
+    def find_spec(*args, **kwargs):
+        return None
+
+    # In Python 2, the import system requires finders
+    # to have a find_module() method, but this usage
+    # is deprecated in Python 3 in favor of find_spec().
+    # For the purposes of this finder (i.e. being present
+    # on sys.meta_path but having no other import
+    # system functionality), the two methods are identical.
+    find_module = find_spec
+
+
+def pypy_partial(val):
+    """
+    Adjust for variable stacklevel on partial under PyPy.
+
+    Workaround for #327.
+    """
+    is_pypy = platform.python_implementation() == 'PyPy'
+    return val + is_pypy
diff --git a/metaflow/_vendor/v3_6/importlib_metadata/_functools.py b/metaflow/_vendor/v3_6/importlib_metadata/_functools.py
new file mode 100644
index 00000000000..71f66bd03cb
--- /dev/null
+++ b/metaflow/_vendor/v3_6/importlib_metadata/_functools.py
@@ -0,0 +1,104 @@
+import types
+import functools
+
+
+# from jaraco.functools 3.3
+def method_cache(method, cache_wrapper=None):
+    """
+    Wrap lru_cache to support storing the cache data in the object instances.
+
+    Abstracts the common paradigm where the method explicitly saves an
+    underscore-prefixed protected property on first call and returns that
+    subsequently.
+
+    >>> class MyClass:
+    ...     calls = 0
+    ...
+    ...     @method_cache
+    ...     def method(self, value):
+    ...         self.calls += 1
+    ...         return value
+
+    >>> a = MyClass()
+    >>> a.method(3)
+    3
+    >>> for x in range(75):
+    ...     res = a.method(x)
+    >>> a.calls
+    75
+
+    Note that the apparent behavior will be exactly like that of lru_cache
+    except that the cache is stored on each instance, so values in one
+    instance will not flush values from another, and when an instance is
+    deleted, so are the cached values for that instance.
+
+    >>> b = MyClass()
+    >>> for x in range(35):
+    ...     res = b.method(x)
+    >>> b.calls
+    35
+    >>> a.method(0)
+    0
+    >>> a.calls
+    75
+
+    Note that if method had been decorated with ``functools.lru_cache()``,
+    a.calls would have been 76 (due to the cached value of 0 having been
+    flushed by the 'b' instance).
+
+    Clear the cache with ``.cache_clear()``
+
+    >>> a.method.cache_clear()
+
+    Same for a method that hasn't yet been called.
+
+    >>> c = MyClass()
+    >>> c.method.cache_clear()
+
+    Another cache wrapper may be supplied:
+
+    >>> cache = functools.lru_cache(maxsize=2)
+    >>> MyClass.method2 = method_cache(lambda self: 3, cache_wrapper=cache)
+    >>> a = MyClass()
+    >>> a.method2()
+    3
+
+    Caution - do not subsequently wrap the method with another decorator, such
+    as ``@property``, which changes the semantics of the function.
+
+    See also
+    http://code.activestate.com/recipes/577452-a-memoize-decorator-for-instance-methods/
+    for another implementation and additional justification.
+    """
+    cache_wrapper = cache_wrapper or functools.lru_cache()
+
+    def wrapper(self, *args, **kwargs):
+        # it's the first call, replace the method with a cached, bound method
+        bound_method = types.MethodType(method, self)
+        cached_method = cache_wrapper(bound_method)
+        setattr(self, method.__name__, cached_method)
+        return cached_method(*args, **kwargs)
+
+    # Support cache clear even before cache has been created.
+    wrapper.cache_clear = lambda: None
+
+    return wrapper
+
+
+# From jaraco.functools 3.3
+def pass_none(func):
+    """
+    Wrap func so it's not called if its first param is None
+
+    >>> print_text = pass_none(print)
+    >>> print_text('text')
+    text
+    >>> print_text(None)
+    """
+
+    @functools.wraps(func)
+    def wrapper(param, *args, **kwargs):
+        if param is not None:
+            return func(param, *args, **kwargs)
+
+    return wrapper
diff --git a/metaflow/_vendor/v3_6/importlib_metadata/_itertools.py b/metaflow/_vendor/v3_6/importlib_metadata/_itertools.py
new file mode 100644
index 00000000000..d4ca9b9140e
--- /dev/null
+++ b/metaflow/_vendor/v3_6/importlib_metadata/_itertools.py
@@ -0,0 +1,73 @@
+from itertools import filterfalse
+
+
+def unique_everseen(iterable, key=None):
+    "List unique elements, preserving order. Remember all elements ever seen."
+    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
+    # unique_everseen('ABBCcAD', str.lower) --> A B C D
+    seen = set()
+    seen_add = seen.add
+    if key is None:
+        for element in filterfalse(seen.__contains__, iterable):
+            seen_add(element)
+            yield element
+    else:
+        for element in iterable:
+            k = key(element)
+            if k not in seen:
+                seen_add(k)
+                yield element
+
+
+# copied from more_itertools 8.8
+def always_iterable(obj, base_type=(str, bytes)):
+    """If *obj* is iterable, return an iterator over its items::
+
+        >>> obj = (1, 2, 3)
+        >>> list(always_iterable(obj))
+        [1, 2, 3]
+
+    If *obj* is not iterable, return a one-item iterable containing *obj*::
+
+        >>> obj = 1
+        >>> list(always_iterable(obj))
+        [1]
+
+    If *obj* is ``None``, return an empty iterable:
+
+        >>> obj = None
+        >>> list(always_iterable(None))
+        []
+
+    By default, binary and text strings are not considered iterable::
+
+        >>> obj = 'foo'
+        >>> list(always_iterable(obj))
+        ['foo']
+
+    If *base_type* is set, objects for which ``isinstance(obj, base_type)``
+    returns ``True`` won't be considered iterable.
+
+        >>> obj = {'a': 1}
+        >>> list(always_iterable(obj))  # Iterate over the dict's keys
+        ['a']
+        >>> list(always_iterable(obj, base_type=dict))  # Treat dicts as a unit
+        [{'a': 1}]
+
+    Set *base_type* to ``None`` to avoid any special handling and treat objects
+    Python considers iterable as iterable:
+
+        >>> obj = 'foo'
+        >>> list(always_iterable(obj, base_type=None))
+        ['f', 'o', 'o']
+    """
+    if obj is None:
+        return iter(())
+
+    if (base_type is not None) and isinstance(obj, base_type):
+        return iter((obj,))
+
+    try:
+        return iter(obj)
+    except TypeError:
+        return iter((obj,))
diff --git a/metaflow/_vendor/v3_6/importlib_metadata/_meta.py b/metaflow/_vendor/v3_6/importlib_metadata/_meta.py
new file mode 100644
index 00000000000..37ee43e6ef4
--- /dev/null
+++ b/metaflow/_vendor/v3_6/importlib_metadata/_meta.py
@@ -0,0 +1,48 @@
+from ._compat import Protocol
+from typing import Any, Dict, Iterator, List, TypeVar, Union
+
+
+_T = TypeVar("_T")
+
+
+class PackageMetadata(Protocol):
+    def __len__(self) -> int:
+        ...  # pragma: no cover
+
+    def __contains__(self, item: str) -> bool:
+        ...  # pragma: no cover
+
+    def __getitem__(self, key: str) -> str:
+        ...  # pragma: no cover
+
+    def __iter__(self) -> Iterator[str]:
+        ...  # pragma: no cover
+
+    def get_all(self, name: str, failobj: _T = ...) -> Union[List[Any], _T]:
+        """
+        Return all values associated with a possibly multi-valued key.
+        """
+
+    @property
+    def json(self) -> Dict[str, Union[str, List[str]]]:
+        """
+        A JSON-compatible form of the metadata.
+        """
+
+
+class SimplePath(Protocol):
+    """
+    A minimal subset of pathlib.Path required by PathDistribution.
+    """
+
+    def joinpath(self) -> 'SimplePath':
+        ...  # pragma: no cover
+
+    def __truediv__(self) -> 'SimplePath':
+        ...  # pragma: no cover
+
+    def parent(self) -> 'SimplePath':
+        ...  # pragma: no cover
+
+    def read_text(self) -> str:
+        ...  # pragma: no cover
diff --git a/metaflow/_vendor/v3_6/importlib_metadata/_text.py b/metaflow/_vendor/v3_6/importlib_metadata/_text.py
new file mode 100644
index 00000000000..c88cfbb2349
--- /dev/null
+++ b/metaflow/_vendor/v3_6/importlib_metadata/_text.py
@@ -0,0 +1,99 @@
+import re
+
+from ._functools import method_cache
+
+
+# from jaraco.text 3.5
+class FoldedCase(str):
+    """
+    A case insensitive string class; behaves just like str
+    except compares equal when the only variation is case.
+
+    >>> s = FoldedCase('hello world')
+
+    >>> s == 'Hello World'
+    True
+
+    >>> 'Hello World' == s
+    True
+
+    >>> s != 'Hello World'
+    False
+
+    >>> s.index('O')
+    4
+
+    >>> s.split('O')
+    ['hell', ' w', 'rld']
+
+    >>> sorted(map(FoldedCase, ['GAMMA', 'alpha', 'Beta']))
+    ['alpha', 'Beta', 'GAMMA']
+
+    Sequence membership is straightforward.
+
+    >>> "Hello World" in [s]
+    True
+    >>> s in ["Hello World"]
+    True
+
+    You may test for set inclusion, but candidate and elements
+    must both be folded.
+
+    >>> FoldedCase("Hello World") in {s}
+    True
+    >>> s in {FoldedCase("Hello World")}
+    True
+
+    String inclusion works as long as the FoldedCase object
+    is on the right.
+
+    >>> "hello" in FoldedCase("Hello World")
+    True
+
+    But not if the FoldedCase object is on the left:
+
+    >>> FoldedCase('hello') in 'Hello World'
+    False
+
+    In that case, use in_:
+
+    >>> FoldedCase('hello').in_('Hello World')
+    True
+
+    >>> FoldedCase('hello') > FoldedCase('Hello')
+    False
+    """
+
+    def __lt__(self, other):
+        return self.lower() < other.lower()
+
+    def __gt__(self, other):
+        return self.lower() > other.lower()
+
+    def __eq__(self, other):
+        return self.lower() == other.lower()
+
+    def __ne__(self, other):
+        return self.lower() != other.lower()
+
+    def __hash__(self):
+        return hash(self.lower())
+
+    def __contains__(self, other):
+        return super().lower().__contains__(other.lower())
+
+    def in_(self, other):
+        "Does self appear in other?"
+        return self in FoldedCase(other)
+
+    # cache lower since it's likely to be called frequently.
+    @method_cache
+    def lower(self):
+        return super().lower()
+
+    def index(self, sub):
+        return self.lower().index(sub.lower())
+
+    def split(self, splitter=' ', maxsplit=0):
+        pattern = re.compile(re.escape(splitter), re.I)
+        return pattern.split(self, maxsplit)
diff --git a/metaflow/plugins/aws/eks/__init__.py b/metaflow/_vendor/v3_6/importlib_metadata/py.typed
similarity index 100%
rename from metaflow/plugins/aws/eks/__init__.py
rename to metaflow/_vendor/v3_6/importlib_metadata/py.typed
diff --git a/metaflow/_vendor/v3_6/typing_extensions.LICENSE b/metaflow/_vendor/v3_6/typing_extensions.LICENSE
new file mode 100644
index 00000000000..583f9f6e617
--- /dev/null
+++ b/metaflow/_vendor/v3_6/typing_extensions.LICENSE
@@ -0,0 +1,254 @@
+A. HISTORY OF THE SOFTWARE
+==========================
+
+Python was created in the early 1990s by Guido van Rossum at Stichting
+Mathematisch Centrum (CWI, see http://www.cwi.nl) in the Netherlands
+as a successor of a language called ABC.  Guido remains Python's
+principal author, although it includes many contributions from others.
+
+In 1995, Guido continued his work on Python at the Corporation for
+National Research Initiatives (CNRI, see http://www.cnri.reston.va.us)
+in Reston, Virginia where he released several versions of the
+software.
+
+In May 2000, Guido and the Python core development team moved to
+BeOpen.com to form the BeOpen PythonLabs team.  In October of the same
+year, the PythonLabs team moved to Digital Creations (now Zope
+Corporation, see http://www.zope.com).  In 2001, the Python Software
+Foundation (PSF, see http://www.python.org/psf/) was formed, a
+non-profit organization created specifically to own Python-related
+Intellectual Property.  Zope Corporation is a sponsoring member of
+the PSF.
+
+All Python releases are Open Source (see http://www.opensource.org for
+the Open Source Definition).  Historically, most, but not all, Python
+releases have also been GPL-compatible; the table below summarizes
+the various releases.
+
+    Release         Derived     Year        Owner       GPL-
+                    from                                compatible? (1)
+
+    0.9.0 thru 1.2              1991-1995   CWI         yes
+    1.3 thru 1.5.2  1.2         1995-1999   CNRI        yes
+    1.6             1.5.2       2000        CNRI        no
+    2.0             1.6         2000        BeOpen.com  no
+    1.6.1           1.6         2001        CNRI        yes (2)
+    2.1             2.0+1.6.1   2001        PSF         no
+    2.0.1           2.0+1.6.1   2001        PSF         yes
+    2.1.1           2.1+2.0.1   2001        PSF         yes
+    2.1.2           2.1.1       2002        PSF         yes
+    2.1.3           2.1.2       2002        PSF         yes
+    2.2 and above   2.1.1       2001-now    PSF         yes
+
+Footnotes:
+
+(1) GPL-compatible doesn't mean that we're distributing Python under
+    the GPL.  All Python licenses, unlike the GPL, let you distribute
+    a modified version without making your changes open source.  The
+    GPL-compatible licenses make it possible to combine Python with
+    other software that is released under the GPL; the others don't.
+
+(2) According to Richard Stallman, 1.6.1 is not GPL-compatible,
+    because its license has a choice of law clause.  According to
+    CNRI, however, Stallman's lawyer has told CNRI's lawyer that 1.6.1
+    is "not incompatible" with the GPL.
+
+Thanks to the many outside volunteers who have worked under Guido's
+direction to make these releases possible.
+
+
+B. TERMS AND CONDITIONS FOR ACCESSING OR OTHERWISE USING PYTHON
+===============================================================
+
+PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
+--------------------------------------------
+
+1. This LICENSE AGREEMENT is between the Python Software Foundation
+("PSF"), and the Individual or Organization ("Licensee") accessing and
+otherwise using this software ("Python") in source or binary form and
+its associated documentation.
+
+2. Subject to the terms and conditions of this License Agreement, PSF hereby
+grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce,
+analyze, test, perform and/or display publicly, prepare derivative works,
+distribute, and otherwise use Python alone or in any derivative version,
+provided, however, that PSF's License Agreement and PSF's notice of copyright,
+i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
+2011, 2012, 2013, 2014 Python Software Foundation; All Rights Reserved" are
+retained in Python alone or in any derivative version prepared by Licensee.
+
+3. In the event Licensee prepares a derivative work that is based on
+or incorporates Python or any part thereof, and wants to make
+the derivative work available to others as provided herein, then
+Licensee hereby agrees to include in any such work a brief summary of
+the changes made to Python.
+
+4. PSF is making Python available to Licensee on an "AS IS"
+basis.  PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
+FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
+A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON,
+OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+6. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+7. Nothing in this License Agreement shall be deemed to create any
+relationship of agency, partnership, or joint venture between PSF and
+Licensee.  This License Agreement does not grant permission to use PSF
+trademarks or trade name in a trademark sense to endorse or promote
+products or services of Licensee, or any third party.
+
+8. By copying, installing or otherwise using Python, Licensee
+agrees to be bound by the terms and conditions of this License
+Agreement.
+
+
+BEOPEN.COM LICENSE AGREEMENT FOR PYTHON 2.0
+-------------------------------------------
+
+BEOPEN PYTHON OPEN SOURCE LICENSE AGREEMENT VERSION 1
+
+1. This LICENSE AGREEMENT is between BeOpen.com ("BeOpen"), having an
+office at 160 Saratoga Avenue, Santa Clara, CA 95051, and the
+Individual or Organization ("Licensee") accessing and otherwise using
+this software in source or binary form and its associated
+documentation ("the Software").
+
+2. Subject to the terms and conditions of this BeOpen Python License
+Agreement, BeOpen hereby grants Licensee a non-exclusive,
+royalty-free, world-wide license to reproduce, analyze, test, perform
+and/or display publicly, prepare derivative works, distribute, and
+otherwise use the Software alone or in any derivative version,
+provided, however, that the BeOpen Python License is retained in the
+Software, alone or in any derivative version prepared by Licensee.
+
+3. BeOpen is making the Software available to Licensee on an "AS IS"
+basis.  BEOPEN MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, BEOPEN MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+4. BEOPEN SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF THE
+SOFTWARE FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS
+AS A RESULT OF USING, MODIFYING OR DISTRIBUTING THE SOFTWARE, OR ANY
+DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+5. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+6. This License Agreement shall be governed by and interpreted in all
+respects by the law of the State of California, excluding conflict of
+law provisions.  Nothing in this License Agreement shall be deemed to
+create any relationship of agency, partnership, or joint venture
+between BeOpen and Licensee.  This License Agreement does not grant
+permission to use BeOpen trademarks or trade names in a trademark
+sense to endorse or promote products or services of Licensee, or any
+third party.  As an exception, the "BeOpen Python" logos available at
+http://www.pythonlabs.com/logos.html may be used according to the
+permissions granted on that web page.
+
+7. By copying, installing or otherwise using the software, Licensee
+agrees to be bound by the terms and conditions of this License
+Agreement.
+
+
+CNRI LICENSE AGREEMENT FOR PYTHON 1.6.1
+---------------------------------------
+
+1. This LICENSE AGREEMENT is between the Corporation for National
+Research Initiatives, having an office at 1895 Preston White Drive,
+Reston, VA 20191 ("CNRI"), and the Individual or Organization
+("Licensee") accessing and otherwise using Python 1.6.1 software in
+source or binary form and its associated documentation.
+
+2. Subject to the terms and conditions of this License Agreement, CNRI
+hereby grants Licensee a nonexclusive, royalty-free, world-wide
+license to reproduce, analyze, test, perform and/or display publicly,
+prepare derivative works, distribute, and otherwise use Python 1.6.1
+alone or in any derivative version, provided, however, that CNRI's
+License Agreement and CNRI's notice of copyright, i.e., "Copyright (c)
+1995-2001 Corporation for National Research Initiatives; All Rights
+Reserved" are retained in Python 1.6.1 alone or in any derivative
+version prepared by Licensee.  Alternately, in lieu of CNRI's License
+Agreement, Licensee may substitute the following text (omitting the
+quotes): "Python 1.6.1 is made available subject to the terms and
+conditions in CNRI's License Agreement.  This Agreement together with
+Python 1.6.1 may be located on the Internet using the following
+unique, persistent identifier (known as a handle): 1895.22/1013.  This
+Agreement may also be obtained from a proxy server on the Internet
+using the following URL: http://hdl.handle.net/1895.22/1013".
+
+3. In the event Licensee prepares a derivative work that is based on
+or incorporates Python 1.6.1 or any part thereof, and wants to make
+the derivative work available to others as provided herein, then
+Licensee hereby agrees to include in any such work a brief summary of
+the changes made to Python 1.6.1.
+
+4. CNRI is making Python 1.6.1 available to Licensee on an "AS IS"
+basis.  CNRI MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, CNRI MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON 1.6.1 WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+5. CNRI SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
+1.6.1 FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
+A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON 1.6.1,
+OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+6. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+7. This License Agreement shall be governed by the federal
+intellectual property law of the United States, including without
+limitation the federal copyright law, and, to the extent such
+U.S. federal law does not apply, by the law of the Commonwealth of
+Virginia, excluding Virginia's conflict of law provisions.
+Notwithstanding the foregoing, with regard to derivative works based
+on Python 1.6.1 that incorporate non-separable material that was
+previously distributed under the GNU General Public License (GPL), the
+law of the Commonwealth of Virginia shall govern this License
+Agreement only as to issues arising under or with respect to
+Paragraphs 4, 5, and 7 of this License Agreement.  Nothing in this
+License Agreement shall be deemed to create any relationship of
+agency, partnership, or joint venture between CNRI and Licensee.  This
+License Agreement does not grant permission to use CNRI trademarks or
+trade name in a trademark sense to endorse or promote products or
+services of Licensee, or any third party.
+
+8. By clicking on the "ACCEPT" button where indicated, or by copying,
+installing or otherwise using Python 1.6.1, Licensee agrees to be
+bound by the terms and conditions of this License Agreement.
+
+        ACCEPT
+
+
+CWI LICENSE AGREEMENT FOR PYTHON 0.9.0 THROUGH 1.2
+--------------------------------------------------
+
+Copyright (c) 1991 - 1995, Stichting Mathematisch Centrum Amsterdam,
+The Netherlands.  All rights reserved.
+
+Permission to use, copy, modify, and distribute this software and its
+documentation for any purpose and without fee is hereby granted,
+provided that the above copyright notice appear in all copies and that
+both that copyright notice and this permission notice appear in
+supporting documentation, and that the name of Stichting Mathematisch
+Centrum or CWI not be used in advertising or publicity pertaining to
+distribution of the software without specific, written prior
+permission.
+
+STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO
+THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE
+FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
+OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
diff --git a/metaflow/_vendor/v3_6/typing_extensions.py b/metaflow/_vendor/v3_6/typing_extensions.py
new file mode 100644
index 00000000000..43c05bdcd22
--- /dev/null
+++ b/metaflow/_vendor/v3_6/typing_extensions.py
@@ -0,0 +1,2908 @@
+import abc
+import collections
+import collections.abc
+import operator
+import sys
+import types as _types
+import typing
+
+# After PEP 560, internal typing API was substantially reworked.
+# This is especially important for Protocol class which uses internal APIs
+# quite extensively.
+PEP_560 = sys.version_info[:3] >= (3, 7, 0)
+
+if PEP_560:
+    GenericMeta = type
+else:
+    # 3.6
+    from typing import GenericMeta, _type_vars  # noqa
+
+
+# Please keep __all__ alphabetized within each category.
+__all__ = [
+    # Super-special typing primitives.
+    'ClassVar',
+    'Concatenate',
+    'Final',
+    'LiteralString',
+    'ParamSpec',
+    'Self',
+    'Type',
+    'TypeVarTuple',
+    'Unpack',
+
+    # ABCs (from collections.abc).
+    'Awaitable',
+    'AsyncIterator',
+    'AsyncIterable',
+    'Coroutine',
+    'AsyncGenerator',
+    'AsyncContextManager',
+    'ChainMap',
+
+    # Concrete collection types.
+    'ContextManager',
+    'Counter',
+    'Deque',
+    'DefaultDict',
+    'OrderedDict',
+    'TypedDict',
+
+    # Structural checks, a.k.a. protocols.
+    'SupportsIndex',
+
+    # One-off things.
+    'Annotated',
+    'assert_never',
+    'dataclass_transform',
+    'final',
+    'IntVar',
+    'is_typeddict',
+    'Literal',
+    'NewType',
+    'overload',
+    'Protocol',
+    'reveal_type',
+    'runtime',
+    'runtime_checkable',
+    'Text',
+    'TypeAlias',
+    'TypeGuard',
+    'TYPE_CHECKING',
+    'Never',
+    'NoReturn',
+    'Required',
+    'NotRequired',
+]
+
+if PEP_560:
+    __all__.extend(["get_args", "get_origin", "get_type_hints"])
+
+# The functions below are modified copies of typing internal helpers.
+# They are needed by _ProtocolMeta and they provide support for PEP 646.
+
+
+def _no_slots_copy(dct):
+    dict_copy = dict(dct)
+    if '__slots__' in dict_copy:
+        for slot in dict_copy['__slots__']:
+            dict_copy.pop(slot, None)
+    return dict_copy
+
+
+_marker = object()
+
+
+def _check_generic(cls, parameters, elen=_marker):
+    """Check correct count for parameters of a generic cls (internal helper).
+    This gives a nice error message in case of count mismatch.
+    """
+    if not elen:
+        raise TypeError(f"{cls} is not a generic class")
+    if elen is _marker:
+        if not hasattr(cls, "__parameters__") or not cls.__parameters__:
+            raise TypeError(f"{cls} is not a generic class")
+        elen = len(cls.__parameters__)
+    alen = len(parameters)
+    if alen != elen:
+        if hasattr(cls, "__parameters__"):
+            parameters = [p for p in cls.__parameters__ if not _is_unpack(p)]
+            num_tv_tuples = sum(isinstance(p, TypeVarTuple) for p in parameters)
+            if (num_tv_tuples > 0) and (alen >= elen - num_tv_tuples):
+                return
+        raise TypeError(f"Too {'many' if alen > elen else 'few'} parameters for {cls};"
+                        f" actual {alen}, expected {elen}")
+
+
+if sys.version_info >= (3, 10):
+    def _should_collect_from_parameters(t):
+        return isinstance(
+            t, (typing._GenericAlias, _types.GenericAlias, _types.UnionType)
+        )
+elif sys.version_info >= (3, 9):
+    def _should_collect_from_parameters(t):
+        return isinstance(t, (typing._GenericAlias, _types.GenericAlias))
+else:
+    def _should_collect_from_parameters(t):
+        return isinstance(t, typing._GenericAlias) and not t._special
+
+
+def _collect_type_vars(types, typevar_types=None):
+    """Collect all type variable contained in types in order of
+    first appearance (lexicographic order). For example::
+
+        _collect_type_vars((T, List[S, T])) == (T, S)
+    """
+    if typevar_types is None:
+        typevar_types = typing.TypeVar
+    tvars = []
+    for t in types:
+        if (
+            isinstance(t, typevar_types) and
+            t not in tvars and
+            not _is_unpack(t)
+        ):
+            tvars.append(t)
+        if _should_collect_from_parameters(t):
+            tvars.extend([t for t in t.__parameters__ if t not in tvars])
+    return tuple(tvars)
+
+
+# 3.6.2+
+if hasattr(typing, 'NoReturn'):
+    NoReturn = typing.NoReturn
+# 3.6.0-3.6.1
+else:
+    class _NoReturn(typing._FinalTypingBase, _root=True):
+        """Special type indicating functions that never return.
+        Example::
+
+          from typing import NoReturn
+
+          def stop() -> NoReturn:
+              raise Exception('no way')
+
+        This type is invalid in other positions, e.g., ``List[NoReturn]``
+        will fail in static type checkers.
+        """
+        __slots__ = ()
+
+        def __instancecheck__(self, obj):
+            raise TypeError("NoReturn cannot be used with isinstance().")
+
+        def __subclasscheck__(self, cls):
+            raise TypeError("NoReturn cannot be used with issubclass().")
+
+    NoReturn = _NoReturn(_root=True)
+
+# Some unconstrained type variables.  These are used by the container types.
+# (These are not for export.)
+T = typing.TypeVar('T')  # Any type.
+KT = typing.TypeVar('KT')  # Key type.
+VT = typing.TypeVar('VT')  # Value type.
+T_co = typing.TypeVar('T_co', covariant=True)  # Any type covariant containers.
+T_contra = typing.TypeVar('T_contra', contravariant=True)  # Ditto contravariant.
+
+ClassVar = typing.ClassVar
+
+# On older versions of typing there is an internal class named "Final".
+# 3.8+
+if hasattr(typing, 'Final') and sys.version_info[:2] >= (3, 7):
+    Final = typing.Final
+# 3.7
+elif sys.version_info[:2] >= (3, 7):
+    class _FinalForm(typing._SpecialForm, _root=True):
+
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+        def __getitem__(self, parameters):
+            item = typing._type_check(parameters,
+                                      f'{self._name} accepts only single type')
+            return typing._GenericAlias(self, (item,))
+
+    Final = _FinalForm('Final',
+                       doc="""A special typing construct to indicate that a name
+                       cannot be re-assigned or overridden in a subclass.
+                       For example:
+
+                           MAX_SIZE: Final = 9000
+                           MAX_SIZE += 1  # Error reported by type checker
+
+                           class Connection:
+                               TIMEOUT: Final[int] = 10
+                           class FastConnector(Connection):
+                               TIMEOUT = 1  # Error reported by type checker
+
+                       There is no runtime checking of these properties.""")
+# 3.6
+else:
+    class _Final(typing._FinalTypingBase, _root=True):
+        """A special typing construct to indicate that a name
+        cannot be re-assigned or overridden in a subclass.
+        For example:
+
+            MAX_SIZE: Final = 9000
+            MAX_SIZE += 1  # Error reported by type checker
+
+            class Connection:
+                TIMEOUT: Final[int] = 10
+            class FastConnector(Connection):
+                TIMEOUT = 1  # Error reported by type checker
+
+        There is no runtime checking of these properties.
+        """
+
+        __slots__ = ('__type__',)
+
+        def __init__(self, tp=None, **kwds):
+            self.__type__ = tp
+
+        def __getitem__(self, item):
+            cls = type(self)
+            if self.__type__ is None:
+                return cls(typing._type_check(item,
+                           f'{cls.__name__[1:]} accepts only single type.'),
+                           _root=True)
+            raise TypeError(f'{cls.__name__[1:]} cannot be further subscripted')
+
+        def _eval_type(self, globalns, localns):
+            new_tp = typing._eval_type(self.__type__, globalns, localns)
+            if new_tp == self.__type__:
+                return self
+            return type(self)(new_tp, _root=True)
+
+        def __repr__(self):
+            r = super().__repr__()
+            if self.__type__ is not None:
+                r += f'[{typing._type_repr(self.__type__)}]'
+            return r
+
+        def __hash__(self):
+            return hash((type(self).__name__, self.__type__))
+
+        def __eq__(self, other):
+            if not isinstance(other, _Final):
+                return NotImplemented
+            if self.__type__ is not None:
+                return self.__type__ == other.__type__
+            return self is other
+
+    Final = _Final(_root=True)
+
+
+if sys.version_info >= (3, 11):
+    final = typing.final
+else:
+    # @final exists in 3.8+, but we backport it for all versions
+    # before 3.11 to keep support for the __final__ attribute.
+    # See https://bugs.python.org/issue46342
+    def final(f):
+        """This decorator can be used to indicate to type checkers that
+        the decorated method cannot be overridden, and decorated class
+        cannot be subclassed. For example:
+
+            class Base:
+                @final
+                def done(self) -> None:
+                    ...
+            class Sub(Base):
+                def done(self) -> None:  # Error reported by type checker
+                    ...
+            @final
+            class Leaf:
+                ...
+            class Other(Leaf):  # Error reported by type checker
+                ...
+
+        There is no runtime checking of these properties. The decorator
+        sets the ``__final__`` attribute to ``True`` on the decorated object
+        to allow runtime introspection.
+        """
+        try:
+            f.__final__ = True
+        except (AttributeError, TypeError):
+            # Skip the attribute silently if it is not writable.
+            # AttributeError happens if the object has __slots__ or a
+            # read-only property, TypeError if it's a builtin class.
+            pass
+        return f
+
+
+def IntVar(name):
+    return typing.TypeVar(name)
+
+
+# 3.8+:
+if hasattr(typing, 'Literal'):
+    Literal = typing.Literal
+# 3.7:
+elif sys.version_info[:2] >= (3, 7):
+    class _LiteralForm(typing._SpecialForm, _root=True):
+
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+        def __getitem__(self, parameters):
+            return typing._GenericAlias(self, parameters)
+
+    Literal = _LiteralForm('Literal',
+                           doc="""A type that can be used to indicate to type checkers
+                           that the corresponding value has a value literally equivalent
+                           to the provided parameter. For example:
+
+                               var: Literal[4] = 4
+
+                           The type checker understands that 'var' is literally equal to
+                           the value 4 and no other value.
+
+                           Literal[...] cannot be subclassed. There is no runtime
+                           checking verifying that the parameter is actually a value
+                           instead of a type.""")
+# 3.6:
+else:
+    class _Literal(typing._FinalTypingBase, _root=True):
+        """A type that can be used to indicate to type checkers that the
+        corresponding value has a value literally equivalent to the
+        provided parameter. For example:
+
+            var: Literal[4] = 4
+
+        The type checker understands that 'var' is literally equal to the
+        value 4 and no other value.
+
+        Literal[...] cannot be subclassed. There is no runtime checking
+        verifying that the parameter is actually a value instead of a type.
+        """
+
+        __slots__ = ('__values__',)
+
+        def __init__(self, values=None, **kwds):
+            self.__values__ = values
+
+        def __getitem__(self, values):
+            cls = type(self)
+            if self.__values__ is None:
+                if not isinstance(values, tuple):
+                    values = (values,)
+                return cls(values, _root=True)
+            raise TypeError(f'{cls.__name__[1:]} cannot be further subscripted')
+
+        def _eval_type(self, globalns, localns):
+            return self
+
+        def __repr__(self):
+            r = super().__repr__()
+            if self.__values__ is not None:
+                r += f'[{", ".join(map(typing._type_repr, self.__values__))}]'
+            return r
+
+        def __hash__(self):
+            return hash((type(self).__name__, self.__values__))
+
+        def __eq__(self, other):
+            if not isinstance(other, _Literal):
+                return NotImplemented
+            if self.__values__ is not None:
+                return self.__values__ == other.__values__
+            return self is other
+
+    Literal = _Literal(_root=True)
+
+
+_overload_dummy = typing._overload_dummy  # noqa
+overload = typing.overload
+
+
+# This is not a real generic class.  Don't use outside annotations.
+Type = typing.Type
+
+# Various ABCs mimicking those in collections.abc.
+# A few are simply re-exported for completeness.
+
+
+class _ExtensionsGenericMeta(GenericMeta):
+    def __subclasscheck__(self, subclass):
+        """This mimics a more modern GenericMeta.__subclasscheck__() logic
+        (that does not have problems with recursion) to work around interactions
+        between collections, typing, and typing_extensions on older
+        versions of Python, see https://github.com/python/typing/issues/501.
+        """
+        if self.__origin__ is not None:
+            if sys._getframe(1).f_globals['__name__'] not in ['abc', 'functools']:
+                raise TypeError("Parameterized generics cannot be used with class "
+                                "or instance checks")
+            return False
+        if not self.__extra__:
+            return super().__subclasscheck__(subclass)
+        res = self.__extra__.__subclasshook__(subclass)
+        if res is not NotImplemented:
+            return res
+        if self.__extra__ in subclass.__mro__:
+            return True
+        for scls in self.__extra__.__subclasses__():
+            if isinstance(scls, GenericMeta):
+                continue
+            if issubclass(subclass, scls):
+                return True
+        return False
+
+
+Awaitable = typing.Awaitable
+Coroutine = typing.Coroutine
+AsyncIterable = typing.AsyncIterable
+AsyncIterator = typing.AsyncIterator
+
+# 3.6.1+
+if hasattr(typing, 'Deque'):
+    Deque = typing.Deque
+# 3.6.0
+else:
+    class Deque(collections.deque, typing.MutableSequence[T],
+                metaclass=_ExtensionsGenericMeta,
+                extra=collections.deque):
+        __slots__ = ()
+
+        def __new__(cls, *args, **kwds):
+            if cls._gorg is Deque:
+                return collections.deque(*args, **kwds)
+            return typing._generic_new(collections.deque, cls, *args, **kwds)
+
+ContextManager = typing.ContextManager
+# 3.6.2+
+if hasattr(typing, 'AsyncContextManager'):
+    AsyncContextManager = typing.AsyncContextManager
+# 3.6.0-3.6.1
+else:
+    from _collections_abc import _check_methods as _check_methods_in_mro  # noqa
+
+    class AsyncContextManager(typing.Generic[T_co]):
+        __slots__ = ()
+
+        async def __aenter__(self):
+            return self
+
+        @abc.abstractmethod
+        async def __aexit__(self, exc_type, exc_value, traceback):
+            return None
+
+        @classmethod
+        def __subclasshook__(cls, C):
+            if cls is AsyncContextManager:
+                return _check_methods_in_mro(C, "__aenter__", "__aexit__")
+            return NotImplemented
+
+DefaultDict = typing.DefaultDict
+
+# 3.7.2+
+if hasattr(typing, 'OrderedDict'):
+    OrderedDict = typing.OrderedDict
+# 3.7.0-3.7.2
+elif (3, 7, 0) <= sys.version_info[:3] < (3, 7, 2):
+    OrderedDict = typing._alias(collections.OrderedDict, (KT, VT))
+# 3.6
+else:
+    class OrderedDict(collections.OrderedDict, typing.MutableMapping[KT, VT],
+                      metaclass=_ExtensionsGenericMeta,
+                      extra=collections.OrderedDict):
+
+        __slots__ = ()
+
+        def __new__(cls, *args, **kwds):
+            if cls._gorg is OrderedDict:
+                return collections.OrderedDict(*args, **kwds)
+            return typing._generic_new(collections.OrderedDict, cls, *args, **kwds)
+
+# 3.6.2+
+if hasattr(typing, 'Counter'):
+    Counter = typing.Counter
+# 3.6.0-3.6.1
+else:
+    class Counter(collections.Counter,
+                  typing.Dict[T, int],
+                  metaclass=_ExtensionsGenericMeta, extra=collections.Counter):
+
+        __slots__ = ()
+
+        def __new__(cls, *args, **kwds):
+            if cls._gorg is Counter:
+                return collections.Counter(*args, **kwds)
+            return typing._generic_new(collections.Counter, cls, *args, **kwds)
+
+# 3.6.1+
+if hasattr(typing, 'ChainMap'):
+    ChainMap = typing.ChainMap
+elif hasattr(collections, 'ChainMap'):
+    class ChainMap(collections.ChainMap, typing.MutableMapping[KT, VT],
+                   metaclass=_ExtensionsGenericMeta,
+                   extra=collections.ChainMap):
+
+        __slots__ = ()
+
+        def __new__(cls, *args, **kwds):
+            if cls._gorg is ChainMap:
+                return collections.ChainMap(*args, **kwds)
+            return typing._generic_new(collections.ChainMap, cls, *args, **kwds)
+
+# 3.6.1+
+if hasattr(typing, 'AsyncGenerator'):
+    AsyncGenerator = typing.AsyncGenerator
+# 3.6.0
+else:
+    class AsyncGenerator(AsyncIterator[T_co], typing.Generic[T_co, T_contra],
+                         metaclass=_ExtensionsGenericMeta,
+                         extra=collections.abc.AsyncGenerator):
+        __slots__ = ()
+
+NewType = typing.NewType
+Text = typing.Text
+TYPE_CHECKING = typing.TYPE_CHECKING
+
+
+def _gorg(cls):
+    """This function exists for compatibility with old typing versions."""
+    assert isinstance(cls, GenericMeta)
+    if hasattr(cls, '_gorg'):
+        return cls._gorg
+    while cls.__origin__ is not None:
+        cls = cls.__origin__
+    return cls
+
+
+_PROTO_WHITELIST = ['Callable', 'Awaitable',
+                    'Iterable', 'Iterator', 'AsyncIterable', 'AsyncIterator',
+                    'Hashable', 'Sized', 'Container', 'Collection', 'Reversible',
+                    'ContextManager', 'AsyncContextManager']
+
+
+def _get_protocol_attrs(cls):
+    attrs = set()
+    for base in cls.__mro__[:-1]:  # without object
+        if base.__name__ in ('Protocol', 'Generic'):
+            continue
+        annotations = getattr(base, '__annotations__', {})
+        for attr in list(base.__dict__.keys()) + list(annotations.keys()):
+            if (not attr.startswith('_abc_') and attr not in (
+                    '__abstractmethods__', '__annotations__', '__weakref__',
+                    '_is_protocol', '_is_runtime_protocol', '__dict__',
+                    '__args__', '__slots__',
+                    '__next_in_mro__', '__parameters__', '__origin__',
+                    '__orig_bases__', '__extra__', '__tree_hash__',
+                    '__doc__', '__subclasshook__', '__init__', '__new__',
+                    '__module__', '_MutableMapping__marker', '_gorg')):
+                attrs.add(attr)
+    return attrs
+
+
+def _is_callable_members_only(cls):
+    return all(callable(getattr(cls, attr, None)) for attr in _get_protocol_attrs(cls))
+
+
+# 3.8+
+if hasattr(typing, 'Protocol'):
+    Protocol = typing.Protocol
+# 3.7
+elif PEP_560:
+
+    def _no_init(self, *args, **kwargs):
+        if type(self)._is_protocol:
+            raise TypeError('Protocols cannot be instantiated')
+
+    class _ProtocolMeta(abc.ABCMeta):
+        # This metaclass is a bit unfortunate and exists only because of the lack
+        # of __instancehook__.
+        def __instancecheck__(cls, instance):
+            # We need this method for situations where attributes are
+            # assigned in __init__.
+            if ((not getattr(cls, '_is_protocol', False) or
+                 _is_callable_members_only(cls)) and
+                    issubclass(instance.__class__, cls)):
+                return True
+            if cls._is_protocol:
+                if all(hasattr(instance, attr) and
+                       (not callable(getattr(cls, attr, None)) or
+                        getattr(instance, attr) is not None)
+                       for attr in _get_protocol_attrs(cls)):
+                    return True
+            return super().__instancecheck__(instance)
+
+    class Protocol(metaclass=_ProtocolMeta):
+        # There is quite a lot of overlapping code with typing.Generic.
+        # Unfortunately it is hard to avoid this while these live in two different
+        # modules. The duplicated code will be removed when Protocol is moved to typing.
+        """Base class for protocol classes. Protocol classes are defined as::
+
+            class Proto(Protocol):
+                def meth(self) -> int:
+                    ...
+
+        Such classes are primarily used with static type checkers that recognize
+        structural subtyping (static duck-typing), for example::
+
+            class C:
+                def meth(self) -> int:
+                    return 0
+
+            def func(x: Proto) -> int:
+                return x.meth()
+
+            func(C())  # Passes static type check
+
+        See PEP 544 for details. Protocol classes decorated with
+        @typing_extensions.runtime act as simple-minded runtime protocol that checks
+        only the presence of given attributes, ignoring their type signatures.
+
+        Protocol classes can be generic, they are defined as::
+
+            class GenProto(Protocol[T]):
+                def meth(self) -> T:
+                    ...
+        """
+        __slots__ = ()
+        _is_protocol = True
+
+        def __new__(cls, *args, **kwds):
+            if cls is Protocol:
+                raise TypeError("Type Protocol cannot be instantiated; "
+                                "it can only be used as a base class")
+            return super().__new__(cls)
+
+        @typing._tp_cache
+        def __class_getitem__(cls, params):
+            if not isinstance(params, tuple):
+                params = (params,)
+            if not params and cls is not typing.Tuple:
+                raise TypeError(
+                    f"Parameter list to {cls.__qualname__}[...] cannot be empty")
+            msg = "Parameters to generic types must be types."
+            params = tuple(typing._type_check(p, msg) for p in params)  # noqa
+            if cls is Protocol:
+                # Generic can only be subscripted with unique type variables.
+                if not all(isinstance(p, typing.TypeVar) for p in params):
+                    i = 0
+                    while isinstance(params[i], typing.TypeVar):
+                        i += 1
+                    raise TypeError(
+                        "Parameters to Protocol[...] must all be type variables."
+                        f" Parameter {i + 1} is {params[i]}")
+                if len(set(params)) != len(params):
+                    raise TypeError(
+                        "Parameters to Protocol[...] must all be unique")
+            else:
+                # Subscripting a regular Generic subclass.
+                _check_generic(cls, params, len(cls.__parameters__))
+            return typing._GenericAlias(cls, params)
+
+        def __init_subclass__(cls, *args, **kwargs):
+            tvars = []
+            if '__orig_bases__' in cls.__dict__:
+                error = typing.Generic in cls.__orig_bases__
+            else:
+                error = typing.Generic in cls.__bases__
+            if error:
+                raise TypeError("Cannot inherit from plain Generic")
+            if '__orig_bases__' in cls.__dict__:
+                tvars = typing._collect_type_vars(cls.__orig_bases__)
+                # Look for Generic[T1, ..., Tn] or Protocol[T1, ..., Tn].
+                # If found, tvars must be a subset of it.
+                # If not found, tvars is it.
+                # Also check for and reject plain Generic,
+                # and reject multiple Generic[...] and/or Protocol[...].
+                gvars = None
+                for base in cls.__orig_bases__:
+                    if (isinstance(base, typing._GenericAlias) and
+                            base.__origin__ in (typing.Generic, Protocol)):
+                        # for error messages
+                        the_base = base.__origin__.__name__
+                        if gvars is not None:
+                            raise TypeError(
+                                "Cannot inherit from Generic[...]"
+                                " and/or Protocol[...] multiple types.")
+                        gvars = base.__parameters__
+                if gvars is None:
+                    gvars = tvars
+                else:
+                    tvarset = set(tvars)
+                    gvarset = set(gvars)
+                    if not tvarset <= gvarset:
+                        s_vars = ', '.join(str(t) for t in tvars if t not in gvarset)
+                        s_args = ', '.join(str(g) for g in gvars)
+                        raise TypeError(f"Some type variables ({s_vars}) are"
+                                        f" not listed in {the_base}[{s_args}]")
+                    tvars = gvars
+            cls.__parameters__ = tuple(tvars)
+
+            # Determine if this is a protocol or a concrete subclass.
+            if not cls.__dict__.get('_is_protocol', None):
+                cls._is_protocol = any(b is Protocol for b in cls.__bases__)
+
+            # Set (or override) the protocol subclass hook.
+            def _proto_hook(other):
+                if not cls.__dict__.get('_is_protocol', None):
+                    return NotImplemented
+                if not getattr(cls, '_is_runtime_protocol', False):
+                    if sys._getframe(2).f_globals['__name__'] in ['abc', 'functools']:
+                        return NotImplemented
+                    raise TypeError("Instance and class checks can only be used with"
+                                    " @runtime protocols")
+                if not _is_callable_members_only(cls):
+                    if sys._getframe(2).f_globals['__name__'] in ['abc', 'functools']:
+                        return NotImplemented
+                    raise TypeError("Protocols with non-method members"
+                                    " don't support issubclass()")
+                if not isinstance(other, type):
+                    # Same error as for issubclass(1, int)
+                    raise TypeError('issubclass() arg 1 must be a class')
+                for attr in _get_protocol_attrs(cls):
+                    for base in other.__mro__:
+                        if attr in base.__dict__:
+                            if base.__dict__[attr] is None:
+                                return NotImplemented
+                            break
+                        annotations = getattr(base, '__annotations__', {})
+                        if (isinstance(annotations, typing.Mapping) and
+                                attr in annotations and
+                                isinstance(other, _ProtocolMeta) and
+                                other._is_protocol):
+                            break
+                    else:
+                        return NotImplemented
+                return True
+            if '__subclasshook__' not in cls.__dict__:
+                cls.__subclasshook__ = _proto_hook
+
+            # We have nothing more to do for non-protocols.
+            if not cls._is_protocol:
+                return
+
+            # Check consistency of bases.
+            for base in cls.__bases__:
+                if not (base in (object, typing.Generic) or
+                        base.__module__ == 'collections.abc' and
+                        base.__name__ in _PROTO_WHITELIST or
+                        isinstance(base, _ProtocolMeta) and base._is_protocol):
+                    raise TypeError('Protocols can only inherit from other'
+                                    f' protocols, got {repr(base)}')
+            cls.__init__ = _no_init
+# 3.6
+else:
+    from typing import _next_in_mro, _type_check  # noqa
+
+    def _no_init(self, *args, **kwargs):
+        if type(self)._is_protocol:
+            raise TypeError('Protocols cannot be instantiated')
+
+    class _ProtocolMeta(GenericMeta):
+        """Internal metaclass for Protocol.
+
+        This exists so Protocol classes can be generic without deriving
+        from Generic.
+        """
+        def __new__(cls, name, bases, namespace,
+                    tvars=None, args=None, origin=None, extra=None, orig_bases=None):
+            # This is just a version copied from GenericMeta.__new__ that
+            # includes "Protocol" special treatment. (Comments removed for brevity.)
+            assert extra is None  # Protocols should not have extra
+            if tvars is not None:
+                assert origin is not None
+                assert all(isinstance(t, typing.TypeVar) for t in tvars), tvars
+            else:
+                tvars = _type_vars(bases)
+                gvars = None
+                for base in bases:
+                    if base is typing.Generic:
+                        raise TypeError("Cannot inherit from plain Generic")
+                    if (isinstance(base, GenericMeta) and
+                            base.__origin__ in (typing.Generic, Protocol)):
+                        if gvars is not None:
+                            raise TypeError(
+                                "Cannot inherit from Generic[...] or"
+                                " Protocol[...] multiple times.")
+                        gvars = base.__parameters__
+                if gvars is None:
+                    gvars = tvars
+                else:
+                    tvarset = set(tvars)
+                    gvarset = set(gvars)
+                    if not tvarset <= gvarset:
+                        s_vars = ", ".join(str(t) for t in tvars if t not in gvarset)
+                        s_args = ", ".join(str(g) for g in gvars)
+                        cls_name = "Generic" if any(b.__origin__ is typing.Generic
+                                                    for b in bases) else "Protocol"
+                        raise TypeError(f"Some type variables ({s_vars}) are"
+                                        f" not listed in {cls_name}[{s_args}]")
+                    tvars = gvars
+
+            initial_bases = bases
+            if (extra is not None and type(extra) is abc.ABCMeta and
+                    extra not in bases):
+                bases = (extra,) + bases
+            bases = tuple(_gorg(b) if isinstance(b, GenericMeta) else b
+                          for b in bases)
+            if any(isinstance(b, GenericMeta) and b is not typing.Generic for b in bases):
+                bases = tuple(b for b in bases if b is not typing.Generic)
+            namespace.update({'__origin__': origin, '__extra__': extra})
+            self = super(GenericMeta, cls).__new__(cls, name, bases, namespace,
+                                                   _root=True)
+            super(GenericMeta, self).__setattr__('_gorg',
+                                                 self if not origin else
+                                                 _gorg(origin))
+            self.__parameters__ = tvars
+            self.__args__ = tuple(... if a is typing._TypingEllipsis else
+                                  () if a is typing._TypingEmpty else
+                                  a for a in args) if args else None
+            self.__next_in_mro__ = _next_in_mro(self)
+            if orig_bases is None:
+                self.__orig_bases__ = initial_bases
+            elif origin is not None:
+                self._abc_registry = origin._abc_registry
+                self._abc_cache = origin._abc_cache
+            if hasattr(self, '_subs_tree'):
+                self.__tree_hash__ = (hash(self._subs_tree()) if origin else
+                                      super(GenericMeta, self).__hash__())
+            return self
+
+        def __init__(cls, *args, **kwargs):
+            super().__init__(*args, **kwargs)
+            if not cls.__dict__.get('_is_protocol', None):
+                cls._is_protocol = any(b is Protocol or
+                                       isinstance(b, _ProtocolMeta) and
+                                       b.__origin__ is Protocol
+                                       for b in cls.__bases__)
+            if cls._is_protocol:
+                for base in cls.__mro__[1:]:
+                    if not (base in (object, typing.Generic) or
+                            base.__module__ == 'collections.abc' and
+                            base.__name__ in _PROTO_WHITELIST or
+                            isinstance(base, typing.TypingMeta) and base._is_protocol or
+                            isinstance(base, GenericMeta) and
+                            base.__origin__ is typing.Generic):
+                        raise TypeError(f'Protocols can only inherit from other'
+                                        f' protocols, got {repr(base)}')
+
+                cls.__init__ = _no_init
+
+            def _proto_hook(other):
+                if not cls.__dict__.get('_is_protocol', None):
+                    return NotImplemented
+                if not isinstance(other, type):
+                    # Same error as for issubclass(1, int)
+                    raise TypeError('issubclass() arg 1 must be a class')
+                for attr in _get_protocol_attrs(cls):
+                    for base in other.__mro__:
+                        if attr in base.__dict__:
+                            if base.__dict__[attr] is None:
+                                return NotImplemented
+                            break
+                        annotations = getattr(base, '__annotations__', {})
+                        if (isinstance(annotations, typing.Mapping) and
+                                attr in annotations and
+                                isinstance(other, _ProtocolMeta) and
+                                other._is_protocol):
+                            break
+                    else:
+                        return NotImplemented
+                return True
+            if '__subclasshook__' not in cls.__dict__:
+                cls.__subclasshook__ = _proto_hook
+
+        def __instancecheck__(self, instance):
+            # We need this method for situations where attributes are
+            # assigned in __init__.
+            if ((not getattr(self, '_is_protocol', False) or
+                    _is_callable_members_only(self)) and
+                    issubclass(instance.__class__, self)):
+                return True
+            if self._is_protocol:
+                if all(hasattr(instance, attr) and
+                        (not callable(getattr(self, attr, None)) or
+                         getattr(instance, attr) is not None)
+                        for attr in _get_protocol_attrs(self)):
+                    return True
+            return super(GenericMeta, self).__instancecheck__(instance)
+
+        def __subclasscheck__(self, cls):
+            if self.__origin__ is not None:
+                if sys._getframe(1).f_globals['__name__'] not in ['abc', 'functools']:
+                    raise TypeError("Parameterized generics cannot be used with class "
+                                    "or instance checks")
+                return False
+            if (self.__dict__.get('_is_protocol', None) and
+                    not self.__dict__.get('_is_runtime_protocol', None)):
+                if sys._getframe(1).f_globals['__name__'] in ['abc',
+                                                              'functools',
+                                                              'typing']:
+                    return False
+                raise TypeError("Instance and class checks can only be used with"
+                                " @runtime protocols")
+            if (self.__dict__.get('_is_runtime_protocol', None) and
+                    not _is_callable_members_only(self)):
+                if sys._getframe(1).f_globals['__name__'] in ['abc',
+                                                              'functools',
+                                                              'typing']:
+                    return super(GenericMeta, self).__subclasscheck__(cls)
+                raise TypeError("Protocols with non-method members"
+                                " don't support issubclass()")
+            return super(GenericMeta, self).__subclasscheck__(cls)
+
+        @typing._tp_cache
+        def __getitem__(self, params):
+            # We also need to copy this from GenericMeta.__getitem__ to get
+            # special treatment of "Protocol". (Comments removed for brevity.)
+            if not isinstance(params, tuple):
+                params = (params,)
+            if not params and _gorg(self) is not typing.Tuple:
+                raise TypeError(
+                    f"Parameter list to {self.__qualname__}[...] cannot be empty")
+            msg = "Parameters to generic types must be types."
+            params = tuple(_type_check(p, msg) for p in params)
+            if self in (typing.Generic, Protocol):
+                if not all(isinstance(p, typing.TypeVar) for p in params):
+                    raise TypeError(
+                        f"Parameters to {repr(self)}[...] must all be type variables")
+                if len(set(params)) != len(params):
+                    raise TypeError(
+                        f"Parameters to {repr(self)}[...] must all be unique")
+                tvars = params
+                args = params
+            elif self in (typing.Tuple, typing.Callable):
+                tvars = _type_vars(params)
+                args = params
+            elif self.__origin__ in (typing.Generic, Protocol):
+                raise TypeError(f"Cannot subscript already-subscripted {repr(self)}")
+            else:
+                _check_generic(self, params, len(self.__parameters__))
+                tvars = _type_vars(params)
+                args = params
+
+            prepend = (self,) if self.__origin__ is None else ()
+            return self.__class__(self.__name__,
+                                  prepend + self.__bases__,
+                                  _no_slots_copy(self.__dict__),
+                                  tvars=tvars,
+                                  args=args,
+                                  origin=self,
+                                  extra=self.__extra__,
+                                  orig_bases=self.__orig_bases__)
+
+    class Protocol(metaclass=_ProtocolMeta):
+        """Base class for protocol classes. Protocol classes are defined as::
+
+          class Proto(Protocol):
+              def meth(self) -> int:
+                  ...
+
+        Such classes are primarily used with static type checkers that recognize
+        structural subtyping (static duck-typing), for example::
+
+          class C:
+              def meth(self) -> int:
+                  return 0
+
+          def func(x: Proto) -> int:
+              return x.meth()
+
+          func(C())  # Passes static type check
+
+        See PEP 544 for details. Protocol classes decorated with
+        @typing_extensions.runtime act as simple-minded runtime protocol that checks
+        only the presence of given attributes, ignoring their type signatures.
+
+        Protocol classes can be generic, they are defined as::
+
+          class GenProto(Protocol[T]):
+              def meth(self) -> T:
+                  ...
+        """
+        __slots__ = ()
+        _is_protocol = True
+
+        def __new__(cls, *args, **kwds):
+            if _gorg(cls) is Protocol:
+                raise TypeError("Type Protocol cannot be instantiated; "
+                                "it can be used only as a base class")
+            return typing._generic_new(cls.__next_in_mro__, cls, *args, **kwds)
+
+
+# 3.8+
+if hasattr(typing, 'runtime_checkable'):
+    runtime_checkable = typing.runtime_checkable
+# 3.6-3.7
+else:
+    def runtime_checkable(cls):
+        """Mark a protocol class as a runtime protocol, so that it
+        can be used with isinstance() and issubclass(). Raise TypeError
+        if applied to a non-protocol class.
+
+        This allows a simple-minded structural check very similar to the
+        one-offs in collections.abc such as Hashable.
+        """
+        if not isinstance(cls, _ProtocolMeta) or not cls._is_protocol:
+            raise TypeError('@runtime_checkable can be only applied to protocol classes,'
+                            f' got {cls!r}')
+        cls._is_runtime_protocol = True
+        return cls
+
+
+# Exists for backwards compatibility.
+runtime = runtime_checkable
+
+
+# 3.8+
+if hasattr(typing, 'SupportsIndex'):
+    SupportsIndex = typing.SupportsIndex
+# 3.6-3.7
+else:
+    @runtime_checkable
+    class SupportsIndex(Protocol):
+        __slots__ = ()
+
+        @abc.abstractmethod
+        def __index__(self) -> int:
+            pass
+
+
+if hasattr(typing, "Required"):
+    # The standard library TypedDict in Python 3.8 does not store runtime information
+    # about which (if any) keys are optional.  See https://bugs.python.org/issue38834
+    # The standard library TypedDict in Python 3.9.0/1 does not honour the "total"
+    # keyword with old-style TypedDict().  See https://bugs.python.org/issue42059
+    # The standard library TypedDict below Python 3.11 does not store runtime
+    # information about optional and required keys when using Required or NotRequired.
+    TypedDict = typing.TypedDict
+    _TypedDictMeta = typing._TypedDictMeta
+    is_typeddict = typing.is_typeddict
+else:
+    def _check_fails(cls, other):
+        try:
+            if sys._getframe(1).f_globals['__name__'] not in ['abc',
+                                                              'functools',
+                                                              'typing']:
+                # Typed dicts are only for static structural subtyping.
+                raise TypeError('TypedDict does not support instance and class checks')
+        except (AttributeError, ValueError):
+            pass
+        return False
+
+    def _dict_new(*args, **kwargs):
+        if not args:
+            raise TypeError('TypedDict.__new__(): not enough arguments')
+        _, args = args[0], args[1:]  # allow the "cls" keyword be passed
+        return dict(*args, **kwargs)
+
+    _dict_new.__text_signature__ = '($cls, _typename, _fields=None, /, **kwargs)'
+
+    def _typeddict_new(*args, total=True, **kwargs):
+        if not args:
+            raise TypeError('TypedDict.__new__(): not enough arguments')
+        _, args = args[0], args[1:]  # allow the "cls" keyword be passed
+        if args:
+            typename, args = args[0], args[1:]  # allow the "_typename" keyword be passed
+        elif '_typename' in kwargs:
+            typename = kwargs.pop('_typename')
+            import warnings
+            warnings.warn("Passing '_typename' as keyword argument is deprecated",
+                          DeprecationWarning, stacklevel=2)
+        else:
+            raise TypeError("TypedDict.__new__() missing 1 required positional "
+                            "argument: '_typename'")
+        if args:
+            try:
+                fields, = args  # allow the "_fields" keyword be passed
+            except ValueError:
+                raise TypeError('TypedDict.__new__() takes from 2 to 3 '
+                                f'positional arguments but {len(args) + 2} '
+                                'were given')
+        elif '_fields' in kwargs and len(kwargs) == 1:
+            fields = kwargs.pop('_fields')
+            import warnings
+            warnings.warn("Passing '_fields' as keyword argument is deprecated",
+                          DeprecationWarning, stacklevel=2)
+        else:
+            fields = None
+
+        if fields is None:
+            fields = kwargs
+        elif kwargs:
+            raise TypeError("TypedDict takes either a dict or keyword arguments,"
+                            " but not both")
+
+        ns = {'__annotations__': dict(fields)}
+        try:
+            # Setting correct module is necessary to make typed dict classes pickleable.
+            ns['__module__'] = sys._getframe(1).f_globals.get('__name__', '__main__')
+        except (AttributeError, ValueError):
+            pass
+
+        return _TypedDictMeta(typename, (), ns, total=total)
+
+    _typeddict_new.__text_signature__ = ('($cls, _typename, _fields=None,'
+                                         ' /, *, total=True, **kwargs)')
+
+    class _TypedDictMeta(type):
+        def __init__(cls, name, bases, ns, total=True):
+            super().__init__(name, bases, ns)
+
+        def __new__(cls, name, bases, ns, total=True):
+            # Create new typed dict class object.
+            # This method is called directly when TypedDict is subclassed,
+            # or via _typeddict_new when TypedDict is instantiated. This way
+            # TypedDict supports all three syntaxes described in its docstring.
+            # Subclasses and instances of TypedDict return actual dictionaries
+            # via _dict_new.
+            ns['__new__'] = _typeddict_new if name == 'TypedDict' else _dict_new
+            tp_dict = super().__new__(cls, name, (dict,), ns)
+
+            annotations = {}
+            own_annotations = ns.get('__annotations__', {})
+            msg = "TypedDict('Name', {f0: t0, f1: t1, ...}); each t must be a type"
+            own_annotations = {
+                n: typing._type_check(tp, msg) for n, tp in own_annotations.items()
+            }
+            required_keys = set()
+            optional_keys = set()
+
+            for base in bases:
+                annotations.update(base.__dict__.get('__annotations__', {}))
+                required_keys.update(base.__dict__.get('__required_keys__', ()))
+                optional_keys.update(base.__dict__.get('__optional_keys__', ()))
+
+            annotations.update(own_annotations)
+            if PEP_560:
+                for annotation_key, annotation_type in own_annotations.items():
+                    annotation_origin = get_origin(annotation_type)
+                    if annotation_origin is Annotated:
+                        annotation_args = get_args(annotation_type)
+                        if annotation_args:
+                            annotation_type = annotation_args[0]
+                            annotation_origin = get_origin(annotation_type)
+
+                    if annotation_origin is Required:
+                        required_keys.add(annotation_key)
+                    elif annotation_origin is NotRequired:
+                        optional_keys.add(annotation_key)
+                    elif total:
+                        required_keys.add(annotation_key)
+                    else:
+                        optional_keys.add(annotation_key)
+            else:
+                own_annotation_keys = set(own_annotations.keys())
+                if total:
+                    required_keys.update(own_annotation_keys)
+                else:
+                    optional_keys.update(own_annotation_keys)
+
+            tp_dict.__annotations__ = annotations
+            tp_dict.__required_keys__ = frozenset(required_keys)
+            tp_dict.__optional_keys__ = frozenset(optional_keys)
+            if not hasattr(tp_dict, '__total__'):
+                tp_dict.__total__ = total
+            return tp_dict
+
+        __instancecheck__ = __subclasscheck__ = _check_fails
+
+    TypedDict = _TypedDictMeta('TypedDict', (dict,), {})
+    TypedDict.__module__ = __name__
+    TypedDict.__doc__ = \
+        """A simple typed name space. At runtime it is equivalent to a plain dict.
+
+        TypedDict creates a dictionary type that expects all of its
+        instances to have a certain set of keys, with each key
+        associated with a value of a consistent type. This expectation
+        is not checked at runtime but is only enforced by type checkers.
+        Usage::
+
+            class Point2D(TypedDict):
+                x: int
+                y: int
+                label: str
+
+            a: Point2D = {'x': 1, 'y': 2, 'label': 'good'}  # OK
+            b: Point2D = {'z': 3, 'label': 'bad'}           # Fails type check
+
+            assert Point2D(x=1, y=2, label='first') == dict(x=1, y=2, label='first')
+
+        The type info can be accessed via the Point2D.__annotations__ dict, and
+        the Point2D.__required_keys__ and Point2D.__optional_keys__ frozensets.
+        TypedDict supports two additional equivalent forms::
+
+            Point2D = TypedDict('Point2D', x=int, y=int, label=str)
+            Point2D = TypedDict('Point2D', {'x': int, 'y': int, 'label': str})
+
+        The class syntax is only supported in Python 3.6+, while two other
+        syntax forms work for Python 2.7 and 3.2+
+        """
+
+    if hasattr(typing, "_TypedDictMeta"):
+        _TYPEDDICT_TYPES = (typing._TypedDictMeta, _TypedDictMeta)
+    else:
+        _TYPEDDICT_TYPES = (_TypedDictMeta,)
+
+    def is_typeddict(tp):
+        """Check if an annotation is a TypedDict class
+
+        For example::
+            class Film(TypedDict):
+                title: str
+                year: int
+
+            is_typeddict(Film)  # => True
+            is_typeddict(Union[list, str])  # => False
+        """
+        return isinstance(tp, tuple(_TYPEDDICT_TYPES))
+
+if hasattr(typing, "Required"):
+    get_type_hints = typing.get_type_hints
+elif PEP_560:
+    import functools
+    import types
+
+    # replaces _strip_annotations()
+    def _strip_extras(t):
+        """Strips Annotated, Required and NotRequired from a given type."""
+        if isinstance(t, _AnnotatedAlias):
+            return _strip_extras(t.__origin__)
+        if hasattr(t, "__origin__") and t.__origin__ in (Required, NotRequired):
+            return _strip_extras(t.__args__[0])
+        if isinstance(t, typing._GenericAlias):
+            stripped_args = tuple(_strip_extras(a) for a in t.__args__)
+            if stripped_args == t.__args__:
+                return t
+            return t.copy_with(stripped_args)
+        if hasattr(types, "GenericAlias") and isinstance(t, types.GenericAlias):
+            stripped_args = tuple(_strip_extras(a) for a in t.__args__)
+            if stripped_args == t.__args__:
+                return t
+            return types.GenericAlias(t.__origin__, stripped_args)
+        if hasattr(types, "UnionType") and isinstance(t, types.UnionType):
+            stripped_args = tuple(_strip_extras(a) for a in t.__args__)
+            if stripped_args == t.__args__:
+                return t
+            return functools.reduce(operator.or_, stripped_args)
+
+        return t
+
+    def get_type_hints(obj, globalns=None, localns=None, include_extras=False):
+        """Return type hints for an object.
+
+        This is often the same as obj.__annotations__, but it handles
+        forward references encoded as string literals, adds Optional[t] if a
+        default value equal to None is set and recursively replaces all
+        'Annotated[T, ...]', 'Required[T]' or 'NotRequired[T]' with 'T'
+        (unless 'include_extras=True').
+
+        The argument may be a module, class, method, or function. The annotations
+        are returned as a dictionary. For classes, annotations include also
+        inherited members.
+
+        TypeError is raised if the argument is not of a type that can contain
+        annotations, and an empty dictionary is returned if no annotations are
+        present.
+
+        BEWARE -- the behavior of globalns and localns is counterintuitive
+        (unless you are familiar with how eval() and exec() work).  The
+        search order is locals first, then globals.
+
+        - If no dict arguments are passed, an attempt is made to use the
+          globals from obj (or the respective module's globals for classes),
+          and these are also used as the locals.  If the object does not appear
+          to have globals, an empty dictionary is used.
+
+        - If one dict argument is passed, it is used for both globals and
+          locals.
+
+        - If two dict arguments are passed, they specify globals and
+          locals, respectively.
+        """
+        if hasattr(typing, "Annotated"):
+            hint = typing.get_type_hints(
+                obj, globalns=globalns, localns=localns, include_extras=True
+            )
+        else:
+            hint = typing.get_type_hints(obj, globalns=globalns, localns=localns)
+        if include_extras:
+            return hint
+        return {k: _strip_extras(t) for k, t in hint.items()}
+
+
+# Python 3.9+ has PEP 593 (Annotated)
+if hasattr(typing, 'Annotated'):
+    Annotated = typing.Annotated
+    # Not exported and not a public API, but needed for get_origin() and get_args()
+    # to work.
+    _AnnotatedAlias = typing._AnnotatedAlias
+# 3.7-3.8
+elif PEP_560:
+    class _AnnotatedAlias(typing._GenericAlias, _root=True):
+        """Runtime representation of an annotated type.
+
+        At its core 'Annotated[t, dec1, dec2, ...]' is an alias for the type 't'
+        with extra annotations. The alias behaves like a normal typing alias,
+        instantiating is the same as instantiating the underlying type, binding
+        it to types is also the same.
+        """
+        def __init__(self, origin, metadata):
+            if isinstance(origin, _AnnotatedAlias):
+                metadata = origin.__metadata__ + metadata
+                origin = origin.__origin__
+            super().__init__(origin, origin)
+            self.__metadata__ = metadata
+
+        def copy_with(self, params):
+            assert len(params) == 1
+            new_type = params[0]
+            return _AnnotatedAlias(new_type, self.__metadata__)
+
+        def __repr__(self):
+            return (f"typing_extensions.Annotated[{typing._type_repr(self.__origin__)}, "
+                    f"{', '.join(repr(a) for a in self.__metadata__)}]")
+
+        def __reduce__(self):
+            return operator.getitem, (
+                Annotated, (self.__origin__,) + self.__metadata__
+            )
+
+        def __eq__(self, other):
+            if not isinstance(other, _AnnotatedAlias):
+                return NotImplemented
+            if self.__origin__ != other.__origin__:
+                return False
+            return self.__metadata__ == other.__metadata__
+
+        def __hash__(self):
+            return hash((self.__origin__, self.__metadata__))
+
+    class Annotated:
+        """Add context specific metadata to a type.
+
+        Example: Annotated[int, runtime_check.Unsigned] indicates to the
+        hypothetical runtime_check module that this type is an unsigned int.
+        Every other consumer of this type can ignore this metadata and treat
+        this type as int.
+
+        The first argument to Annotated must be a valid type (and will be in
+        the __origin__ field), the remaining arguments are kept as a tuple in
+        the __extra__ field.
+
+        Details:
+
+        - It's an error to call `Annotated` with less than two arguments.
+        - Nested Annotated are flattened::
+
+            Annotated[Annotated[T, Ann1, Ann2], Ann3] == Annotated[T, Ann1, Ann2, Ann3]
+
+        - Instantiating an annotated type is equivalent to instantiating the
+        underlying type::
+
+            Annotated[C, Ann1](5) == C(5)
+
+        - Annotated can be used as a generic type alias::
+
+            Optimized = Annotated[T, runtime.Optimize()]
+            Optimized[int] == Annotated[int, runtime.Optimize()]
+
+            OptimizedList = Annotated[List[T], runtime.Optimize()]
+            OptimizedList[int] == Annotated[List[int], runtime.Optimize()]
+        """
+
+        __slots__ = ()
+
+        def __new__(cls, *args, **kwargs):
+            raise TypeError("Type Annotated cannot be instantiated.")
+
+        @typing._tp_cache
+        def __class_getitem__(cls, params):
+            if not isinstance(params, tuple) or len(params) < 2:
+                raise TypeError("Annotated[...] should be used "
+                                "with at least two arguments (a type and an "
+                                "annotation).")
+            allowed_special_forms = (ClassVar, Final)
+            if get_origin(params[0]) in allowed_special_forms:
+                origin = params[0]
+            else:
+                msg = "Annotated[t, ...]: t must be a type."
+                origin = typing._type_check(params[0], msg)
+            metadata = tuple(params[1:])
+            return _AnnotatedAlias(origin, metadata)
+
+        def __init_subclass__(cls, *args, **kwargs):
+            raise TypeError(
+                f"Cannot subclass {cls.__module__}.Annotated"
+            )
+# 3.6
+else:
+
+    def _is_dunder(name):
+        """Returns True if name is a __dunder_variable_name__."""
+        return len(name) > 4 and name.startswith('__') and name.endswith('__')
+
+    # Prior to Python 3.7 types did not have `copy_with`. A lot of the equality
+    # checks, argument expansion etc. are done on the _subs_tre. As a result we
+    # can't provide a get_type_hints function that strips out annotations.
+
+    class AnnotatedMeta(typing.GenericMeta):
+        """Metaclass for Annotated"""
+
+        def __new__(cls, name, bases, namespace, **kwargs):
+            if any(b is not object for b in bases):
+                raise TypeError("Cannot subclass " + str(Annotated))
+            return super().__new__(cls, name, bases, namespace, **kwargs)
+
+        @property
+        def __metadata__(self):
+            return self._subs_tree()[2]
+
+        def _tree_repr(self, tree):
+            cls, origin, metadata = tree
+            if not isinstance(origin, tuple):
+                tp_repr = typing._type_repr(origin)
+            else:
+                tp_repr = origin[0]._tree_repr(origin)
+            metadata_reprs = ", ".join(repr(arg) for arg in metadata)
+            return f'{cls}[{tp_repr}, {metadata_reprs}]'
+
+        def _subs_tree(self, tvars=None, args=None):  # noqa
+            if self is Annotated:
+                return Annotated
+            res = super()._subs_tree(tvars=tvars, args=args)
+            # Flatten nested Annotated
+            if isinstance(res[1], tuple) and res[1][0] is Annotated:
+                sub_tp = res[1][1]
+                sub_annot = res[1][2]
+                return (Annotated, sub_tp, sub_annot + res[2])
+            return res
+
+        def _get_cons(self):
+            """Return the class used to create instance of this type."""
+            if self.__origin__ is None:
+                raise TypeError("Cannot get the underlying type of a "
+                                "non-specialized Annotated type.")
+            tree = self._subs_tree()
+            while isinstance(tree, tuple) and tree[0] is Annotated:
+                tree = tree[1]
+            if isinstance(tree, tuple):
+                return tree[0]
+            else:
+                return tree
+
+        @typing._tp_cache
+        def __getitem__(self, params):
+            if not isinstance(params, tuple):
+                params = (params,)
+            if self.__origin__ is not None:  # specializing an instantiated type
+                return super().__getitem__(params)
+            elif not isinstance(params, tuple) or len(params) < 2:
+                raise TypeError("Annotated[...] should be instantiated "
+                                "with at least two arguments (a type and an "
+                                "annotation).")
+            else:
+                if (
+                    isinstance(params[0], typing._TypingBase) and
+                    type(params[0]).__name__ == "_ClassVar"
+                ):
+                    tp = params[0]
+                else:
+                    msg = "Annotated[t, ...]: t must be a type."
+                    tp = typing._type_check(params[0], msg)
+                metadata = tuple(params[1:])
+            return self.__class__(
+                self.__name__,
+                self.__bases__,
+                _no_slots_copy(self.__dict__),
+                tvars=_type_vars((tp,)),
+                # Metadata is a tuple so it won't be touched by _replace_args et al.
+                args=(tp, metadata),
+                origin=self,
+            )
+
+        def __call__(self, *args, **kwargs):
+            cons = self._get_cons()
+            result = cons(*args, **kwargs)
+            try:
+                result.__orig_class__ = self
+            except AttributeError:
+                pass
+            return result
+
+        def __getattr__(self, attr):
+            # For simplicity we just don't relay all dunder names
+            if self.__origin__ is not None and not _is_dunder(attr):
+                return getattr(self._get_cons(), attr)
+            raise AttributeError(attr)
+
+        def __setattr__(self, attr, value):
+            if _is_dunder(attr) or attr.startswith('_abc_'):
+                super().__setattr__(attr, value)
+            elif self.__origin__ is None:
+                raise AttributeError(attr)
+            else:
+                setattr(self._get_cons(), attr, value)
+
+        def __instancecheck__(self, obj):
+            raise TypeError("Annotated cannot be used with isinstance().")
+
+        def __subclasscheck__(self, cls):
+            raise TypeError("Annotated cannot be used with issubclass().")
+
+    class Annotated(metaclass=AnnotatedMeta):
+        """Add context specific metadata to a type.
+
+        Example: Annotated[int, runtime_check.Unsigned] indicates to the
+        hypothetical runtime_check module that this type is an unsigned int.
+        Every other consumer of this type can ignore this metadata and treat
+        this type as int.
+
+        The first argument to Annotated must be a valid type, the remaining
+        arguments are kept as a tuple in the __metadata__ field.
+
+        Details:
+
+        - It's an error to call `Annotated` with less than two arguments.
+        - Nested Annotated are flattened::
+
+            Annotated[Annotated[T, Ann1, Ann2], Ann3] == Annotated[T, Ann1, Ann2, Ann3]
+
+        - Instantiating an annotated type is equivalent to instantiating the
+        underlying type::
+
+            Annotated[C, Ann1](5) == C(5)
+
+        - Annotated can be used as a generic type alias::
+
+            Optimized = Annotated[T, runtime.Optimize()]
+            Optimized[int] == Annotated[int, runtime.Optimize()]
+
+            OptimizedList = Annotated[List[T], runtime.Optimize()]
+            OptimizedList[int] == Annotated[List[int], runtime.Optimize()]
+        """
+
+# Python 3.8 has get_origin() and get_args() but those implementations aren't
+# Annotated-aware, so we can't use those. Python 3.9's versions don't support
+# ParamSpecArgs and ParamSpecKwargs, so only Python 3.10's versions will do.
+if sys.version_info[:2] >= (3, 10):
+    get_origin = typing.get_origin
+    get_args = typing.get_args
+# 3.7-3.9
+elif PEP_560:
+    try:
+        # 3.9+
+        from typing import _BaseGenericAlias
+    except ImportError:
+        _BaseGenericAlias = typing._GenericAlias
+    try:
+        # 3.9+
+        from typing import GenericAlias
+    except ImportError:
+        GenericAlias = typing._GenericAlias
+
+    def get_origin(tp):
+        """Get the unsubscripted version of a type.
+
+        This supports generic types, Callable, Tuple, Union, Literal, Final, ClassVar
+        and Annotated. Return None for unsupported types. Examples::
+
+            get_origin(Literal[42]) is Literal
+            get_origin(int) is None
+            get_origin(ClassVar[int]) is ClassVar
+            get_origin(Generic) is Generic
+            get_origin(Generic[T]) is Generic
+            get_origin(Union[T, int]) is Union
+            get_origin(List[Tuple[T, T]][int]) == list
+            get_origin(P.args) is P
+        """
+        if isinstance(tp, _AnnotatedAlias):
+            return Annotated
+        if isinstance(tp, (typing._GenericAlias, GenericAlias, _BaseGenericAlias,
+                           ParamSpecArgs, ParamSpecKwargs)):
+            return tp.__origin__
+        if tp is typing.Generic:
+            return typing.Generic
+        return None
+
+    def get_args(tp):
+        """Get type arguments with all substitutions performed.
+
+        For unions, basic simplifications used by Union constructor are performed.
+        Examples::
+            get_args(Dict[str, int]) == (str, int)
+            get_args(int) == ()
+            get_args(Union[int, Union[T, int], str][int]) == (int, str)
+            get_args(Union[int, Tuple[T, int]][str]) == (int, Tuple[str, int])
+            get_args(Callable[[], T][int]) == ([], int)
+        """
+        if isinstance(tp, _AnnotatedAlias):
+            return (tp.__origin__,) + tp.__metadata__
+        if isinstance(tp, (typing._GenericAlias, GenericAlias)):
+            if getattr(tp, "_special", False):
+                return ()
+            res = tp.__args__
+            if get_origin(tp) is collections.abc.Callable and res[0] is not Ellipsis:
+                res = (list(res[:-1]), res[-1])
+            return res
+        return ()
+
+
+# 3.10+
+if hasattr(typing, 'TypeAlias'):
+    TypeAlias = typing.TypeAlias
+# 3.9
+elif sys.version_info[:2] >= (3, 9):
+    class _TypeAliasForm(typing._SpecialForm, _root=True):
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+    @_TypeAliasForm
+    def TypeAlias(self, parameters):
+        """Special marker indicating that an assignment should
+        be recognized as a proper type alias definition by type
+        checkers.
+
+        For example::
+
+            Predicate: TypeAlias = Callable[..., bool]
+
+        It's invalid when used anywhere except as in the example above.
+        """
+        raise TypeError(f"{self} is not subscriptable")
+# 3.7-3.8
+elif sys.version_info[:2] >= (3, 7):
+    class _TypeAliasForm(typing._SpecialForm, _root=True):
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+    TypeAlias = _TypeAliasForm('TypeAlias',
+                               doc="""Special marker indicating that an assignment should
+                               be recognized as a proper type alias definition by type
+                               checkers.
+
+                               For example::
+
+                                   Predicate: TypeAlias = Callable[..., bool]
+
+                               It's invalid when used anywhere except as in the example
+                               above.""")
+# 3.6
+else:
+    class _TypeAliasMeta(typing.TypingMeta):
+        """Metaclass for TypeAlias"""
+
+        def __repr__(self):
+            return 'typing_extensions.TypeAlias'
+
+    class _TypeAliasBase(typing._FinalTypingBase, metaclass=_TypeAliasMeta, _root=True):
+        """Special marker indicating that an assignment should
+        be recognized as a proper type alias definition by type
+        checkers.
+
+        For example::
+
+            Predicate: TypeAlias = Callable[..., bool]
+
+        It's invalid when used anywhere except as in the example above.
+        """
+        __slots__ = ()
+
+        def __instancecheck__(self, obj):
+            raise TypeError("TypeAlias cannot be used with isinstance().")
+
+        def __subclasscheck__(self, cls):
+            raise TypeError("TypeAlias cannot be used with issubclass().")
+
+        def __repr__(self):
+            return 'typing_extensions.TypeAlias'
+
+    TypeAlias = _TypeAliasBase(_root=True)
+
+
+# Python 3.10+ has PEP 612
+if hasattr(typing, 'ParamSpecArgs'):
+    ParamSpecArgs = typing.ParamSpecArgs
+    ParamSpecKwargs = typing.ParamSpecKwargs
+# 3.6-3.9
+else:
+    class _Immutable:
+        """Mixin to indicate that object should not be copied."""
+        __slots__ = ()
+
+        def __copy__(self):
+            return self
+
+        def __deepcopy__(self, memo):
+            return self
+
+    class ParamSpecArgs(_Immutable):
+        """The args for a ParamSpec object.
+
+        Given a ParamSpec object P, P.args is an instance of ParamSpecArgs.
+
+        ParamSpecArgs objects have a reference back to their ParamSpec:
+
+        P.args.__origin__ is P
+
+        This type is meant for runtime introspection and has no special meaning to
+        static type checkers.
+        """
+        def __init__(self, origin):
+            self.__origin__ = origin
+
+        def __repr__(self):
+            return f"{self.__origin__.__name__}.args"
+
+        def __eq__(self, other):
+            if not isinstance(other, ParamSpecArgs):
+                return NotImplemented
+            return self.__origin__ == other.__origin__
+
+    class ParamSpecKwargs(_Immutable):
+        """The kwargs for a ParamSpec object.
+
+        Given a ParamSpec object P, P.kwargs is an instance of ParamSpecKwargs.
+
+        ParamSpecKwargs objects have a reference back to their ParamSpec:
+
+        P.kwargs.__origin__ is P
+
+        This type is meant for runtime introspection and has no special meaning to
+        static type checkers.
+        """
+        def __init__(self, origin):
+            self.__origin__ = origin
+
+        def __repr__(self):
+            return f"{self.__origin__.__name__}.kwargs"
+
+        def __eq__(self, other):
+            if not isinstance(other, ParamSpecKwargs):
+                return NotImplemented
+            return self.__origin__ == other.__origin__
+
+# 3.10+
+if hasattr(typing, 'ParamSpec'):
+    ParamSpec = typing.ParamSpec
+# 3.6-3.9
+else:
+
+    # Inherits from list as a workaround for Callable checks in Python < 3.9.2.
+    class ParamSpec(list):
+        """Parameter specification variable.
+
+        Usage::
+
+           P = ParamSpec('P')
+
+        Parameter specification variables exist primarily for the benefit of static
+        type checkers.  They are used to forward the parameter types of one
+        callable to another callable, a pattern commonly found in higher order
+        functions and decorators.  They are only valid when used in ``Concatenate``,
+        or s the first argument to ``Callable``. In Python 3.10 and higher,
+        they are also supported in user-defined Generics at runtime.
+        See class Generic for more information on generic types.  An
+        example for annotating a decorator::
+
+           T = TypeVar('T')
+           P = ParamSpec('P')
+
+           def add_logging(f: Callable[P, T]) -> Callable[P, T]:
+               '''A type-safe decorator to add logging to a function.'''
+               def inner(*args: P.args, **kwargs: P.kwargs) -> T:
+                   logging.info(f'{f.__name__} was called')
+                   return f(*args, **kwargs)
+               return inner
+
+           @add_logging
+           def add_two(x: float, y: float) -> float:
+               '''Add two numbers together.'''
+               return x + y
+
+        Parameter specification variables defined with covariant=True or
+        contravariant=True can be used to declare covariant or contravariant
+        generic types.  These keyword arguments are valid, but their actual semantics
+        are yet to be decided.  See PEP 612 for details.
+
+        Parameter specification variables can be introspected. e.g.:
+
+           P.__name__ == 'T'
+           P.__bound__ == None
+           P.__covariant__ == False
+           P.__contravariant__ == False
+
+        Note that only parameter specification variables defined in global scope can
+        be pickled.
+        """
+
+        # Trick Generic __parameters__.
+        __class__ = typing.TypeVar
+
+        @property
+        def args(self):
+            return ParamSpecArgs(self)
+
+        @property
+        def kwargs(self):
+            return ParamSpecKwargs(self)
+
+        def __init__(self, name, *, bound=None, covariant=False, contravariant=False):
+            super().__init__([self])
+            self.__name__ = name
+            self.__covariant__ = bool(covariant)
+            self.__contravariant__ = bool(contravariant)
+            if bound:
+                self.__bound__ = typing._type_check(bound, 'Bound must be a type.')
+            else:
+                self.__bound__ = None
+
+            # for pickling:
+            try:
+                def_mod = sys._getframe(1).f_globals.get('__name__', '__main__')
+            except (AttributeError, ValueError):
+                def_mod = None
+            if def_mod != 'typing_extensions':
+                self.__module__ = def_mod
+
+        def __repr__(self):
+            if self.__covariant__:
+                prefix = '+'
+            elif self.__contravariant__:
+                prefix = '-'
+            else:
+                prefix = '~'
+            return prefix + self.__name__
+
+        def __hash__(self):
+            return object.__hash__(self)
+
+        def __eq__(self, other):
+            return self is other
+
+        def __reduce__(self):
+            return self.__name__
+
+        # Hack to get typing._type_check to pass.
+        def __call__(self, *args, **kwargs):
+            pass
+
+        if not PEP_560:
+            # Only needed in 3.6.
+            def _get_type_vars(self, tvars):
+                if self not in tvars:
+                    tvars.append(self)
+
+
+# 3.6-3.9
+if not hasattr(typing, 'Concatenate'):
+    # Inherits from list as a workaround for Callable checks in Python < 3.9.2.
+    class _ConcatenateGenericAlias(list):
+
+        # Trick Generic into looking into this for __parameters__.
+        if PEP_560:
+            __class__ = typing._GenericAlias
+        else:
+            __class__ = typing._TypingBase
+
+        # Flag in 3.8.
+        _special = False
+        # Attribute in 3.6 and earlier.
+        _gorg = typing.Generic
+
+        def __init__(self, origin, args):
+            super().__init__(args)
+            self.__origin__ = origin
+            self.__args__ = args
+
+        def __repr__(self):
+            _type_repr = typing._type_repr
+            return (f'{_type_repr(self.__origin__)}'
+                    f'[{", ".join(_type_repr(arg) for arg in self.__args__)}]')
+
+        def __hash__(self):
+            return hash((self.__origin__, self.__args__))
+
+        # Hack to get typing._type_check to pass in Generic.
+        def __call__(self, *args, **kwargs):
+            pass
+
+        @property
+        def __parameters__(self):
+            return tuple(
+                tp for tp in self.__args__ if isinstance(tp, (typing.TypeVar, ParamSpec))
+            )
+
+        if not PEP_560:
+            # Only required in 3.6.
+            def _get_type_vars(self, tvars):
+                if self.__origin__ and self.__parameters__:
+                    typing._get_type_vars(self.__parameters__, tvars)
+
+
+# 3.6-3.9
+@typing._tp_cache
+def _concatenate_getitem(self, parameters):
+    if parameters == ():
+        raise TypeError("Cannot take a Concatenate of no types.")
+    if not isinstance(parameters, tuple):
+        parameters = (parameters,)
+    if not isinstance(parameters[-1], ParamSpec):
+        raise TypeError("The last parameter to Concatenate should be a "
+                        "ParamSpec variable.")
+    msg = "Concatenate[arg, ...]: each arg must be a type."
+    parameters = tuple(typing._type_check(p, msg) for p in parameters)
+    return _ConcatenateGenericAlias(self, parameters)
+
+
+# 3.10+
+if hasattr(typing, 'Concatenate'):
+    Concatenate = typing.Concatenate
+    _ConcatenateGenericAlias = typing._ConcatenateGenericAlias # noqa
+# 3.9
+elif sys.version_info[:2] >= (3, 9):
+    @_TypeAliasForm
+    def Concatenate(self, parameters):
+        """Used in conjunction with ``ParamSpec`` and ``Callable`` to represent a
+        higher order function which adds, removes or transforms parameters of a
+        callable.
+
+        For example::
+
+           Callable[Concatenate[int, P], int]
+
+        See PEP 612 for detailed information.
+        """
+        return _concatenate_getitem(self, parameters)
+# 3.7-8
+elif sys.version_info[:2] >= (3, 7):
+    class _ConcatenateForm(typing._SpecialForm, _root=True):
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+        def __getitem__(self, parameters):
+            return _concatenate_getitem(self, parameters)
+
+    Concatenate = _ConcatenateForm(
+        'Concatenate',
+        doc="""Used in conjunction with ``ParamSpec`` and ``Callable`` to represent a
+        higher order function which adds, removes or transforms parameters of a
+        callable.
+
+        For example::
+
+           Callable[Concatenate[int, P], int]
+
+        See PEP 612 for detailed information.
+        """)
+# 3.6
+else:
+    class _ConcatenateAliasMeta(typing.TypingMeta):
+        """Metaclass for Concatenate."""
+
+        def __repr__(self):
+            return 'typing_extensions.Concatenate'
+
+    class _ConcatenateAliasBase(typing._FinalTypingBase,
+                                metaclass=_ConcatenateAliasMeta,
+                                _root=True):
+        """Used in conjunction with ``ParamSpec`` and ``Callable`` to represent a
+        higher order function which adds, removes or transforms parameters of a
+        callable.
+
+        For example::
+
+           Callable[Concatenate[int, P], int]
+
+        See PEP 612 for detailed information.
+        """
+        __slots__ = ()
+
+        def __instancecheck__(self, obj):
+            raise TypeError("Concatenate cannot be used with isinstance().")
+
+        def __subclasscheck__(self, cls):
+            raise TypeError("Concatenate cannot be used with issubclass().")
+
+        def __repr__(self):
+            return 'typing_extensions.Concatenate'
+
+        def __getitem__(self, parameters):
+            return _concatenate_getitem(self, parameters)
+
+    Concatenate = _ConcatenateAliasBase(_root=True)
+
+# 3.10+
+if hasattr(typing, 'TypeGuard'):
+    TypeGuard = typing.TypeGuard
+# 3.9
+elif sys.version_info[:2] >= (3, 9):
+    class _TypeGuardForm(typing._SpecialForm, _root=True):
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+    @_TypeGuardForm
+    def TypeGuard(self, parameters):
+        """Special typing form used to annotate the return type of a user-defined
+        type guard function.  ``TypeGuard`` only accepts a single type argument.
+        At runtime, functions marked this way should return a boolean.
+
+        ``TypeGuard`` aims to benefit *type narrowing* -- a technique used by static
+        type checkers to determine a more precise type of an expression within a
+        program's code flow.  Usually type narrowing is done by analyzing
+        conditional code flow and applying the narrowing to a block of code.  The
+        conditional expression here is sometimes referred to as a "type guard".
+
+        Sometimes it would be convenient to use a user-defined boolean function
+        as a type guard.  Such a function should use ``TypeGuard[...]`` as its
+        return type to alert static type checkers to this intention.
+
+        Using  ``-> TypeGuard`` tells the static type checker that for a given
+        function:
+
+        1. The return value is a boolean.
+        2. If the return value is ``True``, the type of its argument
+        is the type inside ``TypeGuard``.
+
+        For example::
+
+            def is_str(val: Union[str, float]):
+                # "isinstance" type guard
+                if isinstance(val, str):
+                    # Type of ``val`` is narrowed to ``str``
+                    ...
+                else:
+                    # Else, type of ``val`` is narrowed to ``float``.
+                    ...
+
+        Strict type narrowing is not enforced -- ``TypeB`` need not be a narrower
+        form of ``TypeA`` (it can even be a wider form) and this may lead to
+        type-unsafe results.  The main reason is to allow for things like
+        narrowing ``List[object]`` to ``List[str]`` even though the latter is not
+        a subtype of the former, since ``List`` is invariant.  The responsibility of
+        writing type-safe type guards is left to the user.
+
+        ``TypeGuard`` also works with type variables.  For more information, see
+        PEP 647 (User-Defined Type Guards).
+        """
+        item = typing._type_check(parameters, f'{self} accepts only single type.')
+        return typing._GenericAlias(self, (item,))
+# 3.7-3.8
+elif sys.version_info[:2] >= (3, 7):
+    class _TypeGuardForm(typing._SpecialForm, _root=True):
+
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+        def __getitem__(self, parameters):
+            item = typing._type_check(parameters,
+                                      f'{self._name} accepts only a single type')
+            return typing._GenericAlias(self, (item,))
+
+    TypeGuard = _TypeGuardForm(
+        'TypeGuard',
+        doc="""Special typing form used to annotate the return type of a user-defined
+        type guard function.  ``TypeGuard`` only accepts a single type argument.
+        At runtime, functions marked this way should return a boolean.
+
+        ``TypeGuard`` aims to benefit *type narrowing* -- a technique used by static
+        type checkers to determine a more precise type of an expression within a
+        program's code flow.  Usually type narrowing is done by analyzing
+        conditional code flow and applying the narrowing to a block of code.  The
+        conditional expression here is sometimes referred to as a "type guard".
+
+        Sometimes it would be convenient to use a user-defined boolean function
+        as a type guard.  Such a function should use ``TypeGuard[...]`` as its
+        return type to alert static type checkers to this intention.
+
+        Using  ``-> TypeGuard`` tells the static type checker that for a given
+        function:
+
+        1. The return value is a boolean.
+        2. If the return value is ``True``, the type of its argument
+        is the type inside ``TypeGuard``.
+
+        For example::
+
+            def is_str(val: Union[str, float]):
+                # "isinstance" type guard
+                if isinstance(val, str):
+                    # Type of ``val`` is narrowed to ``str``
+                    ...
+                else:
+                    # Else, type of ``val`` is narrowed to ``float``.
+                    ...
+
+        Strict type narrowing is not enforced -- ``TypeB`` need not be a narrower
+        form of ``TypeA`` (it can even be a wider form) and this may lead to
+        type-unsafe results.  The main reason is to allow for things like
+        narrowing ``List[object]`` to ``List[str]`` even though the latter is not
+        a subtype of the former, since ``List`` is invariant.  The responsibility of
+        writing type-safe type guards is left to the user.
+
+        ``TypeGuard`` also works with type variables.  For more information, see
+        PEP 647 (User-Defined Type Guards).
+        """)
+# 3.6
+else:
+    class _TypeGuard(typing._FinalTypingBase, _root=True):
+        """Special typing form used to annotate the return type of a user-defined
+        type guard function.  ``TypeGuard`` only accepts a single type argument.
+        At runtime, functions marked this way should return a boolean.
+
+        ``TypeGuard`` aims to benefit *type narrowing* -- a technique used by static
+        type checkers to determine a more precise type of an expression within a
+        program's code flow.  Usually type narrowing is done by analyzing
+        conditional code flow and applying the narrowing to a block of code.  The
+        conditional expression here is sometimes referred to as a "type guard".
+
+        Sometimes it would be convenient to use a user-defined boolean function
+        as a type guard.  Such a function should use ``TypeGuard[...]`` as its
+        return type to alert static type checkers to this intention.
+
+        Using  ``-> TypeGuard`` tells the static type checker that for a given
+        function:
+
+        1. The return value is a boolean.
+        2. If the return value is ``True``, the type of its argument
+        is the type inside ``TypeGuard``.
+
+        For example::
+
+            def is_str(val: Union[str, float]):
+                # "isinstance" type guard
+                if isinstance(val, str):
+                    # Type of ``val`` is narrowed to ``str``
+                    ...
+                else:
+                    # Else, type of ``val`` is narrowed to ``float``.
+                    ...
+
+        Strict type narrowing is not enforced -- ``TypeB`` need not be a narrower
+        form of ``TypeA`` (it can even be a wider form) and this may lead to
+        type-unsafe results.  The main reason is to allow for things like
+        narrowing ``List[object]`` to ``List[str]`` even though the latter is not
+        a subtype of the former, since ``List`` is invariant.  The responsibility of
+        writing type-safe type guards is left to the user.
+
+        ``TypeGuard`` also works with type variables.  For more information, see
+        PEP 647 (User-Defined Type Guards).
+        """
+
+        __slots__ = ('__type__',)
+
+        def __init__(self, tp=None, **kwds):
+            self.__type__ = tp
+
+        def __getitem__(self, item):
+            cls = type(self)
+            if self.__type__ is None:
+                return cls(typing._type_check(item,
+                           f'{cls.__name__[1:]} accepts only a single type.'),
+                           _root=True)
+            raise TypeError(f'{cls.__name__[1:]} cannot be further subscripted')
+
+        def _eval_type(self, globalns, localns):
+            new_tp = typing._eval_type(self.__type__, globalns, localns)
+            if new_tp == self.__type__:
+                return self
+            return type(self)(new_tp, _root=True)
+
+        def __repr__(self):
+            r = super().__repr__()
+            if self.__type__ is not None:
+                r += f'[{typing._type_repr(self.__type__)}]'
+            return r
+
+        def __hash__(self):
+            return hash((type(self).__name__, self.__type__))
+
+        def __eq__(self, other):
+            if not isinstance(other, _TypeGuard):
+                return NotImplemented
+            if self.__type__ is not None:
+                return self.__type__ == other.__type__
+            return self is other
+
+    TypeGuard = _TypeGuard(_root=True)
+
+
+if sys.version_info[:2] >= (3, 7):
+    # Vendored from cpython typing._SpecialFrom
+    class _SpecialForm(typing._Final, _root=True):
+        __slots__ = ('_name', '__doc__', '_getitem')
+
+        def __init__(self, getitem):
+            self._getitem = getitem
+            self._name = getitem.__name__
+            self.__doc__ = getitem.__doc__
+
+        def __getattr__(self, item):
+            if item in {'__name__', '__qualname__'}:
+                return self._name
+
+            raise AttributeError(item)
+
+        def __mro_entries__(self, bases):
+            raise TypeError(f"Cannot subclass {self!r}")
+
+        def __repr__(self):
+            return f'typing_extensions.{self._name}'
+
+        def __reduce__(self):
+            return self._name
+
+        def __call__(self, *args, **kwds):
+            raise TypeError(f"Cannot instantiate {self!r}")
+
+        def __or__(self, other):
+            return typing.Union[self, other]
+
+        def __ror__(self, other):
+            return typing.Union[other, self]
+
+        def __instancecheck__(self, obj):
+            raise TypeError(f"{self} cannot be used with isinstance()")
+
+        def __subclasscheck__(self, cls):
+            raise TypeError(f"{self} cannot be used with issubclass()")
+
+        @typing._tp_cache
+        def __getitem__(self, parameters):
+            return self._getitem(self, parameters)
+
+
+if hasattr(typing, "LiteralString"):
+    LiteralString = typing.LiteralString
+elif sys.version_info[:2] >= (3, 7):
+    @_SpecialForm
+    def LiteralString(self, params):
+        """Represents an arbitrary literal string.
+
+        Example::
+
+          from metaflow._vendor.v3_6.typing_extensions import LiteralString
+
+          def query(sql: LiteralString) -> ...:
+              ...
+
+          query("SELECT * FROM table")  # ok
+          query(f"SELECT * FROM {input()}")  # not ok
+
+        See PEP 675 for details.
+
+        """
+        raise TypeError(f"{self} is not subscriptable")
+else:
+    class _LiteralString(typing._FinalTypingBase, _root=True):
+        """Represents an arbitrary literal string.
+
+        Example::
+
+          from metaflow._vendor.v3_6.typing_extensions import LiteralString
+
+          def query(sql: LiteralString) -> ...:
+              ...
+
+          query("SELECT * FROM table")  # ok
+          query(f"SELECT * FROM {input()}")  # not ok
+
+        See PEP 675 for details.
+
+        """
+
+        __slots__ = ()
+
+        def __instancecheck__(self, obj):
+            raise TypeError(f"{self} cannot be used with isinstance().")
+
+        def __subclasscheck__(self, cls):
+            raise TypeError(f"{self} cannot be used with issubclass().")
+
+    LiteralString = _LiteralString(_root=True)
+
+
+if hasattr(typing, "Self"):
+    Self = typing.Self
+elif sys.version_info[:2] >= (3, 7):
+    @_SpecialForm
+    def Self(self, params):
+        """Used to spell the type of "self" in classes.
+
+        Example::
+
+          from typing import Self
+
+          class ReturnsSelf:
+              def parse(self, data: bytes) -> Self:
+                  ...
+                  return self
+
+        """
+
+        raise TypeError(f"{self} is not subscriptable")
+else:
+    class _Self(typing._FinalTypingBase, _root=True):
+        """Used to spell the type of "self" in classes.
+
+        Example::
+
+          from typing import Self
+
+          class ReturnsSelf:
+              def parse(self, data: bytes) -> Self:
+                  ...
+                  return self
+
+        """
+
+        __slots__ = ()
+
+        def __instancecheck__(self, obj):
+            raise TypeError(f"{self} cannot be used with isinstance().")
+
+        def __subclasscheck__(self, cls):
+            raise TypeError(f"{self} cannot be used with issubclass().")
+
+    Self = _Self(_root=True)
+
+
+if hasattr(typing, "Never"):
+    Never = typing.Never
+elif sys.version_info[:2] >= (3, 7):
+    @_SpecialForm
+    def Never(self, params):
+        """The bottom type, a type that has no members.
+
+        This can be used to define a function that should never be
+        called, or a function that never returns::
+
+            from metaflow._vendor.v3_6.typing_extensions import Never
+
+            def never_call_me(arg: Never) -> None:
+                pass
+
+            def int_or_str(arg: int | str) -> None:
+                never_call_me(arg)  # type checker error
+                match arg:
+                    case int():
+                        print("It's an int")
+                    case str():
+                        print("It's a str")
+                    case _:
+                        never_call_me(arg)  # ok, arg is of type Never
+
+        """
+
+        raise TypeError(f"{self} is not subscriptable")
+else:
+    class _Never(typing._FinalTypingBase, _root=True):
+        """The bottom type, a type that has no members.
+
+        This can be used to define a function that should never be
+        called, or a function that never returns::
+
+            from metaflow._vendor.v3_6.typing_extensions import Never
+
+            def never_call_me(arg: Never) -> None:
+                pass
+
+            def int_or_str(arg: int | str) -> None:
+                never_call_me(arg)  # type checker error
+                match arg:
+                    case int():
+                        print("It's an int")
+                    case str():
+                        print("It's a str")
+                    case _:
+                        never_call_me(arg)  # ok, arg is of type Never
+
+        """
+
+        __slots__ = ()
+
+        def __instancecheck__(self, obj):
+            raise TypeError(f"{self} cannot be used with isinstance().")
+
+        def __subclasscheck__(self, cls):
+            raise TypeError(f"{self} cannot be used with issubclass().")
+
+    Never = _Never(_root=True)
+
+
+if hasattr(typing, 'Required'):
+    Required = typing.Required
+    NotRequired = typing.NotRequired
+elif sys.version_info[:2] >= (3, 9):
+    class _ExtensionsSpecialForm(typing._SpecialForm, _root=True):
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+    @_ExtensionsSpecialForm
+    def Required(self, parameters):
+        """A special typing construct to mark a key of a total=False TypedDict
+        as required. For example:
+
+            class Movie(TypedDict, total=False):
+                title: Required[str]
+                year: int
+
+            m = Movie(
+                title='The Matrix',  # typechecker error if key is omitted
+                year=1999,
+            )
+
+        There is no runtime checking that a required key is actually provided
+        when instantiating a related TypedDict.
+        """
+        item = typing._type_check(parameters, f'{self._name} accepts only single type')
+        return typing._GenericAlias(self, (item,))
+
+    @_ExtensionsSpecialForm
+    def NotRequired(self, parameters):
+        """A special typing construct to mark a key of a TypedDict as
+        potentially missing. For example:
+
+            class Movie(TypedDict):
+                title: str
+                year: NotRequired[int]
+
+            m = Movie(
+                title='The Matrix',  # typechecker error if key is omitted
+                year=1999,
+            )
+        """
+        item = typing._type_check(parameters, f'{self._name} accepts only single type')
+        return typing._GenericAlias(self, (item,))
+
+elif sys.version_info[:2] >= (3, 7):
+    class _RequiredForm(typing._SpecialForm, _root=True):
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+        def __getitem__(self, parameters):
+            item = typing._type_check(parameters,
+                                      '{} accepts only single type'.format(self._name))
+            return typing._GenericAlias(self, (item,))
+
+    Required = _RequiredForm(
+        'Required',
+        doc="""A special typing construct to mark a key of a total=False TypedDict
+        as required. For example:
+
+            class Movie(TypedDict, total=False):
+                title: Required[str]
+                year: int
+
+            m = Movie(
+                title='The Matrix',  # typechecker error if key is omitted
+                year=1999,
+            )
+
+        There is no runtime checking that a required key is actually provided
+        when instantiating a related TypedDict.
+        """)
+    NotRequired = _RequiredForm(
+        'NotRequired',
+        doc="""A special typing construct to mark a key of a TypedDict as
+        potentially missing. For example:
+
+            class Movie(TypedDict):
+                title: str
+                year: NotRequired[int]
+
+            m = Movie(
+                title='The Matrix',  # typechecker error if key is omitted
+                year=1999,
+            )
+        """)
+else:
+    # NOTE: Modeled after _Final's implementation when _FinalTypingBase available
+    class _MaybeRequired(typing._FinalTypingBase, _root=True):
+        __slots__ = ('__type__',)
+
+        def __init__(self, tp=None, **kwds):
+            self.__type__ = tp
+
+        def __getitem__(self, item):
+            cls = type(self)
+            if self.__type__ is None:
+                return cls(typing._type_check(item,
+                           '{} accepts only single type.'.format(cls.__name__[1:])),
+                           _root=True)
+            raise TypeError('{} cannot be further subscripted'
+                            .format(cls.__name__[1:]))
+
+        def _eval_type(self, globalns, localns):
+            new_tp = typing._eval_type(self.__type__, globalns, localns)
+            if new_tp == self.__type__:
+                return self
+            return type(self)(new_tp, _root=True)
+
+        def __repr__(self):
+            r = super().__repr__()
+            if self.__type__ is not None:
+                r += '[{}]'.format(typing._type_repr(self.__type__))
+            return r
+
+        def __hash__(self):
+            return hash((type(self).__name__, self.__type__))
+
+        def __eq__(self, other):
+            if not isinstance(other, type(self)):
+                return NotImplemented
+            if self.__type__ is not None:
+                return self.__type__ == other.__type__
+            return self is other
+
+    class _Required(_MaybeRequired, _root=True):
+        """A special typing construct to mark a key of a total=False TypedDict
+        as required. For example:
+
+            class Movie(TypedDict, total=False):
+                title: Required[str]
+                year: int
+
+            m = Movie(
+                title='The Matrix',  # typechecker error if key is omitted
+                year=1999,
+            )
+
+        There is no runtime checking that a required key is actually provided
+        when instantiating a related TypedDict.
+        """
+
+    class _NotRequired(_MaybeRequired, _root=True):
+        """A special typing construct to mark a key of a TypedDict as
+        potentially missing. For example:
+
+            class Movie(TypedDict):
+                title: str
+                year: NotRequired[int]
+
+            m = Movie(
+                title='The Matrix',  # typechecker error if key is omitted
+                year=1999,
+            )
+        """
+
+    Required = _Required(_root=True)
+    NotRequired = _NotRequired(_root=True)
+
+
+if sys.version_info[:2] >= (3, 9):
+    class _UnpackSpecialForm(typing._SpecialForm, _root=True):
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+    class _UnpackAlias(typing._GenericAlias, _root=True):
+        __class__ = typing.TypeVar
+
+    @_UnpackSpecialForm
+    def Unpack(self, parameters):
+        """A special typing construct to unpack a variadic type. For example:
+
+            Shape = TypeVarTuple('Shape')
+            Batch = NewType('Batch', int)
+
+            def add_batch_axis(
+                x: Array[Unpack[Shape]]
+            ) -> Array[Batch, Unpack[Shape]]: ...
+
+        """
+        item = typing._type_check(parameters, f'{self._name} accepts only single type')
+        return _UnpackAlias(self, (item,))
+
+    def _is_unpack(obj):
+        return isinstance(obj, _UnpackAlias)
+
+elif sys.version_info[:2] >= (3, 7):
+    class _UnpackAlias(typing._GenericAlias, _root=True):
+        __class__ = typing.TypeVar
+
+    class _UnpackForm(typing._SpecialForm, _root=True):
+        def __repr__(self):
+            return 'typing_extensions.' + self._name
+
+        def __getitem__(self, parameters):
+            item = typing._type_check(parameters,
+                                      f'{self._name} accepts only single type')
+            return _UnpackAlias(self, (item,))
+
+    Unpack = _UnpackForm(
+        'Unpack',
+        doc="""A special typing construct to unpack a variadic type. For example:
+
+            Shape = TypeVarTuple('Shape')
+            Batch = NewType('Batch', int)
+
+            def add_batch_axis(
+                x: Array[Unpack[Shape]]
+            ) -> Array[Batch, Unpack[Shape]]: ...
+
+        """)
+
+    def _is_unpack(obj):
+        return isinstance(obj, _UnpackAlias)
+
+else:
+    # NOTE: Modeled after _Final's implementation when _FinalTypingBase available
+    class _Unpack(typing._FinalTypingBase, _root=True):
+        """A special typing construct to unpack a variadic type. For example:
+
+            Shape = TypeVarTuple('Shape')
+            Batch = NewType('Batch', int)
+
+            def add_batch_axis(
+                x: Array[Unpack[Shape]]
+            ) -> Array[Batch, Unpack[Shape]]: ...
+
+        """
+        __slots__ = ('__type__',)
+        __class__ = typing.TypeVar
+
+        def __init__(self, tp=None, **kwds):
+            self.__type__ = tp
+
+        def __getitem__(self, item):
+            cls = type(self)
+            if self.__type__ is None:
+                return cls(typing._type_check(item,
+                           'Unpack accepts only single type.'),
+                           _root=True)
+            raise TypeError('Unpack cannot be further subscripted')
+
+        def _eval_type(self, globalns, localns):
+            new_tp = typing._eval_type(self.__type__, globalns, localns)
+            if new_tp == self.__type__:
+                return self
+            return type(self)(new_tp, _root=True)
+
+        def __repr__(self):
+            r = super().__repr__()
+            if self.__type__ is not None:
+                r += '[{}]'.format(typing._type_repr(self.__type__))
+            return r
+
+        def __hash__(self):
+            return hash((type(self).__name__, self.__type__))
+
+        def __eq__(self, other):
+            if not isinstance(other, _Unpack):
+                return NotImplemented
+            if self.__type__ is not None:
+                return self.__type__ == other.__type__
+            return self is other
+
+        # For 3.6 only
+        def _get_type_vars(self, tvars):
+            self.__type__._get_type_vars(tvars)
+
+    Unpack = _Unpack(_root=True)
+
+    def _is_unpack(obj):
+        return isinstance(obj, _Unpack)
+
+
+class TypeVarTuple:
+    """Type variable tuple.
+
+    Usage::
+
+        Ts = TypeVarTuple('Ts')
+
+    In the same way that a normal type variable is a stand-in for a single
+    type such as ``int``, a type variable *tuple* is a stand-in for a *tuple* type such as
+    ``Tuple[int, str]``.
+
+    Type variable tuples can be used in ``Generic`` declarations.
+    Consider the following example::
+
+        class Array(Generic[*Ts]): ...
+
+    The ``Ts`` type variable tuple here behaves like ``tuple[T1, T2]``,
+    where ``T1`` and ``T2`` are type variables. To use these type variables
+    as type parameters of ``Array``, we must *unpack* the type variable tuple using
+    the star operator: ``*Ts``. The signature of ``Array`` then behaves
+    as if we had simply written ``class Array(Generic[T1, T2]): ...``.
+    In contrast to ``Generic[T1, T2]``, however, ``Generic[*Shape]`` allows
+    us to parameterise the class with an *arbitrary* number of type parameters.
+
+    Type variable tuples can be used anywhere a normal ``TypeVar`` can.
+    This includes class definitions, as shown above, as well as function
+    signatures and variable annotations::
+
+        class Array(Generic[*Ts]):
+
+            def __init__(self, shape: Tuple[*Ts]):
+                self._shape: Tuple[*Ts] = shape
+
+            def get_shape(self) -> Tuple[*Ts]:
+                return self._shape
+
+        shape = (Height(480), Width(640))
+        x: Array[Height, Width] = Array(shape)
+        y = abs(x)  # Inferred type is Array[Height, Width]
+        z = x + x   #        ...    is Array[Height, Width]
+        x.get_shape()  #     ...    is tuple[Height, Width]
+
+    """
+
+    # Trick Generic __parameters__.
+    __class__ = typing.TypeVar
+
+    def __iter__(self):
+        yield self.__unpacked__
+
+    def __init__(self, name):
+        self.__name__ = name
+
+        # for pickling:
+        try:
+            def_mod = sys._getframe(1).f_globals.get('__name__', '__main__')
+        except (AttributeError, ValueError):
+            def_mod = None
+        if def_mod != 'typing_extensions':
+            self.__module__ = def_mod
+
+        self.__unpacked__ = Unpack[self]
+
+    def __repr__(self):
+        return self.__name__
+
+    def __hash__(self):
+        return object.__hash__(self)
+
+    def __eq__(self, other):
+        return self is other
+
+    def __reduce__(self):
+        return self.__name__
+
+    def __init_subclass__(self, *args, **kwds):
+        if '_root' not in kwds:
+            raise TypeError("Cannot subclass special typing classes")
+
+    if not PEP_560:
+        # Only needed in 3.6.
+        def _get_type_vars(self, tvars):
+            if self not in tvars:
+                tvars.append(self)
+
+
+if hasattr(typing, "reveal_type"):
+    reveal_type = typing.reveal_type
+else:
+    def reveal_type(__obj: T) -> T:
+        """Reveal the inferred type of a variable.
+
+        When a static type checker encounters a call to ``reveal_type()``,
+        it will emit the inferred type of the argument::
+
+            x: int = 1
+            reveal_type(x)
+
+        Running a static type checker (e.g., ``mypy``) on this example
+        will produce output similar to 'Revealed type is "builtins.int"'.
+
+        At runtime, the function prints the runtime type of the
+        argument and returns it unchanged.
+
+        """
+        print(f"Runtime type is {type(__obj).__name__!r}", file=sys.stderr)
+        return __obj
+
+
+if hasattr(typing, "assert_never"):
+    assert_never = typing.assert_never
+else:
+    def assert_never(__arg: Never) -> Never:
+        """Assert to the type checker that a line of code is unreachable.
+
+        Example::
+
+            def int_or_str(arg: int | str) -> None:
+                match arg:
+                    case int():
+                        print("It's an int")
+                    case str():
+                        print("It's a str")
+                    case _:
+                        assert_never(arg)
+
+        If a type checker finds that a call to assert_never() is
+        reachable, it will emit an error.
+
+        At runtime, this throws an exception when called.
+
+        """
+        raise AssertionError("Expected code to be unreachable")
+
+
+if hasattr(typing, 'dataclass_transform'):
+    dataclass_transform = typing.dataclass_transform
+else:
+    def dataclass_transform(
+        *,
+        eq_default: bool = True,
+        order_default: bool = False,
+        kw_only_default: bool = False,
+        field_descriptors: typing.Tuple[
+            typing.Union[typing.Type[typing.Any], typing.Callable[..., typing.Any]],
+            ...
+        ] = (),
+    ) -> typing.Callable[[T], T]:
+        """Decorator that marks a function, class, or metaclass as providing
+        dataclass-like behavior.
+
+        Example:
+
+            from metaflow._vendor.v3_6.typing_extensions import dataclass_transform
+
+            _T = TypeVar("_T")
+
+            # Used on a decorator function
+            @dataclass_transform()
+            def create_model(cls: type[_T]) -> type[_T]:
+                ...
+                return cls
+
+            @create_model
+            class CustomerModel:
+                id: int
+                name: str
+
+            # Used on a base class
+            @dataclass_transform()
+            class ModelBase: ...
+
+            class CustomerModel(ModelBase):
+                id: int
+                name: str
+
+            # Used on a metaclass
+            @dataclass_transform()
+            class ModelMeta(type): ...
+
+            class ModelBase(metaclass=ModelMeta): ...
+
+            class CustomerModel(ModelBase):
+                id: int
+                name: str
+
+        Each of the ``CustomerModel`` classes defined in this example will now
+        behave similarly to a dataclass created with the ``@dataclasses.dataclass``
+        decorator. For example, the type checker will synthesize an ``__init__``
+        method.
+
+        The arguments to this decorator can be used to customize this behavior:
+        - ``eq_default`` indicates whether the ``eq`` parameter is assumed to be
+          True or False if it is omitted by the caller.
+        - ``order_default`` indicates whether the ``order`` parameter is
+          assumed to be True or False if it is omitted by the caller.
+        - ``kw_only_default`` indicates whether the ``kw_only`` parameter is
+          assumed to be True or False if it is omitted by the caller.
+        - ``field_descriptors`` specifies a static list of supported classes
+          or functions, that describe fields, similar to ``dataclasses.field()``.
+
+        At runtime, this decorator records its arguments in the
+        ``__dataclass_transform__`` attribute on the decorated object.
+
+        See PEP 681 for details.
+
+        """
+        def decorator(cls_or_fn):
+            cls_or_fn.__dataclass_transform__ = {
+                "eq_default": eq_default,
+                "order_default": order_default,
+                "kw_only_default": kw_only_default,
+                "field_descriptors": field_descriptors,
+            }
+            return cls_or_fn
+        return decorator
+
+
+# We have to do some monkey patching to deal with the dual nature of
+# Unpack/TypeVarTuple:
+# - We want Unpack to be a kind of TypeVar so it gets accepted in
+#   Generic[Unpack[Ts]]
+# - We want it to *not* be treated as a TypeVar for the purposes of
+#   counting generic parameters, so that when we subscript a generic,
+#   the runtime doesn't try to substitute the Unpack with the subscripted type.
+if not hasattr(typing, "TypeVarTuple"):
+    typing._collect_type_vars = _collect_type_vars
+    typing._check_generic = _check_generic
diff --git a/metaflow/_vendor/v3_6/zipp.LICENSE b/metaflow/_vendor/v3_6/zipp.LICENSE
new file mode 100644
index 00000000000..353924be0e5
--- /dev/null
+++ b/metaflow/_vendor/v3_6/zipp.LICENSE
@@ -0,0 +1,19 @@
+Copyright Jason R. Coombs
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to
+deal in the Software without restriction, including without limitation the
+rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+sell copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+IN THE SOFTWARE.
diff --git a/metaflow/_vendor/v3_6/zipp.py b/metaflow/_vendor/v3_6/zipp.py
new file mode 100644
index 00000000000..26b723c1fd3
--- /dev/null
+++ b/metaflow/_vendor/v3_6/zipp.py
@@ -0,0 +1,329 @@
+import io
+import posixpath
+import zipfile
+import itertools
+import contextlib
+import sys
+import pathlib
+
+if sys.version_info < (3, 7):
+    from collections import OrderedDict
+else:
+    OrderedDict = dict
+
+
+__all__ = ['Path']
+
+
+def _parents(path):
+    """
+    Given a path with elements separated by
+    posixpath.sep, generate all parents of that path.
+
+    >>> list(_parents('b/d'))
+    ['b']
+    >>> list(_parents('/b/d/'))
+    ['/b']
+    >>> list(_parents('b/d/f/'))
+    ['b/d', 'b']
+    >>> list(_parents('b'))
+    []
+    >>> list(_parents(''))
+    []
+    """
+    return itertools.islice(_ancestry(path), 1, None)
+
+
+def _ancestry(path):
+    """
+    Given a path with elements separated by
+    posixpath.sep, generate all elements of that path
+
+    >>> list(_ancestry('b/d'))
+    ['b/d', 'b']
+    >>> list(_ancestry('/b/d/'))
+    ['/b/d', '/b']
+    >>> list(_ancestry('b/d/f/'))
+    ['b/d/f', 'b/d', 'b']
+    >>> list(_ancestry('b'))
+    ['b']
+    >>> list(_ancestry(''))
+    []
+    """
+    path = path.rstrip(posixpath.sep)
+    while path and path != posixpath.sep:
+        yield path
+        path, tail = posixpath.split(path)
+
+
+_dedupe = OrderedDict.fromkeys
+"""Deduplicate an iterable in original order"""
+
+
+def _difference(minuend, subtrahend):
+    """
+    Return items in minuend not in subtrahend, retaining order
+    with O(1) lookup.
+    """
+    return itertools.filterfalse(set(subtrahend).__contains__, minuend)
+
+
+class CompleteDirs(zipfile.ZipFile):
+    """
+    A ZipFile subclass that ensures that implied directories
+    are always included in the namelist.
+    """
+
+    @staticmethod
+    def _implied_dirs(names):
+        parents = itertools.chain.from_iterable(map(_parents, names))
+        as_dirs = (p + posixpath.sep for p in parents)
+        return _dedupe(_difference(as_dirs, names))
+
+    def namelist(self):
+        names = super(CompleteDirs, self).namelist()
+        return names + list(self._implied_dirs(names))
+
+    def _name_set(self):
+        return set(self.namelist())
+
+    def resolve_dir(self, name):
+        """
+        If the name represents a directory, return that name
+        as a directory (with the trailing slash).
+        """
+        names = self._name_set()
+        dirname = name + '/'
+        dir_match = name not in names and dirname in names
+        return dirname if dir_match else name
+
+    @classmethod
+    def make(cls, source):
+        """
+        Given a source (filename or zipfile), return an
+        appropriate CompleteDirs subclass.
+        """
+        if isinstance(source, CompleteDirs):
+            return source
+
+        if not isinstance(source, zipfile.ZipFile):
+            return cls(_pathlib_compat(source))
+
+        # Only allow for FastLookup when supplied zipfile is read-only
+        if 'r' not in source.mode:
+            cls = CompleteDirs
+
+        source.__class__ = cls
+        return source
+
+
+class FastLookup(CompleteDirs):
+    """
+    ZipFile subclass to ensure implicit
+    dirs exist and are resolved rapidly.
+    """
+
+    def namelist(self):
+        with contextlib.suppress(AttributeError):
+            return self.__names
+        self.__names = super(FastLookup, self).namelist()
+        return self.__names
+
+    def _name_set(self):
+        with contextlib.suppress(AttributeError):
+            return self.__lookup
+        self.__lookup = super(FastLookup, self)._name_set()
+        return self.__lookup
+
+
+def _pathlib_compat(path):
+    """
+    For path-like objects, convert to a filename for compatibility
+    on Python 3.6.1 and earlier.
+    """
+    try:
+        return path.__fspath__()
+    except AttributeError:
+        return str(path)
+
+
+class Path:
+    """
+    A pathlib-compatible interface for zip files.
+
+    Consider a zip file with this structure::
+
+        .
+        ├── a.txt
+        └── b
+            ├── c.txt
+            └── d
+                └── e.txt
+
+    >>> data = io.BytesIO()
+    >>> zf = zipfile.ZipFile(data, 'w')
+    >>> zf.writestr('a.txt', 'content of a')
+    >>> zf.writestr('b/c.txt', 'content of c')
+    >>> zf.writestr('b/d/e.txt', 'content of e')
+    >>> zf.filename = 'mem/abcde.zip'
+
+    Path accepts the zipfile object itself or a filename
+
+    >>> root = Path(zf)
+
+    From there, several path operations are available.
+
+    Directory iteration (including the zip file itself):
+
+    >>> a, b = root.iterdir()
+    >>> a
+    Path('mem/abcde.zip', 'a.txt')
+    >>> b
+    Path('mem/abcde.zip', 'b/')
+
+    name property:
+
+    >>> b.name
+    'b'
+
+    join with divide operator:
+
+    >>> c = b / 'c.txt'
+    >>> c
+    Path('mem/abcde.zip', 'b/c.txt')
+    >>> c.name
+    'c.txt'
+
+    Read text:
+
+    >>> c.read_text()
+    'content of c'
+
+    existence:
+
+    >>> c.exists()
+    True
+    >>> (b / 'missing.txt').exists()
+    False
+
+    Coercion to string:
+
+    >>> import os
+    >>> str(c).replace(os.sep, posixpath.sep)
+    'mem/abcde.zip/b/c.txt'
+
+    At the root, ``name``, ``filename``, and ``parent``
+    resolve to the zipfile. Note these attributes are not
+    valid and will raise a ``ValueError`` if the zipfile
+    has no filename.
+
+    >>> root.name
+    'abcde.zip'
+    >>> str(root.filename).replace(os.sep, posixpath.sep)
+    'mem/abcde.zip'
+    >>> str(root.parent)
+    'mem'
+    """
+
+    __repr = "{self.__class__.__name__}({self.root.filename!r}, {self.at!r})"
+
+    def __init__(self, root, at=""):
+        """
+        Construct a Path from a ZipFile or filename.
+
+        Note: When the source is an existing ZipFile object,
+        its type (__class__) will be mutated to a
+        specialized type. If the caller wishes to retain the
+        original type, the caller should either create a
+        separate ZipFile object or pass a filename.
+        """
+        self.root = FastLookup.make(root)
+        self.at = at
+
+    def open(self, mode='r', *args, pwd=None, **kwargs):
+        """
+        Open this entry as text or binary following the semantics
+        of ``pathlib.Path.open()`` by passing arguments through
+        to io.TextIOWrapper().
+        """
+        if self.is_dir():
+            raise IsADirectoryError(self)
+        zip_mode = mode[0]
+        if not self.exists() and zip_mode == 'r':
+            raise FileNotFoundError(self)
+        stream = self.root.open(self.at, zip_mode, pwd=pwd)
+        if 'b' in mode:
+            if args or kwargs:
+                raise ValueError("encoding args invalid for binary operation")
+            return stream
+        return io.TextIOWrapper(stream, *args, **kwargs)
+
+    @property
+    def name(self):
+        return pathlib.Path(self.at).name or self.filename.name
+
+    @property
+    def suffix(self):
+        return pathlib.Path(self.at).suffix or self.filename.suffix
+
+    @property
+    def suffixes(self):
+        return pathlib.Path(self.at).suffixes or self.filename.suffixes
+
+    @property
+    def stem(self):
+        return pathlib.Path(self.at).stem or self.filename.stem
+
+    @property
+    def filename(self):
+        return pathlib.Path(self.root.filename).joinpath(self.at)
+
+    def read_text(self, *args, **kwargs):
+        with self.open('r', *args, **kwargs) as strm:
+            return strm.read()
+
+    def read_bytes(self):
+        with self.open('rb') as strm:
+            return strm.read()
+
+    def _is_child(self, path):
+        return posixpath.dirname(path.at.rstrip("/")) == self.at.rstrip("/")
+
+    def _next(self, at):
+        return self.__class__(self.root, at)
+
+    def is_dir(self):
+        return not self.at or self.at.endswith("/")
+
+    def is_file(self):
+        return self.exists() and not self.is_dir()
+
+    def exists(self):
+        return self.at in self.root._name_set()
+
+    def iterdir(self):
+        if not self.is_dir():
+            raise ValueError("Can't listdir a file")
+        subs = map(self._next, self.root.namelist())
+        return filter(self._is_child, subs)
+
+    def __str__(self):
+        return posixpath.join(self.root.filename, self.at)
+
+    def __repr__(self):
+        return self.__repr.format(self=self)
+
+    def joinpath(self, *other):
+        next = posixpath.join(self.at, *map(_pathlib_compat, other))
+        return self._next(self.root.resolve_dir(next))
+
+    __truediv__ = joinpath
+
+    @property
+    def parent(self):
+        if not self.at:
+            return self.filename.parent
+        parent_at = posixpath.dirname(self.at.rstrip('/'))
+        if parent_at:
+            parent_at += '/'
+        return self._next(parent_at)
diff --git a/metaflow/_vendor/vendor_any.txt b/metaflow/_vendor/vendor_any.txt
new file mode 100644
index 00000000000..f7fc59bf3bb
--- /dev/null
+++ b/metaflow/_vendor/vendor_any.txt
@@ -0,0 +1 @@
+click==7.1.2
diff --git a/metaflow/_vendor/vendor.txt b/metaflow/_vendor/vendor_v3_5.txt
similarity index 66%
rename from metaflow/_vendor/vendor.txt
rename to metaflow/_vendor/vendor_v3_5.txt
index 168fe394978..43295d035a6 100644
--- a/metaflow/_vendor/vendor.txt
+++ b/metaflow/_vendor/vendor_v3_5.txt
@@ -1,2 +1 @@
-click==7.1.2
 importlib_metadata==2.1.3
diff --git a/metaflow/_vendor/vendor_v3_6.txt b/metaflow/_vendor/vendor_v3_6.txt
new file mode 100644
index 00000000000..7ab36b23bb5
--- /dev/null
+++ b/metaflow/_vendor/vendor_v3_6.txt
@@ -0,0 +1 @@
+importlib_metadata==4.8.3
\ No newline at end of file
diff --git a/metaflow/cli.py b/metaflow/cli.py
index 2674757aa82..1bc1ba711a0 100644
--- a/metaflow/cli.py
+++ b/metaflow/cli.py
@@ -1,5 +1,4 @@
 import inspect
-import os
 import sys
 import traceback
 from datetime import datetime
@@ -15,6 +14,7 @@
 from . import namespace
 from . import current
 from .cli_args import cli_args
+from .tagging_util import validate_tags
 from .util import (
     resolve_identity,
     decompress_list,
@@ -24,11 +24,12 @@
 from .task import MetaflowTask
 from .exception import CommandException, MetaflowException
 from .graph import FlowGraph
-from .datastore import DATASTORES, FlowDataStore, TaskDataStoreSet, TaskDataStore
+from .datastore import FlowDataStore, TaskDataStoreSet, TaskDataStore
 
 from .runtime import NativeRuntime
 from .package import MetaflowPackage
 from .plugins import (
+    DATASTORES,
     ENVIRONMENTS,
     LOGGING_SIDECARS,
     METADATA_PROVIDERS,
@@ -44,8 +45,6 @@
 )
 from .metaflow_environment import MetaflowEnvironment
 from .pylint_wrapper import PyLint
-from .event_logger import EventLogger
-from .monitor import Monitor
 from .R import use_r, metaflow_r_version
 from .mflog import mflog, LOG_SOURCES
 from .unbounded_foreach import UBF_CONTROL, UBF_TASK
@@ -111,7 +110,7 @@ def echo_always(line, **kwargs):
         click.secho(ERASE_TO_EOL, **kwargs)
 
 
-def logger(body="", system_msg=False, head="", bad=False, timestamp=True):
+def logger(body="", system_msg=False, head="", bad=False, timestamp=True, nl=True):
     if timestamp:
         if timestamp is True:
             dt = datetime.now()
@@ -121,7 +120,7 @@ def logger(body="", system_msg=False, head="", bad=False, timestamp=True):
         click.secho(tstamp + " ", fg=LOGGER_TIMESTAMP, nl=False)
     if head:
         click.secho(head, fg=LOGGER_COLOR, nl=False)
-    click.secho(body, bold=system_msg, fg=LOGGER_BAD_COLOR if bad else None)
+    click.secho(body, bold=system_msg, fg=LOGGER_BAD_COLOR if bad else None, nl=nl)
 
 
 @click.group()
@@ -174,10 +173,22 @@ def help(ctx):
 
 
 @cli.command(help="Output internal state of the flow graph.")
+@click.option("--json", is_flag=True, help="Output the flow graph in JSON format.")
 @click.pass_obj
-def output_raw(obj):
-    echo("Internal representation of the flow:", fg="magenta", bold=False)
-    echo_always(str(obj.graph), err=False)
+def output_raw(obj, json):
+    if json:
+        import json as _json
+
+        _msg = "Internal representation of the flow in JSON format:"
+        _graph_dict, _graph_struct = obj.graph.output_steps()
+        _graph = _json.dumps(
+            dict(graph=_graph_dict, graph_structure=_graph_struct), indent=4
+        )
+    else:
+        _graph = str(obj.graph)
+        _msg = "Internal representation of the flow:"
+    echo(_msg, fg="magenta", bold=False)
+    echo_always(_graph, err=False)
 
 
 @cli.command(help="Visualize the flow with Graphviz.")
@@ -240,7 +251,7 @@ def dump(obj, input_path, private=None, max_value_size=None, include=None, file=
         run_id, step_name, task_id = parts
     else:
         raise CommandException(
-            "input_path should either be run_id/step_name" "or run_id/step_name/task_id"
+            "input_path should either be run_id/step_name or run_id/step_name/task_id"
         )
 
     datastore_set = TaskDataStoreSet(
@@ -386,7 +397,7 @@ def echo_unicode(line, **kwargs):
                     log = ds.load_log_legacy(stream)
                     if log and timestamps:
                         raise CommandException(
-                            "We can't show --timestamps for " "old runs. Sorry!"
+                            "We can't show --timestamps for old runs. Sorry!"
                         )
                     echo_unicode(log, nl=False)
     else:
@@ -416,7 +427,13 @@ def echo_unicode(line, **kwargs):
 )
 @click.option(
     "--input-paths",
-    help="A comma-separated list of pathspecs " "specifying inputs for this step.",
+    help="A comma-separated list of pathspecs specifying inputs for this step.",
+)
+@click.option(
+    "--input-paths-filename",
+    type=click.Path(exists=True, readable=True, dir_okay=False, resolve_path=True),
+    help="A filename containing the argument typically passed to `input-paths`",
+    hidden=True,
 )
 @click.option(
     "--split-index",
@@ -438,7 +455,7 @@ def echo_unicode(line, **kwargs):
     "--namespace",
     "opt_namespace",
     default=None,
-    help="Change namespace from the default (your username) to " "the specified tag.",
+    help="Change namespace from the default (your username) to the specified tag.",
 )
 @click.option(
     "--retry-count",
@@ -456,10 +473,17 @@ def echo_unicode(line, **kwargs):
     help="Pathspec of the origin task for this task to clone. Do "
     "not execute anything.",
 )
+@click.option(
+    "--clone-wait-only/--no-clone-wait-only",
+    default=False,
+    show_default=True,
+    help="If specified, waits for an external process to clone the task",
+    hidden=True,
+)
 @click.option(
     "--clone-run-id",
     default=None,
-    help="Run id of the origin flow, if this task is part of a flow " "being resumed.",
+    help="Run id of the origin flow, if this task is part of a flow being resumed.",
 )
 @click.option(
     "--with",
@@ -473,7 +497,7 @@ def echo_unicode(line, **kwargs):
     "--ubf-context",
     default="none",
     type=click.Choice(["none", UBF_CONTROL, UBF_TASK]),
-    help="Provides additional context if this task is of type " "unbounded foreach.",
+    help="Provides additional context if this task is of type unbounded foreach.",
 )
 @click.option(
     "--num-parallel",
@@ -489,11 +513,13 @@ def step(
     run_id=None,
     task_id=None,
     input_paths=None,
+    input_paths_filename=None,
     split_index=None,
     opt_namespace=None,
     retry_count=None,
     max_user_code_retries=None,
     clone_only=None,
+    clone_wait_only=False,
     clone_run_id=None,
     decospecs=None,
     ubf_context="none",
@@ -526,6 +552,10 @@ def step(
     cli_args._set_step_kwargs(step_kwargs)
 
     ctx.obj.metadata.add_sticky_tags(tags=opt_tag)
+    if not input_paths and input_paths_filename:
+        with open(input_paths_filename, mode="r", encoding="utf-8") as f:
+            input_paths = f.read().strip(" \n\"'")
+
     paths = decompress_list(input_paths) if input_paths else []
 
     task = MetaflowTask(
@@ -539,7 +569,14 @@ def step(
         ubf_context,
     )
     if clone_only:
-        task.clone_only(step_name, run_id, task_id, clone_only, retry_count)
+        task.clone_only(
+            step_name,
+            run_id,
+            task_id,
+            clone_only,
+            retry_count,
+            wait_only=clone_wait_only,
+        )
     else:
         task.run_step(
             step_name,
@@ -662,6 +699,26 @@ def wrapper(*args, **kwargs):
     help="ID of the run that should be resumed. By default, the "
     "last run executed locally.",
 )
+@click.option(
+    "--run-id",
+    default=None,
+    help="Run ID for the new run. By default, a new run-id will be generated",
+    hidden=True,
+)
+@click.option(
+    "--clone-only/--no-clone-only",
+    default=False,
+    show_default=True,
+    help="Only clone tasks without continuing execution",
+    hidden=True,
+)
+@click.option(
+    "--reentrant/--no-reentrant",
+    default=False,
+    show_default=True,
+    hidden=True,
+    help="If specified, allows this call to be called in parallel",
+)
 @click.argument("step-to-rerun", required=False)
 @cli.command(help="Resume execution of a previous run of this flow.")
 @common_run_options
@@ -671,6 +728,9 @@ def resume(
     tags=None,
     step_to_rerun=None,
     origin_run_id=None,
+    run_id=None,
+    clone_only=False,
+    reentrant=False,
     max_workers=None,
     max_num_splits=None,
     max_log_size=None,
@@ -700,6 +760,17 @@ def resume(
             )
         clone_steps = {step_to_rerun}
 
+    if run_id:
+        # Run-ids that are provided by the metadata service are always integers.
+        # External providers or run-ids (like external schedulers) always need to
+        # be non-integers to avoid any clashes. This condition ensures this.
+        try:
+            int(run_id)
+        except:
+            pass
+        else:
+            raise CommandException("run-id %s cannot be an integer" % run_id)
+
     runtime = NativeRuntime(
         obj.flow,
         obj.graph,
@@ -711,17 +782,19 @@ def resume(
         obj.entrypoint,
         obj.event_logger,
         obj.monitor,
+        run_id=run_id,
         clone_run_id=origin_run_id,
+        clone_only=clone_only,
+        reentrant=reentrant,
         clone_steps=clone_steps,
         max_workers=max_workers,
         max_num_splits=max_num_splits,
         max_log_size=max_log_size * 1024 * 1024,
     )
+    write_run_id(run_id_file, runtime.run_id)
     runtime.persist_constants()
     runtime.execute()
 
-    write_run_id(run_id_file, runtime.run_id)
-
 
 @parameters.add_custom_parameters(deploy_mode=True)
 @cli.command(help="Run the workflow locally.")
@@ -784,6 +857,8 @@ def write_run_id(run_id_file, run_id):
 
 
 def before_run(obj, tags, decospecs):
+    validate_tags(tags)
+
     # There's a --with option both at the top-level and for the run
     # subcommand. Why?
     #
@@ -791,8 +866,8 @@ def before_run(obj, tags, decospecs):
     # This is a very common use case of --with.
     #
     # A downside is that we need to have the following decorators handling
-    # in two places in this module and we need to make sure that
-    # _init_step_decorators doesn't get called twice.
+    # in two places in this module and make sure _init_step_decorators
+    # doesn't get called twice.
     if decospecs:
         decorators._attach_decorators(obj.flow, decospecs)
         obj.graph = FlowGraph(obj.flow.__class__)
@@ -802,6 +877,7 @@ def before_run(obj, tags, decospecs):
     decorators._init_step_decorators(
         obj.flow, obj.graph, obj.environment, obj.flow_datastore, obj.logger
     )
+
     obj.metadata.add_sticky_tags(tags=tags)
 
     # Package working directory only once per run.
@@ -848,13 +924,13 @@ def version(obj):
     "--datastore",
     default=DEFAULT_DATASTORE,
     show_default=True,
-    type=click.Choice(DATASTORES),
+    type=click.Choice([d.TYPE for d in DATASTORES]),
     help="Data backend type",
 )
 @click.option("--datastore-root", help="Root path for datastore")
 @click.option(
     "--package-suffixes",
-    help="A comma-separated list of file suffixes to include " "in the code package.",
+    help="A comma-separated list of file suffixes to include in the code package.",
     default=DEFAULT_PACKAGE_SUFFIXES,
     show_default=True,
 )
@@ -918,6 +994,7 @@ def start(
     cli_args._set_top_kwargs(ctx.params)
     ctx.obj.echo = echo
     ctx.obj.echo_always = echo_always
+    ctx.obj.is_quiet = quiet
     ctx.obj.graph = FlowGraph(ctx.obj.flow.__class__)
     ctx.obj.logger = logger
     ctx.obj.check = _check
@@ -926,21 +1003,26 @@ def start(
     ctx.obj.package_suffixes = package_suffixes.split(",")
     ctx.obj.reconstruct_cli = _reconstruct_cli
 
-    ctx.obj.event_logger = EventLogger(event_logger)
-
     ctx.obj.environment = [
         e for e in ENVIRONMENTS + [MetaflowEnvironment] if e.TYPE == environment
     ][0](ctx.obj.flow)
-    ctx.obj.environment.validate_environment(echo)
+    ctx.obj.environment.validate_environment(echo, datastore)
+
+    ctx.obj.event_logger = LOGGING_SIDECARS[event_logger](
+        flow=ctx.obj.flow, env=ctx.obj.environment
+    )
+    ctx.obj.event_logger.start()
 
-    ctx.obj.monitor = Monitor(monitor, ctx.obj.environment, ctx.obj.flow.name)
+    ctx.obj.monitor = MONITOR_SIDECARS[monitor](
+        flow=ctx.obj.flow, env=ctx.obj.environment
+    )
     ctx.obj.monitor.start()
 
     ctx.obj.metadata = [m for m in METADATA_PROVIDERS if m.TYPE == metadata][0](
         ctx.obj.environment, ctx.obj.flow, ctx.obj.event_logger, ctx.obj.monitor
     )
 
-    ctx.obj.datastore_impl = DATASTORES[datastore]
+    ctx.obj.datastore_impl = [d for d in DATASTORES if d.TYPE == datastore][0]
 
     if datastore_root is None:
         datastore_root = ctx.obj.datastore_impl.get_datastore_root_from_config(
@@ -964,7 +1046,7 @@ def start(
     )
 
     # It is important to initialize flow decorators early as some of the
-    # things they provide may be used by some of the objects initialize after.
+    # things they provide may be used by some of the objects initialized after.
     decorators._init_flow_decorators(
         ctx.obj.flow,
         ctx.obj.graph,
@@ -1105,3 +1187,8 @@ def main(flow, args=None, handle_exceptions=True, entrypoint=None):
             sys.exit(1)
         else:
             raise
+    finally:
+        if hasattr(state, "monitor") and state.monitor is not None:
+            state.monitor.terminate()
+        if hasattr(state, "event_logger") and state.event_logger is not None:
+            state.event_logger.terminate()
diff --git a/metaflow/cli_args.py b/metaflow/cli_args.py
index 680fd1bdb3d..40918f984ff 100644
--- a/metaflow/cli_args.py
+++ b/metaflow/cli_args.py
@@ -57,7 +57,7 @@ def _options(mapping):
         for k, v in mapping.items():
 
             # None or False arguments are ignored
-            # v needs to be explicitly False, not falsy, eg. 0 is an acceptable value
+            # v needs to be explicitly False, not falsy, e.g. 0 is an acceptable value
             if v is None or v is False:
                 continue
 
diff --git a/metaflow/client/core.py b/metaflow/client/core.py
index 62ae7553115..bd268d0dc60 100644
--- a/metaflow/client/core.py
+++ b/metaflow/client/core.py
@@ -8,16 +8,17 @@
 from itertools import chain
 
 from metaflow.metaflow_environment import MetaflowEnvironment
+from metaflow.current import current
 from metaflow.exception import (
     MetaflowNotFound,
     MetaflowNamespaceMismatch,
     MetaflowInternalError,
 )
-
+from metaflow.includefile import IncludedFile
 from metaflow.metaflow_config import DEFAULT_METADATA, MAX_ATTEMPTS
 from metaflow.plugins import ENVIRONMENTS, METADATA_PROVIDERS
 from metaflow.unbounded_foreach import CONTROL_TASK_TAG
-from metaflow.util import cached_property, resolve_identity, to_unicode
+from metaflow.util import cached_property, resolve_identity, to_unicode, is_stringish
 
 from .filecache import FileCache
 
@@ -47,6 +48,11 @@ def metadata(ms):
     for example, not allow access to information stored in remote
     metadata providers.
 
+    Note that you don't typically have to call this function directly. Usually
+    the metadata provider is set through the Metaflow configuration file. If you
+    need to switch between multiple providers, you can use the `METAFLOW_PROFILE`
+    environment variable to switch between configurations.
+
     Parameters
     ----------
     ms : string
@@ -58,7 +64,7 @@ def metadata(ms):
     -------
     string
         The description of the metadata selected (equivalent to the result of
-        get_metadata())
+        get_metadata()).
     """
     global current_metadata
     infos = ms.split("@", 1)
@@ -90,11 +96,12 @@ def get_metadata():
     """
     Returns the current Metadata provider.
 
-    This call returns the current Metadata being used to return information
-    about Metaflow objects.
+    If this is not set explicitly using `metadata`, the default value is
+    determined through the Metaflow configuration. You can use this call to
+    check that your configuration is set up properly.
 
-    If this is not set explicitly using metadata(), the default value is
-    determined through environment variables.
+    If multiple configuration profiles are present, this call returns the one
+    selected through the `METAFLOW_PROFILE` environment variable.
 
     Returns
     -------
@@ -110,10 +117,8 @@ def get_metadata():
 
 def default_metadata():
     """
-    Resets the Metadata provider to the default value.
-
-    The default value of the Metadata provider is determined through a combination of
-    environment variables.
+    Resets the Metadata provider to the default value, that is, to the value
+    that was used prior to any `metadata` calls.
 
     Returns
     -------
@@ -121,6 +126,12 @@ def default_metadata():
         The result of get_metadata() after resetting the provider.
     """
     global current_metadata
+
+    # We first check if we are in a flow -- if that is the case, we use the
+    # metadata provider that is being used there
+    if current._metadata_str:
+        return metadata(current._metadata_str)
+
     default = [m for m in METADATA_PROVIDERS if m.TYPE == DEFAULT_METADATA]
     if default:
         current_metadata = default[0]
@@ -174,15 +185,13 @@ def get_namespace():
 
 def default_namespace():
     """
-    Sets or resets the namespace used to filter objects.
-
-    The default namespace is in the form 'user:<username>' and is intended to filter
-    objects belonging to the user.
+    Resets the namespace used to filter objects to the default one, i.e. the one that was
+    used prior to any `namespace` calls.
 
     Returns
     -------
     string
-        The result of get_namespace() after
+        The result of get_namespace() after the namespace has been reset.
     """
     global current_namespace
     current_namespace = resolve_identity()
@@ -198,11 +207,10 @@ class Metaflow(object):
 
     Attributes
     ----------
-    flows : List of all flows.
-        Returns the list of all flows. Note that only flows present in the set namespace will
-        be returned. A flow is present in a namespace if it has at least one run in the
-        namespace.
-
+    flows : List[Flow]
+        Returns the list of all `Flow` objects known to this metadata provider. Note that only
+        flows present in the current namespace will be returned. A `Flow` is present in a namespace
+        if it has at least one run in the namespace.
     """
 
     def __init__(self):
@@ -302,7 +310,11 @@ class MetaflowObject(object):
     Attributes
     ----------
     tags : Set
-        Tags associated with the object.
+        Tags associated with the run this object belongs to (user and system tags).
+    user_tags: Set
+        User tags associated with the run this object belongs to.
+    system_tags: Set
+        System tags associated with the run this object belongs to.
     created_at : datetime
         Date and time this object was first created.
     parent : MetaflowObject
@@ -349,7 +361,7 @@ def __init__(
                 raise MetaflowNotFound(
                     "Attempt can only be smaller than %d" % MAX_ATTEMPTS
                 )
-            # NOTE: It is possible that no attempt exists but we can't
+            # NOTE: It is possible that no attempt exists, but we can't
             # distinguish between "attempt will happen" and "no such
             # attempt exists".
 
@@ -379,6 +391,8 @@ def __init__(
         self._tags = frozenset(
             chain(self._object.get("system_tags") or [], self._object.get("tags") or [])
         )
+        self._user_tags = frozenset(self._object.get("tags") or [])
+        self._system_tags = frozenset(self._object.get("system_tags") or [])
 
         if _namespace_check and not self.is_in_namespace():
             raise MetaflowNamespaceMismatch(current_namespace)
@@ -545,6 +559,30 @@ def tags(self):
         """
         return self._tags
 
+    @property
+    def system_tags(self):
+        """
+        System defined tags associated with this object.
+
+        Returns
+        -------
+        Set[string]
+            System tags associated with the object
+        """
+        return self._system_tags
+
+    @property
+    def user_tags(self):
+        """
+        User defined tags associated with this object.
+
+        Returns
+        -------
+        Set[string]
+            User tags associated with the object
+        """
+        return self._user_tags
+
     @property
     def created_at(self):
         """
@@ -613,7 +651,7 @@ def parent(self):
         if self._parent is None:
             pathspec = self.pathspec
             parent_pathspec = pathspec[: pathspec.rfind("/")]
-            # Only artifacts and tasks have attempts right now so we get the
+            # Only artifacts and tasks have attempts right now, so we get the
             # right parent if we are an artifact.
             attempt_to_pass = self._attempt if self._NAME == "artifact" else None
             # We can skip the namespace check because if self._NAME = 'run',
@@ -664,6 +702,33 @@ def path_components(self):
 
 
 class MetaflowData(object):
+    """
+    Container of data artifacts produced by a `Task`. This object is
+    instantiated through `Task.data`.
+
+    `MetaflowData` allows results to be retrieved by their name
+    through a convenient dot notation:
+
+    ```python
+    Task(...).data.my_object
+    ```
+
+    You can also test the existence of an object
+
+    ```python
+    if 'my_object' in Task(...).data:
+        print('my_object found')
+    ```
+
+    Note that this container relies on the local cache to load all data
+    artifacts. If your `Task` contains a lot of data, a more efficient
+    approach is to load artifacts individually like so
+
+    ```
+    Task(...)['my_object'].data
+    ```
+    """
+
     def __init__(self, artifacts):
         self._artifacts = dict((art.id, art) for art in artifacts)
 
@@ -682,22 +747,28 @@ def __repr__(self):
 
 class MetaflowCode(object):
     """
-    Describes the code that is occasionally stored with a run.
+    Snapshot of the code used to execute this `Run`. Instantiate the object through
+    `Run(...).code` (if all steps are executed remotely) or `Task(...).code` for an
+    individual task. The code package is the same for all steps of a `Run`.
+
+    `MetaflowCode` includes a package of the user-defined `FlowSpec` class and supporting
+    files, as well as a snapshot of the Metaflow library itself.
 
-    A code package will contain the version of Metaflow that was used (all the files comprising
-    the Metaflow library) as well as selected files from the directory containing the Python
-    file of the FlowSpec.
+    Currently, `MetaflowCode` objects are stored only for `Run`s that have at least one `Step`
+    executing outside the user's local environment.
+
+    The `TarFile` for the `Run` is given by `Run(...).code.tarball`
 
     Attributes
     ----------
     path : string
-        Location (in the datastore provider) of the code package
+        Location (in the datastore provider) of the code package.
     info : Dict
-        Dictionary of information related to this code-package
+        Dictionary of information related to this code-package.
     flowspec : string
-        Source code of the file containing the FlowSpec in this code package
+        Source code of the file containing the `FlowSpec` in this code package.
     tarball : TarFile
-        Tar ball containing all the code
+        Python standard library `tarfile.TarFile` archive containing all the code.
     """
 
     def __init__(self, flow_name, code_package):
@@ -775,16 +846,19 @@ def __str__(self):
 
 class DataArtifact(MetaflowObject):
     """
-    A single data artifact and associated metadata.
+    A single data artifact and associated metadata. Note that this object does
+    not contain other objects as it is the leaf object in the hierarchy.
 
     Attributes
     ----------
     data : object
-        The unpickled representation of the data contained in this artifact
+        The data contained in this artifact, that is, the object produced during
+        execution of this run.
     sha : string
-        Encoding representing the unique identity of this artifact
+        A unique ID of this artifact.
     finished_at : datetime
-        Alias for created_at
+        Corresponds roughly to the `Task.finished_at` time of the parent `Task`.
+        An alias for `DataArtifact.created_at`.
     """
 
     _NAME = "artifact"
@@ -809,6 +883,7 @@ def data(self):
         if filecache is None:
             # TODO: Pass proper environment to properly extract artifacts
             filecache = FileCache()
+
         # "create" the metadata information that the datastore needs
         # to access this object.
         # TODO: We can store more information in the metadata, particularly
@@ -824,12 +899,15 @@ def data(self):
             },
         }
         if location.startswith(":root:"):
-            return filecache.get_artifact(ds_type, location[6:], meta, *components)
+            obj = filecache.get_artifact(ds_type, location[6:], meta, *components)
         else:
             # Older artifacts have a location information which we can use.
-            return filecache.get_artifact_by_location(
+            obj = filecache.get_artifact_by_location(
                 ds_type, location, meta, *components
             )
+        if isinstance(obj, IncludedFile):
+            return obj.decode(self.id)
+        return obj
 
     @property
     def size(self):
@@ -895,48 +973,53 @@ def finished_at(self):
 
 class Task(MetaflowObject):
     """
-    A Task represents an execution of a step.
+    A `Task` represents an execution of a `Step`.
 
-    As such, it contains all data artifacts associated with that execution as
-    well as all metadata associated with the execution.
+    It contains all `DataArtifact` objects produced by the task as
+    well as metadata related to execution.
 
-    Note that you can also get information about a specific *attempt* of a
-    task. By default, the latest finished attempt is returned but you can
+    Note that the `@retry` decorator may cause multiple attempts of
+    the task to be present. Usually you want the latest attempt, which
+    is what instantiating a `Task` object returns by default. If
+    you need to e.g. retrieve logs from a failed attempt, you can
     explicitly get information about a specific attempt by using the
     following syntax when creating a task:
-    `Task('flow/run/step/task', attempt=<attempt>)`. Note that you will not be able to
-    access a specific attempt of a task through the `.tasks` method of a step
-    for example (that will always return the latest attempt).
+
+    `Task('flow/run/step/task', attempt=<attempt>)`
+
+    where `attempt=0` corresponds to the first attempt etc.
 
     Attributes
     ----------
     metadata : List[Metadata]
-        List of all metadata associated with the task
+        List of all metadata events associated with the task.
     metadata_dict : Dict
-        Dictionary where the keys are the names of the metadata and the value are the values
-        associated with those names
+        A condensed version of `metadata`: A dictionary where keys
+        are names of metadata events and values the latest corresponding event.
     data : MetaflowData
-        Container of all data artifacts produced by this task
+        Container of all data artifacts produced by this task. Note that this
+        call downloads all data locally, so it can be slower than accessing
+        artifacts individually. See `MetaflowData` for more information.
     artifacts : MetaflowArtifacts
-        Container of DataArtifact objects produced by this task
+        Container of `DataArtifact` objects produced by this task.
     successful : boolean
-        True if the task successfully completed
+        True if the task completed successfully.
     finished : boolean
-        True if the task completed
+        True if the task completed.
     exception : object
-        Exception raised by this task if there was one
+        Exception raised by this task if there was one.
     finished_at : datetime
-        Time this task finished
+        Time this task finished.
     runtime_name : string
-        Runtime this task was executed on
+        Runtime this task was executed on.
     stdout : string
-        Standard output for the task execution
+        Standard output for the task execution.
     stderr : string
-        Standard error output for the task execution
+        Standard error output for the task execution.
     code : MetaflowCode
-        Code package for this task (if present)
+        Code package for this task (if present). See `MetaflowCode`.
     environment_info : Dict
-        Information about the execution environment (for example Conda)
+        Information about the execution environment.
     """
 
     _NAME = "task"
@@ -967,16 +1050,59 @@ def metadata(self):
             self._NAME, "metadata", None, self._attempt, *self.path_components
         )
         all_metadata = all_metadata if all_metadata else []
-        return [
-            Metadata(
-                name=obj.get("field_name"),
-                value=obj.get("value"),
-                created_at=obj.get("ts_epoch"),
-                type=obj.get("type"),
-                task=self,
+
+        # For "clones" (ie: they have an origin-run-id AND a origin-task-id), we
+        # copy a set of metadata from the original task. This is needed to make things
+        # like logs work (which rely on having proper values for ds-root for example)
+        origin_run_id = None
+        origin_task_id = None
+        result = []
+        existing_keys = []
+        for obj in all_metadata:
+            result.append(
+                Metadata(
+                    name=obj.get("field_name"),
+                    value=obj.get("value"),
+                    created_at=obj.get("ts_epoch"),
+                    type=obj.get("type"),
+                    task=self,
+                )
             )
-            for obj in all_metadata
-        ]
+            existing_keys.append(obj.get("field_name"))
+            if obj.get("field_name") == "origin-run-id":
+                origin_run_id = obj.get("value")
+            elif obj.get("field_name") == "origin-task-id":
+                origin_task_id = obj.get("value")
+
+        if origin_task_id:
+            # This is a "cloned" task. We consider that it has the same
+            # metadata as the last attempt of the cloned task.
+
+            origin_obj_pathcomponents = self.path_components
+            origin_obj_pathcomponents[1] = origin_run_id
+            origin_obj_pathcomponents[3] = origin_task_id
+            origin_task = Task(
+                "/".join(origin_obj_pathcomponents), _namespace_check=False
+            )
+            latest_metadata = {
+                m.name: m
+                for m in sorted(origin_task.metadata, key=lambda m: m.created_at)
+            }
+            # We point to ourselves in the Metadata object
+            for v in latest_metadata.values():
+                if v.name in existing_keys:
+                    continue
+                result.append(
+                    Metadata(
+                        name=v.name,
+                        value=v.value,
+                        created_at=v.created_at,
+                        type=v.type,
+                        task=self,
+                    )
+                )
+
+        return result
 
     @property
     def metadata_dict(self):
@@ -1280,35 +1406,52 @@ def environment_info(self):
         if not env_type:
             return None
         env = [m for m in ENVIRONMENTS + [MetaflowEnvironment] if m.TYPE == env_type][0]
-        return env.get_client_info(self.path_components[0], self.metadata_dict)
+        meta_dict = self.metadata_dict
+        return env.get_client_info(self.path_components[0], meta_dict)
 
     def _load_log(self, stream):
-        log_location = self.metadata_dict.get("log_location_%s" % stream)
+        meta_dict = self.metadata_dict
+        log_location = meta_dict.get("log_location_%s" % stream)
         if log_location:
             return self._load_log_legacy(log_location, stream)
         else:
-            return "".join(line + "\n" for _, line in self.loglines(stream))
+            return "".join(
+                line + "\n" for _, line in self.loglines(stream, meta_dict=meta_dict)
+            )
 
     def _get_logsize(self, stream):
-        log_location = self.metadata_dict.get("log_location_%s" % stream)
+        meta_dict = self.metadata_dict
+        log_location = meta_dict.get("log_location_%s" % stream)
         if log_location:
             return self._legacy_log_size(log_location, stream)
         else:
-            return self._log_size(stream)
+            return self._log_size(stream, meta_dict)
 
-    def loglines(self, stream, as_unicode=True):
+    def loglines(self, stream, as_unicode=True, meta_dict=None):
         """
         Return an iterator over (utc_timestamp, logline) tuples.
 
-        If as_unicode=False, logline is returned as a byte object. Otherwise,
-        it is returned as a (unicode) string.
+        Parameters
+        ----------
+        stream : string
+            Either 'stdout' or 'stderr'.
+        as_unicode : boolean
+            If as_unicode=False, each logline is returned as a byte object. Otherwise,
+            it is returned as a (unicode) string.
+
+        Returns
+        -------
+        Iterator[(datetime, string)]
+            Iterator over timestamp, logline pairs.
         """
         from metaflow.mflog.mflog import merge_logs
 
         global filecache
 
-        ds_type = self.metadata_dict.get("ds-type")
-        ds_root = self.metadata_dict.get("ds-root")
+        if meta_dict is None:
+            meta_dict = self.metadata_dict
+        ds_type = meta_dict.get("ds-type")
+        ds_root = meta_dict.get("ds-root")
         if ds_type is None or ds_root is None:
             yield None, ""
             return
@@ -1355,11 +1498,11 @@ def _legacy_log_size(self, log_location, logtype):
             ds_type, location, logtype, int(attempt), *self.path_components
         )
 
-    def _log_size(self, stream):
+    def _log_size(self, stream, meta_dict):
         global filecache
 
-        ds_type = self.metadata_dict.get("ds-type")
-        ds_root = self.metadata_dict.get("ds-root")
+        ds_type = meta_dict.get("ds-type")
+        ds_root = meta_dict.get("ds-root")
         if ds_type is None or ds_root is None:
             return 0
         if filecache is None:
@@ -1373,20 +1516,21 @@ def _log_size(self, stream):
 
 class Step(MetaflowObject):
     """
-    A Step represents a user-defined Step (a method annotated with the @step decorator).
+    A `Step` represents a user-defined step, that is, a method annotated with the `@step` decorator.
 
-    As such, it contains all Tasks associated with the step (ie: all executions of the
-    Step). A linear Step will have only one associated task whereas a foreach Step will have
-    multiple Tasks.
+    It contains `Task` objects associated with the step, that is, all executions of the
+    `Step`. The step may contain multiple `Task`s in the case of a foreach step.
 
     Attributes
     ----------
     task : Task
-        Returns a Task object from the step
+        The first `Task` object in this step. This is a shortcut for retrieving the only
+        task contained in a non-foreach step.
     finished_at : datetime
-        Time this step finished (time of completion of the last task)
+        Time when the latest `Task` of this step finished. Note that in the case of foreaches,
+        this time may change during execution of the step.
     environment_info : Dict
-        Information about the execution environment (for example Conda)
+        Information about the execution environment.
     """
 
     _NAME = "step"
@@ -1410,41 +1554,45 @@ def task(self):
 
     def tasks(self, *tags):
         """
-        Returns an iterator over all the tasks in the step.
+        [Legacy function - do not use]
 
-        An optional filter is available that allows you to filter on tags.
-        If tags are specified, only tasks associated with all specified tags
-        are returned.
+        Returns an iterator over all `Task` objects in the step. This is an alias
+        to iterating the object itself, i.e.
+        ```
+        list(Step(...)) == list(Step(...).tasks())
+        ```
 
         Parameters
         ----------
         tags : string
-            Tags to match
+            No op (legacy functionality)
 
         Returns
         -------
         Iterator[Task]
-            Iterator over Task objects in this step
+            Iterator over all `Task` objects in this step.
         """
         return self._filtered_children(*tags)
 
     @property
     def control_task(self):
         """
+        [Unpublished API - use with caution!]
+
         Returns a Control Task object belonging to this step.
         This is useful when the step only contains one control task.
+
         Returns
         -------
         Task
             A control task in the step
         """
-        children = super(Step, self).__iter__()
-        for t in children:
-            if CONTROL_TASK_TAG in t.tags:
-                return t
+        return next(self.control_tasks(), None)
 
     def control_tasks(self, *tags):
         """
+        [Unpublished API - use with caution!]
+
         Returns an iterator over all the control tasks in the step.
         An optional filter is available that allows you to filter on tags. The
         control tasks returned if the filter is specified will contain all the
@@ -1459,11 +1607,23 @@ def control_tasks(self, *tags):
             Iterator over Control Task objects in this step
         """
         children = super(Step, self).__iter__()
-        filter_tags = [CONTROL_TASK_TAG]
-        filter_tags.extend(tags)
         for child in children:
-            if all(tag in child.tags for tag in filter_tags):
+            # first filter by standard tag filters
+            if not all(tag in child.tags for tag in tags):
+                continue
+            # Then look for control task indicator in one of two ways
+            # Look in tags - this path will activate for metadata service
+            # backends that pre-date tag mutation release
+            if CONTROL_TASK_TAG in child.tags:
                 yield child
+            else:
+                # Look in task metadata
+                for task_metadata in child.metadata:
+                    if (
+                        task_metadata.name == "internal_task_type"
+                        and task_metadata.value == CONTROL_TASK_TAG
+                    ):
+                        yield child
 
     def __iter__(self):
         children = super(Step, self).__iter__()
@@ -1511,24 +1671,22 @@ def environment_info(self):
 
 class Run(MetaflowObject):
     """
-    A Run represents an execution of a Flow
-
-    As such, it contains all Steps associated with the flow.
+    A `Run` represents an execution of a `Flow`. It is a container of `Step`s.
 
     Attributes
     ----------
     data : MetaflowData
-        Container of all data artifacts produced by this run
+        a shortcut to run['end'].task.data, i.e. data produced by this run.
     successful : boolean
-        True if the run successfully completed
+        True if the run completed successfully.
     finished : boolean
-        True if the run completed
+        True if the run completed.
     finished_at : datetime
-        Time this run finished
+        Time this run finished.
     code : MetaflowCode
-        Code package for this run (if present)
+        Code package for this run (if present). See `MetaflowCode`.
     end_task : Task
-        Task for the end step (if it is present already)
+        `Task` for the end step (if it is present already).
     """
 
     _NAME = "run"
@@ -1541,21 +1699,23 @@ def _iter_filter(self, x):
 
     def steps(self, *tags):
         """
-        Returns an iterator over all the steps in the run.
+        [Legacy function - do not use]
 
-        An optional filter is available that allows you to filter on tags.
-        If tags are specified, only steps associated with all specified tags
-        are returned.
+        Returns an iterator over all `Step` objects in the step. This is an alias
+        to iterating the object itself, i.e.
+        ```
+        list(Run(...)) == list(Run(...).steps())
+        ```
 
         Parameters
         ----------
         tags : string
-            Tags to match
+            No op (legacy functionality)
 
         Returns
         -------
         Iterator[Step]
-            Iterator over Step objects in this run
+            Iterator over `Step` objects in this run.
         """
         return self._filtered_children(*tags)
 
@@ -1667,20 +1827,134 @@ def end_task(self):
 
         return end_step.task
 
+    def add_tag(self, tag):
+        """
+        Add a tag to this `Run`.
+
+        Note that if the tag is already a system tag, it is not added as a user tag,
+        and no error is thrown.
+
+        Parameters
+        ----------
+        tag : string
+            Tag to add.
+        """
+
+        # For backwards compatibility with Netflix's early version of this functionality,
+        # this function shall accept both an individual tag AND iterables of tags.
+        #
+        # Iterable of tags support shall be removed in future once existing
+        # usage has been migrated off.
+        if is_stringish(tag):
+            tag = [tag]
+        return self.replace_tag([], tag)
+
+    def add_tags(self, tags):
+        """
+        Add one or more tags to this `Run`.
+
+        Note that if any tag is already a system tag, it is not added as a user tag
+        and no error is thrown.
+
+        Parameters
+        ----------
+        tags : Iterable[string]
+            Tags to add.
+        """
+        return self.replace_tag([], tags)
+
+    def remove_tag(self, tag):
+        """
+        Remove one tag from this `Run`.
+
+        Removing a system tag is an error. Removing a non-existent
+        user tag is a no-op.
+
+        Parameters
+        ----------
+        tag : string
+            Tag to remove.
+        """
+
+        # For backwards compatibility with Netflix's early version of this functionality,
+        # this function shall accept both an individual tag AND iterables of tags.
+        #
+        # Iterable of tags support shall be removed in future once existing
+        # usage has been migrated off.
+        if is_stringish(tag):
+            tag = [tag]
+        return self.replace_tag(tag, [])
+
+    def remove_tags(self, tags):
+        """
+        Remove one or more tags to this `Run`.
+
+        Removing a system tag will result in an error. Removing a non-existent
+        user tag is a no-op.
+
+        Parameters
+        ----------
+        tags : Iterable[string]
+            Tags to remove.
+        """
+        return self.replace_tags(tags, [])
+
+    def replace_tag(self, tag_to_remove, tag_to_add):
+        """
+        Remove a tag and add a tag atomically. Removal is done first.
+        The rules for `Run.add_tag` and `Run.remove_tag` also apply here.
+
+        Parameters
+        ----------
+        tag_to_remove : string
+            Tag to remove.
+        tag_to_add : string
+            Tag to add.
+        """
+
+        # For backwards compatibility with Netflix's early version of this functionality,
+        # this function shall accept both individual tags AND iterables of tags.
+        #
+        # Iterable of tags support shall be removed in future once existing
+        # usage has been migrated off.
+        if is_stringish(tag_to_remove):
+            tag_to_remove = [tag_to_remove]
+        if is_stringish(tag_to_add):
+            tag_to_add = [tag_to_add]
+        return self.replace_tags(tag_to_remove, tag_to_add)
+
+    def replace_tags(self, tags_to_remove, tags_to_add):
+        """
+        Remove and add tags atomically; the removal is done first.
+        The rules for `Run.add_tag` and `Run.remove_tag` also apply here.
+
+        Parameters
+        ----------
+        tags_to_remove : Iterable[string]
+            Tags to remove.
+        tags_to_add : Iterable[string]
+            Tags to add.
+        """
+        flow_id = self.path_components[0]
+        final_user_tags = self._metaflow.metadata.mutate_user_tags_for_run(
+            flow_id, self.id, tags_to_remove=tags_to_remove, tags_to_add=tags_to_add
+        )
+        # refresh Run object with the latest tags
+        self._user_tags = frozenset(final_user_tags)
+        self._tags = frozenset([*self._user_tags, *self._system_tags])
+
 
 class Flow(MetaflowObject):
     """
     A Flow represents all existing flows with a certain name, in other words,
-    classes derived from 'FlowSpec'
-
-    As such, it contains all Runs (executions of a flow) related to this flow.
+    classes derived from `FlowSpec`. A container of `Run` objects.
 
     Attributes
     ----------
     latest_run : Run
-        Latest Run (in progress or completed, successfully or not) of this Flow
+        Latest `Run` (in progress or completed, successfully or not) of this flow.
     latest_successful_run : Run
-        Latest successfully completed Run of this Flow
+        Latest successfully completed `Run` of this flow.
     """
 
     _NAME = "flow"
@@ -1722,21 +1996,21 @@ def latest_successful_run(self):
 
     def runs(self, *tags):
         """
-        Returns an iterator over all the runs in the flow.
+        Returns an iterator over all `Run`s of this flow.
 
         An optional filter is available that allows you to filter on tags.
-        If tags are specified, only runs associated with all specified tags
-        are returned.
+        If multiple tags are specified, only runs that have all the
+        specified tags are returned.
 
         Parameters
         ----------
         tags : string
-            Tags to match
+            Tags to match.
 
         Returns
         -------
         Iterator[Run]
-            Iterator over Run objects in this flow
+            Iterator over `Run` objects in this flow.
         """
         return self._filtered_children(*tags)
 
diff --git a/metaflow/client/filecache.py b/metaflow/client/filecache.py
index 5adf1a06c60..8c7d945d588 100644
--- a/metaflow/client/filecache.py
+++ b/metaflow/client/filecache.py
@@ -6,7 +6,7 @@
 from tempfile import NamedTemporaryFile
 from hashlib import sha1
 
-from metaflow.datastore import DATASTORES, FlowDataStore
+from metaflow.datastore import FlowDataStore
 from metaflow.datastore.content_addressed_store import BlobCache
 from metaflow.exception import MetaflowException
 from metaflow.metaflow_config import (
@@ -16,6 +16,8 @@
     CLIENT_CACHE_MAX_TASKDATASTORE_COUNT,
 )
 
+from metaflow.plugins import DATASTORES
+
 NEW_FILE_QUARANTINE = 10
 
 if sys.version_info[0] >= 3 and sys.version_info[1] >= 2:
@@ -23,7 +25,6 @@
     def od_move_to_end(od, key):
         od.move_to_end(key)
 
-
 else:
     # Not very efficient but works and most people are on 3.2+
     def od_move_to_end(od, key):
@@ -320,7 +321,7 @@ def _task_ds_id(ds_type, ds_root, flow_name, run_id, step_name, task_id, attempt
 
     def _garbage_collect(self):
         now = time.time()
-        while self._objects and self._total > self._max_size * 1024 ** 2:
+        while self._objects and self._total > self._max_size * 1024**2:
             if now - self._objects[0][0] < NEW_FILE_QUARANTINE:
                 break
             ctime, size, path = self._objects.pop(0)
@@ -345,10 +346,10 @@ def _makedirs(path):
 
     @staticmethod
     def _get_datastore_storage_impl(ds_type):
-        storage_impl = DATASTORES.get(ds_type, None)
-        if storage_impl is None:
+        storage_impl = [d for d in DATASTORES if d.TYPE == ds_type]
+        if len(storage_impl) == 0:
             raise FileCacheException("Datastore %s was not found" % ds_type)
-        return storage_impl
+        return storage_impl[0]
 
     def _get_flow_datastore(self, ds_type, ds_root, flow_name):
         cache_id = self._flow_ds_id(ds_type, ds_root, flow_name)
diff --git a/metaflow/cmd/__init__.py b/metaflow/cmd/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/metaflow/main_cli.py b/metaflow/cmd/configure_cmd.py
similarity index 62%
rename from metaflow/main_cli.py
rename to metaflow/cmd/configure_cmd.py
index 709bee87ba1..c5457f2911b 100644
--- a/metaflow/main_cli.py
+++ b/metaflow/cmd/configure_cmd.py
@@ -1,270 +1,18 @@
-from metaflow._vendor import click
 import json
 import os
-import shutil
+import sys
 
 from os.path import expanduser
 
-from metaflow.datastore.local_storage import LocalStorage
-from metaflow.metaflow_config import DATASTORE_LOCAL_DIR
+from metaflow.util import to_unicode
+from metaflow._vendor import click
 from metaflow.util import to_unicode
 
 
-def makedirs(path):
-    # This is for python2 compatibility.
-    # Python3 has os.makedirs(exist_ok=True).
-    try:
-        os.makedirs(path)
-    except OSError as x:
-        if x.errno == 17:
-            return
-        else:
-            raise
-
-
-def echo_dev_null(*args, **kwargs):
-    pass
-
-
-def echo_always(line, **kwargs):
-    click.secho(line, **kwargs)
-
-
-@click.group(invoke_without_command=True)
-@click.pass_context
-def main(ctx):
-    global echo
-    echo = echo_always
-
-    import metaflow
-
-    echo("Metaflow ", fg="magenta", bold=True, nl=False)
-
-    if ctx.invoked_subcommand is None:
-        echo("(%s): " % metaflow.__version__, fg="magenta", bold=False, nl=False)
-    else:
-        echo("(%s)\n" % metaflow.__version__, fg="magenta", bold=False)
-
-    if ctx.invoked_subcommand is None:
-        echo("More data science, less engineering\n", fg="magenta")
-
-        # metaflow URL
-        echo("http://docs.metaflow.org", fg="cyan", nl=False)
-        echo(" - Read the documentation")
-
-        # metaflow chat
-        echo("http://chat.metaflow.org", fg="cyan", nl=False)
-        echo(" - Chat with us")
-
-        # metaflow help email
-        echo("help@metaflow.org", fg="cyan", nl=False)
-        echo("        - Get help by email\n")
-
-        # print a short list of next steps.
-        short_help = {
-            "tutorials": "Browse and access metaflow tutorials.",
-            "configure": "Configure metaflow to access the cloud.",
-            "status": "Display the current working tree.",
-            "help": "Show all available commands to run.",
-        }
-
-        echo("Commands:", bold=False)
-
-        for cmd, desc in short_help.items():
-            echo("  metaflow {0:<10} ".format(cmd), fg="cyan", bold=False, nl=False)
-
-            echo("%s" % desc)
-
-
-@main.command(help="Show all available commands.")
-@click.pass_context
-def help(ctx):
-    print(ctx.parent.get_help())
-
-
-@main.command(help="Show flows accessible from the current working tree.")
-def status():
-    from metaflow.client import get_metadata
-
-    res = get_metadata()
-    if res:
-        res = res.split("@")
-    else:
-        raise click.ClickException("Unknown status: cannot find a Metadata provider")
-    if res[0] == "service":
-        echo("Using Metadata provider at: ", nl=False)
-        echo('"%s"\n' % res[1], fg="cyan")
-        echo("To list available flows, type:\n")
-        echo("1. python")
-        echo("2. from metaflow import Metaflow")
-        echo("3. list(Metaflow())")
-        return
-
-    from metaflow.client import namespace, metadata, Metaflow
-
-    # Get the local data store path
-    path = LocalStorage.get_datastore_root_from_config(echo, create_on_absent=False)
-    # Throw an exception
-    if path is None:
-        raise click.ClickException(
-            "Could not find "
-            + click.style('"%s"' % DATASTORE_LOCAL_DIR, fg="red")
-            + " in the current working tree."
-        )
-
-    stripped_path = os.path.dirname(path)
-    namespace(None)
-    metadata("local@%s" % stripped_path)
-    echo("Working tree found at: ", nl=False)
-    echo('"%s"\n' % stripped_path, fg="cyan")
-    echo("Available flows:", fg="cyan", bold=True)
-    for flow in Metaflow():
-        echo("* %s" % flow, fg="cyan")
-
-
-@main.group(help="Browse and access the metaflow tutorial episodes.")
-def tutorials():
-    pass
-
-
-def get_tutorials_dir():
-    metaflow_dir = os.path.dirname(__file__)
-    package_dir = os.path.dirname(metaflow_dir)
-    tutorials_dir = os.path.join(package_dir, "metaflow", "tutorials")
-
-    return tutorials_dir
-
-
-def get_tutorial_metadata(tutorial_path):
-    metadata = {}
-    with open(os.path.join(tutorial_path, "README.md")) as readme:
-        content = readme.read()
-
-    paragraphs = [paragraph.strip() for paragraph in content.split("#") if paragraph]
-    metadata["description"] = paragraphs[0].split("**")[1]
-    header = paragraphs[0].split("\n")
-    header = header[0].split(":")
-    metadata["episode"] = header[0].strip()[len("Episode ") :]
-    metadata["title"] = header[1].strip()
-
-    for paragraph in paragraphs[1:]:
-        if paragraph.startswith("Before playing"):
-            lines = "\n".join(paragraph.split("\n")[1:])
-            metadata["prereq"] = lines.replace("```", "")
-
-        if paragraph.startswith("Showcasing"):
-            lines = "\n".join(paragraph.split("\n")[1:])
-            metadata["showcase"] = lines.replace("```", "")
-
-        if paragraph.startswith("To play"):
-            lines = "\n".join(paragraph.split("\n")[1:])
-            metadata["play"] = lines.replace("```", "")
-
-    return metadata
-
-
-def get_all_episodes():
-    episodes = []
-    for name in sorted(os.listdir(get_tutorials_dir())):
-        # Skip hidden files (like .gitignore)
-        if not name.startswith("."):
-            episodes.append(name)
-    return episodes
-
-
-@tutorials.command(help="List the available episodes.")
-def list():
-    echo("Episodes:", fg="cyan", bold=True)
-    for name in get_all_episodes():
-        path = os.path.join(get_tutorials_dir(), name)
-        metadata = get_tutorial_metadata(path)
-        echo("* {0: <20} ".format(metadata["episode"]), fg="cyan", nl=False)
-        echo("- {0}".format(metadata["title"]))
-
-    echo("\nTo pull the episodes, type: ")
-    echo("metaflow tutorials pull", fg="cyan")
-
-
-def validate_episode(episode):
-    src_dir = os.path.join(get_tutorials_dir(), episode)
-    if not os.path.isdir(src_dir):
-        raise click.BadArgumentUsage(
-            "Episode "
-            + click.style('"{0}"'.format(episode), fg="red")
-            + " does not exist."
-            " To see a list of available episodes, "
-            "type:\n" + click.style("metaflow tutorials list", fg="cyan")
-        )
-
-
-def autocomplete_episodes(ctx, args, incomplete):
-    return [k for k in get_all_episodes() if incomplete in k]
-
+from .util import echo_always, makedirs
 
-@tutorials.command(help="Pull episodes " "into your current working directory.")
-@click.option(
-    "--episode",
-    default="",
-    help="Optional episode name " "to pull only a single episode.",
-)
-def pull(episode):
-    tutorials_dir = get_tutorials_dir()
-    if not episode:
-        episodes = get_all_episodes()
-    else:
-        episodes = [episode]
-        # Validate that the list is valid.
-        for episode in episodes:
-            validate_episode(episode)
-    # Create destination `metaflow-tutorials` dir.
-    dst_parent = os.path.join(os.getcwd(), "metaflow-tutorials")
-    makedirs(dst_parent)
-
-    # Pull specified episodes.
-    for episode in episodes:
-        dst_dir = os.path.join(dst_parent, episode)
-        # Check if episode has already been pulled before.
-        if os.path.exists(dst_dir):
-            if click.confirm(
-                "Episode "
-                + click.style('"{0}"'.format(episode), fg="red")
-                + " has already been pulled before. Do you wish "
-                "to delete the existing version?"
-            ):
-                shutil.rmtree(dst_dir)
-            else:
-                continue
-        echo("Pulling episode ", nl=False)
-        echo('"{0}"'.format(episode), fg="cyan", nl=False)
-        # TODO: Is the following redundant?
-        echo(" into your current working directory.")
-        # Copy from (local) metaflow package dir to current.
-        src_dir = os.path.join(tutorials_dir, episode)
-        shutil.copytree(src_dir, dst_dir)
-
-    echo("\nTo know more about an episode, type:\n", nl=False)
-    echo("metaflow tutorials info [EPISODE]", fg="cyan")
-
-
-@tutorials.command(help="Find out more about an episode.")
-@click.argument("episode", autocompletion=autocomplete_episodes)
-def info(episode):
-    validate_episode(episode)
-    src_dir = os.path.join(get_tutorials_dir(), episode)
-    metadata = get_tutorial_metadata(src_dir)
-    echo("Synopsis:", fg="cyan", bold=True)
-    echo("%s" % metadata["description"])
-
-    echo("\nShowcasing:", fg="cyan", bold=True, nl=True)
-    echo("%s" % metadata["showcase"])
-
-    if "prereq" in metadata:
-        echo("\nBefore playing:", fg="cyan", bold=True, nl=True)
-        echo("%s" % metadata["prereq"])
-
-    echo("\nTo play:", fg="cyan", bold=True)
-    echo("%s" % metadata["play"])
 
+echo = echo_always
 
 # NOTE: This code needs to be in sync with metaflow/metaflow_config.py.
 METAFLOW_CONFIGURATION_DIR = expanduser(
@@ -272,7 +20,12 @@ def info(episode):
 )
 
 
-@main.group(help="Configure Metaflow to access the cloud.")
+@click.group()
+def cli():
+    pass
+
+
+@cli.group(help="Configure Metaflow to access the cloud.")
 def configure():
     makedirs(METAFLOW_CONFIGURATION_DIR)
 
@@ -283,7 +36,7 @@ def get_config_path(profile):
     return path
 
 
-def overwrite_config(profile):
+def confirm_overwrite_config(profile):
     path = get_config_path(profile)
     if os.path.exists(path):
         if not click.confirm(
@@ -424,7 +177,7 @@ def import_from(profile, input_filename):
     echo('"%s"' % input_path, fg="cyan")
 
     # Persist configuration.
-    overwrite_config(profile)
+    confirm_overwrite_config(profile)
     persist_env(env_dict, profile)
 
 
@@ -436,8 +189,16 @@ def import_from(profile, input_filename):
     help="Configure a named profile. Activate the profile by setting "
     "`METAFLOW_PROFILE` environment variable.",
 )
-def sandbox(profile):
-    overwrite_config(profile)
+@click.option(
+    "--overwrite/--no-overwrite",
+    "-o/",
+    default=False,
+    show_default=True,
+    help="Overwrite profile configuration without asking",
+)
+def sandbox(profile, overwrite):
+    if not overwrite:
+        confirm_overwrite_config(profile)
     # Prompt for user input.
     encoded_str = click.prompt(
         "Following instructions from "
@@ -447,7 +208,8 @@ def sandbox(profile):
     )
     # Decode the bytes to env_dict.
     try:
-        import base64, zlib
+        import base64
+        import zlib
         from metaflow.util import to_bytes
 
         env_dict = json.loads(
@@ -501,6 +263,44 @@ def configure_s3_datastore(existing_env):
     return env
 
 
+def configure_azure_datastore(existing_env):
+    env = {}
+    # Set Azure Blob Storage as default datastore.
+    env["METAFLOW_DEFAULT_DATASTORE"] = "azure"
+    # Set Azure Blob Storage folder for datastore.
+    # TODO rename this Blob Endpoint!
+    env["METAFLOW_AZURE_STORAGE_BLOB_SERVICE_ENDPOINT"] = click.prompt(
+        cyan("[METAFLOW_AZURE_STORAGE_BLOB_SERVICE_ENDPOINT]")
+        + " Azure Storage Account URL, for the account holding the Blob container to be used. "
+        + "(E.g. https://<storage_account>.blob.core.windows.net/)",
+        default=existing_env.get("METAFLOW_AZURE_STORAGE_BLOB_SERVICE_ENDPOINT"),
+        show_default=True,
+    )
+    env["METAFLOW_DATASTORE_SYSROOT_AZURE"] = click.prompt(
+        cyan("[METAFLOW_DATASTORE_SYSROOT_AZURE]")
+        + " Azure Blob Storage folder for Metaflow artifact storage "
+        + "(Format: <container_name>/<prefix>)",
+        default=existing_env.get("METAFLOW_DATASTORE_SYSROOT_AZURE"),
+        show_default=True,
+    )
+    return env
+
+
+def configure_gs_datastore(existing_env):
+    env = {}
+    # Set Google Cloud Storage as default datastore.
+    env["METAFLOW_DEFAULT_DATASTORE"] = "gs"
+    # Set Google Cloud Storage folder for datastore.
+    env["METAFLOW_DATASTORE_SYSROOT_GS"] = click.prompt(
+        cyan("[METAFLOW_DATASTORE_SYSROOT_GS]")
+        + " Google Cloud Storage folder for Metaflow artifact storage "
+        + "(Format: gs://<bucket>/<prefix>)",
+        default=existing_env.get("METAFLOW_DATASTORE_SYSROOT_GS"),
+        show_default=True,
+    )
+    return env
+
+
 def configure_metadata_service(existing_env):
     empty_profile = False
     if not existing_env:
@@ -520,7 +320,7 @@ def configure_metadata_service(existing_env):
         cyan("[METAFLOW_SERVICE_INTERNAL_URL]")
         + yellow(" (optional)")
         + " URL for Metaflow Service "
-        + "(Accessible only within VPC).",
+        + "(Accessible only within VPC [AWS] or a Kubernetes cluster [if the service runs in one]).",
         default=existing_env.get(
             "METAFLOW_SERVICE_INTERNAL_URL", env["METAFLOW_SERVICE_URL"]
         ),
@@ -537,7 +337,83 @@ def configure_metadata_service(existing_env):
     return env
 
 
-def configure_datastore_and_metadata(existing_env):
+def configure_azure_datastore_and_metadata(existing_env):
+    empty_profile = False
+    if not existing_env:
+        empty_profile = True
+    env = {}
+
+    # Configure Azure Blob Storage as the datastore.
+    use_azure_as_datastore = click.confirm(
+        "\nMetaflow can use "
+        + yellow("Azure Blob Storage as the storage backend")
+        + " for all code and data artifacts on "
+        + "Azure.\nAzure Blob Storage is a strict requirement if you "
+        + "intend to execute your flows on a Kubernetes cluster on Azure (AKS or self-managed)"
+        + ".\nWould you like to configure Azure Blob Storage "
+        + "as the default storage backend?",
+        default=empty_profile
+        or existing_env.get("METAFLOW_DEFAULT_DATASTORE", "") == "azure",
+        abort=False,
+    )
+    if use_azure_as_datastore:
+        env.update(configure_azure_datastore(existing_env))
+
+    # Configure Metadata service for tracking.
+    if click.confirm(
+        "\nMetaflow can use a "
+        + yellow("remote Metadata Service to track")
+        + " and persist flow execution metadata.\nConfiguring the "
+        "service is a requirement if you intend to schedule your "
+        "flows with Kubernetes on Azure (AKS or self-managed).\nWould you like to "
+        "configure the Metadata Service?",
+        default=empty_profile
+        or existing_env.get("METAFLOW_DEFAULT_METADATA", "") == "service",
+        abort=False,
+    ):
+        env.update(configure_metadata_service(existing_env))
+    return env
+
+
+def configure_gs_datastore_and_metadata(existing_env):
+    empty_profile = False
+    if not existing_env:
+        empty_profile = True
+    env = {}
+
+    # Configure Google Cloud Storage as the datastore.
+    use_gs_as_datastore = click.confirm(
+        "\nMetaflow can use "
+        + yellow("Google Cloud Storage as the storage backend")
+        + " for all code and data artifacts on "
+        + "Google Cloud Storage.\nGoogle Cloud Storage is a strict requirement if you "
+        + "intend to execute your flows on a Kubernetes cluster on GCP (GKE or self-managed)"
+        + ".\nWould you like to configure Google Cloud Storage "
+        + "as the default storage backend?",
+        default=empty_profile
+        or existing_env.get("METAFLOW_DEFAULT_DATASTORE", "") == "gs",
+        abort=False,
+    )
+    if use_gs_as_datastore:
+        env.update(configure_gs_datastore(existing_env))
+
+    # Configure Metadata service for tracking.
+    if click.confirm(
+        "\nMetaflow can use a "
+        + yellow("remote Metadata Service to track")
+        + " and persist flow execution metadata.\nConfiguring the "
+        "service is a requirement if you intend to schedule your "
+        "flows with Kubernetes on GCP (GKE or self-managed).\nWould you like to "
+        "configure the Metadata Service?",
+        default=empty_profile
+        or existing_env.get("METAFLOW_DEFAULT_METADATA", "") == "service",
+        abort=False,
+    ):
+        env.update(configure_metadata_service(existing_env))
+    return env
+
+
+def configure_aws_datastore_and_metadata(existing_env):
     empty_profile = False
     if not existing_env:
         empty_profile = True
@@ -664,10 +540,11 @@ def check_kubernetes_client(ctx):
         import kubernetes
     except ImportError:
         echo(
-            "Please install python kubernetes client first "
-            + "(run "
-            + yellow("pip install kubernetes")
-            + " or equivalent in your favorite python package manager)"
+            "Could not import module 'Kubernetes'.\nInstall Kubernetes "
+            + "Python package (https://pypi.org/project/kubernetes/) first.\n"
+            "You can install the module by executing - \n"
+            + yellow("%s -m pip install kubernetes" % sys.executable)
+            + " \nor equivalent in your favorite Python package manager\n"
         )
         ctx.abort()
 
@@ -687,20 +564,20 @@ def check_kubernetes_config(ctx):
         )
     except config.config_exception.ConfigException as e:
         click.confirm(
-            "\nYou don't seem to have a valid kubernetes configuration file. "
-            + "The error from kubernetes client library: "
+            "\nYou don't seem to have a valid Kubernetes configuration file. "
+            + "The error from Kubernetes client library: "
             + red(str(e))
             + "."
             + "To create a kubernetes configuration for EKS, you typically need to run "
             + yellow("aws eks update-kubeconfig --name <CLUSTER NAME>")
-            + ". For further details, refer to AWS Documentation at https://docs.aws.amazon.com/eks/latest/userguide/create-kubeconfig.html\n"
-            "Do you want to proceed with configuring Metaflow for EKS anyway?",
+            + ". For further details, refer to AWS documentation at https://docs.aws.amazon.com/eks/latest/userguide/create-kubeconfig.html\n"
+            "Do you want to proceed with configuring Metaflow for Kubernetes anyway?",
             default=False,
             abort=True,
         )
 
 
-def configure_eks(existing_env):
+def configure_kubernetes(existing_env):
     empty_profile = False
     if not existing_env:
         empty_profile = True
@@ -745,6 +622,15 @@ def configure_eks(existing_env):
         default=existing_env.get("METAFLOW_KUBERNETES_CONTAINER_IMAGE", ""),
         show_default=True,
     )
+    # Set default Kubernetes secrets to source into pod envs
+    env["METAFLOW_KUBERNETES_SECRETS"] = click.prompt(
+        cyan("[METAFLOW_KUBERNETES_SECRETS]")
+        + yellow(" (optional)")
+        + " Comma-delimited list of secret names. Jobs will"
+        " gain environment variables from these secrets. ",
+        default=existing_env.get("METAFLOW_KUBERNETES_SECRETS", ""),
+        show_default=True,
+    )
 
     return env
 
@@ -773,6 +659,144 @@ def verify_aws_credentials(ctx):
         ctx.abort()
 
 
+def verify_azure_credentials(ctx):
+    # Verify that the user has configured AWS credentials on their computer.
+    if not click.confirm(
+        "\nMetaflow relies on "
+        + yellow("Azure access credentials")
+        + " present on your computer to access resources on Azure."
+        "\nBefore proceeding further, please confirm that you "
+        "have already configured these access credentials on "
+        "this computer.",
+        default=True,
+    ):
+        echo(
+            "There are many ways to setup your Azure access credentials. You "
+            "can get started by getting familiar with the following: ",
+            nl=False,
+            fg="yellow",
+        )
+        echo("")
+        echo(
+            "- https://docs.microsoft.com/en-us/cli/azure/authenticate-azure-cli",
+            fg="cyan",
+        )
+        echo(
+            "- https://docs.microsoft.com/en-us/cli/azure/azure-cli-configuration",
+            fg="cyan",
+        )
+        ctx.abort()
+
+
+def verify_gcp_credentials(ctx):
+    # Verify that the user has configured AWS credentials on their computer.
+    if not click.confirm(
+        "\nMetaflow relies on "
+        + yellow("GCP access credentials")
+        + " present on your computer to access resources on GCP."
+        "\nBefore proceeding further, please confirm that you "
+        "have already configured these access credentials on "
+        "this computer.",
+        default=True,
+    ):
+        echo(
+            "There are many ways to setup your GCP access credentials. You "
+            "can get started by getting familiar with the following: ",
+            nl=False,
+            fg="yellow",
+        )
+        echo("")
+        echo(
+            "- https://cloud.google.com/docs/authentication/provide-credentials-adc",
+            fg="cyan",
+        )
+        ctx.abort()
+
+
+@configure.command(help="Configure metaflow to access Microsoft Azure.")
+@click.option(
+    "--profile",
+    "-p",
+    default="",
+    help="Configure a named profile. Activate the profile by setting "
+    "`METAFLOW_PROFILE` environment variable.",
+)
+@click.pass_context
+def azure(ctx, profile):
+
+    # Greet the user!
+    echo(
+        "Welcome to Metaflow! Follow the prompts to configure your installation.\n",
+        bold=True,
+    )
+
+    # Check for existing configuration.
+    if not confirm_overwrite_config(profile):
+        ctx.abort()
+
+    verify_azure_credentials(ctx)
+
+    existing_env = get_env(profile)
+
+    env = {}
+    env.update(configure_azure_datastore_and_metadata(existing_env))
+
+    persist_env({k: v for k, v in env.items() if v}, profile)
+
+    # Prompt user to also configure Kubernetes for compute if using azure
+    if env.get("METAFLOW_DEFAULT_DATASTORE") == "azure":
+        click.echo(
+            "\nFinal note! Metaflow can scale your flows by "
+            + yellow("executing your steps on Kubernetes.")
+            + "\nYou may use Azure Kubernetes Service (AKS)"
+            " or a self-managed Kubernetes cluster on Azure VMs."
+            + " If/when your Kubernetes cluster is ready for use,"
+            " please run 'metaflow configure kubernetes'.",
+        )
+
+
+@configure.command(help="Configure metaflow to access Google Cloud Platform.")
+@click.option(
+    "--profile",
+    "-p",
+    default="",
+    help="Configure a named profile. Activate the profile by setting "
+    "`METAFLOW_PROFILE` environment variable.",
+)
+@click.pass_context
+def gcp(ctx, profile):
+
+    # Greet the user!
+    echo(
+        "Welcome to Metaflow! Follow the prompts to configure your installation.\n",
+        bold=True,
+    )
+
+    # Check for existing configuration.
+    if not confirm_overwrite_config(profile):
+        ctx.abort()
+
+    verify_gcp_credentials(ctx)
+
+    existing_env = get_env(profile)
+
+    env = {}
+    env.update(configure_gs_datastore_and_metadata(existing_env))
+
+    persist_env({k: v for k, v in env.items() if v}, profile)
+
+    # Prompt user to also configure Kubernetes for compute if using Google Cloud Storage
+    if env.get("METAFLOW_DEFAULT_DATASTORE") == "gs":
+        click.echo(
+            "\nFinal note! Metaflow can scale your flows by "
+            + yellow("executing your steps on Kubernetes.")
+            + "\nYou may use Google Kubernetes Engine (GKE)"
+            " or a self-managed Kubernetes cluster on Google Compute Engine VMs."
+            + " If/when your Kubernetes cluster is ready for use,"
+            " please run 'metaflow configure kubernetes'.",
+        )
+
+
 @configure.command(help="Configure metaflow to access self-managed AWS resources.")
 @click.option(
     "--profile",
@@ -791,7 +815,7 @@ def aws(ctx, profile):
     )
 
     # Check for existing configuration.
-    if not overwrite_config(profile):
+    if not confirm_overwrite_config(profile):
         ctx.abort()
 
     verify_aws_credentials(ctx)
@@ -802,7 +826,7 @@ def aws(ctx, profile):
         empty_profile = True
 
     env = {}
-    env.update(configure_datastore_and_metadata(existing_env))
+    env.update(configure_aws_datastore_and_metadata(existing_env))
 
     # Configure AWS Batch for compute if using S3
     if env.get("METAFLOW_DEFAULT_DATASTORE") == "s3":
@@ -821,7 +845,7 @@ def aws(ctx, profile):
     persist_env({k: v for k, v in env.items() if v}, profile)
 
 
-@configure.command(help="Configure metaflow to use AWS EKS.")
+@configure.command(help="Configure metaflow to use Kubernetes.")
 @click.option(
     "--profile",
     "-p",
@@ -830,7 +854,7 @@ def aws(ctx, profile):
     "`METAFLOW_PROFILE` environment variable.",
 )
 @click.pass_context
-def eks(ctx, profile):
+def kubernetes(ctx, profile):
 
     check_kubernetes_client(ctx)
 
@@ -843,30 +867,23 @@ def eks(ctx, profile):
     check_kubernetes_config(ctx)
 
     # Check for existing configuration.
-    if not overwrite_config(profile):
+    if not confirm_overwrite_config(profile):
         ctx.abort()
 
-    verify_aws_credentials(ctx)
-
     existing_env = get_env(profile)
 
     env = existing_env.copy()
 
-    if existing_env.get("METAFLOW_DEFAULT_DATASTORE") == "s3":
-        # Skip S3 configuration if it is already configured
-        pass
-    elif not existing_env.get("METAFLOW_DEFAULT_DATASTORE"):
-        env.update(configure_s3_datastore(existing_env))
-    else:
-        # If configured to use something else, offer to switch to S3
-        click.confirm(
-            "\nMetaflow on EKS needs to use S3 as a datastore, "
-            + "but your existing configuration is not using S3. "
-            + "Would you like to reconfigure it to use S3?",
-            default=True,
-            abort=True,
+    # We used to push user straight to S3 configuration inline.
+    # Now that we support >1 cloud, it gets too complicated.
+    # Therefore, we instruct the user to configure datastore first, by
+    # a separate command.
+    if existing_env.get("METAFLOW_DEFAULT_DATASTORE") == "local":
+        click.echo(
+            "\nCannot run Kubernetes with local datastore. Please run"
+            " 'metaflow configure aws' or 'metaflow configure azure'."
         )
-        env.update(configure_s3_datastore(existing_env))
+        click.Abort()
 
     # Configure remote metadata.
     if existing_env.get("METAFLOW_DEFAULT_METADATA") == "service":
@@ -883,10 +900,7 @@ def eks(ctx, profile):
         ):
             env.update(configure_metadata_service(existing_env))
 
-    # Configure AWS EKS for compute.
-    env.update(configure_eks(existing_env))
+    # Configure Kubernetes for compute.
+    env.update(configure_kubernetes(existing_env))
 
     persist_env({k: v for k, v in env.items() if v}, profile)
-
-
-main()
diff --git a/metaflow/cmd/main_cli.py b/metaflow/cmd/main_cli.py
new file mode 100644
index 00000000000..1a976a04c91
--- /dev/null
+++ b/metaflow/cmd/main_cli.py
@@ -0,0 +1,140 @@
+import os
+import traceback
+
+from metaflow._vendor import click
+
+from metaflow.plugins.datastores.local_storage import LocalStorage
+from metaflow.metaflow_config import DATASTORE_LOCAL_DIR
+
+from .util import echo_always
+
+
+@click.group()
+def main(ctx):
+    pass
+
+
+@main.command(help="Show all available commands.")
+@click.pass_context
+def help(ctx):
+    print(ctx.parent.get_help())
+
+
+@main.command(help="Show flows accessible from the current working tree.")
+def status():
+    from metaflow.client import get_metadata
+
+    res = get_metadata()
+    if res:
+        res = res.split("@")
+    else:
+        raise click.ClickException("Unknown status: cannot find a Metadata provider")
+    if res[0] == "service":
+        echo("Using Metadata provider at: ", nl=False)
+        echo('"%s"\n' % res[1], fg="cyan")
+        echo("To list available flows, type:\n")
+        echo("1. python")
+        echo("2. from metaflow import Metaflow")
+        echo("3. list(Metaflow())")
+        return
+
+    from metaflow.client import namespace, metadata, Metaflow
+
+    # Get the local data store path
+    path = LocalStorage.get_datastore_root_from_config(echo, create_on_absent=False)
+    # Throw an exception
+    if path is None:
+        raise click.ClickException(
+            "Could not find "
+            + click.style('"%s"' % DATASTORE_LOCAL_DIR, fg="red")
+            + " in the current working tree."
+        )
+
+    stripped_path = os.path.dirname(path)
+    namespace(None)
+    metadata("local@%s" % stripped_path)
+    echo("Working tree found at: ", nl=False)
+    echo('"%s"\n' % stripped_path, fg="cyan")
+    echo("Available flows:", fg="cyan", bold=True)
+    for flow in Metaflow():
+        echo("* %s" % flow, fg="cyan")
+
+
+try:
+    from metaflow.extension_support import get_modules, load_module, _ext_debug
+
+    _modules_to_import = get_modules("cmd")
+    _clis = []
+    # Reverse to maintain "latest" overrides (in Click, the first one will get it)
+    for m in reversed(_modules_to_import):
+        _get_clis = m.module.__dict__.get("get_cmd_clis")
+        if _get_clis:
+            _clis.extend(_get_clis())
+
+except Exception as e:
+    _ext_debug("\tWARNING: ignoring all plugins due to error during import: %s" % e)
+    print(
+        "WARNING: Command extensions did not load -- ignoring all of them which may not "
+        "be what you want: %s" % e
+    )
+    _clis = []
+    traceback.print_exc()
+
+from .configure_cmd import cli as configure_cli
+from .tutorials_cmd import cli as tutorials_cli
+
+
+@click.command(
+    cls=click.CommandCollection,
+    sources=_clis + [main, configure_cli, tutorials_cli],
+    invoke_without_command=True,
+)
+@click.pass_context
+def start(ctx):
+    global echo
+    echo = echo_always
+
+    import metaflow
+
+    echo("Metaflow ", fg="magenta", bold=True, nl=False)
+
+    if ctx.invoked_subcommand is None:
+        echo("(%s): " % metaflow.__version__, fg="magenta", bold=False, nl=False)
+    else:
+        echo("(%s)\n" % metaflow.__version__, fg="magenta", bold=False)
+
+    if ctx.invoked_subcommand is None:
+        echo("More data science, less engineering\n", fg="magenta")
+
+        # metaflow URL
+        echo("http://docs.metaflow.org", fg="cyan", nl=False)
+        echo(" - Read the documentation")
+
+        # metaflow chat
+        echo("http://chat.metaflow.org", fg="cyan", nl=False)
+        echo(" - Chat with us")
+
+        # metaflow help email
+        echo("help@metaflow.org", fg="cyan", nl=False)
+        echo("        - Get help by email\n")
+
+        print(ctx.get_help())
+
+
+start()
+
+for _n in [
+    "get_modules",
+    "load_module",
+    "_modules_to_import",
+    "m",
+    "_get_clis",
+    "_clis",
+    "ext_debug",
+    "e",
+]:
+    try:
+        del globals()[_n]
+    except KeyError:
+        pass
+del globals()["_n"]
diff --git a/metaflow/cmd/tutorials_cmd.py b/metaflow/cmd/tutorials_cmd.py
new file mode 100644
index 00000000000..6e9e68e0789
--- /dev/null
+++ b/metaflow/cmd/tutorials_cmd.py
@@ -0,0 +1,160 @@
+import os
+import shutil
+
+from metaflow._vendor import click
+
+from .util import echo_always, makedirs
+
+echo = echo_always
+
+
+@click.group()
+def cli():
+    pass
+
+
+@cli.group(help="Browse and access the metaflow tutorial episodes.")
+def tutorials():
+    pass
+
+
+def get_tutorials_dir():
+    metaflow_dir = os.path.dirname(__file__)
+    package_dir = os.path.dirname(metaflow_dir)
+    tutorials_dir = os.path.join(package_dir, "metaflow", "tutorials")
+
+    if not os.path.exists(tutorials_dir):
+        tutorials_dir = os.path.join(package_dir, "tutorials")
+
+    return tutorials_dir
+
+
+def get_tutorial_metadata(tutorial_path):
+    metadata = {}
+    with open(os.path.join(tutorial_path, "README.md")) as readme:
+        content = readme.read()
+
+    paragraphs = [paragraph.strip() for paragraph in content.split("#") if paragraph]
+    metadata["description"] = paragraphs[0].split("**")[1]
+    header = paragraphs[0].split("\n")
+    header = header[0].split(":")
+    metadata["episode"] = header[0].strip()[len("Episode ") :]
+    metadata["title"] = header[1].strip()
+
+    for paragraph in paragraphs[1:]:
+        if paragraph.startswith("Before playing"):
+            lines = "\n".join(paragraph.split("\n")[1:])
+            metadata["prereq"] = lines.replace("```", "")
+
+        if paragraph.startswith("Showcasing"):
+            lines = "\n".join(paragraph.split("\n")[1:])
+            metadata["showcase"] = lines.replace("```", "")
+
+        if paragraph.startswith("To play"):
+            lines = "\n".join(paragraph.split("\n")[1:])
+            metadata["play"] = lines.replace("```", "")
+
+    return metadata
+
+
+def get_all_episodes():
+    episodes = []
+    for name in sorted(os.listdir(get_tutorials_dir())):
+        # Skip hidden files (like .gitignore)
+        if not name.startswith("."):
+            episodes.append(name)
+    return episodes
+
+
+@tutorials.command(help="List the available episodes.")
+def list():
+    echo("Episodes:", fg="cyan", bold=True)
+    for name in get_all_episodes():
+        path = os.path.join(get_tutorials_dir(), name)
+        metadata = get_tutorial_metadata(path)
+        echo("* {0: <20} ".format(metadata["episode"]), fg="cyan", nl=False)
+        echo("- {0}".format(metadata["title"]))
+
+    echo("\nTo pull the episodes, type: ")
+    echo("metaflow tutorials pull", fg="cyan")
+
+
+def validate_episode(episode):
+    src_dir = os.path.join(get_tutorials_dir(), episode)
+    if not os.path.isdir(src_dir):
+        raise click.BadArgumentUsage(
+            "Episode "
+            + click.style('"{0}"'.format(episode), fg="red")
+            + " does not exist."
+            " To see a list of available episodes, "
+            "type:\n" + click.style("metaflow tutorials list", fg="cyan")
+        )
+
+
+def autocomplete_episodes(ctx, args, incomplete):
+    return [k for k in get_all_episodes() if incomplete in k]
+
+
+@tutorials.command(help="Pull episodes " "into your current working directory.")
+@click.option(
+    "--episode",
+    default="",
+    help="Optional episode name " "to pull only a single episode.",
+)
+def pull(episode):
+    tutorials_dir = get_tutorials_dir()
+    if not episode:
+        episodes = get_all_episodes()
+    else:
+        episodes = [episode]
+        # Validate that the list is valid.
+        for episode in episodes:
+            validate_episode(episode)
+    # Create destination `metaflow-tutorials` dir.
+    dst_parent = os.path.join(os.getcwd(), "metaflow-tutorials")
+    makedirs(dst_parent)
+
+    # Pull specified episodes.
+    for episode in episodes:
+        dst_dir = os.path.join(dst_parent, episode)
+        # Check if episode has already been pulled before.
+        if os.path.exists(dst_dir):
+            if click.confirm(
+                "Episode "
+                + click.style('"{0}"'.format(episode), fg="red")
+                + " has already been pulled before. Do you wish "
+                "to delete the existing version?"
+            ):
+                shutil.rmtree(dst_dir)
+            else:
+                continue
+        echo("Pulling episode ", nl=False)
+        echo('"{0}"'.format(episode), fg="cyan", nl=False)
+        # TODO: Is the following redundant?
+        echo(" into your current working directory.")
+        # Copy from (local) metaflow package dir to current.
+        src_dir = os.path.join(tutorials_dir, episode)
+        shutil.copytree(src_dir, dst_dir)
+
+    echo("\nTo know more about an episode, type:\n", nl=False)
+    echo("metaflow tutorials info [EPISODE]", fg="cyan")
+
+
+@tutorials.command(help="Find out more about an episode.")
+@click.argument("episode", autocompletion=autocomplete_episodes)
+def info(episode):
+    validate_episode(episode)
+    src_dir = os.path.join(get_tutorials_dir(), episode)
+    metadata = get_tutorial_metadata(src_dir)
+    echo("Synopsis:", fg="cyan", bold=True)
+    echo("%s" % metadata["description"])
+
+    echo("\nShowcasing:", fg="cyan", bold=True, nl=True)
+    echo("%s" % metadata["showcase"])
+
+    if "prereq" in metadata:
+        echo("\nBefore playing:", fg="cyan", bold=True, nl=True)
+        echo("%s" % metadata["prereq"])
+
+    echo("\nTo play:", fg="cyan", bold=True)
+    echo("%s" % metadata["play"])
diff --git a/metaflow/cmd/util.py b/metaflow/cmd/util.py
new file mode 100644
index 00000000000..ceff9869c25
--- /dev/null
+++ b/metaflow/cmd/util.py
@@ -0,0 +1,23 @@
+import os
+
+from metaflow._vendor import click
+
+
+def makedirs(path):
+    # This is for python2 compatibility.
+    # Python3 has os.makedirs(exist_ok=True).
+    try:
+        os.makedirs(path)
+    except OSError as x:
+        if x.errno == 17:
+            return
+        else:
+            raise
+
+
+def echo_dev_null(*args, **kwargs):
+    pass
+
+
+def echo_always(line, **kwargs):
+    click.secho(line, **kwargs)
diff --git a/metaflow/current.py b/metaflow/current.py
index 0dc8b15776e..06c58ce05f9 100644
--- a/metaflow/current.py
+++ b/metaflow/current.py
@@ -14,6 +14,7 @@ def __init__(self):
         self._origin_run_id = None
         self._namespace = None
         self._username = None
+        self._metadata_str = None
         self._is_running = False
 
         def _raise(ex):
@@ -33,7 +34,9 @@ def _set_env(
         origin_run_id=None,
         namespace=None,
         username=None,
+        metadata_str=None,
         is_running=True,
+        tags=None,
     ):
         if flow is not None:
             self._flow_name = flow.name
@@ -46,7 +49,9 @@ def _set_env(
         self._origin_run_id = origin_run_id
         self._namespace = namespace
         self._username = username
+        self._metadata_str = metadata_str
         self._is_running = is_running
+        self._tags = tags
 
     def _update_env(self, env):
         for k, v in env.items():
@@ -60,42 +65,151 @@ def get(self, key, default=None):
 
     @property
     def is_running_flow(self):
+        """
+        Returns True if called inside a running Flow, False otherwise.
+
+        You can use this property e.g. inside a library to choose the desired
+        behavior depending on the execution context.
+
+        Returns
+        -------
+        bool
+            True if called inside a run, False otherwise.
+        """
         return self._is_running
 
     @property
     def flow_name(self):
+        """
+        The name of the currently executing flow.
+
+        Returns
+        -------
+        str
+            Flow name.
+        """
         return self._flow_name
 
     @property
     def run_id(self):
+        """
+        The run ID of the currently executing run.
+
+        Returns
+        -------
+        str
+            Run ID.
+        """
         return self._run_id
 
     @property
     def step_name(self):
+        """
+        The name of the currently executing step.
+
+        Returns
+        -------
+        str
+            Step name.
+        """
         return self._step_name
 
     @property
     def task_id(self):
+        """
+        The task ID of the currently executing task.
+
+        Returns
+        -------
+        str
+            Task ID.
+        """
         return self._task_id
 
     @property
     def retry_count(self):
+        """
+        The index of the task execution attempt.
+
+        This property returns 0 for the first attempt to execute the task.
+        If the @retry decorator is used and the first attempt fails, this
+        property returns the number of times the task was attempted prior
+        to the current attempt.
+
+        Returns
+        -------
+        int
+            The retry count.
+        """
         return self._retry_count
 
     @property
     def origin_run_id(self):
+        """
+        The run ID of the original run this run was resumed from.
+
+        This property returns None for ordinary runs. If the run
+        was started by the resume command, the property returns
+        the ID of the original run.
+
+        You can use this property to detect if the run is resumed
+        or not.
+
+        Returns
+        -------
+        str
+            Run ID of the original run.
+        """
         return self._origin_run_id
 
     @property
     def pathspec(self):
-        return "/".join((self._flow_name, self._run_id, self._step_name, self._task_id))
+        """
+        Pathspec of the current run, i.e. a unique
+        identifier of the current task. The returned
+        string follows this format:
+        ```
+        {flow_name}/{run_id}/{step_name}/{task_id}
+        ```
+
+        Returns
+        -------
+        str
+            Pathspec.
+        """
+
+        pathspec_components = (
+            self._flow_name,
+            self._run_id,
+            self._step_name,
+            self._task_id,
+        )
+        if any(v is None for v in pathspec_components):
+            return None
+        return "/".join(pathspec_components)
 
     @property
     def namespace(self):
+        """
+        The current namespace.
+
+        Returns
+        -------
+        str
+            Namespace.
+        """
         return self._namespace
 
     @property
     def username(self):
+        """
+        The name of the user who started the run, if available.
+
+        Returns
+        -------
+        str
+            User name.
+        """
         return self._username
 
     @property
@@ -106,6 +220,15 @@ def parallel(self):
             node_index=int(os.environ.get("MF_PARALLEL_NODE_INDEX", "0")),
         )
 
+    @property
+    def tags(self):
+        """
+        [Legacy function - do not use]
+
+        Access tags through the Run object instead.
+        """
+        return self._tags
+
 
 # instantiate the Current singleton. This will be populated
 # by task.MetaflowTask before a task is executed.
diff --git a/metaflow/datastore/__init__.py b/metaflow/datastore/__init__.py
index 794ea23ccef..793251b0cff 100644
--- a/metaflow/datastore/__init__.py
+++ b/metaflow/datastore/__init__.py
@@ -2,8 +2,3 @@
 from .flow_datastore import FlowDataStore
 from .datastore_set import TaskDataStoreSet
 from .task_datastore import TaskDataStore
-
-from .local_storage import LocalStorage
-from .s3_storage import S3Storage
-
-DATASTORES = {"local": LocalStorage, "s3": S3Storage}
diff --git a/metaflow/datastore/datastore_storage.py b/metaflow/datastore/datastore_storage.py
index dfbbd0ef2c0..e8fad3c2bb1 100644
--- a/metaflow/datastore/datastore_storage.py
+++ b/metaflow/datastore/datastore_storage.py
@@ -229,6 +229,8 @@ def save_bytes(self, path_and_bytes_iter, overwrite=False, len_hint=0):
             BufferedIOBase.
         overwrite : bool
             True if the objects can be overwritten. Defaults to False.
+            Even when False, it is NOT an error condition to see an existing object.
+            Simply do not perform the upload operation.
         len_hint : int
             Estimated number of items produced by the iterator
 
@@ -260,7 +262,7 @@ def load_bytes(self, keys):
             Metadata will be None if no metadata is present; otherwise it is
             a dictionary of metadata associated with the object.
 
-            Note that the file at `file_path` may no longer be accessible outside of
+            Note that the file at `file_path` may no longer be accessible outside
             the scope of the returned object.
 
             The order of items in the list is not to be relied on (ie: rely on the key
diff --git a/metaflow/datastore/flow_datastore.py b/metaflow/datastore/flow_datastore.py
index 4dfc6321f8f..f2707c10166 100644
--- a/metaflow/datastore/flow_datastore.py
+++ b/metaflow/datastore/flow_datastore.py
@@ -89,7 +89,7 @@ def get_latest_task_datastores(
             must also be specified, by default None
         pathspecs : List[str], optional
             Full task specs (run_id/step_name/task_id). Can be used instead of
-            specifiying run_id and steps, by default None
+            specifying run_id and steps, by default None
         allow_not_done : bool, optional
             If True, returns the latest attempt of a task even if that attempt
             wasn't marked as done, by default False
@@ -204,7 +204,7 @@ def save_data(self, data_iter, len_hint=0):
 
         Parameters
         ----------
-        data : Iterator[bytes]
+        data_iter : Iterator[bytes]
             Iterator over blobs to save; each item in the list will be saved individually.
         len_hint : int
             Estimate of the number of items that will be produced by the iterator,
diff --git a/metaflow/datastore/task_datastore.py b/metaflow/datastore/task_datastore.py
index f99856682f9..3f003173a0f 100644
--- a/metaflow/datastore/task_datastore.py
+++ b/metaflow/datastore/task_datastore.py
@@ -16,6 +16,8 @@
 
 from .exceptions import DataException, UnpicklableArtifactException
 
+_included_file_type = "<class 'metaflow.includefile.IncludedFile'>"
+
 
 def only_if_not_done(f):
     @wraps(f)
@@ -150,7 +152,7 @@ def __init__(
                 if self._attempt is None:
                     self._attempt = max_attempt
                 elif max_attempt is None or self._attempt > max_attempt:
-                    # In this case, the attempt does not exist so we can't load
+                    # In this case the attempt does not exist, so we can't load
                     # anything
                     self._objects = {}
                     self._info = {}
@@ -299,6 +301,7 @@ def pickle_iter():
                     "type": str(type(obj)),
                     "encoding": encode_type,
                 }
+
                 artifact_names.append(name)
                 yield blob
 
@@ -386,7 +389,11 @@ def get_artifact_sizes(self, names):
         """
         for name in names:
             info = self._info.get(name)
-            yield name, info.get("size", 0)
+            if info["type"] == _included_file_type:
+                sz = self[name].size
+            else:
+                sz = info.get("size", 0)
+            yield name, sz
 
     @require_mode("r")
     def get_legacy_log_size(self, stream):
@@ -569,7 +576,7 @@ def is_none(self, name):
             # Conservatively check if the actual object is None,
             # in case the artifact is stored using a different python version.
             # Note that if an object is None and stored in Py2 and accessed in
-            # Py3, this test will fail and we will fallback to the slow path. This
+            # Py3, this test will fail and we will fall back to the slow path. This
             # is intended (being conservative)
             if obj_type == str(type(None)):
                 return True
@@ -681,7 +688,7 @@ def persist(self, flow):
             self._info.update(flow._datastore._info)
 
         # we create a list of valid_artifacts in advance, outside of
-        # artifacts_iter so we can provide a len_hint below
+        # artifacts_iter, so we can provide a len_hint below
         valid_artifacts = []
         for var in dir(flow):
             if var.startswith("__") or var in flow._EPHEMERAL:
@@ -783,18 +790,37 @@ def to_dict(self, show_private=False, max_value_size=None, include=None):
                 continue
             if k[0] == "_" and not show_private:
                 continue
-            if max_value_size is not None and self._info[k]["size"] > max_value_size:
-                d[k] = ArtifactTooLarge()
+
+            info = self._info[k]
+            if max_value_size is not None:
+                if info["type"] == _included_file_type:
+                    sz = self[k].size
+                else:
+                    sz = info.get("size", 0)
+
+                if sz == 0 or sz > max_value_size:
+                    d[k] = ArtifactTooLarge()
+                else:
+                    d[k] = self[k]
+                    if info["type"] == _included_file_type:
+                        d[k] = d[k].decode(k)
             else:
                 d[k] = self[k]
+                if info["type"] == _included_file_type:
+                    d[k] = d[k].decode(k)
+
         return d
 
     @require_mode("r")
     def format(self, **kwargs):
         def lines():
             for k, v in self.to_dict(**kwargs).items():
+                if self._info[k]["type"] == _included_file_type:
+                    sz = self[k].size
+                else:
+                    sz = self._info[k]["size"]
                 yield k, "*{key}* [size: {size} type: {type}] = {value}".format(
-                    key=k, value=v, **self._info[k]
+                    key=k, value=v, size=sz, type=self._info[k]["type"]
                 )
 
         return "\n".join(line for k, line in sorted(lines()))
diff --git a/metaflow/datatools/s3.py b/metaflow/datatools/s3.py
deleted file mode 100644
index a55f41b9e02..00000000000
--- a/metaflow/datatools/s3.py
+++ /dev/null
@@ -1,1047 +0,0 @@
-import json
-import os
-import sys
-import time
-import shutil
-import random
-import subprocess
-from io import RawIOBase, BytesIO, BufferedIOBase
-from itertools import chain, starmap
-from tempfile import mkdtemp, NamedTemporaryFile
-
-from .. import FlowSpec
-from ..current import current
-from ..metaflow_config import DATATOOLS_S3ROOT, S3_RETRY_COUNT
-from ..util import (
-    namedtuple_with_defaults,
-    is_stringish,
-    to_bytes,
-    to_unicode,
-    to_fileobj,
-    url_quote,
-    url_unquote,
-)
-from ..exception import MetaflowException
-from ..debug import debug
-
-try:
-    # python2
-    from urlparse import urlparse
-except:
-    # python3
-    from urllib.parse import urlparse
-
-from .s3util import get_s3_client, read_in_chunks, get_timestamp
-
-try:
-    import boto3
-    from boto3.s3.transfer import TransferConfig
-
-    DOWNLOAD_FILE_THRESHOLD = 2 * TransferConfig().multipart_threshold
-    DOWNLOAD_MAX_CHUNK = 2 * 1024 * 1024 * 1024 - 1
-    boto_found = True
-except:
-    boto_found = False
-
-
-def ensure_unicode(x):
-    return None if x is None else to_unicode(x)
-
-
-S3GetObject = namedtuple_with_defaults("S3GetObject", "key offset length")
-
-S3PutObject = namedtuple_with_defaults(
-    "S3PutObject",
-    "key value path content_type metadata",
-    defaults=(None, None, None, None),
-)
-
-RangeInfo = namedtuple_with_defaults(
-    "RangeInfo", "total_size request_offset request_length", defaults=(0, -1)
-)
-
-
-class MetaflowS3InvalidObject(MetaflowException):
-    headline = "Not a string-like object"
-
-
-class MetaflowS3URLException(MetaflowException):
-    headline = "Invalid address"
-
-
-class MetaflowS3Exception(MetaflowException):
-    headline = "S3 access failed"
-
-
-class MetaflowS3NotFound(MetaflowException):
-    headline = "S3 object not found"
-
-
-class MetaflowS3AccessDenied(MetaflowException):
-    headline = "S3 access denied"
-
-
-class S3Object(object):
-    """
-    This object represents a path or an object in S3,
-    with an optional local copy.
-    Get or list calls return one or more of S3Objects.
-    """
-
-    def __init__(
-        self,
-        prefix,
-        url,
-        path,
-        size=None,
-        content_type=None,
-        metadata=None,
-        range_info=None,
-        last_modified=None,
-    ):
-
-        # all fields of S3Object should return a unicode object
-        prefix, url, path = map(ensure_unicode, (prefix, url, path))
-
-        self._size = size
-        self._url = url
-        self._path = path
-        self._key = None
-        self._content_type = content_type
-        self._last_modified = last_modified
-
-        self._metadata = None
-        if metadata is not None and "metaflow-user-attributes" in metadata:
-            self._metadata = json.loads(metadata["metaflow-user-attributes"])
-
-        if range_info and (
-            range_info.request_length is None or range_info.request_length < 0
-        ):
-            self._range_info = RangeInfo(
-                range_info.total_size, range_info.request_offset, range_info.total_size
-            )
-        else:
-            self._range_info = range_info
-
-        if path:
-            self._size = os.stat(self._path).st_size
-
-        if prefix is None or prefix == url:
-            self._key = url
-            self._prefix = None
-        else:
-            self._key = url[len(prefix.rstrip("/")) + 1 :].rstrip("/")
-            self._prefix = prefix
-
-    @property
-    def exists(self):
-        """
-        Does this key correspond to an object in S3?
-        """
-        return self._size is not None
-
-    @property
-    def downloaded(self):
-        """
-        Has this object been downloaded?
-        """
-        return bool(self._path)
-
-    @property
-    def url(self):
-        """
-        S3 location of the object
-        """
-        return self._url
-
-    @property
-    def prefix(self):
-        """
-        Prefix requested that matches the object.
-        """
-        return self._prefix
-
-    @property
-    def key(self):
-        """
-        Key corresponds to the key given to the get call that produced
-        this object. This may be a full S3 URL or a suffix based on what
-        was requested.
-        """
-        return self._key
-
-    @property
-    def path(self):
-        """
-        Path to the local file corresponding to the object downloaded.
-        This file gets deleted automatically when a S3 scope exits.
-        Returns None if this S3Object has not been downloaded.
-        """
-        return self._path
-
-    @property
-    def blob(self):
-        """
-        Contents of the object as a byte string.
-        Returns None if this S3Object has not been downloaded.
-        """
-        if self._path:
-            with open(self._path, "rb") as f:
-                return f.read()
-
-    @property
-    def text(self):
-        """
-        Contents of the object as a Unicode string.
-        Returns None if this S3Object has not been downloaded.
-        """
-        if self._path:
-            return self.blob.decode("utf-8", errors="replace")
-
-    @property
-    def size(self):
-        """
-        Size of the object in bytes.
-        Returns None if the key does not correspond to an object in S3.
-        """
-        return self._size
-
-    @property
-    def has_info(self):
-        """
-        Returns true if this S3Object contains the content-type or user-metadata.
-        If False, this means that content_type and range_info will not return the
-        proper information
-        """
-        return self._content_type is not None or self._metadata is not None
-
-    @property
-    def metadata(self):
-        """
-        Returns a dictionary of user-defined metadata
-        """
-        return self._metadata
-
-    @property
-    def content_type(self):
-        """
-        Returns the content-type of the S3 object; if unknown, returns None
-        """
-        return self._content_type
-
-    @property
-    def range_info(self):
-        """
-        Returns a namedtuple containing the following fields:
-            - total_size: size in S3 of the object
-            - request_offset: the starting offset in this S3Object
-            - request_length: the length in this S3Object
-        """
-        return self._range_info
-
-    @property
-    def last_modified(self):
-        """
-        Returns the last modified unix timestamp of the object, or None
-        if not fetched.
-        """
-        return self._last_modified
-
-    def __str__(self):
-        if self._path:
-            return "<S3Object %s (%d bytes, local)>" % (self._url, self._size)
-        elif self._size:
-            return "<S3Object %s (%d bytes, in S3)>" % (self._url, self._size)
-        else:
-            return "<S3Object %s (object does not exist)>" % self._url
-
-    def __repr__(self):
-        return str(self)
-
-
-class S3Client(object):
-    def __init__(self):
-        self._s3_client = None
-        self._s3_error = None
-
-    @property
-    def client(self):
-        if self._s3_client is None:
-            self.reset_client()
-        return self._s3_client
-
-    @property
-    def error(self):
-        if self._s3_error is None:
-            self.reset_client()
-        return self._s3_error
-
-    def reset_client(self):
-        self._s3_client, self._s3_error = get_s3_client()
-
-
-class S3(object):
-    @classmethod
-    def get_root_from_config(cls, echo, create_on_absent=True):
-        return DATATOOLS_S3ROOT
-
-    def __init__(
-        self, tmproot=".", bucket=None, prefix=None, run=None, s3root=None, **kwargs
-    ):
-        """
-        Initialize a new context for S3 operations. This object is used as
-        a context manager for a with statement.
-        There are two ways to initialize this object depending whether you want
-        to bind paths to a Metaflow run or not.
-        1. With a run object:
-            run: (required) Either a FlowSpec object (typically 'self') or a
-                 Run object corresponding to an existing Metaflow run. These
-                 are used to add a version suffix in the S3 path.
-            bucket: (optional) S3 bucket.
-            prefix: (optional) S3 prefix.
-        2. Without a run object:
-            s3root: (optional) An S3 root URL for all operations. If this is
-                    not specified, all operations require a full S3 URL.
-        These options are supported in both the modes:
-            tmproot: (optional) Root path for temporary files (default: '.')
-        """
-
-        if not boto_found:
-            raise MetaflowException("You need to install 'boto3' in order to use S3.")
-
-        if run:
-            # 1. use a (current) run ID with optional customizations
-            parsed = urlparse(DATATOOLS_S3ROOT)
-            if not bucket:
-                bucket = parsed.netloc
-            if not prefix:
-                prefix = parsed.path
-            if isinstance(run, FlowSpec):
-                if current.is_running_flow:
-                    prefix = os.path.join(prefix, current.flow_name, current.run_id)
-                else:
-                    raise MetaflowS3URLException(
-                        "Initializing S3 with a FlowSpec outside of a running "
-                        "flow is not supported."
-                    )
-            else:
-                prefix = os.path.join(prefix, run.parent.id, run.id)
-
-            self._s3root = u"s3://%s" % os.path.join(bucket, prefix.strip("/"))
-        elif s3root:
-            # 2. use an explicit S3 prefix
-            parsed = urlparse(to_unicode(s3root))
-            if parsed.scheme != "s3":
-                raise MetaflowS3URLException(
-                    "s3root needs to be an S3 URL prefxied with s3://."
-                )
-            self._s3root = s3root.rstrip("/")
-        else:
-            # 3. use the client only with full URLs
-            self._s3root = None
-
-        self._s3_client = kwargs.get("external_client", S3Client())
-        self._tmpdir = mkdtemp(dir=tmproot, prefix="metaflow.s3.")
-
-    def __enter__(self):
-        return self
-
-    def __exit__(self, *args):
-        self.close()
-
-    def close(self):
-        """
-        Delete all temporary files downloaded in this context.
-        """
-        try:
-            if not debug.s3client:
-                if self._tmpdir:
-                    shutil.rmtree(self._tmpdir)
-                    self._tmpdir = None
-        except:
-            pass
-
-    def _url(self, key_value):
-        # NOTE: All URLs are handled as Unicode objects (unicode in py2,
-        # string in py3) internally. We expect that all URLs passed to this
-        # class as either Unicode or UTF-8 encoded byte strings. All URLs
-        # returned are Unicode.
-        key = getattr(key_value, "key", key_value)
-        if self._s3root is None:
-            parsed = urlparse(to_unicode(key))
-            if parsed.scheme == "s3" and parsed.path:
-                return key
-            else:
-                if current.is_running_flow:
-                    raise MetaflowS3URLException(
-                        "Specify S3(run=self) when you use S3 inside a running "
-                        "flow. Otherwise you have to use S3 with full "
-                        "s3:// urls."
-                    )
-                else:
-                    raise MetaflowS3URLException(
-                        "Initialize S3 with an 's3root' or 'run' if you don't "
-                        "want to specify full s3:// urls."
-                    )
-        elif key:
-            if key.startswith("s3://"):
-                raise MetaflowS3URLException(
-                    "Don't use absolute S3 URLs when the S3 client is "
-                    "initialized with a prefix. URL: %s" % key
-                )
-            return os.path.join(self._s3root, key)
-        else:
-            return self._s3root
-
-    def _url_and_range(self, key_value):
-        url = self._url(key_value)
-        start = getattr(key_value, "offset", None)
-        length = getattr(key_value, "length", None)
-        range_str = None
-        # Range specification are inclusive so getting from offset 500 for 100
-        # bytes will read as bytes=500-599
-        if start is not None or length is not None:
-            if start is None:
-                start = 0
-            if length is None:
-                # Fetch from offset till the end of the file
-                range_str = "bytes=%d-" % start
-            elif length < 0:
-                # Fetch from end; ignore start value here
-                range_str = "bytes=-%d" % (-length)
-            else:
-                # Typical range fetch
-                range_str = "bytes=%d-%d" % (start, start + length - 1)
-        return url, range_str
-
-    def list_paths(self, keys=None):
-        """
-        List the next level of paths in S3. If multiple keys are
-        specified, listings are done in parallel. The returned
-        S3Objects have .exists == False if the url refers to a
-        prefix, not an existing S3 object.
-        Args:
-            keys: (required) a list of suffixes for paths to list.
-        Returns:
-            a list of S3Objects (not downloaded)
-        Example:
-        Consider the following paths in S3:
-        A/B/C
-        D/E
-        In this case, list_paths(['A', 'D']), returns ['A/B', 'D/E']. The
-        first S3Object has .exists == False, since it does not refer to an
-        object in S3. It is just a prefix.
-        """
-
-        def _list(keys):
-            if keys is None:
-                keys = [None]
-            urls = ((self._url(key).rstrip("/") + "/", None) for key in keys)
-            res = self._read_many_files("list", urls)
-            for s3prefix, s3url, size in res:
-                if size:
-                    yield s3prefix, s3url, None, int(size)
-                else:
-                    yield s3prefix, s3url, None, None
-
-        return list(starmap(S3Object, _list(keys)))
-
-    def list_recursive(self, keys=None):
-        """
-        List objects in S3 recursively. If multiple keys are
-        specified, listings are done in parallel. The returned
-        S3Objects have always .exists == True, since they refer
-        to existing objects in S3.
-        Args:
-            keys: (required) a list of suffixes for paths to list.
-        Returns:
-            a list of S3Objects (not downloaded)
-        Example:
-        Consider the following paths in S3:
-        A/B/C
-        D/E
-        In this case, list_recursive(['A', 'D']), returns ['A/B/C', 'D/E'].
-        """
-
-        def _list(keys):
-            if keys is None:
-                keys = [None]
-            res = self._read_many_files(
-                "list", map(self._url_and_range, keys), recursive=True
-            )
-            for s3prefix, s3url, size in res:
-                yield s3prefix, s3url, None, int(size)
-
-        return list(starmap(S3Object, _list(keys)))
-
-    def info(self, key=None, return_missing=False):
-        """
-        Get information about a single object from S3
-        Args:
-            key: (optional) a suffix identifying the object.
-            return_missing: (optional, default False) if set to True, do
-                            not raise an exception for a missing key but
-                            return it as an S3Object with .exists == False.
-        Returns:
-            an S3Object containing information about the object. The
-            downloaded property will be false and exists will indicate whether
-            or not the file exists
-        """
-        url = self._url(key)
-        src = urlparse(url)
-
-        def _info(s3, tmp):
-            resp = s3.head_object(Bucket=src.netloc, Key=src.path.lstrip('/"'))
-            return {
-                "content_type": resp["ContentType"],
-                "metadata": resp["Metadata"],
-                "size": resp["ContentLength"],
-                "last_modified": get_timestamp(resp["LastModified"]),
-            }
-
-        info_results = None
-        try:
-            _, info_results = self._one_boto_op(_info, url, create_tmp_file=False)
-        except MetaflowS3NotFound:
-            if return_missing:
-                info_results = None
-            else:
-                raise
-        if info_results:
-            return S3Object(
-                self._s3root,
-                url,
-                path=None,
-                size=info_results["size"],
-                content_type=info_results["content_type"],
-                metadata=info_results["metadata"],
-                last_modified=info_results["last_modified"],
-            )
-        return S3Object(self._s3root, url, None)
-
-    def info_many(self, keys, return_missing=False):
-        """
-        Get information about many objects from S3 in parallel.
-        Args:
-            keys: (required) a list of suffixes identifying the objects.
-            return_missing: (optional, default False) if set to True, do
-                            not raise an exception for a missing key but
-                            return it as an S3Object with .exists == False.
-        Returns:
-            a list of S3Objects corresponding to the objects requested. The
-            downloaded property will be false and exists will indicate whether
-            or not the file exists.
-        """
-
-        def _head():
-            from . import s3op
-
-            res = self._read_many_files(
-                "info", map(self._url_and_range, keys), verbose=False, listing=True
-            )
-
-            for s3prefix, s3url, fname in res:
-                if fname:
-                    # We have a metadata file to read from
-                    with open(os.path.join(self._tmpdir, fname), "r") as f:
-                        info = json.load(f)
-                    if info["error"] is not None:
-                        # We have an error, we check if it is a missing file
-                        if info["error"] == s3op.ERROR_URL_NOT_FOUND:
-                            if return_missing:
-                                yield self._s3root, s3url, None
-                            else:
-                                raise MetaflowS3NotFound()
-                        elif info["error"] == s3op.ERROR_URL_ACCESS_DENIED:
-                            raise MetaflowS3AccessDenied()
-                        else:
-                            raise MetaflowS3Exception("Got error: %d" % info["error"])
-                    else:
-                        yield self._s3root, s3url, None, info["size"], info[
-                            "content_type"
-                        ], info["metadata"], None, info["last_modified"]
-                else:
-                    # This should not happen; we should always get a response
-                    # even if it contains an error inside it
-                    raise MetaflowS3Exception("Did not get a response to HEAD")
-
-        return list(starmap(S3Object, _head()))
-
-    def get(self, key=None, return_missing=False, return_info=True):
-        """
-        Get a single object from S3.
-        Args:
-            key: (optional) a suffix identifying the object. Can also be
-                 an object containing the properties `key`, `offset` and
-                 `length` to specify a range query. `S3GetObject` is such an object.
-            return_missing: (optional, default False) if set to True, do
-                            not raise an exception for a missing key but
-                            return it as an S3Object with .exists == False.
-            return_info: (optional, default True) if set to True, fetch the
-                         content-type and user metadata associated with the object.
-        Returns:
-            an S3Object corresponding to the object requested.
-        """
-        url, r = self._url_and_range(key)
-        src = urlparse(url)
-
-        def _download(s3, tmp):
-            if r:
-                resp = s3.get_object(
-                    Bucket=src.netloc, Key=src.path.lstrip("/"), Range=r
-                )
-            else:
-                resp = s3.get_object(Bucket=src.netloc, Key=src.path.lstrip("/"))
-            sz = resp["ContentLength"]
-            if not r and sz > DOWNLOAD_FILE_THRESHOLD:
-                # In this case, it is more efficient to use download_file as it
-                # will download multiple parts in parallel (it does it after
-                # multipart_threshold)
-                s3.download_file(src.netloc, src.path.lstrip("/"), tmp)
-            else:
-                with open(tmp, mode="wb") as t:
-                    read_in_chunks(t, resp["Body"], sz, DOWNLOAD_MAX_CHUNK)
-            if return_info:
-                return {
-                    "content_type": resp["ContentType"],
-                    "metadata": resp["Metadata"],
-                    "last_modified": get_timestamp(resp["LastModified"]),
-                }
-            return None
-
-        addl_info = None
-        try:
-            path, addl_info = self._one_boto_op(_download, url)
-        except MetaflowS3NotFound:
-            if return_missing:
-                path = None
-            else:
-                raise
-        if addl_info:
-            return S3Object(
-                self._s3root,
-                url,
-                path,
-                content_type=addl_info["content_type"],
-                metadata=addl_info["metadata"],
-                last_modified=addl_info["last_modified"],
-            )
-        return S3Object(self._s3root, url, path)
-
-    def get_many(self, keys, return_missing=False, return_info=True):
-        """
-        Get many objects from S3 in parallel.
-        Args:
-            keys: (required) a list of suffixes identifying the objects. Each
-                  item in the list can also be an object containing the properties
-                  `key`, `offset` and `length to specify a range query.
-                  `S3GetObject` is such an object.
-            return_missing: (optional, default False) if set to True, do
-                            not raise an exception for a missing key but
-                            return it as an S3Object with .exists == False.
-            return_info: (optional, default True) if set to True, fetch the
-                         content-type and user metadata associated with the object.
-        Returns:
-            a list of S3Objects corresponding to the objects requested.
-        """
-
-        def _get():
-            res = self._read_many_files(
-                "get",
-                map(self._url_and_range, keys),
-                allow_missing=return_missing,
-                verify=True,
-                verbose=False,
-                info=return_info,
-                listing=True,
-            )
-
-            for s3prefix, s3url, fname in res:
-                if return_info:
-                    if fname:
-                        # We have a metadata file to read from
-                        with open(
-                            os.path.join(self._tmpdir, "%s_meta" % fname), "r"
-                        ) as f:
-                            info = json.load(f)
-                        yield self._s3root, s3url, os.path.join(
-                            self._tmpdir, fname
-                        ), None, info["content_type"], info["metadata"], None, info[
-                            "last_modified"
-                        ]
-                    else:
-                        yield self._s3root, s3prefix, None
-                else:
-                    if fname:
-                        yield self._s3root, s3url, os.path.join(self._tmpdir, fname)
-                    else:
-                        # missing entries per return_missing=True
-                        yield self._s3root, s3prefix, None
-
-        return list(starmap(S3Object, _get()))
-
-    def get_recursive(self, keys, return_info=False):
-        """
-        Get many objects from S3 recursively in parallel.
-        Args:
-            keys: (required) a list of suffixes for paths to download
-                  recursively.
-            return_info: (optional, default False) if set to True, fetch the
-                         content-type and user metadata associated with the object.
-        Returns:
-            a list of S3Objects corresponding to the objects requested.
-        """
-
-        def _get():
-            res = self._read_many_files(
-                "get",
-                map(self._url_and_range, keys),
-                recursive=True,
-                verify=True,
-                verbose=False,
-                info=return_info,
-                listing=True,
-            )
-
-            for s3prefix, s3url, fname in res:
-                if return_info:
-                    # We have a metadata file to read from
-                    with open(os.path.join(self._tmpdir, "%s_meta" % fname), "r") as f:
-                        info = json.load(f)
-                    yield self._s3root, s3url, os.path.join(
-                        self._tmpdir, fname
-                    ), None, info["content_type"], info["metadata"], None, info[
-                        "last_modified"
-                    ]
-                else:
-                    yield s3prefix, s3url, os.path.join(self._tmpdir, fname)
-
-        return list(starmap(S3Object, _get()))
-
-    def get_all(self, return_info=False):
-        """
-        Get all objects from S3 recursively (in parallel). This request
-        only works if S3 is initialized with a run or a s3root prefix.
-        Args:
-            return_info: (optional, default False) if set to True, fetch the
-                         content-type and user metadata associated with the object.
-        Returns:
-            a list of S3Objects corresponding to the objects requested.
-        """
-        if self._s3root is None:
-            raise MetaflowS3URLException(
-                "Can't get_all() when S3 is initialized without a prefix"
-            )
-        else:
-            return self.get_recursive([None], return_info)
-
-    def put(self, key, obj, overwrite=True, content_type=None, metadata=None):
-        """
-        Put an object to S3.
-        Args:
-            key:           (required) suffix for the object.
-            obj:           (required) a bytes, string, or a unicode object to
-                           be stored in S3.
-            overwrite:     (optional) overwrites the key with obj, if it exists
-            content_type:  (optional) string representing the MIME type of the
-                           object
-            metadata:      (optional) User metadata to store alongside the object
-        Returns:
-            an S3 URL corresponding to the object stored.
-        """
-        if isinstance(obj, (RawIOBase, BufferedIOBase)):
-            if not obj.readable() or not obj.seekable():
-                raise MetaflowS3InvalidObject(
-                    "Object corresponding to the key '%s' is not readable or seekable"
-                    % key
-                )
-            blob = obj
-        else:
-            if not is_stringish(obj):
-                raise MetaflowS3InvalidObject(
-                    "Object corresponding to the key '%s' is not a string "
-                    "or a bytes object." % key
-                )
-            blob = to_fileobj(obj)
-        # We override the close functionality to prevent closing of the
-        # file if it is used multiple times when uploading (since upload_fileobj
-        # will/may close it on failure)
-        real_close = blob.close
-        blob.close = lambda: None
-
-        url = self._url(key)
-        src = urlparse(url)
-        extra_args = None
-        if content_type or metadata:
-            extra_args = {}
-            if content_type:
-                extra_args["ContentType"] = content_type
-            if metadata:
-                extra_args["Metadata"] = {
-                    "metaflow-user-attributes": json.dumps(metadata)
-                }
-
-        def _upload(s3, _):
-            # We make sure we are at the beginning in case we are retrying
-            blob.seek(0)
-            s3.upload_fileobj(
-                blob, src.netloc, src.path.lstrip("/"), ExtraArgs=extra_args
-            )
-
-        if overwrite:
-            self._one_boto_op(_upload, url, create_tmp_file=False)
-            real_close()
-            return url
-        else:
-
-            def _head(s3, _):
-                s3.head_object(Bucket=src.netloc, Key=src.path.lstrip("/"))
-
-            try:
-                self._one_boto_op(_head, url, create_tmp_file=False)
-            except MetaflowS3NotFound:
-                self._one_boto_op(_upload, url, create_tmp_file=False)
-            finally:
-                real_close()
-            return url
-
-    def put_many(self, key_objs, overwrite=True):
-        """
-        Put objects to S3 in parallel.
-        Args:
-            key_objs:  (required) an iterator of (key, value) tuples. Value must
-                       be a string, bytes, or a unicode object. Instead of
-                       (key, value) tuples, you can also pass any object that
-                       has the following properties 'key', 'value', 'content_type',
-                       'metadata' like the S3PutObject for example. 'key' and
-                       'value' are required but others are optional.
-            overwrite: (optional) overwrites the key with obj, if it exists
-        Returns:
-            a list of (key, S3 URL) tuples corresponding to the files sent.
-        """
-
-        def _store():
-            for key_obj in key_objs:
-                if isinstance(key_obj, tuple):
-                    key = key_obj[0]
-                    obj = key_obj[1]
-                else:
-                    key = key_obj.key
-                    obj = key_obj.value
-                store_info = {
-                    "key": key,
-                    "content_type": getattr(key_obj, "content_type", None),
-                }
-                metadata = getattr(key_obj, "metadata", None)
-                if metadata:
-                    store_info["metadata"] = {
-                        "metaflow-user-attributes": json.dumps(metadata)
-                    }
-                if isinstance(obj, (RawIOBase, BufferedIOBase)):
-                    if not obj.readable() or not obj.seekable():
-                        raise MetaflowS3InvalidObject(
-                            "Object corresponding to the key '%s' is not readable or seekable"
-                            % key
-                        )
-                else:
-                    if not is_stringish(obj):
-                        raise MetaflowS3InvalidObject(
-                            "Object corresponding to the key '%s' is not a string "
-                            "or a bytes object." % key
-                        )
-                    obj = to_fileobj(obj)
-                with NamedTemporaryFile(
-                    dir=self._tmpdir,
-                    delete=False,
-                    mode="wb",
-                    prefix="metaflow.s3.put_many.",
-                ) as tmp:
-                    tmp.write(obj.read())
-                    tmp.close()
-                    yield tmp.name, self._url(key), store_info
-
-        return self._put_many_files(_store(), overwrite)
-
-    def put_files(self, key_paths, overwrite=True):
-        """
-        Put files to S3 in parallel.
-        Args:
-            key_paths: (required) an iterator of (key, path) tuples. Instead of
-                       (key, path) tuples, you can also pass any object that
-                       has the following properties 'key', 'path', 'content_type',
-                       'metadata' like the S3PutObject for example. 'key' and
-                       'path' are required but others are optional.
-            overwrite: (optional) overwrites the key with obj, if it exists
-        Returns:
-            a list of (key, S3 URL) tuples corresponding to the files sent.
-        """
-
-        def _check():
-            for key_path in key_paths:
-                if isinstance(key_path, tuple):
-                    key = key_path[0]
-                    path = key_path[1]
-                else:
-                    key = key_path.key
-                    path = key_path.path
-                store_info = {
-                    "key": key,
-                    "content_type": getattr(key_path, "content_type", None),
-                }
-                metadata = getattr(key_path, "metadata", None)
-                if metadata:
-                    store_info["metadata"] = {
-                        "metaflow-user-attributes": json.dumps(metadata)
-                    }
-                if not os.path.exists(path):
-                    raise MetaflowS3NotFound("Local file not found: %s" % path)
-                yield path, self._url(key), store_info
-
-        return self._put_many_files(_check(), overwrite)
-
-    def _one_boto_op(self, op, url, create_tmp_file=True):
-        error = ""
-        for i in range(S3_RETRY_COUNT + 1):
-            tmp = None
-            if create_tmp_file:
-                tmp = NamedTemporaryFile(
-                    dir=self._tmpdir, prefix="metaflow.s3.one_file.", delete=False
-                )
-            try:
-                side_results = op(self._s3_client.client, tmp.name if tmp else None)
-                return tmp.name if tmp else None, side_results
-            except self._s3_client.error as err:
-                from . import s3op
-
-                error_code = s3op.normalize_client_error(err)
-                if error_code == 404:
-                    raise MetaflowS3NotFound(url)
-                elif error_code == 403:
-                    raise MetaflowS3AccessDenied(url)
-                elif error_code == "NoSuchBucket":
-                    raise MetaflowS3URLException("Specified S3 bucket doesn't exist.")
-                error = str(err)
-            except Exception as ex:
-                # TODO specific error message for out of disk space
-                error = str(ex)
-            if tmp:
-                os.unlink(tmp.name)
-            self._s3_client.reset_client()
-            # add some jitter to make sure retries are not synchronized
-            time.sleep(2 ** i + random.randint(0, 10))
-        raise MetaflowS3Exception(
-            "S3 operation failed.\n" "Key requested: %s\n" "Error: %s" % (url, error)
-        )
-
-    # NOTE: re: _read_many_files and _put_many_files
-    # All file IO is through binary files - we write bytes, we read
-    # bytes. All inputs and outputs from these functions are Unicode.
-    # Conversion between bytes and unicode is done through
-    # and url_unquote.
-    def _read_many_files(self, op, prefixes_and_ranges, **options):
-        prefixes_and_ranges = list(prefixes_and_ranges)
-        with NamedTemporaryFile(
-            dir=self._tmpdir,
-            mode="wb",
-            delete=not debug.s3client,
-            prefix="metaflow.s3.inputs.",
-        ) as inputfile:
-            inputfile.write(
-                b"\n".join(
-                    [
-                        b" ".join([url_quote(prefix)] + ([url_quote(r)] if r else []))
-                        for prefix, r in prefixes_and_ranges
-                    ]
-                )
-            )
-            inputfile.flush()
-            stdout, stderr = self._s3op_with_retries(
-                op, inputs=inputfile.name, **options
-            )
-            if stderr:
-                raise MetaflowS3Exception(
-                    "Getting S3 files failed.\n"
-                    "First prefix requested: %s\n"
-                    "Error: %s" % (prefixes_and_ranges[0], stderr)
-                )
-            else:
-                for line in stdout.splitlines():
-                    yield tuple(map(url_unquote, line.strip(b"\n").split(b" ")))
-
-    def _put_many_files(self, url_info, overwrite):
-        url_info = list(url_info)
-        url_dicts = [
-            dict(
-                chain([("local", os.path.realpath(local)), ("url", url)], info.items())
-            )
-            for local, url, info in url_info
-        ]
-
-        with NamedTemporaryFile(
-            dir=self._tmpdir,
-            mode="wb",
-            delete=not debug.s3client,
-            prefix="metaflow.s3.put_inputs.",
-        ) as inputfile:
-            lines = [to_bytes(json.dumps(x)) for x in url_dicts]
-            inputfile.write(b"\n".join(lines))
-            inputfile.flush()
-            stdout, stderr = self._s3op_with_retries(
-                "put",
-                filelist=inputfile.name,
-                verbose=False,
-                overwrite=overwrite,
-                listing=True,
-            )
-            if stderr:
-                raise MetaflowS3Exception(
-                    "Uploading S3 files failed.\n"
-                    "First key: %s\n"
-                    "Error: %s" % (url_info[0][2]["key"], stderr)
-                )
-            else:
-                urls = set()
-                for line in stdout.splitlines():
-                    url, _, _ = map(url_unquote, line.strip(b"\n").split(b" "))
-                    urls.add(url)
-                return [(info["key"], url) for _, url, info in url_info if url in urls]
-
-    def _s3op_with_retries(self, mode, **options):
-        from . import s3op
-
-        cmdline = [sys.executable, os.path.abspath(s3op.__file__), mode]
-        for key, value in options.items():
-            key = key.replace("_", "-")
-            if isinstance(value, bool):
-                if value:
-                    cmdline.append("--%s" % key)
-                else:
-                    cmdline.append("--no-%s" % key)
-            else:
-                cmdline.extend(("--%s" % key, value))
-
-        for i in range(S3_RETRY_COUNT + 1):
-            with NamedTemporaryFile(
-                dir=self._tmpdir,
-                mode="wb+",
-                delete=not debug.s3client,
-                prefix="metaflow.s3op.stderr",
-            ) as stderr:
-                try:
-                    debug.s3client_exec(cmdline)
-                    stdout = subprocess.check_output(
-                        cmdline, cwd=self._tmpdir, stderr=stderr.file
-                    )
-                    return stdout, None
-                except subprocess.CalledProcessError as ex:
-                    stderr.seek(0)
-                    err_out = stderr.read().decode("utf-8", errors="replace")
-                    stderr.seek(0)
-                    if ex.returncode == s3op.ERROR_URL_NOT_FOUND:
-                        raise MetaflowS3NotFound(err_out)
-                    elif ex.returncode == s3op.ERROR_URL_ACCESS_DENIED:
-                        raise MetaflowS3AccessDenied(err_out)
-                    print("Error with S3 operation:", err_out)
-                    time.sleep(2 ** i + random.randint(0, 10))
-
-        return None, err_out
diff --git a/metaflow/debug.py b/metaflow/debug.py
index c4b1e4ac03d..fbfaca95dc8 100644
--- a/metaflow/debug.py
+++ b/metaflow/debug.py
@@ -1,4 +1,5 @@
 from __future__ import print_function
+import inspect
 import sys
 
 from functools import partial
@@ -22,7 +23,7 @@ def __init__(self):
         import metaflow.metaflow_config as config
 
         for typ in config.DEBUG_OPTIONS:
-            if getattr(config, "METAFLOW_DEBUG_%s" % typ.upper()):
+            if getattr(config, "DEBUG_%s" % typ.upper()):
                 op = partial(self.log, typ)
             else:
                 op = self.noop
@@ -37,7 +38,9 @@ def log(self, typ, args):
             s = args
         else:
             s = " ".join(args)
-        print("debug[%s]: %s" % (typ, s), file=sys.stderr)
+        lineno = inspect.currentframe().f_back.f_lineno
+        filename = inspect.stack()[1][1]
+        print("debug[%s %s:%s]: %s" % (typ, filename, lineno, s), file=sys.stderr)
 
     def noop(self, args):
         pass
diff --git a/metaflow/decorators.py b/metaflow/decorators.py
index 0042e71dcaa..8723367b79b 100644
--- a/metaflow/decorators.py
+++ b/metaflow/decorators.py
@@ -1,4 +1,5 @@
 from functools import partial
+import json
 import re
 import os
 import sys
@@ -13,6 +14,13 @@
 from metaflow._vendor import click
 
 
+try:
+    unicode
+except NameError:
+    unicode = str
+    basestring = str
+
+
 class BadStepDecoratorException(MetaflowException):
     headline = "Syntax error"
 
@@ -114,21 +122,42 @@ def __init__(self, attributes=None, statically_defined=False):
 
     @classmethod
     def _parse_decorator_spec(cls, deco_spec):
-        top = deco_spec.split(":", 1)
-        if len(top) == 1:
+        if len(deco_spec) == 0:
             return cls()
-        else:
-            name, attrspec = top
-            attrs = dict(
-                map(lambda x: x.strip(), a.split("="))
-                for a in re.split(""",(?=[\s\w]+=)""", attrspec.strip("\"'"))
-            )
-            return cls(attributes=attrs)
+
+        attrs = {}
+        # TODO: Do we really want to allow spaces in the names of attributes?!?
+        for a in re.split(""",(?=[\s\w]+=)""", deco_spec):
+            name, val = a.split("=", 1)
+            try:
+                val_parsed = json.loads(val.strip().replace('\\"', '"'))
+            except json.JSONDecodeError:
+                # In this case, we try to convert to either an int or a float or
+                # leave as is. Prefer ints if possible.
+                try:
+                    val_parsed = int(val.strip())
+                except ValueError:
+                    try:
+                        val_parsed = float(val.strip())
+                    except ValueError:
+                        val_parsed = val.strip()
+
+            attrs[name.strip()] = val_parsed
+        return cls(attributes=attrs)
 
     def make_decorator_spec(self):
         attrs = {k: v for k, v in self.attributes.items() if v is not None}
         if attrs:
-            attrstr = ",".join("%s=%s" % x for x in attrs.items())
+            attr_list = []
+            # We dump simple types directly as string to get around the nightmare quote
+            # escaping but for more complex types (typically dictionaries or lists),
+            # we dump using JSON.
+            for k, v in attrs.items():
+                if isinstance(v, (int, float, unicode, basestring)):
+                    attr_list.append("%s=%s" % (k, str(v)))
+                else:
+                    attr_list.append("%s=%s" % (k, json.dumps(v).replace('"', '\\"')))
+            attrstr = ",".join(attr_list)
             return "%s:%s" % (self.name, attrstr)
         else:
             return self.name
@@ -148,7 +177,7 @@ class FlowDecorator(Decorator):
     options = {}
 
     def __init__(self, *args, **kwargs):
-        # Note that this assumes we are executing one flow per process so we have a global list of
+        # Note that this assumes we are executing one flow per process, so we have a global list of
         # _flow_decorators. A similar setup is used in parameters.
         self._flow_decorators.append(self)
         super(FlowDecorator, self).__init__(*args, **kwargs)
@@ -167,7 +196,7 @@ def get_top_level_options(self):
         options that should be passed to subprocesses (tasks). The option
         names should be a subset of the keys in self.options.
 
-        If the decorator has a non-empty set of options in self.options, you
+        If the decorator has a non-empty set of options in `self.options`, you
         probably want to return the assigned values in this method.
         """
         return []
@@ -450,8 +479,10 @@ def _attach_decorators_to_step(step, decospecs):
     from .plugins import STEP_DECORATORS
 
     decos = {decotype.name: decotype for decotype in STEP_DECORATORS}
+
     for decospec in decospecs:
-        deconame = decospec.strip("'").split(":")[0]
+        splits = decospec.split(":", 1)
+        deconame = splits[0]
         if deconame not in decos:
             raise UnknownStepDecoratorException(deconame)
         # Attach the decorator to step if it doesn't have the decorator
@@ -461,8 +492,11 @@ def _attach_decorators_to_step(step, decospecs):
             deconame not in [deco.name for deco in step.decorators]
             or decos[deconame].allow_multiple
         ):
-            # if the decorator is present in a step and is of type allow_mutliple then add the decorator to the step
-            deco = decos[deconame]._parse_decorator_spec(decospec)
+            # if the decorator is present in a step and is of type allow_multiple
+            # then add the decorator to the step
+            deco = decos[deconame]._parse_decorator_spec(
+                splits[1] if len(splits) > 1 else ""
+            )
             step.decorators.append(deco)
 
 
diff --git a/metaflow/event_logger.py b/metaflow/event_logger.py
index 7e559c443fd..aee08d6902a 100644
--- a/metaflow/event_logger.py
+++ b/metaflow/event_logger.py
@@ -1,33 +1,29 @@
-from .sidecar import SidecarSubProcess
-from .sidecar_messages import Message, MessageTypes
+from metaflow.sidecar import Message, MessageTypes, Sidecar
 
 
 class NullEventLogger(object):
+    TYPE = "nullSidecarLogger"
+
     def __init__(self, *args, **kwargs):
-        pass
+        # Currently passed flow and env in kwargs
+        self._sidecar = Sidecar(self.TYPE)
 
     def start(self):
-        pass
-
-    def log(self, payload):
-        pass
+        return self._sidecar.start()
 
     def terminate(self):
-        pass
-
+        return self._sidecar.terminate()
 
-class EventLogger(NullEventLogger):
-    def __init__(self, logger_type):
-        # type: (str) -> None
-        self.sidecar_process = None
-        self.logger_type = logger_type
-
-    def start(self):
-        self.sidecar_process = SidecarSubProcess(self.logger_type)
+    def send(self, msg):
+        # Arbitrary message sending. Useful if you want to override some different
+        # types of messages.
+        self._sidecar.send(msg)
 
     def log(self, payload):
-        msg = Message(MessageTypes.LOG_EVENT, payload)
-        self.sidecar_process.msg_handler(msg)
+        if self._sidecar.is_active:
+            msg = Message(MessageTypes.BEST_EFFORT, payload)
+            self._sidecar.send(msg)
 
-    def terminate(self):
-        self.sidecar_process.kill()
+    @classmethod
+    def get_worker(cls):
+        return None
diff --git a/metaflow/exception.py b/metaflow/exception.py
index f3552bc3340..fb2a48689ca 100644
--- a/metaflow/exception.py
+++ b/metaflow/exception.py
@@ -95,6 +95,10 @@ class MetaflowInternalError(MetaflowException):
     headline = "Internal error"
 
 
+class MetaflowTaggingError(MetaflowException):
+    headline = "Tagging error"
+
+
 class MetaflowUnknownUser(MetaflowException):
     headline = "Unknown user"
 
diff --git a/metaflow/extension_support.py b/metaflow/extension_support.py
index b6d04935570..59203c1a112 100644
--- a/metaflow/extension_support.py
+++ b/metaflow/extension_support.py
@@ -1,3 +1,5 @@
+from __future__ import print_function
+
 import importlib
 import json
 import os
@@ -7,24 +9,76 @@
 
 from collections import defaultdict, namedtuple
 
+from importlib.abc import MetaPathFinder, Loader
 from itertools import chain
 
+#
+# This file provides the support for Metaflow's extension mechanism which allows
+# a Metaflow developer to extend metaflow by providing a package `metaflow_extensions`.
+# Multiple such packages can be provided, and they will all be loaded into Metaflow in a
+# way that is transparent to the user.
+#
+# NOTE: The conventions used here may change over time and this is an advanced feature.
+#
+# The general functionality provided here can be divided into three phases:
+#   - Package discovery: in this part, packages that provide metaflow extensions
+#     are discovered. This is contained in the `_get_extension_packages` function
+#   - Integration with Metaflow: throughout the Metaflow code, extension points
+#     are provided (they are given below in `_extension_points`). At those points,
+#     the core Metaflow code will invoke functions to load the packages discovered
+#     in the first phase. These functions are:
+#       - get_modules: Returns all modules that are contributing to the extension
+#         point; this is typically done first.
+#       - load_module: Simple loading of a specific module
+#       - load_globals: Utility function to load the globals from a module into
+#         another globals()-like object
+#       - alias_submodules: Determines the aliases for modules allowing metaflow.Z to alias
+#         metaflow_extensions.X.Y.Z for example. This supports the __mf_promote_submodules__
+#         construct as well as aliasing any modules present in the extension. This is
+#         typically used in conjunction with lazy_load_aliases which takes care of actually
+#         making the aliasing work lazily (ie: modules that are not already loaded are only
+#         loaded on use).
+#       - lazy_load_aliases: Adds loaders for all the module aliases produced by
+#         alias_submodules for example
+#       - multiload_globals: Convenience function to `load_globals` on all modules returned
+#         by `get_modules`
+#       - multiload_all: Convenience function to `load_globals` and
+#         `lazy_load_aliases(alias_submodules()) on all modules returned by `get_modules`
+#   - Packaging the extensions: when extensions need to be included in the code package,
+#     this allows the extensions to be properly included (including potentially non .py
+#     files). To support this:
+#       - dump_module_info dumps information in the INFO file allowing packaging to work
+#         in a Conda environment or a remote environment (it saves file paths, load order, etc)
+#       - package_mfext_package: allows the packaging of a single extension
+#       - package_mfext_all: packages all extensions
+#
+# The get_aliases_modules is used by Pylint to ignore some of the errors arising from
+# aliasing packages
+
 __all__ = (
     "load_module",
     "get_modules",
     "dump_module_info",
+    "get_aliased_modules",
+    "package_mfext_package",
+    "package_mfext_all",
     "load_globals",
     "alias_submodules",
     "EXT_PKG",
     "lazy_load_aliases",
     "multiload_globals",
     "multiload_all",
+    "_ext_debug",
 )
 
 EXT_PKG = "metaflow_extensions"
 EXT_CONFIG_REGEXP = re.compile(r"^mfextinit_[a-zA-Z0-9_-]+\.py$")
+EXT_META_REGEXP = re.compile(r"^mfextmeta_[a-zA-Z0-9_-]+\.py$")
+REQ_NAME = re.compile(r"^(([a-zA-Z0-9][a-zA-Z0-9._-]*[a-zA-Z0-9])|[a-zA-Z0-9]).*$")
+EXT_EXCLUDE_SUFFIXES = [".pyc"]
 
-METAFLOW_DEBUG_EXT_MECHANISM = os.environ.get("METAFLOW_DEBUG_EXT", False)
+# To get verbose messages, set METAFLOW_DEBUG_EXT to 1
+DEBUG_EXT = os.environ.get("METAFLOW_DEBUG_EXT", False)
 
 
 MFExtPackage = namedtuple("MFExtPackage", "package_name tl_package config_module")
@@ -47,33 +101,72 @@ def get_modules(extension_point):
         )
     _ext_debug("Getting modules for extension point '%s'..." % extension_point)
     for pkg in _pkgs_per_extension_point.get(extension_point, []):
-        _ext_debug("\tFound TL '%s' from '%s'" % (pkg.tl_package, pkg.package_name))
-        m = _get_extension_config(pkg.tl_package, extension_point, pkg.config_module)
+        _ext_debug("    Found TL '%s' from '%s'" % (pkg.tl_package, pkg.package_name))
+        m = _get_extension_config(
+            pkg.package_name, pkg.tl_package, extension_point, pkg.config_module
+        )
         if m:
             modules_to_load.append(m)
-    _ext_debug("\tLoaded %s" % str(modules_to_load))
+    _ext_debug("    Loaded %s" % str(modules_to_load))
     return modules_to_load
 
 
 def dump_module_info():
-    return "ext_info", [_all_packages, _pkgs_per_extension_point]
+    _filter_files_all()
+    sanitized_all_packages = dict()
+    # Strip out root_paths (we don't need it and no need to expose user's dir structure)
+    for k, v in _all_packages.items():
+        sanitized_all_packages[k] = {
+            "root_paths": None,
+            "meta_module": v["meta_module"],
+            "files": v["files"],
+        }
+    return "ext_info", [sanitized_all_packages, _pkgs_per_extension_point]
+
+
+def get_aliased_modules():
+    return _aliased_modules
+
+
+def package_mfext_package(package_name):
+    from metaflow.util import to_unicode
+
+    _ext_debug("Packaging '%s'" % package_name)
+    _filter_files_package(package_name)
+    pkg_info = _all_packages.get(package_name, None)
+    if pkg_info and pkg_info.get("root_paths", None):
+        single_path = len(pkg_info["root_paths"]) == 1
+        for p in pkg_info["root_paths"]:
+            root_path = to_unicode(p)
+            for f in pkg_info["files"]:
+                f_unicode = to_unicode(f)
+                fp = os.path.join(root_path, f_unicode)
+                if single_path or os.path.isfile(fp):
+                    _ext_debug("    Adding '%s'" % fp)
+                    yield fp, os.path.join(EXT_PKG, f_unicode)
+
+
+def package_mfext_all():
+    for p in _all_packages:
+        for path_tuple in package_mfext_package(p):
+            yield path_tuple
 
 
 def load_globals(module, dst_globals, extra_indent=False):
     if extra_indent:
-        extra_indent = "\t"
+        extra_indent = "    "
     else:
         extra_indent = ""
     _ext_debug("%sLoading globals from '%s'" % (extra_indent, module.__name__))
     for n, o in module.__dict__.items():
         if not n.startswith("__") and not isinstance(o, types.ModuleType):
-            _ext_debug("%s\tImporting '%s'" % (extra_indent, n))
+            _ext_debug("%s    Importing '%s'" % (extra_indent, n))
             dst_globals[n] = o
 
 
 def alias_submodules(module, tl_package, extension_point, extra_indent=False):
     if extra_indent:
-        extra_indent = "\t"
+        extra_indent = "    "
     else:
         extra_indent = ""
     lazy_load_custom_modules = {}
@@ -107,7 +200,7 @@ def alias_submodules(module, tl_package, extension_point, extra_indent=False):
             )
         if lazy_load_custom_modules:
             _ext_debug(
-                "%s\tFound explicit promotions in __mf_promote_submodules__: %s"
+                "%s    Found explicit promotions in __mf_promote_submodules__: %s"
                 % (extra_indent, str(list(lazy_load_custom_modules.keys())))
             )
     for n, o in module.__dict__.items():
@@ -123,15 +216,16 @@ def alias_submodules(module, tl_package, extension_point, extra_indent=False):
             else:
                 lazy_load_custom_modules["metaflow.%s" % n] = o
     _ext_debug(
-        "%s\tWill create the following module aliases: %s"
+        "%s    Will create the following module aliases: %s"
         % (extra_indent, str(list(lazy_load_custom_modules.keys())))
     )
+    _aliased_modules.extend(lazy_load_custom_modules.keys())
     return lazy_load_custom_modules
 
 
 def lazy_load_aliases(aliases):
     if aliases:
-        sys.meta_path = [_LazyLoader(aliases)] + sys.meta_path
+        sys.meta_path = [_LazyFinder(aliases)] + sys.meta_path
 
 
 def multiload_globals(modules, dst_globals):
@@ -141,7 +235,7 @@ def multiload_globals(modules, dst_globals):
 
 def multiload_all(modules, extension_point, dst_globals):
     for m in modules:
-        # Note that we load aliases separately (as opposed to ine one fell swoop) so
+        # Note that we load aliases separately (as opposed to in one fell swoop) so
         # modules loaded later in `modules` can depend on them
         lazy_load_aliases(
             alias_submodules(m.module, m.tl_package, extension_point, extra_indent=True)
@@ -149,44 +243,50 @@ def multiload_all(modules, extension_point, dst_globals):
         load_globals(m.module, dst_globals)
 
 
-_py_ver = sys.version_info[0] * 10 + sys.version_info[1]
+_py_ver = sys.version_info[:2]
 _mfext_supported = False
+_aliased_modules = []
 
-if _py_ver >= 34:
+if _py_ver >= (3, 4):
     import importlib.util
-    from importlib.machinery import ModuleSpec
 
-    if _py_ver >= 38:
+    if _py_ver >= (3, 8):
         from importlib import metadata
+    elif _py_ver >= (3, 6):
+        from metaflow._vendor.v3_6 import importlib_metadata as metadata
     else:
-        from metaflow._vendor import importlib_metadata as metadata
+        from metaflow._vendor.v3_5 import importlib_metadata as metadata
     _mfext_supported = True
-else:
-    # Something random so there is no syntax error
-    ModuleSpec = None
 
-# IMPORTANT: More specific paths must appear FIRST (before any less specific one)
+# Extension points are the directories that can be present in a EXT_PKG to
+# contribute to that extension point. For example, if you have
+# metaflow_extensions/X/plugins, your extension contributes to the plugins
+# extension point.
+# IMPORTANT: More specific paths must appear FIRST (before any less specific one). For
+# efficiency, put the less specific ones directly under more specific ones.
 _extension_points = [
     "plugins.env_escape",
     "plugins.cards",
+    "plugins.datatools",
+    "plugins",
     "config",
-    "datatools",
     "exceptions",
-    "plugins",
     "toplevel",
+    "cmd",
 ]
 
 
 def _ext_debug(*args, **kwargs):
-    if METAFLOW_DEBUG_EXT_MECHANISM:
+    if DEBUG_EXT:
         init_str = "%s:" % EXT_PKG
+        kwargs["file"] = sys.stderr
         print(init_str, *args, **kwargs)
 
 
 def _get_extension_packages():
     if not _mfext_supported:
         _ext_debug("Not supported for your Python version -- 3.4+ is needed")
-        return [], {}
+        return {}, {}
 
     # If we have an INFO file with the appropriate information (if running from a saved
     # code package for example), we use that directly
@@ -194,7 +294,7 @@ def _get_extension_packages():
     from metaflow import INFO_FILE
 
     try:
-        with open(INFO_FILE, "r") as contents:
+        with open(INFO_FILE, encoding="utf-8") as contents:
             all_pkg, ext_to_pkg = json.load(contents).get("ext_info", (None, None))
             if all_pkg is not None and ext_to_pkg is not None:
                 _ext_debug("Loading pre-computed information from INFO file")
@@ -210,13 +310,22 @@ def _get_extension_packages():
     try:
         extensions_module = importlib.import_module(EXT_PKG)
     except ImportError as e:
-        if _py_ver >= 36:
+        if _py_ver >= (3, 6):
             # e.name is set to the name of the package that fails to load
             # so don't error ONLY IF the error is importing this module (but do
             # error if there is a transitive import error)
             if not (isinstance(e, ModuleNotFoundError) and e.name == EXT_PKG):
                 raise
-            return [], {}
+            return {}, {}
+
+    # There are two "types" of packages:
+    #   - those installed on the system (distributions)
+    #   - those present in the PYTHONPATH
+    # We have more information on distributions (including dependencies) and more
+    # effective ways to get file information from them (they include the full list of
+    # files installed) so we treat them separately from packages purely in PYTHONPATH.
+    # They are also the more likely way that users will have extensions present, so
+    # we optimize for that case.
 
     # At this point, we look at all the paths and create a set. As we find distributions
     # that match it, we will remove from the set and then will be left with any
@@ -227,54 +336,117 @@ def _get_extension_packages():
     list_ext_points = [x.split(".") for x in _extension_points]
     init_ext_points = [x[0] for x in list_ext_points]
 
-    # TODO: This relies only on requirements to determine import order; we may want
+    # NOTE: For distribution packages, we will rely on requirements to determine the
+    # load order of extensions: if distribution A and B both provide EXT_PKG and
+    # distribution A depends on B then when returning modules in `get_modules`, we will
+    # first return B and THEN A. We may want
     # other ways of specifying "load me after this if it exists" without depending on
     # the package. One way would be to rely on the description and have that info there.
-    # Not sure of the use though so maybe we can skip for now.
-    mf_ext_packages = []
-    # Key: distribution name/full path to package
-    # Value:
-    #  Key: TL package name
-    #  Value: MFExtPackage
+    # Not sure of the use, though, so maybe we can skip for now.
+
+    # Key: distribution name/package path
+    # Value: Dict containing:
+    #   root_paths: The root path for all the files in this package. Can be a list in
+    #               some rare cases
+    #   meta_module: The module to the meta file (if any) that contains information about
+    #     how to package this extension (suffixes to include/exclude)
+    #   files: The list of files to be included (or considered for inclusion) when
+    #     packaging this extension
+    mf_ext_packages = dict()
+
+    # Key: extension point (one of _extension_point)
+    # Value: another dictionary with
+    #   Key: distribution name/full path to package
+    #   Value: another dictionary with
+    #    Key: TL package name (so in metaflow_extensions.X...., the X)
+    #    Value: MFExtPackage
     extension_points_to_pkg = defaultdict(dict)
+
+    # Key: string: configuration file for a package
+    # Value: list: packages that this configuration file is present in
     config_to_pkg = defaultdict(list)
+    # Same as config_to_pkg for meta files
+    meta_to_pkg = defaultdict(list)
+
+    # 1st step: look for distributions (the common case)
     for dist in metadata.distributions():
         if any(
             [pkg == EXT_PKG for pkg in (dist.read_text("top_level.txt") or "").split()]
         ):
+            if dist.metadata["Name"] in mf_ext_packages:
+                _ext_debug(
+                    "Ignoring duplicate package '%s' (duplicate paths in sys.path? (%s))"
+                    % (dist.metadata["Name"], str(sys.path))
+                )
+                continue
             _ext_debug("Found extension package '%s'..." % dist.metadata["Name"])
 
             # Remove the path from the paths to search. This is not 100% accurate because
-            # it is possible that at that same location there is a package and a non
-            # package but it is exceedingly unlikely so we are going to ignore this.
-            all_paths.discard(dist.locate_file(EXT_PKG).as_posix())
+            # it is possible that at that same location there is a package and a non-package,
+            # but it is exceedingly unlikely, so we are going to ignore this.
+            dist_root = dist.locate_file(EXT_PKG).as_posix()
+            all_paths.discard(dist_root)
 
-            mf_ext_packages.append(dist.metadata["Name"])
+            files_to_include = []
+            meta_module = None
 
             # At this point, we check to see what extension points this package
             # contributes to. This is to enable multiple namespace packages to contribute
             # to the same extension point (for example, you may have multiple packages
             # that have plugins)
             for f in dist.files:
-                # Make sure EXT_PKG is a ns package
-                if f.as_posix() == "%s/__init__.py" % EXT_PKG:
-                    raise RuntimeError(
-                        "Package '%s' providing '%s' is not an implicit namespace "
-                        "package as required" % (dist.metadata["Name"], EXT_PKG)
-                    )
-
                 parts = list(f.parts)
-                if (
-                    len(parts) > 1
-                    and parts[0] == EXT_PKG
-                    and parts[1] in init_ext_points
-                ):
-                    # This is most likely a problem as we need an intermediate "identifier"
-                    raise RuntimeError(
-                        "Package '%s' should conform to %s.X.%s and not %s.%s where "
-                        "X is your organization's name for example"
-                        % (dist.metadata["Name"], EXT_PKG, parts[1], EXT_PKG, parts[1])
-                    )
+
+                if len(parts) > 1 and parts[0] == EXT_PKG:
+                    # Ensure that we don't have a __init__.py to force this package to
+                    # be a NS package
+                    if parts[1] == "__init__.py":
+                        raise RuntimeError(
+                            "Package '%s' providing '%s' is not an implicit namespace "
+                            "package as required" % (dist.metadata["Name"], EXT_PKG)
+                        )
+
+                    # Record the file as a candidate for inclusion when packaging if
+                    # needed
+                    if not any(
+                        parts[-1].endswith(suffix) for suffix in EXT_EXCLUDE_SUFFIXES
+                    ):
+                        files_to_include.append(os.path.join(*parts[1:]))
+
+                    if parts[1] in init_ext_points:
+                        # This is most likely a problem as we need an intermediate
+                        # "identifier"
+                        raise RuntimeError(
+                            "Package '%s' should conform to '%s.X.%s' and not '%s.%s' where "
+                            "X is your organization's name for example"
+                            % (
+                                dist.metadata["Name"],
+                                EXT_PKG,
+                                parts[1],
+                                EXT_PKG,
+                                parts[1],
+                            )
+                        )
+
+                    # Check for any metadata; we can only have one metadata per
+                    # distribution at most
+                    if EXT_META_REGEXP.match(parts[1]) is not None:
+                        potential_meta_module = ".".join([EXT_PKG, parts[1][:-3]])
+                        if meta_module:
+                            raise RuntimeError(
+                                "Package '%s' defines more than one meta configuration: "
+                                "'%s' and '%s' (at least)"
+                                % (
+                                    dist.metadata["Name"],
+                                    meta_module,
+                                    potential_meta_module,
+                                )
+                            )
+                        meta_module = potential_meta_module
+                        _ext_debug(
+                            "Found meta '%s' for '%s'" % (meta_module, dist_full_name)
+                        )
+                        meta_to_pkg[meta_module].append(dist_full_name)
 
                 if len(parts) > 3 and parts[0] == EXT_PKG:
                     # We go over _extension_points *in order* to make sure we get more
@@ -290,9 +462,9 @@ def _get_extension_packages():
                             # Check if this is an "init" file
                             config_module = None
 
-                            if (
-                                len(parts) == len(ext_list) + 3
-                                and EXT_CONFIG_REGEXP.match(parts[-1]) is not None
+                            if len(parts) == len(ext_list) + 3 and (
+                                EXT_CONFIG_REGEXP.match(parts[-1]) is not None
+                                or parts[-1] == "__init__.py"
                             ):
                                 parts[-1] = parts[-1][:-3]  # Remove the .py
                                 config_module = ".".join(parts)
@@ -320,42 +492,48 @@ def _get_extension_packages():
                                     )
                                 if config_module is not None:
                                     _ext_debug(
-                                        "\tTL %s found config file '%s'"
+                                        "    TL '%s' found config file '%s'"
                                         % (parts[1], config_module)
                                     )
                                     extension_points_to_pkg[_extension_points[idx]][
                                         dist.metadata["Name"]
                                     ][parts[1]] = MFExtPackage(
-                                        package_name=dist_full_name,
+                                        package_name=dist.metadata["Name"],
                                         tl_package=parts[1],
                                         config_module=config_module,
                                     )
                             else:
                                 _ext_debug(
-                                    "\tTL %s extends '%s' with config '%s'"
+                                    "    TL '%s' extends '%s' with config '%s'"
                                     % (parts[1], _extension_points[idx], config_module)
                                 )
                                 extension_points_to_pkg[_extension_points[idx]][
                                     dist.metadata["Name"]
                                 ][parts[1]] = MFExtPackage(
-                                    package_name=dist_full_name,
+                                    package_name=dist.metadata["Name"],
                                     tl_package=parts[1],
                                     config_module=config_module,
                                 )
                             break
-
+            mf_ext_packages[dist.metadata["Name"]] = {
+                "root_paths": [dist_root],
+                "meta_module": meta_module,
+                "files": files_to_include,
+            }
     # At this point, we have all the packages that contribute to EXT_PKG,
     # we now check to see if there is an order to respect based on dependencies. We will
     # return an ordered list that respects that order and is ordered alphabetically in
     # case of ties. We do not do any checks because we rely on pip to have done those.
+    # Basically topological sort based on dependencies.
     pkg_to_reqs_count = {}
     req_to_dep = {}
-    mf_ext_packages_set = set(mf_ext_packages)
     for pkg_name in mf_ext_packages:
         req_count = 0
-        req_pkgs = [x.split()[0] for x in metadata.requires(pkg_name) or []]
+        req_pkgs = [
+            REQ_NAME.match(x).group(1) for x in metadata.requires(pkg_name) or []
+        ]
         for req_pkg in req_pkgs:
-            if req_pkg in mf_ext_packages_set:
+            if req_pkg in mf_ext_packages:
                 req_count += 1
                 req_to_dep.setdefault(req_pkg, []).append(pkg_name)
         pkg_to_reqs_count[pkg_name] = req_count
@@ -389,112 +567,217 @@ def _get_extension_packages():
     # Check that we got them all
     if len(pkg_to_reqs_count) > 0:
         raise RuntimeError(
-            "Unresolved dependencies in %s: %s" % (EXT_PKG, str(pkg_to_reqs_count))
+            "Unresolved dependencies in '%s': %s"
+            % (EXT_PKG, ", and ".join("'%s'" % p for p in pkg_to_reqs_count))
         )
 
+    _ext_debug("'%s' distributions order is %s" % (EXT_PKG, str(mf_pkg_list)))
+
     # We check if we have any additional packages that were not yet installed that
-    # we need to use. We always put them *last*.
-    if len(all_paths) > 0:
+    # we need to use. We always put them *last* in the load order and put them
+    # alphabetically.
+    all_paths_list = list(all_paths)
+    all_paths_list.sort()
+
+    # This block of code is the equivalent of the one above for distributions except
+    # for PYTHONPATH packages. The functionality is identical, but it looks a little
+    # different because we construct the file list instead of having it nicely provided
+    # to us.
+    package_name_to_path = dict()
+    if len(all_paths_list) > 0:
         _ext_debug("Non installed packages present at %s" % str(all_paths))
-        packages_to_add = set()
-        for package_path in all_paths:
-            _ext_debug("Walking path %s" % package_path)
+        for package_count, package_path in enumerate(all_paths_list):
+            # We give an alternate name for the visible package name. It is
+            # not exposed to the end user but used to refer to the package, and it
+            # doesn't provide much additional information to have the full path
+            # particularly when it is on a remote machine.
+            # We keep a temporary mapping around for error messages while loading for
+            # the first time.
+            package_name = "_pythonpath_%d" % package_count
+            _ext_debug(
+                "Walking path %s (package name %s)" % (package_path, package_name)
+            )
+            package_name_to_path[package_name] = package_path
             base_depth = len(package_path.split("/"))
+            files_to_include = []
+            meta_module = None
             for root, dirs, files in os.walk(package_path):
                 parts = root.split("/")
                 cur_depth = len(parts)
+                # relative_root strips out metaflow_extensions
+                relative_root = "/".join(parts[base_depth:])
+                relative_module = ".".join(parts[base_depth - 1 :])
+                files_to_include.extend(
+                    [
+                        "/".join([relative_root, f]) if relative_root else f
+                        for f in files
+                        if not any(
+                            [f.endswith(suffix) for suffix in EXT_EXCLUDE_SUFFIXES]
+                        )
+                    ]
+                )
                 if cur_depth == base_depth:
                     if "__init__.py" in files:
                         raise RuntimeError(
-                            "%s at '%s' is not an implicit namespace package as required"
+                            "'%s' at '%s' is not an implicit namespace package as required"
                             % (EXT_PKG, root)
                         )
                     for d in dirs:
                         if d in init_ext_points:
                             raise RuntimeError(
-                                "Package at %s should conform to %s.X.%s and not %s.%s "
-                                "where X is your organization's name for example"
+                                "Package at '%s' should conform to' %s.X.%s' and not "
+                                "'%s.%s' where X is your organization's name for example"
                                 % (root, EXT_PKG, d, EXT_PKG, d)
                             )
+                    # Check for meta files for this package
+                    meta_files = [
+                        x for x in map(EXT_META_REGEXP.match, files) if x is not None
+                    ]
+                    if meta_files:
+                        # We should have one meta file at most
+                        if len(meta_files) > 1:
+                            raise RuntimeError(
+                                "Package at '%s' defines more than one meta file: %s"
+                                % (
+                                    package_path,
+                                    ", and ".join(
+                                        ["'%s'" % x.group(0) for x in meta_files]
+                                    ),
+                                )
+                            )
+                        else:
+                            meta_module = ".".join(
+                                [relative_module, meta_files[0].group(0)[:-3]]
+                            )
+
                 elif cur_depth > base_depth + 1:
                     # We want at least a TL name and something under
                     tl_name = parts[base_depth]
-                    tl_fullname = "/".join([package_path, tl_name])
+                    tl_fullname = "%s[%s]" % (package_path, tl_name)
                     prefix_match = parts[base_depth + 1 :]
-                    next_dirs = None
                     for idx, ext_list in enumerate(list_ext_points):
                         if prefix_match == ext_list:
+                            # We check to see if this is an actual extension point
+                            # or if we just have a directory on the way to another
+                            # extension point. To do this, we check to see if we have
+                            # any files or directories that are *not* directly another
+                            # extension point
+                            skip_extension = len(files) == 0
+                            if skip_extension:
+                                next_dir_idx = len(list_ext_points[idx])
+                                ok_subdirs = [
+                                    list_ext_points[j][next_dir_idx]
+                                    for j in range(0, idx)
+                                    if len(list_ext_points[j]) > next_dir_idx
+                                ]
+                                skip_extension = set(dirs).issubset(set(ok_subdirs))
+
+                            if skip_extension:
+                                _ext_debug(
+                                    "    Skipping '%s' as no files/directory of interest"
+                                    % _extension_points[idx]
+                                )
+                                continue
+
                             # Check for any "init" files
                             init_files = [
-                                x
+                                x.group(0)
                                 for x in map(EXT_CONFIG_REGEXP.match, files)
                                 if x is not None
                             ]
+                            if "__init__.py" in files:
+                                init_files.append("__init__.py")
+
                             config_module = None
                             if len(init_files) > 1:
                                 raise RuntimeError(
-                                    "Package at %s defines more than one configuration "
+                                    "Package at '%s' defines more than one configuration "
                                     "file for '%s': %s"
                                     % (
                                         tl_fullname,
                                         ".".join(prefix_match),
-                                        ", and ".join(
-                                            ["'%s'" % x.group(0) for x in init_files]
-                                        ),
+                                        ", and ".join(["'%s'" % x for x in init_files]),
                                     )
                                 )
                             elif len(init_files) == 1:
                                 config_module = ".".join(
-                                    parts[base_depth - 1 :]
-                                    + [init_files[0].group(0)[:-3]]
+                                    [relative_module, init_files[0][:-3]]
                                 )
                                 config_to_pkg[config_module].append(tl_fullname)
+
                             d = extension_points_to_pkg[_extension_points[idx]][
-                                tl_fullname
+                                package_name
                             ] = dict()
                             d[tl_name] = MFExtPackage(
-                                package_name=tl_fullname,
+                                package_name=package_name,
                                 tl_package=tl_name,
                                 config_module=config_module,
                             )
                             _ext_debug(
-                                "\tExtends '%s' with config '%s'"
+                                "    Extends '%s' with config '%s'"
                                 % (_extension_points[idx], config_module)
                             )
-                            packages_to_add.add(tl_fullname)
-                        else:
-                            # Check what directories we need to go down if any
-                            if len(ext_list) > 1 and prefix_match == ext_list[:-1]:
-                                if next_dirs is None:
-                                    next_dirs = []
-                                next_dirs.append(ext_list[-1])
-                    if next_dirs is not None:
-                        dirs[:] = next_dirs[:]
-
-        # Add all these new packages to the list of packages as well.
-        packages_to_add = list(packages_to_add)
-        packages_to_add.sort()
-        mf_pkg_list.extend(packages_to_add)
-
-    # Sanity check that we only have one package per configuration file
+            mf_pkg_list.append(package_name)
+            mf_ext_packages[package_name] = {
+                "root_paths": [package_path],
+                "meta_module": meta_module,
+                "files": files_to_include,
+            }
+
+    # Sanity check that we only have one package per configuration file.
+    # This prevents multiple packages from providing the same named configuration
+    # file which would result in one overwriting the other if they are both installed.
     errors = []
     for m, packages in config_to_pkg.items():
         if len(packages) > 1:
             errors.append(
-                "\tPackages %s define the same configuration module '%s'"
-                % (", and ".join(packages), m)
+                "    Packages %s define the same configuration module '%s'"
+                % (", and ".join(["'%s'" % p for p in packages]), m)
+            )
+    for m, packages in meta_to_pkg.items():
+        if len(packages) > 1:
+            errors.append(
+                "    Packages %s define the same meta module '%s'"
+                % (", and ".join(["'%s'" % p for p in packages]), m)
             )
     if errors:
         raise RuntimeError(
-            "Conflicts in %s configuration files:\n%s" % (EXT_PKG, "\n".join(errors))
+            "Conflicts in '%s' files:\n%s" % (EXT_PKG, "\n".join(errors))
         )
 
     extension_points_to_pkg.default_factory = None
-    # Figure out the per extension point order
+
+    # We have the load order globally; we now figure it out per extension point.
     for k, v in extension_points_to_pkg.items():
+
+        # v is a dict distributionName/packagePath -> (dict tl_name -> MFPackage)
         l = [v[pkg].values() for pkg in mf_pkg_list if pkg in v]
-        extension_points_to_pkg[k] = list(chain(*l))
-    return mf_pkg_list, extension_points_to_pkg
+        # In the case of the plugins.cards extension we allow those packages
+        # to be ns packages, so we only list the package once (in its first position).
+        # In all other cases, we error out if we don't have a configuration file for the
+        # package (either a __init__.py of an explicit mfextinit_*.py)
+        final_list = []
+        null_config_tl_package = set()
+        for pkg in chain(*l):
+            if pkg.config_module is None:
+                if k == "plugins.cards":
+                    # This is allowed here but we only keep one
+                    if pkg.tl_package in null_config_tl_package:
+                        continue
+                    null_config_tl_package.add(pkg.tl_package)
+                else:
+                    package_path = package_name_to_path.get(pkg.package_name)
+                    if package_path:
+                        package_path = "at '%s'" % package_path
+                    else:
+                        package_path = "'%s'" % pkg.package_name
+                    raise RuntimeError(
+                        "Package %s does not define a configuration file for '%s'"
+                        % (package_path, k)
+                    )
+            final_list.append(pkg)
+        extension_points_to_pkg[k] = final_list
+    return mf_ext_packages, extension_points_to_pkg
 
 
 _all_packages, _pkgs_per_extension_point = _get_extension_packages()
@@ -504,7 +787,7 @@ def _attempt_load_module(module_name):
     try:
         extension_module = importlib.import_module(module_name)
     except ImportError as e:
-        if _py_ver >= 36:
+        if _py_ver >= (3, 6):
             # e.name is set to the name of the package that fails to load
             # so don't error ONLY IF the error is importing this module (but do
             # error if there is a transitive import error)
@@ -514,122 +797,303 @@ def _attempt_load_module(module_name):
                 errored_names.append("%s.%s" % (errored_names[-1], p))
             if not (isinstance(e, ModuleNotFoundError) and e.name in errored_names):
                 print(
-                    "The following exception occured while trying to load %s ('%s')"
+                    "The following exception occurred while trying to load '%s' ('%s')"
                     % (EXT_PKG, module_name)
                 )
                 raise
             else:
-                _ext_debug("\t\tUnknown error when loading '%s': %s" % (module_name, e))
+                _ext_debug(
+                    "        Unknown error when loading '%s': %s" % (module_name, e)
+                )
                 return None
     else:
         return extension_module
 
 
-def _get_extension_config(tl_pkg, extension_point, config_module):
-    module_name = ".".join([EXT_PKG, tl_pkg, extension_point])
-    if config_module is not None:
-        _ext_debug("\t\tAttempting to load '%s'" % config_module)
-        extension_module = _attempt_load_module(config_module)
+def _get_extension_config(distribution_name, tl_pkg, extension_point, config_module):
+    if config_module is not None and not config_module.endswith("__init__"):
+        module_name = config_module
+        # file_path below will be /root/metaflow_extensions/X/Y/mfextinit_Z.py and
+        # module name is metaflow_extensions.X.Y.mfextinit_Z so if we want to strip to
+        # /root/metaflow_extensions, we need to remove this number of elements from the
+        # filepath
+        strip_from_filepath = len(module_name.split(".")) - 1
     else:
-        _ext_debug("\t\tAttempting to load '%s'" % module_name)
-        extension_module = _attempt_load_module(module_name)
+        module_name = ".".join([EXT_PKG, tl_pkg, extension_point])
+        # file_path here will be /root/metaflow_extensions/X/Y/__init__.py BUT
+        # module name is metaflow_extensions.X.Y so we have a 1 off compared to the
+        # previous case
+        strip_from_filepath = len(module_name.split("."))
+
+    _ext_debug("        Attempting to load '%s'" % module_name)
+
+    extension_module = _attempt_load_module(module_name)
+
     if extension_module:
+        # We update the path to this module. This is useful if we need to package this
+        # package again. Note that in most cases, packaging happens in the outermost
+        # local python environment (non Conda and not remote) so we already have the
+        # root_paths set when we are initially looking for metaflow_extensions package.
+        # This code allows for packaging while running inside a Conda environment or
+        # remotely where the root_paths has been changed since the initial packaging.
+        # This currently does not happen much.
+        if _all_packages[distribution_name]["root_paths"] is None:
+            file_path = getattr(extension_module, "__file__")
+            if file_path:
+                # Common case where this is an actual init file (mfextinit_X.py or __init__.py)
+                root_paths = ["/".join(file_path.split("/")[:-strip_from_filepath])]
+            else:
+                # Only used for plugins.cards where the package can be a NS package. In
+                # this case, __path__ will have things like /root/metaflow_extensions/X/Y
+                # and module name will be metaflow_extensions.X.Y
+                root_paths = [
+                    "/".join(p.split("/")[: -len(module_name.split(".")) + 1])
+                    for p in extension_module.__path__
+                ]
+
+            _ext_debug("Package '%s' is rooted at %s" % (distribution_name, root_paths))
+            _all_packages[distribution_name]["root_paths"] = root_paths
+
         return MFExtModule(tl_package=tl_pkg, module=extension_module)
     return None
 
 
-class _LazyLoader(object):
-    # This _LazyLoader implements the Importer Protocol defined in PEP 302
-    # TODO: Need to move to find_spec, exec_module and create_module as
-    # find_module and load_module are deprecated
+def _filter_files_package(package_name):
+    pkg = _all_packages.get(package_name)
+    if pkg and pkg["root_paths"] and pkg["meta_module"]:
+        meta_module = _attempt_load_module(pkg["meta_module"])
+        if meta_module:
+            include_suffixes = meta_module.__dict__.get("include_suffixes")
+            exclude_suffixes = meta_module.__dict__.get("exclude_suffixes")
+
+            # Behavior is as follows:
+            #  - if nothing specified, include all files (so do nothing here)
+            #  - if include_suffixes, only include those suffixes
+            #  - if *not* include_suffixes but exclude_suffixes, include everything *except*
+            #    files ending with that suffix
+            if include_suffixes:
+                new_files = [
+                    f
+                    for f in pkg["files"]
+                    if any([f.endswith(suffix) for suffix in include_suffixes])
+                ]
+            elif exclude_suffixes:
+                new_files = [
+                    f
+                    for f in pkg["files"]
+                    if not any([f.endswith(suffix) for suffix in exclude_suffixes])
+                ]
+            else:
+                new_files = pkg["files"]
+            pkg["files"] = new_files
+
+
+def _filter_files_all():
+    for p in _all_packages:
+        _filter_files_package(p)
+
+
+class _AliasLoader(Loader):
+    def __init__(self, alias, orig):
+        self._alias = alias
+        self._orig = orig
+
+    def create_module(self, spec):
+        _ext_debug(
+            "Loading aliased module '%s' at '%s' " % (str(self._orig), spec.name)
+        )
+        if isinstance(self._orig, str):
+            try:
+                return importlib.import_module(self._orig)
+            except ImportError:
+                raise ImportError(
+                    "No module found '%s' (aliasing '%s')" % (spec.name, self._orig)
+                )
+        elif isinstance(self._orig, types.ModuleType):
+            # We are aliasing a module, so we just return that one
+            return self._orig
+        else:
+            return super().create_module(spec)
+
+    def exec_module(self, module):
+        # Override the name to make it a bit nicer. We keep the old name so that
+        # we can refer to it when we load submodules
+        if not hasattr(module, "__orig_name__"):
+            module.__orig_name__ = module.__name__
+            module.__name__ = self._alias
+
+
+class _OrigLoader(Loader):
+    def __init__(
+        self,
+        fullname,
+        orig_loader,
+        previously_loaded_module=None,
+        previously_loaded_parent_module=None,
+    ):
+        self._fullname = fullname
+        self._orig_loader = orig_loader
+        self._previously_loaded_module = previously_loaded_module
+        self._previously_loaded_parent_module = previously_loaded_parent_module
+
+    def create_module(self, spec):
+        _ext_debug(
+            "Loading original module '%s' (will be loaded at '%s'); spec is %s"
+            % (spec.name, self._fullname, str(spec))
+        )
+        self._orig_name = spec.name
+        return self._orig_loader.create_module(spec)
+
+    def exec_module(self, module):
+        try:
+            # Perform all actions of the original loader
+            self._orig_loader.exec_module(module)
+        except BaseException:
+            raise  # We re-raise it always; the `finally` clause will still restore things
+        else:
+            # It loaded, we move and rename appropriately
+            module.__spec__.name = self._fullname
+            module.__orig_name__ = module.__name__
+            module.__name__ = self._fullname
+            module.__package__ = module.__spec__.parent  # assumption since 3.6
+            sys.modules[self._fullname] = module
+            del sys.modules[self._orig_name]
+
+        finally:
+            # At this point, the original module is loaded with the original name. We
+            # want to replace it with previously_loaded_module if it exists. We
+            # also replace the parent properly
+            if self._previously_loaded_module:
+                sys.modules[self._orig_name] = self._previously_loaded_module
+            if self._previously_loaded_parent_module:
+                sys.modules[
+                    ".".join(self._orig_name.split(".")[:-1])
+                ] = self._previously_loaded_parent_module
+
+
+class _LazyFinder(MetaPathFinder):
+    # This _LazyFinder implements the Importer Protocol defined in PEP 302
 
     def __init__(self, handled):
-        # Modules directly loaded (this is either new modules or overrides of existing ones)
+        # Dictionary:
+        # Key: name of the module to handle
+        # Value:
+        #   - A string: a pathspec to the module to load
+        #   - A module: the module to load
         self._handled = handled if handled else {}
 
-        # This is used to revert back to regular loading when trying to load
+        # This is used to revert to regular loading when trying to load
         # the over-ridden module
-        self._tempexcluded = set()
-
-        # This is used when loading a module alias to load any submodule
-        self._alias_to_orig = {}
-
-    def find_module(self, fullname, path=None):
-        if fullname in self._tempexcluded:
+        self._temp_excluded_prefix = set()
+
+        # This is used to determine if we should be searching in _orig modules. Basically,
+        # when a relative import is done from a module in _orig, we want to search in
+        # the _orig "tree"
+        self._orig_search_paths = set()
+
+    def find_spec(self, fullname, path, target=None):
+        # If we are trying to load a shadowed module (ending in ._orig), we don't
+        # say we handle it
+        _ext_debug(
+            "Looking for %s in %s with target %s" % (fullname, str(path), target)
+        )
+        if any([fullname.startswith(e) for e in self._temp_excluded_prefix]):
             return None
-        if fullname in self._handled or (
-            fullname.endswith("._orig") and fullname[:-6] in self._handled
-        ):
-            return self
-        name_parts = fullname.split(".")
-        if len(name_parts) > 1 and name_parts[-1] != "_orig":
-            # We check if we had an alias created for this module and if so,
-            # we are going to load it to properly fully create aliases all
-            # the way down.
-            parent_name = ".".join(name_parts[:-1])
-            if parent_name in self._alias_to_orig:
-                return self
-        return None
 
-    def load_module(self, fullname):
-        if fullname in sys.modules:
-            return sys.modules[fullname]
-        if not self._can_handle_orig_module() and fullname.endswith("._orig"):
-            # We return a nicer error message
-            raise ImportError(
-                "Attempting to load '%s' -- loading shadowed modules in Metaflow "
-                "Extensions are only supported in Python 3.4+" % fullname
+        # If this is something we directly handle, return our loader
+        if fullname in self._handled:
+            return importlib.util.spec_from_loader(
+                fullname, _AliasLoader(fullname, self._handled[fullname])
             )
-        to_import = self._handled.get(fullname, None)
-
-        # If to_import is None, two cases:
-        #  - we are loading a ._orig module
-        #  - OR we are loading a submodule
-        if to_import is None:
-            if fullname.endswith("._orig"):
-                try:
-                    # We exclude this module temporarily from what we handle to
-                    # revert back to the non-shadowing mode of import
-                    self._tempexcluded.add(fullname)
-                    to_import = importlib.util.find_spec(fullname)
-                finally:
-                    self._tempexcluded.remove(fullname)
-            else:
-                name_parts = fullname.split(".")
-                submodule = name_parts[-1]
-                parent_name = ".".join(name_parts[:-1])
-                to_import = ".".join([self._alias_to_orig[parent_name], submodule])
 
-        if isinstance(to_import, str):
-            try:
-                to_import_mod = importlib.import_module(to_import)
-            except ImportError:
-                raise ImportError(
-                    "No module found '%s' (aliasing %s)" % (fullname, to_import)
+        # For the first pass when we try to load a shadowed module, we send it back
+        # without the ._orig and that will find the original spec of the module
+        # Note that we handle mymodule._orig.orig_submodule as well as mymodule._orig.
+        # Basically, the original module and any of the original submodules are
+        # available under _orig.
+        name_parts = fullname.split(".")
+        try:
+            orig_idx = name_parts.index("_orig")
+        except ValueError:
+            orig_idx = -1
+        if orig_idx > -1 and ".".join(name_parts[:orig_idx]) in self._handled:
+            orig_name = ".".join(name_parts[:orig_idx] + name_parts[orig_idx + 1 :])
+            parent_name = None
+            if orig_idx != len(name_parts) - 1:
+                # We have a parent module under the _orig portion so for example, if
+                # we load mymodule._orig.orig_submodule, our parent is mymodule._orig.
+                # However, since mymodule is currently shadowed, we need to reset
+                # the parent module properly. We know it is already loaded (since modules
+                # are loaded hierarchically)
+                parent_name = ".".join(
+                    name_parts[:orig_idx] + name_parts[orig_idx + 1 : -1]
                 )
-            sys.modules[fullname] = to_import_mod
-            self._alias_to_orig[fullname] = to_import_mod.__name__
-        elif isinstance(to_import, types.ModuleType):
-            sys.modules[fullname] = to_import
-            self._alias_to_orig[fullname] = to_import.__name__
-        elif self._can_handle_orig_module() and isinstance(to_import, ModuleSpec):
-            # This loads modules that end in _orig
-            m = importlib.util.module_from_spec(to_import)
-            to_import.loader.exec_module(m)
-            sys.modules[fullname] = m
-        elif to_import is None and fullname.endswith("._orig"):
-            # This happens when trying to access a shadowed ._orig module
-            # when actually, there is no shadowed module; print a nicer message
-            # Condition is a bit overkill and most likely only checking to_import
-            # would be OK. Being extra sure in case _LazyLoader is misused and
-            # a None value is passed in.
-            raise ImportError(
-                "Metaflow Extensions shadowed module '%s' does not exist" % fullname
+            _ext_debug("Looking for original module '%s'" % orig_name)
+            prefix = ".".join(name_parts[:orig_idx])
+            self._temp_excluded_prefix.add(prefix)
+            # We also have to remove the module temporarily while we look for the
+            # new spec since otherwise it returns the spec of that loaded module.
+            # module is also restored *after* we call `create_module` in the loader
+            # otherwise it just returns None. We also swap out the parent module so that
+            # the search can start from there.
+            loaded_module = sys.modules.get(orig_name)
+            if loaded_module:
+                del sys.modules[orig_name]
+            parent_module = sys.modules.get(parent_name) if parent_name else None
+            if parent_module:
+                sys.modules[parent_name] = sys.modules[".".join([parent_name, "_orig"])]
+
+            # This finds the spec that would have existed had we not added all our
+            # _LazyFinders
+            spec = importlib.util.find_spec(orig_name)
+
+            self._temp_excluded_prefix.remove(prefix)
+
+            if not spec:
+                return None
+
+            if spec.submodule_search_locations:
+                self._orig_search_paths.update(spec.submodule_search_locations)
+
+            _ext_debug("Found original spec %s" % spec)
+
+            # Change the spec
+            spec.loader = _OrigLoader(
+                fullname,
+                spec.loader,
+                loaded_module,
+                parent_module,
             )
-        else:
-            raise ImportError
-        return sys.modules[fullname]
 
-    @staticmethod
-    def _can_handle_orig_module():
-        return sys.version_info[0] >= 3 and sys.version_info[1] >= 4
+            return spec
+
+        for p in path or []:
+            if p in self._orig_search_paths:
+                # We need to look in some of the "_orig" modules
+                orig_override_name = ".".join(
+                    name_parts[:-1] + ["_orig", name_parts[-1]]
+                )
+                _ext_debug(
+                    "Looking for %s as an original module: searching for %s"
+                    % (fullname, orig_override_name)
+                )
+                return importlib.util.find_spec(orig_override_name)
+        if len(name_parts) > 1:
+            # This checks for submodules of things we handle. We check for the most
+            # specific submodule match and use that
+            chop_idx = 1
+            while chop_idx < len(name_parts):
+                parent_name = ".".join(name_parts[:-chop_idx])
+                if parent_name in self._handled:
+                    orig = self._handled[parent_name]
+                    if isinstance(orig, types.ModuleType):
+                        orig_name = ".".join(
+                            [orig.__orig_name__] + name_parts[-chop_idx:]
+                        )
+                    else:
+                        orig_name = ".".join([orig] + name_parts[-chop_idx:])
+                    return importlib.util.spec_from_loader(
+                        fullname, _AliasLoader(fullname, orig_name)
+                    )
+                chop_idx += 1
+        return None
diff --git a/metaflow/flowspec.py b/metaflow/flowspec.py
index fb9dec0ad76..606a400e94d 100644
--- a/metaflow/flowspec.py
+++ b/metaflow/flowspec.py
@@ -7,7 +7,7 @@
 from types import FunctionType, MethodType
 
 from . import cmd_with_io
-from .parameters import Parameter
+from .parameters import DelayedEvaluationParameter, Parameter
 from .exception import (
     MetaflowException,
     MissingInMergeArtifactsException,
@@ -51,7 +51,6 @@ class FlowSpec(object):
 
     Attributes
     ----------
-    script_name
     index
     input
     """
@@ -107,6 +106,8 @@ def __init__(self, use_cli=True):
     @property
     def script_name(self):
         """
+        [Legacy function - do not use. Use `current` instead]
+
         Returns the name of the script containing the flow
 
         Returns
@@ -144,9 +145,8 @@ def _set_constants(self, graph, kwargs):
         for var, param in self._get_parameters():
             seen.add(var)
             val = kwargs[param.name.replace("-", "_").lower()]
-            # Support for delayed evaluation of parameters. This is used for
-            # includefile in particular
-            if callable(val):
+            # Support for delayed evaluation of parameters.
+            if isinstance(val, DelayedEvaluationParameter):
                 val = val()
             val = val.split(param.separator) if val and param.separator else val
             setattr(self, var, val)
@@ -203,6 +203,8 @@ def _set_datastore(self, datastore):
 
     def __iter__(self):
         """
+        [Legacy function - do not use]
+
         Iterate over all steps in the Flow
 
         Returns
@@ -223,26 +225,27 @@ def __getattr__(self, name):
             raise AttributeError("Flow %s has no attribute '%s'" % (self.name, name))
 
     def cmd(self, cmdline, input={}, output=[]):
+        """
+        [Legacy function - do not use]
+        """
         return cmd_with_io.cmd(cmdline, input=input, output=output)
 
     @property
     def index(self):
         """
-        Index of the task in a foreach step
+        The index of this foreach branch.
 
         In a foreach step, multiple instances of this step (tasks) will be executed,
-        one for each element in the foreach.
-        This property returns the zero based index of the current task. If this is not
-        a foreach step, this returns None.
+        one for each element in the foreach. This property returns the zero based index
+        of the current task. If this is not a foreach step, this returns None.
 
-        See Also
-        --------
-        foreach_stack: A detailed example is given in the documentation of this function
+        If you need to know the indices of the parent tasks in a nested foreach, use
+        `FlowSpec.foreach_stack`.
 
         Returns
         -------
         int
-            Index of the task in a foreach step
+            Index of the task in a foreach step.
         """
         if self._foreach_stack:
             return self._foreach_stack[-1].index
@@ -250,30 +253,28 @@ def index(self):
     @property
     def input(self):
         """
-        Value passed to the task in a foreach step
+        The value of the foreach artifact in this foreach branch.
 
         In a foreach step, multiple instances of this step (tasks) will be executed,
-        one for each element in the foreach.
-        This property returns the element passed to the current task. If this is not
-        a foreach step, this returns None.
+        one for each element in the foreach. This property returns the element passed
+        to the current task. If this is not a foreach step, this returns None.
 
-        See Also
-        --------
-        foreach_stack: A detailed example is given in the documentation of this function
+        If you need to know the values of the parent tasks in a nested foreach, use
+        `FlowSpec.foreach_stack`.
 
         Returns
         -------
         object
-            Input passed to the task (can be any object)
+            Input passed to the foreach task.
         """
         return self._find_input()
 
     def foreach_stack(self):
         """
-        Returns the current stack of foreach steps for the current step
+        Returns the current stack of foreach indexes and values for the current step.
 
-        This effectively corresponds to the indexes and values at the various levels of nesting.
-        For example, considering the following code:
+        Use this information to understand what data is being processed in the current
+        foreach branch. For example, considering the following code:
         ```
         @step
         def root(self):
@@ -289,26 +290,31 @@ def nest_1(self):
         def nest_2(self):
             foo = self.foreach_stack()
         ```
-        foo will take the following values in the various tasks for nest_2:
+
+        `foo` will take the following values in the various tasks for nest_2:
+        ```
             [(0, 3, 'a'), (0, 4, 'd')]
             [(0, 3, 'a'), (1, 4, 'e')]
             ...
             [(0, 3, 'a'), (3, 4, 'g')]
             [(1, 3, 'b'), (0, 4, 'd')]
             ...
-
+        ```
         where each tuple corresponds to:
-            - the index of the task for that level of the loop
-            - the number of splits for that level of the loop
-            - the value for that level of the loop
+
+        - The index of the task for that level of the loop.
+        - The number of splits for that level of the loop.
+        - The value for that level of the loop.
+
         Note that the last tuple returned in a task corresponds to:
-            - first element: value returned by self.index
-            - third element: value returned by self.input
+
+        - 1st element: value returned by `self.index`.
+        - 3rd element: value returned by `self.input`.
 
         Returns
         -------
         List[Tuple[int, int, object]]
-            An array describing the current stack of foreach steps
+            An array describing the current stack of foreach steps.
         """
         return [
             (frame.index, frame.num_splits, self._find_input(stack_index=i))
@@ -349,15 +355,16 @@ def _find_input(self, stack_index=None):
 
     def merge_artifacts(self, inputs, exclude=[], include=[]):
         """
-        Merge the artifacts coming from each merge branch (from inputs)
+        Helper function for merging artifacts in a join step.
 
         This function takes all the artifacts coming from the branches of a
         join point and assigns them to self in the calling step. Only artifacts
         not set in the current step are considered. If, for a given artifact, different
-        values are present on the incoming edges, an error will be thrown (and the artifacts
-        that "conflict" will be reported).
+        values are present on the incoming edges, an error will be thrown and the artifacts
+        that conflict will be reported.
 
         As a few examples, in the simple graph: A splitting into B and C and joining in D:
+        ```
         A:
           self.x = 5
           self.y = 6
@@ -369,34 +376,34 @@ def merge_artifacts(self, inputs, exclude=[], include=[]):
 
         D:
           merge_artifacts(inputs)
-
+        ```
         In D, the following artifacts are set:
-          - y (value: 6), b_var (value: 1)
-          - if from_b and from_c are the same, x will be accessible and have value from_b
-          - if from_b and from_c are different, an error will be thrown. To prevent this error,
-            you need to manually set self.x in D to a merged value (for example the max) prior to
-            calling merge_artifacts.
+          - `y` (value: 6), `b_var` (value: 1)
+          - if `from_b` and `from_c` are the same, `x` will be accessible and have value `from_b`
+          - if `from_b` and `from_c` are different, an error will be thrown. To prevent this error,
+            you need to manually set `self.x` in D to a merged value (for example the max) prior to
+            calling `merge_artifacts`.
 
         Parameters
         ----------
         inputs : List[Steps]
-            Incoming steps to the join point
+            Incoming steps to the join point.
         exclude : List[str], optional
             If specified, do not consider merging artifacts with a name in `exclude`.
-            Cannot specify if `include` is also specified
+            Cannot specify if `include` is also specified.
         include : List[str], optional
             If specified, only merge artifacts specified. Cannot specify if `exclude` is
-            also specified
+            also specified.
 
         Raises
         ------
         MetaflowException
-            This exception is thrown if this is not called in a join step
+            This exception is thrown if this is not called in a join step.
         UnhandledInMergeArtifactsException
-            This exception is thrown in case of unresolved conflicts
+            This exception is thrown in case of unresolved conflicts.
         MissingInMergeArtifactsException
-            This exception is thrown in case an artifact specified in `include cannot
-            be found
+            This exception is thrown in case an artifact specified in `include` cannot
+            be found.
         """
         node = self._graph[self._current_step]
         if node.type != "join":
@@ -436,7 +443,7 @@ def merge_artifacts(self, inputs, exclude=[], include=[]):
             if v not in to_merge and not hasattr(self, v):
                 missing.append(v)
         if unresolved:
-            # We have unresolved conflicts so we do not set anything and error out
+            # We have unresolved conflicts, so we do not set anything and error out
             msg = (
                 "Step *{step}* cannot merge the following artifacts due to them "
                 "having conflicting values:\n[{artifacts}].\nTo remedy this issue, "
@@ -483,22 +490,32 @@ def _validate_ubf_step(self, step_name):
 
     def next(self, *dsts, **kwargs):
         """
-        Indicates the next step to execute at the end of this step
+        Indicates the next step to execute after this step has completed.
 
-        This statement should appear once and only once in each and every step (except the `end`
-        step). Furthermore, it should be the last statement in the step.
+        This statement should appear as the last statement of each step, except
+        the end step.
 
         There are several valid formats to specify the next step:
-            - Straight-line connection: self.next(self.next_step) where `next_step` is a method in
-              the current class decorated with the `@step` decorator
-            - Static fan-out connection: self.next(self.step1, self.step2, ...) where `stepX` are
-              methods in the current class decorated with the `@step` decorator
-            - Foreach branch:
-                self.next(self.foreach_step, foreach='foreach_iterator')
-              In this situation, `foreach_step` is a method in the current class decorated with the
-              `@step` docorator and `foreach_iterator` is a variable name in the current class that
-              evaluates to an iterator. A task will be launched for each value in the iterator and
-              each task will execute the code specified by the step `foreach_step`.
+
+        - Straight-line connection: `self.next(self.next_step)` where `next_step` is a method in
+          the current class decorated with the `@step` decorator.
+
+        - Static fan-out connection: `self.next(self.step1, self.step2, ...)` where `stepX` are
+          methods in the current class decorated with the `@step` decorator.
+
+        - Foreach branch:
+          ```
+          self.next(self.foreach_step, foreach='foreach_iterator')
+          ```
+          In this situation, `foreach_step` is a method in the current class decorated with the
+          `@step` decorator and `foreach_iterator` is a variable name in the current class that
+          evaluates to an iterator. A task will be launched for each value in the iterator and
+          each task will execute the code specified by the step `foreach_step`.
+
+        Parameters
+        ----------
+        dsts : Method
+            One or more methods annotated with `@step`.
 
         Raises
         ------
diff --git a/metaflow/graph.py b/metaflow/graph.py
index 9651f34bb2f..4ea41e0114f 100644
--- a/metaflow/graph.py
+++ b/metaflow/graph.py
@@ -5,7 +5,7 @@
 
 def deindent_docstring(doc):
     if doc:
-        # Find the indent to remove from the doctring. We consider the following possibilities:
+        # Find the indent to remove from the docstring. We consider the following possibilities:
         # Option 1:
         #  """This is the first line
         #    This is the second line
@@ -186,7 +186,7 @@ def _create_nodes(self, flow):
 
     def _postprocess(self):
         # any node who has a foreach as any of its split parents
-        # has is_inside_foreach=True *unless* all of those foreaches
+        # has is_inside_foreach=True *unless* all of those `foreach`s
         # are joined by the node
         for node in self.nodes.values():
             foreaches = [
diff --git a/metaflow/includefile.py b/metaflow/includefile.py
index 9131c012ed6..eac99560e42 100644
--- a/metaflow/includefile.py
+++ b/metaflow/includefile.py
@@ -1,3 +1,4 @@
+from collections import namedtuple
 import gzip
 
 import io
@@ -8,259 +9,218 @@
 
 from metaflow._vendor import click
 
-from . import parameters
-from .current import current
 from .exception import MetaflowException
-from .metaflow_config import DATATOOLS_LOCALROOT, DATATOOLS_SUFFIX
-from .parameters import DeployTimeField, Parameter
-from .util import to_unicode
+from .parameters import (
+    DelayedEvaluationParameter,
+    DeployTimeField,
+    Parameter,
+    ParameterContext,
+)
+from .util import get_username
 
-try:
-    # python2
-    from urlparse import urlparse
-except:
-    # python3
-    from urllib.parse import urlparse
+import functools
 
-# TODO: This local "client" and the general notion of dataclients should probably
-# be moved somewhere else. Putting here to keep this change compact for now
-class MetaflowLocalURLException(MetaflowException):
-    headline = "Invalid path"
+# _tracefunc_depth = 0
 
 
-class MetaflowLocalNotFound(MetaflowException):
-    headline = "Local object not found"
+# def tracefunc(func):
+#     """Decorates a function to show its trace."""
 
+#     @functools.wraps(func)
+#     def tracefunc_closure(*args, **kwargs):
+#         global _tracefunc_depth
+#         """The closure."""
+#         print(f"{_tracefunc_depth}: {func.__name__}(args={args}, kwargs={kwargs})")
+#         _tracefunc_depth += 1
+#         result = func(*args, **kwargs)
+#         _tracefunc_depth -= 1
+#         print(f"{_tracefunc_depth} => {result}")
+#         return result
 
-class LocalObject(object):
-    """
-    This object represents a local object. It is a very thin wrapper
-    to allow it to be used in the same way as the S3Object (only as needed
-    in this usecase)
-
-    Get or list calls return one or more of LocalObjects.
-    """
-
-    def __init__(self, url, path):
-
-        # all fields of S3Object should return a unicode object
-        def ensure_unicode(x):
-            return None if x is None else to_unicode(x)
-
-        path = ensure_unicode(path)
-
-        self._path = path
-        self._url = url
-
-        if self._path:
-            try:
-                os.stat(self._path)
-            except FileNotFoundError:
-                self._path = None
-
-    @property
-    def exists(self):
-        """
-        Does this key correspond to an actual file?
-        """
-        return self._path is not None and os.path.isfile(self._path)
-
-    @property
-    def url(self):
-        """
-        Local location of the object; this is the path prefixed with local://
-        """
-        return self._url
-
-    @property
-    def path(self):
-        """
-        Path to the local file
-        """
-        return self._path
+#     return tracefunc_closure
 
 
-class Local(object):
-    """
-    This class allows you to access the local filesystem in a way similar to the S3 datatools
-    client. It is a stripped down version for now and only implements the functionality needed
-    for this use case.
-
-    In the future, we may want to allow it to be used in a way similar to the S3() client.
-    """
-
-    @staticmethod
-    def _makedirs(path):
-        try:
-            os.makedirs(path)
-        except OSError as x:
-            if x.errno == 17:
-                return
-            else:
-                raise
-
-    @classmethod
-    def get_root_from_config(cls, echo, create_on_absent=True):
-        result = DATATOOLS_LOCALROOT
-        if result is None:
-            from .datastore.local_storage import LocalStorage
-
-            result = LocalStorage.get_datastore_root_from_config(echo, create_on_absent)
-            result = os.path.join(result, DATATOOLS_SUFFIX)
-            if create_on_absent and not os.path.exists(result):
-                os.mkdir(result)
-        return result
-
-    def __init__(self):
-        """
-        Initialize a new context for Local file operations. This object is based used as
-        a context manager for a with statement.
-        """
-        pass
-
-    def __enter__(self):
-        return self
-
-    def __exit__(self, *args):
-        pass
-
-    def _path(self, key):
-        key = to_unicode(key)
-        if key.startswith(u"local://"):
-            return key[8:]
-        elif key[0] != u"/":
-            if current.is_running_flow:
-                raise MetaflowLocalURLException(
-                    "Specify Local(run=self) when you use Local inside a running "
-                    "flow. Otherwise you have to use Local with full "
-                    "local:// urls or absolute paths."
-                )
-            else:
-                raise MetaflowLocalURLException(
-                    "Initialize Local with an 'localroot' or 'run' if you don't "
-                    "want to specify full local:// urls or absolute paths."
-                )
-        else:
-            return key
-
-    def get(self, key=None, return_missing=False):
-        p = self._path(key)
-        url = u"local://%s" % p
-        if not os.path.isfile(p):
-            if return_missing:
-                p = None
-            else:
-                raise MetaflowLocalNotFound("Local URL %s not found" % url)
-        return LocalObject(url, p)
-
-    def put(self, key, obj, overwrite=True):
-        p = self._path(key)
-        if overwrite or (not os.path.exists(p)):
-            Local._makedirs(os.path.dirname(p))
-            with open(p, "wb") as f:
-                f.write(obj)
-        return u"local://%s" % p
+_DelayedExecContext = namedtuple(
+    "_DelayedExecContext", "flow_name path is_text encoding handler_type echo"
+)
 
 
 # From here on out, this is the IncludeFile implementation.
-from .datatools import S3
+from metaflow.plugins.datatools import Local, S3
+from metaflow.plugins.azure.includefile_support import Azure
+from metaflow.plugins.gcp.includefile_support import GS
 
-DATACLIENTS = {"local": Local, "s3": S3}
+DATACLIENTS = {
+    "local": Local,
+    "s3": S3,
+    "azure": Azure,
+    "gs": GS,
+}
 
 
-class LocalFile:
-    def __init__(self, is_text, encoding, path):
-        self._is_text = is_text
-        self._encoding = encoding
-        self._path = path
+class IncludedFile(object):
+    # Thin wrapper to indicate to the MF client that this object is special
+    # and should be handled as an IncludedFile when returning it (ie: fetching
+    # the actual content)
 
-    @classmethod
-    def is_file_handled(cls, path):
-        if path:
-            decoded_value = Uploader.decode_value(to_unicode(path))
-            if decoded_value["type"] == "self":
-                return (
-                    True,
-                    LocalFile(
-                        decoded_value["is_text"],
-                        decoded_value["encoding"],
-                        decoded_value["url"],
-                    ),
-                    None,
-                )
-            path = decoded_value["url"]
-        for prefix, handler in DATACLIENTS.items():
-            if path.startswith(u"%s://" % prefix):
-                return True, Uploader(handler), None
-        try:
-            with open(path, mode="r") as _:
-                pass
-        except OSError:
-            return False, None, "IncludeFile: could not open file '%s'" % path
-        return True, None, None
+    # @tracefunc
+    def __init__(self, descriptor):
+        self._descriptor = descriptor
+        self._cached_size = None
 
-    def __str__(self):
-        return self._path
+    @property
+    def descriptor(self):
+        return self._descriptor
 
-    def __repr__(self):
-        return self._path
-
-    def __call__(self, ctx):
-        # We check again if this is a local file that exists. We do this here because
-        # we always convert local files to DeployTimeFields irrespective of whether
-        # the file exists.
-        ok, _, err = LocalFile.is_file_handled(self._path)
-        if not ok:
-            raise MetaflowException(err)
-        client = DATACLIENTS.get(ctx.ds_type)
-        if client:
-            return Uploader(client).store(
-                ctx.flow_name, self._path, self._is_text, self._encoding, ctx.logger
+    @property
+    # @tracefunc
+    def size(self):
+        if self._cached_size is not None:
+            return self._cached_size
+        handler = UPLOADERS.get(self.descriptor.get("type", None), None)
+        if handler is None:
+            raise MetaflowException(
+                "Could not interpret size of IncludedFile: %s"
+                % json.dumps(self.descriptor)
             )
-        raise MetaflowException(
-            "IncludeFile: no client found for datastore type %s" % ctx.ds_type
-        )
+        self._cached_size = handler.size(self._descriptor)
+        return self._cached_size
+
+    # @tracefunc
+    def decode(self, name, var_type="Artifact"):
+        # We look for the uploader for it and decode it
+        handler = UPLOADERS.get(self.descriptor.get("type", None), None)
+        if handler is None:
+            raise MetaflowException(
+                "%s '%s' could not be loaded (IncludedFile) because no handler found: %s"
+                % (var_type, name, json.dumps(self.descriptor))
+            )
+        return handler.load(self._descriptor)
 
 
 class FilePathClass(click.ParamType):
     name = "FilePath"
-    # The logic for this class is as follows:
-    #  - It will always return a path that indicates the persisted path of the file.
-    #    + If the value is already such a string, nothing happens and it returns that same value
-    #    + If the value is a LocalFile, it will persist the local file and return the path
-    #      of the persisted file
-    #  - The artifact will be persisted prior to any run (for non-scheduled runs through persist_constants)
-    #    + This will therefore persist a simple string
-    #  - When the parameter is loaded again, the load_parameter in the IncludeFile class will get called
-    #    which will download and return the bytes of the persisted file.
+
     def __init__(self, is_text, encoding):
         self._is_text = is_text
         self._encoding = encoding
 
     def convert(self, value, param, ctx):
-        if callable(value):
-            # Already a correct type
+        # Click can call convert multiple times, so we need to make sure to only
+        # convert once. This function will return a DelayedEvaluationParameter
+        # (if it needs to still perform an upload) or an IncludedFile if not
+        if isinstance(value, (DelayedEvaluationParameter, IncludedFile)):
             return value
 
-        value = os.path.expanduser(value)
-        ok, file_type, err = LocalFile.is_file_handled(value)
-        if not ok:
-            self.fail(err)
-        if file_type is None:
-            # Here, we need to store the file
-            return lambda is_text=self._is_text, encoding=self._encoding, value=value, ctx=parameters.context_proto: LocalFile(
-                is_text, encoding, value
-            )(
-                ctx
+        # Value will be a string containing one of two things:
+        #  - Scenario A: a JSON blob indicating that the file has already been uploaded.
+        #    This scenario this happens in is as follows:
+        #      + `step-functions create` is called and the IncludeFile has a default
+        #        value. At the time of creation, the file is uploaded and a URL is
+        #        returned; this URL is packaged in a blob by Uploader and passed to
+        #        step-functions as the value of the parameter.
+        #      + when the step function actually runs, the value is passed to click
+        #        through METAFLOW_INIT_XXX; this value is the one returned above
+        #  - Scenario B: A path. The path can either be:
+        #      + B.1: <prefix>://<something> like s3://foo/bar or local:///foo/bar
+        #        (right now, we are disabling support for this because the artifact
+        #        can change unlike all other artifacts. It is trivial to re-enable
+        #      + B.2: an actual path to a local file like /foo/bar
+        #    In the first case, we just store an *external* reference to it (so we
+        #    won't upload anything). In the second case we will want to upload something,
+        #    but we only do that in the DelayedEvaluationParameter step.
+
+        # ctx can be one of two things:
+        #  - the click context (when called normally)
+        #  - the ParameterContext (when called through _eval_default)
+        # If not a ParameterContext, we convert it to that
+        if not isinstance(ctx, ParameterContext):
+            ctx = ParameterContext(
+                flow_name=ctx.obj.flow.name,
+                user_name=get_username(),
+                parameter_name=param.name,
+                logger=ctx.obj.echo,
+                ds_type=ctx.obj.datastore_impl.TYPE,
             )
-        elif isinstance(file_type, LocalFile):
-            # This is a default file that we evaluate now (to delay upload
-            # until *after* the flow is checked)
-            return lambda f=file_type, ctx=parameters.context_proto: f(ctx)
+
+        if len(value) > 0 and value[0] == "{":
+            # This is a blob; no URL starts with `{`. We are thus in scenario A
+            try:
+                value = json.loads(value)
+            except json.JSONDecodeError as e:
+                raise MetaflowException(
+                    "IncludeFile '%s' (value: %s) is malformed" % (param.name, value)
+                )
+            # All processing has already been done, so we just convert to an `IncludedFile`
+            return IncludedFile(value)
+
+        path = os.path.expanduser(value)
+
+        prefix_pos = path.find("://")
+        if prefix_pos > 0:
+            # Scenario B.1
+            raise MetaflowException(
+                "IncludeFile using a direct reference to a file in cloud storage is no "
+                "longer supported. Contact the Metaflow team if you need this supported"
+            )
+            # if DATACLIENTS.get(path[:prefix_pos]) is None:
+            #     self.fail(
+            #         "IncludeFile: no handler for external file of type '%s' "
+            #         "(given path is '%s')" % (path[:prefix_pos], path)
+            #     )
+            # # We don't need to do anything more -- the file is already uploaded so we
+            # # just return a blob indicating how to get the file.
+            # return IncludedFile(
+            #     CURRENT_UPLOADER.encode_url(
+            #         "external", path, is_text=self._is_text, encoding=self._encoding
+            #     )
+            # )
         else:
-            # We will just store the URL in the datastore along with text/encoding info
-            return lambda is_text=self._is_text, encoding=self._encoding, value=value: Uploader.encode_url(
-                "external", value, is_text=is_text, encoding=encoding
+            # Scenario B.2
+            # Check if this is a valid local file
+            try:
+                with open(path, mode="r") as _:
+                    pass
+            except OSError:
+                self.fail("IncludeFile: could not open file '%s' for reading" % path)
+            handler = DATACLIENTS.get(ctx.ds_type)
+            if handler is None:
+                self.fail(
+                    "IncludeFile: no data-client for datastore of type '%s'"
+                    % ctx.ds_type
+                )
+
+            # Now that we have done preliminary checks, we will delay uploading it
+            # until later (so it happens after PyLint checks the flow, but we prepare
+            # everything for it)
+            lambda_ctx = _DelayedExecContext(
+                flow_name=ctx.flow_name,
+                path=path,
+                is_text=self._is_text,
+                encoding=self._encoding,
+                handler_type=ctx.ds_type,
+                echo=ctx.logger,
+            )
+
+            def _delayed_eval_func(ctx=lambda_ctx, return_str=False):
+                incl_file = IncludedFile(
+                    CURRENT_UPLOADER.store(
+                        ctx.flow_name,
+                        ctx.path,
+                        ctx.is_text,
+                        ctx.encoding,
+                        DATACLIENTS[ctx.handler_type],
+                        ctx.echo,
+                    )
+                )
+                if return_str:
+                    return json.dumps(incl_file.descriptor)
+                return incl_file
+
+            return DelayedEvaluationParameter(
+                ctx.parameter_name,
+                "default",
+                functools.partial(_delayed_eval_func, ctx=lambda_ctx),
             )
 
     def __str__(self):
@@ -271,84 +231,115 @@ def __repr__(self):
 
 
 class IncludeFile(Parameter):
+    """
+    Includes a local file as a parameter for the flow.
+
+    `IncludeFile` behaves like `Parameter` except that it reads its value from a file instead of
+    the command line. The user provides a path to a file on the command line. The file contents
+    are saved as a read-only artifact which is available in all steps of the flow.
+
+    Parameters
+    ----------
+    name : str
+        User-visible parameter name.
+    default : str or a function
+        Default path to a local file. A function
+        implies that the parameter corresponds to a *deploy-time parameter*.
+    is_text : bool
+        Convert the file contents to a string using the provided `encoding` (default: True).
+        If False, the artifact is stored in `bytes`.
+    encoding : str
+        Use this encoding to decode the file contexts if `is_text=True` (default: `utf-8`).
+    required : bool
+        Require that the user specified a value for the parameter.
+        `required=True` implies that the `default` is not used.
+    help : str
+        Help text to show in `run --help`.
+    show_default : bool
+        If True, show the default value in the help text (default: True).
+    """
+
     def __init__(
         self, name, required=False, is_text=True, encoding=None, help=None, **kwargs
     ):
-        # Defaults are DeployTimeField
+        # If a default is specified, it needs to be uploaded when the flow is deployed
+        # (for example when doing a `step-functions create`) so we make the default
+        # be a DeployTimeField. This means that it will be evaluated in two cases:
+        #  - by deploy_time_eval for `step-functions create` and related.
+        #  - by Click when evaluating the parameter.
+        #
+        # In the first case, we will need to fully upload the file whereas in the
+        # second case, we can just return the string as the FilePath.convert method
+        # will take care of evaluating things.
         v = kwargs.get("default")
         if v is not None:
-            _, file_type, _ = LocalFile.is_file_handled(v)
-            # Ignore error because we may never use the default
-            if file_type is None:
-                o = {"type": "self", "is_text": is_text, "encoding": encoding, "url": v}
-                kwargs["default"] = DeployTimeField(
-                    name,
-                    str,
-                    "default",
-                    lambda ctx, full_evaluation, o=o: LocalFile(
-                        o["is_text"], o["encoding"], o["url"]
-                    )(ctx)
-                    if full_evaluation
-                    else json.dumps(o),
-                    print_representation=v,
-                )
-            else:
-                kwargs["default"] = DeployTimeField(
-                    name,
-                    str,
-                    "default",
-                    lambda _, __, is_text=is_text, encoding=encoding, v=v: Uploader.encode_url(
-                        "external-default", v, is_text=is_text, encoding=encoding
-                    ),
-                    print_representation=v,
-                )
+            # If the default is a callable, we have two DeployTimeField:
+            #  - the callable nature of the default will require us to "call" the default
+            #    (so that is the outer DeployTimeField)
+            #  - IncludeFile defaults are always DeployTimeFields (since they need to be
+            #    uploaded)
+            #
+            # Therefore, if the default value is itself a callable, we will have
+            # a DeployTimeField (upload the file) wrapping another DeployTimeField
+            # (call the default)
+            if callable(v) and not isinstance(v, DeployTimeField):
+                # If default is a callable, make it a DeployTimeField (the inner one)
+                v = DeployTimeField(name, str, "default", v, return_str=True)
+            kwargs["default"] = DeployTimeField(
+                name,
+                str,
+                "default",
+                IncludeFile._eval_default(is_text, encoding, v),
+                print_representation=v,
+            )
 
         super(IncludeFile, self).__init__(
             name,
             required=required,
             help=help,
             type=FilePathClass(is_text, encoding),
-            **kwargs
+            **kwargs,
         )
 
-    def load_parameter(self, val):
-        if val is None:
-            return val
-        ok, file_type, err = LocalFile.is_file_handled(val)
-        if not ok:
-            raise MetaflowException(
-                "Parameter '%s' could not be loaded: %s" % (self.name, err)
-            )
-        if file_type is None or isinstance(file_type, LocalFile):
-            raise MetaflowException(
-                "Parameter '%s' was not properly converted" % self.name
-            )
-        return file_type.load(val)
+    def load_parameter(self, v):
+        if v is None:
+            return v
+        return v.decode(self.name, var_type="Parameter")
 
+    @staticmethod
+    def _eval_default(is_text, encoding, default_path):
+        # NOTE: If changing name of this function, check comments that refer to it to
+        # update it.
+        def do_eval(ctx, deploy_time):
+            if isinstance(default_path, DeployTimeField):
+                d = default_path(deploy_time=deploy_time)
+            else:
+                d = default_path
+            if deploy_time:
+                fp = FilePathClass(is_text, encoding)
+                val = fp.convert(d, None, ctx)
+                if isinstance(val, DelayedEvaluationParameter):
+                    val = val()
+                # At this point this is an IncludedFile, but we need to make it
+                # into a string so that it can be properly saved.
+                return json.dumps(val.descriptor)
+            else:
+                return d
 
-class Uploader:
+        return do_eval
 
-    file_type = "uploader-v1"
 
-    def __init__(self, client_class):
-        self._client_class = client_class
+class UploaderV1:
+    file_type = "uploader-v1"
 
-    @staticmethod
-    def encode_url(url_type, url, **kwargs):
-        # Avoid encoding twice (default -> URL -> _convert method of FilePath for example)
-        if url is None or len(url) == 0 or url[0] == "{":
-            return url
+    @classmethod
+    def encode_url(cls, url_type, url, **kwargs):
         return_value = {"type": url_type, "url": url}
         return_value.update(kwargs)
-        return json.dumps(return_value)
-
-    @staticmethod
-    def decode_value(value):
-        if value is None or len(value) == 0 or value[0] != "{":
-            return {"type": "base", "url": value}
-        return json.loads(value)
+        return return_value
 
-    def store(self, flow_name, path, is_text, encoding, echo):
+    @classmethod
+    def store(cls, flow_name, path, is_text, encoding, handler, echo):
         sz = os.path.getsize(path)
         unit = ["B", "KB", "MB", "GB", "TB"]
         pos = 0
@@ -370,39 +361,147 @@ def store(self, flow_name, path, is_text, encoding, echo):
                 "large to be properly handled by Python 2.7" % path
             )
         sha = sha1(input_file).hexdigest()
-        path = os.path.join(
-            self._client_class.get_root_from_config(echo, True), flow_name, sha
-        )
+        path = os.path.join(handler.get_root_from_config(echo, True), flow_name, sha)
         buf = io.BytesIO()
+
         with gzip.GzipFile(fileobj=buf, mode="wb", compresslevel=3) as f:
             f.write(input_file)
         buf.seek(0)
-        with self._client_class() as client:
+
+        with handler() as client:
             url = client.put(path, buf.getvalue(), overwrite=False)
-            echo("File persisted at %s" % url)
-            return Uploader.encode_url(
-                Uploader.file_type, url, is_text=is_text, encoding=encoding
+
+        return cls.encode_url(cls.file_type, url, is_text=is_text, encoding=encoding)
+
+    @classmethod
+    def size(cls, descriptor):
+        # We never have the size so we look it up
+        url = descriptor["url"]
+        handler = cls._get_handler(url)
+        with handler() as client:
+            obj = client.info(url, return_missing=True)
+            if obj.exists:
+                return obj.size
+        raise FileNotFoundError("File at '%s' does not exist" % url)
+
+    @classmethod
+    def load(cls, descriptor):
+        url = descriptor["url"]
+        handler = cls._get_handler(url)
+        with handler() as client:
+            obj = client.get(url, return_missing=True)
+            if obj.exists:
+                if descriptor["type"] == cls.file_type:
+                    # We saved this file directly, so we know how to read it out
+                    with gzip.GzipFile(filename=obj.path, mode="rb") as f:
+                        if descriptor["is_text"]:
+                            return io.TextIOWrapper(
+                                f, encoding=descriptor.get("encoding")
+                            ).read()
+                        return f.read()
+                else:
+                    # We open this file according to the is_text and encoding information
+                    if descriptor["is_text"]:
+                        return io.open(
+                            obj.path, mode="rt", encoding=descriptor.get("encoding")
+                        ).read()
+                    else:
+                        return io.open(obj.path, mode="rb").read()
+            raise FileNotFoundError("File at '%s' does not exist" % descriptor["url"])
+
+    @staticmethod
+    def _get_handler(url):
+        prefix_pos = url.find("://")
+        if prefix_pos < 0:
+            raise MetaflowException("Malformed URL: '%s'" % url)
+        prefix = url[:prefix_pos]
+        handler = DATACLIENTS.get(prefix)
+        if handler is None:
+            raise MetaflowException("Could not find data client for '%s'" % prefix)
+        return handler
+
+
+class UploaderV2:
+
+    file_type = "uploader-v2"
+
+    @classmethod
+    def encode_url(cls, url_type, url, **kwargs):
+        return_value = {
+            "note": "Internal representation of IncludeFile(%s)" % url,
+            "type": cls.file_type,
+            "sub-type": url_type,
+            "url": url,
+        }
+        return_value.update(kwargs)
+        return return_value
+
+    @classmethod
+    def store(cls, flow_name, path, is_text, encoding, handler, echo):
+        r = UploaderV1.store(flow_name, path, is_text, encoding, handler, echo)
+
+        # In V2, we store size for faster access
+        r["note"] = "Internal representation of IncludeFile(%s)" % path
+        r["type"] = cls.file_type
+        r["sub-type"] = "uploaded"
+        r["size"] = os.stat(path).st_size
+        return r
+
+    @classmethod
+    def size(cls, descriptor):
+        if descriptor["sub-type"] == "uploaded":
+            return descriptor["size"]
+        else:
+            # This was a file that was external, so we get information on it
+            url = descriptor["url"]
+            handler = cls._get_handler(url)
+            with handler() as client:
+                obj = client.info(url, return_missing=True)
+                if obj.exists:
+                    return obj.size
+            raise FileNotFoundError(
+                "%s file at '%s' does not exist"
+                % (descriptor["sub-type"].capitalize(), url)
             )
 
-    def load(self, value):
-        value_info = Uploader.decode_value(value)
-        with self._client_class() as client:
-            obj = client.get(value_info["url"], return_missing=True)
+    @classmethod
+    def load(cls, descriptor):
+        url = descriptor["url"]
+        # We know the URL is in a <prefix>:// format so we just extract the handler
+        handler = cls._get_handler(url)
+        with handler() as client:
+            obj = client.get(url, return_missing=True)
             if obj.exists:
-                if value_info["type"] == Uploader.file_type:
-                    # We saved this file directly so we know how to read it out
+                if descriptor["sub-type"] == "uploaded":
+                    # We saved this file directly, so we know how to read it out
                     with gzip.GzipFile(filename=obj.path, mode="rb") as f:
-                        if value_info["is_text"]:
+                        if descriptor["is_text"]:
                             return io.TextIOWrapper(
-                                f, encoding=value_info.get("encoding")
+                                f, encoding=descriptor.get("encoding")
                             ).read()
                         return f.read()
                 else:
                     # We open this file according to the is_text and encoding information
-                    if value_info["is_text"]:
+                    if descriptor["is_text"]:
                         return io.open(
-                            obj.path, mode="rt", encoding=value_info.get("encoding")
+                            obj.path, mode="rt", encoding=descriptor.get("encoding")
                         ).read()
                     else:
                         return io.open(obj.path, mode="rb").read()
-            raise FileNotFoundError("File at %s does not exist" % value_info["url"])
+            # If we are here, the file does not exist
+            raise FileNotFoundError(
+                "%s file at '%s' does not exist"
+                % (descriptor["sub-type"].capitalize(), url)
+            )
+
+    @staticmethod
+    def _get_handler(url):
+        return UploaderV1._get_handler(url)
+
+
+UPLOADERS = {
+    "uploader-v1": UploaderV1,
+    "external": UploaderV1,
+    "uploader-v2": UploaderV2,
+}
+CURRENT_UPLOADER = UploaderV2
diff --git a/metaflow/lint.py b/metaflow/lint.py
index c282cc31b76..ebcb8e98e24 100644
--- a/metaflow/lint.py
+++ b/metaflow/lint.py
@@ -246,6 +246,7 @@ def traverse(node, split_stack):
                     )
             else:
                 raise LintWarn(msg2.format(node), node.func_lineno)
+
             # check that incoming steps come from the same lineage
             # (no cross joins)
             def parents(n):
diff --git a/metaflow/metadata/heartbeat.py b/metaflow/metadata/heartbeat.py
index d2f01f2fb4b..7a73c0cf90d 100644
--- a/metaflow/metadata/heartbeat.py
+++ b/metaflow/metadata/heartbeat.py
@@ -3,8 +3,8 @@
 import json
 
 from threading import Thread
-from metaflow.sidecar_messages import MessageTypes, Message
-from metaflow.metaflow_config import METADATA_SERVICE_HEADERS
+from metaflow.sidecar import MessageTypes, Message
+from metaflow.metaflow_config import SERVICE_HEADERS
 from metaflow.exception import MetaflowException
 
 HB_URL_KEY = "hb_url"
@@ -19,8 +19,8 @@ def __init__(self, msg):
 
 class MetadataHeartBeat(object):
     def __init__(self):
-        self.headers = METADATA_SERVICE_HEADERS
-        self.req_thread = Thread(target=self.ping)
+        self.headers = SERVICE_HEADERS
+        self.req_thread = Thread(target=self._ping)
         self.req_thread.daemon = True
         self.default_frequency_secs = 10
         self.hb_url = None
@@ -28,19 +28,22 @@ def __init__(self):
     def process_message(self, msg):
         # type: (Message) -> None
         if msg.msg_type == MessageTypes.SHUTDOWN:
-            # todo shutdown doesnt do anything yet? should it still be called
-            self.shutdown()
-        if (not self.req_thread.is_alive()) and msg.msg_type == MessageTypes.LOG_EVENT:
+            self._shutdown()
+        if not self.req_thread.is_alive():
             # set post url
             self.hb_url = msg.payload[HB_URL_KEY]
             # start thread
             self.req_thread.start()
 
-    def ping(self):
+    @classmethod
+    def get_worker(cls):
+        return cls
+
+    def _ping(self):
         retry_counter = 0
         while True:
             try:
-                frequency_secs = self.heartbeat()
+                frequency_secs = self._heartbeat()
 
                 if frequency_secs is None or frequency_secs <= 0:
                     frequency_secs = self.default_frequency_secs
@@ -49,9 +52,9 @@ def ping(self):
                 retry_counter = 0
             except HeartBeatException as e:
                 retry_counter = retry_counter + 1
-                time.sleep(4 ** retry_counter)
+                time.sleep(4**retry_counter)
 
-    def heartbeat(self):
+    def _heartbeat(self):
         if self.hb_url is not None:
             response = requests.post(url=self.hb_url, data="{}", headers=self.headers)
             # Unfortunately, response.json() returns a string that we need
@@ -67,6 +70,6 @@ def heartbeat(self):
                 )
         return None
 
-    def shutdown(self):
+    def _shutdown(self):
         # attempts sending one last heartbeat
-        self.heartbeat()
+        self._heartbeat()
diff --git a/metaflow/metadata/metadata.py b/metaflow/metadata/metadata.py
index c20b441f09e..70427c3589a 100644
--- a/metaflow/metadata/metadata.py
+++ b/metaflow/metadata/metadata.py
@@ -3,11 +3,11 @@
 import re
 import time
 from collections import namedtuple
-from datetime import datetime
-
-from metaflow.exception import MetaflowInternalError
-from metaflow.util import get_username, resolve_identity
+from itertools import chain
 
+from metaflow.exception import MetaflowInternalError, MetaflowTaggingError
+from metaflow.tagging_util import validate_tag
+from metaflow.util import get_username, resolve_identity_as_tuple, is_stringish
 
 DataArtifact = namedtuple("DataArtifact", "name ds_type ds_root url type sha")
 
@@ -47,6 +47,33 @@ def decorator(cls):
     return decorator
 
 
+class ObjectOrder:
+    # Consider this list a constant that should never change.
+    # Lots of code depend on the membership of this list as
+    # well as exact ordering
+    _order_as_list = [
+        "root",
+        "flow",
+        "run",
+        "step",
+        "task",
+        "artifact",
+        "metadata",
+        "self",
+    ]
+    _order_as_dict = {v: i for i, v in enumerate(_order_as_list)}
+
+    @staticmethod
+    def order_to_type(order):
+        if order < len(ObjectOrder._order_as_list):
+            return ObjectOrder._order_as_list[order]
+        return None
+
+    @staticmethod
+    def type_to_order(obj_type):
+        return ObjectOrder._order_as_dict.get(obj_type)
+
+
 @with_metaclass(MetadataProviderMeta)
 class MetadataProvider(object):
     @classmethod
@@ -128,6 +155,10 @@ def register_run_id(self, run_id, tags=None, sys_tags=None):
             Tags to apply to this particular run, by default None
         sys_tags : list, optional
             System tags to apply to this particular run, by default None
+        Returns
+        -------
+        bool
+            True if a new run was registered; False if it already existed
         """
         raise NotImplementedError()
 
@@ -173,6 +204,10 @@ def register_task_id(
             Tags to apply to this particular run, by default []
         sys_tags : list, optional
             System tags to apply to this particular run, by default []
+        Returns
+        -------
+        bool
+            True if a new run was registered; False if it already existed
         """
         raise NotImplementedError()
 
@@ -355,22 +390,12 @@ def get_object(cls, obj_type, sub_type, filters, attempt, *args):
             object or list :
                 Depending on the call, the type of object return varies
         """
-        obj_order = {
-            "root": 0,
-            "flow": 1,
-            "run": 2,
-            "step": 3,
-            "task": 4,
-            "artifact": 5,
-            "metadata": 6,
-            "self": 7,
-        }
-        type_order = obj_order.get(obj_type)
-        sub_order = obj_order.get(sub_type)
+        type_order = ObjectOrder.type_to_order(obj_type)
+        sub_order = ObjectOrder.type_to_order(sub_type)
 
         if type_order is None:
             raise MetaflowInternalError(msg="Cannot find type %s" % obj_type)
-        if type_order > 5:
+        if type_order >= ObjectOrder.type_to_order("metadata"):
             raise MetaflowInternalError(msg="Type %s is not allowed" % obj_type)
 
         if sub_order is None:
@@ -400,17 +425,93 @@ def get_object(cls, obj_type, sub_type, filters, attempt, *args):
         pre_filter = cls._get_object_internal(
             obj_type, type_order, sub_type, sub_order, filters, attempt_int, *args
         )
-        if attempt_int is None or sub_order != 6:
+        if attempt_int is None or sub_type != "metadata":
             # If no attempt or not for metadata, just return as is
             return pre_filter
         return MetadataProvider._reconstruct_metadata_for_attempt(
             pre_filter, attempt_int
         )
 
+    @classmethod
+    def mutate_user_tags_for_run(
+        cls, flow_id, run_id, tags_to_remove=None, tags_to_add=None
+    ):
+        """
+        Mutate the set of user tags for a run.
+
+        Removals logically get applied after additions.  Operations occur as a batch atomically.
+        Parameters
+        ----------
+        flow_id : str
+            Flow id, that the run belongs to.
+        run_id: str
+            Run id, together with flow_id, that identifies the specific Run whose tags to mutate
+        tags_to_remove: iterable over str
+            Iterable over tags to remove
+        tags_to_add: iterable over str
+            Iterable over tags to add
+
+        Return
+        ------
+        Run tags after mutation operations
+        """
+        # perform common validation, across all provider implementations
+        if tags_to_remove is None:
+            tags_to_remove = []
+        if tags_to_add is None:
+            tags_to_add = []
+        if not tags_to_add and not tags_to_remove:
+            raise MetaflowTaggingError("Must add or remove at least one tag")
+
+        if is_stringish(tags_to_add):
+            raise MetaflowTaggingError("tags_to_add may not be a string")
+
+        if is_stringish(tags_to_remove):
+            raise MetaflowTaggingError("tags_to_remove may not be a string")
+
+        def _is_iterable(something):
+            try:
+                iter(something)
+                return True
+            except TypeError:
+                return False
+
+        if not _is_iterable(tags_to_add):
+            raise MetaflowTaggingError("tags_to_add must be iterable")
+        if not _is_iterable(tags_to_remove):
+            raise MetaflowTaggingError("tags_to_remove must be iterable")
+
+        # check each tag is valid
+        for tag in chain(tags_to_add, tags_to_remove):
+            validate_tag(tag)
+
+        # onto subclass implementation
+        final_user_tags = cls._mutate_user_tags_for_run(
+            flow_id, run_id, tags_to_add=tags_to_add, tags_to_remove=tags_to_remove
+        )
+        return final_user_tags
+
+    @classmethod
+    def _mutate_user_tags_for_run(
+        cls, flow_id, run_id, tags_to_add=None, tags_to_remove=None
+    ):
+        """
+        To be implemented by subclasses of MetadataProvider.
+
+        See mutate_user_tags_for_run() for expectations.
+        """
+        raise NotImplementedError()
+
     def _all_obj_elements(self, tags=None, sys_tags=None):
+        return MetadataProvider._all_obj_elements_static(
+            self._flow_name, tags=tags, sys_tags=sys_tags
+        )
+
+    @staticmethod
+    def _all_obj_elements_static(flow_name, tags=None, sys_tags=None):
         user = get_username()
         return {
-            "flow_id": self._flow_name,
+            "flow_id": flow_name,
             "user_name": user,
             "tags": list(tags) if tags else [],
             "system_tags": list(sys_tags) if sys_tags else [],
@@ -424,11 +525,17 @@ def _flow_to_json(self):
         return {"flow_id": self._flow_name, "ts_epoch": int(round(time.time() * 1000))}
 
     def _run_to_json(self, run_id=None, tags=None, sys_tags=None):
+        return MetadataProvider._run_to_json_static(
+            self._flow_name, run_id=run_id, tags=tags, sys_tags=sys_tags
+        )
+
+    @staticmethod
+    def _run_to_json_static(flow_name, run_id=None, tags=None, sys_tags=None):
         if run_id is not None:
             d = {"run_number": run_id}
         else:
             d = {}
-        d.update(self._all_obj_elements(tags, sys_tags))
+        d.update(MetadataProvider._all_obj_elements_static(flow_name, tags, sys_tags))
         return d
 
     def _step_to_json(self, run_id, step_name, tags=None, sys_tags=None):
@@ -497,28 +604,55 @@ def _metadata_to_json(self, run_id, step_name, task_id, metadata):
             for datum in metadata
         ]
 
-    def _tags(self):
+    def _get_system_info_as_dict(self):
+        """This function drives:
+        - sticky system tags initialization
+        - task-level metadata generation
+        """
+        sys_info = dict()
         env = self._environment.get_environment_info()
-        tags = [
-            resolve_identity(),
-            "runtime:" + env["runtime"],
-            "python_version:" + env["python_version_code"],
-            "date:" + datetime.utcnow().strftime("%Y-%m-%d"),
-        ]
+        sys_info["runtime"] = env["runtime"]
+        sys_info["python_version"] = env["python_version_code"]
+        identity_type, identity_value = resolve_identity_as_tuple()
+        sys_info[identity_type] = identity_value
         if env["metaflow_version"]:
-            tags.append("metaflow_version:" + env["metaflow_version"])
+            sys_info["metaflow_version"] = env["metaflow_version"]
         if "metaflow_r_version" in env:
-            tags.append("metaflow_r_version:" + env["metaflow_r_version"])
+            sys_info["metaflow_r_version"] = env["metaflow_r_version"]
         if "r_version_code" in env:
-            tags.append("r_version:" + env["r_version_code"])
-        return tags
+            sys_info["r_version"] = env["r_version_code"]
+        return sys_info
+
+    def _get_system_tags(self):
+        """Convert system info dictionary into a list of system tags"""
+        return [
+            "{}:{}".format(k, v) for k, v in self._get_system_info_as_dict().items()
+        ]
 
-    def _register_code_package_metadata(self, run_id, step_name, task_id, attempt):
+    def _register_system_metadata(self, run_id, step_name, task_id, attempt):
+        """Gather up system and code packaging info and register them as task metadata"""
         metadata = []
+        # Take everything from system info and store them as metadata
+        sys_info = self._get_system_info_as_dict()
+
+        # field, and type could get long in theory...can the metadata backend handle it?
+        # E.g. as of 5/9/2022 Metadata service's DB says VARCHAR(255).
+        # It is likely overkill to fail a flow over an over-flow. We should expect the
+        # backend to try to tolerate this (e.g. enlarge columns, truncation fallback).
+        metadata.extend(
+            MetaDatum(
+                field=str(k),
+                value=str(v),
+                type=str(k),
+                tags=["attempt_id:{0}".format(attempt)],
+            )
+            for k, v in sys_info.items()
+        )
+        # Also store code packaging information
         code_sha = os.environ.get("METAFLOW_CODE_SHA")
-        code_url = os.environ.get("METAFLOW_CODE_URL")
-        code_ds = os.environ.get("METAFLOW_CODE_DS")
         if code_sha:
+            code_url = os.environ.get("METAFLOW_CODE_URL")
+            code_ds = os.environ.get("METAFLOW_CODE_DS")
             metadata.append(
                 MetaDatum(
                     field="code-package",
@@ -529,8 +663,6 @@ def _register_code_package_metadata(self, run_id, step_name, task_id, attempt):
                     tags=["attempt_id:{0}".format(attempt)],
                 )
             )
-        # We don't tag with attempt_id here because not readily available; this
-        # is ok though as this doesn't change from attempt to attempt.
         if metadata:
             self.register_metadata(run_id, step_name, task_id, metadata)
 
@@ -604,4 +736,4 @@ def __init__(self, environment, flow, event_logger, monitor):
         self._monitor = monitor
         self._environment = environment
         self._runtime = os.environ.get("METAFLOW_RUNTIME_NAME", "dev")
-        self.add_sticky_tags(sys_tags=self._tags())
+        self.add_sticky_tags(sys_tags=self._get_system_tags())
diff --git a/metaflow/metadata/util.py b/metaflow/metadata/util.py
index ec6ab58fcd5..705cfb50f52 100644
--- a/metaflow/metadata/util.py
+++ b/metaflow/metadata/util.py
@@ -5,7 +5,7 @@
 from distutils.dir_util import copy_tree
 
 from metaflow import util
-from metaflow.datastore.local_storage import LocalStorage
+from metaflow.plugins.datastores.local_storage import LocalStorage
 
 
 def sync_local_metadata_to_datastore(metadata_local_dir, task_ds):
@@ -27,7 +27,7 @@ def echo_none(*args, **kwargs):
     _, tarball = next(task_ds.parent_datastore.load_data([key_to_load]))
     with util.TempDir() as td:
         with tarfile.open(fileobj=BytesIO(tarball), mode="r:gz") as tar:
-            tar.extractall(td)
+            util.tar_safe_extract(tar, td)
         copy_tree(
             os.path.join(td, metadata_local_dir),
             LocalStorage.get_datastore_root_from_config(echo_none),
diff --git a/metaflow/metaflow_config.py b/metaflow/metaflow_config.py
index 8a2037bf465..8bc78adee99 100644
--- a/metaflow/metaflow_config.py
+++ b/metaflow/metaflow_config.py
@@ -1,55 +1,34 @@
-import os
-import json
 import logging
-import pkg_resources
+import os
 import sys
 import types
 
+import pkg_resources
 
 from metaflow.exception import MetaflowException
+from metaflow.metaflow_config_funcs import from_conf, get_validate_choice_fn
 
 # Disable multithreading security on MacOS
 if sys.platform == "darwin":
     os.environ["OBJC_DISABLE_INITIALIZE_FORK_SAFETY"] = "YES"
 
-
-def init_config():
-    # Read configuration from $METAFLOW_HOME/config_<profile>.json.
-    home = os.environ.get("METAFLOW_HOME", "~/.metaflowconfig")
-    profile = os.environ.get("METAFLOW_PROFILE")
-    path_to_config = os.path.join(home, "config.json")
-    if profile:
-        path_to_config = os.path.join(home, "config_%s.json" % profile)
-    path_to_config = os.path.expanduser(path_to_config)
-    config = {}
-    if os.path.exists(path_to_config):
-        with open(path_to_config) as f:
-            return json.load(f)
-    elif profile:
-        raise MetaflowException(
-            "Unable to locate METAFLOW_PROFILE '%s' in '%s')" % (profile, home)
-        )
-    return config
-
-
-# Initialize defaults required to setup environment variables.
-METAFLOW_CONFIG = init_config()
-
-
-def from_conf(name, default=None):
-    return os.environ.get(name, METAFLOW_CONFIG.get(name, default))
-
+## NOTE: Just like Click's auto_envar_prefix `METAFLOW` (see in cli.py), all environment
+## variables here are also named METAFLOW_XXX. So, for example, in the statement:
+## `DEFAULT_DATASTORE = from_conf("DEFAULT_DATASTORE", "local")`, to override the default
+## value, either set `METAFLOW_DEFAULT_DATASTORE` in your configuration file or set
+## an environment variable called `METAFLOW_DEFAULT_DATASTORE`
 
 ###
 # Default configuration
 ###
-DEFAULT_DATASTORE = from_conf("METAFLOW_DEFAULT_DATASTORE", "local")
-DEFAULT_ENVIRONMENT = from_conf("METAFLOW_DEFAULT_ENVIRONMENT", "local")
-DEFAULT_EVENT_LOGGER = from_conf("METAFLOW_DEFAULT_EVENT_LOGGER", "nullSidecarLogger")
-DEFAULT_METADATA = from_conf("METAFLOW_DEFAULT_METADATA", "local")
-DEFAULT_MONITOR = from_conf("METAFLOW_DEFAULT_MONITOR", "nullSidecarMonitor")
-DEFAULT_PACKAGE_SUFFIXES = from_conf("METAFLOW_DEFAULT_PACKAGE_SUFFIXES", ".py,.R,.RDS")
-DEFAULT_AWS_CLIENT_PROVIDER = from_conf("METAFLOW_DEFAULT_AWS_CLIENT_PROVIDER", "boto3")
+
+DEFAULT_DATASTORE = from_conf("DEFAULT_DATASTORE", "local")
+DEFAULT_ENVIRONMENT = from_conf("DEFAULT_ENVIRONMENT", "local")
+DEFAULT_EVENT_LOGGER = from_conf("DEFAULT_EVENT_LOGGER", "nullSidecarLogger")
+DEFAULT_METADATA = from_conf("DEFAULT_METADATA", "local")
+DEFAULT_MONITOR = from_conf("DEFAULT_MONITOR", "nullSidecarMonitor")
+DEFAULT_PACKAGE_SUFFIXES = from_conf("DEFAULT_PACKAGE_SUFFIXES", ".py,.R,.RDS")
+DEFAULT_AWS_CLIENT_PROVIDER = from_conf("DEFAULT_AWS_CLIENT_PROVIDER", "boto3")
 
 
 ###
@@ -57,159 +36,250 @@ def from_conf(name, default=None):
 ###
 # Path to the local directory to store artifacts for 'local' datastore.
 DATASTORE_LOCAL_DIR = ".metaflow"
-DATASTORE_SYSROOT_LOCAL = from_conf("METAFLOW_DATASTORE_SYSROOT_LOCAL")
+DATASTORE_SYSROOT_LOCAL = from_conf("DATASTORE_SYSROOT_LOCAL")
 # S3 bucket and prefix to store artifacts for 's3' datastore.
-DATASTORE_SYSROOT_S3 = from_conf("METAFLOW_DATASTORE_SYSROOT_S3")
+DATASTORE_SYSROOT_S3 = from_conf("DATASTORE_SYSROOT_S3")
+# Azure Blob Storage container and blob prefix
+DATASTORE_SYSROOT_AZURE = from_conf("DATASTORE_SYSROOT_AZURE")
+DATASTORE_SYSROOT_GS = from_conf("DATASTORE_SYSROOT_GS")
+# GS bucket and prefix to store artifacts for 'gs' datastore
+
+
+###
+# Datastore local cache
+###
+
+# Path to the client cache
+CLIENT_CACHE_PATH = from_conf("CLIENT_CACHE_PATH", "/tmp/metaflow_client")
+# Maximum size (in bytes) of the cache
+CLIENT_CACHE_MAX_SIZE = from_conf("CLIENT_CACHE_MAX_SIZE", 10000)
+# Maximum number of cached Flow and TaskDatastores in the cache
+CLIENT_CACHE_MAX_FLOWDATASTORE_COUNT = from_conf(
+    "CLIENT_CACHE_MAX_FLOWDATASTORE_COUNT", 50
+)
+CLIENT_CACHE_MAX_TASKDATASTORE_COUNT = from_conf(
+    "CLIENT_CACHE_MAX_TASKDATASTORE_COUNT", CLIENT_CACHE_MAX_FLOWDATASTORE_COUNT * 100
+)
+
+
+###
+# Datatools (S3) configuration
+###
+S3_ENDPOINT_URL = from_conf("S3_ENDPOINT_URL")
+S3_VERIFY_CERTIFICATE = from_conf("S3_VERIFY_CERTIFICATE")
+
+# S3 retry configuration
+# This is useful if you want to "fail fast" on S3 operations; use with caution
+# though as this may increase failures. Note that this is the number of *retries*
+# so setting it to 0 means each operation will be tried once.
+S3_RETRY_COUNT = from_conf("S3_RETRY_COUNT", 7)
+
+# Number of retries on *transient* failures (such as SlowDown errors). Note
+# that if after S3_TRANSIENT_RETRY_COUNT times, all operations haven't been done,
+# it will try up to S3_RETRY_COUNT again so the total number of tries can be up to
+# (S3_RETRY_COUNT + 1) * (S3_TRANSIENT_RETRY_COUNT + 1)
+# You typically want this number fairly high as transient retires are "cheap" (only
+# operations that have not succeeded retry as opposed to all operations for the
+# top-level retries)
+S3_TRANSIENT_RETRY_COUNT = from_conf("S3_TRANSIENT_RETRY_COUNT", 20)
+
+# Threshold to start printing warnings for an AWS retry
+RETRY_WARNING_THRESHOLD = 3
+
 # S3 datatools root location
-DATATOOLS_SUFFIX = from_conf("METAFLOW_DATATOOLS_SUFFIX", "data")
+DATATOOLS_SUFFIX = from_conf("DATATOOLS_SUFFIX", "data")
 DATATOOLS_S3ROOT = from_conf(
-    "METAFLOW_DATATOOLS_S3ROOT",
-    os.path.join(from_conf("METAFLOW_DATASTORE_SYSROOT_S3"), DATATOOLS_SUFFIX)
-    if from_conf("METAFLOW_DATASTORE_SYSROOT_S3")
+    "DATATOOLS_S3ROOT",
+    os.path.join(DATASTORE_SYSROOT_S3, DATATOOLS_SUFFIX)
+    if DATASTORE_SYSROOT_S3
+    else None,
+)
+
+DATATOOLS_CLIENT_PARAMS = from_conf("DATATOOLS_CLIENT_PARAMS", {})
+if S3_ENDPOINT_URL:
+    DATATOOLS_CLIENT_PARAMS["endpoint_url"] = S3_ENDPOINT_URL
+if S3_VERIFY_CERTIFICATE:
+    DATATOOLS_CLIENT_PARAMS["verify"] = S3_VERIFY_CERTIFICATE
+
+DATATOOLS_SESSION_VARS = from_conf("DATATOOLS_SESSION_VARS", {})
+
+# Azure datatools root location
+# Note: we do not expose an actual datatools library for Azure (like we do for S3)
+# Similar to DATATOOLS_LOCALROOT, this is used ONLY by the IncludeFile's internal implementation.
+DATATOOLS_AZUREROOT = from_conf(
+    "DATATOOLS_AZUREROOT",
+    os.path.join(DATASTORE_SYSROOT_AZURE, DATATOOLS_SUFFIX)
+    if DATASTORE_SYSROOT_AZURE
+    else None,
+)
+# GS datatools root location
+# Note: we do not expose an actual datatools library for GS (like we do for S3)
+# Similar to DATATOOLS_LOCALROOT, this is used ONLY by the IncludeFile's internal implementation.
+DATATOOLS_GSROOT = from_conf(
+    "DATATOOLS_GSROOT",
+    os.path.join(DATASTORE_SYSROOT_GS, DATATOOLS_SUFFIX)
+    if DATASTORE_SYSROOT_GS
     else None,
 )
 # Local datatools root location
 DATATOOLS_LOCALROOT = from_conf(
-    "METAFLOW_DATATOOLS_LOCALROOT",
-    os.path.join(from_conf("METAFLOW_DATASTORE_SYSROOT_LOCAL"), DATATOOLS_SUFFIX)
-    if from_conf("METAFLOW_DATASTORE_SYSROOT_LOCAL")
+    "DATATOOLS_LOCALROOT",
+    os.path.join(DATASTORE_SYSROOT_LOCAL, DATATOOLS_SUFFIX)
+    if DATASTORE_SYSROOT_LOCAL
     else None,
 )
 
+# The root directory to save artifact pulls in, when using S3 or Azure
+ARTIFACT_LOCALROOT = from_conf("ARTIFACT_LOCALROOT", os.getcwd())
+
 # Cards related config variables
-DATASTORE_CARD_SUFFIX = "mf.cards"
-DATASTORE_CARD_LOCALROOT = from_conf("METAFLOW_CARD_LOCALROOT")
-DATASTORE_CARD_S3ROOT = from_conf(
-    "METAFLOW_CARD_S3ROOT",
-    os.path.join(from_conf("METAFLOW_DATASTORE_SYSROOT_S3"), DATASTORE_CARD_SUFFIX)
-    if from_conf("METAFLOW_DATASTORE_SYSROOT_S3")
+CARD_SUFFIX = "mf.cards"
+CARD_LOCALROOT = from_conf("CARD_LOCALROOT")
+CARD_S3ROOT = from_conf(
+    "CARD_S3ROOT",
+    os.path.join(DATASTORE_SYSROOT_S3, CARD_SUFFIX) if DATASTORE_SYSROOT_S3 else None,
+)
+CARD_AZUREROOT = from_conf(
+    "CARD_AZUREROOT",
+    os.path.join(DATASTORE_SYSROOT_AZURE, CARD_SUFFIX)
+    if DATASTORE_SYSROOT_AZURE
     else None,
 )
-CARD_NO_WARNING = from_conf("METAFLOW_CARD_NO_WARNING", False)
+CARD_GSROOT = from_conf(
+    "CARD_GSROOT",
+    os.path.join(DATASTORE_SYSROOT_GS, CARD_SUFFIX) if DATASTORE_SYSROOT_GS else None,
+)
+CARD_NO_WARNING = from_conf("CARD_NO_WARNING", False)
 
-# S3 endpoint url
-S3_ENDPOINT_URL = from_conf("METAFLOW_S3_ENDPOINT_URL", None)
-S3_VERIFY_CERTIFICATE = from_conf("METAFLOW_S3_VERIFY_CERTIFICATE", None)
+SKIP_CARD_DUALWRITE = from_conf("SKIP_CARD_DUALWRITE", False)
 
-# S3 retry configuration
-# This is useful if you want to "fail fast" on S3 operations; use with caution
-# though as this may increase failures. Note that this is the number of *retries*
-# so setting it to 0 means each operation will be tried once.
-S3_RETRY_COUNT = int(from_conf("METAFLOW_S3_RETRY_COUNT", 7))
+# Azure storage account URL
+AZURE_STORAGE_BLOB_SERVICE_ENDPOINT = from_conf("AZURE_STORAGE_BLOB_SERVICE_ENDPOINT")
 
-###
-# Datastore local cache
-###
-# Path to the client cache
-CLIENT_CACHE_PATH = from_conf("METAFLOW_CLIENT_CACHE_PATH", "/tmp/metaflow_client")
-# Maximum size (in bytes) of the cache
-CLIENT_CACHE_MAX_SIZE = int(from_conf("METAFLOW_CLIENT_CACHE_MAX_SIZE", 10000))
-# Maximum number of cached Flow and TaskDatastores in the cache
-CLIENT_CACHE_MAX_FLOWDATASTORE_COUNT = int(
-    from_conf("METAFLOW_CLIENT_CACHE_MAX_FLOWDATASTORE_COUNT", 50)
-)
-CLIENT_CACHE_MAX_TASKDATASTORE_COUNT = int(
-    from_conf(
-        "METAFLOW_CLIENT_CACHE_MAX_TASKDATASTORE_COUNT",
-        CLIENT_CACHE_MAX_FLOWDATASTORE_COUNT * 100,
-    )
+# Azure storage can use process-based parallelism instead of threads.
+# Processes perform better for high throughput workloads (e.g. many huge artifacts)
+AZURE_STORAGE_WORKLOAD_TYPE = from_conf(
+    "AZURE_STORAGE_WORKLOAD_TYPE",
+    default="general",
+    validate_fn=get_validate_choice_fn(["general", "high_throughput"]),
 )
 
+# GS storage can use process-based parallelism instead of threads.
+# Processes perform better for high throughput workloads (e.g. many huge artifacts)
+GS_STORAGE_WORKLOAD_TYPE = from_conf(
+    "GS_STORAGE_WORKLOAD_TYPE",
+    "general",
+    validate_fn=get_validate_choice_fn(["general", "high_throughput"]),
+)
 
 ###
 # Metadata configuration
 ###
-METADATA_SERVICE_URL = from_conf("METAFLOW_SERVICE_URL")
-METADATA_SERVICE_NUM_RETRIES = int(from_conf("METAFLOW_SERVICE_RETRY_COUNT", 5))
-METADATA_SERVICE_AUTH_KEY = from_conf("METAFLOW_SERVICE_AUTH_KEY")
-METADATA_SERVICE_HEADERS = json.loads(from_conf("METAFLOW_SERVICE_HEADERS", "{}"))
-if METADATA_SERVICE_AUTH_KEY is not None:
-    METADATA_SERVICE_HEADERS["x-api-key"] = METADATA_SERVICE_AUTH_KEY
+SERVICE_URL = from_conf("SERVICE_URL")
+SERVICE_RETRY_COUNT = from_conf("SERVICE_RETRY_COUNT", 5)
+SERVICE_AUTH_KEY = from_conf("SERVICE_AUTH_KEY")
+SERVICE_HEADERS = from_conf("SERVICE_HEADERS", {})
+if SERVICE_AUTH_KEY is not None:
+    SERVICE_HEADERS["x-api-key"] = SERVICE_AUTH_KEY
+# Checks version compatibility with Metadata service
+SERVICE_VERSION_CHECK = from_conf("SERVICE_VERSION_CHECK", True)
 
 # Default container image
-DEFAULT_CONTAINER_IMAGE = from_conf("METAFLOW_DEFAULT_CONTAINER_IMAGE")
+DEFAULT_CONTAINER_IMAGE = from_conf("DEFAULT_CONTAINER_IMAGE")
 # Default container registry
-DEFAULT_CONTAINER_REGISTRY = from_conf("METAFLOW_DEFAULT_CONTAINER_REGISTRY")
+DEFAULT_CONTAINER_REGISTRY = from_conf("DEFAULT_CONTAINER_REGISTRY")
 
 ###
 # AWS Batch configuration
 ###
 # IAM role for AWS Batch container with Amazon S3 access
 # (and AWS DynamoDb access for AWS StepFunctions, if enabled)
-ECS_S3_ACCESS_IAM_ROLE = from_conf("METAFLOW_ECS_S3_ACCESS_IAM_ROLE")
+ECS_S3_ACCESS_IAM_ROLE = from_conf("ECS_S3_ACCESS_IAM_ROLE")
 # IAM role for AWS Batch container for AWS Fargate
-ECS_FARGATE_EXECUTION_ROLE = from_conf("METAFLOW_ECS_FARGATE_EXECUTION_ROLE")
+ECS_FARGATE_EXECUTION_ROLE = from_conf("ECS_FARGATE_EXECUTION_ROLE")
 # Job queue for AWS Batch
-BATCH_JOB_QUEUE = from_conf("METAFLOW_BATCH_JOB_QUEUE")
+BATCH_JOB_QUEUE = from_conf("BATCH_JOB_QUEUE")
 # Default container image for AWS Batch
-BATCH_CONTAINER_IMAGE = (
-    from_conf("METAFLOW_BATCH_CONTAINER_IMAGE") or DEFAULT_CONTAINER_IMAGE
-)
+BATCH_CONTAINER_IMAGE = from_conf("BATCH_CONTAINER_IMAGE", DEFAULT_CONTAINER_IMAGE)
 # Default container registry for AWS Batch
-BATCH_CONTAINER_REGISTRY = (
-    from_conf("METAFLOW_BATCH_CONTAINER_REGISTRY") or DEFAULT_CONTAINER_REGISTRY
+BATCH_CONTAINER_REGISTRY = from_conf(
+    "BATCH_CONTAINER_REGISTRY", DEFAULT_CONTAINER_REGISTRY
 )
 # Metadata service URL for AWS Batch
-BATCH_METADATA_SERVICE_URL = from_conf(
-    "METAFLOW_SERVICE_INTERNAL_URL", METADATA_SERVICE_URL
-)
-BATCH_METADATA_SERVICE_HEADERS = METADATA_SERVICE_HEADERS
+SERVICE_INTERNAL_URL = from_conf("SERVICE_INTERNAL_URL", SERVICE_URL)
 
 # Assign resource tags to AWS Batch jobs. Set to False by default since
 # it requires `Batch:TagResource` permissions which may not be available
 # in all Metaflow deployments. Hopefully, some day we can flip the
 # default to True.
-BATCH_EMIT_TAGS = from_conf("METAFLOW_BATCH_EMIT_TAGS", False)
+BATCH_EMIT_TAGS = from_conf("BATCH_EMIT_TAGS", False)
 
 ###
 # AWS Step Functions configuration
 ###
 # IAM role for AWS Step Functions with AWS Batch and AWS DynamoDb access
 # https://docs.aws.amazon.com/step-functions/latest/dg/batch-iam.html
-SFN_IAM_ROLE = from_conf("METAFLOW_SFN_IAM_ROLE")
+SFN_IAM_ROLE = from_conf("SFN_IAM_ROLE")
 # AWS DynamoDb Table name (with partition key - `pathspec` of type string)
-SFN_DYNAMO_DB_TABLE = from_conf("METAFLOW_SFN_DYNAMO_DB_TABLE")
+SFN_DYNAMO_DB_TABLE = from_conf("SFN_DYNAMO_DB_TABLE")
 # IAM role for AWS Events with AWS Step Functions access
 # https://docs.aws.amazon.com/eventbridge/latest/userguide/auth-and-access-control-eventbridge.html
-EVENTS_SFN_ACCESS_IAM_ROLE = from_conf("METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE")
+EVENTS_SFN_ACCESS_IAM_ROLE = from_conf("EVENTS_SFN_ACCESS_IAM_ROLE")
 # Prefix for AWS Step Functions state machines. Set to stack name for Metaflow
 # sandbox.
-SFN_STATE_MACHINE_PREFIX = from_conf("METAFLOW_SFN_STATE_MACHINE_PREFIX")
+SFN_STATE_MACHINE_PREFIX = from_conf("SFN_STATE_MACHINE_PREFIX")
 # Optional AWS CloudWatch Log Group ARN for emitting AWS Step Functions state
 # machine execution logs. This needs to be available when using the
 # `step-functions create --log-execution-history` command.
-SFN_EXECUTION_LOG_GROUP_ARN = from_conf("METAFLOW_SFN_EXECUTION_LOG_GROUP_ARN")
+SFN_EXECUTION_LOG_GROUP_ARN = from_conf("SFN_EXECUTION_LOG_GROUP_ARN")
 
 ###
 # Kubernetes configuration
 ###
 # Kubernetes namespace to use for all objects created by Metaflow
-KUBERNETES_NAMESPACE = from_conf("METAFLOW_KUBERNETES_NAMESPACE", "default")
-# Service account to use by K8S jobs created by Metaflow
-KUBERNETES_SERVICE_ACCOUNT = from_conf("METAFLOW_KUBERNETES_SERVICE_ACCOUNT")
+KUBERNETES_NAMESPACE = from_conf("KUBERNETES_NAMESPACE", "default")
+# Default service account to use by K8S jobs created by Metaflow
+KUBERNETES_SERVICE_ACCOUNT = from_conf("KUBERNETES_SERVICE_ACCOUNT")
+# Default node selectors to use by K8S jobs created by Metaflow - foo=bar,baz=bab
+KUBERNETES_NODE_SELECTOR = from_conf("KUBERNETES_NODE_SELECTOR", "")
+KUBERNETES_TOLERATIONS = from_conf("KUBERNETES_TOLERATIONS", "")
+KUBERNETES_SECRETS = from_conf("KUBERNETES_SECRETS", "")
+# Default GPU vendor to use by K8S jobs created by Metaflow (supports nvidia, amd)
+KUBERNETES_GPU_VENDOR = from_conf("KUBERNETES_GPU_VENDOR", "nvidia")
 # Default container image for K8S
-KUBERNETES_CONTAINER_IMAGE = (
-    from_conf("METAFLOW_KUBERNETES_CONTAINER_IMAGE") or DEFAULT_CONTAINER_IMAGE
+KUBERNETES_CONTAINER_IMAGE = from_conf(
+    "KUBERNETES_CONTAINER_IMAGE", DEFAULT_CONTAINER_IMAGE
 )
 # Default container registry for K8S
-KUBERNETES_CONTAINER_REGISTRY = (
-    from_conf("METAFLOW_KUBERNETES_CONTAINER_REGISTRY") or DEFAULT_CONTAINER_REGISTRY
+KUBERNETES_CONTAINER_REGISTRY = from_conf(
+    "KUBERNETES_CONTAINER_REGISTRY", DEFAULT_CONTAINER_REGISTRY
 )
-#
+
+##
+# Airflow Configuration
+##
+# This configuration sets `startup_timeout_seconds` in airflow's KubernetesPodOperator.
+AIRFLOW_KUBERNETES_STARTUP_TIMEOUT_SECONDS = from_conf(
+    "AIRFLOW_KUBERNETES_STARTUP_TIMEOUT_SECONDS", 60 * 60
+)
+# This configuration sets `kubernetes_conn_id` in airflow's KubernetesPodOperator.
+AIRFLOW_KUBERNETES_CONN_ID = from_conf("AIRFLOW_KUBERNETES_CONN_ID")
+
 
 ###
 # Conda configuration
 ###
 # Conda package root location on S3
-CONDA_PACKAGE_S3ROOT = from_conf(
-    "METAFLOW_CONDA_PACKAGE_S3ROOT",
-    "%s/conda" % from_conf("METAFLOW_DATASTORE_SYSROOT_S3"),
-)
+CONDA_PACKAGE_S3ROOT = from_conf("CONDA_PACKAGE_S3ROOT")
+# Conda package root location on Azure
+CONDA_PACKAGE_AZUREROOT = from_conf("CONDA_PACKAGE_AZUREROOT")
+# Conda package root location on GS
+CONDA_PACKAGE_GSROOT = from_conf("CONDA_PACKAGE_GSROOT")
 
 # Use an alternate dependency resolver for conda packages instead of conda
 # Mamba promises faster package dependency resolution times, which
 # should result in an appreciable speedup in flow environment initialization.
-CONDA_DEPENDENCY_RESOLVER = from_conf("METAFLOW_CONDA_DEPENDENCY_RESOLVER", "conda")
+CONDA_DEPENDENCY_RESOLVER = from_conf("CONDA_DEPENDENCY_RESOLVER", "conda")
 
 ###
 # Debug configuration
@@ -217,34 +287,31 @@ def from_conf(name, default=None):
 DEBUG_OPTIONS = ["subcommand", "sidecar", "s3client"]
 
 for typ in DEBUG_OPTIONS:
-    vars()["METAFLOW_DEBUG_%s" % typ.upper()] = from_conf(
-        "METAFLOW_DEBUG_%s" % typ.upper()
-    )
+    vars()["DEBUG_%s" % typ.upper()] = from_conf("DEBUG_%s" % typ.upper())
 
 ###
 # AWS Sandbox configuration
 ###
 # Boolean flag for metaflow AWS sandbox access
-AWS_SANDBOX_ENABLED = bool(from_conf("METAFLOW_AWS_SANDBOX_ENABLED", False))
+AWS_SANDBOX_ENABLED = from_conf("AWS_SANDBOX_ENABLED", False)
 # Metaflow AWS sandbox auth endpoint
-AWS_SANDBOX_STS_ENDPOINT_URL = from_conf("METAFLOW_SERVICE_URL")
+AWS_SANDBOX_STS_ENDPOINT_URL = SERVICE_URL
 # Metaflow AWS sandbox API auth key
-AWS_SANDBOX_API_KEY = from_conf("METAFLOW_AWS_SANDBOX_API_KEY")
+AWS_SANDBOX_API_KEY = from_conf("AWS_SANDBOX_API_KEY")
 # Internal Metadata URL
-AWS_SANDBOX_INTERNAL_SERVICE_URL = from_conf(
-    "METAFLOW_AWS_SANDBOX_INTERNAL_SERVICE_URL"
-)
+AWS_SANDBOX_INTERNAL_SERVICE_URL = from_conf("AWS_SANDBOX_INTERNAL_SERVICE_URL")
 # AWS region
-AWS_SANDBOX_REGION = from_conf("METAFLOW_AWS_SANDBOX_REGION")
+AWS_SANDBOX_REGION = from_conf("AWS_SANDBOX_REGION")
 
 
 # Finalize configuration
 if AWS_SANDBOX_ENABLED:
     os.environ["AWS_DEFAULT_REGION"] = AWS_SANDBOX_REGION
-    BATCH_METADATA_SERVICE_URL = AWS_SANDBOX_INTERNAL_SERVICE_URL
-    METADATA_SERVICE_HEADERS["x-api-key"] = AWS_SANDBOX_API_KEY
-    SFN_STATE_MACHINE_PREFIX = from_conf("METAFLOW_AWS_SANDBOX_STACK_NAME")
+    SERVICE_INTERNAL_URL = AWS_SANDBOX_INTERNAL_SERVICE_URL
+    SERVICE_HEADERS["x-api-key"] = AWS_SANDBOX_API_KEY
+    SFN_STATE_MACHINE_PREFIX = from_conf("AWS_SANDBOX_STACK_NAME")
 
+KUBERNETES_SANDBOX_INIT_SCRIPT = from_conf("KUBERNETES_SANDBOX_INIT_SCRIPT")
 
 # MAX_ATTEMPTS is the maximum number of attempts, including the first
 # task, retries, and the final fallback task and its retries.
@@ -279,15 +346,27 @@ def get_version(pkg):
 
 # PINNED_CONDA_LIBS are the libraries that metaflow depends on for execution
 # and are needed within a conda environment
-def get_pinned_conda_libs(python_version):
-    return {
+def get_pinned_conda_libs(python_version, datastore_type):
+    pins = {
         "requests": ">=2.21.0",
-        "boto3": ">=1.14.0",
     }
+    if datastore_type == "s3":
+        pins["boto3"] = ">=1.14.0"
+    elif datastore_type == "azure":
+        pins["azure-identity"] = ">=1.10.0"
+        pins["azure-storage-blob"] = ">=12.12.0"
+    elif datastore_type == "gs":
+        pins["google-cloud-storage"] = ">=2.5.0"
+        pins["google-auth"] = ">=2.11.0"
+    elif datastore_type == "local":
+        pass
+    else:
+        raise MetaflowException(
+            msg="conda lib pins for datastore %s are undefined" % (datastore_type,)
+        )
+    return pins
 
 
-METAFLOW_EXTENSIONS_ADDL_SUFFIXES = set([])
-
 # Check if there are extensions to Metaflow to load and override everything
 try:
     from metaflow.extension_support import get_modules
@@ -300,19 +379,40 @@ def get_pinned_conda_libs(python_version):
             if n == "DEBUG_OPTIONS":
                 DEBUG_OPTIONS.extend(o)
                 for typ in o:
-                    vars()["METAFLOW_DEBUG_%s" % typ.upper()] = from_conf(
-                        "METAFLOW_DEBUG_%s" % typ.upper()
+                    vars()["DEBUG_%s" % typ.upper()] = from_conf(
+                        "DEBUG_%s" % typ.upper()
                     )
-            elif n == "METAFLOW_EXTENSIONS_ADDL_SUFFIXES":
-                METAFLOW_EXTENSIONS_ADDL_SUFFIXES.update(o)
+            elif n == "get_pinned_conda_libs":
+
+                def _new_get_pinned_conda_libs(
+                    python_version, datastore_type, f1=globals()[n], f2=o
+                ):
+                    d1 = f1(python_version, datastore_type)
+                    d2 = f2(python_version, datastore_type)
+                    for k, v in d2.items():
+                        d1[k] = v if k not in d1 else ",".join([d1[k], v])
+                    return d1
+
+                globals()[n] = _new_get_pinned_conda_libs
             elif not n.startswith("__") and not isinstance(o, types.ModuleType):
                 globals()[n] = o
-    METAFLOW_EXTENSIONS_ADDL_SUFFIXES = list(METAFLOW_EXTENSIONS_ADDL_SUFFIXES)
-    if len(METAFLOW_EXTENSIONS_ADDL_SUFFIXES) == 0:
-        METAFLOW_EXTENSIONS_ADDL_SUFFIXES = None
 finally:
     # Erase all temporary names to avoid leaking things
-    for _n in ["m", "n", "o", "ext_modules", "get_modules"]:
+    for _n in [
+        "m",
+        "n",
+        "o",
+        "type",
+        "ext_modules",
+        "get_modules",
+        "_new_get_pinned_conda_libs",
+        "d1",
+        "d2",
+        "k",
+        "v",
+        "f1",
+        "f2",
+    ]:
         try:
             del globals()[_n]
         except KeyError:
diff --git a/metaflow/metaflow_config_funcs.py b/metaflow/metaflow_config_funcs.py
new file mode 100644
index 00000000000..365a18d75eb
--- /dev/null
+++ b/metaflow/metaflow_config_funcs.py
@@ -0,0 +1,120 @@
+import json
+import os
+
+from collections import namedtuple
+
+from metaflow.exception import MetaflowException
+from metaflow.util import is_stringish
+
+ConfigValue = namedtuple("ConfigValue", "value serializer is_default")
+
+NON_CHANGED_VALUES = 1
+NULL_VALUES = 2
+ALL_VALUES = 3
+
+
+def init_config():
+    # Read configuration from $METAFLOW_HOME/config_<profile>.json.
+    home = os.environ.get("METAFLOW_HOME", "~/.metaflowconfig")
+    profile = os.environ.get("METAFLOW_PROFILE")
+    path_to_config = os.path.join(home, "config.json")
+    if profile:
+        path_to_config = os.path.join(home, "config_%s.json" % profile)
+    path_to_config = os.path.expanduser(path_to_config)
+    config = {}
+    if os.path.exists(path_to_config):
+        with open(path_to_config, encoding="utf-8") as f:
+            return json.load(f)
+    elif profile:
+        raise MetaflowException(
+            "Unable to locate METAFLOW_PROFILE '%s' in '%s')" % (profile, home)
+        )
+    return config
+
+
+# Initialize defaults required to setup environment variables.
+METAFLOW_CONFIG = init_config()
+
+_all_configs = {}
+
+
+def config_values(include=0):
+    # By default, we just return non-null values and that
+    # are not default. This is the common use case because in all other cases, the code
+    # is sufficient to recreate the value (ie: there is no external source for the value)
+    for name, config_value in _all_configs.items():
+        if (config_value.value is not None or include & NULL_VALUES) and (
+            not config_value.is_default or include & NON_CHANGED_VALUES
+        ):
+            yield name, config_value.serializer(config_value.value)
+
+
+def from_conf(name, default=None, validate_fn=None):
+    """
+    First try to pull value from environment, then from metaflow config JSON
+
+    Prior to a value being returned, we will validate using validate_fn (if provided).
+    Only non-None values are validated.
+
+    validate_fn should accept (name, value).
+    If the value validates, return None, else raise an MetaflowException.
+    """
+    env_name = "METAFLOW_%s" % name
+    is_default = True
+    value = os.environ.get(env_name, METAFLOW_CONFIG.get(env_name, default))
+    if validate_fn and value is not None:
+        validate_fn(env_name, value)
+    if default is not None:
+        # In this case, value is definitely not None because default is the ultimate
+        # fallback and all other cases will return a string (even if an empty string)
+        if isinstance(default, (list, dict)):
+            # If we used the default, value is already a list or dict, else it is a
+            # string so we can just compare types to determine is_default
+            if isinstance(value, (list, dict)):
+                is_default = True
+            else:
+                try:
+                    value = json.loads(value)
+                except json.JSONDecodeError:
+                    raise ValueError(
+                        "Expected a valid JSON for %s, got: %s" % (env_name, value)
+                    )
+            _all_configs[env_name] = ConfigValue(
+                value=value,
+                serializer=json.dumps,
+                is_default=is_default,
+            )
+            return value
+        elif isinstance(default, (bool, int, float)) or is_stringish(default):
+            try:
+                value = type(default)(value)
+                # Here we can compare values
+                is_default = value == default
+            except ValueError:
+                raise ValueError(
+                    "Expected a %s for %s, got: %s" % (type(default), env_name, value)
+                )
+        else:
+            raise RuntimeError(
+                "Default of type %s for %s is not supported" % (type(default), env_name)
+            )
+    _all_configs[env_name] = ConfigValue(
+        value=value,
+        serializer=str,
+        is_default=is_default,
+    )
+    return value
+
+
+def get_validate_choice_fn(choices):
+    """Returns a validate_fn for use with from_conf().
+    The validate_fn will check a value against a list of allowed choices.
+    """
+
+    def _validate_choice(name, value):
+        if value not in choices:
+            raise MetaflowException(
+                "%s must be set to one of %s. Got '%s'." % (name, choices, value)
+            )
+
+    return _validate_choice
diff --git a/metaflow/metaflow_environment.py b/metaflow/metaflow_environment.py
index 476a02bf132..839e29a8640 100644
--- a/metaflow/metaflow_environment.py
+++ b/metaflow/metaflow_environment.py
@@ -28,7 +28,7 @@ def init_environment(self, echo):
         """
         pass
 
-    def validate_environment(self, echo):
+    def validate_environment(self, echo, datastore_type):
         """
         Run before any command to validate that we are operating in
         a desired environment.
@@ -42,7 +42,7 @@ def decospecs(self):
         """
         return ()
 
-    def bootstrap_commands(self, step_name):
+    def bootstrap_commands(self, step_name, datastore_type):
         """
         A list of shell commands to bootstrap this environment in a remote runtime.
         """
@@ -79,20 +79,80 @@ def get_client_info(cls, flow_name, metadata):
         """
         return "Local environment"
 
-    def get_package_commands(self, code_package_url):
+    def _get_download_code_package_cmd(self, code_package_url, datastore_type):
+        """Return a command that downloads the code package from the datastore. We use various
+        cloud storage CLI tools because we don't have access to Metaflow codebase (which we
+        are about to download in the command).
+
+        The command should download the package to "job.tar" in the current directory.
+
+        It should work silently if everything goes well.
+        """
+        if datastore_type == "s3":
+            return (
+                '%s -m awscli ${METAFLOW_S3_ENDPOINT_URL:+--endpoint-url=\\"${METAFLOW_S3_ENDPOINT_URL}\\"} '
+                + "s3 cp %s job.tar >/dev/null"
+            ) % (self._python(), code_package_url)
+        elif datastore_type == "azure":
+            from .plugins.azure.azure_utils import parse_azure_full_path
+
+            container_name, blob = parse_azure_full_path(code_package_url)
+            # remove a trailing slash, if present
+            blob_endpoint = "${METAFLOW_AZURE_STORAGE_BLOB_SERVICE_ENDPOINT%/}"
+            return "download-azure-blob --blob-endpoint={blob_endpoint} --container={container} --blob={blob} --output-file=job.tar".format(
+                blob_endpoint=blob_endpoint,
+                blob=blob,
+                container=container_name,
+            )
+        elif datastore_type == "gs":
+            from .plugins.gcp.gs_utils import parse_gs_full_path
+
+            bucket_name, gs_object = parse_gs_full_path(code_package_url)
+            return (
+                "download-gcp-object --bucket=%s --object=%s --output-file=job.tar"
+                % (bucket_name, gs_object)
+            )
+        else:
+            raise NotImplementedError(
+                "We don't know how to generate a download code package cmd for datastore %s"
+                % datastore_type
+            )
+
+    def _get_install_dependencies_cmd(self, datastore_type):
+        cmds = ["%s -m pip install requests -qqq" % self._python()]
+        if datastore_type == "s3":
+            cmds.append("%s -m pip install awscli boto3 -qqq" % self._python())
+        elif datastore_type == "azure":
+            cmds.append(
+                "%s -m pip install azure-identity azure-storage-blob simple-azure-blob-downloader -qqq"
+                % self._python()
+            )
+        elif datastore_type == "gs":
+            cmds.append(
+                "%s -m pip install google-cloud-storage google-auth simple-gcp-object-downloader -qqq"
+                % self._python()
+            )
+        else:
+            raise NotImplementedError(
+                "We don't know how to generate an install dependencies cmd for datastore %s"
+                % datastore_type
+            )
+        return " && ".join(cmds)
+
+    def get_package_commands(self, code_package_url, datastore_type):
         cmds = [
             BASH_MFLOG,
             "mflog 'Setting up task environment.'",
-            "%s -m pip install awscli requests boto3 -qqq" % self._python(),
+            self._get_install_dependencies_cmd(datastore_type),
             "mkdir metaflow",
             "cd metaflow",
             "mkdir .metaflow",  # mute local datastore creation log
             "i=0; while [ $i -le 5 ]; do "
             "mflog 'Downloading code package...'; "
-            "%s -m awscli s3 cp %s job.tar >/dev/null && \
-                        mflog 'Code package downloaded.' && break; "
+            + self._get_download_code_package_cmd(code_package_url, datastore_type)
+            + " && mflog 'Code package downloaded.' && break; "
             "sleep 10; i=$((i+1)); "
-            "done" % (self._python(), code_package_url),
+            "done",
             "if [ $i -gt 5 ]; then "
             "mflog 'Failed to download code package from %s "
             "after 6 tries. Exiting...' && exit 1; "
diff --git a/metaflow/metaflow_version.py b/metaflow/metaflow_version.py
index 108b666779f..9a36dc79ae0 100644
--- a/metaflow/metaflow_version.py
+++ b/metaflow/metaflow_version.py
@@ -20,7 +20,7 @@
 if name == "nt":
 
     def find_git_on_windows():
-        """find the path to the git executable on windows"""
+        """find the path to the git executable on Windows"""
         # first see if git is in the path
         try:
             check_output(["where", "/Q", "git"])
@@ -29,7 +29,7 @@ def find_git_on_windows():
         # catch the exception thrown if git was not found
         except CalledProcessError:
             pass
-        # There are several locations git.exe may be hiding
+        # There are several locations where git.exe may be hiding
         possible_locations = []
         # look in program files for msysgit
         if "PROGRAMFILES(X86)" in environ:
@@ -38,7 +38,7 @@ def find_git_on_windows():
             )
         if "PROGRAMFILES" in environ:
             possible_locations.append("%s/Git/cmd/git.exe" % environ["PROGRAMFILES"])
-        # look for the github version of git
+        # look for the GitHub version of git
         if "LOCALAPPDATA" in environ:
             github_dir = "%s/GitHub" % environ["LOCALAPPDATA"]
             if path.isdir(github_dir):
diff --git a/metaflow/mflog/__init__.py b/metaflow/mflog/__init__.py
index 691c1568ea8..c80b3a57c8f 100644
--- a/metaflow/mflog/__init__.py
+++ b/metaflow/mflog/__init__.py
@@ -4,6 +4,7 @@
 from .mflog import refine, set_should_persist
 
 from metaflow.util import to_unicode
+from metaflow.exception import MetaflowInternalError
 
 # Log source indicates the system that *minted the timestamp*
 # for the logline. This means that for a single task we can
@@ -24,8 +25,8 @@
 TASK_LOG_SOURCE = "task"
 
 # Loglines from all sources need to be merged together to
-# produce a complete view of logs. Hence keep this list short
-# since every items takes a DataStore access.
+# produce a complete view of logs. Hence, keep this list short
+# since each item takes a DataStore access.
 LOG_SOURCES = [RUNTIME_LOG_SOURCE, TASK_LOG_SOURCE]
 
 # BASH_MFLOG defines a bash function that outputs valid mflog
@@ -43,18 +44,20 @@
 BASH_SAVE_LOGS_ARGS = ["python", "-m", "metaflow.mflog.save_logs"]
 BASH_SAVE_LOGS = " ".join(BASH_SAVE_LOGS_ARGS)
 
+
 # this function returns a bash expression that redirects stdout
-# and stderr of the given command to mflog
-def capture_output_to_mflog(command_and_args, var_transform=None):
+# and stderr of the given bash expression to mflog.tee
+def bash_capture_logs(bash_expr, var_transform=None):
     if var_transform is None:
         var_transform = lambda s: "$%s" % s
 
-    return "python -m metaflow.mflog.redirect_streams %s %s %s %s" % (
-        TASK_LOG_SOURCE,
-        var_transform("MFLOG_STDOUT"),
-        var_transform("MFLOG_STDERR"),
-        command_and_args,
+    cmd = "python -m metaflow.mflog.tee %s %s"
+    parts = (
+        bash_expr,
+        cmd % (TASK_LOG_SOURCE, var_transform("MFLOG_STDOUT")),
+        cmd % (TASK_LOG_SOURCE, var_transform("MFLOG_STDERR")),
     )
+    return "(%s) 1>> >(%s) 2>> >(%s >&2)" % parts
 
 
 # update_delay determines how often logs should be uploaded to S3
@@ -76,8 +79,7 @@ def update_delay(secs_since_start):
 
 
 # this function is used to generate a Bash 'export' expression that
-# sets environment variables that are used by 'redirect_streams' and
-# 'save_logs'.
+# sets environment variables that are used by 'tee' and 'save_logs'.
 # Note that we can't set the env vars statically, as some of them
 # may need to be evaluated during runtime
 def export_mflog_env_vars(
@@ -142,3 +144,23 @@ def _available_logs(tail, stream, echo, should_persist=False):
     # tailed.
     _available_logs(stdout_tail, "stdout", echo)
     _available_logs(stderr_tail, "stderr", echo)
+
+
+def get_log_tailer(log_url, datastore_type):
+    if datastore_type == "s3":
+        from metaflow.plugins.datatools.s3.s3tail import S3Tail
+
+        return S3Tail(log_url)
+    elif datastore_type == "azure":
+        from metaflow.plugins.azure.azure_tail import AzureTail
+
+        return AzureTail(log_url)
+    elif datastore_type == "gs":
+        from metaflow.plugins.gcp.gs_tail import GSTail
+
+        return GSTail(log_url)
+    else:
+        raise MetaflowInternalError(
+            "Log tailing implementation missing for datastore type %s"
+            % (datastore_type,)
+        )
diff --git a/metaflow/mflog/mflog.py b/metaflow/mflog/mflog.py
index fe29bdde4a2..fd04cb768ba 100644
--- a/metaflow/mflog/mflog.py
+++ b/metaflow/mflog/mflog.py
@@ -9,7 +9,7 @@
 
 VERSION = b"0"
 
-RE = b"(\[!)?" b"\[MFLOG\|" b"(0)\|" b"(.+?)Z\|" b"(.+?)\|" b"(.+?)\]" b"(.*)"
+RE = rb"(\[!)?" rb"\[MFLOG\|" rb"(0)\|" rb"(.+?)Z\|" rb"(.+?)\|" rb"(.+?)\]" rb"(.*)"
 
 # the RE groups defined above must match the MFLogline fields below
 # except utc_timestamp, which is filled in by the parser based on utc_tstamp_str
diff --git a/metaflow/mflog/redirect_streams.py b/metaflow/mflog/redirect_streams.py
deleted file mode 100644
index 36aac03342c..00000000000
--- a/metaflow/mflog/redirect_streams.py
+++ /dev/null
@@ -1,54 +0,0 @@
-import os
-import sys
-import subprocess
-import threading
-
-from .mflog import decorate
-
-# This script runs another process and captures stderr and stdout to a file, decorating
-# lines with mflog metadata.
-#
-# Usage: redirect_streams SOURCE STDOUT_FILE STDERR_FILE PROGRAM ARG1 ARG2 ...
-
-
-def reader_thread(SOURCE, dest_file, dest_stream, src):
-    with open(dest_file, mode="ab", buffering=0) as f:
-        if sys.version_info < (3, 0):
-            # Python 2
-            for line in iter(sys.stdin.readline, ""):
-                # https://bugs.python.org/issue3907
-                decorated = decorate(SOURCE, line)
-                f.write(decorated)
-                sys.stdout.write(line)
-        else:
-            # Python 3
-            for line in src:
-                decorated = decorate(SOURCE, line)
-                f.write(decorated)
-                dest_stream.buffer.write(line)
-
-
-if __name__ == "__main__":
-    SOURCE = sys.argv[1].encode("utf-8")
-    stdout_dest = sys.argv[2]
-    stderr_dest = sys.argv[3]
-
-    p = subprocess.Popen(
-        sys.argv[4:],
-        env=os.environ,
-        stdout=subprocess.PIPE,
-        stderr=subprocess.PIPE,
-    )
-
-    stdout_reader = threading.Thread(
-        target=reader_thread, args=(SOURCE, stdout_dest, sys.stdout, p.stdout)
-    )
-    stdout_reader.start()
-    stderr_reader = threading.Thread(
-        target=reader_thread, args=(SOURCE, stderr_dest, sys.stderr, p.stderr)
-    )
-    stderr_reader.start()
-    rc = p.wait()
-    stdout_reader.join()
-    stderr_reader.join()
-    sys.exit(rc)
diff --git a/metaflow/mflog/save_logs.py b/metaflow/mflog/save_logs.py
index 4931766ae94..ea7ca673288 100644
--- a/metaflow/mflog/save_logs.py
+++ b/metaflow/mflog/save_logs.py
@@ -3,7 +3,8 @@
 # This script is used to upload logs during task bootstrapping, so
 # it shouldn't have external dependencies besides Metaflow itself
 # (e.g. no click for parsing CLI args).
-from metaflow.datastore import DATASTORES, FlowDataStore
+from metaflow.datastore import FlowDataStore
+from metaflow.plugins import DATASTORES
 from metaflow.util import Path
 from . import TASK_LOG_SOURCE
 
@@ -23,7 +24,7 @@ def _read_file(path):
     paths = (os.environ["MFLOG_STDOUT"], os.environ["MFLOG_STDERR"])
 
     flow_name, run_id, step_name, task_id = pathspec.split("/")
-    storage_impl = DATASTORES[ds_type]
+    storage_impl = [d for d in DATASTORES if d.TYPE == ds_type][0]
     if ds_root is None:
 
         def print_clean(line, **kwargs):
diff --git a/metaflow/mflog/save_logs_periodically.py b/metaflow/mflog/save_logs_periodically.py
index 32fc3d4c98e..d5618f95a30 100644
--- a/metaflow/mflog/save_logs_periodically.py
+++ b/metaflow/mflog/save_logs_periodically.py
@@ -4,8 +4,7 @@
 import subprocess
 from threading import Thread
 
-from metaflow.metaflow_profile import profile
-from metaflow.sidecar import SidecarSubProcess
+from metaflow.sidecar import MessageTypes
 from . import update_delay, BASH_SAVE_LOGS_ARGS
 
 
@@ -16,10 +15,12 @@ def __init__(self):
         self._thread.start()
 
     def process_message(self, msg):
-        pass
+        if msg.msg_type == MessageTypes.SHUTDOWN:
+            self.is_alive = False
 
-    def shutdown(self):
-        self.is_alive = False
+    @classmethod
+    def get_worker(cls):
+        return cls
 
     def _update_loop(self):
         def _file_size(path):
diff --git a/metaflow/monitor.py b/metaflow/monitor.py
index ea33b5cd123..bd696333299 100644
--- a/metaflow/monitor.py
+++ b/metaflow/monitor.py
@@ -2,85 +2,67 @@
 
 from contextlib import contextmanager
 
-from .sidecar import SidecarSubProcess
-from .sidecar_messages import Message, MessageTypes
+from metaflow.sidecar import Message, MessageTypes, Sidecar
 
 COUNTER_TYPE = "COUNTER"
 GAUGE_TYPE = "GAUGE"
-MEASURE_TYPE = "MEASURE"
 TIMER_TYPE = "TIMER"
 
 
 class NullMonitor(object):
+    TYPE = "nullSidecarMonitor"
+
     def __init__(self, *args, **kwargs):
-        pass
+        # Currently passed flow and env as kwargs
+        self._sidecar = Sidecar(self.TYPE)
 
     def start(self):
-        pass
-
-    @contextmanager
-    def count(self, name):
-        yield
-
-    @contextmanager
-    def measure(self, name):
-        yield
-
-    def gauge(self, gauge):
-        pass
+        return self._sidecar.start()
 
     def terminate(self):
-        pass
+        return self._sidecar.terminate()
 
-
-class Monitor(NullMonitor):
-    def __init__(self, monitor_type, env, flow_name):
-        # type: (str) -> None
-        self.sidecar_process = None
-        self.monitor_type = monitor_type
-        self.env_info = env.get_environment_info()
-        self.env_info["flow_name"] = flow_name
-
-    def start(self):
-        if self.sidecar_process is None:
-            self.sidecar_process = SidecarSubProcess(self.monitor_type)
+    def send(self, msg):
+        # Arbitrary message sending. Useful if you want to override some different
+        # types of messages.
+        self._sidecar.send(msg)
 
     @contextmanager
     def count(self, name):
-        if self.sidecar_process is not None:
-            counter = Counter(name, self.env_info)
+        if self._sidecar.is_active:
+            counter = Counter(name)
             counter.increment()
-            payload = {"counter": counter.to_dict()}
-            msg = Message(MessageTypes.LOG_EVENT, payload)
+            payload = {"counter": counter.serialize()}
+            msg = Message(MessageTypes.BEST_EFFORT, payload)
             yield
-            self.sidecar_process.msg_handler(msg)
+            self._sidecar.send(msg)
         else:
             yield
 
     @contextmanager
     def measure(self, name):
-        if self.sidecar_process is not None:
-            timer = Timer(name + "_timer", self.env_info)
-            counter = Counter(name + "_counter", self.env_info)
+        if self._sidecar.is_active:
+            timer = Timer(name + "_timer")
+            counter = Counter(name + "_counter")
             timer.start()
             counter.increment()
             yield
             timer.end()
-            payload = {"counter": counter.to_dict(), "timer": timer.to_dict()}
-            msg = Message(MessageTypes.LOG_EVENT, payload)
-            self.sidecar_process.msg_handler(msg)
+            payload = {"counter": counter.serialize(), "timer": timer.serialize()}
+            msg = Message(MessageTypes.BEST_EFFORT, payload)
+            self._sidecar.send(msg)
         else:
             yield
 
     def gauge(self, gauge):
-        if self.sidecar_process is not None:
-            payload = {"gauge": gauge.to_dict()}
-            msg = Message(MessageTypes.LOG_EVENT, payload)
-            self.sidecar_process.msg_handler(msg)
+        if self._sidecar.is_active:
+            payload = {"gauge": gauge.serialize()}
+            msg = Message(MessageTypes.BEST_EFFORT, payload)
+            self._sidecar.send(msg)
 
-    def terminate(self):
-        if self.sidecar_process is not None:
-            self.sidecar_process.kill()
+    @classmethod
+    def get_worker(cls):
+        return None
 
 
 class Metric(object):
@@ -88,84 +70,93 @@ class Metric(object):
     Abstract base class
     """
 
-    def __init__(self, type, env):
-        self._env = env
-        self._type = type
+    def __init__(self, metric_type, name, context=None):
+        self._type = metric_type
+        self._name = name
+        self._context = context
 
     @property
-    def name(self):
-        raise NotImplementedError()
+    def metric_type(self):
+        return self._type
 
     @property
-    def flow_name(self):
-        return self._env["flow_name"]
+    def name(self):
+        return self._name
 
     @property
-    def env(self):
-        return self._env
+    def context(self):
+        return self._context
+
+    @context.setter
+    def context(self, new_context):
+        self._context = new_context
 
     @property
     def value(self):
         raise NotImplementedError()
 
-    def set_env(self, env):
-        self._env = env
-
-    def to_dict(self):
-        return {
-            "_env": self._env,
-            "_type": self._type,
-        }
+    def serialize(self):
+        # We purposefully do not serialize the context as it can be large;
+        # it will be transferred using a different mechanism and reset on the other
+        # end.
+        return {"_name": self._name, "_type": self._type}
+
+    @classmethod
+    def deserialize(cls, value):
+        if value is None:
+            return None
+        metric_type = value.get("_type", "INVALID")
+        metric_name = value.get("_name", None)
+        metric_cls = _str_type_to_type.get(metric_type, None)
+        if metric_cls:
+            return metric_cls.deserialize(metric_name, value)
+        else:
+            raise NotImplementedError("Metric class %s is not supported" % metric_type)
 
 
 class Timer(Metric):
-    def __init__(self, name, env):
-        super(Timer, self).__init__(TIMER_TYPE, env)
-        self._name = name
+    def __init__(self, name, env=None):
+        super(Timer, self).__init__(TIMER_TYPE, name, env)
         self._start = 0
         self._end = 0
 
-    @property
-    def name(self):
-        return self._name
-
-    def start(self):
-        self._start = time.time()
-
-    def end(self):
-        self._end = time.time()
-
-    def set_start(self, start):
-        self._start = start
+    def start(self, now=None):
+        if now is None:
+            now = time.time()
+        self._start = now
 
-    def set_end(self, end):
-        self._end = end
+    def end(self, now=None):
+        if now is None:
+            now = time.time()
+        self._end = now
 
-    def get_duration(self):
+    @property
+    def duration(self):
         return self._end - self._start
 
     @property
     def value(self):
-        return (self._end - self._start) * 1000
+        return self.duration * 1000
+
+    def serialize(self):
+        parent_ser = super(Timer, self).serialize()
+        parent_ser["_start"] = self._start
+        parent_ser["_end"] = self._end
+        return parent_ser
 
-    def to_dict(self):
-        parent_dict = super(Timer, self).to_dict()
-        parent_dict["_name"] = self.name
-        parent_dict["_start"] = self._start
-        parent_dict["_end"] = self._end
-        return parent_dict
+    @classmethod
+    def deserialize(cls, metric_name, value):
+        t = Timer(metric_name)
+        t.start(value.get("_start", 0))
+        t.end(value.get("_end", 0))
+        return t
 
 
 class Counter(Metric):
-    def __init__(self, name, env):
-        super(Counter, self).__init__(COUNTER_TYPE, env)
-        self._name = name
+    def __init__(self, name, env=None):
+        super(Counter, self).__init__(COUNTER_TYPE, name, env)
         self._count = 0
 
-    @property
-    def name(self):
-        return self._name
-
     def increment(self):
         self._count += 1
 
@@ -176,23 +167,23 @@ def set_count(self, count):
     def value(self):
         return self._count
 
-    def to_dict(self):
-        parent_dict = super(Counter, self).to_dict()
-        parent_dict["_name"] = self.name
-        parent_dict["_count"] = self._count
-        return parent_dict
+    def serialize(self):
+        parent_ser = super(Counter, self).serialize()
+        parent_ser["_count"] = self._count
+        return parent_ser
+
+    @classmethod
+    def deserialize(cls, metric_name, value):
+        c = Counter(metric_name)
+        c.set_count(value.get("_count", 0))
+        return c
 
 
 class Gauge(Metric):
-    def __init__(self, name, env):
-        super(Gauge, self).__init__(GAUGE_TYPE, env)
-        self._name = name
+    def __init__(self, name, env=None):
+        super(Gauge, self).__init__(GAUGE_TYPE, name, env)
         self._value = 0
 
-    @property
-    def name(self):
-        return self._name
-
     def set_value(self, val):
         self._value = val
 
@@ -203,47 +194,15 @@ def increment(self):
     def value(self):
         return self._value
 
-    def to_dict(self):
-        parent_dict = super(Gauge, self).to_dict()
-        parent_dict["_name"] = self.name
-        parent_dict["_value"] = self.value
-        return parent_dict
-
-
-def deserialize_metric(metrics_dict):
-    if metrics_dict is None:
-        return
-
-    type = metrics_dict.get("_type")
-    name = metrics_dict.get("_name")
-    if type == COUNTER_TYPE:
-        try:
-            counter = Counter(name, None)
-            counter.set_env(metrics_dict.get("_env"))
-        except Exception as ex:
-            return
-
-        counter.set_count(metrics_dict.get("_count"))
-        return counter
-    elif type == TIMER_TYPE:
-        timer = Timer(name, None)
-        timer.set_start(metrics_dict.get("_start"))
-        timer.set_end(metrics_dict.get("_end"))
-        timer.set_env(metrics_dict.get("_env"))
-        return timer
-    elif type == GAUGE_TYPE:
-        gauge = Gauge(name, None)
-        gauge.set_env(metrics_dict.get("_env"))
-        gauge.set_value(metrics_dict.get("_value"))
-        return gauge
-    else:
-        raise NotImplementedError("UNSUPPORTED MESSAGE TYPE IN MONITOR")
-
-
-def get_monitor_msg_type(msg):
-    if msg.payload.get("gauge") is not None:
-        return GAUGE_TYPE
-    if msg.payload.get("counter") is not None:
-        if msg.payload.get("timer") is not None:
-            return MEASURE_TYPE
-        return COUNTER_TYPE
+    def serialize(self):
+        parent_ser = super(Gauge, self).serialize()
+        parent_ser["_value"] = self._value
+
+    @classmethod
+    def deserialize(cls, metric_name, value):
+        g = Gauge(metric_name)
+        g.set_value(value.get("_value", 0))
+        return g
+
+
+_str_type_to_type = {COUNTER_TYPE: Counter, GAUGE_TYPE: Gauge, TIMER_TYPE: Timer}
diff --git a/metaflow/multicore_utils.py b/metaflow/multicore_utils.py
index 486ea0c5a23..1ac81039c12 100644
--- a/metaflow/multicore_utils.py
+++ b/metaflow/multicore_utils.py
@@ -31,8 +31,8 @@ def _spawn(func, arg, dir):
     with NamedTemporaryFile(prefix="parallel_map_", dir=dir, delete=False) as tmpfile:
         output_file = tmpfile.name
 
-    # make sure stdout and stderr are flushed before forking. Otherwise
-    # we may print multiple copies of the same output
+    # Make sure stdout and stderr are flushed before forking,
+    # or else we may print multiple copies of the same output
     sys.stderr.flush()
     sys.stdout.flush()
     pid = os.fork()
@@ -47,13 +47,13 @@ def _spawn(func, arg, dir):
             exit_code = 0
         except:
             # we must not let any exceptions escape this function
-            # which might trigger unintended side-effects
+            # which might trigger unintended side effects
             traceback.print_exc()
         finally:
             sys.stderr.flush()
             sys.stdout.flush()
             # we can't use sys.exit(0) here since it raises SystemExit
-            # that may have unintended side-effects (e.g. triggering
+            # that may have unintended side effects (e.g. triggering
             # finally blocks).
             os._exit(exit_code)
 
diff --git a/metaflow/package.py b/metaflow/package.py
index cc1d1e0d85d..f68d15224ca 100644
--- a/metaflow/package.py
+++ b/metaflow/package.py
@@ -6,13 +6,14 @@
 import json
 from io import BytesIO
 
-from .extension_support import EXT_PKG
-from .metaflow_config import DEFAULT_PACKAGE_SUFFIXES, METAFLOW_EXTENSIONS_ADDL_SUFFIXES
+from .extension_support import EXT_PKG, package_mfext_all
+from .metaflow_config import DEFAULT_PACKAGE_SUFFIXES
 from .exception import MetaflowException
 from .util import to_unicode
 from . import R
 
 DEFAULT_SUFFIXES_LIST = DEFAULT_PACKAGE_SUFFIXES.split(",")
+METAFLOW_SUFFIXES_LIST = [".py", ".html", ".css", ".js"]
 
 
 class NonUniqueFileNameToFilePathMappingException(MetaflowException):
@@ -21,7 +22,7 @@ class NonUniqueFileNameToFilePathMappingException(MetaflowException):
     def __init__(self, filename, file_paths, lineno=None):
         msg = (
             "Filename %s included in the code package includes multiple different paths for the same name : %s.\n"
-            "The `filename` in the `add_to_package` decorator hook requires a unqiue `file_path` to `file_name` mapping"
+            "The `filename` in the `add_to_package` decorator hook requires a unique `file_path` to `file_name` mapping"
             % (filename, ", ".join(file_paths))
         )
         super().__init__(msg=msg, lineno=lineno)
@@ -62,13 +63,6 @@ def __init__(self, flow, environment, echo, suffixes=DEFAULT_SUFFIXES_LIST):
         self.suffixes = list(set().union(suffixes, DEFAULT_SUFFIXES_LIST))
         self.environment = environment
         self.metaflow_root = os.path.dirname(__file__)
-        try:
-            ext_package = importlib.import_module(EXT_PKG)
-        except ImportError as e:
-            self.metaflow_extensions_root = []
-        else:
-            self.metaflow_extensions_root = list(ext_package.__path__)
-            self.metaflow_extensions_addl_suffixes = METAFLOW_EXTENSIONS_ADDL_SUFFIXES
 
         self.flow_name = flow.name
         self._flow = flow
@@ -79,9 +73,9 @@ def __init__(self, flow, environment, echo, suffixes=DEFAULT_SUFFIXES_LIST):
                 deco.package_init(flow, step.__name__, environment)
         self.blob = self._make()
 
-    def _walk(self, root, exclude_hidden=True, addl_suffixes=None):
-        if addl_suffixes is None:
-            addl_suffixes = []
+    def _walk(self, root, exclude_hidden=True, suffixes=None):
+        if suffixes is None:
+            suffixes = []
         root = to_unicode(root)  # handle files/folder with non ascii chars
         prefixlen = len("%s/" % os.path.dirname(root))
         for (
@@ -96,9 +90,7 @@ def _walk(self, root, exclude_hidden=True, addl_suffixes=None):
             for fname in files:
                 if fname[0] == ".":
                     continue
-                if any(
-                    fname.endswith(suffix) for suffix in self.suffixes + addl_suffixes
-                ):
+                if any(fname.endswith(suffix) for suffix in suffixes):
                     p = os.path.join(path, fname)
                     yield p, p[prefixlen:]
 
@@ -109,17 +101,16 @@ def path_tuples(self):
         """
         # We want the following contents in the tarball
         # Metaflow package itself
-        for path_tuple in self._walk(self.metaflow_root, exclude_hidden=False):
+        for path_tuple in self._walk(
+            self.metaflow_root, exclude_hidden=False, suffixes=METAFLOW_SUFFIXES_LIST
+        ):
+            yield path_tuple
+
+        # Metaflow extensions; for now, we package *all* extensions but this may change
+        # at a later date; it is possible to call `package_mfext_package` instead of
+        # `package_mfext_all`
+        for path_tuple in package_mfext_all():
             yield path_tuple
-        # Metaflow customization if any
-        if self.metaflow_extensions_root:
-            for root in self.metaflow_extensions_root:
-                for path_tuple in self._walk(
-                    root,
-                    exclude_hidden=False,
-                    addl_suffixes=self.metaflow_extensions_addl_suffixes,
-                ):
-                    yield path_tuple
 
         # Any custom packages exposed via decorators
         deco_module_paths = {}
@@ -142,7 +133,9 @@ def path_tuples(self):
             yield path_tuple
         if R.use_r():
             # the R working directory
-            for path_tuple in self._walk("%s/" % R.working_dir()):
+            for path_tuple in self._walk(
+                "%s/" % R.working_dir(), suffixes=self.suffixes
+            ):
                 yield path_tuple
             # the R package
             for path_tuple in R.package_paths():
@@ -150,7 +143,7 @@ def path_tuples(self):
         else:
             # the user's working directory
             flowdir = os.path.dirname(os.path.abspath(sys.argv[0])) + "/"
-            for path_tuple in self._walk(flowdir):
+            for path_tuple in self._walk(flowdir, suffixes=self.suffixes):
                 yield path_tuple
 
     def _add_info(self, tar):
diff --git a/metaflow/parameters.py b/metaflow/parameters.py
index ac8173620aa..4f1a6ccb485 100644
--- a/metaflow/parameters.py
+++ b/metaflow/parameters.py
@@ -22,7 +22,13 @@
 # breaking backwards compatibility but don't remove any fields!
 ParameterContext = namedtuple(
     "ParameterContext",
-    ["flow_name", "user_name", "parameter_name", "logger", "ds_type"],
+    [
+        "flow_name",
+        "user_name",
+        "parameter_name",
+        "logger",
+        "ds_type",
+    ],
 )
 
 # currently we execute only one flow per process, so we can treat
@@ -55,9 +61,9 @@ def __repr__(self):
 class DeployTimeField(object):
     """
     This a wrapper object for a user-defined function that is called
-    at the deploy time to populate fields in a Parameter. The wrapper
+    at deploy time to populate fields in a Parameter. The wrapper
     is needed to make Click show the actual value returned by the
-    function instead of a function pointer in its help text. Also this
+    function instead of a function pointer in its help text. Also, this
     object curries the context argument for the function, and pretty
     prints any exceptions that occur during evaluation.
     """
@@ -83,23 +89,37 @@ def __init__(
         if self.print_representation is None:
             self.print_representation = str(self.fun)
 
-    def __call__(self, full_evaluation=False):
-        # full_evaluation is True if there will be no further "convert" called
-        # by click and the parameter should be fully evaluated.
+    def __call__(self, deploy_time=False):
+        # This is called in two ways:
+        #  - through the normal Click default parameter evaluation: if a default
+        #    value is a callable, Click will call it without any argument. In other
+        #    words, deploy_time=False. This happens for a normal "run" or the "trigger"
+        #    functions for step-functions for example. Anything that has the
+        #    @add_custom_parameters decorator will trigger this. Once click calls this,
+        #    it will then pass the resulting value to the convert() functions for the
+        #    type for that Parameter.
+        #  - by deploy_time_eval which is invoked to process the parameters at
+        #    deploy_time and outside of click processing (ie: at that point, Click
+        #    is not involved since anytime deploy_time_eval is called, no custom parameters
+        #    have been added). In that situation, deploy_time will be True. Note that in
+        #    this scenario, the value should be something that can be converted to JSON.
+        # The deploy_time value can therefore be used to determine which type of
+        # processing is requested.
         ctx = context_proto._replace(parameter_name=self.parameter_name)
         try:
             try:
-                # Not all functions take two arguments
-                val = self.fun(ctx, full_evaluation)
+                # Most user-level functions may not care about the deploy_time parameter
+                # but IncludeFile does.
+                val = self.fun(ctx, deploy_time)
             except TypeError:
                 val = self.fun(ctx)
         except:
             raise ParameterFieldFailed(self.parameter_name, self.field)
         else:
-            return self._check_type(val)
+            return self._check_type(val, deploy_time)
 
-    def _check_type(self, val):
-        # it is easy to introduce a deploy-time function that that accidentally
+    def _check_type(self, val, deploy_time):
+        # it is easy to introduce a deploy-time function that accidentally
         # returns a value whose type is not compatible with what is defined
         # in Parameter. Let's catch those mistakes early here, instead of
         # showing a cryptic stack trace later.
@@ -120,6 +140,15 @@ def _check_type(self, val):
                 raise ParameterFieldTypeMismatch(msg)
             return str(val) if self.return_str else val
         else:
+            if deploy_time:
+                try:
+                    if not is_stringish(val):
+                        val = json.dumps(val)
+                except TypeError:
+                    msg += "Expected a JSON-encodable object or a string."
+                    raise ParameterFieldTypeMismatch(msg)
+                return val
+            # If not deploy_time, we expect a string
             if not is_stringish(val):
                 msg += "Expected a string."
                 raise ParameterFieldTypeMismatch(msg)
@@ -142,7 +171,7 @@ def __repr__(self):
 
 def deploy_time_eval(value):
     if isinstance(value, DeployTimeField):
-        return value(full_evaluation=True)
+        return value(deploy_time=True)
     else:
         return value
 
@@ -159,7 +188,76 @@ def set_parameter_context(flow_name, echo, datastore):
     )
 
 
+class DelayedEvaluationParameter(object):
+    """
+    This is a very simple wrapper to allow parameter "conversion" to be delayed until
+    the `_set_constants` function in FlowSpec. Typically, parameters are converted
+    by click when the command line option is processed. For some parameters, like
+    IncludeFile, this is too early as it would mean we would trigger the upload
+    of the file too early. If a parameter converts to a DelayedEvaluationParameter
+    object through the usual click mechanisms, `_set_constants` knows to invoke the
+    __call__ method on that DelayedEvaluationParameter; in that case, the __call__
+    method is invoked without any parameter. The return_str parameter will be used
+    by schedulers when they need to convert DelayedEvaluationParameters to a
+    string to store them
+    """
+
+    def __init__(self, name, field, fun):
+        self._name = name
+        self._field = field
+        self._fun = fun
+
+    def __call__(self, return_str=False):
+        try:
+            return self._fun(return_str=return_str)
+        except Exception as e:
+            raise ParameterFieldFailed(self._name, self._field)
+
+
 class Parameter(object):
+    """
+    Defines a parameter for a flow.
+
+    Parameters must be instantiated as class variables in flow classes, e.g.
+    ```
+    class MyFlow(FlowSpec):
+        param = Parameter('myparam')
+    ```
+    in this case, the parameter is specified on the command line as
+    ```
+    python myflow.py run --myparam=5
+    ```
+    and its value is accessible through a read-only artifact like this:
+    ```
+    print(self.param == 5)
+    ```
+    Note that the user-visible parameter name, `myparam` above, can be
+    different from the artifact name, `param` above.
+
+    The parameter value is converted to a Python type based on the `type`
+    argument or to match the type of `default`, if it is set.
+
+    Parameters
+    ----------
+    name : str
+        User-visible parameter name.
+    default : str or float or int or bool or `JSONType` or a function.
+        Default value for the parameter. Use a special `JSONType` class to
+        indicate that the value must be a valid JSON object. A function
+        implies that the parameter corresponds to a *deploy-time parameter*.
+        The type of the default value is used as the parameter `type`.
+    type : type
+        If `default` is not specified, define the parameter type. Specify
+        one of `str`, `float`, `int`, `bool`, or `JSONType` (default: str).
+    help : str
+        Help text to show in `run --help`.
+    required : bool
+        Require that the user specified a value for the parameter.
+        `required=True` implies that the `default` is not used.
+    show_default : bool
+        If True, show the default value in the help text (default: True).
+    """
+
     def __init__(self, name, **kwargs):
         self.name = name
         self.kwargs = kwargs
diff --git a/metaflow/plugins/__init__.py b/metaflow/plugins/__init__.py
index b3039c3721a..bb5c52ea746 100644
--- a/metaflow/plugins/__init__.py
+++ b/metaflow/plugins/__init__.py
@@ -1,4 +1,5 @@
 import sys
+import traceback
 import types
 
 
@@ -14,7 +15,10 @@ def _merge_lists(base, overrides, attr):
 
 
 def _merge_funcs(base_func, override_func):
-    r = base_func() + override_func()
+    # IMPORTANT: This is a `get_plugin_cli` type of function, and we need to *delay*
+    # evaluation of it until after the flowspec is loaded.
+    old_default = base_func.__defaults__[0]
+    r = lambda: base_func(old_default) + override_func()
 
     base_func.__defaults__ = (r,)
 
@@ -43,7 +47,7 @@ def _merge_funcs(base_func, override_func):
         [],
         lambda base, overrides: _merge_lists(base, overrides, "name"),
     ),
-    "get_plugin_cli": (lambda l=None: [] if l is None else l, _merge_funcs),
+    "get_plugin_cli": (lambda l=None: [] if l is None else l(), _merge_funcs),
 }
 
 
@@ -63,6 +67,11 @@ def _merge_funcs(base_func, override_func):
                 v[1](_ext_plugins[k], module_override)
 except Exception as e:
     _ext_debug("\tWARNING: ignoring all plugins due to error during import: %s" % e)
+    print(
+        "WARNING: Plugins did not load -- ignoring all of them which may not "
+        "be what you want: %s" % e
+    )
+    traceback.print_exc()
     _ext_plugins = {k: v[0] for k, v in _expected_extensions.items()}
 
 _ext_debug("\tWill import the following plugins: %s" % str(_ext_plugins))
@@ -77,9 +86,12 @@ def get_plugin_cli():
     # Add new CLI commands in this list
     from . import package_cli
     from .aws.batch import batch_cli
-    from .aws.eks import kubernetes_cli
+    from .kubernetes import kubernetes_cli
     from .aws.step_functions import step_functions_cli
+    from .airflow import airflow_cli
+    from .argo import argo_workflows_cli
     from .cards import card_cli
+    from . import tag_cli
 
     return _ext_plugins["get_plugin_cli"]() + [
         package_cli.cli,
@@ -87,6 +99,9 @@ def get_plugin_cli():
         card_cli.cli,
         kubernetes_cli.cli,
         step_functions_cli.cli,
+        airflow_cli.cli,
+        argo_workflows_cli.cli,
+        tag_cli.cli,
     ]
 
 
@@ -98,7 +113,8 @@ def get_plugin_cli():
 from .retry_decorator import RetryDecorator
 from .resources_decorator import ResourcesDecorator
 from .aws.batch.batch_decorator import BatchDecorator
-from .aws.eks.kubernetes_decorator import KubernetesDecorator
+from .kubernetes.kubernetes_decorator import KubernetesDecorator
+from .argo.argo_workflows_decorator import ArgoWorkflowsInternalDecorator
 from .aws.step_functions.step_functions_decorator import StepFunctionsInternalDecorator
 from .test_unbounded_foreach_decorator import (
     InternalTestUnboundedForeachDecorator,
@@ -107,6 +123,7 @@ def get_plugin_cli():
 from .conda.conda_step_decorator import CondaStepDecorator
 from .cards.card_decorator import CardDecorator
 from .frameworks.pytorch import PytorchParallelDecorator
+from .airflow.airflow_decorator import AirflowInternalDecorator
 
 
 STEP_DECORATORS = [
@@ -123,9 +140,19 @@ def get_plugin_cli():
     ParallelDecorator,
     PytorchParallelDecorator,
     InternalTestUnboundedForeachDecorator,
+    AirflowInternalDecorator,
+    ArgoWorkflowsInternalDecorator,
 ]
 _merge_lists(STEP_DECORATORS, _ext_plugins["STEP_DECORATORS"], "name")
 
+# Datastores
+from .datastores.azure_storage import AzureStorage
+from .datastores.gs_storage import GSStorage
+from .datastores.local_storage import LocalStorage
+from .datastores.s3_storage import S3Storage
+
+DATASTORES = [AzureStorage, GSStorage, LocalStorage, S3Storage]
+
 # Add Conda environment
 from .conda.conda_environment import CondaEnvironment
 
@@ -146,7 +173,12 @@ def get_plugin_cli():
 from .aws.step_functions.schedule_decorator import ScheduleDecorator
 from .project_decorator import ProjectDecorator
 
-FLOW_DECORATORS = [CondaFlowDecorator, ScheduleDecorator, ProjectDecorator]
+
+FLOW_DECORATORS = [
+    CondaFlowDecorator,
+    ScheduleDecorator,
+    ProjectDecorator,
+]
 _merge_lists(FLOW_DECORATORS, _ext_plugins["FLOW_DECORATORS"], "name")
 
 # Cards
@@ -182,7 +214,8 @@ def get_plugin_cli():
     TestNonEditableCard,
     BlankCard,
     DefaultCardJSON,
-] + MF_EXTERNAL_CARDS
+]
+_merge_lists(CARDS, MF_EXTERNAL_CARDS, "type")
 # Sidecars
 from ..mflog.save_logs_periodically import SaveLogsPeriodicallySidecar
 from metaflow.metadata.heartbeat import MetadataHeartBeat
@@ -195,14 +228,22 @@ def get_plugin_cli():
 
 # Add logger
 from .debug_logger import DebugEventLogger
+from metaflow.event_logger import NullEventLogger
 
-LOGGING_SIDECARS = {"debugLogger": DebugEventLogger, "nullSidecarLogger": None}
+LOGGING_SIDECARS = {
+    DebugEventLogger.TYPE: DebugEventLogger,
+    NullEventLogger.TYPE: NullEventLogger,
+}
 LOGGING_SIDECARS.update(_ext_plugins["LOGGING_SIDECARS"])
 
 # Add monitor
 from .debug_monitor import DebugMonitor
+from metaflow.monitor import NullMonitor
 
-MONITOR_SIDECARS = {"debugMonitor": DebugMonitor, "nullSidecarMonitor": None}
+MONITOR_SIDECARS = {
+    DebugMonitor.TYPE: DebugMonitor,
+    NullMonitor.TYPE: NullMonitor,
+}
 MONITOR_SIDECARS.update(_ext_plugins["MONITOR_SIDECARS"])
 
 SIDECARS.update(LOGGING_SIDECARS)
diff --git a/metaflow/plugins/airflow/__init__.py b/metaflow/plugins/airflow/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/metaflow/plugins/airflow/airflow.py b/metaflow/plugins/airflow/airflow.py
new file mode 100644
index 00000000000..5480a79c59e
--- /dev/null
+++ b/metaflow/plugins/airflow/airflow.py
@@ -0,0 +1,693 @@
+from io import BytesIO
+import json
+import os
+import random
+import string
+import sys
+from datetime import datetime, timedelta
+from metaflow.includefile import FilePathClass
+
+import metaflow.util as util
+from metaflow.decorators import flow_decorators
+from metaflow.exception import MetaflowException
+from metaflow.metaflow_config import (
+    SERVICE_HEADERS,
+    SERVICE_INTERNAL_URL,
+    CARD_S3ROOT,
+    DATASTORE_SYSROOT_S3,
+    DATATOOLS_S3ROOT,
+    KUBERNETES_SERVICE_ACCOUNT,
+    KUBERNETES_SECRETS,
+    AIRFLOW_KUBERNETES_STARTUP_TIMEOUT_SECONDS,
+    AZURE_STORAGE_BLOB_SERVICE_ENDPOINT,
+    DATASTORE_SYSROOT_AZURE,
+    CARD_AZUREROOT,
+    AIRFLOW_KUBERNETES_CONN_ID,
+)
+from metaflow.parameters import DelayedEvaluationParameter, deploy_time_eval
+from metaflow.plugins.kubernetes.kubernetes import Kubernetes
+
+# TODO: Move chevron to _vendor
+from metaflow.plugins.cards.card_modules import chevron
+from metaflow.plugins.timeout_decorator import get_run_time_limit_for_task
+from metaflow.util import dict_to_cli_options, get_username, compress_list
+from metaflow.parameters import JSONTypeClass
+
+from . import airflow_utils
+from .exception import AirflowException
+from .airflow_utils import (
+    TASK_ID_XCOM_KEY,
+    AirflowTask,
+    Workflow,
+    AIRFLOW_MACROS,
+)
+from metaflow import current
+
+AIRFLOW_DEPLOY_TEMPLATE_FILE = os.path.join(os.path.dirname(__file__), "dag.py")
+
+
+class Airflow(object):
+
+    TOKEN_STORAGE_ROOT = "mf.airflow"
+
+    def __init__(
+        self,
+        name,
+        graph,
+        flow,
+        code_package_sha,
+        code_package_url,
+        metadata,
+        flow_datastore,
+        environment,
+        event_logger,
+        monitor,
+        production_token,
+        tags=None,
+        namespace=None,
+        username=None,
+        max_workers=None,
+        worker_pool=None,
+        description=None,
+        file_path=None,
+        workflow_timeout=None,
+        is_paused_upon_creation=True,
+    ):
+        self.name = name
+        self.graph = graph
+        self.flow = flow
+        self.code_package_sha = code_package_sha
+        self.code_package_url = code_package_url
+        self.metadata = metadata
+        self.flow_datastore = flow_datastore
+        self.environment = environment
+        self.event_logger = event_logger
+        self.monitor = monitor
+        self.tags = tags
+        self.namespace = namespace  # this is the username space
+        self.username = username
+        self.max_workers = max_workers
+        self.description = description
+        self._file_path = file_path
+        _, self.graph_structure = self.graph.output_steps()
+        self.worker_pool = worker_pool
+        self.is_paused_upon_creation = is_paused_upon_creation
+        self.workflow_timeout = workflow_timeout
+        self.schedule = self._get_schedule()
+        self.parameters = self._process_parameters()
+        self.production_token = production_token
+        self.contains_foreach = self._contains_foreach()
+
+    @classmethod
+    def get_existing_deployment(cls, name, flow_datastore):
+        _backend = flow_datastore._storage_impl
+        token_paths = _backend.list_content([cls.get_token_path(name)])
+        if len(token_paths) == 0:
+            return None
+
+        with _backend.load_bytes([token_paths[0]]) as get_results:
+            for _, path, _ in get_results:
+                if path is not None:
+                    with open(path, "r") as f:
+                        data = json.loads(f.read())
+                    return (data["owner"], data["production_token"])
+
+    @classmethod
+    def get_token_path(cls, name):
+        return os.path.join(cls.TOKEN_STORAGE_ROOT, name)
+
+    @classmethod
+    def save_deployment_token(cls, owner, token, flow_datastore):
+        _backend = flow_datastore._storage_impl
+        _backend.save_bytes(
+            [
+                (
+                    cls.get_token_path(token),
+                    BytesIO(
+                        bytes(
+                            json.dumps({"production_token": token, "owner": owner}),
+                            "utf-8",
+                        )
+                    ),
+                )
+            ],
+            overwrite=False,
+        )
+
+    def _get_schedule(self):
+        # Using the cron presets provided here :
+        # https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html?highlight=schedule%20interval#cron-presets
+        schedule = self.flow._flow_decorators.get("schedule")
+        if not schedule:
+            return None
+        if schedule.attributes["cron"]:
+            return schedule.attributes["cron"]
+        elif schedule.attributes["weekly"]:
+            return "@weekly"
+        elif schedule.attributes["hourly"]:
+            return "@hourly"
+        elif schedule.attributes["daily"]:
+            return "@daily"
+        return None
+
+    def _get_retries(self, node):
+        max_user_code_retries = 0
+        max_error_retries = 0
+        foreach_default_retry = 1
+        # Different decorators may have different retrying strategies, so take
+        # the max of them.
+        for deco in node.decorators:
+            user_code_retries, error_retries = deco.step_task_retry_count()
+            max_user_code_retries = max(max_user_code_retries, user_code_retries)
+            max_error_retries = max(max_error_retries, error_retries)
+        parent_is_foreach = any(  # The immediate parent is a foreach node.
+            self.graph[n].type == "foreach" for n in node.in_funcs
+        )
+
+        if parent_is_foreach:
+            max_user_code_retries + foreach_default_retry
+        return max_user_code_retries, max_user_code_retries + max_error_retries
+
+    def _get_retry_delay(self, node):
+        retry_decos = [deco for deco in node.decorators if deco.name == "retry"]
+        if len(retry_decos) > 0:
+            retry_mins = retry_decos[0].attributes["minutes_between_retries"]
+            return timedelta(minutes=int(retry_mins))
+        return None
+
+    def _process_parameters(self):
+        airflow_params = []
+        type_transform_dict = {
+            int.__name__: "integer",
+            str.__name__: "string",
+            bool.__name__: "string",
+            float.__name__: "number",
+        }
+
+        for var, param in self.flow._get_parameters():
+            # Airflow requires defaults set for parameters.
+            value = deploy_time_eval(param.kwargs.get("default"))
+            # Setting airflow related param args.
+            airflow_param = dict(
+                name=param.name,
+            )
+            if value is not None:
+                airflow_param["default"] = value
+            if param.kwargs.get("help"):
+                airflow_param["description"] = param.kwargs.get("help")
+
+            # Since we will always have a default value and `deploy_time_eval` resolved that to an actual value
+            # we can just use the `default` to infer the object's type.
+            # This avoids parsing/identifying types like `JSONType` or `FilePathClass`
+            # which are returned by calling `param.kwargs.get("type")`
+            param_type = type(airflow_param["default"])
+
+            # extract the name of the type and resolve the type-name
+            # compatible with Airflow.
+            param_type_name = getattr(param_type, "__name__", None)
+            if param_type_name in type_transform_dict:
+                airflow_param["type"] = type_transform_dict[param_type_name]
+
+            if param_type_name == bool.__name__:
+                airflow_param["default"] = str(airflow_param["default"])
+
+            airflow_params.append(airflow_param)
+
+        return airflow_params
+
+    def _compress_input_path(
+        self,
+        steps,
+    ):
+        """
+        This function is meant to compress the input paths, and it specifically doesn't use
+        `metaflow.util.compress_list` under the hood. The reason is that the `AIRFLOW_MACROS.RUN_ID` is a complicated
+        macro string that doesn't behave nicely with `metaflow.util.decompress_list`, since the `decompress_util`
+        function expects a string which doesn't contain any delimiter characters and the run-id string does. Hence, we
+        have a custom compression string created via `_compress_input_path` function instead of `compress_list`.
+        """
+        return "%s:" % (AIRFLOW_MACROS.RUN_ID) + ",".join(
+            self._make_input_path(step, only_task_id=True) for step in steps
+        )
+
+    def _make_foreach_input_path(self, step_name):
+        return (
+            "%s/%s/:{{ task_instance.xcom_pull(task_ids='%s',key='%s') | join_list }}"
+            % (
+                AIRFLOW_MACROS.RUN_ID,
+                step_name,
+                step_name,
+                TASK_ID_XCOM_KEY,
+            )
+        )
+
+    def _make_input_path(self, step_name, only_task_id=False):
+        """
+        This is set using the `airflow_internal` decorator to help pass state.
+        This will pull the `TASK_ID_XCOM_KEY` xcom which holds task-ids.
+        The key is set via the `MetaflowKubernetesOperator`.
+        """
+        task_id_string = "/%s/{{ task_instance.xcom_pull(task_ids='%s',key='%s') }}" % (
+            step_name,
+            step_name,
+            TASK_ID_XCOM_KEY,
+        )
+
+        if only_task_id:
+            return task_id_string
+
+        return "%s%s" % (AIRFLOW_MACROS.RUN_ID, task_id_string)
+
+    def _to_job(self, node):
+        """
+        This function will transform the node's specification into Airflow compatible operator arguments.
+        Since this function is long, below is the summary of the two major duties it performs:
+            1. Based on the type of the graph node (start/linear/foreach/join etc.)
+                it will decide how to set the input paths
+            2. Based on node's decorator specification convert the information into
+                a job spec for the KubernetesPodOperator.
+        """
+        # Add env vars from the optional @environment decorator.
+        env_deco = [deco for deco in node.decorators if deco.name == "environment"]
+        env = {}
+        if env_deco:
+            env = env_deco[0].attributes["vars"]
+
+        # The below if/else block handles "input paths".
+        # Input Paths help manage dataflow across the graph.
+        if node.name == "start":
+            # POSSIBLE_FUTURE_IMPROVEMENT:
+            # We can extract metadata about the possible upstream sensor triggers.
+            # There is a previous commit (7bdf6) in the `airflow` branch that has `SensorMetaExtractor` class and
+            # associated MACRO we have built to handle this case if a metadata regarding the sensor is needed.
+            # Initialize parameters for the flow in the `start` step.
+            # `start` step has no upstream input dependencies aside from
+            # parameters.
+
+            if len(self.parameters):
+                env["METAFLOW_PARAMETERS"] = AIRFLOW_MACROS.PARAMETERS
+            input_paths = None
+        else:
+            # If it is not the start node then we check if there are many paths
+            # converging into it or a single path. Based on that we set the INPUT_PATHS
+            if node.parallel_foreach:
+                raise AirflowException(
+                    "Parallel steps are not supported yet with Airflow."
+                )
+            is_foreach_join = (
+                node.type == "join"
+                and self.graph[node.split_parents[-1]].type == "foreach"
+            )
+            if is_foreach_join:
+                input_paths = self._make_foreach_input_path(node.in_funcs[0])
+
+            elif len(node.in_funcs) == 1:
+                # set input paths where this is only one parent node
+                # The parent-task-id is passed via the xcom; There is no other way to get that.
+                # One key thing about xcoms is that they are immutable and only accepted if the task
+                # doesn't fail.
+                # From airflow docs :
+                # "Note: If the first task run is not succeeded then on every retry task
+                # XComs will be cleared to make the task run idempotent."
+                input_paths = self._make_input_path(node.in_funcs[0])
+            else:
+                # this is a split scenario where there can be more than one input paths.
+                input_paths = self._compress_input_path(node.in_funcs)
+
+            # env["METAFLOW_INPUT_PATHS"] = input_paths
+
+        env["METAFLOW_CODE_URL"] = self.code_package_url
+        env["METAFLOW_FLOW_NAME"] = self.flow.name
+        env["METAFLOW_STEP_NAME"] = node.name
+        env["METAFLOW_OWNER"] = self.username
+
+        metadata_env = self.metadata.get_runtime_environment("airflow")
+        env.update(metadata_env)
+
+        metaflow_version = self.environment.get_environment_info()
+        metaflow_version["flow_name"] = self.graph.name
+        metaflow_version["production_token"] = self.production_token
+        env["METAFLOW_VERSION"] = json.dumps(metaflow_version)
+
+        # Extract the k8s decorators for constructing the arguments of the K8s Pod Operator on Airflow.
+        k8s_deco = [deco for deco in node.decorators if deco.name == "kubernetes"][0]
+        user_code_retries, _ = self._get_retries(node)
+        retry_delay = self._get_retry_delay(node)
+        # This sets timeouts for @timeout decorators.
+        # The timeout is set as "execution_timeout" for an airflow task.
+        runtime_limit = get_run_time_limit_for_task(node.decorators)
+
+        k8s = Kubernetes(self.flow_datastore, self.metadata, self.environment)
+        user = util.get_username()
+
+        labels = {
+            "app": "metaflow",
+            "app.kubernetes.io/name": "metaflow-task",
+            "app.kubernetes.io/part-of": "metaflow",
+            "app.kubernetes.io/created-by": user,
+            # Question to (savin) : Should we have username set over here for created by since it is the
+            # airflow installation that is creating the jobs.
+            # Technically the "user" is the stakeholder but should these labels be present.
+        }
+        additional_mf_variables = {
+            "METAFLOW_CODE_SHA": self.code_package_sha,
+            "METAFLOW_CODE_URL": self.code_package_url,
+            "METAFLOW_CODE_DS": self.flow_datastore.TYPE,
+            "METAFLOW_USER": user,
+            "METAFLOW_SERVICE_URL": SERVICE_INTERNAL_URL,
+            "METAFLOW_SERVICE_HEADERS": json.dumps(SERVICE_HEADERS),
+            "METAFLOW_DATASTORE_SYSROOT_S3": DATASTORE_SYSROOT_S3,
+            "METAFLOW_DATATOOLS_S3ROOT": DATATOOLS_S3ROOT,
+            "METAFLOW_DEFAULT_DATASTORE": "s3",
+            "METAFLOW_DEFAULT_METADATA": "service",
+            "METAFLOW_KUBERNETES_WORKLOAD": str(
+                1
+            ),  # This is used by kubernetes decorator.
+            "METAFLOW_RUNTIME_ENVIRONMENT": "kubernetes",
+            "METAFLOW_CARD_S3ROOT": CARD_S3ROOT,
+            "METAFLOW_RUN_ID": AIRFLOW_MACROS.RUN_ID,
+            "METAFLOW_AIRFLOW_TASK_ID": AIRFLOW_MACROS.create_task_id(
+                self.contains_foreach
+            ),
+            "METAFLOW_AIRFLOW_DAG_RUN_ID": AIRFLOW_MACROS.AIRFLOW_RUN_ID,
+            "METAFLOW_AIRFLOW_JOB_ID": AIRFLOW_MACROS.AIRFLOW_JOB_ID,
+            "METAFLOW_PRODUCTION_TOKEN": self.production_token,
+            "METAFLOW_ATTEMPT_NUMBER": AIRFLOW_MACROS.ATTEMPT,
+        }
+        env[
+            "METAFLOW_AZURE_STORAGE_BLOB_SERVICE_ENDPOINT"
+        ] = AZURE_STORAGE_BLOB_SERVICE_ENDPOINT
+        env["METAFLOW_DATASTORE_SYSROOT_AZURE"] = DATASTORE_SYSROOT_AZURE
+        env["METAFLOW_CARD_AZUREROOT"] = CARD_AZUREROOT
+        env.update(additional_mf_variables)
+
+        service_account = (
+            KUBERNETES_SERVICE_ACCOUNT
+            if k8s_deco.attributes["service_account"] is None
+            else k8s_deco.attributes["service_account"]
+        )
+        k8s_namespace = (
+            k8s_deco.attributes["namespace"]
+            if k8s_deco.attributes["namespace"] is not None
+            else "default"
+        )
+
+        resources = dict(
+            requests={
+                "cpu": k8s_deco.attributes["cpu"],
+                "memory": "%sM" % str(k8s_deco.attributes["memory"]),
+                "ephemeral-storage": str(k8s_deco.attributes["disk"]),
+            }
+        )
+        if k8s_deco.attributes["gpu"] is not None:
+            resources.update(
+                dict(
+                    limits={
+                        "%s.com/gpu".lower()
+                        % k8s_deco.attributes["gpu_vendor"]: str(
+                            k8s_deco.attributes["gpu"]
+                        )
+                    }
+                )
+            )
+
+        annotations = {
+            "metaflow/production_token": self.production_token,
+            "metaflow/owner": self.username,
+            "metaflow/user": self.username,
+            "metaflow/flow_name": self.flow.name,
+        }
+        if current.get("project_name"):
+            annotations.update(
+                {
+                    "metaflow/project_name": current.project_name,
+                    "metaflow/branch_name": current.branch_name,
+                    "metaflow/project_flow_name": current.project_flow_name,
+                }
+            )
+
+        k8s_operator_args = dict(
+            # like argo workflows we use step_name as name of container
+            name=node.name,
+            namespace=k8s_namespace,
+            service_account_name=service_account,
+            node_selector=k8s_deco.attributes["node_selector"],
+            cmds=k8s._command(
+                self.flow.name,
+                AIRFLOW_MACROS.RUN_ID,
+                node.name,
+                AIRFLOW_MACROS.create_task_id(self.contains_foreach),
+                AIRFLOW_MACROS.ATTEMPT,
+                code_package_url=self.code_package_url,
+                step_cmds=self._step_cli(
+                    node, input_paths, self.code_package_url, user_code_retries
+                ),
+            ),
+            annotations=annotations,
+            image=k8s_deco.attributes["image"],
+            resources=resources,
+            execution_timeout=dict(seconds=runtime_limit),
+            retries=user_code_retries,
+            env_vars=[dict(name=k, value=v) for k, v in env.items() if v is not None],
+            labels=labels,
+            task_id=node.name,
+            startup_timeout_seconds=AIRFLOW_KUBERNETES_STARTUP_TIMEOUT_SECONDS,
+            get_logs=True,
+            do_xcom_push=True,
+            log_events_on_failure=True,
+            is_delete_operator_pod=True,
+            retry_exponential_backoff=False,  # todo : should this be a arg we allow on CLI. not right now - there is an open ticket for this - maybe at some point we will.
+            reattach_on_restart=False,
+            secrets=[],
+        )
+        if AIRFLOW_KUBERNETES_CONN_ID is not None:
+            k8s_operator_args["kubernetes_conn_id"] = AIRFLOW_KUBERNETES_CONN_ID
+        else:
+            k8s_operator_args["in_cluster"] = True
+
+        if k8s_deco.attributes["secrets"]:
+            if isinstance(k8s_deco.attributes["secrets"], str):
+                k8s_operator_args["secrets"] = k8s_deco.attributes["secrets"].split(",")
+            elif isinstance(k8s_deco.attributes["secrets"], list):
+                k8s_operator_args["secrets"] = k8s_deco.attributes["secrets"]
+        if len(KUBERNETES_SECRETS) > 0:
+            k8s_operator_args["secrets"] += KUBERNETES_SECRETS.split(",")
+
+        if retry_delay:
+            k8s_operator_args["retry_delay"] = dict(seconds=retry_delay.total_seconds())
+
+        return k8s_operator_args
+
+    def _step_cli(self, node, paths, code_package_url, user_code_retries):
+        cmds = []
+
+        script_name = os.path.basename(sys.argv[0])
+        executable = self.environment.executable(node.name)
+
+        entrypoint = [executable, script_name]
+
+        top_opts_dict = {
+            "with": [
+                decorator.make_decorator_spec()
+                for decorator in node.decorators
+                if not decorator.statically_defined
+            ]
+        }
+        # FlowDecorators can define their own top-level options. They are
+        # responsible for adding their own top-level options and values through
+        # the get_top_level_options() hook. See similar logic in runtime.py.
+        for deco in flow_decorators():
+            top_opts_dict.update(deco.get_top_level_options())
+
+        top_opts = list(dict_to_cli_options(top_opts_dict))
+
+        top_level = top_opts + [
+            "--quiet",
+            "--metadata=%s" % self.metadata.TYPE,
+            "--environment=%s" % self.environment.TYPE,
+            "--datastore=%s" % self.flow_datastore.TYPE,
+            "--datastore-root=%s" % self.flow_datastore.datastore_root,
+            "--event-logger=%s" % self.event_logger.TYPE,
+            "--monitor=%s" % self.monitor.TYPE,
+            "--no-pylint",
+            "--with=airflow_internal",
+        ]
+
+        if node.name == "start":
+            # We need a separate unique ID for the special _parameters task
+            task_id_params = "%s-params" % AIRFLOW_MACROS.create_task_id(
+                self.contains_foreach
+            )
+            # Export user-defined parameters into runtime environment
+            param_file = "".join(
+                random.choice(string.ascii_lowercase) for _ in range(10)
+            )
+            # Setup Parameters as environment variables which are stored in a dictionary.
+            export_params = (
+                "python -m "
+                "metaflow.plugins.airflow.plumbing.set_parameters %s "
+                "&& . `pwd`/%s" % (param_file, param_file)
+            )
+            # Setting parameters over here.
+            params = (
+                entrypoint
+                + top_level
+                + [
+                    "init",
+                    "--run-id %s" % AIRFLOW_MACROS.RUN_ID,
+                    "--task-id %s" % task_id_params,
+                ]
+            )
+
+            # Assign tags to run objects.
+            if self.tags:
+                params.extend("--tag %s" % tag for tag in self.tags)
+
+            # If the start step gets retried, we must be careful not to
+            # regenerate multiple parameters tasks. Hence, we check first if
+            # _parameters exists already.
+            exists = entrypoint + [
+                # Dump the parameters task
+                "dump",
+                "--max-value-size=0",
+                "%s/_parameters/%s" % (AIRFLOW_MACROS.RUN_ID, task_id_params),
+            ]
+            cmd = "if ! %s >/dev/null 2>/dev/null; then %s && %s; fi" % (
+                " ".join(exists),
+                export_params,
+                " ".join(params),
+            )
+            cmds.append(cmd)
+            # set input paths for parameters
+            paths = "%s/_parameters/%s" % (AIRFLOW_MACROS.RUN_ID, task_id_params)
+
+        step = [
+            "step",
+            node.name,
+            "--run-id %s" % AIRFLOW_MACROS.RUN_ID,
+            "--task-id %s" % AIRFLOW_MACROS.create_task_id(self.contains_foreach),
+            "--retry-count %s" % AIRFLOW_MACROS.ATTEMPT,
+            "--max-user-code-retries %d" % user_code_retries,
+            "--input-paths %s" % paths,
+        ]
+        if self.tags:
+            step.extend("--tag %s" % tag for tag in self.tags)
+        if self.namespace is not None:
+            step.append("--namespace=%s" % self.namespace)
+
+        parent_is_foreach = any(  # The immediate parent is a foreach node.
+            self.graph[n].type == "foreach" for n in node.in_funcs
+        )
+        if parent_is_foreach:
+            step.append("--split-index %s" % AIRFLOW_MACROS.FOREACH_SPLIT_INDEX)
+
+        cmds.append(" ".join(entrypoint + top_level + step))
+        return cmds
+
+    def _contains_foreach(self):
+        for node in self.graph:
+            if node.type == "foreach":
+                return True
+        return False
+
+    def compile(self):
+        # Visit every node of the flow and recursively build the state machine.
+        def _visit(node, workflow, exit_node=None):
+            parent_is_foreach = any(  # Any immediate parent is a foreach node.
+                self.graph[n].type == "foreach" for n in node.in_funcs
+            )
+            state = AirflowTask(
+                node.name, is_mapper_node=parent_is_foreach
+            ).set_operator_args(**self._to_job(node))
+            if node.type == "end":
+                workflow.add_state(state)
+
+            # Continue linear assignment within the (sub)workflow if the node
+            # doesn't branch or fork.
+            elif node.type in ("start", "linear", "join", "foreach"):
+                workflow.add_state(state)
+                _visit(
+                    self.graph[node.out_funcs[0]],
+                    workflow,
+                )
+
+            elif node.type == "split":
+                workflow.add_state(state)
+                for func in node.out_funcs:
+                    _visit(
+                        self.graph[func],
+                        workflow,
+                    )
+            else:
+                raise AirflowException(
+                    "Node type *%s* for  step *%s* "
+                    "is not currently supported by "
+                    "Airflow." % (node.type, node.name)
+                )
+
+            return workflow
+
+        # set max active tasks here , For more info check here :
+        # https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/models/dag/index.html#airflow.models.dag.DAG
+        airflow_dag_args = (
+            {} if self.max_workers is None else dict(max_active_tasks=self.max_workers)
+        )
+        airflow_dag_args["is_paused_upon_creation"] = self.is_paused_upon_creation
+
+        # workflow timeout should only be enforced if a dag is scheduled.
+        if self.workflow_timeout is not None and self.schedule is not None:
+            airflow_dag_args["dagrun_timeout"] = dict(seconds=self.workflow_timeout)
+
+        workflow = Workflow(
+            dag_id=self.name,
+            default_args=self._create_defaults(),
+            description=self.description,
+            schedule_interval=self.schedule,
+            # `start_date` is a mandatory argument even though the documentation lists it as optional value
+            # Based on the code, Airflow will throw a `AirflowException` when `start_date` is not provided
+            # to a DAG : https://github.com/apache/airflow/blob/0527a0b6ce506434a23bc2a6f5ddb11f492fc614/airflow/models/dag.py#L2170
+            start_date=datetime.now(),
+            tags=self.tags,
+            file_path=self._file_path,
+            graph_structure=self.graph_structure,
+            metadata=dict(
+                contains_foreach=self.contains_foreach, flow_name=self.flow.name
+            ),
+            **airflow_dag_args
+        )
+        workflow = _visit(self.graph["start"], workflow)
+
+        workflow.set_parameters(self.parameters)
+        return self._to_airflow_dag_file(workflow.to_dict())
+
+    def _to_airflow_dag_file(self, json_dag):
+        util_file = None
+        with open(airflow_utils.__file__) as f:
+            util_file = f.read()
+        with open(AIRFLOW_DEPLOY_TEMPLATE_FILE) as f:
+            return chevron.render(
+                f.read(),
+                dict(
+                    # Converting the configuration to base64 so that there can be no indentation related issues that can be caused because of
+                    # malformed strings / json.
+                    config=json_dag,
+                    utils=util_file,
+                    deployed_on=str(datetime.now()),
+                ),
+            )
+
+    def _create_defaults(self):
+        defu_ = {
+            "owner": get_username(),
+            # If set on a task and the previous run of the task has failed,
+            # it will not run the task in the current DAG run.
+            "depends_on_past": False,
+            # TODO: Enable emails
+            "execution_timeout": timedelta(days=5),
+            "retry_delay": timedelta(seconds=200),
+            # check https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/models/baseoperator/index.html?highlight=retry_delay#airflow.models.baseoperator.BaseOperatorMeta
+        }
+        if self.worker_pool is not None:
+            defu_["pool"] = self.worker_pool
+
+        return defu_
diff --git a/metaflow/plugins/airflow/airflow_cli.py b/metaflow/plugins/airflow/airflow_cli.py
new file mode 100644
index 00000000000..5ac676978c2
--- /dev/null
+++ b/metaflow/plugins/airflow/airflow_cli.py
@@ -0,0 +1,434 @@
+import os
+import re
+import sys
+import base64
+from metaflow import current, decorators
+from metaflow._vendor import click
+from metaflow.exception import MetaflowException, MetaflowInternalError
+from metaflow.package import MetaflowPackage
+from hashlib import sha1
+from metaflow.plugins.kubernetes.kubernetes_decorator import KubernetesDecorator
+from metaflow.util import get_username, to_bytes, to_unicode
+
+from .airflow import Airflow
+from .exception import AirflowException, NotSupportedException
+
+from metaflow.plugins.aws.step_functions.production_token import (
+    load_token,
+    new_token,
+    store_token,
+)
+
+
+class IncorrectProductionToken(MetaflowException):
+    headline = "Incorrect production token"
+
+
+VALID_NAME = re.compile("[^a-zA-Z0-9_\-\.]")
+
+
+def resolve_token(
+    name, token_prefix, obj, authorize, given_token, generate_new_token, is_project
+):
+    # 1) retrieve the previous deployment, if one exists
+
+    workflow = Airflow.get_existing_deployment(name, obj.flow_datastore)
+    if workflow is None:
+        obj.echo(
+            "It seems this is the first time you are deploying *%s* to "
+            "Airflow." % name
+        )
+        prev_token = None
+    else:
+        prev_user, prev_token = workflow
+
+    # 2) authorize this deployment
+    if prev_token is not None:
+        if authorize is None:
+            authorize = load_token(token_prefix)
+        elif authorize.startswith("production:"):
+            authorize = authorize[11:]
+
+        # we allow the user who deployed the previous version to re-deploy,
+        # even if they don't have the token
+        if prev_user != get_username() and authorize != prev_token:
+            obj.echo(
+                "There is an existing version of *%s* on Airflow which was "
+                "deployed by the user *%s*." % (name, prev_user)
+            )
+            obj.echo(
+                "To deploy a new version of this flow, you need to use the same "
+                "production token that they used. "
+            )
+            obj.echo(
+                "Please reach out to them to get the token. Once you have it, call "
+                "this command:"
+            )
+            obj.echo("    airflow create --authorize MY_TOKEN", fg="green")
+            obj.echo(
+                'See "Organizing Results" at docs.metaflow.org for more information '
+                "about production tokens."
+            )
+            raise IncorrectProductionToken(
+                "Try again with the correct production token."
+            )
+
+    # 3) do we need a new token or should we use the existing token?
+    if given_token:
+        if is_project:
+            # we rely on a known prefix for @project tokens, so we can't
+            # allow the user to specify a custom token with an arbitrary prefix
+            raise MetaflowException(
+                "--new-token is not supported for @projects. Use --generate-new-token "
+                "to create a new token."
+            )
+        if given_token.startswith("production:"):
+            given_token = given_token[11:]
+        token = given_token
+        obj.echo("")
+        obj.echo("Using the given token, *%s*." % token)
+    elif prev_token is None or generate_new_token:
+        token = new_token(token_prefix, prev_token)
+        if token is None:
+            if prev_token is None:
+                raise MetaflowInternalError(
+                    "We could not generate a new token. This is unexpected. "
+                )
+            else:
+                raise MetaflowException(
+                    "--generate-new-token option is not supported after using "
+                    "--new-token. Use --new-token to make a new namespace."
+                )
+        obj.echo("")
+        obj.echo("A new production token generated.")
+        Airflow.save_deployment_token(get_username(), token, obj.flow_datastore)
+    else:
+        token = prev_token
+
+    obj.echo("")
+    obj.echo("The namespace of this production flow is")
+    obj.echo("    production:%s" % token, fg="green")
+    obj.echo(
+        "To analyze results of this production flow add this line in your notebooks:"
+    )
+    obj.echo('    namespace("production:%s")' % token, fg="green")
+    obj.echo(
+        "If you want to authorize other people to deploy new versions of this flow to "
+        "Airflow, they need to call"
+    )
+    obj.echo("    airflow create --authorize %s" % token, fg="green")
+    obj.echo("when deploying this flow to Airflow for the first time.")
+    obj.echo(
+        'See "Organizing Results" at https://docs.metaflow.org/ for more '
+        "information about production tokens."
+    )
+    obj.echo("")
+    store_token(token_prefix, token)
+
+    return token
+
+
+@click.group()
+def cli():
+    pass
+
+
+@cli.group(help="Commands related to Airflow.")
+@click.option(
+    "--name",
+    default=None,
+    type=str,
+    help="Airflow DAG name. The flow name is used instead if this option is not "
+    "specified",
+)
+@click.pass_obj
+def airflow(obj, name=None):
+    obj.check(obj.graph, obj.flow, obj.environment, pylint=obj.pylint)
+    obj.dag_name, obj.token_prefix, obj.is_project = resolve_dag_name(name)
+
+
+@airflow.command(help="Compile a new version of this flow to Airflow DAG.")
+@click.argument("file", required=True)
+@click.option(
+    "--authorize",
+    default=None,
+    help="Authorize using this production token. You need this "
+    "when you are re-deploying an existing flow for the first "
+    "time. The token is cached in METAFLOW_HOME, so you only "
+    "need to specify this once.",
+)
+@click.option(
+    "--generate-new-token",
+    is_flag=True,
+    help="Generate a new production token for this flow. "
+    "This will move the production flow to a new namespace.",
+)
+@click.option(
+    "--new-token",
+    "given_token",
+    default=None,
+    help="Use the given production token for this flow. "
+    "This will move the production flow to the given namespace.",
+)
+@click.option(
+    "--tag",
+    "tags",
+    multiple=True,
+    default=None,
+    help="Annotate all objects produced by Airflow DAG executions "
+    "with the given tag. You can specify this option multiple "
+    "times to attach multiple tags.",
+)
+@click.option(
+    "--is-paused-upon-creation",
+    default=False,
+    is_flag=True,
+    help="Generated Airflow DAG is paused/unpaused upon creation.",
+)
+@click.option(
+    "--namespace",
+    "user_namespace",
+    default=None,
+    # TODO (savin): Identify the default namespace?
+    help="Change the namespace from the default to the given tag. "
+    "See run --help for more information.",
+)
+@click.option(
+    "--max-workers",
+    default=100,
+    show_default=True,
+    help="Maximum number of parallel processes.",
+)
+@click.option(
+    "--workflow-timeout",
+    default=None,
+    type=int,
+    help="Workflow timeout in seconds. Enforced only for scheduled DAGs.",
+)
+@click.option(
+    "--worker-pool",
+    default=None,
+    show_default=True,
+    help="Worker pool for Airflow DAG execution.",
+)
+@click.pass_obj
+def create(
+    obj,
+    file,
+    authorize=None,
+    generate_new_token=False,
+    given_token=None,
+    tags=None,
+    is_paused_upon_creation=False,
+    user_namespace=None,
+    max_workers=None,
+    workflow_timeout=None,
+    worker_pool=None,
+):
+    if os.path.abspath(sys.argv[0]) == os.path.abspath(file):
+        raise MetaflowException(
+            "Airflow DAG file name cannot be the same as flow file name"
+        )
+
+    # Validate if the workflow is correctly parsed.
+    _validate_workflow(
+        obj.flow, obj.graph, obj.flow_datastore, obj.metadata, workflow_timeout
+    )
+
+    obj.echo("Compiling *%s* to Airflow DAG..." % obj.dag_name, bold=True)
+    token = resolve_token(
+        obj.dag_name,
+        obj.token_prefix,
+        obj,
+        authorize,
+        given_token,
+        generate_new_token,
+        obj.is_project,
+    )
+
+    flow = make_flow(
+        obj,
+        obj.dag_name,
+        token,
+        tags,
+        is_paused_upon_creation,
+        user_namespace,
+        max_workers,
+        workflow_timeout,
+        worker_pool,
+        file,
+    )
+    with open(file, "w") as f:
+        f.write(flow.compile())
+
+    obj.echo(
+        "DAG *{dag_name}* "
+        "for flow *{name}* compiled to "
+        "Airflow successfully.\n".format(dag_name=obj.dag_name, name=current.flow_name),
+        bold=True,
+    )
+
+
+def make_flow(
+    obj,
+    dag_name,
+    production_token,
+    tags,
+    is_paused_upon_creation,
+    namespace,
+    max_workers,
+    workflow_timeout,
+    worker_pool,
+    file,
+):
+    # Attach @kubernetes.
+    decorators._attach_decorators(obj.flow, [KubernetesDecorator.name])
+
+    decorators._init_step_decorators(
+        obj.flow, obj.graph, obj.environment, obj.flow_datastore, obj.logger
+    )
+
+    # Save the code package in the flow datastore so that both user code and
+    # metaflow package can be retrieved during workflow execution.
+    obj.package = MetaflowPackage(
+        obj.flow, obj.environment, obj.echo, obj.package_suffixes
+    )
+    package_url, package_sha = obj.flow_datastore.save_data(
+        [obj.package.blob], len_hint=1
+    )[0]
+
+    return Airflow(
+        dag_name,
+        obj.graph,
+        obj.flow,
+        package_sha,
+        package_url,
+        obj.metadata,
+        obj.flow_datastore,
+        obj.environment,
+        obj.event_logger,
+        obj.monitor,
+        production_token,
+        tags=tags,
+        namespace=namespace,
+        username=get_username(),
+        max_workers=max_workers,
+        worker_pool=worker_pool,
+        workflow_timeout=workflow_timeout,
+        description=obj.flow.__doc__,
+        file_path=file,
+        is_paused_upon_creation=is_paused_upon_creation,
+    )
+
+
+def _validate_foreach_constraints(graph):
+    # Todo :Invoke this function when we integrate `foreach`s
+    def traverse_graph(node, state):
+        if node.type == "foreach" and node.is_inside_foreach:
+            raise NotSupportedException(
+                "Step *%s* is a foreach step called within a foreach step. "
+                "This type of graph is currently not supported with Airflow."
+                % node.name
+            )
+
+        if node.type == "foreach":
+            state["foreach_stack"] = [node.name]
+
+        if node.type in ("start", "linear", "join", "foreach"):
+            if node.type == "linear" and node.is_inside_foreach:
+                state["foreach_stack"].append(node.name)
+
+            if len(state["foreach_stack"]) > 2:
+                raise NotSupportedException(
+                    "The foreach step *%s* created by step *%s* needs to have an immediate join step. "
+                    "Step *%s* is invalid since it is a linear step with a foreach. "
+                    "This type of graph is currently not supported with Airflow."
+                    % (
+                        state["foreach_stack"][1],
+                        state["foreach_stack"][0],
+                        state["foreach_stack"][-1],
+                    )
+                )
+
+            traverse_graph(graph[node.out_funcs[0]], state)
+
+        elif node.type == "split":
+            for func in node.out_funcs:
+                traverse_graph(graph[func], state)
+
+    traverse_graph(graph["start"], {})
+
+
+def _validate_workflow(flow, graph, flow_datastore, metadata, workflow_timeout):
+    seen = set()
+    for var, param in flow._get_parameters():
+        # Throw an exception if the parameter is specified twice.
+        norm = param.name.lower()
+        if norm in seen:
+            raise MetaflowException(
+                "Parameter *%s* is specified twice. "
+                "Note that parameter names are "
+                "case-insensitive." % param.name
+            )
+        seen.add(norm)
+        if "default" not in param.kwargs:
+            raise MetaflowException(
+                "Parameter *%s* does not have a "
+                "default value. "
+                "A default value is required for parameters when deploying flows on Airflow."
+            )
+    # check for other compute related decorators.
+    for node in graph:
+        if node.parallel_foreach:
+            raise AirflowException(
+                "Deploying flows with @parallel decorator(s) "
+                "to Airflow is not supported currently."
+            )
+
+        if node.type == "foreach":
+            raise NotSupportedException(
+                "Step *%s* is a foreach step and Foreach steps are not currently supported with Airflow."
+                % node.name
+            )
+        if any([d.name == "batch" for d in node.decorators]):
+            raise NotSupportedException(
+                "Step *%s* is marked for execution on AWS Batch with Airflow which isn't currently supported."
+                % node.name
+            )
+
+    if flow_datastore.TYPE not in ("azure", "s3"):
+        raise AirflowException(
+            'Datastore of type "s3" or "azure" required with `airflow create`'
+        )
+
+
+def resolve_dag_name(name):
+    project = current.get("project_name")
+    is_project = False
+
+    if project:
+        is_project = True
+        if name:
+            raise MetaflowException(
+                "--name is not supported for @projects. " "Use --branch instead."
+            )
+        dag_name = current.project_flow_name
+        if dag_name and VALID_NAME.search(dag_name):
+            raise MetaflowException(
+                "Name '%s' contains invalid characters. Please construct a name using regex %s"
+                % (dag_name, VALID_NAME.pattern)
+            )
+        project_branch = to_bytes(".".join((project, current.branch_name)))
+        token_prefix = (
+            "mfprj-%s"
+            % to_unicode(base64.b32encode(sha1(project_branch).digest()))[:16]
+        )
+    else:
+        if name and VALID_NAME.search(name):
+            raise MetaflowException(
+                "Name '%s' contains invalid characters. Please construct a name using regex %s"
+                % (name, VALID_NAME.pattern)
+            )
+        dag_name = name if name else current.flow_name
+        token_prefix = dag_name
+    return dag_name, token_prefix.lower(), is_project
diff --git a/metaflow/plugins/airflow/airflow_decorator.py b/metaflow/plugins/airflow/airflow_decorator.py
new file mode 100644
index 00000000000..11bdecaaa8b
--- /dev/null
+++ b/metaflow/plugins/airflow/airflow_decorator.py
@@ -0,0 +1,66 @@
+import json
+import os
+from metaflow.decorators import StepDecorator
+from metaflow.metadata import MetaDatum
+
+from .airflow_utils import (
+    TASK_ID_XCOM_KEY,
+    FOREACH_CARDINALITY_XCOM_KEY,
+)
+
+K8S_XCOM_DIR_PATH = "/airflow/xcom"
+
+
+def safe_mkdir(dir):
+    try:
+        os.makedirs(dir)
+    except FileExistsError:
+        pass
+
+
+def push_xcom_values(xcom_dict):
+    safe_mkdir(K8S_XCOM_DIR_PATH)
+    with open(os.path.join(K8S_XCOM_DIR_PATH, "return.json"), "w") as f:
+        json.dump(xcom_dict, f)
+
+
+class AirflowInternalDecorator(StepDecorator):
+    name = "airflow_internal"
+
+    def task_pre_step(
+        self,
+        step_name,
+        task_datastore,
+        metadata,
+        run_id,
+        task_id,
+        flow,
+        graph,
+        retry_count,
+        max_user_code_retries,
+        ubf_context,
+        inputs,
+    ):
+        meta = {}
+        meta["airflow-dag-run-id"] = os.environ["METAFLOW_AIRFLOW_DAG_RUN_ID"]
+        meta["airflow-job-id"] = os.environ["METAFLOW_AIRFLOW_JOB_ID"]
+        entries = [
+            MetaDatum(
+                field=k, value=v, type=k, tags=["attempt_id:{0}".format(retry_count)]
+            )
+            for k, v in meta.items()
+        ]
+
+        # Register book-keeping metadata for debugging.
+        metadata.register_metadata(run_id, step_name, task_id, entries)
+
+    def task_finished(
+        self, step_name, flow, graph, is_task_ok, retry_count, max_user_code_retries
+    ):
+        # This will pass the xcom when the task finishes.
+        xcom_values = {
+            TASK_ID_XCOM_KEY: os.environ["METAFLOW_AIRFLOW_TASK_ID"],
+        }
+        if graph[step_name].type == "foreach":
+            xcom_values[FOREACH_CARDINALITY_XCOM_KEY] = flow._foreach_num_splits
+        push_xcom_values(xcom_values)
diff --git a/metaflow/plugins/airflow/airflow_utils.py b/metaflow/plugins/airflow/airflow_utils.py
new file mode 100644
index 00000000000..26e544e8d61
--- /dev/null
+++ b/metaflow/plugins/airflow/airflow_utils.py
@@ -0,0 +1,672 @@
+import hashlib
+import json
+import sys
+import platform
+from collections import defaultdict
+from datetime import datetime, timedelta
+
+
+TASK_ID_XCOM_KEY = "metaflow_task_id"
+FOREACH_CARDINALITY_XCOM_KEY = "metaflow_foreach_cardinality"
+FOREACH_XCOM_KEY = "metaflow_foreach_indexes"
+RUN_HASH_ID_LEN = 12
+TASK_ID_HASH_LEN = 8
+RUN_ID_PREFIX = "airflow"
+AIRFLOW_FOREACH_SUPPORT_VERSION = "2.3.0"
+AIRFLOW_MIN_SUPPORT_VERSION = "2.2.0"
+KUBERNETES_PROVIDER_FOREACH_VERSION = "4.2.0"
+
+
+class KubernetesProviderNotFound(Exception):
+    headline = "Kubernetes provider not found"
+
+
+class ForeachIncompatibleException(Exception):
+    headline = "Airflow version is incompatible to support Metaflow `foreach`s."
+
+
+class IncompatibleVersionException(Exception):
+    headline = "Metaflow is incompatible with current version of Airflow."
+
+    def __init__(self, version_number) -> None:
+        msg = (
+            "Airflow version %s is incompatible with Metaflow. Metaflow requires Airflow a minimum version %s"
+            % (version_number, AIRFLOW_MIN_SUPPORT_VERSION)
+        )
+        super().__init__(msg)
+
+
+class IncompatibleKubernetesProviderVersionException(Exception):
+    headline = (
+        "Kubernetes Provider version is incompatible with Metaflow `foreach`s. "
+        "Install the provider via "
+        "`%s -m pip install apache-airflow-providers-cncf-kubernetes==%s`"
+    ) % (sys.executable, KUBERNETES_PROVIDER_FOREACH_VERSION)
+
+
+def create_absolute_version_number(version):
+    abs_version = None
+    # For all digits
+    if all(v.isdigit() for v in version.split(".")):
+        abs_version = sum(
+            [
+                (10 ** (3 - idx)) * i
+                for idx, i in enumerate([int(v) for v in version.split(".")])
+            ]
+        )
+    # For first two digits
+    elif all(v.isdigit() for v in version.split(".")[:2]):
+        abs_version = sum(
+            [
+                (10 ** (3 - idx)) * i
+                for idx, i in enumerate([int(v) for v in version.split(".")[:2]])
+            ]
+        )
+    return abs_version
+
+
+def _validate_dynamic_mapping_compatibility():
+    from airflow.version import version
+
+    af_ver = create_absolute_version_number(version)
+    if af_ver is None or af_ver < create_absolute_version_number(
+        AIRFLOW_FOREACH_SUPPORT_VERSION
+    ):
+        ForeachIncompatibleException(
+            "Please install airflow version %s to use Airflow's Dynamic task mapping functionality."
+            % AIRFLOW_FOREACH_SUPPORT_VERSION
+        )
+
+
+def get_kubernetes_provider_version():
+    try:
+        from airflow.providers.cncf.kubernetes.get_provider_info import (
+            get_provider_info,
+        )
+    except ImportError as e:
+        raise KubernetesProviderNotFound(
+            "This DAG utilizes `KubernetesPodOperator`. "
+            "Install the Airflow Kubernetes provider using "
+            "`%s -m pip install apache-airflow-providers-cncf-kubernetes`"
+            % sys.executable
+        )
+    return get_provider_info()["versions"][0]
+
+
+def _validate_minimum_airflow_version():
+    from airflow.version import version
+
+    af_ver = create_absolute_version_number(version)
+    if af_ver is None or af_ver < create_absolute_version_number(
+        AIRFLOW_MIN_SUPPORT_VERSION
+    ):
+        raise IncompatibleVersionException(version)
+
+
+def _check_foreach_compatible_kubernetes_provider():
+    provider_version = get_kubernetes_provider_version()
+    ver = create_absolute_version_number(provider_version)
+    if ver is None or ver < create_absolute_version_number(
+        KUBERNETES_PROVIDER_FOREACH_VERSION
+    ):
+        raise IncompatibleKubernetesProviderVersionException()
+
+
+def datetimeparse(isotimestamp):
+    ver = int(platform.python_version_tuple()[0]) * 10 + int(
+        platform.python_version_tuple()[1]
+    )
+    if ver >= 37:
+        return datetime.fromisoformat(isotimestamp)
+    else:
+        return datetime.strptime(isotimestamp, "%Y-%m-%dT%H:%M:%S.%f")
+
+
+def get_xcom_arg_class():
+    try:
+        from airflow import XComArg
+    except ImportError:
+        return None
+    return XComArg
+
+
+class AIRFLOW_MACROS:
+    # run_id_creator is added via the `user_defined_filters`
+    RUN_ID = "%s-{{ [run_id, dag_run.dag_id] | run_id_creator }}" % RUN_ID_PREFIX
+    PARAMETERS = "{{ params | json_dump }}"
+
+    STEPNAME = "{{ ti.task_id }}"
+
+    # AIRFLOW_MACROS.TASK_ID will work for linear/branched workflows.
+    # ti.task_id is the stepname in metaflow code.
+    # AIRFLOW_MACROS.TASK_ID uses a jinja filter called `task_id_creator` which helps
+    # concatenate the string using a `/`. Since run-id will keep changing and stepname will be
+    # the same task id will change. Since airflow doesn't encourage dynamic rewriting of dags
+    # we can rename steps in a foreach with indexes (eg. `stepname-$index`) to create those steps.
+    # Hence : `foreach`s will require some special form of plumbing.
+    # https://stackoverflow.com/questions/62962386/can-an-airflow-task-dynamically-generate-a-dag-at-runtime
+    TASK_ID = (
+        "%s-{{ [run_id, ti.task_id, dag_run.dag_id] | task_id_creator  }}"
+        % RUN_ID_PREFIX
+    )
+
+    FOREACH_TASK_ID = (
+        "%s-{{ [run_id, ti.task_id, dag_run.dag_id, ti.map_index] | task_id_creator  }}"
+        % RUN_ID_PREFIX
+    )
+
+    # Airflow run_ids are of the form : "manual__2022-03-15T01:26:41.186781+00:00"
+    # Such run-ids break the `metaflow.util.decompress_list`; this is why we hash the runid
+    # We do `echo -n` because it emits line breaks, and we don't want to consider that, since we want same hash value
+    # when retrieved in python.
+    RUN_ID_SHELL = (
+        "%s-$(echo -n {{ run_id }}-{{ dag_run.dag_id }} | md5sum | awk '{print $1}' | awk '{print substr ($0, 0, %s)}')"
+        % (RUN_ID_PREFIX, str(RUN_HASH_ID_LEN))
+    )
+
+    ATTEMPT = "{{ task_instance.try_number - 1 }}"
+
+    AIRFLOW_RUN_ID = "{{ run_id }}"
+
+    AIRFLOW_JOB_ID = "{{ ti.job_id }}"
+
+    FOREACH_SPLIT_INDEX = "{{ ti.map_index }}"
+
+    @classmethod
+    def create_task_id(cls, is_foreach):
+        if is_foreach:
+            return cls.FOREACH_TASK_ID
+        else:
+            return cls.TASK_ID
+
+    @classmethod
+    def pathspec(cls, flowname, is_foreach=False):
+        return "%s/%s/%s/%s" % (
+            flowname,
+            cls.RUN_ID,
+            cls.STEPNAME,
+            cls.create_task_id(is_foreach),
+        )
+
+
+def run_id_creator(val):
+    # join `[dag-id,run-id]` of airflow dag.
+    return hashlib.md5("-".join([str(x) for x in val]).encode("utf-8")).hexdigest()[
+        :RUN_HASH_ID_LEN
+    ]
+
+
+def task_id_creator(val):
+    # join `[dag-id,run-id]` of airflow dag.
+    return hashlib.md5("-".join([str(x) for x in val]).encode("utf-8")).hexdigest()[
+        :TASK_ID_HASH_LEN
+    ]
+
+
+def id_creator(val, hash_len):
+    # join `[dag-id,run-id]` of airflow dag.
+    return hashlib.md5("-".join([str(x) for x in val]).encode("utf-8")).hexdigest()[
+        :hash_len
+    ]
+
+
+def json_dump(val):
+    return json.dumps(val)
+
+
+class AirflowDAGArgs(object):
+
+    # `_arg_types` is a dictionary which represents the types of the arguments of an Airflow `DAG`.
+    # `_arg_types` is used when parsing types back from the configuration json.
+    # It doesn't cover all the arguments but covers many of the important one which can come from the cli.
+    _arg_types = {
+        "dag_id": str,
+        "description": str,
+        "schedule_interval": str,
+        "start_date": datetime,
+        "catchup": bool,
+        "tags": list,
+        "dagrun_timeout": timedelta,
+        "default_args": {
+            "owner": str,
+            "depends_on_past": bool,
+            "email": list,
+            "email_on_failure": bool,
+            "email_on_retry": bool,
+            "retries": int,
+            "retry_delay": timedelta,
+            "queue": str,  # which queue to target when running this job. Not all executors implement queue management, the CeleryExecutor does support targeting specific queues.
+            "pool": str,  # the slot pool this task should run in, slot pools are a way to limit concurrency for certain tasks
+            "priority_weight": int,
+            "wait_for_downstream": bool,
+            "sla": timedelta,
+            "execution_timeout": timedelta,
+            "trigger_rule": str,
+        },
+    }
+
+    # Reference for user_defined_filters : https://stackoverflow.com/a/70175317
+    filters = dict(
+        task_id_creator=lambda v: task_id_creator(v),
+        json_dump=lambda val: json_dump(val),
+        run_id_creator=lambda val: run_id_creator(val),
+        join_list=lambda x: ",".join(list(x)),
+    )
+
+    def __init__(self, **kwargs):
+        self._args = kwargs
+
+    @property
+    def arguments(self):
+        return dict(**self._args, user_defined_filters=self.filters)
+
+    def serialize(self):
+        def parse_args(dd):
+            data_dict = {}
+            for k, v in dd.items():
+                if isinstance(v, dict):
+                    data_dict[k] = parse_args(v)
+                elif isinstance(v, datetime):
+                    data_dict[k] = v.isoformat()
+                elif isinstance(v, timedelta):
+                    data_dict[k] = dict(seconds=v.total_seconds())
+                else:
+                    data_dict[k] = v
+            return data_dict
+
+        return parse_args(self._args)
+
+    @classmethod
+    def deserialize(cls, data_dict):
+        def parse_args(dd, type_check_dict):
+            kwrgs = {}
+            for k, v in dd.items():
+                if k not in type_check_dict:
+                    kwrgs[k] = v
+                elif isinstance(v, dict) and isinstance(type_check_dict[k], dict):
+                    kwrgs[k] = parse_args(v, type_check_dict[k])
+                elif type_check_dict[k] == datetime:
+                    kwrgs[k] = datetimeparse(v)
+                elif type_check_dict[k] == timedelta:
+                    kwrgs[k] = timedelta(**v)
+                else:
+                    kwrgs[k] = v
+            return kwrgs
+
+        return cls(**parse_args(data_dict, cls._arg_types))
+
+
+def _kubernetes_pod_operator_args(operator_args):
+    from kubernetes import client
+
+    from airflow.kubernetes.secret import Secret
+
+    # Set dynamic env variables like run-id, task-id etc from here.
+    secrets = [
+        Secret("env", secret, secret) for secret in operator_args.get("secrets", [])
+    ]
+    args = operator_args
+    args.update(
+        {
+            "secrets": secrets,
+            # Question for (savin):
+            # Default timeout in airflow is 120. I can remove `startup_timeout_seconds` for now. how should we expose it to the user?
+        }
+    )
+    # We need to explicitly add the `client.V1EnvVar` over here because
+    # `pod_runtime_info_envs` doesn't accept arguments in dictionary form and strictly
+    # Requires objects of type `client.V1EnvVar`
+    additional_env_vars = [
+        client.V1EnvVar(
+            name=k,
+            value_from=client.V1EnvVarSource(
+                field_ref=client.V1ObjectFieldSelector(field_path=str(v))
+            ),
+        )
+        for k, v in {
+            "METAFLOW_KUBERNETES_POD_NAMESPACE": "metadata.namespace",
+            "METAFLOW_KUBERNETES_POD_NAME": "metadata.name",
+            "METAFLOW_KUBERNETES_POD_ID": "metadata.uid",
+            "METAFLOW_KUBERNETES_SERVICE_ACCOUNT_NAME": "spec.serviceAccountName",
+        }.items()
+    ]
+    args["pod_runtime_info_envs"] = additional_env_vars
+
+    resources = args.get("resources")
+    # KubernetesPodOperator version 4.2.0 renamed `resources` to
+    # `container_resources` (https://github.com/apache/airflow/pull/24673) / (https://github.com/apache/airflow/commit/45f4290712f5f779e57034f81dbaab5d77d5de85)
+    # This was done because `KubernetesPodOperator` didn't play nice with dynamic task mapping and they had to
+    # deprecate the `resources` argument. Hence, the below code path checks for the version of `KubernetesPodOperator`
+    # and then sets the argument. If the version < 4.2.0 then we set the argument as `resources`.
+    # If it is > 4.2.0 then we set the argument as `container_resources`
+    # The `resources` argument of `KubernetesPodOperator` is going to be deprecated soon in the future.
+    # So we will only use it for `KubernetesPodOperator` version < 4.2.0
+    # The `resources` argument will also not work for `foreach`s.
+    provider_version = get_kubernetes_provider_version()
+    k8s_op_ver = create_absolute_version_number(provider_version)
+    if k8s_op_ver is None or k8s_op_ver < create_absolute_version_number(
+        KUBERNETES_PROVIDER_FOREACH_VERSION
+    ):
+        # Since the provider version is less than `4.2.0` so we need to use the `resources` argument
+        # We need to explicitly parse `resources`/`container_resources` to `k8s.V1ResourceRequirements`,
+        # otherwise airflow tries to parse dictionaries to `airflow.providers.cncf.kubernetes.backcompat.pod.Resources`
+        # object via `airflow.providers.cncf.kubernetes.backcompat.backward_compat_converts.convert_resources` function.
+        # This fails many times since the dictionary structure it expects is not the same as
+        # `client.V1ResourceRequirements`.
+        args["resources"] = client.V1ResourceRequirements(
+            requests=resources["requests"],
+            limits=None if "limits" not in resources else resources["limits"],
+        )
+    else:  # since the provider version is greater than `4.2.0` so should use the `container_resources` argument
+        args["container_resources"] = client.V1ResourceRequirements(
+            requests=resources["requests"],
+            limits=None if "limits" not in resources else resources["limits"],
+        )
+        del args["resources"]
+
+    if operator_args.get("execution_timeout"):
+        args["execution_timeout"] = timedelta(
+            **operator_args.get(
+                "execution_timeout",
+            )
+        )
+    if operator_args.get("retry_delay"):
+        args["retry_delay"] = timedelta(**operator_args.get("retry_delay"))
+    return args
+
+
+def get_metaflow_kubernetes_operator():
+    try:
+        from airflow.contrib.operators.kubernetes_pod_operator import (
+            KubernetesPodOperator,
+        )
+    except ImportError:
+        try:
+            from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import (
+                KubernetesPodOperator,
+            )
+        except ImportError as e:
+            raise KubernetesProviderNotFound(
+                "This DAG utilizes `KubernetesPodOperator`. "
+                "Install the Airflow Kubernetes provider using "
+                "`%s -m pip install apache-airflow-providers-cncf-kubernetes`"
+                % sys.executable
+            )
+
+    class MetaflowKubernetesOperator(KubernetesPodOperator):
+        """
+        ## Why Inherit the `KubernetesPodOperator` class ?
+
+        Two key reasons :
+
+        1. So that we can override the `execute` method.
+        The only change we introduce to the method is to explicitly modify xcom relating to `return_values`.
+        We do this so that the `XComArg` object can work with `expand` function.
+
+        2. So that we can introduce a keyword argument named `mapper_arr`.
+        This keyword argument can help as a dummy argument for the `KubernetesPodOperator.partial().expand` method. Any Airflow Operator can be dynamically mapped to runtime artifacts using `Operator.partial(**kwargs).extend(**mapper_kwargs)` post the introduction of [Dynamic Task Mapping](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dynamic-task-mapping.html).
+        The `expand` function takes keyword arguments taken by the operator.
+
+        ## Why override the `execute` method  ?
+
+        When we dynamically map vanilla Airflow operators with artifacts generated at runtime, we need to pass that information via `XComArg` to a operator's keyword argument in the `expand` [function](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dynamic-task-mapping.html#mapping-over-result-of-classic-operators).
+        The `XComArg` object retrieves XCom values for a particular task based on a `key`, the default key being `return_values`.
+        Oddly dynamic task mapping [doesn't support XCom values from any other key except](https://github.com/apache/airflow/blob/8a34d25049a060a035d4db4a49cd4a0d0b07fb0b/airflow/models/mappedoperator.py#L150) `return_values`
+        The values of XCom passed by the `KubernetesPodOperator` are mapped to the `return_values` XCom key.
+
+        The biggest problem this creates is that the values of the Foreach cardinality are stored inside the dictionary of `return_values` and cannot be accessed trivially like : `XComArg(task)['foreach_key']` since they are resolved during runtime.
+        This puts us in a bind since the only xcom we can retrieve is the full dictionary and we cannot pass that as the iterable for the mapper tasks.
+        Hence, we inherit the `execute` method and push custom xcom keys (needed by downstream tasks such as metaflow taskids) and modify `return_values` captured from the container whenever a foreach related xcom is passed.
+        When we encounter a foreach xcom we resolve the cardinality which is passed to an actual list and return that as `return_values`.
+        This is later useful in the `Workflow.compile` where the operator's `expand` method is called and we are able to retrieve the xcom value.
+        """
+
+        template_fields = KubernetesPodOperator.template_fields + (
+            "metaflow_pathspec",
+            "metaflow_run_id",
+            "metaflow_task_id",
+            "metaflow_attempt",
+            "metaflow_step_name",
+            "metaflow_flow_name",
+        )
+
+        def __init__(
+            self,
+            *args,
+            mapper_arr=None,
+            flow_name=None,
+            flow_contains_foreach=False,
+            **kwargs
+        ) -> None:
+            super().__init__(*args, **kwargs)
+            self.mapper_arr = mapper_arr
+            self._flow_name = flow_name
+            self._flow_contains_foreach = flow_contains_foreach
+            self.metaflow_pathspec = AIRFLOW_MACROS.pathspec(
+                self._flow_name, is_foreach=self._flow_contains_foreach
+            )
+            self.metaflow_run_id = AIRFLOW_MACROS.RUN_ID
+            self.metaflow_task_id = AIRFLOW_MACROS.create_task_id(
+                self._flow_contains_foreach
+            )
+            self.metaflow_attempt = AIRFLOW_MACROS.ATTEMPT
+            self.metaflow_step_name = AIRFLOW_MACROS.STEPNAME
+            self.metaflow_flow_name = self._flow_name
+
+        def execute(self, context):
+            result = super().execute(context)
+            if result is None:
+                return
+            ti = context["ti"]
+            if TASK_ID_XCOM_KEY in result:
+                ti.xcom_push(
+                    key=TASK_ID_XCOM_KEY,
+                    value=result[TASK_ID_XCOM_KEY],
+                )
+            if FOREACH_CARDINALITY_XCOM_KEY in result:
+                return list(range(result[FOREACH_CARDINALITY_XCOM_KEY]))
+
+    return MetaflowKubernetesOperator
+
+
+class AirflowTask(object):
+    def __init__(
+        self,
+        name,
+        operator_type="kubernetes",
+        flow_name=None,
+        is_mapper_node=False,
+        flow_contains_foreach=False,
+    ):
+        self.name = name
+        self._is_mapper_node = is_mapper_node
+        self._operator_args = None
+        self._operator_type = operator_type
+        self._flow_name = flow_name
+        self._flow_contains_foreach = flow_contains_foreach
+
+    @property
+    def is_mapper_node(self):
+        return self._is_mapper_node
+
+    def set_operator_args(self, **kwargs):
+        self._operator_args = kwargs
+        return self
+
+    def to_dict(self):
+        return {
+            "name": self.name,
+            "is_mapper_node": self._is_mapper_node,
+            "operator_type": self._operator_type,
+            "operator_args": self._operator_args,
+        }
+
+    @classmethod
+    def from_dict(cls, task_dict, flow_name=None, flow_contains_foreach=False):
+        op_args = {} if "operator_args" not in task_dict else task_dict["operator_args"]
+        is_mapper_node = (
+            False if "is_mapper_node" not in task_dict else task_dict["is_mapper_node"]
+        )
+        return cls(
+            task_dict["name"],
+            is_mapper_node=is_mapper_node,
+            operator_type=task_dict["operator_type"]
+            if "operator_type" in task_dict
+            else "kubernetes",
+            flow_name=flow_name,
+            flow_contains_foreach=flow_contains_foreach,
+        ).set_operator_args(**op_args)
+
+    def _kubernetes_task(self):
+        MetaflowKubernetesOperator = get_metaflow_kubernetes_operator()
+        k8s_args = _kubernetes_pod_operator_args(self._operator_args)
+        return MetaflowKubernetesOperator(
+            flow_name=self._flow_name,
+            flow_contains_foreach=self._flow_contains_foreach,
+            **k8s_args
+        )
+
+    def _kubernetes_mapper_task(self):
+        MetaflowKubernetesOperator = get_metaflow_kubernetes_operator()
+        k8s_args = _kubernetes_pod_operator_args(self._operator_args)
+        return MetaflowKubernetesOperator.partial(
+            flow_name=self._flow_name,
+            flow_contains_foreach=self._flow_contains_foreach,
+            **k8s_args
+        )
+
+    def to_task(self):
+        if self._operator_type == "kubernetes":
+            if not self.is_mapper_node:
+                return self._kubernetes_task()
+            else:
+                return self._kubernetes_mapper_task()
+
+
+class Workflow(object):
+    def __init__(self, file_path=None, graph_structure=None, metadata=None, **kwargs):
+        self._dag_instantiation_params = AirflowDAGArgs(**kwargs)
+        self._file_path = file_path
+        self._metadata = metadata
+        tree = lambda: defaultdict(tree)
+        self.states = tree()
+        self.metaflow_params = None
+        self.graph_structure = graph_structure
+
+    def set_parameters(self, params):
+        self.metaflow_params = params
+
+    def add_state(self, state):
+        self.states[state.name] = state
+
+    def to_dict(self):
+        return dict(
+            metadata=self._metadata,
+            graph_structure=self.graph_structure,
+            states={s: v.to_dict() for s, v in self.states.items()},
+            dag_instantiation_params=self._dag_instantiation_params.serialize(),
+            file_path=self._file_path,
+            metaflow_params=self.metaflow_params,
+        )
+
+    def to_json(self):
+        return json.dumps(self.to_dict())
+
+    @classmethod
+    def from_dict(cls, data_dict):
+        re_cls = cls(
+            file_path=data_dict["file_path"],
+            graph_structure=data_dict["graph_structure"],
+            metadata=data_dict["metadata"],
+        )
+        re_cls._dag_instantiation_params = AirflowDAGArgs.deserialize(
+            data_dict["dag_instantiation_params"]
+        )
+
+        for sd in data_dict["states"].values():
+            re_cls.add_state(
+                AirflowTask.from_dict(sd, flow_name=data_dict["metadata"]["flow_name"])
+            )
+        re_cls.set_parameters(data_dict["metaflow_params"])
+        return re_cls
+
+    @classmethod
+    def from_json(cls, json_string):
+        data = json.loads(json_string)
+        return cls.from_dict(data)
+
+    def _construct_params(self):
+        from airflow.models.param import Param
+
+        if self.metaflow_params is None:
+            return {}
+        param_dict = {}
+        for p in self.metaflow_params:
+            name = p["name"]
+            del p["name"]
+            param_dict[name] = Param(**p)
+        return param_dict
+
+    def compile(self):
+        from airflow import DAG
+
+        # Airflow 2.0.0 cannot import this, so we have to do it this way.
+        # `XComArg` is needed for dynamic task mapping and if the airflow installation is of the right
+        # version (+2.3.0) then the class will be importable.
+        XComArg = get_xcom_arg_class()
+
+        _validate_minimum_airflow_version()
+
+        if self._metadata["contains_foreach"]:
+            _validate_dynamic_mapping_compatibility()
+            # We need to verify if KubernetesPodOperator is of version > 4.2.0 to support foreachs / dynamic task mapping.
+            # If the dag uses dynamic Task mapping then we throw an error since the `resources` argument in the `KubernetesPodOperator`
+            # doesn't work for dynamic task mapping for `KubernetesPodOperator` version < 4.2.0.
+            # For more context check this issue :  https://github.com/apache/airflow/issues/24669
+            _check_foreach_compatible_kubernetes_provider()
+
+        params_dict = self._construct_params()
+        # DAG Params can be seen here :
+        # https://airflow.apache.org/docs/apache-airflow/2.0.0/_api/airflow/models/dag/index.html#airflow.models.dag.DAG
+        # Airflow 2.0.0 Allows setting Params.
+        dag = DAG(params=params_dict, **self._dag_instantiation_params.arguments)
+        dag.fileloc = self._file_path if self._file_path is not None else dag.fileloc
+
+        def add_node(node, parents, dag):
+            """
+            A recursive function to traverse the specialized
+            graph_structure datastructure.
+            """
+            if type(node) == str:
+                task = self.states[node].to_task()
+                if parents:
+                    for parent in parents:
+                        # Handle foreach nodes.
+                        if self.states[node].is_mapper_node:
+                            task = task.expand(mapper_arr=XComArg(parent))
+                        parent >> task
+                return [task]  # Return Parent
+
+            # this means a split from parent
+            if type(node) == list:
+                # this means branching since everything within the list is a list
+                if all(isinstance(n, list) for n in node):
+                    curr_parents = parents
+                    parent_list = []
+                    for node_list in node:
+                        last_parent = add_node(node_list, curr_parents, dag)
+                        parent_list.extend(last_parent)
+                    return parent_list
+                else:
+                    # this means no branching and everything within the list is not a list and can be actual nodes.
+                    curr_parents = parents
+                    for node_x in node:
+                        curr_parents = add_node(node_x, curr_parents, dag)
+                    return curr_parents
+
+        with dag:
+            parent = None
+            for node in self.graph_structure:
+                parent = add_node(node, parent, dag)
+
+        return dag
diff --git a/metaflow/plugins/airflow/dag.py b/metaflow/plugins/airflow/dag.py
new file mode 100644
index 00000000000..2720fe40e0e
--- /dev/null
+++ b/metaflow/plugins/airflow/dag.py
@@ -0,0 +1,9 @@
+# Deployed on {{deployed_on}}
+
+CONFIG = {{{config}}}
+
+{{{utils}}}
+
+dag = Workflow.from_dict(CONFIG).compile()
+with dag:
+    pass
diff --git a/metaflow/plugins/airflow/exception.py b/metaflow/plugins/airflow/exception.py
new file mode 100644
index 00000000000..a76a755e22c
--- /dev/null
+++ b/metaflow/plugins/airflow/exception.py
@@ -0,0 +1,12 @@
+from metaflow.exception import MetaflowException
+
+
+class AirflowException(MetaflowException):
+    headline = "Airflow Exception"
+
+    def __init__(self, msg):
+        super().__init__(msg)
+
+
+class NotSupportedException(MetaflowException):
+    headline = "Not yet supported with Airflow"
diff --git a/metaflow/plugins/airflow/plumbing/__init__.py b/metaflow/plugins/airflow/plumbing/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/metaflow/plugins/airflow/plumbing/set_parameters.py b/metaflow/plugins/airflow/plumbing/set_parameters.py
new file mode 100644
index 00000000000..7a2e4dd3112
--- /dev/null
+++ b/metaflow/plugins/airflow/plumbing/set_parameters.py
@@ -0,0 +1,21 @@
+import os
+import json
+import sys
+
+
+def export_parameters(output_file):
+    input = json.loads(os.environ.get("METAFLOW_PARAMETERS", "{}"))
+    with open(output_file, "w") as f:
+        for k in input:
+            # Replace `-` with `_` is parameter names since `-` isn't an
+            # allowed character for environment variables. cli.py will
+            # correctly translate the replaced `-`s.
+            f.write(
+                "export METAFLOW_INIT_%s=%s\n"
+                % (k.upper().replace("-", "_"), json.dumps(input[k]))
+            )
+    os.chmod(output_file, 509)
+
+
+if __name__ == "__main__":
+    export_parameters(sys.argv[1])
diff --git a/metaflow/plugins/argo/__init__.py b/metaflow/plugins/argo/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/metaflow/plugins/argo/argo_client.py b/metaflow/plugins/argo/argo_client.py
new file mode 100644
index 00000000000..bfe70cff5b9
--- /dev/null
+++ b/metaflow/plugins/argo/argo_client.py
@@ -0,0 +1,182 @@
+import json
+import os
+import sys
+
+from metaflow.exception import MetaflowException
+from metaflow.plugins.kubernetes.kubernetes_client import KubernetesClient
+
+
+class ArgoClientException(MetaflowException):
+    headline = "Argo Client error"
+
+
+class ArgoClient(object):
+    def __init__(self, namespace=None):
+
+        self._kubernetes_client = KubernetesClient()
+        self._namespace = namespace or "default"
+        self._group = "argoproj.io"
+        self._version = "v1alpha1"
+
+    def get_workflow_template(self, name):
+        client = self._kubernetes_client.get()
+        try:
+            return client.CustomObjectsApi().get_namespaced_custom_object(
+                group=self._group,
+                version=self._version,
+                namespace=self._namespace,
+                plural="workflowtemplates",
+                name=name,
+            )
+        except client.rest.ApiException as e:
+            if e.status == 404:
+                return None
+            raise ArgoClientException(
+                json.loads(e.body)["message"] if e.body is not None else e.reason
+            )
+
+    def register_workflow_template(self, name, workflow_template):
+        # Unfortunately, Kubernetes client does not handle optimistic
+        # concurrency control by itself unlike kubectl
+        client = self._kubernetes_client.get()
+        try:
+            workflow_template["metadata"][
+                "resourceVersion"
+            ] = client.CustomObjectsApi().get_namespaced_custom_object(
+                group=self._group,
+                version=self._version,
+                namespace=self._namespace,
+                plural="workflowtemplates",
+                name=name,
+            )[
+                "metadata"
+            ][
+                "resourceVersion"
+            ]
+        except client.rest.ApiException as e:
+            if e.status == 404:
+                try:
+                    return client.CustomObjectsApi().create_namespaced_custom_object(
+                        group=self._group,
+                        version=self._version,
+                        namespace=self._namespace,
+                        plural="workflowtemplates",
+                        body=workflow_template,
+                    )
+                except client.rest.ApiException as e:
+                    raise ArgoClientException(
+                        json.loads(e.body)["message"]
+                        if e.body is not None
+                        else e.reason
+                    )
+            else:
+                raise ArgoClientException(
+                    json.loads(e.body)["message"] if e.body is not None else e.reason
+                )
+        try:
+            return client.CustomObjectsApi().replace_namespaced_custom_object(
+                group=self._group,
+                version=self._version,
+                namespace=self._namespace,
+                plural="workflowtemplates",
+                body=workflow_template,
+                name=name,
+            )
+        except client.rest.ApiException as e:
+            raise ArgoClientException(
+                json.loads(e.body)["message"] if e.body is not None else e.reason
+            )
+
+    def trigger_workflow_template(self, name, parameters={}):
+        client = self._kubernetes_client.get()
+        body = {
+            "apiVersion": "argoproj.io/v1alpha1",
+            "kind": "Workflow",
+            "metadata": {"generateName": name + "-"},
+            "spec": {
+                "workflowTemplateRef": {"name": name},
+                "arguments": {
+                    "parameters": [
+                        {"name": k, "value": json.dumps(v)}
+                        for k, v in parameters.items()
+                    ]
+                },
+            },
+        }
+        try:
+            return client.CustomObjectsApi().create_namespaced_custom_object(
+                group=self._group,
+                version=self._version,
+                namespace=self._namespace,
+                plural="workflows",
+                body=body,
+            )
+        except client.rest.ApiException as e:
+            raise ArgoClientException(
+                json.loads(e.body)["message"] if e.body is not None else e.reason
+            )
+
+    def schedule_workflow_template(self, name, schedule=None):
+        # Unfortunately, Kubernetes client does not handle optimistic
+        # concurrency control by itself unlike kubectl
+        client = self._kubernetes_client.get()
+        body = {
+            "apiVersion": "argoproj.io/v1alpha1",
+            "kind": "CronWorkflow",
+            "metadata": {"name": name},
+            "spec": {
+                "suspend": schedule is None,
+                "schedule": schedule,
+                "workflowSpec": {"workflowTemplateRef": {"name": name}},
+            },
+        }
+        try:
+            body["metadata"][
+                "resourceVersion"
+            ] = client.CustomObjectsApi().get_namespaced_custom_object(
+                group=self._group,
+                version=self._version,
+                namespace=self._namespace,
+                plural="cronworkflows",
+                name=name,
+            )[
+                "metadata"
+            ][
+                "resourceVersion"
+            ]
+        except client.rest.ApiException as e:
+            # Scheduled workflow does not exist and we want to schedule a workflow
+            if e.status == 404:
+                if schedule is None:
+                    return
+                try:
+                    return client.CustomObjectsApi().create_namespaced_custom_object(
+                        group=self._group,
+                        version=self._version,
+                        namespace=self._namespace,
+                        plural="cronworkflows",
+                        body=body,
+                    )
+                except client.rest.ApiException as e:
+                    raise ArgoClientException(
+                        json.loads(e.body)["message"]
+                        if e.body is not None
+                        else e.reason
+                    )
+            else:
+                raise ArgoClientException(
+                    json.loads(e.body)["message"] if e.body is not None else e.reason
+                )
+        try:
+            return client.CustomObjectsApi().replace_namespaced_custom_object(
+                group=self._group,
+                version=self._version,
+                namespace=self._namespace,
+                plural="cronworkflows",
+                body=body,
+                name=name,
+            )
+        except client.rest.ApiException as e:
+            raise ArgoClientException(
+                json.loads(e.body)["message"] if e.body is not None else e.reason
+            )
diff --git a/metaflow/plugins/argo/argo_workflows.py b/metaflow/plugins/argo/argo_workflows.py
new file mode 100644
index 00000000000..071e097baf0
--- /dev/null
+++ b/metaflow/plugins/argo/argo_workflows.py
@@ -0,0 +1,1404 @@
+import json
+import os
+import shlex
+import sys
+from collections import defaultdict
+
+from metaflow import current
+from metaflow.decorators import flow_decorators
+from metaflow.exception import MetaflowException
+from metaflow.metaflow_config import (
+    SERVICE_HEADERS,
+    SERVICE_INTERNAL_URL,
+    CARD_S3ROOT,
+    DATASTORE_SYSROOT_S3,
+    DATATOOLS_S3ROOT,
+    DEFAULT_METADATA,
+    KUBERNETES_NAMESPACE,
+    KUBERNETES_NODE_SELECTOR,
+    KUBERNETES_SANDBOX_INIT_SCRIPT,
+    KUBERNETES_SECRETS,
+    S3_ENDPOINT_URL,
+    AZURE_STORAGE_BLOB_SERVICE_ENDPOINT,
+    DATASTORE_SYSROOT_AZURE,
+    DATASTORE_SYSROOT_GS,
+    CARD_AZUREROOT,
+    CARD_GSROOT,
+)
+from metaflow.mflog import BASH_SAVE_LOGS, bash_capture_logs, export_mflog_env_vars
+from metaflow.parameters import deploy_time_eval
+from metaflow.util import compress_list, dict_to_cli_options, to_camelcase
+
+from .argo_client import ArgoClient
+
+
+class ArgoWorkflowsException(MetaflowException):
+    headline = "Argo Workflows error"
+
+
+class ArgoWorkflowsSchedulingException(MetaflowException):
+    headline = "Argo Workflows scheduling error"
+
+
+# List of future enhancements -
+#     1. Configure Argo metrics.
+#     2. Support Argo Events.
+#     3. Support resuming failed workflows within Argo Workflows.
+#     4. Support gang-scheduled clusters for distributed PyTorch/TF - One option is to
+#        use volcano - https://github.com/volcano-sh/volcano/tree/master/example/integrations/argo
+#     5. Support GitOps workflows.
+#     6. Add Metaflow tags to labels/annotations.
+#     7. Support Multi-cluster scheduling - https://github.com/argoproj/argo-workflows/issues/3523#issuecomment-792307297
+#     8. Support for workflow notifications.
+#     9. Support R lang.
+#     10.Ping @savin at slack.outerbounds.co for any feature request.
+
+
+class ArgoWorkflows(object):
+    def __init__(
+        self,
+        name,
+        graph,
+        flow,
+        code_package_sha,
+        code_package_url,
+        production_token,
+        metadata,
+        flow_datastore,
+        environment,
+        event_logger,
+        monitor,
+        tags=None,
+        namespace=None,
+        username=None,
+        max_workers=None,
+        workflow_timeout=None,
+        workflow_priority=None,
+    ):
+        # Some high-level notes -
+        #
+        # Fail-fast behavior for Argo Workflows - Argo stops
+        # scheduling new steps as soon as it detects that one of the DAG nodes
+        # has failed. After waiting for all the scheduled DAG nodes to run till
+        # completion, Argo with fail the DAG. This implies that after a node
+        # has failed, it may be awhile before the entire DAG is marked as
+        # failed. There is nothing Metaflow can do here for failing even
+        # faster (as of Argo 3.2).
+        #
+        # argo stop` vs `argo terminate` - since we don't currently
+        # rely on any exit handlers, it's safe to either stop or terminate any running
+        # argo workflow deployed through Metaflow. This may not hold true, once we
+        # integrate with Argo Events.
+        #
+        # Currently, an Argo Workflow can only execute entirely within a single
+        # Kubernetes namespace. Multi-cluster / Multi-namespace execution is on the
+        # deck for v3.4 release for Argo Workflows; beyond which point, we will be
+        # able to support them natively.
+        #
+        # Since this implementation generates numerous templates on the fly, please
+        # ensure that your Argo Workflows controller doesn't restrict
+        # templateReferencing.
+
+        self.name = name
+        self.graph = graph
+        self.flow = flow
+        self.code_package_sha = code_package_sha
+        self.code_package_url = code_package_url
+        self.production_token = production_token
+        self.metadata = metadata
+        self.flow_datastore = flow_datastore
+        self.environment = environment
+        self.event_logger = event_logger
+        self.monitor = monitor
+        self.tags = tags
+        self.namespace = namespace
+        self.username = username
+        self.max_workers = max_workers
+        self.workflow_timeout = workflow_timeout
+        self.workflow_priority = workflow_priority
+
+        self.parameters = self._process_parameters()
+        self._workflow_template = self._compile()
+        self._cron = self._cron()
+
+    def __str__(self):
+        return str(self._workflow_template)
+
+    def deploy(self):
+        try:
+            ArgoClient(namespace=KUBERNETES_NAMESPACE).register_workflow_template(
+                self.name, self._workflow_template.to_json()
+            )
+        except Exception as e:
+            raise ArgoWorkflowsException(str(e))
+
+    @staticmethod
+    def _sanitize(name):
+        # Metaflow allows underscores in node names, which are disallowed in Argo
+        # Workflow template names - so we swap them with hyphens which are not
+        # allowed by Metaflow - guaranteeing uniqueness.
+        return name.replace("_", "-")
+
+    @classmethod
+    def trigger(cls, name, parameters=None):
+        if parameters is None:
+            parameters = {}
+        try:
+            workflow_template = ArgoClient(
+                namespace=KUBERNETES_NAMESPACE
+            ).get_workflow_template(name)
+        except Exception as e:
+            raise ArgoWorkflowsException(str(e))
+        if workflow_template is None:
+            raise ArgoWorkflowsException(
+                "The workflow *%s* doesn't exist on Argo Workflows in namespace *%s*. "
+                "Please deploy your flow first." % (name, KUBERNETES_NAMESPACE)
+            )
+        else:
+            try:
+                # Check that the workflow was deployed through Metaflow
+                workflow_template["metadata"]["annotations"]["metaflow/owner"]
+            except KeyError as e:
+                raise ArgoWorkflowsException(
+                    "An existing non-metaflow workflow with the same name as "
+                    "*%s* already exists in Argo Workflows. \nPlease modify the "
+                    "name of this flow or delete your existing workflow on Argo "
+                    "Workflows before proceeding." % name
+                )
+        try:
+            return ArgoClient(namespace=KUBERNETES_NAMESPACE).trigger_workflow_template(
+                name, parameters
+            )
+        except Exception as e:
+            raise ArgoWorkflowsException(str(e))
+
+    def _cron(self):
+        schedule = self.flow._flow_decorators.get("schedule")
+        if schedule:
+            # Remove the field "Year" if it exists
+            return " ".join(schedule.schedule.split()[:5])
+        return None
+
+    def schedule(self):
+        try:
+            ArgoClient(namespace=KUBERNETES_NAMESPACE).schedule_workflow_template(
+                self.name, self._cron
+            )
+        except Exception as e:
+            raise ArgoWorkflowsSchedulingException(str(e))
+
+    def trigger_explanation(self):
+        if self._cron:
+            return (
+                "This workflow triggers automatically via the CronWorkflow *%s*."
+                % self.name
+            )
+        else:
+            return "No triggers defined. You need to launch this workflow manually."
+
+    @classmethod
+    def get_existing_deployment(cls, name):
+        workflow_template = ArgoClient(
+            namespace=KUBERNETES_NAMESPACE
+        ).get_workflow_template(name)
+        if workflow_template is not None:
+            try:
+                return (
+                    workflow_template["metadata"]["annotations"]["metaflow/owner"],
+                    workflow_template["metadata"]["annotations"][
+                        "metaflow/production_token"
+                    ],
+                )
+            except KeyError as e:
+                raise ArgoWorkflowsException(
+                    "An existing non-metaflow workflow with the same name as "
+                    "*%s* already exists in Argo Workflows. \nPlease modify the "
+                    "name of this flow or delete your existing workflow on Argo "
+                    "Workflows before proceeding." % name
+                )
+        return None
+
+    def _process_parameters(self):
+        parameters = []
+        has_schedule = self._cron() is not None
+        seen = set()
+        for var, param in self.flow._get_parameters():
+            # Throw an exception if the parameter is specified twice.
+            norm = param.name.lower()
+            if norm in seen:
+                raise MetaflowException(
+                    "Parameter *%s* is specified twice. "
+                    "Note that parameter names are "
+                    "case-insensitive." % param.name
+                )
+            seen.add(norm)
+
+            is_required = param.kwargs.get("required", False)
+            # Throw an exception if a schedule is set for a flow with required
+            # parameters with no defaults. We currently don't have any notion
+            # of data triggers in Argo Workflows.
+
+            # TODO: Support Argo Events for data triggering in the near future.
+            if "default" not in param.kwargs and is_required and has_schedule:
+                raise MetaflowException(
+                    "The parameter *%s* does not have a default and is required. "
+                    "Scheduling such parameters via Argo CronWorkflows is not "
+                    "currently supported." % param.name
+                )
+            value = deploy_time_eval(param.kwargs.get("default"))
+            # If the value is not required and the value is None, we set the value to
+            # the JSON equivalent of None to please argo-workflows.
+            if not is_required or value is not None:
+                value = json.dumps(value)
+            parameters.append(
+                dict(name=param.name, value=value, description=param.kwargs.get("help"))
+            )
+        return parameters
+
+    def _compile(self):
+        # This method compiles a Metaflow FlowSpec into Argo WorkflowTemplate
+        #
+        # WorkflowTemplate
+        #   |
+        #    -- WorkflowSpec
+        #         |
+        #          -- Array<Template>
+        #                     |
+        #                      -- DAGTemplate, ContainerTemplate
+        #                           |                  |
+        #                            -- Array<DAGTask> |
+        #                                       |      |
+        #                                        -- Template
+        #
+        # Steps in FlowSpec are represented as DAGTasks.
+        # A DAGTask can reference to -
+        #     a ContainerTemplate (for linear steps..) or
+        #     another DAGTemplate (for nested `foreach`s).
+        #
+        # While we could have very well inlined container templates inside a DAGTask,
+        # unfortunately Argo variable substitution ({{pod.name}}) doesn't work as
+        # expected within DAGTasks
+        # (https://github.com/argoproj/argo-workflows/issues/7432) and we are forced to
+        # generate container templates at the top level (in WorkflowSpec) and maintain
+        # references to them within the DAGTask.
+
+        labels = {"app.kubernetes.io/part-of": "metaflow"}
+
+        annotations = {
+            "metaflow/production_token": self.production_token,
+            "metaflow/owner": self.username,
+            "metaflow/user": "argo-workflows",
+            "metaflow/flow_name": self.flow.name,
+        }
+        if current.get("project_name"):
+            annotations.update(
+                {
+                    "metaflow/project_name": current.project_name,
+                    "metaflow/branch_name": current.branch_name,
+                    "metaflow/project_flow_name": current.project_flow_name,
+                }
+            )
+
+        return (
+            WorkflowTemplate()
+            .metadata(
+                # Workflow Template metadata.
+                ObjectMeta()
+                .name(self.name)
+                # Argo currently only supports Workflow-level namespaces. When v3.4.0
+                # is released, we should be able to support multi-namespace /
+                # multi-cluster scheduling.
+                .namespace(KUBERNETES_NAMESPACE)
+                .label("app.kubernetes.io/name", "metaflow-flow")
+                .label("app.kubernetes.io/part-of", "metaflow")
+                .annotations(annotations)
+            )
+            .spec(
+                WorkflowSpec()
+                # Set overall workflow timeout.
+                .active_deadline_seconds(self.workflow_timeout)
+                # TODO: Allow Argo to optionally archive all workflow execution logs
+                #       It's disabled for now since it requires all Argo installations
+                #       to enable an artifactory repository. If log archival is
+                #       enabled in workflow controller, the logs for this workflow will
+                #       automatically get archived.
+                # .archive_logs()
+                # Don't automount service tokens for now - https://github.com/kubernetes/kubernetes/issues/16779#issuecomment-159656641
+                # TODO: Service account names are currently set in the templates. We
+                #       can specify the default service account name here to reduce
+                #       the size of the generated YAML by a tiny bit.
+                # .automount_service_account_token()
+                # TODO: Support ImagePullSecrets for Argo & Kubernetes
+                # .image_pull_secrets(...)
+                # Limit workflow parallelism
+                .parallelism(self.max_workers)
+                # TODO: Support Prometheus metrics for Argo
+                # .metrics(...)
+                # TODO: Support PodGC and DisruptionBudgets
+                .priority(self.workflow_priority)
+                # Set workflow metadata
+                .workflow_metadata(
+                    Metadata()
+                    .label("app.kubernetes.io/name", "metaflow-run")
+                    .label("app.kubernetes.io/part-of", "metaflow")
+                    .annotations(
+                        {**annotations, **{"metaflow/run_id": "argo-{{workflow.name}}"}}
+                    )
+                    # TODO: Set dynamic labels using labels_from. Ideally, we would
+                    #       want to expose run_id as a label. It's easy to add labels,
+                    #       but very difficult to remove them - let's err on the
+                    #       conservative side and only add labels when we come across
+                    #       use-cases for them.
+                )
+                # Handle parameters
+                .arguments(
+                    Arguments().parameters(
+                        [
+                            Parameter(parameter["name"])
+                            .value(parameter["value"])
+                            .description(parameter.get("description"))
+                            # TODO: Better handle IncludeFile in Argo Workflows UI.
+                            for parameter in self.parameters
+                        ]
+                    )
+                )
+                # Set common pod metadata.
+                .pod_metadata(
+                    Metadata()
+                    .label("app.kubernetes.io/name", "metaflow-task")
+                    .label("app.kubernetes.io/part-of", "metaflow")
+                    .annotations(annotations)
+                )
+                # Set the entrypoint to flow name
+                .entrypoint(self.flow.name)
+                # Top-level DAG template(s)
+                .templates(self._dag_templates())
+                # Container templates
+                .templates(self._container_templates())
+            )
+        )
+
+    # Visit every node and yield the uber DAGTemplate(s).
+    def _dag_templates(self):
+        def _visit(node, exit_node=None, templates=None, dag_tasks=None):
+            # Every for-each node results in a separate subDAG and an equivalent
+            # DAGTemplate rooted at the child of the for-each node. Each DAGTemplate
+            # has a unique name - the top-level DAGTemplate is named as the name of
+            # the flow and the subDAG DAGTemplates are named after the (only) descendant
+            # of the for-each node.
+
+            # Emit if we have reached the end of the sub workflow
+            if dag_tasks is None:
+                dag_tasks = []
+            if templates is None:
+                templates = []
+            if exit_node is not None and exit_node is node.name:
+                return templates, dag_tasks
+
+            if node.name == "start":
+                # Start node has no dependencies.
+                dag_task = DAGTask(self._sanitize(node.name)).template(
+                    self._sanitize(node.name)
+                )
+            elif (
+                node.is_inside_foreach
+                and self.graph[node.in_funcs[0]].type == "foreach"
+            ):
+                # Child of a foreach node needs input-paths as well as split-index
+                # This child is the first node of the sub workflow and has no dependency
+                parameters = [
+                    Parameter("input-paths").value("{{inputs.parameters.input-paths}}"),
+                    Parameter("split-index").value("{{inputs.parameters.split-index}}"),
+                ]
+                dag_task = (
+                    DAGTask(self._sanitize(node.name))
+                    .template(self._sanitize(node.name))
+                    .arguments(Arguments().parameters(parameters))
+                )
+            else:
+                # Every other node needs only input-paths
+                parameters = [
+                    Parameter("input-paths").value(
+                        compress_list(
+                            [
+                                "argo-{{workflow.name}}/%s/{{tasks.%s.outputs.parameters.task-id}}"
+                                % (n, self._sanitize(n))
+                                for n in node.in_funcs
+                            ]
+                        )
+                    )
+                ]
+                dag_task = (
+                    DAGTask(self._sanitize(node.name))
+                    .dependencies(
+                        [self._sanitize(in_func) for in_func in node.in_funcs]
+                    )
+                    .template(self._sanitize(node.name))
+                    .arguments(Arguments().parameters(parameters))
+                )
+            dag_tasks.append(dag_task)
+
+            # End the workflow if we have reached the end of the flow
+            if node.type == "end":
+                return [
+                    Template(self.flow.name).dag(
+                        DAGTemplate().fail_fast().tasks(dag_tasks)
+                    )
+                ] + templates, dag_tasks
+            # For split nodes traverse all the children
+            if node.type == "split":
+                for n in node.out_funcs:
+                    _visit(self.graph[n], node.matching_join, templates, dag_tasks)
+                return _visit(
+                    self.graph[node.matching_join], exit_node, templates, dag_tasks
+                )
+            # For foreach nodes generate a new sub DAGTemplate
+            elif node.type == "foreach":
+                foreach_template_name = self._sanitize(
+                    "%s-foreach-%s"
+                    % (
+                        node.name,
+                        node.foreach_param,
+                    )
+                )
+                foreach_task = (
+                    DAGTask(foreach_template_name)
+                    .dependencies([self._sanitize(node.name)])
+                    .template(foreach_template_name)
+                    .arguments(
+                        Arguments().parameters(
+                            [
+                                Parameter("input-paths").value(
+                                    "argo-{{workflow.name}}/%s/{{tasks.%s.outputs.parameters.task-id}}"
+                                    % (node.name, self._sanitize(node.name))
+                                ),
+                                Parameter("split-index").value("{{item}}"),
+                            ]
+                        )
+                    )
+                    .with_param(
+                        "{{tasks.%s.outputs.parameters.num-splits}}"
+                        % self._sanitize(node.name)
+                    )
+                )
+                dag_tasks.append(foreach_task)
+                templates, dag_tasks_1 = _visit(
+                    self.graph[node.out_funcs[0]], node.matching_join, templates, []
+                )
+                templates.append(
+                    Template(foreach_template_name)
+                    .inputs(
+                        Inputs().parameters(
+                            [Parameter("input-paths"), Parameter("split-index")]
+                        )
+                    )
+                    .outputs(
+                        Outputs().parameters(
+                            [
+                                Parameter("task-id").valueFrom(
+                                    {
+                                        "parameter": "{{tasks.%s.outputs.parameters.task-id}}"
+                                        % self._sanitize(
+                                            self.graph[node.matching_join].in_funcs[0]
+                                        )
+                                    }
+                                )
+                            ]
+                        )
+                    )
+                    .dag(DAGTemplate().fail_fast().tasks(dag_tasks_1))
+                )
+                join_foreach_task = (
+                    DAGTask(self._sanitize(self.graph[node.matching_join].name))
+                    .template(self._sanitize(self.graph[node.matching_join].name))
+                    .dependencies([foreach_template_name])
+                    .arguments(
+                        Arguments().parameters(
+                            [
+                                Parameter("input-paths").value(
+                                    "argo-{{workflow.name}}/%s/{{tasks.%s.outputs.parameters}}"
+                                    % (
+                                        self.graph[node.matching_join].in_funcs[-1],
+                                        foreach_template_name,
+                                    )
+                                )
+                            ]
+                        )
+                    )
+                )
+                dag_tasks.append(join_foreach_task)
+                return _visit(
+                    self.graph[self.graph[node.matching_join].out_funcs[0]],
+                    exit_node,
+                    templates,
+                    dag_tasks,
+                )
+            # For linear nodes continue traversing to the next node
+            if node.type in ("linear", "join", "start"):
+                return _visit(
+                    self.graph[node.out_funcs[0]], exit_node, templates, dag_tasks
+                )
+            else:
+                raise ArgoWorkflowsException(
+                    "Node type *%s* for step *%s* is not currently supported by "
+                    "Argo Workflows." % (node.type, node.name)
+                )
+
+        templates, _ = _visit(node=self.graph["start"])
+        return templates
+
+    # Visit every node and yield ContainerTemplates.
+    def _container_templates(self):
+        try:
+            # Kubernetes is a soft dependency for generating Argo objects.
+            # We can very well remove this dependency for Argo with the downside of
+            # adding a bunch more json bloat classes (looking at you... V1Container)
+            from kubernetes import client as kubernetes_sdk
+        except (NameError, ImportError):
+            raise MetaflowException(
+                "Could not import Python package 'kubernetes'. Install kubernetes "
+                "sdk (https://pypi.org/project/kubernetes/) first."
+            )
+        for node in self.graph:
+            # Resolve entry point for pod container.
+            script_name = os.path.basename(sys.argv[0])
+            executable = self.environment.executable(node.name)
+            # TODO: Support R someday. Quite a few people will be happy.
+            entrypoint = [executable, script_name]
+
+            # The values with curly braces '{{}}' are made available by Argo
+            # Workflows. Unfortunately, there are a few bugs in Argo which prevent
+            # us from accessing these values as liberally as we would like to - e.g,
+            # within inline templates - so we are forced to generate container templates
+            run_id = "argo-{{workflow.name}}"
+
+            # Unfortunately, we don't have any easy access to unique ids that remain
+            # stable across task attempts through Argo Workflows. So, we are forced to
+            # stitch them together ourselves. The task ids are a function of step name,
+            # split index and the parent task id (available from input path name).
+            # Ideally, we would like these task ids to be the same as node name
+            # (modulo retry suffix) on Argo Workflows but that doesn't seem feasible
+            # right now.
+            task_str = node.name + "-{{workflow.creationTimestamp}}"
+            if node.name != "start":
+                task_str += "-{{inputs.parameters.input-paths}}"
+            if any(self.graph[n].type == "foreach" for n in node.in_funcs):
+                task_str += "-{{inputs.parameters.split-index}}"
+            # Generated task_ids need to be non-numeric - see register_task_id in
+            # service.py. We do so by prefixing `t-`
+            task_id_expr = (
+                "export METAFLOW_TASK_ID="
+                "(t-$(echo %s | md5sum | cut -d ' ' -f 1 | tail -c 9))" % task_str
+            )
+            task_id = "$METAFLOW_TASK_ID"
+
+            # Resolve retry strategy.
+            (
+                user_code_retries,
+                total_retries,
+                retry_count,
+                minutes_between_retries,
+            ) = self._get_retries(node)
+
+            mflog_expr = export_mflog_env_vars(
+                datastore_type=self.flow_datastore.TYPE,
+                stdout_path="$PWD/.logs/mflog_stdout",
+                stderr_path="$PWD/.logs/mflog_stderr",
+                flow_name=self.flow.name,
+                run_id=run_id,
+                step_name=node.name,
+                task_id=task_id,
+                retry_count=retry_count,
+            )
+
+            init_cmds = " && ".join(
+                [
+                    # For supporting sandboxes, ensure that a custom script is executed
+                    # before anything else is executed. The script is passed in as an
+                    # env var.
+                    '${METAFLOW_INIT_SCRIPT:+eval \\"${METAFLOW_INIT_SCRIPT}\\"}',
+                    "mkdir -p $PWD/.logs",
+                    task_id_expr,
+                    mflog_expr,
+                ]
+                + self.environment.get_package_commands(
+                    self.code_package_url, self.flow_datastore.TYPE
+                )
+            )
+            step_cmds = self.environment.bootstrap_commands(
+                node.name, self.flow_datastore.TYPE
+            )
+
+            input_paths = "{{inputs.parameters.input-paths}}"
+
+            top_opts_dict = {
+                "with": [
+                    decorator.make_decorator_spec()
+                    for decorator in node.decorators
+                    if not decorator.statically_defined
+                ]
+            }
+            # FlowDecorators can define their own top-level options. They are
+            # responsible for adding their own top-level options and values through
+            # the get_top_level_options() hook. See similar logic in runtime.py.
+            for deco in flow_decorators():
+                top_opts_dict.update(deco.get_top_level_options())
+
+            top_level = list(dict_to_cli_options(top_opts_dict)) + [
+                "--quiet",
+                "--metadata=%s" % self.metadata.TYPE,
+                "--environment=%s" % self.environment.TYPE,
+                "--datastore=%s" % self.flow_datastore.TYPE,
+                "--datastore-root=%s" % self.flow_datastore.datastore_root,
+                "--event-logger=%s" % self.event_logger.TYPE,
+                "--monitor=%s" % self.monitor.TYPE,
+                "--no-pylint",
+                "--with=argo_workflows_internal",
+            ]
+
+            if node.name == "start":
+                # Execute `init` before any step of the workflow executes
+                task_id_params = "%s-params" % task_id
+                init = (
+                    entrypoint
+                    + top_level
+                    + [
+                        "init",
+                        "--run-id %s" % run_id,
+                        "--task-id %s" % task_id_params,
+                    ]
+                    + [
+                        # Parameter names can be hyphenated, hence we use
+                        # {{foo.bar['param_name']}}
+                        "--%s={{workflow.parameters.%s}}"
+                        % (parameter["name"], parameter["name"])
+                        for parameter in self.parameters
+                    ]
+                )
+                if self.tags:
+                    init.extend("--tag %s" % tag for tag in self.tags)
+                # if the start step gets retried, we must be careful
+                # not to regenerate multiple parameters tasks. Hence,
+                # we check first if _parameters exists already.
+                exists = entrypoint + [
+                    "dump",
+                    "--max-value-size=0",
+                    "%s/_parameters/%s" % (run_id, task_id_params),
+                ]
+                step_cmds.extend(
+                    [
+                        "if ! %s >/dev/null 2>/dev/null; then %s; fi"
+                        % (" ".join(exists), " ".join(init))
+                    ]
+                )
+                input_paths = "%s/_parameters/%s" % (run_id, task_id_params)
+            elif (
+                node.type == "join"
+                and self.graph[node.split_parents[-1]].type == "foreach"
+            ):
+                # Set aggregated input-paths for a foreach-join
+                input_paths = (
+                    "$(python -m metaflow.plugins.argo.process_input_paths %s)"
+                    % input_paths
+                )
+            step = [
+                "step",
+                node.name,
+                "--run-id %s" % run_id,
+                "--task-id %s" % task_id,
+                "--retry-count %s" % retry_count,
+                "--max-user-code-retries %d" % user_code_retries,
+                "--input-paths %s" % input_paths,
+            ]
+            if any(self.graph[n].type == "foreach" for n in node.in_funcs):
+                # Pass split-index to a foreach task
+                step.append("--split-index {{inputs.parameters.split-index}}")
+            if self.tags:
+                step.extend("--tag %s" % tag for tag in self.tags)
+            if self.namespace is not None:
+                step.append("--namespace=%s" % self.namespace)
+
+            step_cmds.extend([" ".join(entrypoint + top_level + step)])
+
+            cmd_str = "%s; c=$?; %s; exit $c" % (
+                " && ".join([init_cmds, bash_capture_logs(" && ".join(step_cmds))]),
+                BASH_SAVE_LOGS,
+            )
+            cmds = shlex.split('bash -c "%s"' % cmd_str)
+
+            # Resolve resource requirements.
+            resources = dict(
+                [deco for deco in node.decorators if deco.name == "kubernetes"][
+                    0
+                ].attributes
+            )
+
+            if (
+                resources["namespace"]
+                and resources["namespace"] != KUBERNETES_NAMESPACE
+            ):
+                raise ArgoWorkflowsException(
+                    "Multi-namespace Kubernetes execution of flows in Argo Workflows "
+                    "is not currently supported. \nStep *%s* is trying to override the "
+                    "default Kubernetes namespace *%s*."
+                    % (node.name, KUBERNETES_NAMESPACE)
+                )
+
+            run_time_limit = [
+                deco for deco in node.decorators if deco.name == "kubernetes"
+            ][0].run_time_limit
+
+            # Resolve @environment decorator. We set three classes of environment
+            # variables -
+            #   (1) User-specified environment variables through @environment
+            #   (2) Metaflow runtime specific environment variables
+            #   (3) @kubernetes, @argo_workflows_internal bookkeeping environment
+            #       variables
+            env = dict(
+                [deco for deco in node.decorators if deco.name == "environment"][
+                    0
+                ].attributes["vars"]
+            )
+            env.update(
+                {
+                    **{
+                        # These values are needed by Metaflow to set it's internal
+                        # state appropriately
+                        "METAFLOW_CODE_URL": self.code_package_url,
+                        "METAFLOW_CODE_SHA": self.code_package_sha,
+                        "METAFLOW_CODE_DS": self.flow_datastore.TYPE,
+                        "METAFLOW_SERVICE_URL": SERVICE_INTERNAL_URL,
+                        "METAFLOW_SERVICE_HEADERS": json.dumps(SERVICE_HEADERS),
+                        "METAFLOW_USER": "argo-workflows",
+                        "METAFLOW_DATASTORE_SYSROOT_S3": DATASTORE_SYSROOT_S3,
+                        "METAFLOW_DATATOOLS_S3ROOT": DATATOOLS_S3ROOT,
+                        "METAFLOW_DEFAULT_DATASTORE": self.flow_datastore.TYPE,
+                        "METAFLOW_DEFAULT_METADATA": DEFAULT_METADATA,
+                        "METAFLOW_CARD_S3ROOT": CARD_S3ROOT,
+                        "METAFLOW_KUBERNETES_WORKLOAD": 1,
+                        "METAFLOW_RUNTIME_ENVIRONMENT": "kubernetes",
+                        "METAFLOW_OWNER": self.username,
+                    },
+                    **{
+                        # Some optional values for bookkeeping
+                        "METAFLOW_FLOW_NAME": self.flow.name,
+                        "METAFLOW_STEP_NAME": node.name,
+                        "METAFLOW_RUN_ID": run_id,
+                        # "METAFLOW_TASK_ID": task_id,
+                        "METAFLOW_RETRY_COUNT": retry_count,
+                        "METAFLOW_PRODUCTION_TOKEN": self.production_token,
+                        "ARGO_WORKFLOW_TEMPLATE": self.name,
+                        "ARGO_WORKFLOW_NAME": "{{workflow.name}}",
+                        "ARGO_WORKFLOW_NAMESPACE": KUBERNETES_NAMESPACE,
+                    },
+                    **self.metadata.get_runtime_environment("argo-workflows"),
+                }
+            )
+            # add METAFLOW_S3_ENDPOINT_URL
+            env["METAFLOW_S3_ENDPOINT_URL"] = S3_ENDPOINT_URL
+
+            # support Metaflow sandboxes
+            env["METAFLOW_INIT_SCRIPT"] = KUBERNETES_SANDBOX_INIT_SCRIPT
+
+            # Azure stuff
+            env[
+                "METAFLOW_AZURE_STORAGE_BLOB_SERVICE_ENDPOINT"
+            ] = AZURE_STORAGE_BLOB_SERVICE_ENDPOINT
+            env["METAFLOW_DATASTORE_SYSROOT_AZURE"] = DATASTORE_SYSROOT_AZURE
+            env["METAFLOW_CARD_AZUREROOT"] = CARD_AZUREROOT
+
+            # GCP stuff
+            env["METAFLOW_DATASTORE_SYSROOT_GS"] = DATASTORE_SYSROOT_GS
+            env["METAFLOW_CARD_GSROOT"] = CARD_GSROOT
+
+            metaflow_version = self.environment.get_environment_info()
+            metaflow_version["flow_name"] = self.graph.name
+            metaflow_version["production_token"] = self.production_token
+            env["METAFLOW_VERSION"] = json.dumps(metaflow_version)
+
+            # Set the template inputs and outputs for passing state. Very simply,
+            # the container template takes in input-paths as input and outputs
+            # the task-id (which feeds in as input-paths to the subsequent task).
+            # In addition to that, if the parent of the node under consideration
+            # is a for-each node, then we take the split-index as an additional
+            # input. Analogously, if the node under consideration is a foreach
+            # node, then we emit split cardinality as an extra output. I would like
+            # to thank the designers of Argo Workflows for making this so
+            # straightforward!
+            inputs = []
+            if node.name != "start":
+                inputs.append(Parameter("input-paths"))
+            if any(self.graph[n].type == "foreach" for n in node.in_funcs):
+                # Fetch split-index from parent
+                inputs.append(Parameter("split-index"))
+
+            outputs = []
+            if node.name != "end":
+                outputs = [Parameter("task-id").valueFrom({"path": "/mnt/out/task_id"})]
+            if node.type == "foreach":
+                # Emit split cardinality from foreach task
+                outputs.append(
+                    Parameter("num-splits").valueFrom({"path": "/mnt/out/splits"})
+                )
+
+            # It makes no sense to set env vars to None (shows up as "None" string)
+            env_without_none_values = {k: v for k, v in env.items() if v is not None}
+            del env
+
+            # Create a ContainerTemplate for this node. Ideally, we would have
+            # liked to inline this ContainerTemplate and avoid scanning the workflow
+            # twice, but due to issues with variable substitution, we will have to
+            # live with this routine.
+            yield (
+                Template(self._sanitize(node.name))
+                # Set @timeout values
+                .active_deadline_seconds(run_time_limit)
+                # Set service account
+                .service_account_name(resources["service_account"])
+                # Configure template input
+                .inputs(Inputs().parameters(inputs))
+                # Configure template output
+                .outputs(Outputs().parameters(outputs))
+                # Fail fast!
+                .fail_fast()
+                # Set @retry/@catch values
+                .retry_strategy(
+                    times=total_retries,
+                    minutes_between_retries=minutes_between_retries,
+                )
+                .metadata(
+                    ObjectMeta().annotation("metaflow/step_name", node.name)
+                    # Unfortunately, we can't set the task_id since it is generated
+                    # inside the pod. However, it can be inferred from the annotation
+                    # set by argo-workflows - `workflows.argoproj.io/outputs` - refer
+                    # the field 'task-id' in 'parameters'
+                    # .annotation("metaflow/task_id", ...)
+                    .annotation("metaflow/attempt", retry_count)
+                )
+                # Set emptyDir volume for state management
+                .empty_dir_volume("out")
+                # Set node selectors
+                .node_selectors(resources.get("node_selector"))
+                .tolerations(resources.get("tolerations"))
+                # Set container
+                .container(
+                    # TODO: Unify the logic with kubernetes.py
+                    # Important note - Unfortunately, V1Container uses snakecase while
+                    # Argo Workflows uses camel. For most of the attributes, both cases
+                    # are indistinguishable, but unfortunately, not for all - (
+                    # env_from, value_from, etc.) - so we need to handle the conversion
+                    # ourselves using to_camelcase. We need to be vigilant about
+                    # resources attributes in particular where the keys maybe user
+                    # defined.
+                    to_camelcase(
+                        kubernetes_sdk.V1Container(
+                            name=self._sanitize(node.name),
+                            command=cmds,
+                            env=[
+                                kubernetes_sdk.V1EnvVar(name=k, value=str(v))
+                                for k, v in env_without_none_values.items()
+                            ]
+                            # Add environment variables for book-keeping.
+                            # https://argoproj.github.io/argo-workflows/fields/#fields_155
+                            + [
+                                kubernetes_sdk.V1EnvVar(
+                                    name=k,
+                                    value_from=kubernetes_sdk.V1EnvVarSource(
+                                        field_ref=kubernetes_sdk.V1ObjectFieldSelector(
+                                            field_path=str(v)
+                                        )
+                                    ),
+                                )
+                                for k, v in {
+                                    "METAFLOW_KUBERNETES_POD_NAMESPACE": "metadata.namespace",
+                                    "METAFLOW_KUBERNETES_POD_NAME": "metadata.name",
+                                    "METAFLOW_KUBERNETES_POD_ID": "metadata.uid",
+                                    "METAFLOW_KUBERNETES_SERVICE_ACCOUNT_NAME": "spec.serviceAccountName",
+                                }.items()
+                            ],
+                            image=resources["image"],
+                            resources=kubernetes_sdk.V1ResourceRequirements(
+                                requests={
+                                    "cpu": str(resources["cpu"]),
+                                    "memory": "%sM" % str(resources["memory"]),
+                                    "ephemeral-storage": "%sM" % str(resources["disk"]),
+                                },
+                                limits={
+                                    "%s.com/gpu".lower()
+                                    % resources["gpu_vendor"]: str(resources["gpu"])
+                                    for k in [0]
+                                    if resources["gpu"] is not None
+                                },
+                            ),
+                            # Configure secrets
+                            env_from=[
+                                kubernetes_sdk.V1EnvFromSource(
+                                    secret_ref=kubernetes_sdk.V1SecretEnvSource(
+                                        name=str(k),
+                                        # optional=True
+                                    )
+                                )
+                                for k in list(
+                                    []
+                                    if not resources.get("secrets")
+                                    else [resources.get("secrets")]
+                                    if isinstance(resources.get("secrets"), str)
+                                    else resources.get("secrets")
+                                )
+                                + KUBERNETES_SECRETS.split(",")
+                                if k
+                            ],
+                            # Assign a volume point to pass state to the next task.
+                            volume_mounts=[
+                                kubernetes_sdk.V1VolumeMount(
+                                    name="out", mount_path="/mnt/out"
+                                )
+                            ],
+                        ).to_dict()
+                    )
+                )
+            )
+
+    def _get_retries(self, node):
+        max_user_code_retries = 0
+        max_error_retries = 0
+        minutes_between_retries = "2"
+        for deco in node.decorators:
+            if deco.name == "retry":
+                minutes_between_retries = deco.attributes.get(
+                    "minutes_between_retries", minutes_between_retries
+                )
+            user_code_retries, error_retries = deco.step_task_retry_count()
+            max_user_code_retries = max(max_user_code_retries, user_code_retries)
+            max_error_retries = max(max_error_retries, error_retries)
+
+        return (
+            max_user_code_retries,
+            max_user_code_retries + max_error_retries,
+            # {{retries}} is only available if retryStrategy is specified
+            "{{retries}}" if max_user_code_retries + max_error_retries else 0,
+            int(minutes_between_retries),
+        )
+
+
+# Helper classes to assist with JSON-foo. This can very well replaced with an explicit
+# dependency on argo-workflows Python SDK if this method turns out to be painful.
+# TODO: Autogenerate them, maybe?
+
+
+class WorkflowTemplate(object):
+    # https://argoproj.github.io/argo-workflows/fields/#workflowtemplate
+
+    def __init__(self):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+        self.payload["apiVersion"] = "argoproj.io/v1alpha1"
+        self.payload["kind"] = "WorkflowTemplate"
+
+    def metadata(self, object_meta):
+        self.payload["metadata"] = object_meta.to_json()
+        return self
+
+    def spec(self, workflow_spec):
+        self.payload["spec"] = workflow_spec.to_json()
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.payload, indent=4)
+
+
+class ObjectMeta(object):
+    # https://argoproj.github.io/argo-workflows/fields/#objectmeta
+
+    def __init__(self):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+
+    def annotation(self, key, value):
+        self.payload["annotations"][key] = str(value)
+        return self
+
+    def annotations(self, annotations):
+        if "annotations" not in self.payload:
+            self.payload["annotations"] = {}
+        self.payload["annotations"].update(annotations)
+        return self
+
+    def generate_name(self, generate_name):
+        self.payload["generateName"] = generate_name
+        return self
+
+    def label(self, key, value):
+        self.payload["labels"][key] = str(value)
+        return self
+
+    def labels(self, labels):
+        if "labels" not in self.payload:
+            self.payload["labels"] = {}
+        self.payload["labels"].update(labels)
+        return self
+
+    def name(self, name):
+        self.payload["name"] = name
+        return self
+
+    def namespace(self, namespace):
+        self.payload["namespace"] = namespace
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.to_json(), indent=4)
+
+
+class WorkflowSpec(object):
+    # https://argoproj.github.io/argo-workflows/fields/#workflowspec
+    # This object sets all Workflow level properties.
+
+    def __init__(self):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+
+    def active_deadline_seconds(self, active_deadline_seconds):
+        # Overall duration of a workflow in seconds
+        if active_deadline_seconds is not None:
+            self.payload["activeDeadlineSeconds"] = int(active_deadline_seconds)
+        return self
+
+    def automount_service_account_token(self, mount=True):
+        self.payload["automountServiceAccountToken"] = mount
+        return self
+
+    def arguments(self, arguments):
+        self.payload["arguments"] = arguments.to_json()
+        return self
+
+    def archive_logs(self, archive_logs=True):
+        self.payload["archiveLogs"] = archive_logs
+        return self
+
+    def entrypoint(self, entrypoint):
+        self.payload["entrypoint"] = entrypoint
+        return self
+
+    def parallelism(self, parallelism):
+        # Set parallelism at Workflow level
+        self.payload["parallelism"] = int(parallelism)
+        return self
+
+    def pod_metadata(self, metadata):
+        self.payload["podMetadata"] = metadata.to_json()
+        return self
+
+    def priority(self, priority):
+        if priority is not None:
+            self.payload["priority"] = int(priority)
+        return self
+
+    def workflow_metadata(self, workflow_metadata):
+        self.payload["workflowMetadata"] = workflow_metadata.to_json()
+        return self
+
+    def service_account_name(self, service_account_name):
+        # https://argoproj.github.io/argo-workflows/workflow-rbac/
+        self.payload["serviceAccountName"] = service_account_name
+        return self
+
+    def templates(self, templates):
+        if "templates" not in self.payload:
+            self.payload["templates"] = []
+        for template in templates:
+            self.payload["templates"].append(template.to_json())
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.to_json(), indent=4)
+
+
+class Metadata(object):
+    # https://argoproj.github.io/argo-workflows/fields/#metadata
+
+    def __init__(self):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+
+    def annotation(self, key, value):
+        self.payload["annotations"][key] = str(value)
+        return self
+
+    def annotations(self, annotations):
+        if "annotations" not in self.payload:
+            self.payload["annotations"] = {}
+        self.payload["annotations"].update(annotations)
+        return self
+
+    def label(self, key, value):
+        self.payload["labels"][key] = str(value)
+        return self
+
+    def labels(self, labels):
+        if "labels" not in self.payload:
+            self.payload["labels"] = {}
+        self.payload["labels"].update(labels)
+        return self
+
+    def labels_from(self, labels_from):
+        # Only available in workflow_metadata
+        # https://github.com/argoproj/argo-workflows/blob/master/examples/label-value-from-workflow.yaml
+        if "labelsFrom" not in self.payload:
+            self.payload["labelsFrom"] = {}
+        for k, v in labels_from.items():
+            self.payload["labelsFrom"].update({k: {"expression": v}})
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.to_json(), indent=4)
+
+
+class Template(object):
+    # https://argoproj.github.io/argo-workflows/fields/#template
+
+    def __init__(self, name):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+        self.payload["name"] = name
+
+    def active_deadline_seconds(self, active_deadline_seconds):
+        # Overall duration of a pod in seconds, only obeyed for container templates
+        # Used for implementing @timeout.
+        self.payload["activeDeadlineSeconds"] = int(active_deadline_seconds)
+        return self
+
+    def dag(self, dag_template):
+        self.payload["dag"] = dag_template.to_json()
+        return self
+
+    def container(self, container):
+        # Luckily this can simply be V1Container and we are spared from writing more
+        # boilerplate - https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1Container.md.
+        self.payload["container"] = container
+        return self
+
+    def inputs(self, inputs):
+        self.payload["inputs"] = inputs.to_json()
+        return self
+
+    def outputs(self, outputs):
+        self.payload["outputs"] = outputs.to_json()
+        return self
+
+    def fail_fast(self, fail_fast=True):
+        # https://github.com/argoproj/argo-workflows/issues/1442
+        self.payload["failFast"] = fail_fast
+        return self
+
+    def metadata(self, metadata):
+        self.payload["metadata"] = metadata.to_json()
+        return self
+
+    def service_account_name(self, service_account_name):
+        self.payload["serviceAccountName"] = service_account_name
+        return self
+
+    def retry_strategy(self, times, minutes_between_retries):
+        if times > 0:
+            self.payload["retryStrategy"] = {
+                "retryPolicy": "Always",
+                "limit": times,
+                "backoff": {"duration": "%sm" % minutes_between_retries},
+            }
+        return self
+
+    def empty_dir_volume(self, name):
+        # Attach an emptyDir volume
+        # https://argoproj.github.io/argo-workflows/empty-dir/
+        if "volumes" not in self.payload:
+            self.payload["volumes"] = []
+        self.payload["volumes"].append({"name": name, "emptyDir": {}})
+        return self
+
+    def node_selectors(self, node_selectors):
+        if "nodeSelector" not in self.payload:
+            self.payload["nodeSelector"] = {}
+        if node_selectors:
+            self.payload["nodeSelector"].update(node_selectors)
+        return self
+
+    def tolerations(self, tolerations):
+        self.payload["tolerations"] = tolerations
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.payload, indent=4)
+
+
+class Inputs(object):
+    # https://argoproj.github.io/argo-workflows/fields/#inputs
+
+    def __init__(self):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+
+    def parameters(self, parameters):
+        if "parameters" not in self.payload:
+            self.payload["parameters"] = []
+        for parameter in parameters:
+            self.payload["parameters"].append(parameter.to_json())
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.payload, indent=4)
+
+
+class Outputs(object):
+    # https://argoproj.github.io/argo-workflows/fields/#outputs
+
+    def __init__(self):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+
+    def parameters(self, parameters):
+        if "parameters" not in self.payload:
+            self.payload["parameters"] = []
+        for parameter in parameters:
+            self.payload["parameters"].append(parameter.to_json())
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.payload, indent=4)
+
+
+class Parameter(object):
+    # https://argoproj.github.io/argo-workflows/fields/#parameter
+
+    def __init__(self, name):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+        self.payload["name"] = name
+
+    def value(self, value):
+        self.payload["value"] = value
+        return self
+
+    def default(self, value):
+        self.payload["default"] = value
+        return self
+
+    def valueFrom(self, value_from):
+        self.payload["valueFrom"] = value_from
+        return self
+
+    def description(self, description):
+        self.payload["description"] = description
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.payload, indent=4)
+
+
+class DAGTemplate(object):
+    # https://argoproj.github.io/argo-workflows/fields/#dagtemplate
+
+    def __init__(self):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+
+    def fail_fast(self, fail_fast=True):
+        # https://github.com/argoproj/argo-workflows/issues/1442
+        self.payload["failFast"] = fail_fast
+        return self
+
+    def tasks(self, tasks):
+        if "tasks" not in self.payload:
+            self.payload["tasks"] = []
+        for task in tasks:
+            self.payload["tasks"].append(task.to_json())
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.payload, indent=4)
+
+
+class DAGTask(object):
+    # https://argoproj.github.io/argo-workflows/fields/#dagtask
+
+    def __init__(self, name):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+        self.payload["name"] = name
+
+    def arguments(self, arguments):
+        self.payload["arguments"] = arguments.to_json()
+        return self
+
+    def dependencies(self, dependencies):
+        self.payload["dependencies"] = dependencies
+        return self
+
+    def template(self, template):
+        # Template reference
+        self.payload["template"] = template
+        return self
+
+    def inline(self, template):
+        # We could have inlined the template here but
+        # https://github.com/argoproj/argo-workflows/issues/7432 prevents us for now.
+        self.payload["inline"] = template.to_json()
+        return self
+
+    def with_param(self, with_param):
+        self.payload["withParam"] = with_param
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.payload, indent=4)
+
+
+class Arguments(object):
+    # https://argoproj.github.io/argo-workflows/fields/#arguments
+
+    def __init__(self):
+        tree = lambda: defaultdict(tree)
+        self.payload = tree()
+
+    def parameters(self, parameters):
+        if "parameters" not in self.payload:
+            self.payload["parameters"] = []
+        for parameter in parameters:
+            self.payload["parameters"].append(parameter.to_json())
+        return self
+
+    def to_json(self):
+        return self.payload
+
+    def __str__(self):
+        return json.dumps(self.payload, indent=4)
diff --git a/metaflow/plugins/argo/argo_workflows_cli.py b/metaflow/plugins/argo/argo_workflows_cli.py
new file mode 100644
index 00000000000..e09f410a9d9
--- /dev/null
+++ b/metaflow/plugins/argo/argo_workflows_cli.py
@@ -0,0 +1,513 @@
+import base64
+import json
+import platform
+import re
+import sys
+from distutils.version import LooseVersion
+from hashlib import sha1
+
+from metaflow import JSONType, current, decorators, parameters
+from metaflow._vendor import click
+from metaflow.metaflow_config import SERVICE_VERSION_CHECK
+from metaflow.exception import MetaflowException, MetaflowInternalError
+from metaflow.package import MetaflowPackage
+from metaflow.plugins.environment_decorator import EnvironmentDecorator
+from metaflow.plugins.kubernetes.kubernetes_decorator import KubernetesDecorator
+
+# TODO: Move production_token to utils
+from metaflow.plugins.aws.step_functions.production_token import (
+    load_token,
+    new_token,
+    store_token,
+)
+from metaflow.util import get_username, to_bytes, to_unicode
+from metaflow.tagging_util import validate_tags
+
+from .argo_workflows import ArgoWorkflows
+
+VALID_NAME = re.compile("^[a-z0-9]([a-z0-9\.\-]*[a-z0-9])?$")
+
+
+class IncorrectProductionToken(MetaflowException):
+    headline = "Incorrect production token"
+
+
+class IncorrectMetadataServiceVersion(MetaflowException):
+    headline = "Incorrect version for metaflow service"
+
+
+class ArgoWorkflowsNameTooLong(MetaflowException):
+    headline = "Argo Workflows name too long"
+
+
+class UnsupportedPythonVersion(MetaflowException):
+    headline = "Unsupported version of Python"
+
+
+@click.group()
+def cli():
+    pass
+
+
+@cli.group(help="Commands related to Argo Workflows.")
+@click.option(
+    "--name",
+    default=None,
+    type=str,
+    help="Argo Workflow name. The flow name is used instead if "
+    "this option is not specified",
+)
+@click.pass_obj
+def argo_workflows(obj, name=None):
+    check_python_version(obj)
+    obj.check(obj.graph, obj.flow, obj.environment, pylint=obj.pylint)
+    (
+        obj.workflow_name,
+        obj.token_prefix,
+        obj.is_project,
+    ) = resolve_workflow_name(obj, name)
+
+
+@argo_workflows.command(help="Deploy a new version of this workflow to Argo Workflows.")
+@click.option(
+    "--authorize",
+    default=None,
+    help="Authorize using this production token. You need this "
+    "when you are re-deploying an existing flow for the first "
+    "time. The token is cached in METAFLOW_HOME, so you only "
+    "need to specify this once.",
+)
+@click.option(
+    "--generate-new-token",
+    is_flag=True,
+    help="Generate a new production token for this flow. "
+    "This will move the production flow to a new namespace.",
+)
+@click.option(
+    "--new-token",
+    "given_token",
+    default=None,
+    help="Use the given production token for this flow. "
+    "This will move the production flow to the given namespace.",
+)
+@click.option(
+    "--tag",
+    "tags",
+    multiple=True,
+    default=None,
+    help="Annotate all objects produced by Argo Workflows runs "
+    "with the given tag. You can specify this option multiple "
+    "times to attach multiple tags.",
+)
+@click.option(
+    "--namespace",
+    "user_namespace",
+    default=None,
+    help="Change the namespace from the default (production token) "
+    "to the given tag. See run --help for more information.",
+)
+@click.option(
+    "--only-json",
+    is_flag=True,
+    default=False,
+    help="Only print out JSON sent to Argo Workflows. Do not deploy anything.",
+)
+@click.option(
+    "--max-workers",
+    default=100,
+    show_default=True,
+    help="Maximum number of parallel processes.",
+)
+@click.option(
+    "--workflow-timeout", default=None, type=int, help="Workflow timeout in seconds."
+)
+@click.option(
+    "--workflow-priority",
+    default=None,
+    type=int,
+    help="Workflow priority as an integer. Workflows with higher priority "
+    "are processed first if Argo Workflows controller is configured to process limited "
+    "number of workflows in parallel",
+)
+@click.pass_obj
+def create(
+    obj,
+    tags=None,
+    user_namespace=None,
+    only_json=False,
+    authorize=None,
+    generate_new_token=False,
+    given_token=None,
+    max_workers=None,
+    workflow_timeout=None,
+    workflow_priority=None,
+):
+    validate_tags(tags)
+
+    obj.echo("Deploying *%s* to Argo Workflows..." % obj.workflow_name, bold=True)
+
+    if SERVICE_VERSION_CHECK:
+        # TODO: Consider dispelling with this check since it's been 2 years since the
+        #       needed metadata service changes have been available in open-source. It's
+        #       likely that Metaflow users may not have access to metadata service from
+        #       within their workstations.
+        check_metadata_service_version(obj)
+
+    token = resolve_token(
+        obj.workflow_name,
+        obj.token_prefix,
+        obj,
+        authorize,
+        given_token,
+        generate_new_token,
+        obj.is_project,
+    )
+
+    flow = make_flow(
+        obj,
+        token,
+        obj.workflow_name,
+        tags,
+        user_namespace,
+        max_workers,
+        workflow_timeout,
+        workflow_priority,
+    )
+
+    if only_json:
+        obj.echo_always(str(flow), err=False, no_bold=True)
+    else:
+        flow.deploy()
+        obj.echo(
+            "Workflow *{workflow_name}* "
+            "for flow *{name}* pushed to "
+            "Argo Workflows successfully.\n".format(
+                workflow_name=obj.workflow_name, name=current.flow_name
+            ),
+            bold=True,
+        )
+        if obj._is_workflow_name_modified:
+            obj.echo(
+                "Note that the flow was deployed with a modified name "
+                "due to Kubernetes naming conventions\non Argo Workflows. The "
+                "original flow name is stored in the workflow annotation.\n"
+            )
+        flow.schedule()
+        obj.echo("What will trigger execution of the workflow:", bold=True)
+        obj.echo(flow.trigger_explanation(), indent=True)
+
+        # response = ArgoWorkflows.trigger(obj.workflow_name)
+        # run_id = "argo-" + response["metadata"]["name"]
+
+        # obj.echo(
+        #     "Workflow *{name}* triggered on Argo Workflows "
+        #     "(run-id *{run_id}*).".format(name=obj.workflow_name, run_id=run_id),
+        #     bold=True,
+        # )
+
+
+def check_python_version(obj):
+    # argo-workflows integration for Metaflow isn't supported for Py versions below 3.5.
+    # This constraint can very well be lifted if desired.
+    if sys.version_info < (3, 5):
+        obj.echo("")
+        obj.echo(
+            "Metaflow doesn't support Argo Workflows for Python %s right now."
+            % platform.python_version()
+        )
+        obj.echo(
+            "Please upgrade your Python interpreter to version 3.5 (or higher) or "
+            "reach out to us at slack.outerbounds.co for more help."
+        )
+        raise UnsupportedPythonVersion(
+            "Try again with a more recent version of Python (>=3.5)."
+        )
+
+
+def check_metadata_service_version(obj):
+    metadata = obj.metadata
+    version = metadata.version()
+    if version == "local":
+        return
+    elif version is not None and LooseVersion(version) >= LooseVersion("2.0.2"):
+        # Metaflow metadata service needs to be at least at version 2.0.2
+        # since prior versions did not support strings as object ids.
+        return
+    else:
+        obj.echo("")
+        obj.echo(
+            "You are running a version of the metaflow service that currently doesn't "
+            "support Argo Workflows. "
+        )
+        obj.echo(
+            "For more information on how to upgrade your service to a compatible "
+            "version (>= 2.0.2), visit:"
+        )
+        obj.echo(
+            "    https://admin-docs.metaflow.org/metaflow-on-aws/operation"
+            "s-guide/metaflow-service-migration-guide",
+            fg="green",
+        )
+        obj.echo(
+            "Once you have upgraded your metadata service, please re-execute your "
+            "command."
+        )
+        raise IncorrectMetadataServiceVersion(
+            "Try again with a more recent version of metaflow service (>=2.0.2)."
+        )
+
+
+def resolve_workflow_name(obj, name):
+    project = current.get("project_name")
+    obj._is_workflow_name_modified = False
+    if project:
+        if name:
+            raise MetaflowException(
+                "--name is not supported for @projects. Use --branch instead."
+            )
+        workflow_name = current.project_flow_name
+        project_branch = to_bytes(".".join((project, current.branch_name)))
+        token_prefix = (
+            "mfprj-%s"
+            % to_unicode(base64.b32encode(sha1(project_branch).digest()))[:16]
+        )
+        is_project = True
+        # Argo Workflow names can't be longer than 253 characters, so we truncate
+        # by default. Also, while project and branch allow for underscores, Argo
+        # Workflows doesn't (DNS Subdomain names as defined in RFC 1123) - so we will
+        # remove any underscores as well as convert the name to lower case.
+        if len(workflow_name) > 253:
+            name_hash = to_unicode(
+                base64.b32encode(sha1(to_bytes(workflow_name)).digest())
+            )[:8].lower()
+            workflow_name = "%s-%s" % (workflow_name[:242], name_hash)
+            obj._is_workflow_name_modified = True
+        if not VALID_NAME.search(workflow_name):
+            workflow_name = (
+                re.compile(r"^[^A-Za-z0-9]+")
+                .sub("", workflow_name)
+                .replace("_", "")
+                .lower()
+            )
+            obj._is_workflow_name_modified = True
+    else:
+        if name and not VALID_NAME.search(name):
+            raise MetaflowException(
+                "Name '%s' contains invalid characters. The "
+                "name must consist of lower case alphanumeric characters, '-' or '.'"
+                ", and must start and end with an alphanumeric character." % name
+            )
+
+        workflow_name = name if name else current.flow_name
+        token_prefix = workflow_name
+        is_project = False
+
+        if len(workflow_name) > 253:
+            msg = (
+                "The full name of the workflow:\n*%s*\nis longer than 253 "
+                "characters.\n\n"
+                "To deploy this workflow to Argo Workflows, please "
+                "assign a shorter name\nusing the option\n"
+                "*argo-workflows --name <name> create*." % workflow_name
+            )
+            raise ArgoWorkflowsNameTooLong(msg)
+
+        if not VALID_NAME.search(workflow_name):
+            workflow_name = (
+                re.compile(r"^[^A-Za-z0-9]+")
+                .sub("", workflow_name)
+                .replace("_", "")
+                .lower()
+            )
+            obj._is_workflow_name_modified = True
+
+    return workflow_name, token_prefix.lower(), is_project
+
+
+def make_flow(
+    obj, token, name, tags, namespace, max_workers, workflow_timeout, workflow_priority
+):
+    # TODO: Make this check less specific to Amazon S3 as we introduce
+    #       support for more cloud object stores.
+    if obj.flow_datastore.TYPE not in ("azure", "gs", "s3"):
+        raise MetaflowException(
+            "Argo Workflows requires --datastore=s3 or --datastore=azure or --datastore=gs"
+        )
+
+    # Attach @kubernetes and @environment decorator to the flow to
+    # ensure that the related decorator hooks are invoked.
+    decorators._attach_decorators(
+        obj.flow, [KubernetesDecorator.name, EnvironmentDecorator.name]
+    )
+
+    decorators._init_step_decorators(
+        obj.flow, obj.graph, obj.environment, obj.flow_datastore, obj.logger
+    )
+
+    # Save the code package in the flow datastore so that both user code and
+    # metaflow package can be retrieved during workflow execution.
+    obj.package = MetaflowPackage(
+        obj.flow, obj.environment, obj.echo, obj.package_suffixes
+    )
+    package_url, package_sha = obj.flow_datastore.save_data(
+        [obj.package.blob], len_hint=1
+    )[0]
+
+    return ArgoWorkflows(
+        name,
+        obj.graph,
+        obj.flow,
+        package_sha,
+        package_url,
+        token,
+        obj.metadata,
+        obj.flow_datastore,
+        obj.environment,
+        obj.event_logger,
+        obj.monitor,
+        tags=tags,
+        namespace=namespace,
+        max_workers=max_workers,
+        username=get_username(),
+        workflow_timeout=workflow_timeout,
+        workflow_priority=workflow_priority,
+    )
+
+
+# TODO: Unify this method with the one in step_functions_cli.py
+def resolve_token(
+    name, token_prefix, obj, authorize, given_token, generate_new_token, is_project
+):
+    # 1) retrieve the previous deployment, if one exists
+    workflow = ArgoWorkflows.get_existing_deployment(name)
+    if workflow is None:
+        obj.echo(
+            "It seems this is the first time you are deploying *%s* to "
+            "Argo Workflows." % name
+        )
+        prev_token = None
+    else:
+        prev_user, prev_token = workflow
+
+    # 2) authorize this deployment
+    if prev_token is not None:
+        if authorize is None:
+            authorize = load_token(token_prefix)
+        elif authorize.startswith("production:"):
+            authorize = authorize[11:]
+
+        # we allow the user who deployed the previous version to re-deploy,
+        # even if they don't have the token
+        if prev_user != get_username() and authorize != prev_token:
+            obj.echo(
+                "There is an existing version of *%s* on Argo Workflows which was "
+                "deployed by the user *%s*." % (name, prev_user)
+            )
+            obj.echo(
+                "To deploy a new version of this flow, you need to use the same "
+                "production token that they used. "
+            )
+            obj.echo(
+                "Please reach out to them to get the token. Once you have it, call "
+                "this command:"
+            )
+            obj.echo("    argo-workflows create --authorize MY_TOKEN", fg="green")
+            obj.echo(
+                'See "Organizing Results" at docs.metaflow.org for more information '
+                "about production tokens."
+            )
+            raise IncorrectProductionToken(
+                "Try again with the correct production token."
+            )
+
+    # 3) do we need a new token or should we use the existing token?
+    if given_token:
+        if is_project:
+            # we rely on a known prefix for @project tokens, so we can't
+            # allow the user to specify a custom token with an arbitrary prefix
+            raise MetaflowException(
+                "--new-token is not supported for @projects. Use --generate-new-token "
+                "to create a new token."
+            )
+        if given_token.startswith("production:"):
+            given_token = given_token[11:]
+        token = given_token
+        obj.echo("")
+        obj.echo("Using the given token, *%s*." % token)
+    elif prev_token is None or generate_new_token:
+        token = new_token(token_prefix, prev_token)
+        if token is None:
+            if prev_token is None:
+                raise MetaflowInternalError(
+                    "We could not generate a new token. This is unexpected. "
+                )
+            else:
+                raise MetaflowException(
+                    "--generate-new-token option is not supported after using "
+                    "--new-token. Use --new-token to make a new namespace."
+                )
+        obj.echo("")
+        obj.echo("A new production token generated.")
+    else:
+        token = prev_token
+
+    obj.echo("")
+    obj.echo("The namespace of this production flow is")
+    obj.echo("    production:%s" % token, fg="green")
+    obj.echo(
+        "To analyze results of this production flow add this line in your notebooks:"
+    )
+    obj.echo('    namespace("production:%s")' % token, fg="green")
+    obj.echo(
+        "If you want to authorize other people to deploy new versions of this flow to "
+        "Argo Workflows, they need to call"
+    )
+    obj.echo("    argo-workflows create --authorize %s" % token, fg="green")
+    obj.echo("when deploying this flow to Argo Workflows for the first time.")
+    obj.echo(
+        'See "Organizing Results" at https://docs.metaflow.org/ for more '
+        "information about production tokens."
+    )
+    obj.echo("")
+    store_token(token_prefix, token)
+    return token
+
+
+@parameters.add_custom_parameters(deploy_mode=False)
+@argo_workflows.command(help="Trigger the workflow on Argo Workflows.")
+@click.option(
+    "--run-id-file",
+    default=None,
+    show_default=True,
+    type=str,
+    help="Write the ID of this run to the file specified.",
+)
+@click.pass_obj
+def trigger(obj, run_id_file=None, **kwargs):
+    def _convert_value(param):
+        # Swap `-` with `_` in parameter name to match click's behavior
+        val = kwargs.get(param.name.replace("-", "_").lower())
+        if param.kwargs.get("type") == JSONType:
+            val = json.dumps(val)
+        elif isinstance(val, parameters.DelayedEvaluationParameter):
+            val = val(return_str=True)
+        return val
+
+    params = {
+        param.name: _convert_value(param)
+        for _, param in obj.flow._get_parameters()
+        if kwargs.get(param.name.replace("-", "_").lower()) is not None
+    }
+
+    response = ArgoWorkflows.trigger(obj.workflow_name, params)
+    run_id = "argo-" + response["metadata"]["name"]
+
+    if run_id_file:
+        with open(run_id_file, "w") as f:
+            f.write(str(run_id))
+
+    obj.echo(
+        "Workflow *{name}* triggered on Argo Workflows "
+        "(run-id *{run_id}*).".format(name=obj.workflow_name, run_id=run_id),
+        bold=True,
+    )
diff --git a/metaflow/plugins/argo/argo_workflows_decorator.py b/metaflow/plugins/argo/argo_workflows_decorator.py
new file mode 100644
index 00000000000..26e72cfa807
--- /dev/null
+++ b/metaflow/plugins/argo/argo_workflows_decorator.py
@@ -0,0 +1,63 @@
+import json
+import os
+
+from metaflow.decorators import StepDecorator
+from metaflow.metadata import MetaDatum
+
+
+class ArgoWorkflowsInternalDecorator(StepDecorator):
+    name = "argo_workflows_internal"
+
+    def task_pre_step(
+        self,
+        step_name,
+        task_datastore,
+        metadata,
+        run_id,
+        task_id,
+        flow,
+        graph,
+        retry_count,
+        max_user_code_retries,
+        ubf_context,
+        inputs,
+    ):
+        self.task_id = task_id
+        meta = {}
+        meta["argo-workflow-template"] = os.environ["ARGO_WORKFLOW_TEMPLATE"]
+        meta["argo-workflow-name"] = os.environ["ARGO_WORKFLOW_NAME"]
+        meta["argo-workflow-namespace"] = os.environ["ARGO_WORKFLOW_NAMESPACE"]
+        entries = [
+            MetaDatum(
+                field=k, value=v, type=k, tags=["attempt_id:{0}".format(retry_count)]
+            )
+            for k, v in meta.items()
+        ]
+        # Register book-keeping metadata for debugging.
+        metadata.register_metadata(run_id, step_name, task_id, entries)
+
+    def task_finished(
+        self, step_name, flow, graph, is_task_ok, retry_count, max_user_code_retries
+    ):
+        if not is_task_ok:
+            # The task finished with an exception - execution won't
+            # continue so no need to do anything here.
+            return
+
+        # For `foreach`s, we need to dump the cardinality of the fanout
+        # into a file so that Argo Workflows can properly configure
+        # the subsequent fanout task via an Output parameter
+        #
+        # Docker and PNS workflow executors can get output parameters from the base
+        # layer (e.g. /tmp), but the Kubelet nor the K8SAPI nor the emissary executors
+        # can. It is also unlikely we can get output parameters from the base layer if
+        # we run pods with a security context. We work around this constraint by
+        # mounting an emptyDir volume.
+        if graph[step_name].type == "foreach":
+            with open("/mnt/out/splits", "w") as file:
+                json.dump(list(range(flow._foreach_num_splits)), file)
+        # Unfortunately, we can't always use pod names as task-ids since the pod names
+        # are not static across retries. We write the task-id to a file that is read
+        # by the next task here.
+        with open("/mnt/out/task_id", "w") as file:
+            file.write(self.task_id)
diff --git a/metaflow/plugins/argo/process_input_paths.py b/metaflow/plugins/argo/process_input_paths.py
new file mode 100644
index 00000000000..e94b5a2669f
--- /dev/null
+++ b/metaflow/plugins/argo/process_input_paths.py
@@ -0,0 +1,19 @@
+import sys
+import re
+
+
+def process_input_paths(input_paths):
+    # Convert Argo Workflows provided input-paths string to something that Metaflow
+    # understands
+    #
+    # flow/step/[{task-id:foo},{task-id:bar}] => flow/step/:foo,bar
+
+    flow, run_id, task_ids = input_paths.split("/")
+    task_ids = re.sub("[\[\]{}]", "", task_ids)
+    task_ids = task_ids.split(",")
+    tasks = [t.split(":")[1] for t in task_ids]
+    return "{}/{}/:{}".format(flow, run_id, ",".join(tasks))
+
+
+if __name__ == "__main__":
+    print(process_input_paths(sys.argv[1]))
diff --git a/metaflow/plugins/aws/aws_client.py b/metaflow/plugins/aws/aws_client.py
index 275f236213f..eeee7c33698 100644
--- a/metaflow/plugins/aws/aws_client.py
+++ b/metaflow/plugins/aws/aws_client.py
@@ -6,24 +6,44 @@ class Boto3ClientProvider(object):
     name = "boto3"
 
     @staticmethod
-    def get_client(module, with_error=False, params={}):
+    def get_client(
+        module, with_error=False, role_arn=None, session_vars=None, client_params=None
+    ):
         from metaflow.exception import MetaflowException
         from metaflow.metaflow_config import (
             AWS_SANDBOX_ENABLED,
             AWS_SANDBOX_STS_ENDPOINT_URL,
             AWS_SANDBOX_API_KEY,
         )
+
+        if session_vars is None:
+            session_vars = {}
+
+        if client_params is None:
+            client_params = {}
+
         import requests
 
         try:
             import boto3
+            import botocore
             from botocore.exceptions import ClientError
+            from botocore.config import Config
         except (NameError, ImportError):
             raise MetaflowException(
                 "Could not import module 'boto3'. Install boto3 first."
             )
 
+        if module == "s3" and (
+            "config" not in client_params or client_params["config"].retries is None
+        ):
+            # Use the adaptive retry strategy by default -- do not set anything if
+            # the user has already set something
+            config = client_params.get("config", Config())
+            config.retries = {"max_attempts": 10, "mode": "adaptive"}
+
         if AWS_SANDBOX_ENABLED:
+            # role is ignored in the sandbox
             global cached_aws_sandbox_creds
             if cached_aws_sandbox_creds is None:
                 # authenticate using STS
@@ -38,19 +58,35 @@ def get_client(module, with_error=False, params={}):
             if with_error:
                 return (
                     boto3.session.Session(**cached_aws_sandbox_creds).client(
-                        module, **params
+                        module, **client_params
                     ),
                     ClientError,
                 )
             return boto3.session.Session(**cached_aws_sandbox_creds).client(
-                module, **params
+                module, **client_params
+            )
+        session = boto3.session.Session()
+        if role_arn:
+            fetcher = botocore.credentials.AssumeRoleCredentialFetcher(
+                client_creator=session._session.create_client,
+                source_credentials=session._session.get_credentials(),
+                role_arn=role_arn,
+                extra_args={},
+            )
+            creds = botocore.credentials.DeferredRefreshableCredentials(
+                method="assume-role", refresh_using=fetcher.fetch_credentials
             )
+            botocore_session = botocore.session.Session(session_vars=session_vars)
+            botocore_session._credentials = creds
+            session = boto3.session.Session(botocore_session=botocore_session)
         if with_error:
-            return boto3.client(module, **params), ClientError
-        return boto3.client(module, **params)
+            return session.client(module, **client_params), ClientError
+        return session.client(module, **client_params)
 
 
-def get_aws_client(module, with_error=False, params={}):
+def get_aws_client(
+    module, with_error=False, role_arn=None, session_vars=None, client_params=None
+):
     global cached_provider_class
     if cached_provider_class is None:
         from metaflow.metaflow_config import DEFAULT_AWS_CLIENT_PROVIDER
@@ -64,4 +100,10 @@ def get_aws_client(module, with_error=False, params={}):
             raise ValueError(
                 "Cannot find AWS Client provider %s" % DEFAULT_AWS_CLIENT_PROVIDER
             )
-    return cached_provider_class.get_client(module, with_error, params)
+    return cached_provider_class.get_client(
+        module,
+        with_error,
+        role_arn=role_arn,
+        session_vars=session_vars,
+        client_params=client_params,
+    )
diff --git a/metaflow/plugins/aws/aws_utils.py b/metaflow/plugins/aws/aws_utils.py
index d4c73ce98fc..29805bb2dcf 100644
--- a/metaflow/plugins/aws/aws_utils.py
+++ b/metaflow/plugins/aws/aws_utils.py
@@ -17,7 +17,7 @@ def get_docker_registry(image_uri):
             [@:]                - The separator must be either "@" or ":"
             ?                   - The separator is optional
         ((?<=[@:]).*)?      - [GROUP 2] TAG / DIGEST
-            (?<=[@:])           - A tag / digest must be preceeded by "@" or ":"
+            (?<=[@:])           - A tag / digest must be preceded by "@" or ":"
             .*                  - Capture rest of tag / digest
             ?                   - A tag / digest is optional
     Examples:
@@ -79,9 +79,13 @@ def compute_resource_attributes(decos, compute_deco, resource_defaults):
                     continue
                 if my_val is not None and v is not None:
                     try:
-                        result[k] = str(max(int(my_val or 0), int(v or 0)))
+                        # Use Decimals to compare and convert to string here so
+                        # that numbers that can't be exactly represented as
+                        # floats (e.g. 0.8) still look "nice". We don't care
+                        # about precision more that .001 for resources anyway.
+                        result[k] = str(max(float(my_val or 0), float(v or 0)))
                     except ValueError:
-                        # Here, we don't have ints so we compare the value and raise
+                        # Here we don't have ints, so we compare the value and raise
                         # an exception if not equal
                         if my_val != v:
                             raise MetaflowException(
diff --git a/metaflow/plugins/aws/batch/batch.py b/metaflow/plugins/aws/batch/batch.py
index 5aab064057f..83e7fa595cc 100644
--- a/metaflow/plugins/aws/batch/batch.py
+++ b/metaflow/plugins/aws/batch/batch.py
@@ -7,22 +7,21 @@
 import time
 
 from metaflow import util
-from metaflow.datatools.s3tail import S3Tail
-from metaflow.exception import MetaflowException, MetaflowInternalError
+from metaflow.plugins.datatools.s3.s3tail import S3Tail
+from metaflow.exception import MetaflowException
 from metaflow.metaflow_config import (
-    BATCH_METADATA_SERVICE_URL,
+    SERVICE_INTERNAL_URL,
     DATATOOLS_S3ROOT,
-    DATASTORE_LOCAL_DIR,
     DATASTORE_SYSROOT_S3,
     DEFAULT_METADATA,
-    BATCH_METADATA_SERVICE_HEADERS,
+    SERVICE_HEADERS,
     BATCH_EMIT_TAGS,
-    DATASTORE_CARD_S3ROOT,
+    CARD_S3ROOT,
+    S3_ENDPOINT_URL,
 )
-from metaflow.mflog.mflog import refine, set_should_persist
 from metaflow.mflog import (
     export_mflog_env_vars,
-    capture_output_to_mflog,
+    bash_capture_logs,
     tail_logs,
     BASH_SAVE_LOGS,
 )
@@ -59,14 +58,10 @@ def _command(self, environment, code_package_url, step_name, step_cmds, task_spe
             stderr_path=STDERR_PATH,
             **task_spec
         )
-        init_cmds = environment.get_package_commands(code_package_url)
+        init_cmds = environment.get_package_commands(code_package_url, "s3")
         init_expr = " && ".join(init_cmds)
-        step_expr = " && ".join(
-            [
-                capture_output_to_mflog(a)
-                for a in (environment.bootstrap_commands(step_name))
-            ]
-            + step_cmds
+        step_expr = bash_capture_logs(
+            " && ".join(environment.bootstrap_commands(step_name, "s3") + step_cmds)
         )
 
         # construct an entry point that
@@ -178,6 +173,7 @@ def create_job(
         shared_memory=None,
         max_swap=None,
         swappiness=None,
+        inferentia=None,
         env={},
         attrs={},
         host_volumes=None,
@@ -211,6 +207,7 @@ def create_job(
                 shared_memory,
                 max_swap,
                 swappiness,
+                inferentia,
                 host_volumes=host_volumes,
                 num_parallel=num_parallel,
             )
@@ -220,6 +217,7 @@ def create_job(
             .shared_memory(shared_memory)
             .max_swap(max_swap)
             .swappiness(swappiness)
+            .inferentia(inferentia)
             .timeout_in_secs(run_time_limit)
             .task_id(attrs.get("metaflow.task_id"))
             .environment_variable("AWS_DEFAULT_REGION", self._client.region())
@@ -227,28 +225,34 @@ def create_job(
             .environment_variable("METAFLOW_CODE_URL", code_package_url)
             .environment_variable("METAFLOW_CODE_DS", code_package_ds)
             .environment_variable("METAFLOW_USER", attrs["metaflow.user"])
-            .environment_variable("METAFLOW_SERVICE_URL", BATCH_METADATA_SERVICE_URL)
+            .environment_variable("METAFLOW_SERVICE_URL", SERVICE_INTERNAL_URL)
             .environment_variable(
-                "METAFLOW_SERVICE_HEADERS", json.dumps(BATCH_METADATA_SERVICE_HEADERS)
+                "METAFLOW_SERVICE_HEADERS", json.dumps(SERVICE_HEADERS)
             )
             .environment_variable("METAFLOW_DATASTORE_SYSROOT_S3", DATASTORE_SYSROOT_S3)
             .environment_variable("METAFLOW_DATATOOLS_S3ROOT", DATATOOLS_S3ROOT)
             .environment_variable("METAFLOW_DEFAULT_DATASTORE", "s3")
             .environment_variable("METAFLOW_DEFAULT_METADATA", DEFAULT_METADATA)
-            .environment_variable("METAFLOW_CARD_S3ROOT", DATASTORE_CARD_S3ROOT)
+            .environment_variable("METAFLOW_CARD_S3ROOT", CARD_S3ROOT)
             .environment_variable("METAFLOW_RUNTIME_ENVIRONMENT", "aws-batch")
         )
         # Skip setting METAFLOW_DATASTORE_SYSROOT_LOCAL because metadata sync between the local user
         # instance and the remote AWS Batch instance assumes metadata is stored in DATASTORE_LOCAL_DIR
         # on the remote AWS Batch instance; this happens when METAFLOW_DATASTORE_SYSROOT_LOCAL
         # is NOT set (see get_datastore_root_from_config in datastore/local.py).
+        # add METAFLOW_S3_ENDPOINT_URL
+        if S3_ENDPOINT_URL is not None:
+            job.environment_variable("METAFLOW_S3_ENDPOINT_URL", S3_ENDPOINT_URL)
+
         for name, value in env.items():
             job.environment_variable(name, value)
+
         if attrs:
             for key, value in attrs.items():
                 job.parameter(key, value)
         # Tags for AWS Batch job (for say cost attribution)
         if BATCH_EMIT_TAGS:
+            job.tag("app", "metaflow")
             for key in [
                 "metaflow.flow_name",
                 "metaflow.run_id",
@@ -282,6 +286,7 @@ def launch_job(
         shared_memory=None,
         max_swap=None,
         swappiness=None,
+        inferentia=None,
         host_volumes=None,
         num_parallel=0,
         env={},
@@ -296,7 +301,7 @@ def launch_job(
                 )
         job = self.create_job(
             step_name,
-            capture_output_to_mflog(step_cli),
+            step_cli,
             task_spec,
             code_package_sha,
             code_package_url,
@@ -312,6 +317,7 @@ def launch_job(
             shared_memory,
             max_swap,
             swappiness,
+            inferentia,
             env=env,
             attrs=attrs,
             host_volumes=host_volumes,
diff --git a/metaflow/plugins/aws/batch/batch_cli.py b/metaflow/plugins/aws/batch/batch_cli.py
index bfe929e8cd8..5f69b93d671 100644
--- a/metaflow/plugins/aws/batch/batch_cli.py
+++ b/metaflow/plugins/aws/batch/batch_cli.py
@@ -11,7 +11,7 @@
 from metaflow.exception import CommandException, METAFLOW_EXIT_DISALLOW_RETRY
 from metaflow.metadata.util import sync_local_metadata_from_datastore
 from metaflow.metaflow_config import DATASTORE_LOCAL_DIR
-from metaflow.mflog import TASK_LOG_SOURCE, capture_output_to_mflog
+from metaflow.mflog import TASK_LOG_SOURCE
 
 from .batch import Batch, BatchKilledException
 
@@ -140,6 +140,7 @@ def kill(ctx, run_id, user, my_runs):
 @click.option("--shared-memory", help="Shared Memory requirement for AWS Batch.")
 @click.option("--max-swap", help="Max Swap requirement for AWS Batch.")
 @click.option("--swappiness", help="Swappiness requirement for AWS Batch.")
+@click.option("--inferentia", help="Inferentia requirement for AWS Batch.")
 # TODO: Maybe remove it altogether since it's not used here
 @click.option("--ubf-context", default=None, type=click.Choice([None, "ubf_control"]))
 @click.option("--host-volumes", multiple=True)
@@ -167,6 +168,7 @@ def step(
     shared_memory=None,
     max_swap=None,
     swappiness=None,
+    inferentia=None,
     host_volumes=None,
     num_parallel=None,
     **kwargs
@@ -201,7 +203,7 @@ def echo(msg, stream="stderr", batch_id=None):
     if num_parallel and num_parallel > 1:
         # For multinode, we need to add a placeholder that can be mutated by the caller
         step_args += " [multinode-args]"
-    step_cli = u"{entrypoint} {top_args} step {step} {step_args}".format(
+    step_cli = "{entrypoint} {top_args} step {step} {step_args}".format(
         entrypoint=entrypoint,
         top_args=top_args,
         step=step_name,
@@ -290,6 +292,7 @@ def _sync_metadata():
                 shared_memory=shared_memory,
                 max_swap=max_swap,
                 swappiness=swappiness,
+                inferentia=inferentia,
                 env=env,
                 attrs=attrs,
                 host_volumes=host_volumes,
diff --git a/metaflow/plugins/aws/batch/batch_client.py b/metaflow/plugins/aws/batch/batch_client.py
index 078e437d54c..b1b5bf4ee66 100644
--- a/metaflow/plugins/aws/batch/batch_client.py
+++ b/metaflow/plugins/aws/batch/batch_client.py
@@ -148,6 +148,7 @@ def _register_job_definition(
         shared_memory,
         max_swap,
         swappiness,
+        inferentia,
         host_volumes,
         num_parallel,
     ):
@@ -247,6 +248,26 @@ def _register_job_definition(
                         "maxSwap"
                     ] = int(max_swap)
 
+        if inferentia:
+            if not (isinstance(inferentia, (int, unicode, basestring))):
+                raise BatchJobException(
+                    "invalid inferentia value: ({}) (should be 0 or greater)".format(
+                        inferentia
+                    )
+                )
+            else:
+                job_definition["containerProperties"]["linuxParameters"]["devices"] = []
+                for i in range(int(inferentia)):
+                    job_definition["containerProperties"]["linuxParameters"][
+                        "devices"
+                    ].append(
+                        {
+                            "containerPath": "/dev/neuron{}".format(i),
+                            "hostPath": "/dev/neuron{}".format(i),
+                            "permissions": ["read", "write"],
+                        }
+                    )
+
         if host_volumes:
             job_definition["containerProperties"]["volumes"] = []
             job_definition["containerProperties"]["mountPoints"] = []
@@ -321,6 +342,7 @@ def job_def(
         shared_memory,
         max_swap,
         swappiness,
+        inferentia,
         host_volumes,
         num_parallel,
     ):
@@ -332,6 +354,7 @@ def job_def(
             shared_memory,
             max_swap,
             swappiness,
+            inferentia,
             host_volumes,
             num_parallel,
         )
@@ -373,6 +396,10 @@ def swappiness(self, swappiness):
         self._swappiness = swappiness
         return self
 
+    def inferentia(self, inferentia):
+        self._inferentia = inferentia
+        return self
+
     def command(self, command):
         if "command" not in self.payload["containerOverrides"]:
             self.payload["containerOverrides"]["command"] = []
@@ -386,20 +413,25 @@ def cpu(self, cpu):
             )
         if "resourceRequirements" not in self.payload["containerOverrides"]:
             self.payload["containerOverrides"]["resourceRequirements"] = []
+
+        # %g will format the value without .0 if it doesn't have a fractional part
+        #
+        # While AWS Batch supports fractional values for fargate, it does not
+        # seem to like seeing values like 2.0 for non-fargate environments.
         self.payload["containerOverrides"]["resourceRequirements"].append(
-            {"value": str(cpu), "type": "VCPU"}
+            {"value": "%g" % (float(cpu)), "type": "VCPU"}
         )
         return self
 
     def memory(self, mem):
-        if not (isinstance(mem, (int, unicode, basestring)) and int(mem) > 0):
+        if not (isinstance(mem, (int, unicode, basestring, float)) and float(mem) > 0):
             raise BatchJobException(
                 "Invalid memory value ({}); it should be greater than 0".format(mem)
             )
         if "resourceRequirements" not in self.payload["containerOverrides"]:
             self.payload["containerOverrides"]["resourceRequirements"] = []
         self.payload["containerOverrides"]["resourceRequirements"].append(
-            {"value": str(mem), "type": "MEMORY"}
+            {"value": str(int(float(mem))), "type": "MEMORY"}
         )
         return self
 
@@ -408,15 +440,20 @@ def gpu(self, gpu):
             raise BatchJobException(
                 "invalid gpu value: ({}) (should be 0 or greater)".format(gpu)
             )
-        if int(gpu) > 0:
+        if float(gpu) > 0:
             if "resourceRequirements" not in self.payload["containerOverrides"]:
                 self.payload["containerOverrides"]["resourceRequirements"] = []
+
+            # Only integer values are supported but the value passed to us
+            # could be a float-converted-to-string
             self.payload["containerOverrides"]["resourceRequirements"].append(
-                {"type": "GPU", "value": str(gpu)}
+                {"type": "GPU", "value": str(int(float(gpu)))}
             )
         return self
 
     def environment_variable(self, name, value):
+        if value is None:
+            return self
         if "environment" not in self.payload["containerOverrides"]:
             self.payload["containerOverrides"]["environment"] = []
         value = str(value)
@@ -513,7 +550,7 @@ def _update(self):
         # batch.submit_job API call is not strongly consistent(¯\_(ツ)_/¯).
         # We add a check here to guard against that. The `update()` call
         # will ensure that we poll `batch.describe_jobs` until we get a
-        # satisfactory response at least once through out the lifecycle of
+        # satisfactory response at least once throughout the lifecycle of
         # the job.
         if len(data["jobs"]) == 1:
             self._apply(data["jobs"][0])
diff --git a/metaflow/plugins/aws/batch/batch_decorator.py b/metaflow/plugins/aws/batch/batch_decorator.py
index c563854d9b4..81d38b6f0a2 100644
--- a/metaflow/plugins/aws/batch/batch_decorator.py
+++ b/metaflow/plugins/aws/batch/batch_decorator.py
@@ -8,7 +8,7 @@
 from metaflow import R, current
 
 from metaflow.decorators import StepDecorator
-from metaflow.plugins import ResourcesDecorator
+from metaflow.plugins.resources_decorator import ResourcesDecorator
 from metaflow.plugins.timeout_decorator import get_run_time_limit_for_task
 from metaflow.metadata import MetaDatum
 from metaflow.metadata.util import sync_local_metadata_to_datastore
@@ -20,7 +20,7 @@
     ECS_FARGATE_EXECUTION_ROLE,
     DATASTORE_LOCAL_DIR,
 )
-from metaflow.sidecar import SidecarSubProcess
+from metaflow.sidecar import Sidecar
 from metaflow.unbounded_foreach import UBF_CONTROL
 
 from .batch import BatchException
@@ -29,60 +29,49 @@
 
 class BatchDecorator(StepDecorator):
     """
-    Step decorator to specify that this step should execute on AWS Batch.
-
-    This decorator indicates that your step should execute on AWS Batch. Note
-    that you can apply this decorator automatically to all steps using the
-    ```--with batch``` argument when calling run/resume. Step level decorators
-    within the code are overrides and will force a step to execute on AWS Batch
-    regardless of the ```--with``` specification.
-
-    To use, annotate your step as follows:
-    ```
-    @batch
-    @step
-    def my_step(self):
-        ...
-    ```
+    Specifies that this step should execute on [AWS Batch](https://aws.amazon.com/batch/).
+
     Parameters
     ----------
     cpu : int
-        Number of CPUs required for this step. Defaults to 1. If @resources is
-        also present, the maximum value from all decorators is used
+        Number of CPUs required for this step. Defaults to 1. If `@resources` is
+        also present, the maximum value from all decorators is used.
     gpu : int
-        Number of GPUs required for this step. Defaults to 0. If @resources is
-        also present, the maximum value from all decorators is used
+        Number of GPUs required for this step. Defaults to 0. If `@resources` is
+        also present, the maximum value from all decorators is used.
     memory : int
-        Memory size (in MB) required for this step. Defaults to 4096. If
-        @resources is also present, the maximum value from all decorators is
-        used
+        Memory size (in MB) required for this step. Defaults to 4096 (4GB). If
+        `@resources` is also present, the maximum value from all decorators is
+        used.
     image : string
         Docker image to use when launching on AWS Batch. If not specified, a
-        default docker image mapping to the current version of Python is used
+        default Docker image mapping to the current version of Python is used.
     queue : string
         AWS Batch Job Queue to submit the job to. Defaults to the one
-        specified by the environment variable METAFLOW_BATCH_JOB_QUEUE
+        specified by the configuration variable `METAFLOW_BATCH_JOB_QUEUE`.
     iam_role : string
-        AWS IAM role that AWS Batch container uses to access AWS cloud resources
-        (Amazon S3, Amazon DynamoDb, etc). Defaults to the one specified by the
-        environment variable METAFLOW_ECS_S3_ACCESS_IAM_ROLE
+        AWS IAM role that AWS Batch container uses to access AWS cloud resources.
+        Defaults to the one specified by the configuration variable `METAFLOW_ECS_S3_ACCESS_IAM_ROLE`.
     execution_role : string
-        AWS IAM role that AWS Batch can use to trigger AWS Fargate tasks.
-        Defaults to the one determined by the environment variable
-        METAFLOW_ECS_FARGATE_EXECUTION_ROLE https://docs.aws.amazon.com/batch/latest/userguide/execution-IAM-role.html
+        AWS IAM role that AWS Batch can use [to trigger AWS Fargate tasks]
+        (https://docs.aws.amazon.com/batch/latest/userguide/execution-IAM-role.html).
+        Defaults to the one determined by the configuration variable
+        `METAFLOW_ECS_FARGATE_EXECUTION_ROLE`.
     shared_memory : int
         The value for the size (in MiB) of the /dev/shm volume for this step.
-        This parameter maps to the --shm-size option to docker run.
+        This parameter maps to the `--shm-size` option in Docker.
     max_swap : int
         The total amount of swap memory (in MiB) a container can use for this
-        step. This parameter is translated to the --memory-swap option to
-        docker run where the value is the sum of the container memory plus the
-        max_swap value.
+        step. This parameter is translated to the `--memory-swap` option in
+        Docker where the value is the sum of the container memory plus the
+        `max_swap` value.
     swappiness : int
         This allows you to tune memory swappiness behavior for this step.
         A swappiness value of 0 causes swapping not to happen unless absolutely
         necessary. A swappiness value of 100 causes pages to be swapped very
         aggressively. Accepted values are whole numbers between 0 and 100.
+    inferentia : int
+        Number of Inferentia chips required for this step. Defaults to 0.
     """
 
     name = "batch"
@@ -97,6 +86,7 @@ def my_step(self):
         "shared_memory": None,
         "max_swap": None,
         "swappiness": None,
+        "inferentia": None,
         "host_volumes": None,
     }
     resource_defaults = {
@@ -119,7 +109,7 @@ def __init__(self, attributes=None, statically_defined=False):
             # If metaflow-config doesn't specify a docker image, assign a
             # default docker image.
             else:
-                # Metaflow-R has it's own default docker image (rocker family)
+                # Metaflow-R has its own default docker image (rocker family)
                 if R.use_r():
                     self.attributes["image"] = R.container_image()
                 # Default to vanilla Python image corresponding to major.minor
@@ -220,9 +210,9 @@ def task_pre_step(
             meta["aws-batch-jq-name"] = os.environ["AWS_BATCH_JQ_NAME"]
             meta["aws-batch-execution-env"] = os.environ["AWS_EXECUTION_ENV"]
 
-            # Capture AWS Logs metadata. This is best effort only since
+            # Capture AWS Logs metadata. This is best-effort only since
             # only V4 of the metadata uri for the ECS container hosts this
-            # information and it is quite likely that not all consumers of
+            # information, and it is quite likely that not all consumers of
             # Metaflow would be running the container agent compatible with
             # version V4.
             # https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint.html
@@ -250,7 +240,8 @@ def task_pre_step(
             # Register book-keeping metadata for debugging.
             metadata.register_metadata(run_id, step_name, task_id, entries)
 
-            self._save_logs_sidecar = SidecarSubProcess("save_logs_periodically")
+            self._save_logs_sidecar = Sidecar("save_logs_periodically")
+            self._save_logs_sidecar.start()
 
         num_parallel = int(os.environ.get("AWS_BATCH_JOB_NUM_NODES", 0))
         if num_parallel >= 1 and ubf_context == UBF_CONTROL:
@@ -289,7 +280,7 @@ def task_finished(
                 )
 
         try:
-            self._save_logs_sidecar.kill()
+            self._save_logs_sidecar.terminate()
         except:
             # Best effort kill
             pass
@@ -299,9 +290,9 @@ def task_finished(
 
     def _wait_for_mapper_tasks(self, flow, step_name):
         """
-        When lauching multinode task with UBF, need to wait for the secondary
+        When launching multinode task with UBF, need to wait for the secondary
         tasks to finish cleanly and produce their output before exiting the
-        main task. Otherwise main task finishing will cause secondary nodes
+        main task. Otherwise, the main task finishing will cause secondary nodes
         to terminate immediately, and possibly prematurely.
         """
         from metaflow import Step  # avoid circular dependency
diff --git a/metaflow/plugins/aws/eks/kubernetes.py b/metaflow/plugins/aws/eks/kubernetes.py
deleted file mode 100644
index 356ef57b107..00000000000
--- a/metaflow/plugins/aws/eks/kubernetes.py
+++ /dev/null
@@ -1,362 +0,0 @@
-import os
-import time
-import json
-import select
-import shlex
-import time
-import re
-import hashlib
-
-from metaflow import util
-from metaflow.datatools.s3tail import S3Tail
-from metaflow.exception import MetaflowException, MetaflowInternalError
-from metaflow.metaflow_config import (
-    BATCH_METADATA_SERVICE_URL,
-    DATATOOLS_S3ROOT,
-    DATASTORE_LOCAL_DIR,
-    DATASTORE_SYSROOT_S3,
-    DEFAULT_METADATA,
-    BATCH_METADATA_SERVICE_HEADERS,
-    DATASTORE_CARD_S3ROOT,
-)
-from metaflow.mflog import (
-    export_mflog_env_vars,
-    capture_output_to_mflog,
-    tail_logs,
-    BASH_SAVE_LOGS,
-)
-from metaflow.mflog.mflog import refine, set_should_persist
-
-from .kubernetes_client import KubernetesClient
-
-# Redirect structured logs to $PWD/.logs/
-LOGS_DIR = "$PWD/.logs"
-STDOUT_FILE = "mflog_stdout"
-STDERR_FILE = "mflog_stderr"
-STDOUT_PATH = os.path.join(LOGS_DIR, STDOUT_FILE)
-STDERR_PATH = os.path.join(LOGS_DIR, STDERR_FILE)
-
-
-class KubernetesException(MetaflowException):
-    headline = "Kubernetes error"
-
-
-class KubernetesKilledException(MetaflowException):
-    headline = "Kubernetes Batch job killed"
-
-
-def generate_rfc1123_name(flow_name, run_id, step_name, task_id, attempt):
-    """
-    Generate RFC 1123 compatible name. Specifically, the format is:
-        <let-or-digit>[*[<let-or-digit-or-hyphen>]<let-or-digit>]
-
-    The generated name consists from a human-readable prefix, derived from
-    flow/step/task/attempt, and a hash suffux.
-    """
-    long_name = "-".join(
-        [
-            flow_name,
-            run_id,
-            step_name,
-            task_id,
-            attempt,
-        ]
-    )
-    hash = hashlib.sha256(long_name.encode("utf-8")).hexdigest()
-
-    if long_name.startswith("_"):
-        # RFC 1123 names can't start with hyphen so slap an extra prefix on it
-        sanitized_long_name = "u" + long_name.replace("_", "-").lower()
-    else:
-        sanitized_long_name = long_name.replace("_", "-").lower()
-
-    # the name has to be under 63 chars total
-    return sanitized_long_name[:57] + "-" + hash[:5]
-
-
-LABEL_VALUE_REGEX = re.compile(r"^[a-zA-Z0-9]([a-zA-Z0-9\-\_\.]{0,61}[a-zA-Z0-9])?$")
-
-
-def sanitize_label_value(val):
-    # Label sanitization: if the value can be used as is, return it as is.
-    # If it can't, sanitize and add a suffix based on hash of the original
-    # value, replace invalid chars and truncate.
-    #
-    # The idea here is that even if there are non-allowed chars in the same
-    # position, this function will likely return distinct values, so you can
-    # still filter on those. For example, "alice$" and "alice&" will be
-    # sanitized into different values "alice_b3f201" and "alice_2a6f13".
-    if val == "" or LABEL_VALUE_REGEX.match(val):
-        return val
-    hash = hashlib.sha256(val.encode("utf-8")).hexdigest()
-
-    # Replace invalid chars with dots, and if the first char is
-    # non-alphahanumeric, replace it with 'u' to make it valid
-    sanitized_val = re.sub("^[^A-Z0-9a-z]", "u", re.sub(r"[^A-Za-z0-9.\-_]", "_", val))
-    return sanitized_val[:57] + "-" + hash[:5]
-
-
-class Kubernetes(object):
-    def __init__(
-        self,
-        datastore,
-        metadata,
-        environment,
-        flow_name,
-        run_id,
-        step_name,
-        task_id,
-        attempt,
-    ):
-        self._datastore = datastore
-        self._metadata = metadata
-        self._environment = environment
-
-        self._flow_name = flow_name
-        self._run_id = run_id
-        self._step_name = step_name
-        self._task_id = task_id
-        self._attempt = str(attempt)
-
-    def _command(
-        self,
-        code_package_url,
-        step_cmds,
-    ):
-        mflog_expr = export_mflog_env_vars(
-            flow_name=self._flow_name,
-            run_id=self._run_id,
-            step_name=self._step_name,
-            task_id=self._task_id,
-            retry_count=self._attempt,
-            datastore_type=self._datastore.TYPE,
-            stdout_path=STDOUT_PATH,
-            stderr_path=STDERR_PATH,
-        )
-        init_cmds = self._environment.get_package_commands(code_package_url)
-        init_expr = " && ".join(init_cmds)
-        step_expr = " && ".join(
-            [
-                capture_output_to_mflog(a)
-                for a in (
-                    self._environment.bootstrap_commands(self._step_name) + step_cmds
-                )
-            ]
-        )
-
-        # Construct an entry point that
-        # 1) initializes the mflog environment (mflog_expr)
-        # 2) bootstraps a metaflow environment (init_expr)
-        # 3) executes a task (step_expr)
-
-        # The `true` command is to make sure that the generated command
-        # plays well with docker containers which have entrypoint set as
-        # eval $@
-        cmd_str = "true && mkdir -p %s && %s && %s && %s; " % (
-            LOGS_DIR,
-            mflog_expr,
-            init_expr,
-            step_expr,
-        )
-        # After the task has finished, we save its exit code (fail/success)
-        # and persist the final logs. The whole entrypoint should exit
-        # with the exit code (c) of the task.
-        #
-        # Note that if step_expr OOMs, this tail expression is never executed.
-        # We lose the last logs in this scenario.
-        #
-        # TODO: Find a way to capture hard exit logs in Kubernetes.
-        cmd_str += "c=$?; %s; exit $c" % BASH_SAVE_LOGS
-        return shlex.split('bash -c "%s"' % cmd_str)
-
-    def launch_job(self, **kwargs):
-        self._job = self.create_job(**kwargs).execute()
-
-    def create_job(
-        self,
-        user,
-        code_package_sha,
-        code_package_url,
-        code_package_ds,
-        step_cli,
-        docker_image,
-        service_account=None,
-        secrets=None,
-        node_selector=None,
-        namespace=None,
-        cpu=None,
-        gpu=None,
-        disk=None,
-        memory=None,
-        run_time_limit=None,
-        env={},
-    ):
-        # TODO: Test for DNS-1123 compliance. Python names can have underscores
-        #       which are not valid Kubernetes names. We can potentially make
-        #       the pathspec DNS-1123 compliant by stripping away underscores
-        #       etc. and relying on Kubernetes to attach a suffix to make the
-        #       name unique within a namespace.
-        #
-        # Set the pathspec (along with attempt) as the Kubernetes job name.
-        # Kubernetes job names are supposed to be unique within a Kubernetes
-        # namespace and compliant with DNS-1123. The pathspec (with attempt)
-        # can provide that guarantee, however, for flows launched via AWS Step
-        # Functions (and potentially Argo), we may not get the task_id or the
-        # attempt_id while submitting the job to the Kubernetes cluster. If
-        # that is indeed the case, we can rely on Kubernetes to generate a name
-        # for us.
-        job_name = generate_rfc1123_name(
-            self._flow_name,
-            self._run_id,
-            self._step_name,
-            self._task_id,
-            self._attempt,
-        )
-
-        job = (
-            KubernetesClient()
-            .job(
-                name=job_name,
-                namespace=namespace,
-                service_account=service_account,
-                secrets=secrets,
-                node_selector=node_selector,
-                command=self._command(
-                    code_package_url=code_package_url,
-                    step_cmds=[step_cli],
-                ),
-                image=docker_image,
-                cpu=cpu,
-                memory=memory,
-                disk=disk,
-                timeout_in_seconds=run_time_limit,
-                # Retries are handled by Metaflow runtime
-                retries=0,
-            )
-            .environment_variable("METAFLOW_CODE_SHA", code_package_sha)
-            .environment_variable("METAFLOW_CODE_URL", code_package_url)
-            .environment_variable("METAFLOW_CODE_DS", code_package_ds)
-            .environment_variable("METAFLOW_USER", user)
-            .environment_variable("METAFLOW_SERVICE_URL", BATCH_METADATA_SERVICE_URL)
-            .environment_variable(
-                "METAFLOW_SERVICE_HEADERS",
-                json.dumps(BATCH_METADATA_SERVICE_HEADERS),
-            )
-            .environment_variable("METAFLOW_DATASTORE_SYSROOT_S3", DATASTORE_SYSROOT_S3)
-            .environment_variable("METAFLOW_DATATOOLS_S3ROOT", DATATOOLS_S3ROOT)
-            .environment_variable("METAFLOW_DEFAULT_DATASTORE", "s3")
-            .environment_variable("METAFLOW_DEFAULT_METADATA", DEFAULT_METADATA)
-            .environment_variable("METAFLOW_KUBERNETES_WORKLOAD", 1)
-            .environment_variable("METAFLOW_RUNTIME_ENVIRONMENT", "kubernetes")
-            .environment_variable("METAFLOW_CARD_S3ROOT", DATASTORE_CARD_S3ROOT)
-            .label("app", "metaflow")
-            .label("metaflow/flow_name", sanitize_label_value(self._flow_name))
-            .label("metaflow/run_id", sanitize_label_value(self._run_id))
-            .label("metaflow/step_name", sanitize_label_value(self._step_name))
-            .label("metaflow/task_id", sanitize_label_value(self._task_id))
-            .label("metaflow/attempt", sanitize_label_value(self._attempt))
-        )
-
-        # Skip setting METAFLOW_DATASTORE_SYSROOT_LOCAL because metadata sync
-        # between the local user instance and the remote Kubernetes pod
-        # assumes metadata is stored in DATASTORE_LOCAL_DIR on the Kubernetes
-        # pod; this happens when METAFLOW_DATASTORE_SYSROOT_LOCAL is NOT set (
-        # see get_datastore_root_from_config in datastore/local.py).
-        for name, value in env.items():
-            job.environment_variable(name, value)
-
-        # Add labels to the Kubernetes job
-        #
-        # Apply recommended labels https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
-        #
-        # TODO: 1. Verify the behavior of high cardinality labels like instance,
-        #          version etc. in the app.kubernetes.io namespace before
-        #          introducing them here.
-        job.label("app.kubernetes.io/name", "metaflow-task").label(
-            "app.kubernetes.io/part-of", "metaflow"
-        ).label("app.kubernetes.io/created-by", sanitize_label_value(user))
-        # Add Metaflow system tags as labels as well!
-        for sys_tag in self._metadata.sticky_sys_tags:
-            job.label(
-                "metaflow/%s" % sys_tag[: sys_tag.index(":")],
-                sanitize_label_value(sys_tag[sys_tag.index(":") + 1 :]),
-            )
-        # TODO: Add annotations based on https://kubernetes.io/blog/2021/04/20/annotating-k8s-for-humans/
-
-        return job.create()
-
-    def wait(self, stdout_location, stderr_location, echo=None):
-        def wait_for_launch(job):
-            status = job.status
-            echo(
-                "Task is starting (Status %s)..." % status,
-                "stderr",
-                job_id=job.id,
-            )
-            t = time.time()
-            while True:
-                new_status = job.status
-                if status != new_status or (time.time() - t) > 30:
-                    status = new_status
-                    echo(
-                        "Task is starting (Status %s)..." % status,
-                        "stderr",
-                        job_id=job.id,
-                    )
-                    t = time.time()
-                if job.is_running or job.is_done:
-                    break
-                time.sleep(1)
-
-        prefix = b"[%s] " % util.to_bytes(self._job.id)
-        stdout_tail = S3Tail(stdout_location)
-        stderr_tail = S3Tail(stderr_location)
-
-        # 1) Loop until the job has started
-        wait_for_launch(self._job)
-
-        # 2) Tail logs until the job has finished
-        tail_logs(
-            prefix=prefix,
-            stdout_tail=stdout_tail,
-            stderr_tail=stderr_tail,
-            echo=echo,
-            has_log_updates=lambda: self._job.is_running,
-        )
-
-        # 3) Fetch remaining logs
-        #
-        # It is possible that we exit the loop above before all logs have been
-        # shown.
-        #
-        # TODO (savin): If we notice Kubernetes failing to upload logs to S3,
-        #               we can add a HEAD request here to ensure that the file
-        #               exists prior to calling S3Tail and note the user about
-        #               truncated logs if it doesn't.
-        # TODO (savin): For hard crashes, we can fetch logs from the pod.
-
-        if self._job.has_failed:
-            exit_code, reason = self._job.reason
-            msg = next(
-                msg
-                for msg in [
-                    reason,
-                    "Task crashed",
-                ]
-                if msg is not None
-            )
-            if exit_code:
-                if int(exit_code) == 139:
-                    raise KubernetesException("Task failed with a segmentation fault.")
-                else:
-                    msg = "%s (exit code %s)" % (msg, exit_code)
-            raise KubernetesException(
-                "%s. This could be a transient error. " "Use @retry to retry." % msg
-            )
-
-        exit_code, _ = self._job.reason
-        echo(
-            "Task finished with exit code %s." % exit_code,
-            "stderr",
-            job_id=self._job.id,
-        )
diff --git a/metaflow/plugins/aws/eks/kubernetes_decorator.py b/metaflow/plugins/aws/eks/kubernetes_decorator.py
deleted file mode 100644
index 83a42962ece..00000000000
--- a/metaflow/plugins/aws/eks/kubernetes_decorator.py
+++ /dev/null
@@ -1,257 +0,0 @@
-import os
-import sys
-import platform
-import requests
-
-from metaflow import util
-from metaflow.decorators import StepDecorator
-from metaflow.metadata import MetaDatum
-from metaflow.metadata.util import sync_local_metadata_to_datastore
-from metaflow.metaflow_config import (
-    KUBERNETES_CONTAINER_IMAGE,
-    KUBERNETES_CONTAINER_REGISTRY,
-    DATASTORE_LOCAL_DIR,
-    KUBERNETES_NAMESPACE,
-)
-from metaflow.plugins import ResourcesDecorator
-from metaflow.plugins.timeout_decorator import get_run_time_limit_for_task
-from metaflow.sidecar import SidecarSubProcess
-
-from .kubernetes import KubernetesException
-from ..aws_utils import get_docker_registry, compute_resource_attributes
-
-
-class KubernetesDecorator(StepDecorator):
-    """
-    TODO (savin): Update this docstring.
-    Step decorator to specify that this step should execute on Kubernetes.
-
-    This decorator indicates that your step should execute on Kubernetes. Note
-    that you can apply this decorator automatically to all steps using the
-    ```--with kubernetes``` argument when calling run/resume. Step level
-    decorators within the code are overrides and will force a step to execute
-    on Kubernetes regardless of the ```--with``` specification.
-
-    To use, annotate your step as follows:
-    ```
-    @kubernetes
-    @step
-    def my_step(self):
-        ...
-    ```
-    Parameters
-    ----------
-    cpu : int
-        Number of CPUs required for this step. Defaults to 1. If @resources is
-        also present, the maximum value from all decorators is used
-    gpu : int
-        Number of GPUs required for this step. Defaults to 0. If @resources is
-        also present, the maximum value from all decorators is used
-    memory : int
-        Memory size (in MB) required for this step. Defaults to 4096. If
-        @resources is also present, the maximum value from all decorators is
-        used
-    image : string
-        Docker image to use when launching on Kubernetes. If not specified, a
-        default docker image mapping to the current version of Python is used
-    shared_memory : int
-        The value for the size (in MiB) of the /dev/shm volume for this step.
-        This parameter maps to the --shm-size option to docker run.
-    """
-
-    name = "kubernetes"
-    defaults = {
-        "cpu": None,
-        "memory": None,
-        "disk": None,
-        "gpu": None,
-        "image": None,
-        "service_account": None,
-        "secrets": None,  # e.g., mysecret
-        "node_selector": None,  # e.g., kubernetes.io/os=linux
-        "gpu": "0",
-        # "shared_memory": None,
-        "namespace": None,
-    }
-    resource_defaults = {
-        "cpu": "1",
-        "memory": "4096",
-        "disk": "10240",
-        "gpu": "0",
-    }
-    package_url = None
-    package_sha = None
-    run_time_limit = None
-
-    def __init__(self, attributes=None, statically_defined=False):
-        super(KubernetesDecorator, self).__init__(attributes, statically_defined)
-
-        if not self.attributes["namespace"]:
-            self.attributes["namespace"] = KUBERNETES_NAMESPACE
-        # TODO: Unify the logic with AWS Batch
-        # If no docker image is explicitly specified, impute a default image.
-        if not self.attributes["image"]:
-            # If metaflow-config specifies a docker image, just use that.
-            if KUBERNETES_CONTAINER_IMAGE:
-                self.attributes["image"] = KUBERNETES_CONTAINER_IMAGE
-            # If metaflow-config doesn't specify a docker image, assign a
-            # default docker image.
-            else:
-                # Default to vanilla Python image corresponding to major.minor
-                # version of the Python interpreter launching the flow.
-                self.attributes["image"] = "python:%s.%s" % (
-                    platform.python_version_tuple()[0],
-                    platform.python_version_tuple()[1],
-                )
-        # Assign docker registry URL for the image.
-        if not get_docker_registry(self.attributes["image"]):
-            if KUBERNETES_CONTAINER_REGISTRY:
-                self.attributes["image"] = "%s/%s" % (
-                    KUBERNETES_CONTAINER_REGISTRY.rstrip("/"),
-                    self.attributes["image"],
-                )
-
-    # Refer https://github.com/Netflix/metaflow/blob/master/docs/lifecycle.png
-    # to understand where these functions are invoked in the lifecycle of a
-    # Metaflow flow.
-    def step_init(self, flow, graph, step, decos, environment, flow_datastore, logger):
-        # Executing Kubernetes jobs requires a non-local datastore at the
-        # moment.
-        # TODO: To support MiniKube we need to enable local datastore execution.
-        if flow_datastore.TYPE != "s3":
-            raise KubernetesException(
-                "The *@kubernetes* decorator requires --datastore=s3 " "at the moment."
-            )
-
-        # Set internal state.
-        self.logger = logger
-        self.environment = environment
-        self.step = step
-        self.flow_datastore = flow_datastore
-        self.attributes.update(
-            compute_resource_attributes(decos, self, self.resource_defaults)
-        )
-
-        for deco in decos:
-            if getattr(deco, "IS_PARALLEL", False):
-                raise KubernetesException(
-                    "Kubernetes decorator does not support parallel execution yet."
-                )
-
-        # Set run time limit for the Kubernetes job.
-        self.run_time_limit = get_run_time_limit_for_task(decos)
-        if self.run_time_limit < 60:
-            raise KubernetesException(
-                "The timeout for step *{step}* should be "
-                "at least 60 seconds for execution on "
-                "Kubernetes.".format(step=step)
-            )
-
-    def runtime_init(self, flow, graph, package, run_id):
-        # Set some more internal state.
-        self.flow = flow
-        self.graph = graph
-        self.package = package
-        self.run_id = run_id
-
-    def runtime_task_created(
-        self, task_datastore, task_id, split_index, input_paths, is_cloned, ubf_context
-    ):
-        # To execute the Kubernetes job, the job container needs to have
-        # access to the code package. We store the package in the datastore
-        # which the pod is able to download as part of it's entrypoint.
-        if not is_cloned:
-            self._save_package_once(self.flow_datastore, self.package)
-
-    def runtime_step_cli(
-        self, cli_args, retry_count, max_user_code_retries, ubf_context
-    ):
-        if retry_count <= max_user_code_retries:
-            # After all attempts to run the user code have failed, we don't need
-            # to execute on Kubernetes anymore. We can execute possible fallback
-            # code locally.
-            cli_args.commands = ["kubernetes", "step"]
-            cli_args.command_args.append(self.package_sha)
-            cli_args.command_args.append(self.package_url)
-
-            # --namespace is used to specify Metaflow namespace (different
-            # concept from k8s namespace).
-            for k, v in self.attributes.items():
-                if k == "namespace":
-                    cli_args.command_options["k8s_namespace"] = v
-                else:
-                    cli_args.command_options[k] = v
-            cli_args.command_options["run-time-limit"] = self.run_time_limit
-            cli_args.entrypoint[0] = sys.executable
-
-    def task_pre_step(
-        self,
-        step_name,
-        task_datastore,
-        metadata,
-        run_id,
-        task_id,
-        flow,
-        graph,
-        retry_count,
-        max_retries,
-        ubf_context,
-        inputs,
-    ):
-        self.metadata = metadata
-        self.task_datastore = task_datastore
-
-        # task_pre_step may run locally if fallback is activated for @catch
-        # decorator. In that scenario, we skip collecting Kubernetes execution
-        # metadata. A rudimentary way to detect non-local execution is to
-        # check for the existence of METAFLOW_KUBERNETES_WORKLOAD environment
-        # variable.
-
-        if "METAFLOW_KUBERNETES_WORKLOAD" in os.environ:
-            meta = {}
-            # TODO: Get kubernetes job id and job name
-            meta["kubernetes-pod-id"] = os.environ["METAFLOW_KUBERNETES_POD_ID"]
-            meta["kubernetes-pod-name"] = os.environ["METAFLOW_KUBERNETES_POD_NAME"]
-            meta["kubernetes-pod-namespace"] = os.environ[
-                "METAFLOW_KUBERNETES_POD_NAMESPACE"
-            ]
-            # meta['kubernetes-job-attempt'] = ?
-
-            entries = [
-                MetaDatum(field=k, value=v, type=k, tags=[]) for k, v in meta.items()
-            ]
-            # Register book-keeping metadata for debugging.
-            metadata.register_metadata(run_id, step_name, task_id, entries)
-
-            # Start MFLog sidecar to collect task logs.
-            self._save_logs_sidecar = SidecarSubProcess("save_logs_periodically")
-
-    def task_finished(
-        self, step_name, flow, graph, is_task_ok, retry_count, max_retries
-    ):
-        # task_finished may run locally if fallback is activated for @catch
-        # decorator.
-        if "METAFLOW_KUBERNETES_WORKLOAD" in os.environ:
-            # If `local` metadata is configured, we would need to copy task
-            # execution metadata from the AWS Batch container to user's
-            # local file system after the user code has finished execution.
-            # This happens via datastore as a communication bridge.
-            if self.metadata.TYPE == "local":
-                # Note that the datastore is *always* Amazon S3 (see
-                # runtime_task_created function).
-                sync_local_metadata_to_datastore(
-                    DATASTORE_LOCAL_DIR, self.task_datastore
-                )
-
-        try:
-            self._save_logs_sidecar.kill()
-        except:
-            # Best effort kill
-            pass
-
-    @classmethod
-    def _save_package_once(cls, flow_datastore, package):
-        if cls.package_url is None:
-            cls.package_url, cls.package_sha = flow_datastore.save_data(
-                [package.blob], len_hint=1
-            )[0]
diff --git a/metaflow/plugins/aws/step_functions/dynamo_db_client.py b/metaflow/plugins/aws/step_functions/dynamo_db_client.py
index 52386b26b94..caff1b4a343 100644
--- a/metaflow/plugins/aws/step_functions/dynamo_db_client.py
+++ b/metaflow/plugins/aws/step_functions/dynamo_db_client.py
@@ -7,9 +7,7 @@ class DynamoDbClient(object):
     def __init__(self):
         from ..aws_client import get_aws_client
 
-        self._client = get_aws_client(
-            "dynamodb", params={"region_name": self._get_instance_region()}
-        )
+        self._client = get_aws_client("dynamodb")
         self.name = SFN_DYNAMO_DB_TABLE
 
     def save_foreach_cardinality(self, foreach_split_task_id, foreach_cardinality, ttl):
@@ -42,21 +40,3 @@ def get_parent_task_ids_for_foreach_join(self, foreach_split_task_id):
             ConsistentRead=True,
         )
         return response["Item"]["parent_task_ids_for_foreach_join"]["SS"]
-
-    def _get_instance_region(self):
-        region = os.environ.get("AWS_REGION")
-        # region is available as an env variable in AWS Fargate but not in EC2
-        if region is not None:
-            return region
-        metadata_url = (
-            "http://169.254.169.254/latest/meta-data/placement/availability-zone/"
-        )
-        r = requests.get(url=metadata_url)
-
-        if r.status_code != 200:
-            raise RuntimeError(
-                "Failed to query instance metadata. Url [%s]" % metadata_url
-                + " Error code [%s]" % str(r.status_code)
-            )
-
-        return r.text[:-1]
diff --git a/metaflow/plugins/aws/step_functions/schedule_decorator.py b/metaflow/plugins/aws/step_functions/schedule_decorator.py
index a5f95f55a5b..b3bc174bc2a 100644
--- a/metaflow/plugins/aws/step_functions/schedule_decorator.py
+++ b/metaflow/plugins/aws/step_functions/schedule_decorator.py
@@ -1,7 +1,25 @@
 from metaflow.decorators import FlowDecorator
 
 
+# TODO (savin): Lift this decorator up since it's also used by Argo now
 class ScheduleDecorator(FlowDecorator):
+    """
+    Specifies the times when the flow should be run when running on a
+    production scheduler.
+
+    Parameters
+    ----------
+    hourly : bool
+        Run the workflow hourly (default: False).
+    daily : bool
+        Run the workflow daily (default: True).
+    weekly : bool
+        Run the workflow weekly (default: False).
+    cron : str
+        Run the workflow at [a custom Cron schedule](https://docs.aws.amazon.com/eventbridge/latest/userguide/scheduled-events.html#cron-expressions)
+        specified by this expression.
+    """
+
     name = "schedule"
     defaults = {"cron": None, "weekly": False, "daily": True, "hourly": False}
 
diff --git a/metaflow/plugins/aws/step_functions/step_functions.py b/metaflow/plugins/aws/step_functions/step_functions.py
index b67b75550e6..05dad2804be 100644
--- a/metaflow/plugins/aws/step_functions/step_functions.py
+++ b/metaflow/plugins/aws/step_functions/step_functions.py
@@ -9,7 +9,9 @@
 import uuid
 
 from metaflow.exception import MetaflowException, MetaflowInternalError
-from metaflow.plugins import ResourcesDecorator, BatchDecorator, RetryDecorator
+from metaflow.plugins.aws.batch.batch_decorator import BatchDecorator
+from metaflow.plugins.resources_decorator import ResourcesDecorator
+from metaflow.plugins.retry_decorator import RetryDecorator
 from metaflow.parameters import deploy_time_eval
 from metaflow.decorators import flow_decorators
 from metaflow.util import compress_list, dict_to_cli_options, to_pascalcase
@@ -18,6 +20,7 @@
     EVENTS_SFN_ACCESS_IAM_ROLE,
     SFN_DYNAMO_DB_TABLE,
     SFN_EXECUTION_LOG_GROUP_ARN,
+    S3_ENDPOINT_URL,
 )
 from metaflow import R
 
@@ -26,8 +29,6 @@
 from ..batch.batch import Batch
 from ..aws_utils import compute_resource_attributes
 
-from metaflow.mflog import capture_output_to_mflog
-
 
 class StepFunctionsException(MetaflowException):
     headline = "AWS Step Functions error"
@@ -86,7 +87,7 @@ def to_json(self):
     def trigger_explanation(self):
         if self._cron:
             # Sometime in the future, we should vendor (or write) a utility
-            # that can translate cron specifications into a human readable
+            # that can translate cron specifications into a human-readable
             # format and push to the user for a better UX, someday.
             return (
                 "This workflow triggers automatically "
@@ -172,8 +173,8 @@ def trigger(cls, name, parameters):
         # Dump parameters into `Parameters` input field.
         input = json.dumps({"Parameters": json.dumps(parameters)})
         # AWS Step Functions limits input to be 32KiB, but AWS Batch
-        # has it's own limitation of 30KiB for job specification length.
-        # Reserving 10KiB for rest of the job sprecification leaves 20KiB
+        # has its own limitation of 30KiB for job specification length.
+        # Reserving 10KiB for rest of the job specification leaves 20KiB
         # for us, which should be enough for most use cases for now.
         if len(input) > 20480:
             raise StepFunctionsException(
@@ -413,7 +414,11 @@ def _batch(self, node):
         env_deco = [deco for deco in node.decorators if deco.name == "environment"]
         env = {}
         if env_deco:
-            env = env_deco[0].attributes["vars"]
+            env = env_deco[0].attributes["vars"].copy()
+
+        # add METAFLOW_S3_ENDPOINT_URL
+        if S3_ENDPOINT_URL is not None:
+            env["METAFLOW_S3_ENDPOINT_URL"] = S3_ENDPOINT_URL
 
         if node.name == "start":
             # Initialize parameters for the flow in the `start` step.
@@ -461,7 +466,7 @@ def _batch(self, node):
                     "${METAFLOW_PARENT_TASK_IDS}" % node.in_funcs[0]
                 )
                 # Unfortunately, AWS Batch only allows strings as value types
-                # in it's specification and we don't have any way to concatenate
+                # in its specification, and we don't have any way to concatenate
                 # the task ids array from the parent steps within AWS Step
                 # Functions and pass it down to AWS Batch. We instead have to
                 # rely on publishing the state to DynamoDb and fetching it back
@@ -519,7 +524,7 @@ def _batch(self, node):
                         # parent tasks. We filter the Map state to only output
                         # `$.[0]`, since we don't need any of the other outputs,
                         # that information is available to us from AWS DynamoDB.
-                        # This has a nice side-effect of making our foreach
+                        # This has a nice side effect of making our foreach
                         # splits infinitely scalable because otherwise we would
                         # be bounded by the 32K state limit for the outputs. So,
                         # instead of referencing `Parameters` fields by index
@@ -591,7 +596,7 @@ def _batch(self, node):
         # Set AWS DynamoDb Table Name for state tracking for for-eaches.
         # There are three instances when metaflow runtime directly interacts
         # with AWS DynamoDB.
-        #   1. To set the cardinality of foreaches (which are subsequently)
+        #   1. To set the cardinality of `foreach`s (which are subsequently)
         #      read prior to the instantiation of the Map state by AWS Step
         #      Functions.
         #   2. To set the input paths from the parent steps of a foreach join.
@@ -626,6 +631,10 @@ def _batch(self, node):
                 )
             env["METAFLOW_SFN_DYNAMO_DB_TABLE"] = SFN_DYNAMO_DB_TABLE
 
+        # It makes no sense to set env vars to None (shows up as "None" string)
+        env_without_none_values = {k: v for k, v in env.items() if v is not None}
+        del env
+
         # Resolve AWS Batch resource requirements.
         batch_deco = [deco for deco in node.decorators if deco.name == "batch"][0]
         resources = {}
@@ -672,7 +681,7 @@ def _batch(self, node):
                 shared_memory=resources["shared_memory"],
                 max_swap=resources["max_swap"],
                 swappiness=resources["swappiness"],
-                env=env,
+                env=env_without_none_values,
                 attrs=attrs,
                 host_volumes=resources["host_volumes"],
             )
@@ -725,8 +734,8 @@ def _step_cli(self, node, paths, code_package_url, user_code_retries):
             "--environment=%s" % self.environment.TYPE,
             "--datastore=%s" % self.flow_datastore.TYPE,
             "--datastore-root=%s" % self.flow_datastore.datastore_root,
-            "--event-logger=%s" % self.event_logger.logger_type,
-            "--monitor=%s" % self.monitor.monitor_type,
+            "--event-logger=%s" % self.event_logger.TYPE,
+            "--monitor=%s" % self.monitor.TYPE,
             "--no-pylint",
             "--with=step_functions_internal",
         ]
@@ -738,14 +747,10 @@ def _step_cli(self, node, paths, code_package_url, user_code_retries):
             param_file = "".join(
                 random.choice(string.ascii_lowercase) for _ in range(10)
             )
-            export_params = " && ".join(
-                [
-                    capture_output_to_mflog(
-                        "python -m metaflow.plugins.aws.step_functions.set_batch_environment parameters %s"
-                        % param_file
-                    ),
-                    ". `pwd`/%s" % param_file,
-                ]
+            export_params = (
+                "python -m "
+                "metaflow.plugins.aws.step_functions.set_batch_environment "
+                "parameters %s && . `pwd`/%s" % (param_file, param_file)
             )
             params = (
                 entrypoint
@@ -761,7 +766,7 @@ def _step_cli(self, node, paths, code_package_url, user_code_retries):
                 params.extend("--tag %s" % tag for tag in self.tags)
 
             # If the start step gets retried, we must be careful not to
-            # regenerate multiple parameters tasks. Hence we check first if
+            # regenerate multiple parameters tasks. Hence, we check first if
             # _parameters exists already.
             exists = entrypoint + [
                 "dump",
@@ -771,7 +776,7 @@ def _step_cli(self, node, paths, code_package_url, user_code_retries):
             cmd = "if ! %s >/dev/null 2>/dev/null; then %s && %s; fi" % (
                 " ".join(exists),
                 export_params,
-                capture_output_to_mflog(" ".join(params)),
+                " ".join(params),
             )
             cmds.append(cmd)
             paths = "sfn-${METAFLOW_RUN_ID}/_parameters/%s" % (task_id_params)
@@ -780,7 +785,7 @@ def _step_cli(self, node, paths, code_package_url, user_code_retries):
             parent_tasks_file = "".join(
                 random.choice(string.ascii_lowercase) for _ in range(10)
             )
-            export_parent_tasks = capture_output_to_mflog(
+            export_parent_tasks = (
                 "python -m "
                 "metaflow.plugins.aws.step_functions.set_batch_environment "
                 "parent_tasks %s && . `pwd`/%s" % (parent_tasks_file, parent_tasks_file)
@@ -806,7 +811,7 @@ def _step_cli(self, node, paths, code_package_url, user_code_retries):
             step.extend("--tag %s" % tag for tag in self.tags)
         if self.namespace is not None:
             step.append("--namespace=%s" % self.namespace)
-        cmds.append(capture_output_to_mflog(" ".join(entrypoint + top_level + step)))
+        cmds.append(" ".join(entrypoint + top_level + step))
         return " && ".join(cmds)
 
 
diff --git a/metaflow/plugins/aws/step_functions/step_functions_cli.py b/metaflow/plugins/aws/step_functions/step_functions_cli.py
index 8b41b0792a4..e2d753b8a5d 100644
--- a/metaflow/plugins/aws/step_functions/step_functions_cli.py
+++ b/metaflow/plugins/aws/step_functions/step_functions_cli.py
@@ -6,10 +6,14 @@
 from distutils.version import LooseVersion
 
 from metaflow import current, decorators, parameters, JSONType
-from metaflow.metaflow_config import SFN_STATE_MACHINE_PREFIX
+from metaflow.metaflow_config import (
+    SERVICE_VERSION_CHECK,
+    SFN_STATE_MACHINE_PREFIX,
+)
 from metaflow.exception import MetaflowException, MetaflowInternalError
 from metaflow.package import MetaflowPackage
-from metaflow.plugins import BatchDecorator
+from metaflow.plugins.aws.batch.batch_decorator import BatchDecorator
+from metaflow.tagging_util import validate_tags
 from metaflow.util import get_username, to_bytes, to_unicode
 
 from .step_functions import StepFunctions
@@ -131,11 +135,14 @@ def create(
     log_execution_history=False,
     **kwargs,
 ):
+    validate_tags(tags)
+
     obj.echo(
         "Deploying *%s* to AWS Step Functions..." % obj.state_machine_name, bold=True
     )
 
-    check_metadata_service_version(obj)
+    if SERVICE_VERSION_CHECK:
+        check_metadata_service_version(obj)
 
     token = resolve_token(
         obj.state_machine_name,
@@ -437,13 +444,11 @@ def trigger(obj, run_id_file=None, **kwargs):
     def _convert_value(param):
         # Swap `-` with `_` in parameter name to match click's behavior
         val = kwargs.get(param.name.replace("-", "_").lower())
-        return (
-            json.dumps(val)
-            if param.kwargs.get("type") == JSONType
-            else val()
-            if callable(val)
-            else val
-        )
+        if param.kwargs.get("type") == JSONType:
+            val = json.dumps(val)
+        elif isinstance(val, parameters.DelayedEvaluationParameter):
+            val = val(return_str=True)
+        return val
 
     params = {
         param.name: _convert_value(param)
diff --git a/metaflow/plugins/aws/step_functions/step_functions_decorator.py b/metaflow/plugins/aws/step_functions/step_functions_decorator.py
index bffb313d381..ba71c0af8df 100644
--- a/metaflow/plugins/aws/step_functions/step_functions_decorator.py
+++ b/metaflow/plugins/aws/step_functions/step_functions_decorator.py
@@ -45,7 +45,7 @@ def task_finished(
             # continue so no need to do anything here.
             return
 
-        # For foreaches, we need to dump the cardinality of the fanout
+        # For `foreach`s, we need to dump the cardinality of the fanout
         # into AWS DynamoDb so that AWS Step Functions can properly configure
         # the Map job, in the absence of any better message passing feature
         # between the states.
diff --git a/metaflow/plugins/azure/__init__.py b/metaflow/plugins/azure/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/metaflow/plugins/azure/azure_exceptions.py b/metaflow/plugins/azure/azure_exceptions.py
new file mode 100644
index 00000000000..f4f500b1268
--- /dev/null
+++ b/metaflow/plugins/azure/azure_exceptions.py
@@ -0,0 +1,13 @@
+from metaflow.exception import MetaflowException
+
+
+class MetaflowAzureAuthenticationError(MetaflowException):
+    headline = "Failed to authenticate with Azure"
+
+
+class MetaflowAzureResourceError(MetaflowException):
+    headline = "Failed to access Azure resource"
+
+
+class MetaflowAzurePackageError(MetaflowException):
+    headline = "Missing required packages azure-identity and azure-storage-blob"
diff --git a/metaflow/plugins/azure/azure_tail.py b/metaflow/plugins/azure/azure_tail.py
new file mode 100644
index 00000000000..d07fe08e94a
--- /dev/null
+++ b/metaflow/plugins/azure/azure_tail.py
@@ -0,0 +1,94 @@
+from io import BytesIO
+
+from azure.core.exceptions import ResourceNotFoundError, HttpResponseError
+
+from metaflow.exception import MetaflowException
+
+from metaflow.plugins.azure.blob_service_client_factory import (
+    get_azure_blob_service_client,
+)
+from metaflow.plugins.azure.azure_utils import (
+    parse_azure_full_path,
+)
+
+
+class AzureTail(object):
+    def __init__(self, blob_full_uri):
+        """Location should be something like <container_name>/blob"""
+        container_name, blob_name = parse_azure_full_path(blob_full_uri)
+        if not blob_name:
+            raise MetaflowException(
+                msg="Failed to parse blob_full_uri into <container_name>/<blob_name> (got %s)"
+                % blob_full_uri
+            )
+        service = get_azure_blob_service_client()
+        container = service.get_container_client(container_name)
+        self._blob_client = container.get_blob_client(blob_name)
+        self._pos = 0
+        self._tail = b""
+
+    def __iter__(self):
+        buf = self._fill_buf()
+        if buf is not None:
+            # If there are no line breaks in the entries
+            # file, then we will yield nothing, ever.
+            #
+            # This apes S3 tail. We can fix it here and in S3
+            # if/when this becomes an issue. It boils down to
+            # knowing when to give up waiting on partial lines
+            # to become full lines (tricky, need more info).
+            #
+            # Likely this has been OK because we control the
+            # line-break presence in the objects we tail.
+            for line in buf:
+                if line.endswith(b"\n"):
+                    yield line
+                else:
+                    self._tail = line
+                    break
+
+    def _make_range_request(self):
+        try:
+            # Yes we read to the end... memory blow up is possible. We can improve by specifying length param
+            return self._blob_client.download_blob(offset=self._pos).readall()
+        except ResourceNotFoundError:
+            # Maybe the log hasn't been uploaded yet, but will be soon.
+            return None
+        except HttpResponseError as e:
+            # be silent on range errors - it means log did not advance
+            if e.status_code != 416:
+                print(
+                    "Failed to tail log from step (status code = %d)" % (e.status_code,)
+                )
+            return None
+        except Exception as e:
+            print("Failed to tail log from step (%s)" % type(e))
+            return None
+
+    def _fill_buf(self):
+        data = self._make_range_request()
+        if data is None:
+            return None
+        if data:
+            buf = BytesIO(data)
+            self._pos += len(data)
+            self._tail = b""
+            return buf
+        else:
+            return None
+
+
+if __name__ == "__main__":
+    # This main program is for debugging and testing purposes
+    import argparse
+
+    parser = argparse.ArgumentParser(
+        description="Tail an Azure Blob. Must specify METAFLOW_AZURE_STORAGE_BLOB_SERVICE_ENDPOINT in environment."
+    )
+    parser.add_argument(
+        "blob_full_uri", help="The blob to tail. Format is <container_name>/<blob>"
+    )
+    args = parser.parse_args()
+    az_tail = AzureTail(args.blob_full_uri)
+    for line in az_tail:
+        print(line.strip().decode("utf-8"))
diff --git a/metaflow/plugins/azure/azure_utils.py b/metaflow/plugins/azure/azure_utils.py
new file mode 100644
index 00000000000..0f3f465a171
--- /dev/null
+++ b/metaflow/plugins/azure/azure_utils.py
@@ -0,0 +1,218 @@
+import sys
+import time
+
+from metaflow.plugins.azure.azure_exceptions import (
+    MetaflowAzureAuthenticationError,
+    MetaflowAzureResourceError,
+    MetaflowAzurePackageError,
+)
+from metaflow.exception import MetaflowInternalError, MetaflowException
+
+
+def _check_and_init_azure_deps():
+    try:
+        # Python 3.6 would print lots of warnings about deprecated cryptography usage when importing Azure modules
+        import warnings
+
+        warnings.filterwarnings("ignore")
+
+        import azure.storage.blob
+        import azure.identity
+
+        # cut down on crazy logging from azure.identity.
+        # TODO but what if folks want to debug on occasion?
+        import logging
+
+        logging.getLogger("azure.identity").setLevel(logging.ERROR)
+        logging.getLogger("msrest.serialization").setLevel(logging.ERROR)
+    except ImportError:
+        raise MetaflowAzurePackageError()
+
+    if sys.version_info[:2] < (3, 6):
+        raise MetaflowException(
+            msg="Metaflow may only use Azure Blob Storage with Python 3.6 or newer"
+        )
+
+
+def check_azure_deps(func):
+    """The decorated function checks Azure dependencies (as needed for Azure storage backend). This includes
+    various Azure SDK packages, as well as a Python version of >3.6
+
+    We also tune some warning and logging configurations to reduce excessive log lines from Azure SDK.
+    """
+
+    def _inner_func(*args, **kwargs):
+        _check_and_init_azure_deps()
+        return func(*args, **kwargs)
+
+    return _inner_func
+
+
+def parse_azure_full_path(blob_full_uri):
+    """
+    Parse an Azure Blob Storage path str into a tuple (container_name, blob).
+
+    Expected format is: <container_name>/<blob>
+
+    This is sometimes used to parse an Azure sys root, in which case:
+
+    - <container_name> is the Azure Blob Storage container name
+    - <blob> is effectively a blob_prefix, a subpath within the container in which blobs will live
+
+    Blob may be None, if input looks like <container_name>. I.e. no slashes present.
+
+    We take a strict validation approach, doing no implicit string manipulations on
+    the user's behalf.  Path manipulations by themselves are complicated enough without
+    adding magic.
+
+    We provide clear error messages so the user knows exactly how to fix any validation error.
+    """
+    if blob_full_uri.endswith("/"):
+        raise ValueError("sysroot may not end with slash (got %s)" % blob_full_uri)
+    if blob_full_uri.startswith("/"):
+        raise ValueError("sysroot may not start with slash (got %s)" % blob_full_uri)
+    if "//" in blob_full_uri:
+        raise ValueError(
+            "sysroot may not contain any consecutive slashes (got %s)" % blob_full_uri
+        )
+    parts = blob_full_uri.split("/", 1)
+    container_name = parts[0]
+    if container_name == "":
+        raise ValueError(
+            "Container name part of sysroot may not be empty (tried to parse %s)"
+            % (blob_full_uri,)
+        )
+    if len(parts) == 1:
+        blob_name = None
+    else:
+        blob_name = parts[1]
+
+    return container_name, blob_name
+
+
+@check_azure_deps
+def process_exception(e):
+    """
+    Translate errors to Metaflow errors for standardized messaging. The intent is that all
+    Azure Blob Storage integration logic should send errors to this function for
+    translation.
+
+    We explicitly EXCLUDE executor related errors here.  See handle_executor_exceptions
+    """
+    if isinstance(e, MetaflowException):
+        # If it's already a MetaflowException... no translation needed
+        raise
+    if isinstance(e, ImportError):
+        # Surprise ImportError here... (expected to see this handled and wrapped as MetaflowAzurePackagingError)
+        # Reraise it raw for visibility, it's a bug and is catastrophic anyway.
+        raise
+
+    from azure.core.exceptions import (
+        ClientAuthenticationError,
+        ResourceNotFoundError,
+        ResourceExistsError,
+        AzureError,
+    )
+
+    if isinstance(e, ClientAuthenticationError):
+        # Final line shows the TLDR
+        # Note we assume the str(e) is never empty string (otherwise, IndexError)
+        raise MetaflowAzureAuthenticationError(msg=str(e).splitlines()[-1])
+    elif isinstance(e, (ResourceNotFoundError, ResourceExistsError)):
+        raise MetaflowAzureResourceError(msg=str(e))
+    elif isinstance(e, AzureError):  # this is the base class for all Azure SDK errors
+        raise MetaflowInternalError(msg="Azure error: %s" % (str(e)))
+    else:
+        raise MetaflowInternalError(msg=str(e))
+
+
+def handle_exceptions(func):
+    """This is a decorator leveraging the logic from process_exception()"""
+
+    def _inner_func(*args, **kwargs):
+        try:
+            return func(*args, **kwargs)
+        except Exception as e:
+            process_exception(e)
+
+    return _inner_func
+
+
+@check_azure_deps
+def create_cacheable_default_azure_credentials(*args, **kwargs):
+    """azure.identity.DefaultAzureCredential is not readily cacheable in a dictionary
+    because it does not have a content based hash and equality implementations.
+
+    We implement a subclass CacheableDefaultAzureCredential to add them.
+
+    We need this because credentials will be part of the cache key in _ClientCache.
+    """
+    from azure.identity import DefaultAzureCredential
+
+    class CacheableDefaultAzureCredential(DefaultAzureCredential):
+        def __init__(self, *args, **kwargs):
+            super(CacheableDefaultAzureCredential, self).__init__(*args, **kwargs)
+            # Just hashing all the kwargs works because they are all individually
+            # hashable as of 7/15/2022.
+            #
+            # What if Azure adds unhashable things to kwargs?
+            # - We will have CI to catch this (it will always install the latest Azure SDKs)
+            # - In Metaflow usage today we never specify any kwargs anyway. (see last line
+            #   of the outer function.
+            self._hash_code = hash((args, tuple(sorted(kwargs.items()))))
+
+        def __hash__(self):
+            return self._hash_code
+
+        def __eq__(self, other):
+            return hash(self) == hash(other)
+
+    return CacheableDefaultAzureCredential(*args, **kwargs)
+
+
+@check_azure_deps
+def create_static_token_credential(token_):
+    from azure.core.credentials import TokenCredential
+
+    class StaticTokenCredential(TokenCredential):
+        # We initialize with a fixed token (_cached_token).
+        #
+        # In most cases, we take a fast path - we always just return that fixed token.
+        # I.e. we generate the token once somewhere; subsequent operations use that same token.
+        #
+        # The fixed token can expire (defaults to several hours, but can be configured by an Azure admin)
+        #
+        # If we detect token expiration, we delegate all future token needs back to DefaultAzureCredential,
+        # which similarly supports token caching and regeneration of expired tokens.
+        #
+        # The net result is that we only generate new tokens when absolutely necessary.
+        #
+        # This dance is needed because DefaultAzureCredential is picklable and therefore cannot be shared
+        # across thread or process pool workers. Simply using a new DefaultAzureCredential in each thread
+        # imposes a one time penalty per object. That penalty comes from token generation may be large.
+        # e.g. Azure CLI is particularly slow (500ms - 1000ms).
+        #
+        # https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python
+        def __init__(self, token):
+            self._cached_token = token
+            self._credential = None
+
+        def get_token(self, *_scopes, **_kwargs):
+
+            if (self._cached_token.expires_on - time.time()) < 300:
+                from azure.identity import DefaultAzureCredential
+
+                self._credential = DefaultAzureCredential()
+            if self._credential:
+                return self._credential.get_token(*_scopes, **_kwargs)
+            return self._cached_token
+
+        # This object will be stored within a BlobServiceClient object.
+        # Implement __hash__ and __eq__ so this and the containing service objects become cacheable.
+        def __hash__(self):
+            return hash(self._cached_token)
+
+        def __eq__(self, other):
+            return self._cached_token == other._cached_token
+
+    return StaticTokenCredential(token_)
diff --git a/metaflow/plugins/azure/blob_service_client_factory.py b/metaflow/plugins/azure/blob_service_client_factory.py
new file mode 100644
index 00000000000..64cd04ebe5d
--- /dev/null
+++ b/metaflow/plugins/azure/blob_service_client_factory.py
@@ -0,0 +1,171 @@
+from metaflow.exception import MetaflowException
+from metaflow.metaflow_config import AZURE_STORAGE_BLOB_SERVICE_ENDPOINT
+from metaflow.plugins.azure.azure_utils import (
+    create_cacheable_default_azure_credentials,
+    check_azure_deps,
+)
+
+import os
+import threading
+
+
+class _ClientCache(object):
+    """
+    azure.storage.blob.BlobServiceClient objects internally cache HTTPS connections.
+    In order to reuse HTTPS connections, we need to reuse BlobServiceClient objects.
+
+    The effect we are going for is for each process or thread in the executor to
+    EACH REUSE ITS OWN long-lived BlobServiceClient object.
+    """
+
+    _cache = dict()
+
+    def __init__(self):
+        raise RuntimeError("_ClientCache may not be instantiated!")
+
+    @staticmethod
+    def get(
+        blob_service_endpoint,
+        credential=None,
+        max_single_put_size=None,
+        max_single_get_size=None,
+        max_chunk_get_size=None,
+        connection_data_block_size=None,
+    ):
+        """
+        Each spawned process will get its own fresh cache.
+
+        With each process's cache, each thread gets its own distinct service object.
+
+        The cache key includes all BlobServiceClient creation params PLUS process ID and thread ID.
+
+        This means that no more than one thread of control will ever access a cache key. This, and
+        the fact that dict() operations are thread-safe means that no additional synchronization
+        required here.
+        """
+        cache_key = (
+            os.getpid(),
+            threading.get_ident(),
+            credential,
+            blob_service_endpoint,
+            max_single_put_size,
+            max_single_get_size,
+            max_chunk_get_size,
+            connection_data_block_size,
+        )
+        service_just_created = None
+        if cache_key not in _ClientCache._cache:
+            service_just_created = _create_blob_service_client(
+                blob_service_endpoint,
+                credential=credential,
+                max_single_put_size=max_single_put_size,
+                max_single_get_size=max_single_get_size,
+                max_chunk_get_size=max_chunk_get_size,
+                connection_data_block_size=connection_data_block_size,
+                # BlobServiceClient accepts many more kwargs. We can add passthroughs as/when needed.
+            )
+            _ClientCache._cache[cache_key] = service_just_created
+        service_to_return = _ClientCache._cache[cache_key]
+        if (
+            service_just_created is not None
+            and service_just_created != service_to_return
+        ):
+            # This condition may not be fatal, but highly unexpected.
+            # Each thread of execution gets its own entry, so there *should* be no concurrent insertions
+            # on the same key.
+            #
+            # The Metaflow team REALLY wants to know if this ever happens, but we stop short of raising a
+            # fatal exception to not kill user jobs.
+            print(
+                "METAFLOW WARNING: Azure _ClientCache had the same cache key updated more than once"
+            )
+        return service_to_return
+
+
+# AZURE SDK tunables (AZURE_CLIENT_*)
+# ===================================
+# We try pick some sensible defaults here.
+#
+# Block size on underlying down TLS transport - Azure SDK defaults to 4096.
+# This larger block size dramatically improves aggregate download throughput.
+# 256KB (within order of 2x) seemed optimal for specific benchmark setup.
+# In the future, exposing this as a tunable to users is a possibility if necessary.
+AZURE_CLIENT_CONNECTION_DATA_BLOCK_SIZE = 262144
+
+# When to use more than a single thread / connection for a single GET or PUT.
+AZURE_CLIENT_MAX_SINGLE_GET_SIZE_MB = 32
+AZURE_CLIENT_MAX_SINGLE_PUT_SIZE_MB = 64
+
+# Maximum chunk size when splitting a single blob GET into chunks for concurrent processing
+AZURE_CLIENT_MAX_CHUNK_GET_SIZE_MB = 16
+
+BYTES_IN_MB = 1024 * 1024
+
+
+def get_azure_blob_service_client(
+    credential=None,
+    credential_is_cacheable=False,
+    max_single_get_size=AZURE_CLIENT_MAX_SINGLE_GET_SIZE_MB * BYTES_IN_MB,
+    max_single_put_size=AZURE_CLIENT_MAX_SINGLE_PUT_SIZE_MB * BYTES_IN_MB,
+    max_chunk_get_size=AZURE_CLIENT_MAX_CHUNK_GET_SIZE_MB * BYTES_IN_MB,
+    connection_data_block_size=AZURE_CLIENT_CONNECTION_DATA_BLOCK_SIZE,
+):
+    """Returns a azure.storage.blob.BlobServiceClient.
+
+    The value adds are:
+    - connection caching (see _ClientCache)
+    - auto storage account URL detection
+    - auto credential handling (pull SAS token from environment, OR DefaultAzureCredential)
+    - sensible default values for Azure SDK tunables
+    """
+    if not AZURE_STORAGE_BLOB_SERVICE_ENDPOINT:
+        raise MetaflowException(
+            msg="Must configure METAFLOW_AZURE_STORAGE_BLOB_SERVICE_ENDPOINT"
+        )
+    blob_service_endpoint = AZURE_STORAGE_BLOB_SERVICE_ENDPOINT
+
+    if not credential:
+        credential = create_cacheable_default_azure_credentials()
+        credential_is_cacheable = True
+
+    if not credential_is_cacheable:
+        return _create_blob_service_client(
+            blob_service_endpoint,
+            credential=credential,
+            max_single_put_size=max_single_put_size,
+            max_single_get_size=max_single_get_size,
+            max_chunk_get_size=max_chunk_get_size,
+            connection_data_block_size=connection_data_block_size,
+            # BlobServiceClient accepts many more kwargs. We can add passthroughs as/when needed.
+        )
+
+    return _ClientCache.get(
+        blob_service_endpoint,
+        credential=credential,
+        max_single_put_size=max_single_put_size,
+        max_single_get_size=max_single_get_size,
+        max_chunk_get_size=max_chunk_get_size,
+        connection_data_block_size=connection_data_block_size,
+    )
+
+
+@check_azure_deps
+def _create_blob_service_client(
+    blob_service_endpoint,
+    credential=None,
+    max_single_put_size=None,
+    max_single_get_size=None,
+    max_chunk_get_size=None,
+    connection_data_block_size=None,
+):
+    from azure.storage.blob import BlobServiceClient
+
+    return BlobServiceClient(
+        blob_service_endpoint,
+        credential=credential,
+        max_single_put_size=max_single_put_size,
+        max_single_get_size=max_single_get_size,
+        max_chunk_get_size=max_chunk_get_size,
+        connection_data_block_size=connection_data_block_size,
+        # BlobServiceClient accepts many more kwargs. We can add passthroughs as/when needed.
+    )
diff --git a/metaflow/plugins/azure/includefile_support.py b/metaflow/plugins/azure/includefile_support.py
new file mode 100644
index 00000000000..e1a36b32c56
--- /dev/null
+++ b/metaflow/plugins/azure/includefile_support.py
@@ -0,0 +1,123 @@
+import io
+import os
+import shutil
+import uuid
+from tempfile import mkdtemp
+
+from metaflow.exception import MetaflowException, MetaflowInternalError
+
+
+class Azure(object):
+    @classmethod
+    def get_root_from_config(cls, echo, create_on_absent=True):
+        from metaflow.metaflow_config import DATATOOLS_AZUREROOT
+
+        return DATATOOLS_AZUREROOT
+
+    def __init__(self):
+        # This local directory is used to house any downloaded blobs, for lifetime of
+        # this object as a context manager.
+        self._tmpdir = None
+
+    def _get_storage_backend(self, key):
+        """
+        Return an AzureDatastore, rooted at the container level, no prefix.
+        Key MUST be a fully qualified path. e.g. <container_name>/b/l/o/b/n/a/m/e
+        """
+        from metaflow.plugins.azure.azure_utils import parse_azure_full_path
+
+        # we parse out the container name only, and use that to root our storage implementation
+        container_name, _ = parse_azure_full_path(key)
+        # Import DATASTORES dynamically... otherwise, circular import
+        from metaflow.plugins import DATASTORES
+
+        storage_impl = [d for d in DATASTORES if d.TYPE == "azure"][0]
+        return storage_impl(container_name)
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, *args):
+        if self._tmpdir and os.path.exists(self._tmpdir):
+            shutil.rmtree(self._tmpdir)
+
+    def get(self, key=None, return_missing=False):
+        """Key MUST be a fully qualified path with uri scheme.  azure://<container_name>/b/l/o/b/n/a/m/e"""
+        # Azure.get() is meant for use within includefile.py ONLY.
+        # All existing call sites set return_missing=True.
+        #
+        # Support for return_missing=False may be added if/when the situation changes.
+        if not return_missing:
+            raise MetaflowException("Azure object supports only return_missing=True")
+        # We fabricate a uri scheme to fit into existing includefile code (just like local://)
+        if not key.startswith("azure://"):
+            raise MetaflowInternalError(
+                msg="Expected Azure object key to start with 'azure://'"
+            )
+        uri_style_key = key
+        short_key = key[8:]
+        storage = self._get_storage_backend(short_key)
+        azure_object = None
+        with storage.load_bytes([short_key]) as load_result:
+            for _, tmpfile, _ in load_result:
+                if tmpfile is None:
+                    azure_object = AzureObject(uri_style_key, None, False, None)
+                else:
+                    if not self._tmpdir:
+                        self._tmpdir = mkdtemp(prefix="metaflow.includefile.azure.")
+                    output_file_path = os.path.join(self._tmpdir, str(uuid.uuid4()))
+                    shutil.move(tmpfile, output_file_path)
+                    # Beats making another Azure API call!
+                    sz = os.stat(output_file_path).st_size
+                    azure_object = AzureObject(
+                        uri_style_key, output_file_path, True, sz
+                    )
+                break
+        return azure_object
+
+    def put(self, key, obj, overwrite=True):
+        """Key MUST be a fully qualified path.  <container_name>/b/l/o/b/n/a/m/e"""
+        storage = self._get_storage_backend(key)
+        storage.save_bytes([(key, io.BytesIO(obj))], overwrite=overwrite)
+        # We fabricate a uri scheme to fit into existing includefile code (just like local://)
+        return "azure://%s" % key
+
+    def info(self, key=None, return_missing=False):
+        # We fabricate a uri scheme to fit into existing includefile code (just like local://)
+        if not key.startswith("azure://"):
+            raise MetaflowInternalError(
+                msg="Expected Azure object key to start with 'azure://'"
+            )
+        # aliasing this purely for clarity
+        uri_style_key = key
+        short_key = key[8:]
+        storage = self._get_storage_backend(short_key)
+        blob_size = storage.size_file(short_key)
+        blob_exists = blob_size is not None
+        if not blob_exists and not return_missing:
+            raise MetaflowException("Azure blob '%s' not found" % uri_style_key)
+        return AzureObject(uri_style_key, None, blob_exists, blob_size)
+
+
+class AzureObject(object):
+    def __init__(self, url, path, exists, size):
+        self._path = path
+        self._url = url
+        self._exists = exists
+        self._size = size
+
+    @property
+    def path(self):
+        return self._path
+
+    @property
+    def url(self):
+        return self._url
+
+    @property
+    def exists(self):
+        return self._exists
+
+    @property
+    def size(self):
+        return self._size
diff --git a/metaflow/plugins/cards/card_cli.py b/metaflow/plugins/cards/card_cli.py
index 950d03ce63e..2c0c8017744 100644
--- a/metaflow/plugins/cards/card_cli.py
+++ b/metaflow/plugins/cards/card_cli.py
@@ -154,7 +154,7 @@ def timeout(time):
     except TimeoutError:
         pass
     finally:
-        # Unregister the signal so it won't be triggered
+        # Unregister the signal so that it won't be triggered
         # if the timeout is not reached.
         signal.signal(signal.SIGALRM, signal.SIG_IGN)
 
@@ -171,6 +171,7 @@ def list_available_cards(
     command="view",
     show_list_as_json=False,
     list_many=False,
+    file=None,
 ):
     # pathspec is full pathspec.
     # todo : create nice response messages on the CLI for cards which were found.
@@ -182,7 +183,15 @@ def list_available_cards(
             for tup in path_tuples
         ]
         if not list_many:
-            print(json.dumps(dict(pathspec=pathspec, cards=json_arr), indent=4))
+            # This means that `list_available_cards` is being called once.
+            # So we can directly dump the file
+            dump_dict = dict(pathspec=pathspec, cards=json_arr)
+            if file:
+                with open(file, "w") as f:
+                    json.dump(dump_dict, f)
+            else:
+                ctx.obj.echo_always(json.dumps(dump_dict, indent=4), err=False)
+        # if you have to list many in json format then return
         return dict(pathspec=pathspec, cards=json_arr)
 
     if list_many:
@@ -253,6 +262,7 @@ def list_many_cards(
     card_id=None,
     follow_resumed=None,
     as_json=None,
+    file=None,
 ):
     from metaflow import Flow
 
@@ -288,6 +298,7 @@ def list_many_cards(
                     command=None,
                     show_list_as_json=as_json,
                     list_many=True,
+                    file=file,
                 )
                 if as_json:
                     js_list.append(js_resp)
@@ -299,7 +310,11 @@ def list_many_cards(
             run.pathspec, card_hash=hash, card_type=type, card_id=card_id
         )
     if as_json:
-        print(json.dumps(js_list, indent=4))
+        if file:
+            with open(file, "w") as f:
+                json.dump(js_list, f)
+        else:
+            ctx.obj.echo_always(json.dumps(js_list, indent=4), err=False)
 
 
 @click.group()
@@ -381,7 +396,7 @@ def render_card(mf_card, task, timeout_value=None):
 )
 @click.option(
     "--options",
-    default={},
+    default=None,
     show_default=True,
     type=JSONType,
     help="arguments of the card being created.",
@@ -465,14 +480,17 @@ def create(
             % (filtered_card.type, timeout),
             fg="green",
         )
-        # If the card is Instatiatable then
+        # If the card is Instantiatable then
         # first instantiate; If instantiation has a TypeError
         # then check for render_error_card and accordingly
         # store the exception as a string or raise the exception
         try:
-            mf_card = filtered_card(
-                options=options, components=component_arr, graph=graph_dict
-            )
+            if options is not None:
+                mf_card = filtered_card(
+                    options=options, components=component_arr, graph=graph_dict
+                )
+            else:
+                mf_card = filtered_card(components=component_arr, graph=graph_dict)
         except TypeError as e:
             if render_error_card:
                 mf_card = None
@@ -621,6 +639,11 @@ def get(
     is_flag=True,
     help="Print all available cards as a JSON object",
 )
+@click.option(
+    "--file",
+    default=None,
+    help="Save the available card list to file.",
+)
 @click.pass_context
 def list(
     ctx,
@@ -630,7 +653,9 @@ def list(
     id=None,
     follow_resumed=False,
     as_json=False,
+    file=None,
 ):
+
     card_id = id
     if pathspec is None:
         list_many_cards(
@@ -640,6 +665,7 @@ def list(
             card_id=card_id,
             follow_resumed=follow_resumed,
             as_json=as_json,
+            file=file,
         )
         return
 
@@ -659,4 +685,5 @@ def list(
         card_datastore,
         command=None,
         show_list_as_json=as_json,
+        file=file,
     )
diff --git a/metaflow/plugins/cards/card_client.py b/metaflow/plugins/cards/card_client.py
index ec8c3ef046f..ddd38050e72 100644
--- a/metaflow/plugins/cards/card_client.py
+++ b/metaflow/plugins/cards/card_client.py
@@ -1,5 +1,5 @@
-from metaflow.datastore import DATASTORES, FlowDataStore
-from metaflow.metaflow_config import DATASTORE_CARD_SUFFIX
+from metaflow.datastore import FlowDataStore
+from metaflow.metaflow_config import CARD_SUFFIX
 from .card_resolver import resolve_paths_from_task, resumed_info
 from .card_datastore import CardDatastore
 from .exception import (
@@ -17,20 +17,12 @@
 
 class Card:
     """
-    The object which holds the html of a Metaflow card.
-
-    ### Usage
-    ```python
-    card_container = get_cards(task)
-    # This retrieves a `Card` instance
-    card = card_container[0]
-    # View the HTML in browser
-    card.view()
-    # Get the HTML of the card
-    html = card.get()
-    # calling the instance of `Card` inside a notebook cell will render the card as the output of a cell
-    card
-    ```
+    `Card` represents an individual Metaflow Card, a single HTML file, produced by
+    the card `@card` decorator. `Card`s are contained by `CardContainer`, returned by
+    `get_cards`.
+
+    Note that the contents of the card, an HTML file, is retrieved lazily when you call
+    `Card.get` for the first time or when the card is rendered in a notebook.
     """
 
     def __init__(
@@ -62,6 +54,15 @@ def __init__(
         self._temp_file = None
 
     def get(self):
+        """
+        Retrieves the HTML contents of the card from the
+        Metaflow datastore.
+
+        Returns
+        -------
+        str
+            HTML contents of the card.
+        """
         if self._html is not None:
             return self._html
         self._html = self._card_ds.get_card_html(self.path)
@@ -69,16 +70,29 @@ def get(self):
 
     @property
     def path(self):
+        """
+        The path of the card in the datastore which uniquely
+        identifies the card.
+        """
         return self._path
 
     @property
     def id(self):
+        """
+        The ID of the card, if specified with `@card(id=ID)`.
+        """
         return self._card_id
 
     def __str__(self):
         return "<Card at '%s'>" % self._path
 
     def view(self):
+        """
+        Opens the card in a local web browser.
+
+        This call uses Python's built-in [`webbrowser`](https://docs.python.org/3/library/webbrowser.html)
+        module to open the card.
+        """
         import webbrowser
 
         self._temp_file = tempfile.NamedTemporaryFile(suffix=".html")
@@ -104,15 +118,26 @@ def _repr_html_(self):
 
 class CardContainer:
     """
-    A `list` like object that helps iterate through all the stored `Card`s.
-
-    ### Usage:
-    ```python
-    card_container = get_cards(task)
-    # Get all stored cards
-    cards = list(card_container)
-    # calling the instance of `CardContainer` inside a notebook will render all cards as the output of a cell
-    card_container
+    `CardContainer` is an immutable list-like object, returned by `get_cards`,
+    which contains individual `Card`s.
+
+    Notably, `CardContainer` contains a special
+    `_repr_html_` function which renders cards automatically in an output
+    cell of a notebook.
+
+    The following operations are supported:
+    ```
+    cards = get_cards(MyTask)
+
+    # retrieve by index
+    first_card = cards[0]
+
+    # check length
+    if len(cards) > 1:
+        print('many cards present!')
+
+    # iteration
+    list_of_cards = list(cards)
     ```
     """
 
@@ -172,15 +197,30 @@ def _repr_html_(self):
 
 def get_cards(task, id=None, type=None, follow_resumed=True):
     """
-    Get cards related to a Metaflow `Task`
-
-    Args:
-        task (str or `Task`): A metaflow `Task` object or pathspec (flowname/runid/stepname/taskid)
-        type (str, optional): The type of card to retrieve. Defaults to None.
-        follow_resumed (bool, optional): If a Task has been resumed and cloned, then setting this flag will resolve the card for the origin task. Defaults to True.
-
-    Returns:
-        `CardContainer` : A `list` like object that holds `Card` objects.
+    Get cards related to a `Task`.
+
+    Note that `get_cards` resolves the cards contained by the task, but it doesn't actually
+    retrieve them from the datastore. Actual card contents are retrieved lazily either when
+    the card is rendered in a notebook to when you call `Card.get`. This means that
+    `get_cards` is a fast call even when individual cards contain a lot of data.
+
+    Parameters
+    ----------
+    task : str or `Task`
+        A `Task` object or pathspec `{flow_name}/{run_id}/{step_name}/{task_id}` that
+        uniquely identifies a task.
+    id : str (optional)
+        The ID of card to retrieve if multiple cards are present.
+    type : str (optional)
+        The type of card to retrieve if multiple cards are present.
+    follow_resumed : bool (optional)
+        If the task has been resumed, then setting this flag will resolve the card for
+        the origin task. Defaults to True.
+
+    Returns
+    -------
+    CardContainer
+        A list-like object that holds `Card` objects.
     """
     from metaflow.client import Task
     from metaflow import namespace
@@ -191,11 +231,11 @@ def get_cards(task, id=None, type=None, follow_resumed=True):
         if len(task_str.split("/")) != 4:
             # Exception that pathspec is not of correct form
             raise IncorrectPathspecException(task_str)
-        # set namepsace as None so that we don't face namespace mismatch error.
+        # set namespace as None so that we don't face namespace mismatch error.
         namespace(None)
         task = Task(task_str)
     elif not isinstance(task, Task):
-        # Exception that the task argument should of form `Task` or `str`
+        # Exception that the task argument should be of form `Task` or `str`
         raise IncorrectArguementException(_TYPE(task))
 
     if follow_resumed:
@@ -219,27 +259,24 @@ def _get_flow_datastore(task):
     # Resolve datastore type
     ds_type = None
     # We need to set the correct datastore root here so that
-    # we can ensure the the card client picks up the correct path to the cards
-
-    for meta in task.metadata:
-        if meta.name == "ds-type":
-            ds_type = meta.value
-            break
+    # we can ensure that the card client picks up the correct path to the cards
 
-    ds_root = CardDatastore.get_storage_root(ds_type)
-
-    if ds_root is None:
-        for meta in task.metadata:
-            # Incase METAFLOW_CARD_S3ROOT and METAFLOW_DATASTORE_SYSROOT_S3 are absent
-            # then construct the default path for METAFLOW_CARD_S3ROOT from ds-root metadata
-            if meta.name == "ds-root":
-                ds_root = os.path.join(meta.value, DATASTORE_CARD_SUFFIX)
-                break
+    meta_dict = task.metadata_dict
+    ds_type = meta_dict.get("ds-type", None)
 
     if ds_type is None:
         raise UnresolvableDatastoreException(task)
 
-    storage_impl = DATASTORES[ds_type]
+    ds_root = meta_dict.get("ds-root", None)
+    if ds_root:
+        ds_root = os.path.join(ds_root, CARD_SUFFIX)
+    else:
+        ds_root = CardDatastore.get_storage_root(ds_type)
+
+    # Delay load to prevent circular dep
+    from metaflow.plugins import DATASTORES
+
+    storage_impl = [d for d in DATASTORES if d.TYPE == ds_type][0]
     return FlowDataStore(
         flow_name=flow_name,
         environment=None,  # TODO: Add environment here
diff --git a/metaflow/plugins/cards/card_datastore.py b/metaflow/plugins/cards/card_datastore.py
index a31c26425ad..59931592878 100644
--- a/metaflow/plugins/cards/card_datastore.py
+++ b/metaflow/plugins/cards/card_datastore.py
@@ -8,13 +8,17 @@
 import os
 import shutil
 
-from metaflow.datastore.local_storage import LocalStorage
+from metaflow.plugins.datastores.local_storage import LocalStorage
 from metaflow.metaflow_config import (
-    DATASTORE_CARD_S3ROOT,
-    DATASTORE_CARD_LOCALROOT,
+    CARD_S3ROOT,
+    CARD_LOCALROOT,
     DATASTORE_LOCAL_DIR,
-    DATASTORE_CARD_SUFFIX,
+    CARD_SUFFIX,
+    CARD_AZUREROOT,
+    CARD_GSROOT,
+    SKIP_CARD_DUALWRITE,
 )
+import metaflow.metaflow_config as metaflow_config
 
 from .exception import CardNotPresentException
 
@@ -26,7 +30,8 @@
 
 def path_spec_resolver(pathspec):
     splits = pathspec.split("/")
-    return (*splits, *[None] * (4 - len(splits)))
+    splits.extend([None] * (4 - len(splits)))
+    return tuple(splits)
 
 
 def is_file_present(path):
@@ -43,15 +48,17 @@ class CardDatastore(object):
     @classmethod
     def get_storage_root(cls, storage_type):
         if storage_type == "s3":
-            return DATASTORE_CARD_S3ROOT
-        else:
+            return CARD_S3ROOT
+        elif storage_type == "azure":
+            return CARD_AZUREROOT
+        elif storage_type == "gs":
+            return CARD_GSROOT
+        elif storage_type == "local":
             # Borrowing some of the logic from LocalStorage.get_storage_root
-            result = DATASTORE_CARD_LOCALROOT
+            result = CARD_LOCALROOT
             if result is None:
                 current_path = os.getcwd()
-                check_dir = os.path.join(
-                    current_path, DATASTORE_LOCAL_DIR, DATASTORE_CARD_SUFFIX
-                )
+                check_dir = os.path.join(current_path, DATASTORE_LOCAL_DIR, CARD_SUFFIX)
                 check_dir = os.path.realpath(check_dir)
                 orig_path = check_dir
                 while not os.path.isdir(check_dir):
@@ -60,11 +67,16 @@ def get_storage_root(cls, storage_type):
                         break  # We are no longer making upward progress
                     current_path = new_path
                     check_dir = os.path.join(
-                        current_path, DATASTORE_LOCAL_DIR, DATASTORE_CARD_SUFFIX
+                        current_path, DATASTORE_LOCAL_DIR, CARD_SUFFIX
                     )
                 result = orig_path
 
             return result
+        else:
+            # Let's make it obvious we need to update this block for each new datastore backend...
+            raise NotImplementedError(
+                "Card datastore does not support backend %s" % (storage_type,)
+            )
 
     def __init__(self, flow_datastore, pathspec=None):
         self._backend = flow_datastore._storage_impl
@@ -73,7 +85,7 @@ def __init__(self, flow_datastore, pathspec=None):
         self._run_id = run_id
         self._step_name = step_name
         self._pathspec = pathspec
-        self._temp_card_save_path = self._get_card_path(base_pth=TEMP_DIR_NAME)
+        self._temp_card_save_path = self._get_write_path(base_pth=TEMP_DIR_NAME)
 
     @classmethod
     def get_card_location(cls, base_path, card_name, card_html, card_id=None):
@@ -84,33 +96,52 @@ def get_card_location(cls, base_path, card_name, card_html, card_id=None):
             card_file_name = "%s-%s-%s.html" % (card_name, card_id, chash)
         return os.path.join(base_path, card_file_name)
 
-    def _make_path(self, base_pth, pathspec=None):
+    def _make_path(self, base_pth, pathspec=None, with_steps=False):
         sysroot = base_pth
-
         if pathspec is not None:
+            # since most cards are at a task level there will always be 4 non-none values returned
             flow_name, run_id, step_name, task_id = path_spec_resolver(pathspec)
 
-        # For task level cards the flow_name and run_id and task_id are required
-        if flow_name is not None and run_id is not None and task_id is not None:
+        # We have a condition that checks for `with_steps` because
+        # when cards were introduced there was an assumption made
+        # about task-ids being unique.
+        # This assumption is incorrect since pathspec needs to be
+        # unique but there is no such guarantees on task-ids.
+        # This is why we have a `with_steps` flag that allows
+        # constructing the path with and without steps so that
+        # older-cards (cards with a path without `steps/<stepname>` in them)
+        # can also be accessed by the card cli and the card client.
+        if with_steps:
             pth_arr = [
                 sysroot,
                 flow_name,
                 "runs",
                 run_id,
+                "steps",
+                step_name,
                 "tasks",
                 task_id,
                 "cards",
             ]
-
-        if sysroot == "" or sysroot == None:
+        else:
+            pth_arr = [
+                sysroot,
+                flow_name,
+                "runs",
+                run_id,
+                "tasks",
+                task_id,
+                "cards",
+            ]
+        if sysroot == "" or sysroot is None:
             pth_arr.pop(0)
         return os.path.join(*pth_arr)
 
-    def _get_card_path(self, base_pth=""):
-        return self._make_path(
-            base_pth,
-            pathspec=self._pathspec,
-        )
+    def _get_write_path(self, base_pth=""):
+        return self._make_path(base_pth, pathspec=self._pathspec, with_steps=True)
+
+    def _get_read_path(self, base_pth="", with_steps=False):
+        return self._make_path(base_pth, pathspec=self._pathspec, with_steps=with_steps)
 
     @staticmethod
     def card_info_from_path(path):
@@ -144,25 +175,68 @@ def card_info_from_path(path):
 
     def save_card(self, card_type, card_html, card_id=None, overwrite=True):
         card_file_name = card_type
-        card_path = self.get_card_location(
-            self._get_card_path(), card_file_name, card_html, card_id=card_id
-        )
-        self._backend.save_bytes(
-            [(card_path, BytesIO(bytes(card_html, "utf-8")))], overwrite=overwrite
+        # TEMPORARY_WORKAROUND: FIXME (LATER) : Fix the duplication of below block in a few months.
+        # Check file blame to understand the age of this temporary workaround.
+
+        # This function will end up saving cards at two locations.
+        # Thereby doubling the number of cards. (Which is a temporary fix)
+        # Why do this ? :
+        # When cards were introduced there was an assumption made about task-ids being unique.
+        # This assumption was incorrect.
+        # Only the pathspec needs to be unique but there is no such guarantees about task-ids.
+        # When task-ids are non-unique, card read would result in finding incorrect cards.
+        # This happens because cards were stored based on task-ids.
+        # If we immediately switch from storing based on task-ids to a step-name abstraction folder,
+        # then card reading will crash for many users.
+        # It would especially happen for users who are accessing cards created by a newer
+        # MF client from an older version of MF client.
+        # It will also easily end up breaking the metaflow-ui (which maybe using a client from an older version).
+        # Hence, we are writing cards to both paths so that we can introduce breaking changes later in the future.
+        card_path_with_steps = self.get_card_location(
+            self._get_write_path(), card_file_name, card_html, card_id=card_id
         )
-        return self.card_info_from_path(card_path)
+        if SKIP_CARD_DUALWRITE:
+            self._backend.save_bytes(
+                [(card_path_with_steps, BytesIO(bytes(card_html, "utf-8")))],
+                overwrite=overwrite,
+            )
+        else:
+            card_path_without_steps = self.get_card_location(
+                self._get_read_path(with_steps=False),
+                card_file_name,
+                card_html,
+                card_id=card_id,
+            )
+            for cp in [card_path_with_steps, card_path_without_steps]:
+                self._backend.save_bytes(
+                    [(cp, BytesIO(bytes(card_html, "utf-8")))], overwrite=overwrite
+                )
+
+        return self.card_info_from_path(card_path_with_steps)
 
     def _list_card_paths(self, card_type=None, card_hash=None, card_id=None):
-        card_path = self._get_card_path()
-
-        card_paths = self._backend.list_content([card_path])
-        if len(card_paths) == 0:
-            # If there are no files found on the Path then raise an error of
-            raise CardNotPresentException(
-                self._pathspec,
-                card_hash=card_hash,
-                card_type=card_type,
+        # Check for new cards first
+        card_paths = []
+        card_paths_with_steps = self._backend.list_content(
+            [self._get_read_path(with_steps=True)]
+        )
+
+        if len(card_paths_with_steps) == 0:
+            card_paths_without_steps = self._backend.list_content(
+                [self._get_read_path(with_steps=False)]
             )
+            if len(card_paths_without_steps) == 0:
+                # If there are no files found on the Path then raise an error of
+                raise CardNotPresentException(
+                    self._pathspec,
+                    card_hash=card_hash,
+                    card_type=card_type,
+                )
+            else:
+                card_paths = card_paths_without_steps
+        else:
+            card_paths = card_paths_with_steps
+
         cards_found = []
         for task_card_path in card_paths:
             card_path = task_card_path.path
diff --git a/metaflow/plugins/cards/card_decorator.py b/metaflow/plugins/cards/card_decorator.py
index 9a9980850af..c2f88123823 100644
--- a/metaflow/plugins/cards/card_decorator.py
+++ b/metaflow/plugins/cards/card_decorator.py
@@ -7,7 +7,6 @@
 from metaflow.current import current
 from metaflow.util import to_unicode
 from .component_serializer import CardComponentCollector, get_card_class
-from .card_modules import _get_external_card_package_paths
 
 
 # from metaflow import get_metadata
@@ -23,6 +22,24 @@ def warning_message(message, logger=None, ts=False):
 
 
 class CardDecorator(StepDecorator):
+    """
+    Creates a human-readable report, a Metaflow Card, after this step completes.
+
+    Note that you may add multiple `@card` decorators in a step with different parameters.
+
+    Parameters
+    ----------
+    type : str
+        Card type (default: 'default').
+    id : str
+        If multiple cards are present, use this id to identify this card.
+    options : Dict
+        Options passed to the card. The contents depend on the card type.
+    timeout : int
+        Interrupt reporting if it takes more than this many seconds
+        (default: 45).
+    """
+
     name = "card"
     defaults = {
         "type": "default",
@@ -51,49 +68,6 @@ def __init__(self, *args, **kwargs):
         self._card_uuid = None
         self._user_set_card_id = None
 
-    def add_to_package(self):
-        return list(self._load_card_package())
-
-    def _load_card_package(self):
-
-        from . import card_modules
-
-        card_modules_root = os.path.dirname(card_modules.__file__)
-
-        for path_tuple in self._walk(
-            card_modules_root, filter_extensions=[".html", ".js", ".css"]
-        ):
-            file_path, arcname = path_tuple
-            yield (file_path, os.path.join("metaflow", "plugins", "cards", arcname))
-
-        external_card_pth_generator = _get_external_card_package_paths()
-        if external_card_pth_generator is None:
-            return
-        for module_pth, parent_arcname in external_card_pth_generator:
-            # `_get_card_package_paths` is a generator which yields
-            # path to the module and its relative arcname in the metaflow-extensions package.
-            for file_pth, rel_path in self._walk(module_pth, prefix_root=True):
-                arcname = os.path.join(parent_arcname, rel_path)
-                yield (file_pth, arcname)
-
-    def _walk(self, root, filter_extensions=[], prefix_root=False):
-        root = to_unicode(root)  # handle files/folder with non ascii chars
-        prfx = "%s/" % (root if prefix_root else os.path.dirname(root))
-        prefixlen = len(prfx)
-        for path, dirs, files in os.walk(root):
-            for fname in files:
-                # ignoring filesnames which are hidden;
-                # TODO : Should we ignore hidden filenames
-                if fname[0] == ".":
-                    continue
-
-                if len(filter_extensions) > 0 and not any(
-                    fname.endswith(s) for s in filter_extensions
-                ):
-                    continue
-                p = os.path.join(path, fname)
-                yield p, p[prefixlen:]
-
     def _is_event_registered(self, evt_name):
         return evt_name in self._called_once
 
@@ -119,25 +93,12 @@ def step_init(
         self._logger = logger
         self.card_options = None
 
-        # Populate the defaults which may be missing.
-        missing_keys = set(self.defaults.keys()) - set(self.attributes.keys())
-        for k in missing_keys:
-            self.attributes[k] = self.defaults[k]
-
-        # when instantiation happens from the CLI we sometimes get stringified JSON and sometimes a dict for the
-        # `options` attributes. Hence we need to check for both and serialized.
-        if type(self.attributes["options"]) is str:
-            try:
-                self.card_options = json.loads(self.attributes["options"])
-            except json.decoder.JSONDecodeError:
-                self.card_options = self.defaults["options"]
-        else:
-            self.card_options = self.attributes["options"]
+        self.card_options = self.attributes["options"]
 
         evt_name = "step-init"
         # `'%s-%s'%(evt_name,step_name)` ensures that we capture this once per @card per @step.
         # Since there can be many steps checking if event is registered for `evt_name` will only make it check it once for all steps.
-        # Hence we have `_is_event_registered('%s-%s'%(evt_name,step_name))`
+        # Hence, we have `_is_event_registered('%s-%s'%(evt_name,step_name))`
         evt = "%s-%s" % (evt_name, step_name)
         if not self._is_event_registered(evt):
             # We set the total count of decorators so that we can use it for
@@ -179,7 +140,7 @@ def task_pre_step(
         ):
             # There should be a warning issued to the user that `id` doesn't match regex pattern
             # Since it is doesn't match pattern, we need to ensure that `id` is not accepted by `current`
-            # and warn users that they cannot use id for thier arguements.
+            # and warn users that they cannot use id for their arguments.
             wrn_msg = (
                 "@card with id '%s' doesn't match REGEX pattern. "
                 "Adding custom components to cards will not be accessible via `current.card['%s']`. "
@@ -207,9 +168,9 @@ def task_pre_step(
         )
         self._card_uuid = card_metadata["uuid"]
 
-        # This means that the we are calling `task_pre_step` on the last card decorator.
+        # This means that we are calling `task_pre_step` on the last card decorator.
         # We can now `finalize` method in the CardComponentCollector object.
-        # This will setup the `current.card` object for usage inside `@step` code.
+        # This will set up the `current.card` object for usage inside `@step` code.
         if self.step_counter == self.total_decos_on_step[step_name]:
             current.card._finalize()
 
@@ -244,8 +205,11 @@ def _create_top_level_args(self):
             "environment": self._environment.TYPE,
             "datastore": self._flow_datastore.TYPE,
             "datastore-root": self._flow_datastore.datastore_root,
+            "no-pylint": True,
+            "event-logger": "nullSidecarLogger",
+            "monitor": "nullSidecarMonitor",
             # We don't provide --with as all execution is taking place in
-            # the context of the main processs
+            # the context of the main process
         }
         return list(self._options(top_level_options))
 
@@ -279,8 +243,7 @@ def _run_cards_subprocess(self, runspec, component_strings):
         if self._user_set_card_id is not None:
             cmd += ["--id", str(self._user_set_card_id)]
 
-        # Doing this because decospecs parse information as str, since some non-runtime decorators pass it as bool we parse bool to str
-        if str(self.attributes["save_errors"]) == "True":
+        if self.attributes["save_errors"]:
             cmd += ["--render-error-card"]
 
         if temp_file is not None:
diff --git a/metaflow/plugins/cards/card_modules/__init__.py b/metaflow/plugins/cards/card_modules/__init__.py
index cd1e3105d60..e43151e66bc 100644
--- a/metaflow/plugins/cards/card_modules/__init__.py
+++ b/metaflow/plugins/cards/card_modules/__init__.py
@@ -3,6 +3,8 @@
 from .card import MetaflowCard, MetaflowCardComponent
 from metaflow.extension_support import get_modules, EXT_PKG, _ext_debug
 
+_CARD_MODULES = []
+
 
 def iter_namespace(ns_pkg):
     # Specifying the second argument (prefix) to iter_modules makes the
@@ -14,53 +16,46 @@ def iter_namespace(ns_pkg):
     return pkgutil.iter_modules(ns_pkg.__path__, ns_pkg.__name__ + ".")
 
 
-def _get_external_card_packages(with_paths=False):
+def _get_external_card_packages():
     """
-    Safely extract all exteranl card modules
-    Args:
-        with_paths (bool, optional): setting `with_paths=True` will result in a list of tuples: `[( mf_extensions_parent_path , relative_path_to_module, module)]`. setting false will return a list of modules Defaults to False.
-
+    Safely extract all external card modules
     Returns:
-        `list` of `ModuleType` or `list` of `tuples` where each tuple if of the form (mf_extensions_parent_path , relative_path_to_module, ModuleType)
+        `list` of `ModuleType`
     """
     import importlib
 
-    available_card_modules = []
+    # Caching card related modules.
+    global _CARD_MODULES
+    if len(_CARD_MODULES) > 0:
+        return _CARD_MODULES
     for m in get_modules("plugins.cards"):
-        # Iterate submodules of metaflow_extensions.X.plugins.cards
-        # For example metaflow_extensions.X.plugins.cards.monitoring
         card_packages = []
-        for fndx, card_mod, ispkg_c in iter_namespace(m.module):
-            try:
-                if not ispkg_c:
-                    continue
-                cm = importlib.import_module(card_mod)
-                _ext_debug("Importing card package %s" % card_mod)
-                if with_paths:
-                    card_packages.append((fndx.path, cm))
-                else:
+        # condition checks if it is not a namespace package or is a regular package.
+        if getattr(m.module, "__file__", None):
+            # This supports the following cases
+            # - a namespace package support with mfextinit_X.py
+            # - a regular package support
+            card_packages.append(m.module)
+        else:
+            # This is to support current system where you have a namespace package and then sub packages
+            # that have __init__.py
+            for _, card_mod, ispkg_c in iter_namespace(m.module):
+                # Iterate submodules of metaflow_extensions.X.plugins.cards
+                # For example metaflow_extensions.X.plugins.cards.monitoring
+                try:
+                    if not ispkg_c:
+                        continue
+                    cm = importlib.import_module(card_mod)
+                    _ext_debug("Importing card package %s" % card_mod)
                     card_packages.append(cm)
-            except Exception as e:
-                _ext_debug(
-                    "External Card Module Import Exception \n\n %s \n\n %s"
-                    % (str(e), traceback.format_exc())
-                )
-        if with_paths:
-            card_packages = [
-                (
-                    os.path.abspath(
-                        os.path.join(pth, "../../../../")
-                    ),  # parent path to metaflow_extensions
-                    os.path.join(
-                        EXT_PKG,
-                        os.path.relpath(m.__path__[0], os.path.join(pth, "../../../")),
-                    ),  # construct relative path to parent.
-                    m,
-                )
-                for pth, m in card_packages
-            ]
-        available_card_modules.extend(card_packages)
-    return available_card_modules
+                except Exception as e:
+                    _ext_debug(
+                        "External Card Module Import Exception \n\n %s \n\n %s"
+                        % (str(e), traceback.format_exc())
+                    )
+
+        _CARD_MODULES.extend(card_packages)
+    return _CARD_MODULES
 
 
 def _load_external_cards():
@@ -77,7 +72,8 @@ def _load_external_cards():
             # Ensure that types match.
             if not type(cards) == list:
                 continue
-        except AttributeError:
+        except AttributeError as e:
+            _ext_debug("Card import failed with error : %s" % str(e))
             continue
         else:
             for c in cards:
@@ -85,28 +81,15 @@ def _load_external_cards():
                     # every card should only be inheriting a MetaflowCard
                     continue
                 if not getattr(c, "type", None):
-                    # todo Warn user of non existant `type` in MetaflowCard
+                    # todo Warn user of nonexistent `type` in MetaflowCard
                     continue
                 if c.type in external_cards:
                     # todo Warn user of duplicate card
                     continue
                 # external_cards[c.type] = c
+                _ext_debug("Adding card of type: %s" % str(c.type))
                 card_arr.append(c)
     return card_arr
 
 
-def _get_external_card_package_paths():
-    pkg_iter = _get_external_card_packages(with_paths=True)
-    if pkg_iter is None:
-        return None
-    for (
-        mf_extension_parent_path,
-        relative_path_to_module,
-        _,
-    ) in pkg_iter:
-        module_pth = os.path.join(mf_extension_parent_path, relative_path_to_module)
-        arcname = relative_path_to_module
-        yield module_pth, arcname
-
-
 MF_EXTERNAL_CARDS = _load_external_cards()
diff --git a/metaflow/plugins/cards/card_modules/basic.py b/metaflow/plugins/cards/card_modules/basic.py
index 2893e95866b..b078a8f82d3 100644
--- a/metaflow/plugins/cards/card_modules/basic.py
+++ b/metaflow/plugins/cards/card_modules/basic.py
@@ -42,9 +42,12 @@ def read_file(path):
 
 class DefaultComponent(MetaflowCardComponent):
     """
-    The `DefaultCard` and the `BlankCard` use a JS framework that build the HTML dynamically from JSON. The `DefaultComponent` is the base component that helps build the JSON when `render` is called.
+    The `DefaultCard` and the `BlankCard` use a JS framework that build the HTML dynamically from JSON.
+    The `DefaultComponent` is the base component that helps build the JSON when `render` is called.
 
-    The underlying JS framewok consists of various types of objects. These can be found in: "metaflow/plugins/cards/ui/types.ts". The `type` attribute in a `DefaultComponent` corresponds to the type of component in the Javascript framework.
+    The underlying JS framework consists of various types of objects.
+    These can be found in: "metaflow/plugins/cards/ui/types.ts".
+    The `type` attribute in a `DefaultComponent` corresponds to the type of component in the Javascript framework.
     """
 
     type = None
@@ -338,13 +341,19 @@ def render(self):
         )
         # ignore the name as an artifact
         del task_data_dict["data"]["name"]
-        mf_version = [
-            t for t in self._task.parent.parent.tags if "metaflow_version" in t
-        ][0].split("metaflow_version:")[1]
+
+        _metadata = dict(version=1, template="defaultCardTemplate")
+        # try to parse out metaflow version from tags, but let it go if unset
+        # e.g. if a run came from a local, un-versioned metaflow codebase
+        try:
+            _metadata["metaflow_version"] = [
+                t for t in self._task.parent.parent.tags if "metaflow_version" in t
+            ][0].split("metaflow_version:")[1]
+        except Exception:
+            pass
+
         final_component_dict = dict(
-            metadata=dict(
-                metaflow_version=mf_version, version=1, template="defaultCardTemplate"
-            ),
+            metadata=_metadata,
             components=[],
         )
 
@@ -443,12 +452,12 @@ def render(self):
             if k not in param_ids
         ]
         if len(artifactlist) > 0:
-            artrifact_component = ArtifactsComponent(data=artifactlist).render()
+            artifact_component = ArtifactsComponent(data=artifactlist).render()
         else:
-            artrifact_component = TitleComponent(text="No Artifacts")
+            artifact_component = TitleComponent(text="No Artifacts")
 
         artifact_section = SectionComponent(
-            title="Artifacts", contents=[artrifact_component]
+            title="Artifacts", contents=[artifact_component]
         ).render()
         dag_component = SectionComponent(
             title="DAG", contents=[DagComponent(data=task_data_dict["graph"]).render()]
diff --git a/metaflow/plugins/cards/card_modules/card.py b/metaflow/plugins/cards/card_modules/card.py
index e4cdfd8855e..f5249f8ea8e 100644
--- a/metaflow/plugins/cards/card_modules/card.py
+++ b/metaflow/plugins/cards/card_modules/card.py
@@ -1,4 +1,31 @@
 class MetaflowCard(object):
+    """
+    Metaflow cards derive from this base class.
+
+    Subclasses of this class are called *card types*. The desired card
+    type `T` is defined in the `@card` decorator as `@card(type=T)`.
+
+    After a task with `@card(type=T, options=S)` finishes executing, Metaflow instantiates
+    a subclass `C` of `MetaflowCard` that has its `type` attribute set to `T`, i.e. `C.type=T`.
+    The constructor is given the options dictionary `S` that contains arbitrary
+    JSON-encodable data that is passed to the instance, parametrizing the card. The subclass
+    may override the constructor to capture and process the options.
+
+    The subclass needs to implement a `render(task)` method that produces the card
+    contents in HTML, given the finished task that is represented by a `Task` object.
+
+    Attributes
+    ----------
+    type : str
+        Card type string. Note that this should be a globally unique name, similar to a
+        Python package name, to avoid name clashes between different custom cards.
+
+    Parameters
+    ----------
+    options : Dict
+        JSON-encodable dictionary containing user-definable options for the class.
+    """
+
     type = None
 
     ALLOW_USER_COMPONENTS = False
@@ -18,7 +45,20 @@ def _get_mustache(self):
 
     def render(self, task):
         """
-        `render` returns a string.
+        Produce custom card contents in HTML.
+
+        Subclasses override this method to customize the card contents.
+
+        Parameters
+        ----------
+        task : metaflow.Task
+            A `Task` object that allows you to access data from the finished task and tasks
+            preceding it.
+
+        Returns
+        -------
+        str
+            Card contents as an HTML string.
         """
         return NotImplementedError()
 
diff --git a/metaflow/plugins/cards/card_modules/chevron/renderer.py b/metaflow/plugins/cards/card_modules/chevron/renderer.py
index 571cab293b1..28368cd7ee1 100644
--- a/metaflow/plugins/cards/card_modules/chevron/renderer.py
+++ b/metaflow/plugins/cards/card_modules/chevron/renderer.py
@@ -23,7 +23,6 @@
     def unicode(x, y):
         return x
 
-
 else:  # python 2
     python3 = False
     unicode_type = unicode
@@ -199,7 +198,7 @@ def render(
     A string containing the rendered template.
     """
 
-    # If the template is a seqeuence but not derived from a string
+    # If the template is a sequence but not derived from a string
     if isinstance(template, Sequence) and not isinstance(template, string_type):
         # Then we don't need to tokenize it
         # But it does need to be a generator
@@ -293,7 +292,7 @@ def render(
                         text += "%s%s %s%s" % (
                             def_ldel,
                             {
-                                "commment": "!",
+                                "comment": "!",
                                 "section": "#",
                                 "inverted section": "^",
                                 "end": "/",
@@ -415,7 +414,7 @@ def render(
                 # then remove the spaces from the end
                 part_out = part_out.rstrip(" \t")
 
-            # Add the partials output to the ouput
+            # Add the partials output to the output
             if python3:
                 output += part_out
             else:  # python 2
diff --git a/metaflow/plugins/cards/card_modules/components.py b/metaflow/plugins/cards/card_modules/components.py
index 3efd4e59e10..ee16ebb5e34 100644
--- a/metaflow/plugins/cards/card_modules/components.py
+++ b/metaflow/plugins/cards/card_modules/components.py
@@ -8,7 +8,7 @@
     MarkdownComponent,
 )
 from .card import MetaflowCardComponent
-from .convert_to_native_type import TaskToDict
+from .convert_to_native_type import TaskToDict, _full_classname
 from .renderer_tools import render_safely
 
 
@@ -18,20 +18,24 @@ class UserComponent(MetaflowCardComponent):
 
 class Artifact(UserComponent):
     """
-    This class helps visualize any variable on the `MetaflowCard`.
-    The variable will be truncated using `reprlib.Repr.repr()`.
+    A pretty-printed version of any Python object.
 
-    ### Usage :
-    ```python
-    @card
-    @step
-    def my_step(self):
-        from metaflow.cards import Artifact
-        from metaflow import current
-        x = dict(a=2,b=2..)
-        current.card.append(Artifact(x)) # Adds a name to the artifact
-        current.card.append(Artifact(x,'my artifact name'))
+    Large objects are truncated using Python's built-in [`reprlib`](https://docs.python.org/3/library/reprlib.html).
+
+    Example:
     ```
+    from datetime import datetime
+    current.card.append(Artifact({'now': datetime.utcnow()}))
+    ```
+
+    Parameters
+    ----------
+    artifact : object
+        Any Python object.
+    name : str
+        Optional label for the object.
+    compressed : bool
+        Use a truncated representation (default: True).
     """
 
     def __init__(self, artifact, name=None, compressed=True):
@@ -50,49 +54,43 @@ def render(self):
 
 class Table(UserComponent):
     """
-    This class helps visualize information in the form of a table in a `MetaflowCard`.
-    `Table` can take other `MetaflowCardComponent`s like `Artifact`, `Image`, `Markdown` and `Error` as sub elements.
+    A table.
 
-    ### Parameters
-    - `data` (List[List[Any]]) : The data to see in the table. Input is a 2d list that contains native types or `MetaflowCardComponent`s like `Artifact`, `Image`, `Markdown` and `Error`. Doesn't play friendly with `dict` as a sub-element. If passing a `dict`, pass it via `Artifact`. Example : `Table([[Artifact(my_dictionary)]])`. If a non serializable object is passed as a sub-element then the table cell on the `MetaflowCard` will show up as `<object>`. columns.  Defaults to [[]].
-    - `headers` (List[str]) : The names of the columns.  Defaults to [].
+    The contents of the table can be text or numerical data, a Pandas dataframe,
+    or other card components: `Artifact`, `Image`, `Markdown` objects.
 
-    ### Usage with other components:
-    ```python
-    @card
-    @step
-    def my_step(self):
-        from metaflow.cards import Table, Artifact
-        from metaflow import current
-        x = dict(a=2,b=2..)
-        y = dict(b=3,c=2..)
-        # Can take other components as arguments
-        current.card.append(
-            Table([[
-                Artifact(x), # Adds a name to the artifact
-                Artifact(y), # Adds a name to the artifact
-            ]])
-        current.card.append(Artifact(x,'my artifact name'))
+    Example: Text and artifacts
     ```
-    ### Usage with dataframes:
-    ```python
-    @card
-    @step
-    def my_step(self):
-        from metaflow.cards import Table
-        from metaflow import current
-        # Can be created from a dataframe
-        import pandas as pd
-        import numpy as np
-        current.card.append(
-            Table.from_dataframe(
-                pandas.DataFrame(
-                    np.random.randint(0, 100, size=(15, 4)),
-                    columns=list("ABCD"),
-                )
+    from metaflow.cards import Table, Artifact
+    current.card.append(
+        Table([
+            ['first row', Artifact({'a': 2})],
+            ['second row', Artifact(3)]
+        ])
+    )
+    ```
+
+    Example: Table from a Pandas dataframe
+    ```
+    from metaflow.cards import Table
+    import pandas as pd
+    import numpy as np
+    current.card.append(
+        Table.from_dataframe(
+            pd.DataFrame(
+                np.random.randint(0, 100, size=(15, 4)),
+                columns=list("ABCD")
             )
         )
+    )
     ```
+
+    Parameters
+    ----------
+    data : List[List[str|MetaflowCardComponent]]
+        List (rows) of lists (columns). Each item can be a string or a `MetaflowCardComponent`.
+    headers : List[str]
+        Optional header row for the table.
     """
 
     def __init__(self, data=[[]], headers=[]):
@@ -106,6 +104,16 @@ def __init__(self, data=[[]], headers=[]):
 
     @classmethod
     def from_dataframe(cls, dataframe=None, truncate=True):
+        """
+        Create a `Table` based on a Pandas dataframe.
+
+        Parameters
+        ----------
+        dataframe : pandas.DataFrame
+            Pandas dataframe.
+        truncate : bool
+            Truncate large dataframe instead of showing all rows (default: True).
+        """
         task_to_dict = TaskToDict()
         object_type = task_to_dict.object_type(dataframe)
         if object_type == "pandas.core.frame.DataFrame":
@@ -147,61 +155,53 @@ def render(self):
 
 class Image(UserComponent):
     """
-    This class helps visualize an image in a `MetaflowCard`. `Image`s can be created direcly from `bytes` or `PIL.Image`s or Matplotib figures.
+    An image.
 
-    ### Parameters
-    - `src` (bytes) : The image source in `bytes`.
-    - `label` (str) : Label to the image show on the `MetaflowCard`.
+    `Images can be created directly from PNG/JPG/GIF `bytes`, `PIL.Image`s,
+    or Matplotlib figures. Note that the image data is embedded in the card,
+    so no external files are required to show the image.
 
-    ### Usage
-    ```python
-    @card
-    @step
-    def my_step(self):
-        from metaflow.cards import Image
-        from metaflow import current
-        import requests
-        current.card.append(
-            Image(
-                requests.get("https://www.gif-vif.com/hacker-cat.gif").content,
-                "Image From Bytes",
-            ),
+    Example: Create an `Image` from bytes:
+    ```
+    current.card.append(
+        Image(
+            requests.get("https://www.gif-vif.com/hacker-cat.gif").content,
+            "Image From Bytes"
         )
+    )
     ```
 
-    #### `Image.from_matplotlib` :
-    ```python
-    @card
-    @step
-    def my_step(self):
-        from metaflow.cards import Image
-        from metaflow import current
-        import pandas as pd
-        import numpy as np
-        current.card.append(
-            Image.from_matplotlib(
-                pandas.DataFrame(
-                    np.random.randint(0, 100, size=(15, 4)),
-                    columns=list("ABCD"),
-                ).plot()
-            )
+    Example: Create an `Image` from a Matplotlib figure
+    ```
+    import pandas as pd
+    import numpy as np
+    current.card.append(
+        Image.from_matplotlib(
+            pandas.DataFrame(
+                np.random.randint(0, 100, size=(15, 4)),
+                columns=list("ABCD"),
+            ).plot()
         )
+    )
     ```
-    #### `Image.from_pil_image` :
-    ```python
-    @card
-    @step
-    def my_step(self):
-        from metaflow.cards import Image
-        from metaflow import current
-        from PIL import Image as PILImage
-        current.card.append(
-            Image.from_pil_image(
-                PILImage.fromarray(np.random.randn(1024, 768), "RGB"),
-                "From PIL Image",
-            ),
+
+    Example: Create an `Image` from a [PIL](https://pillow.readthedocs.io/) Image
+    ```
+    from PIL import Image as PILImage
+    current.card.append(
+        Image.from_pil_image(
+            PILImage.fromarray(np.random.randn(1024, 768), "RGB"),
+            "From PIL Image"
         )
+    )
     ```
+
+    Parameters
+    ----------
+    src : bytes
+        The image data in `bytes`.
+    label : str
+        Optional label for the image.
     """
 
     @staticmethod
@@ -218,7 +218,7 @@ def __init__(self, src=None, label=None):
             except TypeError:
                 self._error_comp = ErrorComponent(
                     self.render_fail_headline(
-                        "first argument should be of type `bytes` or vaild image base64 string"
+                        "first argument should be of type `bytes` or valid image base64 string"
                     ),
                     "Type of %s is invalid" % (str(type(src))),
                 )
@@ -239,7 +239,7 @@ def __init__(self, src=None, label=None):
             else:
                 self._error_comp = ErrorComponent(
                     self.render_fail_headline(
-                        "first argument should be of type `bytes` or vaild image base64 string"
+                        "first argument should be of type `bytes` or valid image base64 string"
                     ),
                     "String %s is invalid base64 string" % src,
                 )
@@ -256,6 +256,16 @@ def _bytes_to_base64(bytes_arr):
 
     @classmethod
     def from_pil_image(cls, pilimage, label=None):
+        """
+        Create an `Image` from a PIL image.
+
+        Parameters
+        ----------
+        pilimage : PIL.Image
+            a PIL image object.
+        label : str
+            Optional label for the image.
+        """
         try:
             import io
 
@@ -294,10 +304,19 @@ def from_pil_image(cls, pilimage, label=None):
 
     @classmethod
     def from_matplotlib(cls, plot, label=None):
+        """
+        Create an `Image` from a Matplotlib plot.
+
+        Parameters
+        ----------
+        plot :  matplotlib.figure.Figure or matplotlib.axes.Axes or matplotlib.axes._subplots.AxesSubplot
+            a PIL axes (plot) object.
+        label : str
+            Label for the image (optional)
+        """
         import io
 
         try:
-            plt = getattr(plot, "get_figure", None)
             try:
                 import matplotlib.pyplot as pyplt
             except ImportError:
@@ -305,15 +324,26 @@ def from_matplotlib(cls, plot, label=None):
                     cls.render_fail_headline("Matplotlib cannot be imported"),
                     "%s" % traceback.format_exc(),
                 )
-            if plt is None:
-                return ErrorComponent(
-                    cls.render_fail_headline(
-                        "Invalid Type. Object %s is not from `matlplotlib`" % type(plot)
-                    ),
-                    "",
-                )
+            # First check if it is a valid Matplotlib figure.
+            figure = None
+            if _full_classname(plot) == "matplotlib.figure.Figure":
+                figure = plot
+
+            # If it is not valid figure then check if it is matplotlib.axes.Axes or a matplotlib.axes._subplots.AxesSubplot
+            # These contain the `get_figure` function to get the main figure object.
+            if figure is None:
+                if getattr(plot, "get_figure", None) is None:
+                    return ErrorComponent(
+                        cls.render_fail_headline(
+                            "Invalid Type. Object %s is not from `matplotlib`"
+                            % type(plot)
+                        ),
+                        "",
+                    )
+                else:
+                    figure = plot.get_figure()
+
             task_to_dict = TaskToDict()
-            figure = plot.get_figure()
             img_bytes_arr = io.BytesIO()
             figure.savefig(img_bytes_arr, format="PNG")
             parsed_image = task_to_dict.parse_image(img_bytes_arr.getvalue())
@@ -339,7 +369,7 @@ def render(self):
         if self._src is not None:
             return ImageComponent(src=self._src, label=self._label).render()
         return ErrorComponent(
-            self.render_fail_headline("`Image` Component `src` arguement is `None`"), ""
+            self.render_fail_headline("`Image` Component `src` argument is `None`"), ""
         ).render()
 
 
@@ -380,24 +410,19 @@ def render(self):
 
 class Markdown(UserComponent):
     """
-    This class helps visualize Markdown on the `MetaflowCard`
+    A block of text formatted in Markdown.
 
-    ### Parameters
-    - `text` (str) : A markdown string
-
-    ### Usage
-    ```python
-    @card
-    @step
-    def my_step(self):
-        from metaflow.cards import Markdown
-        from metaflow import current
-        current.card.append(
-            Markdown("# This is a header appended from @step code")
-        )
-        ...
+    Example:
+    ```
+    current.card.append(
+        Markdown("# This is a header appended from `@step` code")
+    )
     ```
 
+    Parameters
+    ----------
+    text : str
+        Text formatted in Markdown.
     """
 
     def __init__(self, text=None):
diff --git a/metaflow/plugins/cards/card_modules/convert_to_native_type.py b/metaflow/plugins/cards/card_modules/convert_to_native_type.py
index de59f83a1d9..d6cabec503d 100644
--- a/metaflow/plugins/cards/card_modules/convert_to_native_type.py
+++ b/metaflow/plugins/cards/card_modules/convert_to_native_type.py
@@ -102,7 +102,7 @@ def __call__(self, task, graph=None):
     def _create_task_data_dict(self, task):
 
         task_data_dict = {}
-        type_infered_objects = {"images": {}, "tables": {}}
+        type_inferred_objects = {"images": {}, "tables": {}}
         for data in task:
             try:
                 data_object = data.data
@@ -124,11 +124,11 @@ def _create_task_data_dict(self, task):
             type_resolved_obj = self._extract_type_infered_object(data_object)
             if type_resolved_obj is not None:
                 if type_resolved_obj.is_image:
-                    type_infered_objects["images"][data.id] = type_resolved_obj.data
+                    type_inferred_objects["images"][data.id] = type_resolved_obj.data
                 elif type_resolved_obj.is_table:
-                    type_infered_objects["tables"][data.id] = type_resolved_obj.data
+                    type_inferred_objects["tables"][data.id] = type_resolved_obj.data
 
-        return task_data_dict, type_infered_objects
+        return task_data_dict, type_inferred_objects
 
     def object_type(self, object):
         return self._get_object_type(object)
@@ -140,7 +140,7 @@ def parse_image(self, data_object):
             import imghdr
 
             resp = imghdr.what(None, h=data_object)
-            # Only accept types suppored on the web
+            # Only accept types supported on the web
             # https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Image_types
             if resp is not None and resp in ["gif", "png", "jpeg", "webp"]:
                 return self._parse_image(data_object, resp)
@@ -154,7 +154,7 @@ def _extract_type_infered_object(self, data_object):
             import imghdr
 
             resp = imghdr.what(None, h=data_object)
-            # Only accept types suppored on the web
+            # Only accept types supported on the web
             # https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Image_types
             if resp is not None and resp in ["gif", "png", "jpeg", "webp"]:
                 return TypeResolvedObject(
@@ -213,7 +213,7 @@ def _to_native_type(self, data_object):
         supported_type = False
         large_object = False
         obj_type_name = self._get_object_type(data_object)
-        if obj_type_name == None:
+        if obj_type_name is None:
             return rep.repr(data_object), obj_type_name, supported_type, large_object
         elif self._only_repr:
             return (
@@ -301,15 +301,19 @@ def _parse_pandas_dataframe(self, data_object, truncate=True):
         if truncate:
             data = data_object.head()
         index_column = data.index
-
+        time_format = "%Y-%m-%dT%H:%M:%SZ"
         if index_column.dtype == "datetime64[ns]":
-            index_column = index_column.dt.strftime("%Y-%m-%dT%H:%M:%SZ")
+            if index_column.__class__.__name__ == "DatetimeIndex":
+                index_column = index_column.strftime(time_format)
+            else:
+                index_column = index_column.dt.strftime(time_format)
 
         for col in data.columns:
             # we convert datetime columns to strings
             if data[col].dtype == "datetime64[ns]":
-                data[col] = data[col].dt.strftime("%Y-%m-%dT%H:%M:%SZ")
+                data[col] = data[col].dt.strftime(time_format)
 
+        data = data.astype(object).where(data.notnull(), None)
         data_vals = data.values.tolist()
         for row, idx in zip(data_vals, index_column.values.tolist()):
             row.insert(0, idx)
diff --git a/metaflow/plugins/cards/card_modules/renderer_tools.py b/metaflow/plugins/cards/card_modules/renderer_tools.py
index 4f4e33e00e6..6746e8fee02 100644
--- a/metaflow/plugins/cards/card_modules/renderer_tools.py
+++ b/metaflow/plugins/cards/card_modules/renderer_tools.py
@@ -6,7 +6,7 @@
 
 
 def _render_component_safely(
-    component, render_func, *args, return_error_component=True, **kwargs
+    component, render_func, return_error_component, *args, **kwargs
 ):
     rendered_obj = None
     try:
@@ -38,12 +38,10 @@ def _render_component_safely(
 def render_safely(func):
     """
     This is a decorator that can be added to any `MetaflowCardComponent.render`
-    The goal is to render sub components safely and ensure that they are JSON serializable.
+    The goal is to render subcomponents safely and ensure that they are JSON serializable.
     """
     # expects a renderer func
     def ret_func(self, *args, **kwargs):
-        return _render_component_safely(
-            self, func, *args, return_error_component=True, **kwargs
-        )
+        return _render_component_safely(self, func, True, *args, **kwargs)
 
     return ret_func
diff --git a/metaflow/plugins/cards/component_serializer.py b/metaflow/plugins/cards/component_serializer.py
index 1d2ec7dd487..9451cd67406 100644
--- a/metaflow/plugins/cards/component_serializer.py
+++ b/metaflow/plugins/cards/component_serializer.py
@@ -35,12 +35,11 @@ class CardComponentCollector:
         - [x] Classes with `ALLOW_USER_COMPONENTS=False` are never default editable.
     - [x] The user can specify an `id` argument to a card, in which case the card is editable through `current.card[id].append`.
         - [x] A card with an id can be also default editable, if there are no other cards that are eligible to be default editable.
-    - [x] If multiple default-editable cards exist but only one card doesn’t have an id, the card without an id is considered to be default editable.
-    - [x] If we can’t resolve a single default editable card through the above rules, `current.card`.append calls show a warning but the call doesn’t fail.
+    - [x] If multiple default-editable cards exist but only one card doesn't have an id, the card without an id is considered to be default editable.
+    - [x] If we can't resolve a single default editable card through the above rules, `current.card`.append calls show a warning but the call doesn't fail.
     - [x] A card that is not default editable can be still edited through:
         - [x] its `current.card['myid']`
-        - [x] by looking it up by its type, e.g. `current.card.get(type=’pytorch’)`.
-
+        - [x] by looking it up by its type, e.g. `current.card.get(type='pytorch')`.
     """
 
     def __init__(self, logger=None):
@@ -84,8 +83,8 @@ def _add_card(
             - `card_type` (str) : value of the associated `MetaflowCard.type`
             - `card_id` (str) : `id` argument provided at top of decorator
             - `editable` (bool) : this corresponds to the value of `MetaflowCard.ALLOW_USER_COMPONENTS` for that `card_type`
-            - `customize` (bool) : This arguement is reserved for a single @card decorator per @step.
-                - An `editable` card with `customize=True` gets precidence to be set as default editable card.
+            - `customize` (bool) : This argument is reserved for a single @card decorator per @step.
+                - An `editable` card with `customize=True` gets precedence to be set as default editable card.
                 - A default editable card is the card which can be access via the `append` and `extend` methods.
         """
         card_uuid = self.create_uuid()
@@ -147,7 +146,7 @@ def _finalize(self):
             if ct is not None:
                 c["exists"] = True
 
-        # If a card has customize=True and is not editable then it will not be considered default editable.
+        # If a card has `customize=True` and is not editable then it will not be considered default editable.
         editable_cards_meta = [c for c in all_card_meta if c["editable"]]
 
         if len(editable_cards_meta) == 0:
@@ -166,7 +165,7 @@ def _finalize(self):
             self._default_editable_card = editable_cards_meta[0]["uuid"]
             return
 
-        # Segregate cards which have id as none and those which dont.
+        # Segregate cards which have id as none and those which don't.
         not_none_id_cards = [c for c in editable_cards_meta if c["card_id"] is not None]
         none_id_cards = [c for c in editable_cards_meta if c["card_id"] is None]
 
@@ -175,7 +174,7 @@ def _finalize(self):
             self._default_editable_card = none_id_cards[0]["uuid"]
 
         # If the size of the set of ids is not equal to total number of cards with ids then warn the user that we cannot disambiguate
-        # so `current.card['my_card_id']` wont work.
+        # so `current.card['my_card_id']` won't work.
         id_set = set(card_ids)
         if len(card_ids) != len(id_set):
             non_unique_ids = [
@@ -185,7 +184,7 @@ def _finalize(self):
                 > 1
             ]
             nui = ", ".join(non_unique_ids)
-            # throw a warning that decorators have non unique Ids
+            # throw a warning that decorators have non-unique Ids
             self._warning(
                 (
                     "Multiple `@card` decorator have been annotated with duplicate ids : %s. "
@@ -213,6 +212,27 @@ def _finalize(self):
             self._default_editable_card = customize_cards[0]["uuid"]
 
     def __getitem__(self, key):
+        """
+        Choose a specific card for manipulation.
+
+        When multiple @card decorators are present, you can add an
+        `ID` to distinguish between them, `@card(id=ID)`. This allows you
+        to add components to a specific card like this:
+        ```
+        current.card[ID].append(component)
+        ```
+
+        Parameters
+        ----------
+        key : str
+            Card ID.
+
+        Returns
+        -------
+        CardComponentCollector
+            An object with `append` and `extend` calls which allow you to
+            add components to the chosen card.
+        """
         if key in self._card_id_map:
             card_uuid = self._card_id_map[key]
             return self._cards_components[card_uuid]
@@ -229,6 +249,23 @@ def __getitem__(self, key):
         return []
 
     def __setitem__(self, key, value):
+        """
+        Specify components of the chosen card.
+
+        Instead of adding components to a card individually with `current.card[ID].append(component)`,
+        use this method to assign a list of components to a card, replacing the existing components:
+        ```
+        current.card[ID] = [FirstComponent, SecondComponent]
+        ```
+
+        Parameters
+        ----------
+        key: str
+            Card ID.
+
+        value: List[CardComponent]
+            List of card components to assign to this card.
+        """
         if key in self._card_id_map:
             card_uuid = self._card_id_map[key]
             if not isinstance(value, list):
@@ -247,6 +284,14 @@ def __setitem__(self, key, value):
         )
 
     def append(self, component):
+        """
+        Appends a component to the current card.
+
+        Parameters
+        ----------
+        component : CardComponent
+            Card component to add to this card.
+        """
         if self._default_editable_card is None:
             if (
                 len(self._cards_components) == 1
@@ -282,6 +327,14 @@ def append(self, component):
         self._cards_components[self._default_editable_card].append(component)
 
     def extend(self, components):
+        """
+        Appends many components to the current card.
+
+        Parameters
+        ----------
+        component : Iterator[CardComponent]
+            Card components to add to this card.
+        """
         if self._default_editable_card is None:
             # if there is one card which is not the _default_editable_card then the card is not editable
             if len(self._cards_components) == 1:
diff --git a/metaflow/plugins/cards/exception.py b/metaflow/plugins/cards/exception.py
index 4c18375d625..0b9188b4475 100644
--- a/metaflow/plugins/cards/exception.py
+++ b/metaflow/plugins/cards/exception.py
@@ -85,10 +85,10 @@ def __init__(
 
 class IncorrectCardArgsException(MetaflowException):
 
-    headline = "Incorrect arguements to @card decorator"
+    headline = "Incorrect arguments to @card decorator"
 
     def __init__(self, card_type, args):
-        msg = "Card of type %s cannot support arguements" " %s" % (card_type, args)
+        msg = "Card of type %s cannot support arguments" " %s" % (card_type, args)
         super(IncorrectCardArgsException, self).__init__(msg)
 
 
@@ -98,7 +98,7 @@ class UnrenderableCardException(MetaflowException):
 
     def __init__(self, card_type, args):
         msg = (
-            "Card of type %s is unable to be rendered with arguements %s.\nStack trace : "
+            "Card of type %s is unable to be rendered with arguments %s.\nStack trace : "
             " %s" % (card_type, args, traceback.format_exc())
         )
         super(UnrenderableCardException, self).__init__(msg)
diff --git a/metaflow/plugins/cards/ui/package.json b/metaflow/plugins/cards/ui/package.json
index 88d261d52f6..e251a2974a1 100644
--- a/metaflow/plugins/cards/ui/package.json
+++ b/metaflow/plugins/cards/ui/package.json
@@ -36,7 +36,7 @@
     "rollup-plugin-svelte": "^7.1.0",
     "rollup-plugin-terser": "^7.0.2",
     "sirv-cli": "^2.0.1",
-    "svelte": "^3.46.1",
+    "svelte": "^3.49.0",
     "svelte-check": "^2.3.0",
     "svelte-preprocess": "^4.10.1",
     "tslib": "^2.3.1",
diff --git a/metaflow/plugins/cards/ui/public/card-example.json b/metaflow/plugins/cards/ui/public/card-example.json
index 15bd6fa8bcc..91c7bdc609a 100644
--- a/metaflow/plugins/cards/ui/public/card-example.json
+++ b/metaflow/plugins/cards/ui/public/card-example.json
@@ -1457,7 +1457,7 @@
         {
           "type": "section",
           "title": "Bar Chart",
-          "subtitle": "This bar chart has some maade up data.",
+          "subtitle": "This bar chart has some made up data.",
           "contents": [
             {
               "type": "barChart",
diff --git a/metaflow/plugins/cards/ui/src/store.ts b/metaflow/plugins/cards/ui/src/store.ts
index 9b0d58b2bd6..677e28d1c86 100644
--- a/metaflow/plugins/cards/ui/src/store.ts
+++ b/metaflow/plugins/cards/ui/src/store.ts
@@ -12,7 +12,7 @@ export const setCardData: (cardDataId: string) => void = (cardDataId) => {
     const data = JSON.parse(atob((window as any).__MF_DATA__[cardDataId])) as types.CardResponse;
     cardData.set(data);
   } catch (error) {
-    // for now we are loading an example card if there is no string
+    // for now, we are loading an example card if there is no string
     fetch("/card-example.json")
       .then((resp) => resp.json())
       .then((data: types.CardResponse) => {
diff --git a/metaflow/plugins/cards/ui/yarn.lock b/metaflow/plugins/cards/ui/yarn.lock
index e4f7e55662a..637fa574c62 100644
--- a/metaflow/plugins/cards/ui/yarn.lock
+++ b/metaflow/plugins/cards/ui/yarn.lock
@@ -110,6 +110,46 @@
   dependencies:
     cross-fetch "^3.1.4"
 
+"@jridgewell/gen-mapping@^0.3.0":
+  version "0.3.2"
+  resolved "https://registry.yarnpkg.com/@jridgewell/gen-mapping/-/gen-mapping-0.3.2.tgz#c1aedc61e853f2bb9f5dfe6d4442d3b565b253b9"
+  integrity sha512-mh65xKQAzI6iBcFzwv28KVWSmCkdRBWoOh+bYQGW3+6OZvbbN3TqMGo5hqYxQniRcH9F2VZIoJCm4pa3BPDK/A==
+  dependencies:
+    "@jridgewell/set-array" "^1.0.1"
+    "@jridgewell/sourcemap-codec" "^1.4.10"
+    "@jridgewell/trace-mapping" "^0.3.9"
+
+"@jridgewell/resolve-uri@^3.0.3":
+  version "3.1.0"
+  resolved "https://registry.yarnpkg.com/@jridgewell/resolve-uri/-/resolve-uri-3.1.0.tgz#2203b118c157721addfe69d47b70465463066d78"
+  integrity sha512-F2msla3tad+Mfht5cJq7LSXcdudKTWCVYUgw6pLFOOHSTtZlj6SWNYAp+AhuqLmWdBO2X5hPrLcu8cVP8fy28w==
+
+"@jridgewell/set-array@^1.0.1":
+  version "1.1.2"
+  resolved "https://registry.yarnpkg.com/@jridgewell/set-array/-/set-array-1.1.2.tgz#7c6cf998d6d20b914c0a55a91ae928ff25965e72"
+  integrity sha512-xnkseuNADM0gt2bs+BvhO0p78Mk762YnZdsuzFV018NoG1Sj1SCQvpSqa7XUaTam5vAGasABV9qXASMKnFMwMw==
+
+"@jridgewell/source-map@^0.3.2":
+  version "0.3.2"
+  resolved "https://registry.yarnpkg.com/@jridgewell/source-map/-/source-map-0.3.2.tgz#f45351aaed4527a298512ec72f81040c998580fb"
+  integrity sha512-m7O9o2uR8k2ObDysZYzdfhb08VuEml5oWGiosa1VdaPZ/A6QyPkAJuwN0Q1lhULOf6B7MtQmHENS743hWtCrgw==
+  dependencies:
+    "@jridgewell/gen-mapping" "^0.3.0"
+    "@jridgewell/trace-mapping" "^0.3.9"
+
+"@jridgewell/sourcemap-codec@^1.4.10":
+  version "1.4.14"
+  resolved "https://registry.yarnpkg.com/@jridgewell/sourcemap-codec/-/sourcemap-codec-1.4.14.tgz#add4c98d341472a289190b424efbdb096991bb24"
+  integrity sha512-XPSJHWmi394fuUuzDnGz1wiKqWfo1yXecHQMRf2l6hztTO+nPru658AyDngaBe7isIxEkRsPR3FZh+s7iVa4Uw==
+
+"@jridgewell/trace-mapping@^0.3.9":
+  version "0.3.14"
+  resolved "https://registry.yarnpkg.com/@jridgewell/trace-mapping/-/trace-mapping-0.3.14.tgz#b231a081d8f66796e475ad588a1ef473112701ed"
+  integrity sha512-bJWEfQ9lPTvm3SneWwRFVLzrh6nhjwqw7TUFFBEMzwvg7t7PCDenf2lDwqo4NQXzdpgBXyFgDWnQA+2vkruksQ==
+  dependencies:
+    "@jridgewell/resolve-uri" "^3.0.3"
+    "@jridgewell/sourcemap-codec" "^1.4.10"
+
 "@nodelib/fs.scandir@2.1.5":
   version "2.1.5"
   resolved "https://registry.yarnpkg.com/@nodelib/fs.scandir/-/fs.scandir-2.1.5.tgz#7619c2eb21b25483f6d167548b4cfd5a7488c3d5"
@@ -369,15 +409,10 @@ acorn-jsx@^5.3.1:
   resolved "https://registry.yarnpkg.com/acorn-jsx/-/acorn-jsx-5.3.2.tgz#7ed5bb55908b3b2f1bc55c6af1653bada7f07937"
   integrity sha512-rq9s+JNhf0IChjtDXxllJ7g41oZk5SlXtp0LHwyA5cejwn7vKmKp4pPri6YEePv2PU65sAsegbXtIinmDFDXgQ==
 
-acorn@^8.6.0:
-  version "8.6.0"
-  resolved "https://registry.yarnpkg.com/acorn/-/acorn-8.6.0.tgz#e3692ba0eb1a0c83eaa4f37f5fa7368dd7142895"
-  integrity sha512-U1riIR+lBSNi3IbxtaHOIKdH8sLFv3NYfNv8sg7ZsNhcfl4HF2++BfqqrNAxoCLQW1iiylOj76ecnaUxz+z9yw==
-
-acorn@^8.7.0:
-  version "8.7.0"
-  resolved "https://registry.yarnpkg.com/acorn/-/acorn-8.7.0.tgz#90951fde0f8f09df93549481e5fc141445b791cf"
-  integrity sha512-V/LGr1APy+PXIwKebEWrkZPwoeoF+w1jiOBUmuxuiUIaOHtob8Qc9BTrYo7VuI5fR8tqsy+buA2WFooR5olqvQ==
+acorn@^8.5.0, acorn@^8.6.0, acorn@^8.7.0:
+  version "8.7.1"
+  resolved "https://registry.yarnpkg.com/acorn/-/acorn-8.7.1.tgz#0197122c843d1bf6d0a5e83220a788f278f63c30"
+  integrity sha512-Xx54uLJQZ19lKygFXOWsscKUbsBZW0CPykPhVQdhIeIwrbPmJzqeASDInc8nKBnp/JT6igTs82qPXz069H8I/A==
 
 aggregate-error@^3.0.0:
   version "3.1.0"
@@ -785,11 +820,11 @@ cross-env@^7.0.3:
     cross-spawn "^7.0.1"
 
 cross-fetch@^3.1.4:
-  version "3.1.4"
-  resolved "https://registry.yarnpkg.com/cross-fetch/-/cross-fetch-3.1.4.tgz#9723f3a3a247bf8b89039f3a380a9244e8fa2f39"
-  integrity sha512-1eAtFWdIubi6T4XPy6ei9iUFoKpUkIF971QLN8lIvvvwueI65+Nw5haMNKUwfJxabqlIIDODJKGrQ66gxC0PbQ==
+  version "3.1.5"
+  resolved "https://registry.yarnpkg.com/cross-fetch/-/cross-fetch-3.1.5.tgz#e1389f44d9e7ba767907f7af8454787952ab534f"
+  integrity sha512-lvb1SBsI0Z7GDwmuid+mU3kWVBwTVUbe7S0H52yaaAdQOXq2YktTCZdlAcNKFzE6QtRz0snpw9bNiPeOIkkQvw==
   dependencies:
-    node-fetch "2.6.1"
+    node-fetch "2.6.7"
 
 cross-spawn@^7.0.0, cross-spawn@^7.0.1, cross-spawn@^7.0.2:
   version "7.0.3"
@@ -1378,9 +1413,9 @@ flatted@^3.1.0:
   integrity sha512-8/sOawo8tJ4QOBX8YlQBMxL8+RLZfxMQOif9o0KUKTNTjMYElWPE0r/m5VNFxTRd0NSw8qSy8dajrwX4RYI1Hw==
 
 follow-redirects@^1.14.0:
-  version "1.14.7"
-  resolved "https://registry.yarnpkg.com/follow-redirects/-/follow-redirects-1.14.7.tgz#2004c02eb9436eee9a21446a6477debf17e81685"
-  integrity sha512-+hbxoLbFMbRKDwohX8GkTataGqO6Jb7jGwpAlwgy2bIz25XtRm7KEzJM76R1WiNT5SwZkX4Y75SwBolkpmE7iQ==
+  version "1.14.9"
+  resolved "https://registry.yarnpkg.com/follow-redirects/-/follow-redirects-1.14.9.tgz#dd4ea157de7bfaf9ea9b3fbd85aa16951f78d8d7"
+  integrity sha512-MQDfihBQYMcyy5dhRDJUHcw7lb2Pv/TuE6xP1vyraLukNDHKbDxDNaOE3NbCAdKQApno+GPRyo1YAp89yCjK4w==
 
 forever-agent@~0.6.1:
   version "0.6.1"
@@ -1839,9 +1874,9 @@ livereload@^0.9.1:
     ws "^7.4.3"
 
 loader-utils@^3.2.0:
-  version "3.2.0"
-  resolved "https://registry.yarnpkg.com/loader-utils/-/loader-utils-3.2.0.tgz#bcecc51a7898bee7473d4bc6b845b23af8304d4f"
-  integrity sha512-HVl9ZqccQihZ7JM85dco1MvO9G+ONvxoGa9rkhzFsneGLKSUg1gJf9bWzhRhcvm2qChhWpebQhP44qxjKIUCaQ==
+  version "3.2.1"
+  resolved "https://registry.yarnpkg.com/loader-utils/-/loader-utils-3.2.1.tgz#4fb104b599daafd82ef3e1a41fb9265f87e1f576"
+  integrity sha512-ZvFw1KWS3GVyYBYb7qkmRM/WwL2TQQBxgCK62rlvm4WpVQ23Nb4tYjApUlfjrEGvOs7KHEsmyUn75OHZrJMWPw==
 
 local-access@^1.0.1:
   version "1.1.0"
@@ -1968,9 +2003,9 @@ minimatch@^3.0.4:
     brace-expansion "^1.1.7"
 
 minimist@^1.2.0, minimist@^1.2.5:
-  version "1.2.5"
-  resolved "https://registry.yarnpkg.com/minimist/-/minimist-1.2.5.tgz#67d66014b66a6a8aaa0c083c5fd58df4e4e97602"
-  integrity sha512-FM9nNUYrRBAELZQT3xeZQ7fmMOBg6nWNmJKTcgsJeaLstP/UODVpGsr5OhXhhXg6f+qtJ8uiZ+PUxkDWcgIXLw==
+  version "1.2.6"
+  resolved "https://registry.yarnpkg.com/minimist/-/minimist-1.2.6.tgz#8637a5b759ea0d6e98702cfb3a9283323c93af44"
+  integrity sha512-Jsjnk4bw3YJqYzbdyBiNsPWHPfO++UGG749Cxs6peCu5Xg4nrena6OVxOYxrQTqww0Jmwt+Ref8rggumkTLz9Q==
 
 mkdirp@^0.5.1:
   version "0.5.5"
@@ -2009,10 +2044,12 @@ natural-compare@^1.4.0:
   resolved "https://registry.yarnpkg.com/natural-compare/-/natural-compare-1.4.0.tgz#4abebfeed7541f2c27acfb29bdbbd15c8d5ba4f7"
   integrity sha1-Sr6/7tdUHywnrPspvbvRXI1bpPc=
 
-node-fetch@2.6.1:
-  version "2.6.1"
-  resolved "https://registry.yarnpkg.com/node-fetch/-/node-fetch-2.6.1.tgz#045bd323631f76ed2e2b55573394416b639a0052"
-  integrity sha512-V4aYg89jEoVRxRb2fJdAg8FHvI7cEyYdVAh94HH0UIK8oJxUfkjlDQN9RbMx+bEjP7+ggMiFRprSti032Oipxw==
+node-fetch@2.6.7:
+  version "2.6.7"
+  resolved "https://registry.yarnpkg.com/node-fetch/-/node-fetch-2.6.7.tgz#24de9fba827e3b4ae44dc8b20256a379160052ad"
+  integrity sha512-ZjMPFEfVx5j+y2yF35Kzx5sF7kDzxuDj6ziH4FFbOp87zKDZNx8yExJIb05OGF4Nlt9IHFIMBkRl41VdvcNdbQ==
+  dependencies:
+    whatwg-url "^5.0.0"
 
 node-releases@^2.0.1:
   version "2.0.1"
@@ -2483,9 +2520,9 @@ punycode@^2.1.0, punycode@^2.1.1:
   integrity sha512-XRsRjdf+j5ml+y/6GKHPZbrF/8p2Yga0JPtdqTIY2Xe5ohJPD9saDJJLPvp9+NSBprVvevdXZybnj2cv8OEd0A==
 
 qs@~6.5.2:
-  version "6.5.2"
-  resolved "https://registry.yarnpkg.com/qs/-/qs-6.5.2.tgz#cb3ae806e8740444584ef154ce8ee98d403f3e36"
-  integrity sha512-N5ZAX4/LxJmF+7wN74pUD6qAh9/wnvdQcjq9TZjevvXzSUo7bfmw91saqMjzGS2xq91/odN2dW/WOl7qQHNDGA==
+  version "6.5.3"
+  resolved "https://registry.yarnpkg.com/qs/-/qs-6.5.3.tgz#3aeeffc91967ef6e35c0e488ef46fb296ab76aad"
+  integrity sha512-qxXIEh4pCGfHICj1mAJQ2/2XVZkjCDTcEgfoSQxc/fYivUZxTkk7L3bDBJSoNrEzXI17oUO5Dp07ktqE5KzczA==
 
 querystring@0.2.0:
   version "0.2.0"
@@ -2787,9 +2824,9 @@ source-map-js@^1.0.1:
   integrity sha512-R0XvVJ9WusLiqTCEiGCmICCMplcCkIwwR11mOSD9CR5u+IXYdiseeEuXCVAjS54zqwkLcPNnmU4OeJ6tUrWhDw==
 
 source-map-support@~0.5.20:
-  version "0.5.20"
-  resolved "https://registry.yarnpkg.com/source-map-support/-/source-map-support-0.5.20.tgz#12166089f8f5e5e8c56926b377633392dd2cb6c9"
-  integrity sha512-n1lZZ8Ve4ksRqizaBQgxXDgKwttHDhyfQjA6YZZn8+AroHbsIz+JjwxQDxbp+7y5OYCI8t1Yk7etjD9CRd2hIw==
+  version "0.5.21"
+  resolved "https://registry.yarnpkg.com/source-map-support/-/source-map-support-0.5.21.tgz#04fe7c7f9e1ed2d662233c28cb2b35b9f63f6e4f"
+  integrity sha512-uBHU3L3czsIyYXKX88fdrGovxdSCoTGDRZ6SYXtSRxLZUzHg5P/66Ht6uoUlHu9EZod+inXhKo3qQgwXUT/y1w==
   dependencies:
     buffer-from "^1.0.0"
     source-map "^0.6.0"
@@ -2799,7 +2836,7 @@ source-map@^0.6.0, source-map@^0.6.1:
   resolved "https://registry.yarnpkg.com/source-map/-/source-map-0.6.1.tgz#74722af32e9614e9c287a8d0bbde48b5e2f1a263"
   integrity sha512-UjgapumWlbMhkBgzT7Ykc5YXUT46F0iKu8SGXq0bcwP5dz/h0Plj6enJqjz1Zbq2l5WaqYnrVbwWOWMyF3F47g==
 
-source-map@^0.7.3, source-map@~0.7.2:
+source-map@^0.7.3:
   version "0.7.3"
   resolved "https://registry.yarnpkg.com/source-map/-/source-map-0.7.3.tgz#5302f8169031735226544092e64981f751750383"
   integrity sha512-CkCj6giN3S+n9qrYiBTX5gystlENnRW5jZeNLHpe6aue+SrHcG5VYwujhW9s4dY31mEGsxBDrHR6oI69fTXsaQ==
@@ -2948,10 +2985,10 @@ svelte-preprocess@^4.10.1:
     sorcery "^0.10.0"
     strip-indent "^3.0.0"
 
-svelte@^3.46.1:
-  version "3.46.1"
-  resolved "https://registry.yarnpkg.com/svelte/-/svelte-3.46.1.tgz#8ea23595824a39d47d04c16c217000fbc4c52c49"
-  integrity sha512-Ue8ivq+G45AfZZL4Z93xNFiC352wPkyGiY9QSuWjxXh6jiaZMrpthinjc1rz0OSTceuST7Pxr1HDBj2KioliZg==
+svelte@^3.49.0:
+  version "3.49.0"
+  resolved "https://registry.yarnpkg.com/svelte/-/svelte-3.49.0.tgz#5baee3c672306de1070c3b7888fc2204e36a4029"
+  integrity sha512-+lmjic1pApJWDfPCpUUTc1m8azDqYCG1JN9YEngrx/hUyIcFJo6VZhj0A1Ai0wqoHcEIuQy+e9tk+4uDgdtsFA==
 
 svgo@^2.7.0:
   version "2.8.0"
@@ -2967,12 +3004,13 @@ svgo@^2.7.0:
     stable "^0.1.8"
 
 terser@^5.0.0:
-  version "5.10.0"
-  resolved "https://registry.yarnpkg.com/terser/-/terser-5.10.0.tgz#b86390809c0389105eb0a0b62397563096ddafcc"
-  integrity sha512-AMmF99DMfEDiRJfxfY5jj5wNH/bYO09cniSqhfoyxc8sFoYIgkJy86G04UoZU5VjlpnplVu0K6Tx6E9b5+DlHA==
+  version "5.14.2"
+  resolved "https://registry.yarnpkg.com/terser/-/terser-5.14.2.tgz#9ac9f22b06994d736174f4091aa368db896f1c10"
+  integrity sha512-oL0rGeM/WFQCUd0y2QrWxYnq7tfSuKBiqTjRPWrRgB46WD/kiwHwF8T23z78H6Q6kGCuuHcPB+KULHRdxvVGQA==
   dependencies:
+    "@jridgewell/source-map" "^0.3.2"
+    acorn "^8.5.0"
     commander "^2.20.0"
-    source-map "~0.7.2"
     source-map-support "~0.5.20"
 
 text-table@^0.2.0:
@@ -3027,6 +3065,11 @@ tough-cookie@~2.5.0:
     psl "^1.1.28"
     punycode "^2.1.1"
 
+tr46@~0.0.3:
+  version "0.0.3"
+  resolved "https://registry.yarnpkg.com/tr46/-/tr46-0.0.3.tgz#8184fd347dac9cdc185992f3a6622e14b9d9ab6a"
+  integrity sha1-gYT9NH2snNwYWZLzpmIuFLnZq2o=
+
 tslib@^1.8.1:
   version "1.14.1"
   resolved "https://registry.yarnpkg.com/tslib/-/tslib-1.14.1.tgz#cf2d38bdc34a134bcaf1091c41f6619e2f672d00"
@@ -3153,6 +3196,19 @@ wait-on@^6.0.0:
     minimist "^1.2.5"
     rxjs "^7.1.0"
 
+webidl-conversions@^3.0.0:
+  version "3.0.1"
+  resolved "https://registry.yarnpkg.com/webidl-conversions/-/webidl-conversions-3.0.1.tgz#24534275e2a7bc6be7bc86611cc16ae0a5654871"
+  integrity sha1-JFNCdeKnvGvnvIZhHMFq4KVlSHE=
+
+whatwg-url@^5.0.0:
+  version "5.0.0"
+  resolved "https://registry.yarnpkg.com/whatwg-url/-/whatwg-url-5.0.0.tgz#966454e8765462e37644d3626f6742ce8b70965d"
+  integrity sha1-lmRU6HZUYuN2RNNib2dCzotwll0=
+  dependencies:
+    tr46 "~0.0.3"
+    webidl-conversions "^3.0.0"
+
 which@^2.0.1:
   version "2.0.2"
   resolved "https://registry.yarnpkg.com/which/-/which-2.0.2.tgz#7c6a8dd0a636a0327e10b59c9286eee93f3f51b1"
diff --git a/metaflow/plugins/catch_decorator.py b/metaflow/plugins/catch_decorator.py
index ef73dbe3288..638040919a4 100644
--- a/metaflow/plugins/catch_decorator.py
+++ b/metaflow/plugins/catch_decorator.py
@@ -21,30 +21,21 @@ def __init__(self, retry_count):
 
 class CatchDecorator(StepDecorator):
     """
-    Step decorator to specify error handling for your step.
+    Specifies that the step will success under all circumstances.
 
-    This decorator indicates that exceptions in the step should be caught and not fail the entire
-    flow.
-
-    This can be used in conjunction with the @retry decorator. In that case, catch will only
-    activate if all retries fail and will catch the last exception thrown by the last retry.
-
-    To use, annotate your step as follows:
-    ```
-    @catch(var='foo')
-    @step
-    def myStep(self):
-        ...
-    ```
+    The decorator will create an optional artifact, specified by `var`, which
+    contains the exception raised. You can use it to detect the presence
+    of errors, indicating that all happy-path artifacts produced by the step
+    are missing.
 
     Parameters
     ----------
     var : string
-        Name of the artifact in which to store the caught exception. If not specified,
-        the exception is not stored
+        Name of the artifact in which to store the caught exception.
+        If not specified, the exception is not stored.
     print_exception : bool
-        Determines whether or not the exception is printed to stdout when caught. Defaults
-        to True
+        Determines whether or not the exception is printed to
+        stdout when caught (default: True).
     """
 
     name = "catch"
diff --git a/metaflow/plugins/conda/__init__.py b/metaflow/plugins/conda/__init__.py
index ec0321b4e8e..17d40bffaf4 100644
--- a/metaflow/plugins/conda/__init__.py
+++ b/metaflow/plugins/conda/__init__.py
@@ -3,6 +3,16 @@
 import json
 import fcntl
 
+from metaflow.exception import MetaflowException, MetaflowInternalError
+from metaflow.metaflow_config import (
+    CONDA_PACKAGE_S3ROOT,
+    DATASTORE_SYSROOT_S3,
+    CONDA_PACKAGE_AZUREROOT,
+    CONDA_PACKAGE_GSROOT,
+    DATASTORE_SYSROOT_AZURE,
+    DATASTORE_SYSROOT_GS,
+)
+
 CONDA_MAGIC_FILE = "conda.dependencies"
 
 
@@ -42,3 +52,39 @@ def write_to_conda_manifest(ds_root, flow_name, key, value):
                 raise
         finally:
             fcntl.flock(f, fcntl.LOCK_UN)
+
+
+def get_conda_package_root(datastore_type):
+    # Yes, code duplication. But easier to read this way, at N=2
+    if datastore_type == "s3":
+        if CONDA_PACKAGE_S3ROOT is None:
+            if DATASTORE_SYSROOT_S3 is None:
+                raise MetaflowException(
+                    msg="METAFLOW_DATASTORE_SYSROOT_S3 must be set!"
+                )
+            return "%s/conda" % DATASTORE_SYSROOT_S3
+        else:
+            return CONDA_PACKAGE_S3ROOT
+    elif datastore_type == "azure":
+        if CONDA_PACKAGE_AZUREROOT is None:
+            if DATASTORE_SYSROOT_AZURE is None:
+                raise MetaflowException(
+                    msg="METAFLOW_DATASTORE_SYSROOT_AZURE must be set!"
+                )
+            return "%s/conda" % DATASTORE_SYSROOT_AZURE
+        else:
+            return CONDA_PACKAGE_AZUREROOT
+    elif datastore_type == "gs":
+        if CONDA_PACKAGE_GSROOT is None:
+            if DATASTORE_SYSROOT_GS is None:
+                raise MetaflowException(
+                    msg="METAFLOW_DATASTORE_SYSROOT_GS must be set!"
+                )
+            return "%s/conda" % DATASTORE_SYSROOT_GS
+        else:
+            return CONDA_PACKAGE_GSROOT
+    else:
+        raise MetaflowInternalError(
+            msg="Unsupported storage backend '%s' for working with Conda"
+            % (datastore_type,)
+        )
diff --git a/metaflow/plugins/conda/batch_bootstrap.py b/metaflow/plugins/conda/batch_bootstrap.py
index 1217c72c39f..38a791037cf 100644
--- a/metaflow/plugins/conda/batch_bootstrap.py
+++ b/metaflow/plugins/conda/batch_bootstrap.py
@@ -1,23 +1,19 @@
-import functools
 import json
-from multiprocessing import Pool
 import os
-import tarfile
 import shutil
-import subprocess
 import sys
 
-from metaflow.datatools import S3
+from metaflow.exception import MetaflowException
 from metaflow.metaflow_config import DATASTORE_LOCAL_DIR
 
 from ..env_escape import generate_trampolines, ENV_ESCAPE_PY
 
-from . import CONDA_MAGIC_FILE
+from . import CONDA_MAGIC_FILE, get_conda_package_root
 
 
-def bootstrap_environment(flow_name, env_id):
+def bootstrap_environment(flow_name, env_id, datastore_type):
     setup_conda_manifest(flow_name)
-    packages = download_conda_packages(flow_name, env_id)
+    packages = download_conda_packages(flow_name, env_id, datastore_type)
     install_conda_environment(env_id, packages)
 
 
@@ -31,18 +27,55 @@ def setup_conda_manifest(flow_name):
     )
 
 
-def download_conda_packages(flow_name, env_id):
+def download_conda_packages(flow_name, env_id, datastore_type):
     pkgs_folder = os.path.join(os.getcwd(), "pkgs")
     if not os.path.exists(pkgs_folder):
         os.makedirs(pkgs_folder)
+    # NOTE: if two runs use the same DATASTORE_LOCAL_DIR but different cloud based
+    # datastore roots (e.g. a different DATASTORE_SYSROOT_AZURE), then this breaks.
+    # This shared local manifest CONDA_MAGIC_FILE will say that cache_urls exist,
+    # but those URLs will not actually point to any real objects in the datastore
+    # for the second run.
     manifest_folder = os.path.join(os.getcwd(), DATASTORE_LOCAL_DIR, flow_name)
     with open(os.path.join(manifest_folder, CONDA_MAGIC_FILE)) as f:
         env = json.load(f)[env_id]
-        with S3() as s3:
-            for pkg in s3.get_many(env["cache_urls"]):
-                shutil.move(
-                    pkg.path, os.path.join(pkgs_folder, os.path.basename(pkg.key))
+
+        # git commit fbd6c9d8a819fad647958c9fa869153ab37bc0ca introduced support for
+        # Microsoft Azure and made a minor tweak to how conda packages are uploaded. As
+        # a result, the URL stored by Metaflow no longer includes the datastore root (
+        # which arguably helps with portability of datastore). To ensure backwards
+        # compatibility, we add this small check here that checks for the prefix of
+        # the URLs before downloading them appropriately. Of course, a change can be
+        # made to allow the datastore to consume full URLs as well instead of this
+        # change, but given that change's far-reaching consequences, we introduce this
+        # workaround.
+        # https://github.com/Netflix/metaflow/commit/fbd6c9d8a819fad647958c9fa869153ab37bc0ca#diff-1ecbb40de8aba5b41e149987de4aa797a47f4498e5e4e3f63a53d4283dcdf941R198
+        if env["cache_urls"][0].startswith("s3://"):
+            from metaflow.plugins.datatools import S3
+
+            with S3() as s3:
+                for pkg in s3.get_many(env["cache_urls"]):
+                    shutil.move(
+                        pkg.path, os.path.join(pkgs_folder, os.path.basename(pkg.key))
+                    )
+        else:
+            # Import DATASTORES dynamically... otherwise, circular import
+            from metaflow.plugins import DATASTORES
+
+            storage_impl = [d for d in DATASTORES if d.TYPE == datastore_type]
+            if len(storage_impl) == 0:
+                raise MetaflowException(
+                    msg="Downloading conda packages from %s datastore is not yet implemented!"
+                    % datastore_type
                 )
+            conda_package_root = get_conda_package_root(datastore_type)
+            storage = storage_impl[0](conda_package_root)
+            with storage.load_bytes(env["cache_urls"]) as load_result:
+                for key, tmpfile, _ in load_result:
+                    shutil.move(
+                        tmpfile, os.path.join(pkgs_folder, os.path.basename(key))
+                    )
+
         return env["order"]
 
 
@@ -68,4 +101,4 @@ def install_conda_environment(env_id, packages):
 
 
 if __name__ == "__main__":
-    bootstrap_environment(sys.argv[1], sys.argv[2])
+    bootstrap_environment(sys.argv[1], sys.argv[2], sys.argv[3])
diff --git a/metaflow/plugins/conda/conda.py b/metaflow/plugins/conda/conda.py
index 797f9ff3105..9ff490dbcf3 100644
--- a/metaflow/plugins/conda/conda.py
+++ b/metaflow/plugins/conda/conda.py
@@ -124,7 +124,9 @@ def _create(
         if explicit:
             cmd.append("--no-deps")
         cmd.extend(deps)
-        self._call_conda(cmd, architecture=architecture, disable_safety_checks=False)
+        self._call_conda(
+            cmd, architecture=architecture, disable_safety_checks=disable_safety_checks
+        )
 
     def _remove(self, env_id):
         self._call_conda(["env", "remove", "--name", env_id, "--yes", "--quiet"])
@@ -176,7 +178,6 @@ def _call_conda(self, args, architecture=None, disable_safety_checks=False):
             env = {
                 "CONDA_JSON": "True",
                 "CONDA_SUBDIR": (architecture if architecture else ""),
-                "CONDA_USE_ONLY_TAR_BZ2": "True",
                 "MAMBA_NO_BANNER": "1",
                 "MAMBA_JSON": "True",
             }
diff --git a/metaflow/plugins/conda/conda_environment.py b/metaflow/plugins/conda/conda_environment.py
index 7ca49e16385..e410929d2c6 100644
--- a/metaflow/plugins/conda/conda_environment.py
+++ b/metaflow/plugins/conda/conda_environment.py
@@ -21,7 +21,7 @@ def __init__(self, flow):
         self.flow = flow
         self.local_root = None
         # A conda environment sits on top of whatever default environment
-        # the user has so we get that environment to be able to forward
+        # the user has, so we get that environment to be able to forward
         # any calls we don't handle specifically to that one.
         from ...plugins import ENVIRONMENTS
         from metaflow.metaflow_config import DEFAULT_ENVIRONMENT
@@ -42,8 +42,8 @@ def init_environment(self, echo):
         echo("Bootstrapping conda environment..." + "(this could take a few minutes)")
         self.base_env.init_environment(echo)
 
-    def validate_environment(self, echo):
-        return self.base_env.validate_environment(echo)
+    def validate_environment(self, echo, datastore_type):
+        return self.base_env.validate_environment(echo, datastore_type)
 
     def decospecs(self):
         # Apply conda decorator and base environment's decorators to all steps
@@ -70,14 +70,14 @@ def _get_executable(self, step_name):
     def set_local_root(self, ds_root):
         self.local_root = ds_root
 
-    def bootstrap_commands(self, step_name):
+    def bootstrap_commands(self, step_name, datastore_type):
         # Bootstrap conda and execution environment for step
         env_id = self._get_env_id(step_name)
         if env_id is not None:
             return [
                 "echo 'Bootstrapping environment...'",
-                'python -m metaflow.plugins.conda.batch_bootstrap "%s" %s'
-                % (self.flow.name, env_id),
+                'python -m metaflow.plugins.conda.batch_bootstrap "%s" %s "%s"'
+                % (self.flow.name, env_id, datastore_type),
                 "echo 'Environment bootstrapped.'",
             ]
         return []
@@ -129,8 +129,8 @@ def get_client_info(cls, flow_name, metadata):
         }
         return new_info
 
-    def get_package_commands(self, code_package_url):
-        return self.base_env.get_package_commands(code_package_url)
+    def get_package_commands(self, code_package_url, datastore_type):
+        return self.base_env.get_package_commands(code_package_url, datastore_type)
 
     def get_environment_info(self):
         return self.base_env.get_environment_info()
diff --git a/metaflow/plugins/conda/conda_flow_decorator.py b/metaflow/plugins/conda/conda_flow_decorator.py
index 75340c9718d..0f7aa1abea8 100644
--- a/metaflow/plugins/conda/conda_flow_decorator.py
+++ b/metaflow/plugins/conda/conda_flow_decorator.py
@@ -4,33 +4,21 @@
 
 class CondaFlowDecorator(FlowDecorator):
     """
-    Conda decorator that sets a default Conda step decorator for all
-    steps in the flow.
+    Specifies the Conda environment for all steps of the flow.
 
-    To use, add this decorator directly on top of your Flow class:
-    ```
-    @conda_base
-    class MyFlow(FlowSpec):
-        ...
-    ```
-
-    Any step level Conda decorator will override any setting by this decorator.
+    Use `@conda_base` to set common libraries required by all
+    steps and use `@conda` to specify step-specific additions.
 
     Parameters
     ----------
     libraries : Dict
-        Libraries to use for this flow. The key is the name of the package and the value
-        is the version to use. Defaults to {}
+        Libraries to use for this flow. The key is the name of the package
+        and the value is the version to use (default: `{}`).
     python : string
-        Version of Python to use (for example: '3.7.4'). Defaults to None
-        (specified at the step level)
+        Version of Python to use, e.g. '3.7.4'
+        (default: None, i.e. the current Python version).
     disabled : bool
-        If set to True, disables Conda (note that this is overridden if a step level decorator
-        sets to True). Defaults to None (specified at the step level)
-    Raises
-    ------
-    InvalidEnvironmentException
-        Raised if --environment=conda is not specified
+        If set to True, disables Conda (default: False).
     """
 
     name = "conda_base"
diff --git a/metaflow/plugins/conda/conda_step_decorator.py b/metaflow/plugins/conda/conda_step_decorator.py
index 81265641ca2..6896e293c25 100644
--- a/metaflow/plugins/conda/conda_step_decorator.py
+++ b/metaflow/plugins/conda/conda_step_decorator.py
@@ -1,4 +1,5 @@
 import importlib
+import json
 import os
 import sys
 from hashlib import sha1
@@ -17,14 +18,13 @@
 from metaflow.extension_support import EXT_PKG
 from metaflow.metaflow_environment import InvalidEnvironmentException
 from metaflow.metadata import MetaDatum
-from metaflow.metaflow_config import get_pinned_conda_libs, CONDA_PACKAGE_S3ROOT
+from metaflow.metaflow_config import (
+    get_pinned_conda_libs,
+)
 from metaflow.util import get_metaflow_root
-from metaflow.datastore import LocalStorage
-from metaflow.datatools import S3
-from metaflow.unbounded_foreach import UBF_CONTROL
 
 from ..env_escape import generate_trampolines
-from . import read_conda_manifest, write_to_conda_manifest
+from . import read_conda_manifest, write_to_conda_manifest, get_conda_package_root
 from .conda import Conda
 
 try:
@@ -36,27 +36,23 @@
 
 class CondaStepDecorator(StepDecorator):
     """
-    Conda decorator that sets the Conda environment for your step
+    Specifies the Conda environment for the step.
 
-    To use, add this decorator to your step:
-    ```
-    @conda
-    @step
-    def MyStep(self):
-        ...
-    ```
+    Information in this decorator will augment any
+    attributes set in the `@conda_base` flow-level decorator. Hence,
+    you can use `@conda_base` to set common libraries required by all
+    steps and use `@conda` to specify step-specific additions.
 
-    Information in this decorator will override any eventual @conda_base flow level decorator.
     Parameters
     ----------
     libraries : Dict
-        Libraries to use for this flow. The key is the name of the package and the value
-        is the version to use. Defaults to {}
+        Libraries to use for this step. The key is the name of the package
+        and the value is the version to use (default: `{}`).
     python : string
-        Version of Python to use (for example: '3.7.4'). Defaults to None
-        (will use the current python version)
+        Version of Python to use, e.g. '3.7.4'
+        (default: None, i.e. the current Python version).
     disabled : bool
-        If set to True, disables Conda. Defaults to False
+        If set to True, disables Conda (default: False).
     """
 
     name = "conda"
@@ -93,18 +89,12 @@ def is_enabled(self, ubf_context=None):
         )
 
     def _lib_deps(self):
-        deps = get_pinned_conda_libs(self._python_version())
+        deps = get_pinned_conda_libs(self._python_version(), self.flow_datastore.TYPE)
 
         base_deps = self.base_attributes["libraries"]
         deps.update(base_deps)
         step_deps = self.attributes["libraries"]
-        if isinstance(step_deps, (unicode, basestring)):
-            step_deps = step_deps.strip("\"{}'")
-            if step_deps:
-                step_deps = dict(
-                    map(lambda x: x.strip().strip("\"'"), a.split(":"))
-                    for a in step_deps.split(",")
-                )
+
         deps.update(step_deps)
         return deps
 
@@ -154,7 +144,11 @@ def _resolve_step_environment(self, ds_root, force=False):
                 }
             else:
                 payload = cached_deps[env_id]
-            if self.flow_datastore.TYPE == "s3" and "cache_urls" not in payload:
+
+            if (
+                self.flow_datastore.TYPE in ("s3", "azure", "gs")
+                and "cache_urls" not in payload
+            ):
                 payload["cache_urls"] = self._cache_env()
             write_to_conda_manifest(ds_root, self.flow.name, env_id, payload)
             CondaStepDecorator.environments = CondaStepDecorator.conda.environments(
@@ -163,6 +157,9 @@ def _resolve_step_environment(self, ds_root, force=False):
         return env_id
 
     def _cache_env(self):
+        # Move here to avoid circular imports
+        from metaflow.plugins import DATASTORES
+
         def _download(entry):
             url, local_path = entry
             with requests.get(url, stream=True) as r:
@@ -175,27 +172,41 @@ def _download(entry):
         for package_info in self.conda.package_info(env_id):
             url = urlparse(package_info["url"])
             path = os.path.join(
-                CONDA_PACKAGE_S3ROOT,
                 url.netloc,
                 url.path.lstrip("/"),
                 package_info["md5"],
                 package_info["fn"],
             )
             tarball_path = package_info["package_tarball_full_path"]
-            if tarball_path.endswith(".conda"):
-                # Conda doesn't set the metadata correctly for certain fields
-                # when the underlying OS is spoofed.
-                tarball_path = tarball_path[:-6]
-            if not tarball_path.endswith(".tar.bz2"):
-                tarball_path = "%s.tar.bz2" % tarball_path
+            # we were originally restricted to just .tar.bz2 packages
+            # due to https://github.com/conda/conda/issues/9674
+            # which doesn't seem to be the case anymore
+            if not tarball_path.endswith(".conda") and not tarball_path.endswith(
+                ".tar.bz2"
+            ):
+                tarball_path_suffix = ".tar.bz2"
+                if package_info["url"].endswith(".conda"):
+                    tarball_path_suffix = ".conda"
+                tarball_path = "%s%s" % (tarball_path, tarball_path_suffix)
             if not os.path.isfile(tarball_path):
                 # The tarball maybe missing when user invokes `conda clean`!
                 to_download.append((package_info["url"], tarball_path))
             files.append((path, tarball_path))
         if to_download:
             Pool(8).map(_download, to_download)
-        with S3() as s3:
-            s3.put_files(files, overwrite=False)
+
+        list_of_path_and_filehandle = [
+            (path, open(tarball_path, "rb")) for path, tarball_path in files
+        ]
+
+        # We need our own storage backend so that we can customize datastore_root on it
+        # in a clearly safe way, without the existing backend owned by FlowDatastore
+        storage_impl = [d for d in DATASTORES if d.TYPE == self.flow_datastore.TYPE][0]
+        storage = storage_impl(get_conda_package_root(self.flow_datastore.TYPE))
+        storage.save_bytes(
+            list_of_path_and_filehandle, len_hint=len(list_of_path_and_filehandle)
+        )
+
         return [files[0] for files in files]
 
     def _prepare_step_environment(self, step_name, ds_root):
@@ -235,7 +246,11 @@ def _architecture(self, decos):
         if platform.system() == "Linux":
             return "linux-%s" % bit
         elif platform.system() == "Darwin":
-            return "osx-%s" % bit
+            # Support M1 Mac
+            if platform.machine() == "arm64":
+                return "osx-arm64"
+            else:
+                return "osx-%s" % bit
         else:
             raise InvalidEnvironmentException(
                 "The *@conda* decorator is not supported "
@@ -250,12 +265,23 @@ def runtime_init(self, flow, graph, package, run_id):
         self.addl_paths = None
         os.symlink(path_to_metaflow, os.path.join(self.metaflow_home, "metaflow"))
 
-        # Also symlink the INFO version to properly propagate down version information
-        # from, for example, a step-function execution
+        # Symlink the INFO file as well to properly propagate down the Metaflow version
+        # if launching on AWS Batch for example
         if os.path.isfile(path_to_info):
             os.symlink(path_to_info, os.path.join(self.metaflow_home, "INFO"))
-
-        # Do the same for metaflow_extensions
+        else:
+            # If there is no "INFO" file, we will actually create one in this new
+            # place because we won't be able to properly resolve the EXT_PKG extensions
+            # the same way as outside conda (looking at distributions, etc.). In a
+            # Conda environment, as shown below (where we set self.addl_paths), all
+            # EXT_PKG extensions are PYTHONPATH extensions. Instead of re-resolving,
+            # we use the resolved information that is written out to the INFO file.
+            with open(
+                os.path.join(self.metaflow_home, "INFO"), mode="wt", encoding="utf-8"
+            ) as f:
+                f.write(json.dumps(self._cur_environment.get_environment_info()))
+
+        # Do the same for EXT_PKG
         try:
             m = importlib.import_module(EXT_PKG)
         except ImportError:
@@ -263,23 +289,36 @@ def runtime_init(self, flow, graph, package, run_id):
             # for other issues when loading at the toplevel
             pass
         else:
-            custom_paths = list(m.__path__)
+            custom_paths = list(set(m.__path__))  # For some reason, at times, unique
+            # paths appear multiple times. We simplify
+            # to avoid un-necessary links
+
             if len(custom_paths) == 1:
-                # Regular package
+                # Regular package; we take a quick shortcut here
                 os.symlink(
                     custom_paths[0],
                     os.path.join(self.metaflow_home, EXT_PKG),
                 )
             else:
-                # Namespace package; we don't symlink but add the additional paths
-                # for the conda interpreter
-                self.addl_paths = [os.path.split(p)[0] for p in custom_paths]
+                # This is a namespace package, we therefore create a bunch of directories
+                # so that we can symlink in those separately, and we will add those paths
+                # to the PYTHONPATH for the interpreter. Note that we don't symlink
+                # to the parent of the package because that could end up including
+                # more stuff we don't want
+                self.addl_paths = []
+                for p in custom_paths:
+                    temp_dir = tempfile.mkdtemp(dir=self.metaflow_home)
+                    os.symlink(p, os.path.join(temp_dir, EXT_PKG))
+                    self.addl_paths.append(temp_dir)
 
         # Also install any environment escape overrides directly here to enable
         # the escape to work even in non metaflow-created subprocesses
         generate_trampolines(self.metaflow_home)
 
     def step_init(self, flow, graph, step, decos, environment, flow_datastore, logger):
+        # Move here to avoid circular import
+        from metaflow.plugins.datastores.local_storage import LocalStorage
+
         if environment.TYPE != "conda":
             raise InvalidEnvironmentException(
                 "The *@conda* decorator requires " "--environment=conda"
@@ -299,6 +338,7 @@ def _logger(line, **kwargs):
         os.environ["PYTHONNOUSERSITE"] = "1"
 
     def package_init(self, flow, step, environment):
+        self._cur_environment = environment
         if self.is_enabled():
             self._prepare_step_environment(step, self.local_root)
 
diff --git a/metaflow/plugins/datastores/__init__.py b/metaflow/plugins/datastores/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/metaflow/plugins/datastores/azure_storage.py b/metaflow/plugins/datastores/azure_storage.py
new file mode 100644
index 00000000000..c25ed74e183
--- /dev/null
+++ b/metaflow/plugins/datastores/azure_storage.py
@@ -0,0 +1,397 @@
+import json
+import os
+import shutil
+import uuid
+import time
+from concurrent.futures import as_completed
+from tempfile import mkdtemp
+
+from metaflow.datastore.datastore_storage import DataStoreStorage, CloseAfterUse
+from metaflow.exception import MetaflowInternalError
+from metaflow.metaflow_config import (
+    DATASTORE_SYSROOT_AZURE,
+    ARTIFACT_LOCALROOT,
+    AZURE_STORAGE_WORKLOAD_TYPE,
+)
+from metaflow.plugins.azure.azure_utils import (
+    check_azure_deps,
+    process_exception,
+    handle_exceptions,
+    create_static_token_credential,
+    parse_azure_full_path,
+)
+
+from metaflow.plugins.azure.blob_service_client_factory import (
+    get_azure_blob_service_client,
+)
+
+
+# How many threads / connections to use per upload or download operation
+from metaflow.plugins.storage_executor import (
+    StorageExecutor,
+    handle_executor_exceptions,
+)
+
+AZURE_STORAGE_DOWNLOAD_MAX_CONCURRENCY = 4
+AZURE_STORAGE_UPLOAD_MAX_CONCURRENCY = 16
+
+BYTES_IN_MB = 1024 * 1024
+
+AZURE_STORAGE_DEFAULT_SCOPE = "https://storage.azure.com/.default"
+
+
+class _AzureRootClient(object):
+    """
+    This exists independent of AzureBlobStorage as a wrapper around SDK clients.
+    It carries around parameters needed to construct Azure SDK clients on demand.
+
+    _AzureRootClient objects will be passed from main to worker threads or processes. They
+    must be picklable. We delay constructing Azure SDK objects because they are not picklable.
+
+    For example, azure.core.TokenCredential objects are not picklable. Therefore, we pass around an
+    AccessToken (token) instead, and construct TokenCredential on demand in the target thread (or process).
+    Note that we do this to amortize credential retrieval cost across threads (or processes). Depending on
+    the credential methods available to DefaultAzureCredential, credential retrieval can be expensive.
+    E.g. Azure CLI based credential may take 500-1000ms.
+
+    _AzureRootClient  also carries around with it blob methods that operate relative to
+    datastore_root.
+    """
+
+    def __init__(self, datastore_root=None, token=None, shared_access_signature=None):
+        if datastore_root is None:
+            raise MetaflowInternalError("datastore_root must be set")
+        if token is None and shared_access_signature is None:
+            raise MetaflowInternalError(
+                "either shared_access_signature or token must be set"
+            )
+        if token and shared_access_signature:
+            raise MetaflowInternalError(
+                "cannot set both shared_access_signature and token"
+            )
+        self._datastore_root = datastore_root
+        self._token = token
+        self._shared_access_signature = shared_access_signature
+
+    def get_datastore_root(self):
+        return self._datastore_root
+
+    def get_blob_container_client(self):
+        if self._shared_access_signature:
+            credential = self._shared_access_signature
+            credential_is_cacheable = True
+        else:
+            credential = create_static_token_credential(self._token)
+            credential_is_cacheable = True
+        service = get_azure_blob_service_client(
+            credential=credential,
+            credential_is_cacheable=credential_is_cacheable,
+        )
+        # datastore_root is <container_name>/<blob_prefix>
+        container_name, _ = parse_azure_full_path(self._datastore_root)
+        return service.get_container_client(container_name)
+
+    def get_blob_client(self, path):
+        container = self.get_blob_container_client()
+        blob_full_path = self.get_blob_full_path(path)
+        return container.get_blob_client(blob_full_path)
+
+    def get_blob_full_path(self, path):
+        """
+        Full path means <blob_prefix>/<path> where:
+        datastore_root is <container_name>/<blob_prefix>
+        """
+        _, blob_prefix = parse_azure_full_path(self._datastore_root)
+        if blob_prefix is None:
+            return path
+        path = path.lstrip("/")
+        return "/".join([blob_prefix, path])
+
+    # Azure blob operations. These are meant to be single units of work
+    # to be performed by thread or process pool workers.
+    def is_file_single(self, path):
+        """Drives AzureStorage.is_file()"""
+        try:
+            blob = self.get_blob_client(path)
+            return blob.exists()
+        except Exception as e:
+            process_exception(e)
+
+    def save_bytes_single(
+        self,
+        path_tmpfile_metadata_triple,
+        overwrite=False,
+    ):
+        """Drives AzureStorage.save_bytes()"""
+        try:
+            path, tmpfile, metadata = path_tmpfile_metadata_triple
+
+            metadata_to_upload = None
+            if metadata:
+                metadata_to_upload = {
+                    # Azure metadata rules:
+                    # https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-properties-metadata#set-and-retrieve-metadata
+                    # https://docs.microsoft.com/en-us/rest/api/storageservices/setting-and-retrieving-properties-and-metadata-for-blob-resources#Subheading1
+                    "metaflow_user_attributes": json.dumps(metadata),
+                }
+            blob = self.get_blob_client(path)
+            from azure.core.exceptions import ResourceExistsError
+
+            with open(tmpfile, "rb") as byte_stream:
+                try:
+                    # This is a racy existence check worth doing.
+                    # It is good enough 99.9% of the time.
+                    # Depending on ResourceExistsError is more costly, though
+                    # we are still going to handle it right.
+                    if overwrite or not blob.exists():
+                        blob.upload_blob(
+                            byte_stream,
+                            overwrite=overwrite,
+                            metadata=metadata_to_upload,
+                            max_concurrency=AZURE_STORAGE_UPLOAD_MAX_CONCURRENCY,
+                        )
+                except ResourceExistsError:
+                    if overwrite:
+                        # this is an unexpected condition - operation should not complain about
+                        # resource exists if we already said it's fine to overwrite
+                        raise
+                    else:
+                        # we did not want to overwrite. We swallow the exception because the behavior we
+                        # want is "try to upload, but just no-op if already exists".
+                        # this is consistent with S3 and Local implementations.
+                        #
+                        # Note: In other implementations, we may do a pre-upload object existence check.
+                        # Race conditions are possible in those implementations - and that appears to
+                        # be tolerated by our datastore usage patterns.
+                        #
+                        # For Azure, we let azure-storage-blob and underlying REST API handle this. It looks
+                        # race free (as of 6/28/2022)
+                        pass
+        except Exception as e:
+            process_exception(e)
+
+    def load_bytes_single(self, tmpdir, key):
+        """Drives AzureStorage.load_bytes()"""
+        from azure.core.exceptions import ResourceNotFoundError
+
+        try:
+            blob = self.get_blob_client(key)
+            try:
+                blob_properties = blob.get_blob_properties()
+            except ResourceNotFoundError:
+                # load_bytes() needs to return None for keys that don't exist
+                return key, None, None
+            tmp_filename = os.path.join(tmpdir, str(uuid.uuid4()))
+            try:
+                with open(tmp_filename, "wb") as f:
+                    blob.download_blob(
+                        max_concurrency=AZURE_STORAGE_DOWNLOAD_MAX_CONCURRENCY
+                    ).readinto(f)
+                metaflow_user_attributes = None
+                if (
+                    blob_properties.metadata
+                    and "metaflow_user_attributes" in blob_properties.metadata
+                ):
+                    metaflow_user_attributes = json.loads(
+                        blob_properties.metadata["metaflow_user_attributes"]
+                    )
+            except Exception:
+                # clean up the tmp file for the one specific failed load
+                if os.path.exists(tmp_filename):
+                    os.unlink(tmp_filename)
+                raise
+            return key, tmp_filename, metaflow_user_attributes
+        except Exception as e:
+            process_exception(e)
+
+    def list_content_single(self, path):
+        """Drives AzureStorage.list_content()"""
+        try:
+            result = []
+            # all query paths are assumed to be folders. This replicates S3 behavior.
+            path = path.rstrip("/") + "/"
+            full_path = self.get_blob_full_path(path)
+            container = self.get_blob_container_client()
+            # "file" blobs show up as BlobProperties. We assume we always have "blob_type" key
+            # "directories" show up as BlobPrefix. We assume we never have "blob_type" key
+            for blob_properties_or_prefix in container.walk_blobs(
+                name_starts_with=full_path
+            ):
+                name = blob_properties_or_prefix.name
+                # there are other ways. Like checking the returned name ends with slash.
+                # But checking blob_type is more robust
+                is_file = blob_properties_or_prefix.has_key("blob_type")
+                if not is_file:
+                    # for directories we don't want trailing slashes in results
+                    name = name.rstrip("/")
+
+                # Now massage the resulting blob paths - we need to strip off the common blob prefix
+                _, top_level_blob_prefix = parse_azure_full_path(
+                    self.get_datastore_root()
+                )
+                if (
+                    top_level_blob_prefix is not None
+                    and name[: len(top_level_blob_prefix)] == top_level_blob_prefix
+                ):
+                    name = name[len(top_level_blob_prefix) + 1 :]
+
+                # DataStorage.list_content_result is not pickle-able, because it is defined
+                # inline as a class member. So let's just return a regular tuple. list_content()
+                # can pack it up later.
+                # TODO(jackie) Why is it defined as a class member at all? Probably should not be.
+                result.append((name, is_file))
+            return result
+        except Exception as e:
+            process_exception(e)
+
+
+class AzureStorage(DataStoreStorage):
+    TYPE = "azure"
+
+    def __init__(self, root=None):
+        # cannot decorate __init__... invoke it with dummy decoratee
+        check_azure_deps(lambda: 0)
+        super(AzureStorage, self).__init__(root)
+        self._tmproot = ARTIFACT_LOCALROOT
+        self._default_scope_token = None
+        self._root_client = None
+
+        self._use_processes = AZURE_STORAGE_WORKLOAD_TYPE == "high_throughput"
+        self._executor = StorageExecutor(use_processes=self._use_processes)
+        self._executor.warm_up()
+
+    @handle_exceptions
+    def _get_default_token(self):
+        # either we never got a default token, or the one we got is expiring in 5 min
+        if not self._default_scope_token or (
+            self._default_scope_token.expires_on - time.time() < 300
+        ):
+            from azure.identity import DefaultAzureCredential
+
+            with DefaultAzureCredential() as credential:
+                self._default_scope_token = credential.get_token(
+                    AZURE_STORAGE_DEFAULT_SCOPE
+                )
+        return self._default_scope_token
+
+    @property
+    def root_client(self):
+        """Note this is for optimization only - it allows slow-initialization credentials to be
+        reused across multiple threads and processes.
+
+        Speed up applies mainly to the "no access key" path.
+        """
+        if self._root_client is None:
+            self._root_client = _AzureRootClient(
+                datastore_root=self.datastore_root,
+                token=self._get_default_token(),
+            )
+        return self._root_client
+
+    @classmethod
+    def get_datastore_root_from_config(cls, echo, create_on_absent=True):
+        # create_on_absent doesn't do anything.  This matches S3Storage
+        return DATASTORE_SYSROOT_AZURE
+
+    @handle_executor_exceptions
+    def is_file(self, paths):
+        # preserving order is important...
+        futures = [
+            self._executor.submit(
+                self.root_client.is_file_single,
+                path,
+            )
+            for path in paths
+        ]
+        # preserving order is important...
+        return [future.result() for future in futures]
+
+    def info_file(self, path):
+        # not used anywhere... we should consider killing this on all data storage implementations
+        raise NotImplementedError()
+
+    def size_file(self, path):
+        from azure.core.exceptions import ResourceNotFoundError
+
+        try:
+            return self.root_client.get_blob_client(path).get_blob_properties().size
+        except ResourceNotFoundError:
+            return None
+        except Exception as e:
+            process_exception(e)
+
+    @handle_executor_exceptions
+    def list_content(self, paths):
+        futures = [
+            self._executor.submit(self.root_client.list_content_single, path)
+            for path in paths
+        ]
+        result = []
+        for future in as_completed(futures):
+            result.extend(self.list_content_result(*x) for x in future.result())
+        return result
+
+    @handle_executor_exceptions
+    def save_bytes(self, path_and_bytes_iter, overwrite=False, len_hint=0):
+        tmpdir = None
+        try:
+            tmpdir = mkdtemp(
+                dir=ARTIFACT_LOCALROOT, prefix="metaflow.azure.save_bytes."
+            )
+            futures = []
+            for path, byte_stream in path_and_bytes_iter:
+                metadata = None
+                # bytes_stream could actually be (bytes_stream, metadata) instead.
+                # Read the small print on DatastoreStorage.save_bytes()
+                if isinstance(byte_stream, tuple):
+                    byte_stream, metadata = byte_stream
+                tmp_filename = os.path.join(tmpdir, str(uuid.uuid4()))
+                with open(tmp_filename, "wb") as f:
+                    f.write(byte_stream.read())
+                # Fully finish writing the file, before submitting work. Careful with indentation.
+
+                futures.append(
+                    self._executor.submit(
+                        self.root_client.save_bytes_single,
+                        (path, tmp_filename, metadata),
+                        overwrite=overwrite,
+                    )
+                )
+            for future in as_completed(futures):
+                future.result()
+        finally:
+            # *Future* improvement: We could clean up individual tmp files as each future completes
+            if tmpdir and os.path.exists(tmpdir):
+                shutil.rmtree(tmpdir)
+
+    @handle_executor_exceptions
+    def load_bytes(self, keys):
+
+        tmpdir = mkdtemp(dir=self._tmproot, prefix="metaflow.azure.load_bytes.")
+        try:
+            futures = [
+                self._executor.submit(
+                    self.root_client.load_bytes_single,
+                    tmpdir,
+                    key,
+                )
+                for key in keys
+            ]
+
+            # Let's detect any failures fast. Stop and cleanup ASAP.
+            # Note that messing up return order is beneficial re: Hyrum's law too.
+            items = [future.result() for future in as_completed(futures)]
+        except Exception:
+            # Other BaseExceptions will skip clean up here.
+            # We let this go - smooth exit in those circumstances is more important than cleaning up a few tmp files
+            if os.path.exists(tmpdir):
+                shutil.rmtree(tmpdir)
+            raise
+
+        class _Closer(object):
+            @staticmethod
+            def close():
+                if os.path.isdir(tmpdir):
+                    shutil.rmtree(tmpdir)
+
+        return CloseAfterUse(iter(items), closer=_Closer)
diff --git a/metaflow/plugins/datastores/gs_storage.py b/metaflow/plugins/datastores/gs_storage.py
new file mode 100644
index 00000000000..d5d3c1435da
--- /dev/null
+++ b/metaflow/plugins/datastores/gs_storage.py
@@ -0,0 +1,275 @@
+import json
+import os
+import shutil
+import uuid
+from concurrent.futures import as_completed
+from tempfile import mkdtemp
+
+
+from metaflow.datastore.datastore_storage import DataStoreStorage, CloseAfterUse
+from metaflow.exception import MetaflowInternalError
+from metaflow.metaflow_config import (
+    DATASTORE_SYSROOT_GS,
+    ARTIFACT_LOCALROOT,
+    GS_STORAGE_WORKLOAD_TYPE,
+)
+
+from metaflow.plugins.gcp.gs_storage_client_factory import get_gs_storage_client
+from metaflow.plugins.gcp.gs_utils import (
+    check_gs_deps,
+    parse_gs_full_path,
+    process_gs_exception,
+)
+from metaflow.plugins.storage_executor import (
+    StorageExecutor,
+    handle_executor_exceptions,
+)
+
+
+class _GSRootClient(object):
+    """
+    datastore_root aware Google Cloud Storage client. I.e. blob operations are
+    all done relative to datastore_root.
+
+    This must be picklable, so methods may be passed across process boundaries.
+    """
+
+    def __init__(self, datastore_root):
+        if datastore_root is None:
+            raise MetaflowInternalError("datastore_root must be set")
+        self._datastore_root = datastore_root
+
+    def get_datastore_root(self):
+        return self._datastore_root
+
+    def get_blob_full_path(self, path):
+        """
+        Full path means <blob_prefix>/<path> where:
+        datastore_root is gs://<bucket_name>/<blob_prefix>
+        """
+        _, blob_prefix = parse_gs_full_path(self._datastore_root)
+        if blob_prefix is None:
+            return path
+        path = path.lstrip("/")
+        return "/".join([blob_prefix, path])
+
+    def get_bucket_client(self):
+        bucket_name, _ = parse_gs_full_path(self._datastore_root)
+        client = get_gs_storage_client()
+        return client.bucket(bucket_name)
+
+    def get_blob_client(self, path):
+        bucket = self.get_bucket_client()
+        blob_full_path = self.get_blob_full_path(path)
+        blob = bucket.blob(blob_full_path)
+        return blob
+
+    # GS blob operations. These are meant to be single units of work
+    # to be performed by thread or process pool workers.
+    def is_file_single(self, path):
+        """Drives GSStorage.is_file()"""
+        try:
+            blob = self.get_blob_client(path)
+            result = blob.exists()
+
+            return result
+        except Exception as e:
+            process_gs_exception(e)
+
+    def list_content_single(self, path):
+        """Drives GSStorage.list_content()"""
+
+        def _trim_result(name, prefix):
+            # Remove a prefix from the name, if present
+            if prefix is not None and name[: len(prefix)] == prefix:
+                name = name[len(prefix) + 1 :]
+            return name
+
+        try:
+            path = path.rstrip("/") + "/"
+            bucket_name, blob_prefix = parse_gs_full_path(self._datastore_root)
+            full_path = self.get_blob_full_path(path)
+            blobs = get_gs_storage_client().list_blobs(
+                bucket_name,
+                prefix=full_path,
+                delimiter="/",
+                include_trailing_delimiter=False,
+            )
+            result = []
+            for b in blobs:
+                result.append((_trim_result(b.name, blob_prefix), True))
+            for p in blobs.prefixes:
+                result.append((_trim_result(p, blob_prefix).rstrip("/"), False))
+            return result
+        except Exception as e:
+            process_gs_exception(e)
+
+    def save_bytes_single(
+        self,
+        path_tmpfile_metadata_triple,
+        overwrite=False,
+    ):
+        try:
+            path, tmpfile, metadata = path_tmpfile_metadata_triple
+            blob = self.get_blob_client(path)
+            if not overwrite:
+                if blob.exists():
+                    return
+            if metadata is not None:
+                blob.metadata = {"metaflow-user-attributes": json.dumps(metadata)}
+            from google.cloud.storage.retry import DEFAULT_RETRY
+
+            blob.upload_from_filename(tmpfile, retry=DEFAULT_RETRY)
+        except Exception as e:
+            process_gs_exception(e)
+
+    def load_bytes_single(self, tmpdir, key):
+        """Drives GSStorage.load_bytes()"""
+        tmp_filename = os.path.join(tmpdir, str(uuid.uuid4()))
+        blob = self.get_blob_client(key)
+        metaflow_user_attributes = None
+        import google.api_core.exceptions
+
+        try:
+            blob.reload()
+            if blob.metadata and "metaflow-user-attributes" in blob.metadata:
+                metaflow_user_attributes = json.loads(
+                    blob.metadata["metaflow-user-attributes"]
+                )
+            blob.download_to_filename(tmp_filename)
+        except google.api_core.exceptions.NotFound:
+            tmp_filename = None
+        return key, tmp_filename, metaflow_user_attributes
+
+
+class GSStorage(DataStoreStorage):
+    TYPE = "gs"
+
+    def __init__(self, root=None):
+        # cannot decorate __init__... invoke it with dummy decoratee
+        check_gs_deps(lambda: 0)
+        super(GSStorage, self).__init__(root)
+        self._tmproot = ARTIFACT_LOCALROOT
+        self._root_client = None
+
+        self._use_processes = GS_STORAGE_WORKLOAD_TYPE == "high_throughput"
+        self._executor = StorageExecutor(use_processes=self._use_processes)
+        self._executor.warm_up()
+
+    @property
+    def root_client(self):
+        """Root client is datastore_root aware. All blob operations go through root_client's
+        methods.
+
+        Method calls may run on the main thread, or be submitted to a process or thread pool.
+        """
+        if self._root_client is None:
+            self._root_client = _GSRootClient(
+                datastore_root=self.datastore_root,
+            )
+        return self._root_client
+
+    @classmethod
+    def get_datastore_root_from_config(cls, echo, create_on_absent=True):
+        # create_on_absent doesn't do anything.  This matches S3Storage
+        return DATASTORE_SYSROOT_GS
+
+    @handle_executor_exceptions
+    def is_file(self, paths):
+        # preserving order is important...
+        futures = [
+            self._executor.submit(
+                self.root_client.is_file_single,
+                path,
+            )
+            for path in paths
+        ]
+        # preserving order is important...
+        return [future.result() for future in futures]
+
+    def info_file(self, path):
+        # not used anywhere... we should consider killing this on all data storage implementations
+        raise NotImplementedError()
+
+    def size_file(self, path):
+        import google.api_core.exceptions
+
+        try:
+            blob = self.root_client.get_blob_client(path)
+            blob.reload()
+            return blob.size
+        except google.api_core.exceptions.NotFound:
+            return None
+
+    @handle_executor_exceptions
+    def list_content(self, paths):
+        futures = [
+            self._executor.submit(self.root_client.list_content_single, path)
+            for path in paths
+        ]
+        result = []
+        for future in as_completed(futures):
+            result.extend(self.list_content_result(*x) for x in future.result())
+        return result
+
+    @handle_executor_exceptions
+    def save_bytes(self, path_and_bytes_iter, overwrite=False, len_hint=0):
+        tmpdir = None
+        try:
+            tmpdir = mkdtemp(dir=ARTIFACT_LOCALROOT, prefix="metaflow.gs.save_bytes.")
+            futures = []
+            for path, byte_stream in path_and_bytes_iter:
+                metadata = None
+                # bytes_stream could actually be (bytes_stream, metadata) instead.
+                # Read the small print on DatastoreStorage.save_bytes()
+                if isinstance(byte_stream, tuple):
+                    byte_stream, metadata = byte_stream
+                tmp_filename = os.path.join(tmpdir, str(uuid.uuid4()))
+                with open(tmp_filename, "wb") as f:
+                    f.write(byte_stream.read())
+                # Fully finish writing the file, before submitting work. Careful with indentation.
+
+                futures.append(
+                    self._executor.submit(
+                        self.root_client.save_bytes_single,
+                        (path, tmp_filename, metadata),
+                        overwrite=overwrite,
+                    )
+                )
+            for future in as_completed(futures):
+                future.result()
+        finally:
+            # *Future* improvement: We could clean up individual tmp files as each future completes
+            if tmpdir and os.path.exists(tmpdir):
+                shutil.rmtree(tmpdir)
+
+    @handle_executor_exceptions
+    def load_bytes(self, keys):
+        tmpdir = mkdtemp(dir=self._tmproot, prefix="metaflow.gs.load_bytes.")
+        try:
+            futures = [
+                self._executor.submit(
+                    self.root_client.load_bytes_single,
+                    tmpdir,
+                    key,
+                )
+                for key in keys
+            ]
+
+            # Let's detect any failures fast. Stop and cleanup ASAP.
+            # Note that messing up return order is beneficial re: Hyrum's law too.
+            items = [future.result() for future in as_completed(futures)]
+        except Exception:
+            # Other BaseExceptions will skip clean up here.
+            # We let this go - smooth exit in those circumstances is more important than cleaning up a few tmp files
+            if os.path.exists(tmpdir):
+                shutil.rmtree(tmpdir)
+            raise
+
+        class _Closer(object):
+            @staticmethod
+            def close():
+                if os.path.isdir(tmpdir):
+                    shutil.rmtree(tmpdir)
+
+        return CloseAfterUse(iter(items), closer=_Closer)
diff --git a/metaflow/datastore/local_storage.py b/metaflow/plugins/datastores/local_storage.py
similarity index 96%
rename from metaflow/datastore/local_storage.py
rename to metaflow/plugins/datastores/local_storage.py
index dc0343ffb97..4077a9404dd 100644
--- a/metaflow/datastore/local_storage.py
+++ b/metaflow/plugins/datastores/local_storage.py
@@ -1,9 +1,8 @@
 import json
 import os
 
-from ..metaflow_config import DATASTORE_LOCAL_DIR, DATASTORE_SYSROOT_LOCAL
-from .datastore_storage import CloseAfterUse, DataStoreStorage
-from .exceptions import DataException
+from metaflow.metaflow_config import DATASTORE_LOCAL_DIR, DATASTORE_SYSROOT_LOCAL
+from metaflow.datastore.datastore_storage import CloseAfterUse, DataStoreStorage
 
 
 class LocalStorage(DataStoreStorage):
diff --git a/metaflow/datastore/s3_storage.py b/metaflow/plugins/datastores/s3_storage.py
similarity index 91%
rename from metaflow/datastore/s3_storage.py
rename to metaflow/plugins/datastores/s3_storage.py
index 130d099ea4e..327880ede44 100644
--- a/metaflow/datastore/s3_storage.py
+++ b/metaflow/plugins/datastores/s3_storage.py
@@ -2,9 +2,9 @@
 
 from itertools import starmap
 
-from ..datatools.s3 import S3, S3Client, S3PutObject
-from ..metaflow_config import DATASTORE_SYSROOT_S3
-from .datastore_storage import CloseAfterUse, DataStoreStorage
+from metaflow.plugins.datatools.s3.s3 import S3, S3Client, S3PutObject
+from metaflow.metaflow_config import DATASTORE_SYSROOT_S3, ARTIFACT_LOCALROOT
+from metaflow.datastore.datastore_storage import CloseAfterUse, DataStoreStorage
 
 
 try:
@@ -29,7 +29,7 @@ def get_datastore_root_from_config(cls, echo, create_on_absent=True):
     def is_file(self, paths):
         with S3(
             s3root=self.datastore_root,
-            tmproot=os.getcwd(),
+            tmproot=ARTIFACT_LOCALROOT,
             external_client=self.s3_client,
         ) as s3:
             if len(paths) > 10:
@@ -44,7 +44,7 @@ def is_file(self, paths):
     def info_file(self, path):
         with S3(
             s3root=self.datastore_root,
-            tmproot=os.getcwd(),
+            tmproot=ARTIFACT_LOCALROOT,
             external_client=self.s3_client,
         ) as s3:
             s3obj = s3.info(path, return_missing=True)
@@ -53,7 +53,7 @@ def info_file(self, path):
     def size_file(self, path):
         with S3(
             s3root=self.datastore_root,
-            tmproot=os.getcwd(),
+            tmproot=ARTIFACT_LOCALROOT,
             external_client=self.s3_client,
         ) as s3:
             s3obj = s3.info(path, return_missing=True)
@@ -63,7 +63,7 @@ def list_content(self, paths):
         strip_prefix_len = len(self.datastore_root.rstrip("/")) + 1
         with S3(
             s3root=self.datastore_root,
-            tmproot=os.getcwd(),
+            tmproot=ARTIFACT_LOCALROOT,
             external_client=self.s3_client,
         ) as s3:
             results = s3.list_paths(paths)
@@ -86,7 +86,7 @@ def _convert():
 
         with S3(
             s3root=self.datastore_root,
-            tmproot=os.getcwd(),
+            tmproot=ARTIFACT_LOCALROOT,
             external_client=self.s3_client,
         ) as s3:
             # HACK: The S3 datatools we rely on does not currently do a good job
@@ -120,7 +120,7 @@ def load_bytes(self, paths):
 
         s3 = S3(
             s3root=self.datastore_root,
-            tmproot=os.getcwd(),
+            tmproot=ARTIFACT_LOCALROOT,
             external_client=self.s3_client,
         )
 
diff --git a/metaflow/datatools/__init__.py b/metaflow/plugins/datatools/__init__.py
similarity index 82%
rename from metaflow/datatools/__init__.py
rename to metaflow/plugins/datatools/__init__.py
index d5eb71ad014..5566ae208e8 100644
--- a/metaflow/datatools/__init__.py
+++ b/metaflow/plugins/datatools/__init__.py
@@ -3,7 +3,7 @@
 # because of https://bugs.python.org/issue42853 (Py3 bug); this also helps
 # keep memory consumption lower
 # NOTE: For some weird reason, if you pass a large value to
-# read, it delays the call so we always pass it either what
+# read it delays the call, so we always pass it either what
 # remains or 2GB, whichever is smallest.
 def read_in_chunks(dst, src, src_sz, max_chunk_size):
     remaining = src_sz
@@ -15,13 +15,14 @@ def read_in_chunks(dst, src, src_sz, max_chunk_size):
         remaining -= len(buf)
 
 
+from .local import MetaflowLocalNotFound, MetaflowLocalURLException, Local
 from .s3 import MetaflowS3Exception, S3
 
 # Import any additional datatools defined by a Metaflow extensions package
 try:
     from metaflow.extension_support import get_modules, multiload_all
 
-    multiload_all(get_modules("datatools"), "datatools", globals())
+    multiload_all(get_modules("plugins.datatools"), "plugins.datatools", globals())
 finally:
     # Erase all temporary names to avoid leaking things
     for _n in ["get_modules", "multiload_all"]:
diff --git a/metaflow/plugins/datatools/local.py b/metaflow/plugins/datatools/local.py
new file mode 100644
index 00000000000..f326f6e0412
--- /dev/null
+++ b/metaflow/plugins/datatools/local.py
@@ -0,0 +1,152 @@
+import os
+
+from metaflow.exception import MetaflowException
+from metaflow.metaflow_config import DATATOOLS_LOCALROOT, DATATOOLS_SUFFIX
+from metaflow.util import to_unicode
+
+
+class MetaflowLocalURLException(MetaflowException):
+    headline = "Invalid path"
+
+
+class MetaflowLocalNotFound(MetaflowException):
+    headline = "Local object not found"
+
+
+class LocalObject(object):
+    """
+    This object represents a local object. It is a very thin wrapper
+    to allow it to be used in the same way as the S3Object (only as needed
+    in the IncludeFile use case)
+
+    Get or list calls return one or more of LocalObjects.
+    """
+
+    def __init__(self, url, path):
+
+        # all fields of S3Object should return a unicode object
+        def ensure_unicode(x):
+            return None if x is None else to_unicode(x)
+
+        path = ensure_unicode(path)
+
+        self._path = path
+        self._url = url
+
+        if self._path:
+            try:
+                os.stat(self._path)
+            except FileNotFoundError:
+                self._path = None
+
+    @property
+    def exists(self):
+        """
+        Does this key correspond to an actual file?
+        """
+        return self._path is not None and os.path.isfile(self._path)
+
+    @property
+    def url(self):
+        """
+        Local location of the object; this is the path prefixed with local://
+        """
+        return self._url
+
+    @property
+    def path(self):
+        """
+        Path to the local file
+        """
+        return self._path
+
+    @property
+    def size(self):
+        """
+        Size of the local file (in bytes)
+
+        Returns None if the key does not correspond to an actual object
+        """
+        if self._path is None:
+            return None
+        return os.stat(self._path).st_size
+
+
+class Local(object):
+    """
+    This class allows you to access the local filesystem in a way similar to the S3 datatools
+    client. It is a stripped down version for now and only implements the functionality needed
+    for this use case.
+
+    In the future, we may want to allow it to be used in a way similar to the S3() client.
+    """
+
+    @staticmethod
+    def _makedirs(path):
+        try:
+            os.makedirs(path)
+        except OSError as x:
+            if x.errno == 17:
+                return
+            else:
+                raise
+
+    @classmethod
+    def get_root_from_config(cls, echo, create_on_absent=True):
+        result = DATATOOLS_LOCALROOT
+        if result is None:
+            from metaflow.plugins.datastores.local_storage import LocalStorage
+
+            result = LocalStorage.get_datastore_root_from_config(echo, create_on_absent)
+            result = os.path.join(result, DATATOOLS_SUFFIX)
+            if create_on_absent and not os.path.exists(result):
+                os.mkdir(result)
+        return result
+
+    def __init__(self):
+        """
+        Initialize a new context for Local file operations. This object is based used as
+        a context manager for a with statement.
+        """
+        pass
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, *args):
+        pass
+
+    def _path(self, key):
+        key = to_unicode(key)
+        if key.startswith("local://"):
+            return key[8:]
+        else:
+            return key
+
+    def get(self, key=None, return_missing=False):
+        p = self._path(key)
+        url = "local://%s" % p
+        if not os.path.isfile(p):
+            if return_missing:
+                p = None
+            else:
+                raise MetaflowLocalNotFound("Local URL '%s' not found" % url)
+        return LocalObject(url, p)
+
+    def put(self, key, obj, overwrite=True):
+        p = self._path(key)
+        if overwrite or (not os.path.exists(p)):
+            Local._makedirs(os.path.dirname(p))
+            with open(p, "wb") as f:
+                f.write(obj)
+        return "local://%s" % p
+
+    def info(self, key=None, return_missing=False):
+        p = self._path(key)
+        url = "local://%s" % p
+        if not os.path.isfile(p):
+            if return_missing:
+                p = None
+            else:
+                raise MetaflowLocalNotFound("Local URL '%s' not found" % url)
+        return LocalObject(url, p)
diff --git a/metaflow/plugins/datatools/s3/__init__.py b/metaflow/plugins/datatools/s3/__init__.py
new file mode 100644
index 00000000000..b477cdf1ece
--- /dev/null
+++ b/metaflow/plugins/datatools/s3/__init__.py
@@ -0,0 +1,9 @@
+from .s3 import RangeInfo, S3, S3GetObject, S3Object, S3PutObject
+from .s3 import (
+    MetaflowS3InvalidObject,
+    MetaflowS3URLException,
+    MetaflowS3Exception,
+    MetaflowS3NotFound,
+    MetaflowS3AccessDenied,
+    MetaflowS3InvalidRange,
+)
diff --git a/metaflow/plugins/datatools/s3/s3.py b/metaflow/plugins/datatools/s3/s3.py
new file mode 100644
index 00000000000..ea0864c9f7f
--- /dev/null
+++ b/metaflow/plugins/datatools/s3/s3.py
@@ -0,0 +1,1645 @@
+import json
+import os
+import re
+import sys
+import time
+import shutil
+import random
+import subprocess
+from io import RawIOBase, BufferedIOBase
+from itertools import chain, starmap
+from tempfile import mkdtemp, NamedTemporaryFile
+
+from metaflow import FlowSpec
+from metaflow.current import current
+from metaflow.metaflow_config import (
+    DATATOOLS_S3ROOT,
+    S3_RETRY_COUNT,
+    S3_TRANSIENT_RETRY_COUNT,
+)
+from metaflow.util import (
+    namedtuple_with_defaults,
+    is_stringish,
+    to_bytes,
+    to_unicode,
+    to_fileobj,
+    url_quote,
+    url_unquote,
+)
+from metaflow.exception import MetaflowException
+from metaflow.debug import debug
+
+try:
+    # python2
+    from urlparse import urlparse
+except:
+    # python3
+    from urllib.parse import urlparse
+
+from .s3util import (
+    get_s3_client,
+    read_in_chunks,
+    get_timestamp,
+    TRANSIENT_RETRY_START_LINE,
+    TRANSIENT_RETRY_LINE_CONTENT,
+)
+
+try:
+    import boto3
+    from boto3.s3.transfer import TransferConfig
+
+    DOWNLOAD_FILE_THRESHOLD = 2 * TransferConfig().multipart_threshold
+    DOWNLOAD_MAX_CHUNK = 2 * 1024 * 1024 * 1024 - 1
+    boto_found = True
+except:
+    boto_found = False
+
+
+TEST_INJECT_RETRYABLE_FAILURES = int(
+    os.environ.get("METAFLOW_S3_TEST_RETRYABLE_FAILURES", 0)
+)
+
+
+def ensure_unicode(x):
+    return None if x is None else to_unicode(x)
+
+
+S3GetObject = namedtuple_with_defaults("S3GetObject", "key offset length")
+
+S3PutObject = namedtuple_with_defaults(
+    "S3PutObject",
+    "key value path content_type metadata",
+    defaults=(None, None, None, None),
+)
+
+RangeInfo = namedtuple_with_defaults(
+    "RangeInfo", "total_size request_offset request_length", defaults=(0, -1)
+)
+
+RANGE_MATCH = re.compile(r"bytes (?P<start>[0-9]+)-(?P<end>[0-9]+)/(?P<total>[0-9]+)")
+
+
+class MetaflowS3InvalidObject(MetaflowException):
+    headline = "Not a string-like object"
+
+
+class MetaflowS3URLException(MetaflowException):
+    headline = "Invalid address"
+
+
+class MetaflowS3Exception(MetaflowException):
+    headline = "S3 access failed"
+
+
+class MetaflowS3NotFound(MetaflowException):
+    headline = "S3 object not found"
+
+
+class MetaflowS3AccessDenied(MetaflowException):
+    headline = "S3 access denied"
+
+
+class MetaflowS3InvalidRange(MetaflowException):
+    headline = "S3 invalid range"
+
+
+class S3Object(object):
+    """
+    This object represents a path or an object in S3,
+    with an optional local copy.
+
+    `S3Object`s are not instantiated directly, but they are returned
+    by many methods of the `S3` client.
+    """
+
+    def __init__(
+        self,
+        prefix,
+        url,
+        path,
+        size=None,
+        content_type=None,
+        metadata=None,
+        range_info=None,
+        last_modified=None,
+    ):
+
+        # all fields of S3Object should return a unicode object
+        prefix, url, path = map(ensure_unicode, (prefix, url, path))
+
+        self._size = size
+        self._url = url
+        self._path = path
+        self._key = None
+        self._content_type = content_type
+        self._last_modified = last_modified
+
+        self._metadata = None
+        if metadata is not None and "metaflow-user-attributes" in metadata:
+            self._metadata = json.loads(metadata["metaflow-user-attributes"])
+
+        if range_info and (
+            range_info.request_length is None or range_info.request_length < 0
+        ):
+            self._range_info = RangeInfo(
+                range_info.total_size, range_info.request_offset, range_info.total_size
+            )
+        else:
+            self._range_info = range_info
+
+        if path:
+            self._size = os.stat(self._path).st_size
+
+        if prefix is None or prefix == url:
+            self._key = url
+            self._prefix = None
+        else:
+            self._key = url[len(prefix.rstrip("/")) + 1 :].rstrip("/")
+            self._prefix = prefix
+
+    @property
+    def exists(self):
+        """
+        Does this key correspond to an object in S3?
+
+        Returns
+        -------
+        bool
+            True if this object points at an existing object (file) in S3.
+        """
+        return self._size is not None
+
+    @property
+    def downloaded(self):
+        """
+        Has this object been downloaded?
+
+        If True, the contents can be accessed through `path`, `blob`,
+        and `text` properties.
+
+        Returns
+        -------
+        bool
+            True if the contents of this object have been downloaded.
+        """
+        return bool(self._path)
+
+    @property
+    def url(self):
+        """
+        S3 location of the object
+
+        Returns
+        -------
+        str
+            The S3 location of this object.
+        """
+        return self._url
+
+    @property
+    def prefix(self):
+        """
+        Prefix requested that matches this object.
+
+        Returns
+        -------
+        str
+            Requested prefix
+        """
+        return self._prefix
+
+    @property
+    def key(self):
+        """
+        Key corresponds to the key given to the get call that produced
+        this object.
+
+        This may be a full S3 URL or a suffix based on what
+        was requested.
+
+        Returns
+        -------
+        str
+            Key requested.
+        """
+        return self._key
+
+    @property
+    def path(self):
+        """
+        Path to a local temporary file corresponding to the object downloaded.
+
+        This file gets deleted automatically when a S3 scope exits.
+        Returns None if this S3Object has not been downloaded.
+
+        Returns
+        -------
+        str
+            Local path, if the object has been downloaded.
+        """
+        return self._path
+
+    @property
+    def blob(self):
+        """
+        Contents of the object as a byte string or None if the
+        object hasn't been downloaded.
+
+        Returns
+        -------
+        bytes
+            Contents of the object as bytes.
+        """
+        if self._path:
+            with open(self._path, "rb") as f:
+                return f.read()
+
+    @property
+    def text(self):
+        """
+        Contents of the object as a string or None if the
+        object hasn't been downloaded.
+
+        The object is assumed to contain UTF-8 encoded data.
+
+        Returns
+        -------
+        str
+            Contents of the object as text.
+        """
+        if self._path:
+            return self.blob.decode("utf-8", errors="replace")
+
+    @property
+    def size(self):
+        """
+        Size of the object in bytes.
+
+        Returns None if the key does not correspond to an object in S3.
+
+        Returns
+        -------
+        int
+            Size of the object in bytes, if the object exists.
+        """
+        return self._size
+
+    @property
+    def has_info(self):
+        """
+        Returns true if this `S3Object` contains the content-type MIME header or
+        user-defined metadata.
+
+        If False, this means that `content_type`, `metadata`, `range_info` and
+        `last_modified` will return None.
+
+        Returns
+        -------
+        bool
+            True if additional metadata is available.
+        """
+        return (
+            self._content_type is not None
+            or self._metadata is not None
+            or self._range_info is not None
+        )
+
+    @property
+    def metadata(self):
+        """
+        Returns a dictionary of user-defined metadata, or None if no metadata
+        is defined.
+
+        Returns
+        -------
+        Dict
+            User-defined metadata.
+        """
+        return self._metadata
+
+    @property
+    def content_type(self):
+        """
+        Returns the content-type of the S3 object or None if it is not defined.
+
+        Returns
+        -------
+        str
+            Content type or None if the content type is undefined.
+        """
+        return self._content_type
+
+    @property
+    def range_info(self):
+        """
+        If the object corresponds to a partially downloaded object, returns
+        information of what was downloaded.
+
+        The returned object has the following fields:
+        - `total_size`: Size of the object in S3.
+        - `request_offset`: The starting offset.
+        - `request_length`: The number of bytes downloaded.
+
+        Returns
+        -------
+        namedtuple
+            An object containing information about the partial download. If
+            the `S3Object` doesn't correspond to a partially downloaded file,
+            returns None.
+        """
+        return self._range_info
+
+    @property
+    def last_modified(self):
+        """
+        Returns the last modified unix timestamp of the object.
+
+        Returns
+        -------
+        int
+            Unix timestamp corresponding to the last modified time.
+        """
+        return self._last_modified
+
+    def __str__(self):
+        if self._path:
+            return "<S3Object %s (%d bytes, local)>" % (self._url, self._size)
+        elif self._size:
+            return "<S3Object %s (%d bytes, in S3)>" % (self._url, self._size)
+        else:
+            return "<S3Object %s (object does not exist)>" % self._url
+
+    def __repr__(self):
+        return str(self)
+
+
+class S3Client(object):
+    def __init__(self, s3_role_arn=None, s3_session_vars=None, s3_client_params=None):
+        self._s3_client = None
+        self._s3_error = None
+        self._s3_role = s3_role_arn
+        self._s3_session_vars = s3_session_vars
+        self._s3_client_params = s3_client_params
+
+    @property
+    def client(self):
+        if self._s3_client is None:
+            self.reset_client()
+        return self._s3_client
+
+    @property
+    def error(self):
+        if self._s3_error is None:
+            self.reset_client()
+        return self._s3_error
+
+    def reset_client(self):
+        self._s3_client, self._s3_error = get_s3_client(
+            s3_role_arn=self._s3_role,
+            s3_session_vars=self._s3_session_vars,
+            s3_client_params=self._s3_client_params,
+        )
+
+
+class S3(object):
+    """
+    The Metaflow S3 client.
+
+    This object manages the connection to S3 and a temporary diretory that is used
+    to download objects. Note that in most cases when the data fits in memory, no local
+    disk IO is needed as operations are cached by the operating system, which makes
+    operations fast as long as there is enough memory available.
+
+    The easiest way is to use this object as a context manager:
+    ```
+    with S3() as s3:
+        data = [obj.blob for obj in s3.get_many(urls)]
+    print(data)
+    ```
+    The context manager takes care of creating and deleting a temporary directory
+    automatically. Without a context manager, you must call `.close()` to delete
+    the directory explicitly:
+    ```
+    s3 = S3()
+    data = [obj.blob for obj in s3.get_many(urls)]
+    s3.close()
+    ```
+    You can customize the location of the temporary directory with `tmproot`. It
+    defaults to the current working directory.
+
+    To make it easier to deal with object locations, the client can be initialized
+    with an S3 path prefix. There are three ways to handle locations:
+
+    1. Use a `metaflow.Run` object or `self`, e.g. `S3(run=self)` which
+       initializes the prefix with the global `DATATOOLS_S3ROOT` path, combined
+       with the current run ID. This mode makes it easy to version data based
+       on the run ID consistently. You can use the `bucket` and `prefix` to
+       override parts of `DATATOOLS_S3ROOT`.
+
+    2. Specify an S3 prefix explicitly with `s3root`,
+       e.g. `S3(s3root='s3://mybucket/some/path')`.
+
+    3. Specify nothing, i.e. `S3()`, in which case all operations require
+       a full S3 url prefixed with `s3://`.
+
+    Parameters
+    ----------
+    tmproot : str
+        Where to store the temporary directory (default: '.').
+    bucket : str
+        Override the bucket from `DATATOOLS_S3ROOT` when `run` is specified.
+    prefix : str
+        Override the path from `DATATOOLS_S3ROOT` when `run` is specified.
+    run : FlowSpec or Run
+        Derive path prefix from the current or a past run ID, e.g. S3(run=self).
+    s3root : str
+        If `run` is not specified, use this as the S3 prefix.
+    """
+
+    @classmethod
+    def get_root_from_config(cls, echo, create_on_absent=True):
+        return DATATOOLS_S3ROOT
+
+    def __init__(
+        self, tmproot=".", bucket=None, prefix=None, run=None, s3root=None, **kwargs
+    ):
+        if not boto_found:
+            raise MetaflowException("You need to install 'boto3' in order to use S3.")
+
+        if run:
+            # 1. use a (current) run ID with optional customizations
+            parsed = urlparse(DATATOOLS_S3ROOT)
+            if not bucket:
+                bucket = parsed.netloc
+            if not prefix:
+                prefix = parsed.path
+            if isinstance(run, FlowSpec):
+                if current.is_running_flow:
+                    prefix = os.path.join(prefix, current.flow_name, current.run_id)
+                else:
+                    raise MetaflowS3URLException(
+                        "Initializing S3 with a FlowSpec outside of a running "
+                        "flow is not supported."
+                    )
+            else:
+                prefix = os.path.join(prefix, run.parent.id, run.id)
+
+            self._s3root = "s3://%s" % os.path.join(bucket, prefix.strip("/"))
+        elif s3root:
+            # 2. use an explicit S3 prefix
+            parsed = urlparse(to_unicode(s3root))
+            if parsed.scheme != "s3":
+                raise MetaflowS3URLException(
+                    "s3root needs to be an S3 URL prefixed with s3://."
+                )
+            self._s3root = s3root.rstrip("/")
+        else:
+            # 3. use the client only with full URLs
+            self._s3root = None
+
+        # Note that providing a role, session vars or client params and a client
+        # will result in the role/session vars/client params being ignored
+        self._s3_role = kwargs.get("role", None)
+        self._s3_session_vars = kwargs.get("session_vars", None)
+        self._s3_client_params = kwargs.get("client_params", None)
+        self._s3_client = kwargs.get(
+            "external_client",
+            S3Client(
+                s3_role_arn=self._s3_role,
+                s3_session_vars=self._s3_session_vars,
+                s3_client_params=self._s3_client_params,
+            ),
+        )
+        self._s3_inject_failures = kwargs.get(
+            "inject_failure_rate", TEST_INJECT_RETRYABLE_FAILURES
+        )
+        self._tmpdir = mkdtemp(dir=tmproot, prefix="metaflow.s3.")
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, *args):
+        self.close()
+
+    def close(self):
+        """
+        Delete all temporary files downloaded in this context.
+        """
+        try:
+            if not debug.s3client:
+                if self._tmpdir:
+                    shutil.rmtree(self._tmpdir)
+                    self._tmpdir = None
+        except:
+            pass
+
+    def _url(self, key_value):
+        # NOTE: All URLs are handled as Unicode objects (unicode in py2,
+        # string in py3) internally. We expect that all URLs passed to this
+        # class as either Unicode or UTF-8 encoded byte strings. All URLs
+        # returned are Unicode.
+        key = getattr(key_value, "key", key_value)
+        if self._s3root is None:
+            parsed = urlparse(to_unicode(key))
+            if parsed.scheme == "s3" and parsed.path:
+                return key
+            else:
+                if current.is_running_flow:
+                    raise MetaflowS3URLException(
+                        "Specify S3(run=self) when you use S3 inside a running "
+                        "flow. Otherwise you have to use S3 with full "
+                        "s3:// urls."
+                    )
+                else:
+                    raise MetaflowS3URLException(
+                        "Initialize S3 with an 's3root' or 'run' if you don't "
+                        "want to specify full s3:// urls."
+                    )
+        elif key:
+            if key.startswith("s3://"):
+                raise MetaflowS3URLException(
+                    "Don't use absolute S3 URLs when the S3 client is "
+                    "initialized with a prefix. URL: %s" % key
+                )
+            return os.path.join(self._s3root, key)
+        else:
+            return self._s3root
+
+    def _url_and_range(self, key_value):
+        url = self._url(key_value)
+        start = getattr(key_value, "offset", None)
+        length = getattr(key_value, "length", None)
+        range_str = None
+        # Range specification are inclusive so getting from offset 500 for 100
+        # bytes will read as bytes=500-599
+        if start is not None or length is not None:
+            if start is None:
+                start = 0
+            if length is None:
+                # Fetch from offset till the end of the file
+                range_str = "bytes=%d-" % start
+            elif length < 0:
+                # Fetch from end; ignore start value here
+                range_str = "bytes=-%d" % (-length)
+            else:
+                # Typical range fetch
+                range_str = "bytes=%d-%d" % (start, start + length - 1)
+        return url, range_str
+
+    def list_paths(self, keys=None):
+        """
+        List the next level of paths in S3.
+
+        If multiple keys are specified, listings are done in parallel. The returned
+        S3Objects have `.exists == False` if the path refers to a prefix, not an
+        existing S3 object.
+
+        For instance, if the directory hierarchy is
+        ```
+        a/0.txt
+        a/b/1.txt
+        a/c/2.txt
+        a/d/e/3.txt
+        f/4.txt
+        ```
+        The `list_paths(['a', 'f'])` call returns
+        ```
+        a/0.txt (exists == True)
+        a/b/ (exists == False)
+        a/c/ (exists == False)
+        a/d/ (exists == False)
+        f/4.txt (exists == True)
+        ```
+
+        Parameters
+        ----------
+        keys : List(str)
+            List of paths.
+
+        Returns
+        -------
+        List[`S3Object`]
+            S3Objects under the given paths, including prefixes (directories) that
+            do not correspond to leaf objects.
+        """
+
+        def _list(keys):
+            if keys is None:
+                keys = [None]
+            urls = ((self._url(key).rstrip("/") + "/", None) for key in keys)
+            res = self._read_many_files("list", urls)
+            for s3prefix, s3url, size in res:
+                if size:
+                    yield s3prefix, s3url, None, int(size)
+                else:
+                    yield s3prefix, s3url, None, None
+
+        return list(starmap(S3Object, _list(keys)))
+
+    def list_recursive(self, keys=None):
+        """
+        List all objects recursively under the given prefixes.
+
+        If multiple keys are specified, listings are done in parallel. All objects
+        returned have `.exists == True` as this call always returns leaf objects.
+
+        For instance, if the directory hierarchy is
+        ```
+        a/0.txt
+        a/b/1.txt
+        a/c/2.txt
+        a/d/e/3.txt
+        f/4.txt
+        ```
+        The `list_paths(['a', 'f'])` call returns
+        ```
+        a/0.txt (exists == True)
+        a/b/1.txt (exists == True)
+        a/c/2.txt (exists == True)
+        a/d/e/3.txt (exists == True)
+        f/4.txt (exists == True)
+        ```
+
+        Parameters
+        ----------
+        keys : List(str)
+            List of paths.
+
+        Returns
+        -------
+        List[`S3Object`]
+            S3Objects under the given paths.
+        """
+
+        def _list(keys):
+            if keys is None:
+                keys = [None]
+            res = self._read_many_files(
+                "list", map(self._url_and_range, keys), recursive=True
+            )
+            for s3prefix, s3url, size in res:
+                yield s3prefix, s3url, None, int(size)
+
+        return list(starmap(S3Object, _list(keys)))
+
+    def info(self, key=None, return_missing=False):
+        """
+        Get metadata about a single object in S3.
+
+        This call makes a single `HEAD` request to S3 which can be
+        much faster than downloading all data with `get`.
+
+        Parameters
+        ----------
+        key : str
+            Object to query. It can be an S3 url or a path suffix.
+        return_missing : bool
+            If set to True, do not raise an exception for a missing key but
+            return it as an `S3Object` with `.exists == False` (default: False).
+
+        Returns
+        -------
+        `S3Object`
+            An S3Object corresponding to the object requested. The object
+            will have `.downloaded == False`.
+        """
+
+        url = self._url(key)
+        src = urlparse(url)
+
+        def _info(s3, tmp):
+            resp = s3.head_object(Bucket=src.netloc, Key=src.path.lstrip('/"'))
+            return {
+                "content_type": resp["ContentType"],
+                "metadata": resp["Metadata"],
+                "size": resp["ContentLength"],
+                "last_modified": get_timestamp(resp["LastModified"]),
+            }
+
+        info_results = None
+        try:
+            _, info_results = self._one_boto_op(_info, url, create_tmp_file=False)
+        except MetaflowS3NotFound:
+            if return_missing:
+                info_results = None
+            else:
+                raise
+        if info_results:
+            return S3Object(
+                self._s3root,
+                url,
+                path=None,
+                size=info_results["size"],
+                content_type=info_results["content_type"],
+                metadata=info_results["metadata"],
+                last_modified=info_results["last_modified"],
+            )
+        return S3Object(self._s3root, url, None)
+
+    def info_many(self, keys, return_missing=False):
+        """
+        Get metadata about many objects in S3 in parallel.
+
+        This call makes a single `HEAD` request to S3 which can be
+        much faster than downloading all data with `get`.
+
+        Parameters
+        ----------
+        keys : List[str]
+            Objects to query. Each key can be an S3 url or a path suffix.
+        return_missing : bool
+            If set to True, do not raise an exception for a missing key but
+            return it as an `S3Object` with `.exists == False` (default: False).
+
+        Returns
+        -------
+        List[`S3Object`]
+            A list of `S3Object`s corresponding to the paths requested. The
+            objects will have `.downloaded == False`.
+        """
+
+        def _head():
+            from . import s3op
+
+            res = self._read_many_files(
+                "info", map(self._url_and_range, keys), verbose=False, listing=True
+            )
+
+            for s3prefix, s3url, fname in res:
+                if fname:
+                    # We have a metadata file to read from
+                    with open(os.path.join(self._tmpdir, fname), "r") as f:
+                        info = json.load(f)
+                    if info["error"] is not None:
+                        # We have an error, we check if it is a missing file
+                        if info["error"] == s3op.ERROR_URL_NOT_FOUND:
+                            if return_missing:
+                                yield self._s3root, s3url, None
+                            else:
+                                raise MetaflowS3NotFound()
+                        elif info["error"] == s3op.ERROR_URL_ACCESS_DENIED:
+                            raise MetaflowS3AccessDenied()
+                        else:
+                            raise MetaflowS3Exception("Got error: %d" % info["error"])
+                    else:
+                        yield self._s3root, s3url, None, info["size"], info[
+                            "content_type"
+                        ], info["metadata"], None, info["last_modified"]
+                else:
+                    # This should not happen; we should always get a response
+                    # even if it contains an error inside it
+                    raise MetaflowS3Exception("Did not get a response to HEAD")
+
+        return list(starmap(S3Object, _head()))
+
+    def get(self, key=None, return_missing=False, return_info=True):
+        """
+        Get a single object from S3.
+
+        Parameters
+        ----------
+        key : str or `S3GetObject`
+            Object to download. It can be an S3 url, a path suffix, or
+            an `S3GetObject` that defines a range of data to download.
+        return_missing : bool
+            If set to True, do not raise an exception for a missing key but
+            return it as an `S3Object` with `.exists == False` (default: False).
+        return_info : bool
+            If set to True, fetch the content-type and user metadata associated
+            with the object at no extra cost, included for symmetry with `get_many`
+            (default: True).
+
+        Returns
+        -------
+        `S3Object`
+            An S3Object corresponding to the object requested.
+        """
+        url, r = self._url_and_range(key)
+        src = urlparse(url)
+
+        def _download(s3, tmp):
+            if r:
+                resp = s3.get_object(
+                    Bucket=src.netloc, Key=src.path.lstrip("/"), Range=r
+                )
+                # Format is bytes start-end/total; both start and end are inclusive so
+                # a 500 bytes file will be `bytes 0-499/500` for the entire file.
+                range_result = resp["ContentRange"]
+                range_result_match = RANGE_MATCH.match(range_result)
+                if range_result_match is None:
+                    raise RuntimeError(
+                        "Wrong format for ContentRange: %s" % str(range_result)
+                    )
+                range_result = RangeInfo(
+                    int(range_result_match.group("total")),
+                    request_offset=int(range_result_match.group("start")),
+                    request_length=int(range_result_match.group("end"))
+                    - int(range_result_match.group("start"))
+                    + 1,
+                )
+            else:
+                resp = s3.get_object(Bucket=src.netloc, Key=src.path.lstrip("/"))
+                range_result = None
+            sz = resp["ContentLength"]
+            if range_result is None:
+                range_result = RangeInfo(sz, request_offset=0, request_length=sz)
+            if not r and sz > DOWNLOAD_FILE_THRESHOLD:
+                # In this case, it is more efficient to use download_file as it
+                # will download multiple parts in parallel (it does it after
+                # multipart_threshold)
+                s3.download_file(src.netloc, src.path.lstrip("/"), tmp)
+            else:
+                with open(tmp, mode="wb") as t:
+                    read_in_chunks(t, resp["Body"], sz, DOWNLOAD_MAX_CHUNK)
+            if return_info:
+                return {
+                    "content_type": resp["ContentType"],
+                    "metadata": resp["Metadata"],
+                    "range_result": range_result,
+                    "last_modified": get_timestamp(resp["LastModified"]),
+                }
+            return None
+
+        addl_info = None
+        try:
+            path, addl_info = self._one_boto_op(_download, url)
+        except MetaflowS3NotFound:
+            if return_missing:
+                path = None
+            else:
+                raise
+        if addl_info:
+            return S3Object(
+                self._s3root,
+                url,
+                path,
+                content_type=addl_info["content_type"],
+                metadata=addl_info["metadata"],
+                range_info=addl_info["range_result"],
+                last_modified=addl_info["last_modified"],
+            )
+        return S3Object(self._s3root, url, path)
+
+    def get_many(self, keys, return_missing=False, return_info=True):
+        """
+        Get many objects from S3 in parallel.
+
+        Parameters
+        ----------
+        keys : List[str or `S3GetObject`]
+            Objects to download. Each object can be an S3 url, a path suffix, or
+            an `S3GetObject` that defines a range of data to download.
+        return_missing : bool
+            If set to True, do not raise an exception for a missing key but
+            return it as an `S3Object` with `.exists == False` (default: False).
+        return_info : bool
+            If set to True, fetch the content-type and user metadata associated
+            with the object at no extra cost, included for symmetry with `get_many`
+            (default: True).
+
+        Returns
+        -------
+        List[`S3Object`]
+            S3Objects corresponding to the objects requested.
+        """
+
+        def _get():
+            res = self._read_many_files(
+                "get",
+                map(self._url_and_range, keys),
+                allow_missing=return_missing,
+                verify=True,
+                verbose=False,
+                info=return_info,
+                listing=True,
+            )
+
+            for s3prefix, s3url, fname in res:
+                if return_info:
+                    if fname:
+                        # We have a metadata file to read from
+                        with open(
+                            os.path.join(self._tmpdir, "%s_meta" % fname), "r"
+                        ) as f:
+                            info = json.load(f)
+                        range_info = info.get("range_result")
+                        if range_info:
+                            range_info = RangeInfo(
+                                range_info["total"],
+                                request_offset=range_info["start"],
+                                request_length=range_info["end"]
+                                - range_info["start"]
+                                + 1,
+                            )
+                        yield self._s3root, s3url, os.path.join(
+                            self._tmpdir, fname
+                        ), None, info["content_type"], info[
+                            "metadata"
+                        ], range_info, info[
+                            "last_modified"
+                        ]
+                    else:
+                        yield self._s3root, s3prefix, None
+                else:
+                    if fname:
+                        yield self._s3root, s3url, os.path.join(self._tmpdir, fname)
+                    else:
+                        # missing entries per return_missing=True
+                        yield self._s3root, s3prefix, None
+
+        return list(starmap(S3Object, _get()))
+
+    def get_recursive(self, keys, return_info=False):
+        """
+        Get many objects from S3 recursively in parallel.
+
+        Parameters
+        ----------
+        keys : List[str]
+            Prefixes to download recursively. Each prefix can be an S3 url or a path suffix
+            which define the root prefix under which all objects are downloaded.
+        return_info : bool
+            If set to True, fetch the content-type and user metadata associated
+            with the object (default: False).
+
+        Returns
+        -------
+        List[`S3Object`]
+            S3Objects stored under the given prefixes.
+        """
+
+        def _get():
+            res = self._read_many_files(
+                "get",
+                map(self._url_and_range, keys),
+                recursive=True,
+                verify=True,
+                verbose=False,
+                info=return_info,
+                listing=True,
+            )
+
+            for s3prefix, s3url, fname in res:
+                if return_info:
+                    # We have a metadata file to read from
+                    with open(os.path.join(self._tmpdir, "%s_meta" % fname), "r") as f:
+                        info = json.load(f)
+                    range_info = info.get("range_result")
+                    if range_info:
+                        range_info = RangeInfo(
+                            range_info["total"],
+                            request_offset=range_info["start"],
+                            request_length=range_info["end"] - range_info["start"] + 1,
+                        )
+                    yield self._s3root, s3url, os.path.join(
+                        self._tmpdir, fname
+                    ), None, info["content_type"], info["metadata"], range_info, info[
+                        "last_modified"
+                    ]
+                else:
+                    yield s3prefix, s3url, os.path.join(self._tmpdir, fname)
+
+        return list(starmap(S3Object, _get()))
+
+    def get_all(self, return_info=False):
+        """
+        Get all objects under the prefix set in the `S3` constructor.
+
+        This method requires that the `S3` object is initialized either with `run` or
+        `s3root`.
+
+        Parameters
+        ----------
+        return_info : bool
+            If set to True, fetch the content-type and user metadata associated
+            with the object (default: False).
+
+        Returns
+        -------
+        List[`S3Object`]
+            S3Objects stored under the main prefix.
+        """
+
+        if self._s3root is None:
+            raise MetaflowS3URLException(
+                "Can't get_all() when S3 is initialized without a prefix"
+            )
+        else:
+            return self.get_recursive([None], return_info)
+
+    def put(self, key, obj, overwrite=True, content_type=None, metadata=None):
+        """
+        Upload a single object to S3.
+
+        Parameters
+        ----------
+        key : str or `S3PutObject`
+            Object path. It can be an S3 url or a path suffix.
+        obj : bytes or str
+            An object to store in S3. Strings are converted to UTF-8 encoding.
+        overwrite : bool
+            Overwrite the object if it exists. If set to False, the operation
+            succeeds without uploading anything if the key already exists
+            (default: True).
+        content_type : str
+            Optional MIME type for the object.
+        metadata : Dict
+            A JSON-encodable dictionary of additional headers to be stored
+            as metadata with the object.
+
+        Returns
+        -------
+        str
+            URL of the object stored.
+        """
+
+        if isinstance(obj, (RawIOBase, BufferedIOBase)):
+            if not obj.readable() or not obj.seekable():
+                raise MetaflowS3InvalidObject(
+                    "Object corresponding to the key '%s' is not readable or seekable"
+                    % key
+                )
+            blob = obj
+        else:
+            if not is_stringish(obj):
+                raise MetaflowS3InvalidObject(
+                    "Object corresponding to the key '%s' is not a string "
+                    "or a bytes object." % key
+                )
+            blob = to_fileobj(obj)
+        # We override the close functionality to prevent closing of the
+        # file if it is used multiple times when uploading (since upload_fileobj
+        # will/may close it on failure)
+        real_close = blob.close
+        blob.close = lambda: None
+
+        url = self._url(key)
+        src = urlparse(url)
+        extra_args = None
+        if content_type or metadata:
+            extra_args = {}
+            if content_type:
+                extra_args["ContentType"] = content_type
+            if metadata:
+                extra_args["Metadata"] = {
+                    "metaflow-user-attributes": json.dumps(metadata)
+                }
+
+        def _upload(s3, _):
+            # We make sure we are at the beginning in case we are retrying
+            blob.seek(0)
+            s3.upload_fileobj(
+                blob, src.netloc, src.path.lstrip("/"), ExtraArgs=extra_args
+            )
+
+        if overwrite:
+            self._one_boto_op(_upload, url, create_tmp_file=False)
+            real_close()
+            return url
+        else:
+
+            def _head(s3, _):
+                s3.head_object(Bucket=src.netloc, Key=src.path.lstrip("/"))
+
+            try:
+                self._one_boto_op(_head, url, create_tmp_file=False)
+            except MetaflowS3NotFound:
+                self._one_boto_op(_upload, url, create_tmp_file=False)
+            finally:
+                real_close()
+            return url
+
+    def put_many(self, key_objs, overwrite=True):
+        """
+        Upload many objects to S3.
+
+        Each object to be uploaded can be specified in two ways:
+
+        1. As a `(key, obj)` tuple where `key` is a string specifying
+           the path and `obj` is a string or a bytes object.
+
+        2. As a `S3PutObject` which contains additional metadata to be
+           stored with the object.
+
+        Parameters
+        ----------
+        key_objs : List[(str, str) or `S3PutObject`]
+            List of key-object pairs to upload.
+        overwrite : bool
+            Overwrite the object if it exists. If set to False, the operation
+            succeeds without uploading anything if the key already exists
+            (default: True).
+
+        Returns
+        -------
+        List[(str, str)]
+            List of `(key, url)` pairs corresponding to the objects uploaded.
+        """
+
+        def _store():
+            for key_obj in key_objs:
+                if isinstance(key_obj, tuple):
+                    key = key_obj[0]
+                    obj = key_obj[1]
+                else:
+                    key = key_obj.key
+                    obj = key_obj.value
+                store_info = {
+                    "key": key,
+                    "content_type": getattr(key_obj, "content_type", None),
+                }
+                metadata = getattr(key_obj, "metadata", None)
+                if metadata:
+                    store_info["metadata"] = {
+                        "metaflow-user-attributes": json.dumps(metadata)
+                    }
+                if isinstance(obj, (RawIOBase, BufferedIOBase)):
+                    if not obj.readable() or not obj.seekable():
+                        raise MetaflowS3InvalidObject(
+                            "Object corresponding to the key '%s' is not readable or seekable"
+                            % key
+                        )
+                else:
+                    if not is_stringish(obj):
+                        raise MetaflowS3InvalidObject(
+                            "Object corresponding to the key '%s' is not a string "
+                            "or a bytes object." % key
+                        )
+                    obj = to_fileobj(obj)
+                with NamedTemporaryFile(
+                    dir=self._tmpdir,
+                    delete=False,
+                    mode="wb",
+                    prefix="metaflow.s3.put_many.",
+                ) as tmp:
+                    tmp.write(obj.read())
+                    tmp.close()
+                    yield tmp.name, self._url(key), store_info
+
+        return self._put_many_files(_store(), overwrite)
+
+    def put_files(self, key_paths, overwrite=True):
+        """
+        Upload many local files to S3.
+
+        Each file to be uploaded can be specified in two ways:
+
+        1. As a `(key, path)` tuple where `key` is a string specifying
+           the S3 path and `path` is the path to a local file.
+
+        2. As a `S3PutObject` which contains additional metadata to be
+           stored with the file.
+
+        Parameters
+        ----------
+        key_paths : List[(str, str) or `S3PutObject`]
+            List of files to upload.
+        overwrite : bool
+            Overwrite the object if it exists. If set to False, the operation
+            succeeds without uploading anything if the key already exists
+            (default: True).
+
+        Returns
+        -------
+        List[(str, str)]
+            List of `(key, url)` pairs corresponding to the files uploaded.
+        """
+
+        def _check():
+            for key_path in key_paths:
+                if isinstance(key_path, tuple):
+                    key = key_path[0]
+                    path = key_path[1]
+                else:
+                    key = key_path.key
+                    path = key_path.path
+                store_info = {
+                    "key": key,
+                    "content_type": getattr(key_path, "content_type", None),
+                }
+                metadata = getattr(key_path, "metadata", None)
+                if metadata:
+                    store_info["metadata"] = {
+                        "metaflow-user-attributes": json.dumps(metadata)
+                    }
+                if not os.path.exists(path):
+                    raise MetaflowS3NotFound("Local file not found: %s" % path)
+                yield path, self._url(key), store_info
+
+        return self._put_many_files(_check(), overwrite)
+
+    def _one_boto_op(self, op, url, create_tmp_file=True):
+        error = ""
+        for i in range(S3_RETRY_COUNT + 1):
+            tmp = None
+            if create_tmp_file:
+                tmp = NamedTemporaryFile(
+                    dir=self._tmpdir, prefix="metaflow.s3.one_file.", delete=False
+                )
+            try:
+                side_results = op(self._s3_client.client, tmp.name if tmp else None)
+                return tmp.name if tmp else None, side_results
+            except self._s3_client.error as err:
+                from . import s3op
+
+                error_code = s3op.normalize_client_error(err)
+                if error_code == 404:
+                    raise MetaflowS3NotFound(url)
+                elif error_code == 403:
+                    raise MetaflowS3AccessDenied(url)
+                elif error_code == 416:
+                    raise MetaflowS3InvalidRange(err)
+                elif error_code == "NoSuchBucket":
+                    raise MetaflowS3URLException("Specified S3 bucket doesn't exist.")
+                error = str(err)
+            except Exception as ex:
+                # TODO specific error message for out of disk space
+                error = str(ex)
+            if tmp:
+                os.unlink(tmp.name)
+            self._s3_client.reset_client()
+            # only sleep if retries > 0
+            if S3_RETRY_COUNT > 0:
+                self._jitter_sleep(i)
+        raise MetaflowS3Exception(
+            "S3 operation failed.\n" "Key requested: %s\n" "Error: %s" % (url, error)
+        )
+
+    # add some jitter to make sure retries are not synchronized
+    def _jitter_sleep(self, trynum, multiplier=2):
+        interval = multiplier**trynum + random.randint(0, 10)
+        time.sleep(interval)
+
+    # NOTE: re: _read_many_files and _put_many_files
+    # All file IO is through binary files - we write bytes, we read
+    # bytes. All inputs and outputs from these functions are Unicode.
+    # Conversion between bytes and unicode is done through
+    # and url_unquote.
+    def _read_many_files(self, op, prefixes_and_ranges, **options):
+        prefixes_and_ranges = list(prefixes_and_ranges)
+        with NamedTemporaryFile(
+            dir=self._tmpdir,
+            mode="wb",
+            delete=not debug.s3client,
+            prefix="metaflow.s3.inputs.",
+        ) as inputfile:
+            inputfile.write(
+                b"\n".join(
+                    [
+                        b" ".join([url_quote(prefix)] + ([url_quote(r)] if r else []))
+                        for prefix, r in prefixes_and_ranges
+                    ]
+                )
+            )
+            inputfile.flush()
+            stdout_lines, stderr = self._s3op_with_retries(
+                op, inputs=inputfile.name, **options
+            )
+            if stderr:
+                raise MetaflowS3Exception(
+                    "Getting S3 files failed.\n"
+                    "First prefix requested: %s\n"
+                    "Error: %s" % (prefixes_and_ranges[0], stderr)
+                )
+            else:
+                for line in stdout_lines:
+                    yield tuple(map(url_unquote, line.strip(b"\n").split(b" ")))
+
+    def _put_many_files(self, url_info, overwrite):
+        url_info = list(url_info)
+        url_dicts = [
+            dict(
+                chain([("local", os.path.realpath(local)), ("url", url)], info.items())
+            )
+            for local, url, info in url_info
+        ]
+
+        with NamedTemporaryFile(
+            dir=self._tmpdir,
+            mode="wb",
+            delete=not debug.s3client,
+            prefix="metaflow.s3.put_inputs.",
+        ) as inputfile:
+            lines = [to_bytes(json.dumps(x)) for x in url_dicts]
+            inputfile.write(b"\n".join(lines))
+            inputfile.flush()
+            stdout_lines, stderr = self._s3op_with_retries(
+                "put",
+                inputs=inputfile.name,
+                verbose=False,
+                overwrite=overwrite,
+                listing=True,
+            )
+            if stderr:
+                raise MetaflowS3Exception(
+                    "Uploading S3 files failed.\n"
+                    "First key: %s\n"
+                    "Error: %s" % (url_info[0][2]["key"], stderr)
+                )
+            else:
+                urls = set()
+                for line in stdout_lines:
+                    url, _, _ = map(url_unquote, line.strip(b"\n").split(b" "))
+                    urls.add(url)
+                return [(info["key"], url) for _, url, info in url_info if url in urls]
+
+    def _s3op_with_retries(self, mode, **options):
+        from . import s3op
+
+        # High level note on what this function does:
+        #  - perform s3op (which calls s3op.py in a subprocess to parallelize the
+        #    operation). Typically this operation has several inputs (for example,
+        #    multiple files to get or put)
+        #  - the result of this operation can be either:
+        #    - a known permanent failure (access denied for example) in which case we
+        #      return this failure.
+        #    - a known transient failure (SlowDown for example) in which case we will
+        #      retry *only* the inputs that have this transient failure.
+        #    - an unknown failure (something went wrong but we cannot say if it was
+        #      a known permanent failure or something else). In this case, we retry
+        #      the operation completely.
+        #
+        # There are therefore two retry counts:
+        #  - the transient failure retry count: how many times do we try on known
+        #    transient errors
+        #  - the top-level retry count: how many times do we try on unknown failures
+        #
+        # Note that, if the operation runs out of transient failure retries, it will
+        # count as an "unknown" failure (ie: it will be retried according to the
+        # outer top-level retry count). In other words, you can potentially have
+        # transient_retry_count * retry_count tries).
+        # Finally, if on transient failures, we make NO progress (ie: no input is
+        # successfully processed), that counts as an "unknown" failure.
+        cmdline = [sys.executable, os.path.abspath(s3op.__file__), mode]
+        recursive_get = False
+        for key, value in options.items():
+            key = key.replace("_", "-")
+            if isinstance(value, bool):
+                if value:
+                    if mode == "get" and key == "recursive":
+                        # We make a note of this because for transient retries, we
+                        # don't pass the recursive flag since we already did all the
+                        # listing we needed
+                        recursive_get = True
+                    else:
+                        cmdline.append("--%s" % key)
+                else:
+                    cmdline.append("--no-%s" % key)
+            elif key == "inputs":
+                base_input_filename = value
+            else:
+                cmdline.extend(("--%s" % key, value))
+        if self._s3_role is not None:
+            cmdline.extend(("--s3role", self._s3_role))
+        if self._s3_session_vars is not None:
+            cmdline.extend(("--s3sessionvars", json.dumps(self._s3_session_vars)))
+        if self._s3_client_params is not None:
+            cmdline.extend(("--s3clientparams", json.dumps(self._s3_client_params)))
+
+        def _inject_failure_rate():
+            # list mode does not do retries on transient failures (there is no
+            # SlowDown handling) so we never inject a failure rate
+            if mode == "list":
+                return 0
+            # Otherwise, we cap the failure rate at 90%
+            return min(90, self._s3_inject_failures)
+
+        retry_count = 0  # Number of retries (excluding transient failures)
+        transient_retry_count = 0  # Number of transient retries (per top-level retry)
+        inject_failures = _inject_failure_rate()
+        out_lines = []  # List to contain the lines returned by _s3op_with_retries
+        pending_retries = (
+            []
+        )  # Inputs that need to be retried due to a transient failure
+        loop_count = 0
+        last_ok_count = 0  # Number of inputs that were successful in the last try
+        total_ok_count = 0  # Total number of OK inputs
+
+        def _reset():
+            nonlocal transient_retry_count, inject_failures, out_lines, pending_retries
+            nonlocal loop_count, last_ok_count, total_ok_count
+            transient_retry_count = 0
+            inject_failures = _inject_failure_rate()
+            if mode != "put":
+                # For put, even after retries, we keep around whatever we already
+                # uploaded. This is because uploading with overwrite=False is not
+                # an idempotent operation and so some files could be uploaded during
+                # the first try which we should report back.
+                out_lines = []
+            pending_retries = []
+            loop_count = 0
+            last_ok_count = 0
+            total_ok_count = 0  # Reset to zero even if we keep out_lines
+
+        def _update_out_lines(out_lines, ok_lines, resize=False):
+            if resize:
+                # This is the first time around; we make the list big enough. Typically,
+                # there is nothing in out_lines but in some cases (a retry after a
+                # partial result), there may be stuff in it
+                out_lines.extend([None] * (len(ok_lines) - len(out_lines)))
+            for l in ok_lines:
+                idx, rest = l.split(b" ", maxsplit=1)
+                if rest.decode(encoding="utf-8") != TRANSIENT_RETRY_LINE_CONTENT:
+                    # Update the proper location in the out_lines array; we maintain
+                    # position as if transient retries did not exist. This
+                    # makes sure that order is respected even in the presence of
+                    # transient retries.
+                    out_lines[int(idx.decode(encoding="utf-8"))] = rest
+
+        def try_s3_op(last_ok_count, pending_retries, out_lines, inject_failures):
+            # NOTE: Make sure to update pending_retries and out_lines in place
+            addl_cmdline = []
+            if len(pending_retries) == 0 and recursive_get:
+                # First time around (or after a fatal failure)
+                addl_cmdline = ["--recursive"]
+            with NamedTemporaryFile(
+                dir=self._tmpdir,
+                mode="wb+",
+                delete=not debug.s3client,
+                prefix="metaflow.s3op.stderr.",
+            ) as stderr:
+                with NamedTemporaryFile(
+                    dir=self._tmpdir,
+                    mode="wb",
+                    delete=not debug.s3client,
+                    prefix="metaflow.s3op.transientretry.",
+                ) as tmp_input:
+                    if len(pending_retries) > 0:
+                        # We try a little bit more than the previous success (to still
+                        # be aggressive but not too much). If there is a lot of
+                        # transient errors and we are having issues pushing through
+                        # things, this will shrink more and more until we are doing a
+                        # single operation at a time. If things start going better, it
+                        # will increase by 20% every round.
+                        max_count = min(int(last_ok_count * 1.2), len(pending_retries))
+                        tmp_input.writelines(pending_retries[:max_count])
+                        tmp_input.flush()
+                        debug.s3client_exec(
+                            "Have %d pending; succeeded in %d => trying for %d and "
+                            "leaving %d for the next round"
+                            % (
+                                len(pending_retries),
+                                last_ok_count,
+                                max_count,
+                                len(pending_retries) - max_count,
+                            )
+                        )
+                        del pending_retries[:max_count]
+
+                        input_filename = tmp_input.name
+                    else:
+                        input_filename = base_input_filename
+
+                    addl_cmdline.extend(["--inputs", input_filename])
+
+                    # Check if we want to inject failures (for testing)
+                    if inject_failures > 0:
+                        addl_cmdline.extend(["--inject-failure", str(inject_failures)])
+                        # Logic here is to have higher and lower failure rates to try to
+                        # exercise as much of the code as possible. The failure rate
+                        # trends towards 0.
+                        if loop_count % 2 == 0:
+                            inject_failures = int(inject_failures / 3)
+                        else:
+                            # We cap at 90 (and not 100) for injection of failures to
+                            # reduce the likelihood of having flaky test. If the
+                            # failure injection rate is too high, this can cause actual
+                            # retries more often and then lead to too many actual
+                            # retries
+                            inject_failures = min(90, int(inject_failures * 1.5))
+                    try:
+                        debug.s3client_exec(cmdline + addl_cmdline)
+                        # Run the operation.
+                        stdout = subprocess.check_output(
+                            cmdline + addl_cmdline,
+                            cwd=self._tmpdir,
+                            stderr=stderr.file,
+                        )
+                        # Here we did not have any error -- transient or otherwise.
+                        ok_lines = stdout.splitlines()
+                        _update_out_lines(out_lines, ok_lines, resize=loop_count == 0)
+                        return (len(ok_lines), 0, inject_failures, None)
+                    except subprocess.CalledProcessError as ex:
+                        if ex.returncode == s3op.ERROR_TRANSIENT:
+                            # In this special case, we failed transiently on *some* of
+                            # the files but not necessarily all. This is typically
+                            # caused by limits on the number of operations that can
+                            # occur per second or some other temporary limitation.
+                            # We will retry only those that we failed on and we will not
+                            # count this as a retry *unless* we are making no forward
+                            # progress. In effect, we consider that as long as *some*
+                            # operations are going through, we should just keep going as
+                            # if it was a single operation.
+                            ok_lines = ex.stdout.splitlines()
+                            stderr.seek(0)
+                            do_output = False
+                            retry_lines = []
+                            for l in stderr:
+                                if do_output:
+                                    retry_lines.append(l)
+                                    continue
+                                if (
+                                    l.decode(encoding="utf-8")
+                                    == "%s\n" % TRANSIENT_RETRY_START_LINE
+                                ):
+                                    # Look for a special marker as the start of the
+                                    # "failed inputs that need to be retried"
+                                    do_output = True
+                            stderr.seek(0)
+                            if do_output is False:
+                                return (
+                                    0,
+                                    0,
+                                    inject_failures,
+                                    "Could not find inputs to retry",
+                                )
+                            else:
+                                _update_out_lines(
+                                    out_lines, ok_lines, resize=loop_count == 0
+                                )
+                                pending_retries.extend(retry_lines)
+
+                                return (
+                                    len(ok_lines),
+                                    len(retry_lines),
+                                    inject_failures,
+                                    None,
+                                )
+
+                        # Here, this is a "normal" failure that we need to send back up.
+                        # These failures are not retried.
+                        stderr.seek(0)
+                        err_out = stderr.read().decode("utf-8", errors="replace")
+                        stderr.seek(0)
+                        if ex.returncode == s3op.ERROR_URL_NOT_FOUND:
+                            raise MetaflowS3NotFound(err_out)
+                        elif ex.returncode == s3op.ERROR_URL_ACCESS_DENIED:
+                            raise MetaflowS3AccessDenied(err_out)
+                        elif ex.returncode == s3op.ERROR_INVALID_RANGE:
+                            raise MetaflowS3InvalidRange(err_out)
+
+                        # Here, this is some other error that we will retry. We still
+                        # update the successful lines
+                        ok_lines = ex.stdout.splitlines()
+                        _update_out_lines(out_lines, ok_lines, resize=loop_count == 0)
+                        return 0, 0, inject_failures, err_out
+
+        while retry_count <= S3_RETRY_COUNT:
+            (
+                last_ok_count,
+                last_retry_count,
+                inject_failures,
+                err_out,
+            ) = try_s3_op(last_ok_count, pending_retries, out_lines, inject_failures)
+            if err_out or (
+                last_retry_count != 0
+                and (
+                    last_ok_count == 0
+                    or transient_retry_count > S3_TRANSIENT_RETRY_COUNT
+                )
+            ):
+                # We had a fatal failure (err_out is not None)
+                # or we made no progress (last_ok_count is 0)
+                # or we are out of transient retries
+                # so we will restart from scratch (being very conservative)
+                retry_count += 1
+                err_msg = err_out
+                if err_msg is None and last_ok_count == 0:
+                    err_msg = "No progress"
+                if err_msg is None:
+                    err_msg = "Too many transient errors"
+                print(
+                    "S3 non-transient error (attempt #%d): %s" % (retry_count, err_msg)
+                )
+                _reset()
+                if retry_count <= S3_RETRY_COUNT:
+                    self._jitter_sleep(retry_count)
+                continue
+            elif last_retry_count != 0:
+                # During our last try, we did not manage to process everything we wanted
+                # due to a transient failure so we try again.
+                transient_retry_count += 1
+                total_ok_count += last_ok_count
+                print(
+                    "Transient S3 failure (attempt #%d) -- total success: %d, "
+                    "last attempt %d/%d -- remaining: %d"
+                    % (
+                        transient_retry_count,
+                        total_ok_count,
+                        last_ok_count,
+                        last_ok_count + last_retry_count,
+                        len(pending_retries),
+                    )
+                )
+                if inject_failures == 0:
+                    # Don't sleep when we are "faking" the failures
+                    self._jitter_sleep(transient_retry_count)
+
+            loop_count += 1
+            # If we have no more things to try, we break out of the loop.
+            if len(pending_retries) == 0:
+                break
+
+        # At this point, we check out_lines; strip None which can happen for puts that
+        # didn't upload files
+        return [o for o in out_lines if o is not None], err_out
diff --git a/metaflow/datatools/s3op.py b/metaflow/plugins/datatools/s3/s3op.py
similarity index 50%
rename from metaflow/datatools/s3op.py
rename to metaflow/plugins/datatools/s3/s3op.py
index 1e6d8c9959e..95f67b8fa3a 100644
--- a/metaflow/datatools/s3op.py
+++ b/metaflow/plugins/datatools/s3/s3op.py
@@ -3,9 +3,13 @@
 import json
 import time
 import math
+import random
+import re
 import sys
 import os
 import traceback
+from collections import namedtuple
+from functools import partial, wraps
 from hashlib import sha1
 from tempfile import NamedTemporaryFile
 from multiprocessing import Process, Queue
@@ -16,15 +20,15 @@
 try:
     # python2
     from urlparse import urlparse
-    from Queue import Full as QueueFull
 except:
     # python3
     from urllib.parse import urlparse
-    from queue import Full as QueueFull
 
-# s3op can be launched as a stand-alone script. We must set
-# PYTHONPATH for the parent Metaflow explicitly.
-sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../../")))
+if __name__ == "__main__":
+    # When launched standalone, point to our parent metaflow
+    sys.path.insert(
+        0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../../../../"))
+    )
 
 from metaflow._vendor import click
 
@@ -32,13 +36,23 @@
 # multiprocessing.Pool because https://bugs.python.org/issue31886
 from metaflow.util import TempDir, url_quote, url_unquote
 from metaflow.multicore_utils import parallel_map
-from metaflow.datatools.s3util import aws_retry, read_in_chunks, get_timestamp
+from metaflow.plugins.datatools.s3.s3util import (
+    aws_retry,
+    read_in_chunks,
+    get_timestamp,
+    TRANSIENT_RETRY_LINE_CONTENT,
+    TRANSIENT_RETRY_START_LINE,
+)
 
 NUM_WORKERS_DEFAULT = 64
 
 DOWNLOAD_FILE_THRESHOLD = 2 * TransferConfig().multipart_threshold
 DOWNLOAD_MAX_CHUNK = 2 * 1024 * 1024 * 1024 - 1
 
+RANGE_MATCH = re.compile(r"bytes (?P<start>[0-9]+)-(?P<end>[0-9]+)/(?P<total>[0-9]+)")
+
+S3Config = namedtuple("S3Config", "role session_vars client_params")
+
 
 class S3Url(object):
     def __init__(
@@ -51,6 +65,7 @@ def __init__(
         content_type=None,
         metadata=None,
         range=None,
+        idx=None,
     ):
 
         self.bucket = bucket
@@ -61,13 +76,14 @@ def __init__(
         self.content_type = content_type
         self.metadata = metadata
         self.range = range
+        self.idx = idx
 
     def __str__(self):
         return self.url
 
 
 # We use error codes instead of Exceptions, which are trickier to
-# handle reliably in a multi-process world
+# handle reliably in a multiprocess world
 ERROR_INVALID_URL = 4
 ERROR_NOT_FULL_PATH = 5
 ERROR_URL_NOT_FOUND = 6
@@ -75,10 +91,20 @@ def __str__(self):
 ERROR_WORKER_EXCEPTION = 8
 ERROR_VERIFY_FAILED = 9
 ERROR_LOCAL_FILE_NOT_FOUND = 10
+ERROR_INVALID_RANGE = 11
+ERROR_TRANSIENT = 12
 
 
-def format_triplet(prefix, url="", local=""):
-    return u" ".join(url_quote(x).decode("utf-8") for x in (prefix, url, local))
+def format_result_line(idx, prefix, url="", local=""):
+    # We prefix each output with the index corresponding to the line number on the
+    # initial request (ie: prior to any transient errors). This allows us to
+    # properly maintain the order in which things were requested even in the presence
+    # of transient retries where we do not know what succeeds and what does not.
+    # Basically, when we retry an operation, we can trace it back to its original
+    # position in the first request.
+    return " ".join(
+        [str(idx)] + [url_quote(x).decode("utf-8") for x in (prefix, url, local)]
+    )
 
 
 # I can't understand what's the right way to deal
@@ -93,13 +119,39 @@ def normalize_client_error(err):
             return 403
         if error_code == "NoSuchKey":
             return 404
+        if error_code == "InvalidRange":
+            return 416
+        # We "normalize" retriable server errors to 503. These are also considered
+        # transient by boto3 (see:
+        # https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html)
+        if error_code in (
+            "SlowDown",
+            "RequestTimeout",
+            "RequestTimeoutException",
+            "PriorRequestNotComplete",
+            "ConnectionError",
+            "HTTPClientError",
+            "Throttling",
+            "ThrottlingException",
+            "ThrottledException",
+            "RequestThrottledException",
+            "TooManyRequestsException",
+            "ProvisionedThroughputExceededException",
+            "TransactionInProgressException",
+            "RequestLimitExceeded",
+            "BandwidthLimitExceeded",
+            "LimitExceededException",
+            "RequestThrottled",
+            "EC2ThrottledException",
+        ):
+            return 503
     return error_code
 
 
 # S3 worker pool
 
 
-def worker(result_file_name, queue, mode):
+def worker(result_file_name, queue, mode, s3config):
     # Interpret mode, it can either be a single op or something like
     # info_download or info_upload which implies:
     #  - for download: we need to return the information as well
@@ -128,15 +180,23 @@ def op_info(url):
                 to_return = {"error": ERROR_URL_NOT_FOUND, "raise_error": err}
             elif error_code == 403:
                 to_return = {"error": ERROR_URL_ACCESS_DENIED, "raise_error": err}
+            elif error_code == 416:
+                to_return = {"error": ERROR_INVALID_RANGE, "raise_error": err}
+            elif error_code in (500, 502, 503, 504):
+                to_return = {"error": ERROR_TRANSIENT, "raise_error": err}
             else:
                 to_return = {"error": error_code, "raise_error": err}
         return to_return
 
     with open(result_file_name, "w") as result_file:
         try:
-            from metaflow.datatools.s3util import get_s3_client
+            from metaflow.plugins.datatools.s3.s3util import get_s3_client
 
-            s3, client_error = get_s3_client()
+            s3, client_error = get_s3_client(
+                s3_role_arn=s3config.role,
+                s3_session_vars=s3config.session_vars,
+                s3_client_params=s3config.client_params,
+            )
             while True:
                 url, idx = queue.get()
                 if url is None:
@@ -148,6 +208,10 @@ def op_info(url):
                         del result["raise_error"]
                     with open(url.local, "w") as f:
                         json.dump(result, f)
+                    result_file.write(
+                        "%d %d\n"
+                        % (idx, -1 * result["error"] if orig_error else result["size"])
+                    )
                 elif mode == "download":
                     tmp = NamedTemporaryFile(dir=".", mode="wb", delete=False)
                     try:
@@ -155,9 +219,23 @@ def op_info(url):
                             resp = s3.get_object(
                                 Bucket=url.bucket, Key=url.path, Range=url.range
                             )
+                            range_result = resp["ContentRange"]
+                            range_result_match = RANGE_MATCH.match(range_result)
+                            if range_result_match is None:
+                                raise RuntimeError(
+                                    "Wrong format for ContentRange: %s"
+                                    % str(range_result)
+                                )
+                            range_result = {
+                                x: int(range_result_match.group(x))
+                                for x in ["total", "start", "end"]
+                            }
                         else:
                             resp = s3.get_object(Bucket=url.bucket, Key=url.path)
+                            range_result = None
                         sz = resp["ContentLength"]
+                        if range_result is None:
+                            range_result = {"total": sz, "start": 0, "end": sz - 1}
                         if not url.range and sz > DOWNLOAD_FILE_THRESHOLD:
                             # In this case, it is more efficient to use download_file as it
                             # will download multiple parts in parallel (it does it after
@@ -179,14 +257,21 @@ def op_info(url):
                                 "%d %d\n" % (idx, -ERROR_URL_ACCESS_DENIED)
                             )
                             continue
+                        elif error_code == 503:
+                            result_file.write("%d %d\n" % (idx, -ERROR_TRANSIENT))
+                            continue
                         else:
                             raise
                         # TODO specific error message for out of disk space
                     # If we need the metadata, get it and write it out
                     if pre_op_info:
-
                         with open("%s_meta" % url.local, mode="w") as f:
-                            args = {"size": resp["ContentLength"]}
+                            # Get range information
+
+                            args = {
+                                "size": resp["ContentLength"],
+                                "range_result": range_result,
+                            }
                             if resp["ContentType"]:
                                 args["content_type"] = resp["ContentType"]
                             if resp["Metadata"] is not None:
@@ -196,10 +281,10 @@ def op_info(url):
                                     resp["LastModified"]
                                 )
                             json.dump(args, f)
-                        # Finally, we push out the size to the result_pipe since
-                        # the size is used for verification and other purposes and
-                        # we want to avoid file operations for this simple process
-                        result_file.write("%d %d\n" % (idx, resp["ContentLength"]))
+                    # Finally, we push out the size to the result_pipe since
+                    # the size is used for verification and other purposes, and
+                    # we want to avoid file operations for this simple process
+                    result_file.write("%d %d\n" % (idx, resp["ContentLength"]))
                 else:
                     # This is upload, if we have a pre_op, it means we do not
                     # want to overwrite
@@ -220,41 +305,64 @@ def op_info(url):
                                 extra["ContentType"] = url.content_type
                             if url.metadata is not None:
                                 extra["Metadata"] = url.metadata
-                        s3.upload_file(url.local, url.bucket, url.path, ExtraArgs=extra)
-                        # We indicate that the file was uploaded
-                        result_file.write("%d %d\n" % (idx, 0))
+                        try:
+                            s3.upload_file(
+                                url.local, url.bucket, url.path, ExtraArgs=extra
+                            )
+                            # We indicate that the file was uploaded
+                            result_file.write("%d %d\n" % (idx, 0))
+                        except client_error as err:
+                            error_code = normalize_client_error(err)
+                            if error_code == 403:
+                                result_file.write(
+                                    "%d %d\n" % (idx, -ERROR_URL_ACCESS_DENIED)
+                                )
+                                continue
+                            elif error_code == 503:
+                                result_file.write("%d %d\n" % (idx, -ERROR_TRANSIENT))
+                                continue
+                            else:
+                                raise
         except:
             traceback.print_exc()
             sys.exit(ERROR_WORKER_EXCEPTION)
 
 
-def start_workers(mode, urls, num_workers):
+def start_workers(mode, urls, num_workers, inject_failure, s3config):
     # We start the minimum of len(urls) or num_workers to avoid starting
     # workers that will definitely do nothing
     num_workers = min(num_workers, len(urls))
     queue = Queue(len(urls) + num_workers)
     procs = {}
+    random.seed()
 
+    sz_results = []
     # 1. push sources and destinations to the queue
+    # We only push if we don't inject a failure; otherwise, we already set the sz_results
+    # appropriately with the result of the injected failure.
     for idx, elt in enumerate(urls):
-        queue.put((elt, idx))
+        if random.randint(0, 99) < inject_failure:
+            sz_results.append(-ERROR_TRANSIENT)
+        else:
+            sz_results.append(None)
+            queue.put((elt, idx))
 
     # 2. push end-of-queue markers
     for i in range(num_workers):
         queue.put((None, None))
 
-    # 3. Prepare the result structure
-    sz_results = [None] * len(urls)
-
-    # 4. start processes
+    # 3. start processes
     with TempDir() as output_dir:
         for i in range(num_workers):
             file_path = os.path.join(output_dir, str(i))
-            p = Process(target=worker, args=(file_path, queue, mode))
+            p = Process(
+                target=worker,
+                args=(file_path, queue, mode, s3config),
+            )
             p.start()
             procs[p] = file_path
 
-        # 5. wait for the processes to finish; we continuously update procs
+        # 4. wait for the processes to finish; we continuously update procs
         # to remove all processes that have finished already
         while procs:
             new_procs = {}
@@ -276,13 +384,13 @@ def start_workers(mode, urls, num_workers):
     return sz_results
 
 
-def process_urls(mode, urls, verbose, num_workers):
+def process_urls(mode, urls, verbose, inject_failure, num_workers, s3config):
 
     if verbose:
         print("%sing %d files.." % (mode.capitalize(), len(urls)), file=sys.stderr)
 
     start = time.time()
-    sz_results = start_workers(mode, urls, num_workers)
+    sz_results = start_workers(mode, urls, num_workers, inject_failure, s3config)
     end = time.time()
 
     if verbose:
@@ -306,10 +414,10 @@ def process_urls(mode, urls, verbose, num_workers):
 
 
 def with_unit(x):
-    if x > 1024 ** 3:
-        return "%.1fGB" % (x / 1024.0 ** 3)
-    elif x > 1024 ** 2:
-        return "%.1fMB" % (x / 1024.0 ** 2)
+    if x > 1024**3:
+        return "%.1fGB" % (x / 1024.0**3)
+    elif x > 1024**2:
+        return "%.1fMB" % (x / 1024.0**2)
     elif x > 1024:
         return "%.1fKB" % (x / 1024.0)
     else:
@@ -320,15 +428,20 @@ def with_unit(x):
 # required by @aws_retry decorator, which needs the reset_client
 # method. Otherwise they would be just stand-alone functions.
 class S3Ops(object):
-    def __init__(self):
+    def __init__(self, s3config):
         self.s3 = None
+        self.s3config = s3config
         self.client_error = None
 
     def reset_client(self, hard_reset=False):
-        from metaflow.datatools.s3util import get_s3_client
+        from metaflow.plugins.datatools.s3.s3util import get_s3_client
 
         if hard_reset or self.s3 is None:
-            self.s3, self.client_error = get_s3_client()
+            self.s3, self.client_error = get_s3_client(
+                s3_role_arn=self.s3config.role,
+                s3_session_vars=self.s3config.session_vars,
+                s3_client_params=self.s3config.client_params,
+            )
 
     @aws_retry
     def get_info(self, url):
@@ -360,6 +473,7 @@ def get_info(self, url):
                 return False, url, ERROR_URL_NOT_FOUND
             elif error_code == 403:
                 return False, url, ERROR_URL_ACCESS_DENIED
+            # Transient errors are going to be retried by the aws_retry decorator
             else:
                 raise
 
@@ -402,28 +516,32 @@ def list_prefix(self, prefix_url, delimiter=""):
         except self.s3.exceptions.NoSuchBucket:
             return False, prefix_url, ERROR_URL_NOT_FOUND
         except self.client_error as err:
-            if err.response["Error"]["Code"] in ("AccessDenied", "AllAccessDisabled"):
+            error_code = normalize_client_error(err)
+            if error_code == 404:
+                return False, prefix_url, ERROR_URL_NOT_FOUND
+            elif error_code == 403:
                 return False, prefix_url, ERROR_URL_ACCESS_DENIED
+            # Transient errors are going to be retried by the aws_retry decorator
             else:
                 raise
 
 
-# We want to reuse an s3 client instance over multiple operations.
+# We want to reuse an S3 client instance over multiple operations.
 # This is accomplished by op_ functions below.
 
 
-def op_get_info(urls):
-    s3 = S3Ops()
+def op_get_info(s3config, urls):
+    s3 = S3Ops(s3config)
     return [s3.get_info(url) for url in urls]
 
 
-def op_list_prefix(prefix_urls):
-    s3 = S3Ops()
+def op_list_prefix(s3config, prefix_urls):
+    s3 = S3Ops(s3config)
     return [s3.list_prefix(prefix) for prefix in prefix_urls]
 
 
-def op_list_prefix_nonrecursive(prefix_urls):
-    s3 = S3Ops()
+def op_list_prefix_nonrecursive(s3config, prefix_urls):
+    s3 = S3Ops(s3config)
     return [s3.list_prefix(prefix, delimiter="/") for prefix in prefix_urls]
 
 
@@ -442,6 +560,8 @@ def exit(exit_code, url):
         msg = "Verification failed for URL %s, local file %s" % (url.url, url.local)
     elif exit_code == ERROR_LOCAL_FILE_NOT_FOUND:
         msg = "Local file not found: %s" % url
+    elif exit_code == ERROR_TRANSIENT:
+        msg = "Transient error for url: %s" % url
     else:
         msg = "Unknown error"
     print("s3op failed:\n%s" % msg, file=sys.stderr)
@@ -456,7 +576,6 @@ def verify_results(urls, verbose=False):
             got = os.stat(url.local).st_size
         except OSError:
             raise
-            exit(ERROR_VERIFY_FAILED, url)
         if expected != got:
             exit(ERROR_VERIFY_FAILED, url)
         if url.content_type or url.metadata:
@@ -467,17 +586,24 @@ def verify_results(urls, verbose=False):
                 exit(ERROR_VERIFY_FAILED, url)
 
 
-def generate_local_path(url, suffix=None):
+def generate_local_path(url, range="whole", suffix=None):
     # this function generates a safe local file name corresponding to
     # an S3 URL. URLs may be longer than maximum file length limit on Linux,
     # so we mostly hash the URL but retain the leaf part as a convenience
     # feature to ease eyeballing
+    # We also call out "range" specifically to allow multiple ranges for the same
+    # file to be downloaded in parallel.
+    if range is None:
+        range = "whole"
+    if range != "whole":
+        # It will be of the form `bytes=%d-` or `bytes=-%d` or `bytes=%d-%d`
+        range = range[6:].replace("-", "_")
     quoted = url_quote(url)
     fname = quoted.split(b"/")[-1].replace(b".", b"_").replace(b"-", b"_")
     sha = sha1(quoted).hexdigest()
     if suffix:
-        return u"-".join((sha, fname.decode("utf-8"), suffix))
-    return u"-".join((sha, fname.decode("utf-8")))
+        return "-".join((sha, fname.decode("utf-8"), range, suffix))
+    return "-".join((sha, fname.decode("utf-8"), range))
 
 
 def parallel_op(op, lst, num_workers):
@@ -512,37 +638,110 @@ def parallel_op(op, lst, num_workers):
 # CLI
 
 
+def common_options(func):
+    @click.option(
+        "--inputs",
+        type=click.Path(exists=True),
+        help="Read input prefixes from the given file.",
+    )
+    @click.option(
+        "--num-workers",
+        default=NUM_WORKERS_DEFAULT,
+        show_default=True,
+        help="Number of concurrent connections.",
+    )
+    @click.option(
+        "--s3role",
+        default=None,
+        show_default=True,
+        required=False,
+        help="Role to assume when getting the S3 client",
+    )
+    @click.option(
+        "--s3sessionvars",
+        default=None,
+        show_default=True,
+        required=False,
+        help="Session vars to set when getting the S3 client",
+    )
+    @click.option(
+        "--s3clientparams",
+        default=None,
+        show_default=True,
+        required=False,
+        help="Client parameters to set when getting the S3 client",
+    )
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        return func(*args, **kwargs)
+
+    return wrapper
+
+
+def non_lst_common_options(func):
+    @click.option(
+        "--verbose/--no-verbose",
+        default=True,
+        show_default=True,
+        help="Print status information on stderr.",
+    )
+    @click.option(
+        "--listing/--no-listing",
+        default=False,
+        show_default=True,
+        help="Print S3 URL -> local file mapping on stdout.",
+    )
+    @click.option(
+        "--inject-failure",
+        default=0,
+        show_default=True,
+        type=int,
+        help="Simulate transient failures -- percentage (int) of injected failures",
+        hidden=True,
+    )
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        return func(*args, **kwargs)
+
+    return wrapper
+
+
 @click.group()
 def cli():
     pass
 
 
 @cli.command("list", help="List S3 objects")
-@click.option(
-    "--inputs",
-    type=click.Path(exists=True),
-    help="Read input prefixes from the given file.",
-)
-@click.option(
-    "--num-workers",
-    default=NUM_WORKERS_DEFAULT,
-    show_default=True,
-    help="Number of concurrent connections.",
-)
 @click.option(
     "--recursive/--no-recursive",
     default=False,
     show_default=True,
-    help="Download prefixes recursively.",
+    help="List prefixes recursively.",
 )
+@common_options
 @click.argument("prefixes", nargs=-1)
-def lst(prefixes, inputs=None, num_workers=None, recursive=None):
+def lst(
+    prefixes,
+    inputs=None,
+    num_workers=None,
+    recursive=None,
+    s3role=None,
+    s3sessionvars=None,
+    s3clientparams=None,
+):
+
+    s3config = S3Config(
+        s3role,
+        json.loads(s3sessionvars) if s3sessionvars else None,
+        json.loads(s3clientparams) if s3clientparams else None,
+    )
 
     urllist = []
-    for prefix, _ in _populate_prefixes(prefixes, inputs):
-        src = urlparse(prefix)
+    to_iterate, _ = _populate_prefixes(prefixes, inputs)
+    for _, prefix, url, _ in to_iterate:
+        src = urlparse(url)
         url = S3Url(
-            url=prefix,
+            url=url,
             bucket=src.netloc,
             path=src.path.lstrip("/"),
             local=None,
@@ -552,7 +751,11 @@ def lst(prefixes, inputs=None, num_workers=None, recursive=None):
             exit(ERROR_INVALID_URL, url)
         urllist.append(url)
 
-    op = op_list_prefix if recursive else op_list_prefix_nonrecursive
+    op = (
+        partial(op_list_prefix, s3config)
+        if recursive
+        else partial(op_list_prefix_nonrecursive, s3config)
+    )
     urls = []
     for success, prefix_url, ret in parallel_op(op, urllist, num_workers):
         if success:
@@ -560,11 +763,11 @@ def lst(prefixes, inputs=None, num_workers=None, recursive=None):
         else:
             exit(ret, prefix_url)
 
-    for url, size in urls:
+    for idx, (url, size) in enumerate(urls):
         if size is None:
-            print(format_triplet(url.prefix, url.url))
+            print(format_result_line(idx, url.prefix, url.url))
         else:
-            print(format_triplet(url.prefix, url.url, str(size)))
+            print(format_result_line(idx, url.prefix, url.url, str(size)))
 
 
 @cli.command(help="Upload files to S3")
@@ -573,24 +776,12 @@ def lst(prefixes, inputs=None, num_workers=None, recursive=None):
     "files",
     type=(click.Path(exists=True), str),
     multiple=True,
-    help="Local file->S3Url pair to upload. " "Can be specified multiple times.",
+    help="Local file->S3Url pair to upload. Can be specified multiple times.",
 )
 @click.option(
     "--filelist",
     type=click.Path(exists=True),
-    help="Read local file -> S3 URL mappings from the given file.",
-)
-@click.option(
-    "--num-workers",
-    default=NUM_WORKERS_DEFAULT,
-    show_default=True,
-    help="Number of concurrent connections.",
-)
-@click.option(
-    "--verbose/--no-verbose",
-    default=True,
-    show_default=True,
-    help="Print status information on stderr.",
+    help="Read local file -> S3 URL mappings from the given file. Use --inputs instead",
 )
 @click.option(
     "--overwrite/--no-overwrite",
@@ -598,35 +789,60 @@ def lst(prefixes, inputs=None, num_workers=None, recursive=None):
     show_default=True,
     help="Overwrite key if it already exists in S3.",
 )
-@click.option(
-    "--listing/--no-listing",
-    default=False,
-    show_default=True,
-    help="Print S3 URLs upload to on stdout.",
-)
+@common_options
+@non_lst_common_options
 def put(
     files=None,
     filelist=None,
+    inputs=None,
     num_workers=None,
     verbose=None,
     overwrite=True,
     listing=None,
+    s3role=None,
+    s3sessionvars=None,
+    s3clientparams=None,
+    inject_failure=0,
 ):
+    if inputs is not None and filelist is not None:
+        raise RuntimeError("Cannot specify inputs and filelist at the same time")
+    if inputs is not None and filelist is None:
+        filelist = inputs
+
+    is_transient_retry = False
+
     def _files():
+        nonlocal is_transient_retry
+        line_idx = 0
         for local, url in files:
-            yield url_unquote(local), url_unquote(url), None, None
+            local_file = url_unquote(local)
+            if not os.path.exists(local_file):
+                exit(ERROR_LOCAL_FILE_NOT_FOUND, local_file)
+            yield line_idx, local_file, url_unquote(url), None, None
+            line_idx += 1
         if filelist:
+            # NOTE: We are assuming that the idx is properly set. This is only used
+            # by the transient failure retry mechanism and users should not use it
+            # directly. This will not work, for example, if only some lines have
+            # an idx specified (in some cases)
             for line in open(filelist, mode="rb"):
                 r = json.loads(line)
+                input_line_idx = r.get("idx")
+                if input_line_idx is not None:
+                    # We only have input indices if we have a transient retry.
+                    is_transient_retry = True
+                else:
+                    input_line_idx = line_idx
+                line_idx += 1
                 local = r["local"]
                 url = r["url"]
                 content_type = r.get("content_type", None)
                 metadata = r.get("metadata", None)
                 if not os.path.exists(local):
                     exit(ERROR_LOCAL_FILE_NOT_FOUND, local)
-                yield local, url, content_type, metadata
+                yield input_line_idx, local, url, content_type, metadata
 
-    def _make_url(local, user_url, content_type, metadata):
+    def _make_url(idx, local, user_url, content_type, metadata):
         src = urlparse(user_url)
         url = S3Url(
             url=user_url,
@@ -636,6 +852,7 @@ def _make_url(local, user_url, content_type, metadata):
             prefix=None,
             content_type=content_type,
             metadata=metadata,
+            idx=idx,
         )
         if src.scheme != "s3":
             exit(ERROR_INVALID_URL, url)
@@ -643,35 +860,105 @@ def _make_url(local, user_url, content_type, metadata):
             exit(ERROR_NOT_FULL_PATH, url)
         return url
 
+    s3config = S3Config(
+        s3role,
+        json.loads(s3sessionvars) if s3sessionvars else None,
+        json.loads(s3clientparams) if s3clientparams else None,
+    )
+
     urls = list(starmap(_make_url, _files()))
     ul_op = "upload"
     if not overwrite:
         ul_op = "info_upload"
-    sz_results = process_urls(ul_op, urls, verbose, num_workers)
-    urls = [url for url, sz in zip(urls, sz_results) if sz is not None]
-    if listing:
-        for url in urls:
-            print(format_triplet(url.url))
+    sz_results = process_urls(
+        ul_op, urls, verbose, inject_failure, num_workers, s3config
+    )
+    retry_lines = []
+    out_lines = []
+    denied_url = None
+    for url, sz in zip(urls, sz_results):
+        # sz is None if the file wasn't uploaded (no overwrite), 0 if uploaded OK
+        # or the error code if not (error code here will only be
+        # ERROR_TRANSIENT or ERROR_URL_ACCESS_DENIED
+        if sz is None:
+            if listing:
+                # We keep a position for it in our out list in case of retries
+                out_lines.append("%d %s\n" % (url.idx, TRANSIENT_RETRY_LINE_CONTENT))
+            continue
+        elif listing and sz == 0:
+            out_lines.append(format_result_line(url.idx, url.url) + "\n")
+        elif sz == -ERROR_TRANSIENT:
+            retry_lines.append(
+                json.dumps(
+                    {
+                        "idx": url.idx,
+                        "url": url.url,
+                        "local": url.local,
+                        "content_type": url.content_type,
+                        "metadata": url.metadata,
+                    }
+                )
+                + "\n"
+            )
+            # Output something to get a total count the first time around
+            if not is_transient_retry:
+                out_lines.append("%d %s\n" % (url.idx, TRANSIENT_RETRY_LINE_CONTENT))
+        elif sz == -ERROR_URL_ACCESS_DENIED:
+            # We do NOT break because we want to be able to accurately report all
+            # the files uploaded after retries.
+            denied_url = url
+    if denied_url is not None:
+        exit(ERROR_URL_ACCESS_DENIED, denied_url)
+
+    if out_lines:
+        sys.stdout.writelines(out_lines)
+        sys.stdout.flush()
+
+    if retry_lines:
+        sys.stderr.write("%s\n" % TRANSIENT_RETRY_START_LINE)
+        sys.stderr.writelines(retry_lines)
+        sys.stderr.flush()
+        sys.exit(ERROR_TRANSIENT)
 
 
 def _populate_prefixes(prefixes, inputs):
-    # Returns a tuple: first element is the prefix and second element
-    # is the optional range (or None if the entire prefix is requested)
+    # Returns a tuple: first element is the prefix index, the second element is the
+    # prefix and the third element is the optional range (or None if the entire prefix
+    # is requested).
+    # We again assume that the indices, if provided, are correct. This is again only
+    # used for the transient error retry so users should not use this directly.
+    is_transient_retry = False
     if prefixes:
-        prefixes = [(url_unquote(p), None) for p in prefixes]
+        prefixes = [(idx, url_unquote(p), None) for idx, p in enumerate(prefixes)]
     else:
         prefixes = []
     if inputs:
         with open(inputs, mode="rb") as f:
-            for l in f:
+            for idx, l in enumerate(f, start=len(prefixes)):
                 s = l.split(b" ")
-                if len(s) > 1:
+                if len(s) == 1:
+                    url = url_unquote(s[0].strip())
+                    prefixes.append((idx, url, url, None))
+                elif len(s) == 2:
+                    url = url_unquote(s[0].strip())
+                    prefixes.append((idx, url, url, url_unquote(s[1].strip())))
+                else:
+                    is_transient_retry = True
+                    if len(s) == 3:
+                        prefix = url = url_unquote(s[1].strip())
+                        range_info = url_unquote(s[2].strip())
+                    else:
+                        # Special case when we have both prefix and URL -- this is
+                        # used in recursive gets for example
+                        prefix = url_unquote(s[1].strip())
+                        url = url_unquote(s[2].strip())
+                        range_info = url_unquote(s[3].strip())
+                    if range_info == "<norange>":
+                        range_info = None
                     prefixes.append(
-                        (url_unquote(s[0].strip()), url_unquote(s[1].strip()))
+                        (int(url_unquote(s[0].strip())), prefix, url, range_info)
                     )
-                else:
-                    prefixes.append((url_unquote(s[0].strip()), None))
-    return prefixes
+    return prefixes, is_transient_retry
 
 
 @cli.command(help="Download files from S3")
@@ -681,17 +968,6 @@ def _populate_prefixes(prefixes, inputs):
     show_default=True,
     help="Download prefixes recursively.",
 )
-@click.option(
-    "--num-workers",
-    default=NUM_WORKERS_DEFAULT,
-    show_default=True,
-    help="Number of concurrent connections.",
-)
-@click.option(
-    "--inputs",
-    type=click.Path(exists=True),
-    help="Read input prefixes from the given file.",
-)
 @click.option(
     "--verify/--no-verify",
     default=True,
@@ -708,20 +984,10 @@ def _populate_prefixes(prefixes, inputs):
     "--allow-missing/--no-allow-missing",
     default=False,
     show_default=True,
-    help="Do not exit if missing files are detected. " "Implies --verify.",
-)
-@click.option(
-    "--verbose/--no-verbose",
-    default=True,
-    show_default=True,
-    help="Print status information on stderr.",
-)
-@click.option(
-    "--listing/--no-listing",
-    default=False,
-    show_default=True,
-    help="Print S3 URL -> local file mapping on stdout.",
+    help="Do not exit if missing files are detected. Implies --verify.",
 )
+@common_options
+@non_lst_common_options
 @click.argument("prefixes", nargs=-1)
 def get(
     prefixes,
@@ -733,33 +999,47 @@ def get(
     allow_missing=None,
     verbose=None,
     listing=None,
+    s3role=None,
+    s3sessionvars=None,
+    s3clientparams=None,
+    inject_failure=0,
 ):
 
+    s3config = S3Config(
+        s3role,
+        json.loads(s3sessionvars) if s3sessionvars else None,
+        json.loads(s3clientparams) if s3clientparams else None,
+    )
+
     # Construct a list of URL (prefix) objects
     urllist = []
-    for prefix, r in _populate_prefixes(prefixes, inputs):
-        src = urlparse(prefix)
+    to_iterate, is_transient_retry = _populate_prefixes(prefixes, inputs)
+    for idx, prefix, url, r in to_iterate:
+        src = urlparse(url)
         url = S3Url(
-            url=prefix,
+            url=url,
             bucket=src.netloc,
             path=src.path.lstrip("/"),
-            local=generate_local_path(prefix),
+            local=generate_local_path(url, range=r),
             prefix=prefix,
             range=r,
+            idx=idx,
         )
         if src.scheme != "s3":
             exit(ERROR_INVALID_URL, url)
         if not recursive and not src.path:
             exit(ERROR_NOT_FULL_PATH, url)
         urllist.append(url)
-    # Construct a url->size mapping and get content-type and metadata if needed
+    # Construct a URL->size mapping and get content-type and metadata if needed
     op = None
     dl_op = "download"
     if recursive:
-        op = op_list_prefix
+        op = partial(op_list_prefix, s3config)
     if verify or verbose or info:
         dl_op = "info_download"
     if op:
+        if is_transient_retry:
+            raise RuntimeError("--recursive not allowed for transient retries")
         urls = []
         # NOTE - we must retain the order of prefixes requested
         # and the listing order returned by S3
@@ -770,6 +1050,10 @@ def get(
                 urls.append((prefix_url, None))
             else:
                 exit(ret, prefix_url)
+        # We re-index here since we may have pulled in a bunch more stuff. On a transient
+        # retry, we never have recursive so we would not re-index
+        for idx, (url, _) in enumerate(urls):
+            url.idx = idx
     else:
         # pretend zero size since we don't need it for anything.
         # it can't be None though, to make sure the listing below
@@ -778,96 +1062,147 @@ def get(
 
     # exclude the non-existent files from loading
     to_load = [url for url, size in urls if size is not None]
-    sz_results = process_urls(dl_op, to_load, verbose, num_workers)
+    sz_results = process_urls(
+        dl_op, to_load, verbose, inject_failure, num_workers, s3config
+    )
     # We check if there is any access denied
-    is_denied = [sz == -ERROR_URL_ACCESS_DENIED for sz in sz_results]
-    if any(is_denied):
-        # Find the first one to return that as an error
-        for i, b in enumerate(is_denied):
-            if b:
-                exit(ERROR_URL_ACCESS_DENIED, to_load[i])
-    if not allow_missing:
-        is_missing = [sz == -ERROR_URL_NOT_FOUND for sz in sz_results]
-        if any(is_missing):
-            # Find the first one to return that as an error
-            for i, b in enumerate(is_missing):
-                if b:
-                    exit(ERROR_URL_NOT_FOUND, to_load[i])
+    retry_lines = []
+    out_lines = []
+    denied_url = None
+    missing_url = None
+    verify_info = []
+    idx_in_sz = 0
+    for url, _ in urls:
+        sz = None
+        # to_load contains an ordered subset of urls
+        if idx_in_sz != len(to_load) and url.url == to_load[idx_in_sz].url:
+            sz = sz_results[idx_in_sz]
+            idx_in_sz += 1
+        if listing and sz is None:
+            out_lines.append(format_result_line(url.idx, url.url) + "\n")
+        elif listing and sz >= 0:
+            out_lines.append(
+                format_result_line(url.idx, url.prefix, url.url, url.local) + "\n"
+            )
+            if verify:
+                verify_info.append((url, sz))
+        elif sz == -ERROR_URL_ACCESS_DENIED:
+            denied_url = url
+            break
+        elif sz == -ERROR_URL_NOT_FOUND:
+            if missing_url is None:
+                missing_url = url
+            if not allow_missing:
+                break
+            out_lines.append(format_result_line(url.idx, url.url) + "\n")
+        elif sz == -ERROR_TRANSIENT:
+            retry_lines.append(
+                " ".join(
+                    [
+                        str(url.idx),
+                        url_quote(url.prefix).decode(encoding="utf-8"),
+                        url_quote(url.url).decode(encoding="utf-8"),
+                        url_quote(url.range).decode(encoding="utf-8")
+                        if url.range
+                        else "<norange>",
+                    ]
+                )
+                + "\n"
+            )
+            # First time around, we output something to indicate the total length
+            if not is_transient_retry:
+                out_lines.append("%d %s\n" % (url.idx, TRANSIENT_RETRY_LINE_CONTENT))
+
+    if denied_url is not None:
+        exit(ERROR_URL_ACCESS_DENIED, denied_url)
+
+    if not allow_missing and missing_url is not None:
+        exit(ERROR_URL_NOT_FOUND, missing_url)
+
     # Postprocess
     if verify:
-        # Verify only results with an actual size (so actual files)
-        verify_results(
-            [
-                (url, sz)
-                for url, sz in zip(to_load, sz_results)
-                if sz != -ERROR_URL_NOT_FOUND
-            ],
-            verbose=verbose,
-        )
+        verify_results(verify_info, verbose=verbose)
 
-    idx_in_sz = 0
-    if listing:
-        for url, _ in urls:
-            sz = None
-            if idx_in_sz != len(to_load) and url.url == to_load[idx_in_sz].url:
-                sz = sz_results[idx_in_sz] if sz_results[idx_in_sz] >= 0 else None
-                idx_in_sz += 1
-            if sz is None:
-                # This means that either the initial url had a None size or
-                # that after loading, we found a None size
-                print(format_triplet(url.url))
-            else:
-                print(format_triplet(url.prefix, url.url, url.local))
+    if out_lines:
+        sys.stdout.writelines(out_lines)
+        sys.stdout.flush()
+
+    if retry_lines:
+        sys.stderr.write("%s\n" % TRANSIENT_RETRY_START_LINE)
+        sys.stderr.writelines(retry_lines)
+        sys.stderr.flush()
+        sys.exit(ERROR_TRANSIENT)
 
 
 @cli.command(help="Get info about files from S3")
-@click.option(
-    "--num-workers",
-    default=NUM_WORKERS_DEFAULT,
-    show_default=True,
-    help="Number of concurrent connections.",
-)
-@click.option(
-    "--inputs",
-    type=click.Path(exists=True),
-    help="Read input prefixes from the given file.",
-)
-@click.option(
-    "--verbose/--no-verbose",
-    default=True,
-    show_default=True,
-    help="Print status information on stderr.",
-)
-@click.option(
-    "--listing/--no-listing",
-    default=False,
-    show_default=True,
-    help="Print S3 URL -> local file mapping on stdout.",
-)
+@common_options
+@non_lst_common_options
 @click.argument("prefixes", nargs=-1)
-def info(prefixes, num_workers=None, inputs=None, verbose=None, listing=None):
+def info(
+    prefixes,
+    num_workers=None,
+    inputs=None,
+    verbose=None,
+    listing=None,
+    s3role=None,
+    s3sessionvars=None,
+    s3clientparams=None,
+    inject_failure=0,
+):
+
+    s3config = S3Config(
+        s3role,
+        json.loads(s3sessionvars) if s3sessionvars else None,
+        json.loads(s3clientparams) if s3clientparams else None,
+    )
 
     # Construct a list of URL (prefix) objects
     urllist = []
-    for prefix, _ in _populate_prefixes(prefixes, inputs):
-        src = urlparse(prefix)
+    to_iterate, is_transient_retry = _populate_prefixes(prefixes, inputs)
+    for idx, prefix, url, _ in to_iterate:
+        src = urlparse(url)
         url = S3Url(
-            url=prefix,
+            url=url,
             bucket=src.netloc,
             path=src.path.lstrip("/"),
-            local=generate_local_path(prefix, suffix="info"),
+            local=generate_local_path(url, suffix="info"),
             prefix=prefix,
             range=None,
+            idx=idx,
         )
         if src.scheme != "s3":
             exit(ERROR_INVALID_URL, url)
         urllist.append(url)
 
-    process_urls("info", urllist, verbose, num_workers)
-
-    if listing:
-        for url in urllist:
-            print(format_triplet(url.prefix, url.url, url.local))
+    sz_results = process_urls(
+        "info", urllist, verbose, inject_failure, num_workers, s3config
+    )
+
+    retry_lines = []
+    out_lines = []
+    for idx, sz in enumerate(sz_results):
+        url = urllist[idx]
+        if listing and sz != -ERROR_TRANSIENT:
+            out_lines.append(
+                format_result_line(url.idx, url.prefix, url.url, url.local) + "\n"
+            )
+        else:
+            retry_lines.append(
+                "%d %s <norange>\n"
+                % (url.idx, url_quote(url.url).decode(encoding="utf-8"))
+            )
+            if not is_transient_retry:
+                out_lines.append("%d %s\n" % (url.idx, TRANSIENT_RETRY_LINE_CONTENT))
+
+    if out_lines:
+        sys.stdout.writelines(out_lines)
+        sys.stdout.flush()
+
+    if retry_lines:
+        sys.stderr.write("%s\n" % TRANSIENT_RETRY_START_LINE)
+        sys.stderr.writelines(retry_lines)
+        sys.stderr.flush()
+        sys.exit(ERROR_TRANSIENT)
 
 
 if __name__ == "__main__":
diff --git a/metaflow/datatools/s3tail.py b/metaflow/plugins/datatools/s3/s3tail.py
similarity index 89%
rename from metaflow/datatools/s3tail.py
rename to metaflow/plugins/datatools/s3/s3tail.py
index 26e98f38480..1df08b395b0 100644
--- a/metaflow/datatools/s3tail.py
+++ b/metaflow/plugins/datatools/s3/s3tail.py
@@ -57,7 +57,9 @@ def _make_range_request(self):
             code = err.response["Error"]["Code"]
             # NOTE we deliberately regard NoSuchKey as an ignorable error.
             # We assume that the file just hasn't appeared in S3 yet.
-            if code in ("InvalidRange", "NoSuchKey"):
+            # Some S3 compatible storage systems like Dell EMC-ECS return 416 in-lieu
+            # of InvalidRange - https://www.delltechnologies.com/asset/en-us/products/storage/technical-support/docu95766.pdf
+            if code in ("InvalidRange", "NoSuchKey", "416"):
                 return None
             else:
                 raise
diff --git a/metaflow/datatools/s3util.py b/metaflow/plugins/datatools/s3/s3util.py
similarity index 63%
rename from metaflow/datatools/s3util.py
rename to metaflow/plugins/datatools/s3/s3util.py
index c9e1ae5d5c3..51a79787653 100644
--- a/metaflow/datatools/s3util.py
+++ b/metaflow/plugins/datatools/s3/s3util.py
@@ -7,22 +7,28 @@
 
 from metaflow.exception import MetaflowException
 from metaflow.metaflow_config import (
-    S3_ENDPOINT_URL,
-    S3_VERIFY_CERTIFICATE,
+    DATATOOLS_CLIENT_PARAMS,
+    DATATOOLS_SESSION_VARS,
     S3_RETRY_COUNT,
+    RETRY_WARNING_THRESHOLD,
 )
 
 
 TEST_S3_RETRY = "TEST_S3_RETRY" in os.environ
 
+TRANSIENT_RETRY_LINE_CONTENT = "<none>"
+TRANSIENT_RETRY_START_LINE = "### RETRY INPUTS ###"
 
-def get_s3_client():
+
+def get_s3_client(s3_role_arn=None, s3_session_vars=None, s3_client_params=None):
     from metaflow.plugins.aws.aws_client import get_aws_client
 
     return get_aws_client(
         "s3",
         with_error=True,
-        params={"endpoint_url": S3_ENDPOINT_URL, "verify": S3_VERIFY_CERTIFICATE},
+        role_arn=s3_role_arn,
+        session_vars=s3_session_vars if s3_session_vars else DATATOOLS_SESSION_VARS,
+        client_params=s3_client_params if s3_client_params else DATATOOLS_CLIENT_PARAMS,
     )
 
 
@@ -49,16 +55,26 @@ def retry_wrapper(self, *args, **kwargs):
                     function_name = f.func_name
                 except AttributeError:
                     function_name = f.__name__
-                sys.stderr.write(
-                    "S3 datastore operation %s failed (%s). "
-                    "Retrying %d more times..\n"
-                    % (function_name, ex, S3_RETRY_COUNT - i)
-                )
+                if TEST_S3_RETRY and i == 0:
+                    # This is applicable when this code is being tested
+                    sys.stderr.write(
+                        "[WARNING] S3 datastore operation %s failed (%s). "
+                        "Retrying %d more times..\n"
+                        % (function_name, ex, S3_RETRY_COUNT - i)
+                    )
+                if i + 1 > RETRY_WARNING_THRESHOLD:
+                    # In a real failure, print this warning message only after a certain
+                    # amount of retries
+                    sys.stderr.write(
+                        "[WARNING] S3 datastore operation %s failed (%s). "
+                        "Retrying %d more times..\n"
+                        % (function_name, ex, S3_RETRY_COUNT - i)
+                    )
                 self.reset_client(hard_reset=True)
                 last_exc = ex
                 # exponential backoff for real failures
                 if not (TEST_S3_RETRY and i == 0):
-                    time.sleep(2 ** i + random.randint(0, 5))
+                    time.sleep(2**i + random.randint(0, 5))
         raise last_exc
 
     return retry_wrapper
@@ -69,7 +85,7 @@ def retry_wrapper(self, *args, **kwargs):
 # because of https://bugs.python.org/issue42853 (Py3 bug); this also helps
 # keep memory consumption lower
 # NOTE: For some weird reason, if you pass a large value to
-# read, it delays the call so we always pass it either what
+# read it delays the call, so we always pass it either what
 # remains or 2GB, whichever is smallest.
 def read_in_chunks(dst, src, src_sz, max_chunk_size):
     remaining = src_sz
diff --git a/metaflow/plugins/debug_logger.py b/metaflow/plugins/debug_logger.py
index 0634762b56e..2467c9e2593 100644
--- a/metaflow/plugins/debug_logger.py
+++ b/metaflow/plugins/debug_logger.py
@@ -1,17 +1,30 @@
 import sys
 
-from metaflow.sidecar_messages import Message
+from metaflow.event_logger import NullEventLogger
+from metaflow.sidecar import Message, MessageTypes
 
 
-class DebugEventLogger(object):
+class DebugEventLogger(NullEventLogger):
     TYPE = "debugLogger"
 
-    def log(self, msg):
-        sys.stderr.write("event_logger: " + str(msg) + "\n")
+    @classmethod
+    def get_worker(cls):
+        return DebugEventLoggerSidecar
+
+
+class DebugEventLoggerSidecar(object):
+    def __init__(self):
+        pass
 
     def process_message(self, msg):
         # type: (Message) -> None
-        self.log(msg.payload)
+        if msg.msg_type == MessageTypes.SHUTDOWN:
+            print("Debug[shutdown]: got shutdown!", file=sys.stderr)
+            self._shutdown()
+        elif msg.msg_type == MessageTypes.BEST_EFFORT:
+            print("Debug[best_effort]: %s" % str(msg.payload), file=sys.stderr)
+        elif msg.msg_type == MessageTypes.MUST_SEND:
+            print("Debug[must_send]: %s" % str(msg.payload), file=sys.stderr)
 
-    def shutdown(self):
-        pass
+    def _shutdown(self):
+        sys.stderr.flush()
diff --git a/metaflow/plugins/debug_monitor.py b/metaflow/plugins/debug_monitor.py
index 40324a2c86a..25090924770 100644
--- a/metaflow/plugins/debug_monitor.py
+++ b/metaflow/plugins/debug_monitor.py
@@ -1,43 +1,36 @@
-from __future__ import print_function
-
 import sys
 
-from metaflow.sidecar_messages import MessageTypes, Message
-from metaflow.monitor import Timer, deserialize_metric
-from metaflow.monitor import MEASURE_TYPE, get_monitor_msg_type
+from metaflow.sidecar import MessageTypes, Message
+from metaflow.monitor import NullMonitor, Metric
 
 
-class DebugMonitor(object):
+class DebugMonitor(NullMonitor):
     TYPE = "debugMonitor"
 
-    def __init__(self):
-        self.logger("init")
-
-    def count(self, count):
-        pass
+    @classmethod
+    def get_worker(cls):
+        return DebugMonitorSidecar
 
-    def measure(self, timer):
-        # type: (Timer) -> None
-        self.logger(
-            "elapsed time for {}: {}".format(timer.name, str(timer.get_duration()))
-        )
 
-    def gauge(self, gauge):
+class DebugMonitorSidecar(object):
+    def __init__(self):
         pass
 
     def process_message(self, msg):
         # type: (Message) -> None
-        self.logger("processing message %s" % str(msg.msg_type))
-        msg_type = get_monitor_msg_type(msg)
-        if msg_type == MEASURE_TYPE:
-            timer = deserialize_metric(msg.payload.get("timer"))
-            self.measure(timer)
-        else:
-            pass
-
-    def shutdown(self):
-        sys.stderr.flush()
-
-    def logger(self, msg):
-        print("local_monitor: %s" % msg, file=sys.stderr)
+        if msg.msg_type == MessageTypes.MUST_SEND:
+            print("DebugMonitor[must_send]: %s" % str(msg.payload), file=sys.stderr)
+        elif msg.msg_type == MessageTypes.SHUTDOWN:
+            print("DebugMonitor[shutdown]: got shutdown!", file=sys.stderr)
+            self._shutdown()
+        elif msg.msg_type == MessageTypes.BEST_EFFORT:
+            for v in msg.payload.values():
+                metric = Metric.deserialize(v)
+                print(
+                    "DebugMonitor[metric]: %s for %s: %s"
+                    % (metric.metric_type, metric.name, str(metric.value)),
+                    file=sys.stderr,
+                )
+
+    def _shutdown(self):
         sys.stderr.flush()
diff --git a/metaflow/plugins/env_escape/__init__.py b/metaflow/plugins/env_escape/__init__.py
index cb06186513d..b0e22259410 100644
--- a/metaflow/plugins/env_escape/__init__.py
+++ b/metaflow/plugins/env_escape/__init__.py
@@ -43,18 +43,37 @@
 # environment (like the ones created through the Conda plugin) and we will therefore
 # consider that as the environment we escape to.
 # Note that it is important to store the value back in the environment to make
-# it available to any sub-process that launch sa well.
+# it available to any sub-process that launch as well (ie: when we re-launch WITHIN
+# the conda environment).
 # We also store the maximum protocol version that we support for pickle so that
-# we can determine what to use
+# we can determine what to use.
+#
+# In the case of a bootstrap code (for example on Batch), the subprocess mechanism is
+# not used to determine the outside environment but the `generate_trampolines` function
+# in this file is called directly by the remote bootstrap code which operates *outside*
+# of the created conda environment, so it has the same effect
 ENV_ESCAPE_PY = os.environ.get("METAFLOW_ENV_ESCAPE_PY", sys.executable)
+
+_cur_sys_paths = sys.path
+if len(_cur_sys_paths) > 0 and _cur_sys_paths[0] == "":
+    # This means "current working directory". We actually replace it with the current
+    # working directory because when the env escape server is launched, it is launched
+    # in a different directory and would therefore not have the exact same path
+    # specifications (which is not what we want)
+    _cur_sys_paths[0] = os.getcwd()
+
+ENV_ESCAPE_PATHS = os.environ.get(
+    "METAFLOW_ENV_ESCAPE_PATHS", os.pathsep.join(_cur_sys_paths)
+)
 ENV_ESCAPE_PICKLE_VERSION = os.environ.get(
     "METAFLOW_ENV_ESCAPE_PICKLE_VERSION", str(pickle.HIGHEST_PROTOCOL)
 )
 os.environ["METAFLOW_ENV_ESCAPE_PICKLE_VERSION"] = ENV_ESCAPE_PICKLE_VERSION
+os.environ["METAFLOW_ENV_ESCAPE_PATHS"] = ENV_ESCAPE_PATHS
 os.environ["METAFLOW_ENV_ESCAPE_PY"] = ENV_ESCAPE_PY
 
 
-def generate_trampolines(python_path):
+def generate_trampolines(out_dir):
     # This function will look in the configurations directory and create
     # files named <module>.py that will properly setup the environment escape when
     # called
@@ -64,9 +83,6 @@ def generate_trampolines(python_path):
     if os.environ.get("METAFLOW_ENV_ESCAPE_DISABLED", False) in (True, "True"):
         return
 
-    python_interpreter_path = ENV_ESCAPE_PY
-    max_pickle_version = int(ENV_ESCAPE_PICKLE_VERSION)
-
     paths = [os.path.join(os.path.dirname(os.path.abspath(__file__)), "configurations")]
     for m in get_modules("plugins.env_escape"):
         paths.extend(
@@ -82,7 +98,9 @@ def generate_trampolines(python_path):
                     module_names = dir_name[8:].split("__")
                     for module_name in module_names:
                         with open(
-                            os.path.join(python_path, module_name + ".py"), mode="w"
+                            os.path.join(out_dir, module_name + ".py"),
+                            mode="w",
+                            encoding="utf-8",
                         ) as f:
                             f.write(
                                 """
@@ -119,29 +137,34 @@ def load():
             # in which case we are happy (since no module exists) OR we are being imported by the
             # server in which case we could not find the underlying module so we re-raise
             # this error.
-            # We distinguish these cases by checking if the executable is the python_path the
-            # server should be using
-            if sys.executable == "{python_path}":
+            # We distinguish these cases by checking if the executable is the
+            # python_executable the server should be using
+            if sys.executable == "{python_executable}":
                 raise
-            # print("Env escape using executable {python_path}")
+            # print("Env escape using executable {python_executable}")
         else:
             # Inverse logic as above here.
-            if sys.executable == "{python_path}":
-                return
-            raise RuntimeError("Trying to override '%s' when module exists in system" % prefix)
+            if sys.executable != "{python_executable}":
+                # We use the package locally and warn user.
+                print("Not using environment escape for '%s' as module present" % prefix)
+            # In both cases, we don't load our loader since
+            # the package is locally present
+            sys.path = old_paths
+            return
     sys.path = old_paths
-    m = ModuleImporter("{python_path}", {max_pickle_version}, "{path}", {prefixes})
+    m = ModuleImporter("{python_executable}", "{pythonpath}", {max_pickle_version}, "{path}", {prefixes})
     sys.meta_path.insert(0, m)
     # Reload this module using the ModuleImporter
     importlib.import_module("{module_name}")
 
-if not "{python_path}":
+if not "{python_executable}":
     raise RuntimeError(
         "Trying to access an escaped module ({module_name}) without a valid interpreter")
 load()
 """.format(
-                                    python_path=python_interpreter_path,
-                                    max_pickle_version=max_pickle_version,
+                                    python_executable=ENV_ESCAPE_PY,
+                                    pythonpath=ENV_ESCAPE_PATHS,
+                                    max_pickle_version=int(ENV_ESCAPE_PICKLE_VERSION),
                                     path=path,
                                     prefixes=module_names,
                                     module_name=module_name,
@@ -149,7 +172,7 @@ def load():
                             )
 
 
-def init(python_interpreter_path, max_pickle_version):
+def init(python_executable, pythonpath, max_pickle_version):
     # This function will look in the configurations directory and setup
     # the proper overrides
     config_dir = os.path.join(
@@ -163,5 +186,9 @@ def init(python_interpreter_path, max_pickle_version):
             if dir_name.startswith("emulate_"):
                 module_names = dir_name[8:].split("__")
                 create_modules(
-                    python_interpreter_path, max_pickle_version, path, module_names
+                    python_executable,
+                    pythonpath,
+                    max_pickle_version,
+                    path,
+                    module_names,
                 )
diff --git a/metaflow/plugins/env_escape/client.py b/metaflow/plugins/env_escape/client.py
index 4206242d764..6cbb8155174 100644
--- a/metaflow/plugins/env_escape/client.py
+++ b/metaflow/plugins/env_escape/client.py
@@ -33,7 +33,7 @@
 
 from .data_transferer import DataTransferer, ObjReference
 from .exception_transferer import load_exception
-from .override_decorators import LocalAttrOverride, LocalOverride
+from .override_decorators import LocalAttrOverride, LocalException, LocalOverride
 from .stub import create_class
 
 BIND_TIMEOUT = 0.1
@@ -41,7 +41,7 @@
 
 
 class Client(object):
-    def __init__(self, python_path, max_pickle_version, config_dir):
+    def __init__(self, python_executable, pythonpath, max_pickle_version, config_dir):
         # Make sure to init these variables (used in __del__) early on in case we
         # have an exception
         self._poller = None
@@ -51,24 +51,26 @@ def __init__(self, python_path, max_pickle_version, config_dir):
         data_transferer.defaultProtocol = max_pickle_version
 
         self._config_dir = config_dir
+        server_path, server_config = os.path.split(config_dir)
         # The client launches the server when created; we use
         # Unix sockets for now
         server_module = ".".join([__package__, "server"])
-        self._socket_path = "/tmp/%s_%d" % (os.path.basename(config_dir), os.getpid())
+        self._socket_path = "/tmp/%s_%d" % (server_config, os.getpid())
         if os.path.exists(self._socket_path):
             raise RuntimeError("Existing socket: %s" % self._socket_path)
         env = os.environ.copy()
-        # env["PYTHONPATH"] = ":".join(sys.path)
+        env["PYTHONPATH"] = pythonpath
         self._server_process = Popen(
             [
-                python_path,
+                python_executable,
                 "-u",
                 "-m",
                 server_module,
                 str(max_pickle_version),
-                config_dir,
+                server_config,
                 self._socket_path,
             ],
+            cwd=server_path,
             env=env,
             stdout=PIPE,
             stderr=PIPE,
@@ -77,31 +79,33 @@ def __init__(self, python_path, max_pickle_version, config_dir):
         )
 
         # Read override configuration
-        sys.path.insert(0, config_dir)
-        override_module = importlib.import_module("overrides")
-        sys.path = sys.path[1:]
+        # We can't just import the "overrides" module because that does not
+        # distinguish it from other modules named "overrides" (either a third party
+        # lib -- there is one -- or just other escaped modules). We therefore load
+        # a fuller path to distinguish them from one another.
+        pkg_components = []
+        prefix, last_basename = os.path.split(config_dir)
+        while last_basename not in ("metaflow", "metaflow_extensions"):
+            pkg_components.append(last_basename)
+            prefix, last_basename = os.path.split(prefix)
+        pkg_components.append(last_basename)
+
+        try:
+            sys.path.insert(0, prefix)
+            override_module = importlib.import_module(
+                ".overrides", package=".".join(reversed(pkg_components))
+            )
+            override_values = override_module.__dict__.values()
+        except ImportError:
+            # We ignore so the file can be non-existent if not needed
+            override_values = []
+        except Exception as e:
+            raise RuntimeError(
+                "Cannot import overrides from '%s': %s" % (sys.path[0], str(e))
+            )
+        finally:
+            sys.path = sys.path[1:]
 
-        # Determine all overrides
-        self._overrides = {}
-        self._getattr_overrides = {}
-        self._setattr_overrides = {}
-        for override in override_module.__dict__.values():
-            if isinstance(override, (LocalOverride, LocalAttrOverride)):
-                for obj_name, obj_funcs in override.obj_mapping.items():
-                    if isinstance(override, LocalOverride):
-                        override_dict = self._overrides.setdefault(obj_name, {})
-                    elif override.is_setattr:
-                        override_dict = self._setattr_overrides.setdefault(obj_name, {})
-                    else:
-                        override_dict = self._getattr_overrides.setdefault(obj_name, {})
-                    if isinstance(obj_funcs, str):
-                        obj_funcs = (obj_funcs,)
-                    for name in obj_funcs:
-                        if name in override_dict:
-                            raise ValueError(
-                                "%s was already overridden for %s" % (name, obj_name)
-                            )
-                        override_dict[name] = override.func
         self._proxied_objects = {}
 
         # Wait for the socket to be up on the other side; we also check if the
@@ -114,7 +118,7 @@ def __init__(self, python_path, max_pickle_version, config_dir):
                     % self._server_process.stderr.read(),
                 )
             time.sleep(1)
-        # Open up the channel and setup the datastransfer pipeline
+        # Open up the channel and set up the datatransferer pipeline
         self._channel = Channel(SocketByteStream.unixconnect(self._socket_path))
         self._datatransferer = DataTransferer(self)
 
@@ -142,6 +146,38 @@ def __init__(self, python_path, max_pickle_version, config_dir):
             )
         }
 
+        # Determine all overrides
+        self._overrides = {}
+        self._getattr_overrides = {}
+        self._setattr_overrides = {}
+        self._exception_overrides = {}
+        for override in override_values:
+            if isinstance(override, (LocalOverride, LocalAttrOverride)):
+                for obj_name, obj_funcs in override.obj_mapping.items():
+                    if obj_name not in self._proxied_classes:
+                        raise ValueError(
+                            "%s does not refer to a proxied or override type" % obj_name
+                        )
+                    if isinstance(override, LocalOverride):
+                        override_dict = self._overrides.setdefault(obj_name, {})
+                    elif override.is_setattr:
+                        override_dict = self._setattr_overrides.setdefault(obj_name, {})
+                    else:
+                        override_dict = self._getattr_overrides.setdefault(obj_name, {})
+                    if isinstance(obj_funcs, str):
+                        obj_funcs = (obj_funcs,)
+                    for name in obj_funcs:
+                        if name in override_dict:
+                            raise ValueError(
+                                "%s was already overridden for %s" % (name, obj_name)
+                            )
+                        override_dict[name] = override.func
+            if isinstance(override, LocalException):
+                cur_ex = self._exception_overrides.get(override.class_path, None)
+                if cur_ex is not None:
+                    raise ValueError("Exception %s redefined" % override.class_path)
+                self._exception_overrides[override.class_path] = override.wrapped_class
+
         # Proxied standalone functions are functions that are proxied
         # as part of other objects like defaultdict for which we create a
         # on-the-fly simple class that is just a callable. This is therefore
@@ -168,7 +204,7 @@ def cleanup(self):
             self._poller.unregister(self._channel)
             last_evts = self._poller.poll(5)
             for fd, _ in last_evts:
-                # Readlines will never block here because bufsize is set to
+                # Readlines will never block here because `bufsize` is set to
                 # 1 (line buffering)
                 if fd == self._server_process.stdout.fileno():
                     sys.stdout.write(self._server_process.stdout.readline())
@@ -200,6 +236,9 @@ def name(self):
     def get_exports(self):
         return self._export_info
 
+    def get_local_exception_overrides(self):
+        return self._exception_overrides
+
     def stub_request(self, stub, request_type, *args, **kwargs):
         # Encode the operation to send over the wire and wait for the response
         target = self.encode(stub)
@@ -327,17 +366,17 @@ def _communicate(self, msg):
             for fd, _ in evt_list:
                 if fd == self._channel.fileno():
                     # We deal with this last as this basically gives us the
-                    # response so we stop looking at things on stdout/stderr
+                    # response, so we stop looking at things on stdout/stderr
                     response_ready = True
-                # Readlines will never block here because bufsize is set to 1
+                # Readlines will never block here because `bufsize` is set to 1
                 # (line buffering)
                 elif fd == self._server_process.stdout.fileno():
                     sys.stdout.write(self._server_process.stdout.readline())
                 elif fd == self._server_process.stderr.fileno():
                     sys.stderr.write(self._server_process.stderr.readline())
-        # We make sure there is nothing left to read. On the server side, a
-        # flush happens before we respond so we read until we get an exception;
-        # this is non blocking
+        # We make sure there is nothing left to read. On the server side a
+        # flush happens before we respond, so we read until we get an exception;
+        # this is non-blocking
         while True:
             try:
                 line = self._server_process.stdout.readline()
diff --git a/metaflow/plugins/env_escape/client_modules.py b/metaflow/plugins/env_escape/client_modules.py
index c292cf87b49..663d398dabf 100644
--- a/metaflow/plugins/env_escape/client_modules.py
+++ b/metaflow/plugins/env_escape/client_modules.py
@@ -63,7 +63,7 @@ def func(*args, **kwargs):
         elif name in self._exception_classes:
             return self._exception_classes[name]
         else:
-            # Try to see if this is a sub-module that we can load
+            # Try to see if this is a submodule that we can load
             m = None
             try:
                 m = self._loader.load_module(".".join([self._prefix, name]))
@@ -107,9 +107,17 @@ def __setattr__(self, name, value):
 
 class ModuleImporter(object):
     # This ModuleImporter implements the Importer Protocol defined in PEP 302
-    def __init__(self, python_path, max_pickle_version, config_dir, module_prefixes):
+    def __init__(
+        self,
+        python_executable,
+        pythonpath,
+        max_pickle_version,
+        config_dir,
+        module_prefixes,
+    ):
         self._module_prefixes = module_prefixes
-        self._python_path = python_path
+        self._python_executable = python_executable
+        self._pythonpath = pythonpath
         self._config_dir = config_dir
         self._client = None
         self._max_pickle_version = max_pickle_version
@@ -142,14 +150,16 @@ def load_module(self, fullname):
             max_pickle_version = min(self._max_pickle_version, pickle.HIGHEST_PROTOCOL)
 
             self._client = Client(
-                self._python_path, max_pickle_version, self._config_dir
+                self._python_executable,
+                self._pythonpath,
+                max_pickle_version,
+                self._config_dir,
             )
             atexit.register(_clean_client, self._client)
 
+            # Get information about overrides and what the server knows about
             exports = self._client.get_exports()
-            sys.path.insert(0, self._config_dir)
-            overrides = importlib.import_module("overrides")
-            sys.path = sys.path[1:]
+            ex_overrides = self._client.get_local_exception_overrides()
 
             prefixes = set()
             export_classes = exports.get("classes", [])
@@ -163,15 +173,6 @@ def load_module(self, fullname):
                 splits = name.rsplit(".", 1)
                 prefixes.add(splits[0])
 
-            # Look for any exception overrides
-            ex_overrides = {}
-            for override in overrides.__dict__.values():
-                if isinstance(override, LocalException):
-                    cur_ex = ex_overrides.get(override.class_path, None)
-                    if cur_ex is not None:
-                        raise ValueError("Exception %s redefined" % override.class_path)
-                    ex_overrides[override.class_path] = override.wrapped_class
-
             # Now look at the exceptions coming from the server
             formed_exception_classes = {}
             for ex_name, ex_parents in export_exceptions:
@@ -205,8 +206,8 @@ def load_module(self, fullname):
 
             # We will make sure that we create modules even for "empty" prefixes
             # because packages are always loaded hierarchically so if we have
-            # something in a.b.c but nothing directly in a, we still need to
-            # create a module named a. There is probably a better way of doing this
+            # something in `a.b.c` but nothing directly in `a`, we still need to
+            # create a module named `a`. There is probably a better way of doing this
             all_prefixes = list(prefixes)
             for prefix in all_prefixes:
                 parts = prefix.split(".")
@@ -241,8 +242,8 @@ def _get_canonical_name(self, name):
         return name
 
 
-def create_modules(python_path, max_pickle_version, path, prefixes):
-    # This is a extra verification to make sure we are not trying to use the
+def create_modules(python_executable, pythonpath, max_pickle_version, path, prefixes):
+    # This is an extra verification to make sure we are not trying to use the
     # environment escape for something that is in the system
     for prefix in prefixes:
         try:
@@ -262,5 +263,7 @@ def create_modules(python_path, max_pickle_version, path, prefixes):
 
     # sys.meta_path.insert(0, ModuleImporter(python_path, path, prefixes))
     sys.meta_path.append(
-        ModuleImporter(python_path, max_pickle_version, path, prefixes)
+        ModuleImporter(
+            python_executable, pythonpath, max_pickle_version, path, prefixes
+        )
     )
diff --git a/metaflow/plugins/env_escape/communication/channel.py b/metaflow/plugins/env_escape/communication/channel.py
index cd3fd8a85d1..e39b0bba800 100644
--- a/metaflow/plugins/env_escape/communication/channel.py
+++ b/metaflow/plugins/env_escape/communication/channel.py
@@ -9,7 +9,7 @@ class Channel(object):
 
     You can send and receive JSON serializable object directly with this interface
 
-    For now, this class does not do much but we could imagine some sort compression or other
+    For now this class does not do much, but we could imagine some sort compression or other
     transformation being added here
     """
 
diff --git a/metaflow/plugins/env_escape/communication/socket_bytestream.py b/metaflow/plugins/env_escape/communication/socket_bytestream.py
index 231b161e908..ff0d9246d2b 100644
--- a/metaflow/plugins/env_escape/communication/socket_bytestream.py
+++ b/metaflow/plugins/env_escape/communication/socket_bytestream.py
@@ -58,6 +58,9 @@ def read(self, count, timeout=None):
                         m,
                         min(count, MAX_MSG_SIZE),
                     )
+                    # If we don't receive anything, we reached EOF
+                    if nbytes == 0:
+                        raise socket.error()
                     count -= nbytes
                     m = m[nbytes:]
                 except socket.timeout:
diff --git a/metaflow/plugins/env_escape/communication/utils.py b/metaflow/plugins/env_escape/communication/utils.py
index 5602460e31d..0c04be5c50a 100644
--- a/metaflow/plugins/env_escape/communication/utils.py
+++ b/metaflow/plugins/env_escape/communication/utils.py
@@ -4,7 +4,7 @@
 def __try_op__(op_name, op, retries, *args):
     """
     A helper function to retry an operation that timed out on a socket. After
-    the retries are expired a socket.timeout is raised.
+    the retries are expired a `socket.timeout` is raised.
 
     Parameters
     ----------
@@ -24,7 +24,7 @@ def __try_op__(op_name, op, retries, *args):
     Raises
     ------
     socket.timeout
-        If all retries are exhausted, socket.timeout is raised
+        If all retries are exhausted, `socket.timeout` is raised
 
     """
     for i in range(retries):
diff --git a/metaflow/plugins/env_escape/data_transferer.py b/metaflow/plugins/env_escape/data_transferer.py
index 51225b17a39..3415f80cd8c 100644
--- a/metaflow/plugins/env_escape/data_transferer.py
+++ b/metaflow/plugins/env_escape/data_transferer.py
@@ -5,7 +5,7 @@
 
 from collections import OrderedDict, defaultdict, namedtuple
 from copy import copy
-from datetime import datetime
+from datetime import datetime, timedelta
 
 ObjReference = namedtuple("ObjReference", "value_type class_name identifier")
 
@@ -38,13 +38,14 @@ class InvalidUnicode:
     defaultdict,
     OrderedDict,
     datetime,
+    timedelta,
 ]
 
 _container_types = (list, tuple, set, frozenset, dict, defaultdict, OrderedDict)
 
 if sys.version_info[0] >= 3:
     _types.extend([InvalidLong, InvalidUnicode])
-    _simple_types = (bool, int, float, complex, bytearray, bytes, datetime)
+    _simple_types = (bool, int, float, complex, bytearray, bytes, datetime, timedelta)
 else:
     _types.extend([long, unicode])  # noqa F821
     _simple_types = (
@@ -57,6 +58,7 @@ class InvalidUnicode:
         unicode,  # noqa F821
         long,  # noqa F821
         datetime,
+        timedelta,
     )
 
 _types_to_encoding = {x: idx for idx, x in enumerate(_types)}
@@ -165,7 +167,6 @@ def _dump_invalidunicode(obj_type, transferer, obj):
     def _load_invalidunicode(obj_type, transferer, json_annotation, json_obj):
         return _load_simple(str, transferer, json_annotation, json_obj)
 
-
 else:
 
     @_register_dumper((str,))
diff --git a/metaflow/plugins/env_escape/server.py b/metaflow/plugins/env_escape/server.py
index 66e6608d490..11ef2cf4d65 100644
--- a/metaflow/plugins/env_escape/server.py
+++ b/metaflow/plugins/env_escape/server.py
@@ -60,13 +60,29 @@
 
 
 class Server(object):
-    def __init__(self, max_pickle_version, config_dir):
+    def __init__(self, config_dir, max_pickle_version):
 
         self._max_pickle_version = data_transferer.defaultProtocol = max_pickle_version
-        sys.path.insert(0, config_dir)
-        mappings = importlib.import_module("server_mappings")
-        override_module = importlib.import_module("overrides")
-        sys.path = sys.path[1:]
+        try:
+            mappings = importlib.import_module(".server_mappings", package=config_dir)
+        except Exception as e:
+            raise RuntimeError(
+                "Cannot import server_mappings from '%s': %s" % (sys.path[0], str(e))
+            )
+        try:
+            # Import module as a relative package to make sure that it is consistent
+            # with how the client does it -- this enables us to do the same type of
+            # relative imports in overrides
+            override_module = importlib.import_module(".overrides", package=config_dir)
+            override_values = override_module.__dict__.values()
+        except ImportError:
+            # We ignore so the file can be non-existent if not needed
+            override_values = []
+        except Exception as e:
+            raise RuntimeError(
+                "Cannot import overrides from '%s': %s" % (sys.path[0], str(e))
+            )
+
         self._aliases = {}
         self._known_classes, a1 = self._flatten_dict(mappings.EXPORTED_CLASSES)
         self._class_types_to_names = {v: k for k, v in self._known_classes.items()}
@@ -101,21 +117,22 @@ def __init__(self, max_pickle_version, config_dir):
         self._getattr_overrides = {}
         self._setattr_overrides = {}
         self._exception_serializers = {}
-        for override in override_module.__dict__.values():
+        for override in override_values:
             if isinstance(override, (RemoteAttrOverride, RemoteOverride)):
                 for obj_name, obj_funcs in override.obj_mapping.items():
-                    if isinstance(override, RemoteOverride):
-                        override_dict = self._overrides.setdefault(
-                            self._known_classes[obj_name], {}
+                    obj_type = self._known_classes.get(
+                        obj_name, self._proxied_types.get(obj_name)
+                    )
+                    if obj_type is None:
+                        raise ValueError(
+                            "%s does not refer to a proxied or exported type" % obj_name
                         )
+                    if isinstance(override, RemoteOverride):
+                        override_dict = self._overrides.setdefault(obj_type, {})
                     elif override.is_setattr:
-                        override_dict = self._setattr_overrides.setdefault(
-                            self._known_classes[obj_name], {}
-                        )
+                        override_dict = self._setattr_overrides.setdefault(obj_type, {})
                     else:
-                        override_dict = self._getattr_overrides.setdefault(
-                            self._known_classes[obj_name], {}
-                        )
+                        override_dict = self._getattr_overrides.setdefault(obj_type, {})
                     if isinstance(obj_funcs, str):
                         obj_funcs = (obj_funcs,)
                     for name in obj_funcs:
@@ -464,5 +481,5 @@ def _handle_init(self, target, class_name, *args, **kwargs):
     max_pickle_version = int(sys.argv[1])
     config_dir = sys.argv[2]
     socket_path = sys.argv[3]
-    s = Server(max_pickle_version, config_dir)
+    s = Server(config_dir, max_pickle_version)
     s.serve(path=socket_path)
diff --git a/metaflow/plugins/env_escape/stub.py b/metaflow/plugins/env_escape/stub.py
index 807ab1827e4..00364e61557 100644
--- a/metaflow/plugins/env_escape/stub.py
+++ b/metaflow/plugins/env_escape/stub.py
@@ -222,7 +222,7 @@ def class_method(connection, class_name, name, cls, *args, **kwargs):
 
 class MetaWithConnection(StubMetaClass):
     # The use of this metaclass is so that we can support two modes when
-    # instantiating a sub-class of Stub. Suppose we have a class Foo which is a stub.
+    # instantiating a subclass of Stub. Suppose we have a class Foo which is a stub.
     # There are two ways Foo is initialized:
     #  - when it is returned from the remote side, in which case we do
     #    Foo(class_name, connection, identifier)
@@ -234,7 +234,7 @@ class MetaWithConnection(StubMetaClass):
     # see if the first one is the connection which would indicate that we are in
     # the first case. If that is the case, we just pass everything down to the
     # super __call__ and go our merry way. If this is not the case, we will
-    # use the connection we saved when creating this meta class and call
+    # use the connection we saved when creating this metaclass and call
     # OP_INIT to create the object
 
     def __new__(cls, class_name, base_classes, class_dict, connection):
diff --git a/metaflow/plugins/environment_decorator.py b/metaflow/plugins/environment_decorator.py
index af50f24195e..787227529ad 100644
--- a/metaflow/plugins/environment_decorator.py
+++ b/metaflow/plugins/environment_decorator.py
@@ -6,23 +6,12 @@
 
 class EnvironmentDecorator(StepDecorator):
     """
-    Step decorator to add or update environment variables prior to the execution of your step.
-
-    The environment variables set with this decorator will be present during the execution of the
-    step.
-
-    To use, annotate your step as follows:
-    ```
-    @environment(vars={'MY_ENV': 'value'})
-    @step
-    def myStep(self):
-        ...
-    ```
+    Specifies environment variables to be set prior to the execution of a step.
 
     Parameters
     ----------
-    vars : Dict
-        Dictionary of environment variables to add/update prior to executing your step.
+    vars : Dict[str, str]
+        Dictionary of environment variables to set.
     """
 
     name = "environment"
diff --git a/metaflow/plugins/frameworks/pytorch.py b/metaflow/plugins/frameworks/pytorch.py
index e1084ea466f..2ef696e2f5f 100644
--- a/metaflow/plugins/frameworks/pytorch.py
+++ b/metaflow/plugins/frameworks/pytorch.py
@@ -33,7 +33,7 @@ def setup_torch_distributed(master_port=None):
     try:
         master_port = master_port or (51000 + abs(int(current.run_id)) % 10000)
     except:
-        # if int() fails, i.e run_id is not an int use just a constant port. Can't use hash()
+        # if `int()` fails, i.e. `run_id` is not an `int`, use just a constant port. Can't use `hash()`,
         # as that is not constant.
         master_port = 51001
     os.environ["MASTER_PORT"] = str(master_port)
diff --git a/metaflow/plugins/gcp/__init__.py b/metaflow/plugins/gcp/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/metaflow/plugins/gcp/gs_exceptions.py b/metaflow/plugins/gcp/gs_exceptions.py
new file mode 100644
index 00000000000..87ffe303fff
--- /dev/null
+++ b/metaflow/plugins/gcp/gs_exceptions.py
@@ -0,0 +1,5 @@
+from metaflow.exception import MetaflowException
+
+
+class MetaflowGSPackageError(MetaflowException):
+    headline = "Missing required packages google-cloud-storage google-auth"
diff --git a/metaflow/plugins/gcp/gs_storage_client_factory.py b/metaflow/plugins/gcp/gs_storage_client_factory.py
new file mode 100644
index 00000000000..df915421182
--- /dev/null
+++ b/metaflow/plugins/gcp/gs_storage_client_factory.py
@@ -0,0 +1,21 @@
+import os
+import threading
+
+_client_cache = dict()
+
+
+def _get_cache_key():
+    return os.getpid(), threading.get_ident()
+
+
+def get_gs_storage_client():
+    cache_key = _get_cache_key()
+    if cache_key not in _client_cache:
+        from google.cloud import storage
+        import google.auth
+
+        credentials, project_id = google.auth.default(scopes=storage.Client.SCOPE)
+        _client_cache[cache_key] = storage.Client(
+            credentials=credentials, project=project_id
+        )
+    return _client_cache[cache_key]
diff --git a/metaflow/plugins/gcp/gs_tail.py b/metaflow/plugins/gcp/gs_tail.py
new file mode 100644
index 00000000000..83b8e09ec8f
--- /dev/null
+++ b/metaflow/plugins/gcp/gs_tail.py
@@ -0,0 +1,85 @@
+from io import BytesIO
+
+from google.cloud.exceptions import NotFound, ClientError
+
+from metaflow.exception import MetaflowException
+
+from metaflow.plugins.gcp.gs_storage_client_factory import get_gs_storage_client
+from metaflow.plugins.gcp.gs_utils import parse_gs_full_path
+
+
+class GSTail(object):
+    def __init__(self, blob_full_uri):
+        """Location should be something like gs://<bucket_name>/blob"""
+        bucket_name, blob_name = parse_gs_full_path(blob_full_uri)
+        if not blob_name:
+            raise MetaflowException(
+                msg="Failed to parse blob_full_uri into gs://<bucket_name>/<blob_name> (got %s)"
+                % blob_full_uri
+            )
+        client = get_gs_storage_client()
+        bucket = client.bucket(bucket_name)
+        self._blob_client = bucket.blob(blob_name)
+        self._pos = 0
+        self._tail = b""
+
+    def __iter__(self):
+        buf = self._fill_buf()
+        if buf is not None:
+            # If there are no line breaks in the entries
+            # file, then we will yield nothing, ever.
+            #
+            # This apes S3 tail. We can fix it here and in S3
+            # if/when this becomes an issue. It boils down to
+            # knowing when to give up waiting on partial lines
+            # to become full lines (tricky, need more info).
+            #
+            # Likely this has been OK because we control the
+            # line-break presence in the objects we tail.
+            for line in buf:
+                if line.endswith(b"\n"):
+                    yield line
+                else:
+                    self._tail = line
+                    break
+
+    def _make_range_request(self):
+        try:
+            # Yes we read to the end... memory blow up is possible. We can improve by specifying length param
+            return self._blob_client.download_as_bytes(start=self._pos)
+        except NotFound:
+            return None
+        except ClientError as e:
+            # be silent on range errors - it means log did not advance
+            if e.code != 416:
+                print("Failed to tail log from step (status code = %d)" % (e.code,))
+            return None
+        except Exception as e:
+            print("Failed to tail log from step (%s)" % type(e))
+            return None
+
+    def _fill_buf(self):
+        data = self._make_range_request()
+        if data is None:
+            return None
+        if data:
+            buf = BytesIO(data)
+            self._pos += len(data)
+            self._tail = b""
+            return buf
+        else:
+            return None
+
+
+if __name__ == "__main__":
+    # This main program is for debugging and testing purposes
+    import argparse
+
+    parser = argparse.ArgumentParser(description="Tail an Google Cloud Storage blob.")
+    parser.add_argument(
+        "blob_full_uri", help="The blob to tail. Format is gs://<bucket_name>/<blob>"
+    )
+    args = parser.parse_args()
+    gs_tail = GSTail(args.blob_full_uri)
+    for line in gs_tail:
+        print(line.strip().decode("utf-8"))
diff --git a/metaflow/plugins/gcp/gs_utils.py b/metaflow/plugins/gcp/gs_utils.py
new file mode 100644
index 00000000000..73d371dbfed
--- /dev/null
+++ b/metaflow/plugins/gcp/gs_utils.py
@@ -0,0 +1,65 @@
+import sys
+
+from metaflow.exception import MetaflowException, MetaflowInternalError
+from metaflow.plugins.gcp.gs_exceptions import MetaflowGSPackageError
+
+
+def parse_gs_full_path(gs_uri):
+    from urllib.parse import urlparse
+
+    #  <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
+    scheme, netloc, path, _, _, _ = urlparse(gs_uri)
+    assert scheme == "gs"
+    assert netloc is not None
+
+    bucket = netloc
+    path = path.lstrip("/").rstrip("/")
+    if path == "":
+        path = None
+
+    return bucket, path
+
+
+def _check_and_init_gs_deps():
+    try:
+        from google.cloud import storage
+        import google.auth
+    except ImportError:
+        raise MetaflowGSPackageError()
+
+    if sys.version_info[:2] < (3, 7):
+        raise MetaflowException(
+            msg="Metaflow may only use Google Cloud Storage with Python 3.7 or newer"
+        )
+
+
+def check_gs_deps(func):
+    """The decorated function checks GS dependencies (as needed for Azure storage backend). This includes
+    various GCP SDK packages, as well as a Python version of >=3.7
+    """
+
+    def _inner_func(*args, **kwargs):
+        _check_and_init_gs_deps()
+        return func(*args, **kwargs)
+
+    return _inner_func
+
+
+@check_gs_deps
+def process_gs_exception(e):
+    """
+    Translate errors to Metaflow errors for standardized messaging. The intent is that all
+    Google Cloud Storage integration logic should send errors to this function for
+    translation.
+
+    We explicitly EXCLUDE executor related errors here.  See handle_executor_exceptions
+    """
+    if isinstance(e, MetaflowException):
+        # If it's already a MetaflowException... no translation needed
+        raise
+    if isinstance(e, ImportError):
+        # Surprise ImportError here... (expected to see this handled and wrapped as MetaflowGSPackagingError)
+        # Reraise it raw for visibility, it's a bug and is catastrophic anyway.
+        raise
+    # TODO we may catch and wrap more GCP errors here, as needed.
+    raise MetaflowInternalError(msg=str(e))
diff --git a/metaflow/plugins/gcp/includefile_support.py b/metaflow/plugins/gcp/includefile_support.py
new file mode 100644
index 00000000000..35d9aab96dc
--- /dev/null
+++ b/metaflow/plugins/gcp/includefile_support.py
@@ -0,0 +1,108 @@
+import io
+import os
+import shutil
+import uuid
+from tempfile import mkdtemp
+
+from metaflow.exception import MetaflowException, MetaflowInternalError
+
+
+class GS(object):
+    @classmethod
+    def get_root_from_config(cls, echo, create_on_absent=True):
+        from metaflow.metaflow_config import DATATOOLS_GSROOT
+
+        return DATATOOLS_GSROOT
+
+    def __init__(self):
+        # This local directory is used to house any downloaded blobs, for lifetime of
+        # this object as a context manager.
+        self._tmpdir = None
+
+    def _get_storage_backend_and_subpath(self, key):
+        """
+        Return an GSDatastore, rooted at the container level, no prefix.
+        Key MUST be a fully qualified path. e.g. gs://<bucket_name>/b/l/o/b/n/a/m/e
+        """
+        from metaflow.plugins.gcp.gs_utils import parse_gs_full_path
+
+        # we parse out the bucket name only, and use that to root our storage implementation
+        bucket_name, subpath = parse_gs_full_path(key)
+        # Import DATASTORES dynamically... otherwise, circular import
+        from metaflow.plugins import DATASTORES
+
+        storage_impl = [d for d in DATASTORES if d.TYPE == "gs"][0]
+        return storage_impl("gs://" + bucket_name), subpath
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, *args):
+        if self._tmpdir and os.path.exists(self._tmpdir):
+            shutil.rmtree(self._tmpdir)
+
+    def get(self, key=None, return_missing=False):
+        """Key MUST be a fully qualified path.  gs://<bucket_name>/b/l/o/b/n/a/m/e"""
+        if not return_missing:
+            raise MetaflowException("GS object supports only return_missing=True")
+        if not key.startswith("gs://"):
+            raise MetaflowInternalError(
+                msg="Expected GS object key to start with 'gs://'"
+            )
+        storage, subpath = self._get_storage_backend_and_subpath(key)
+        gs_object = None
+        with storage.load_bytes([subpath]) as load_result:
+            for _, tmpfile, _ in load_result:
+                if tmpfile is None:
+                    gs_object = GSObject(key, None, False, None)
+                else:
+                    if not self._tmpdir:
+                        self._tmpdir = mkdtemp(prefix="metaflow.includefile.gs.")
+                    output_file_path = os.path.join(self._tmpdir, str(uuid.uuid4()))
+                    shutil.move(tmpfile, output_file_path)
+                    sz = os.stat(output_file_path).st_size
+                    gs_object = GSObject(key, output_file_path, True, sz)
+                break
+        return gs_object
+
+    def put(self, key, obj, overwrite=True):
+        """Key MUST be a fully qualified path.  gs://<bucket_name>/b/l/o/b/n/a/m/e"""
+        storage, subpath = self._get_storage_backend_and_subpath(key)
+        storage.save_bytes([(subpath, io.BytesIO(obj))], overwrite=overwrite)
+        return key
+
+    def info(self, key=None, return_missing=False):
+        if not key.startswith("gs://"):
+            raise MetaflowInternalError(
+                msg="Expected GS object key to start with 'gs://'"
+            )
+        storage, subpath = self._get_storage_backend_and_subpath(key)
+        blob_size = storage.size_file(subpath)
+        blob_exists = blob_size is not None
+        if not blob_exists and not return_missing:
+            raise MetaflowException("GS object '%s' not found" % key)
+        return GSObject(key, None, blob_exists, blob_size)
+
+
+class GSObject(object):
+    def __init__(self, url, path, exists, size):
+        self._path = path
+        self._url = url
+        self._exists = exists
+        self._size = size
+
+    @property
+    def path(self):
+        return self._path
+
+    @property
+    def url(self):
+        return self._url
+
+    @property
+    def exists(self):
+        return self._exists
+
+    @property
+    def size(self):
+        return self._size
diff --git a/metaflow/plugins/kubernetes/__init__.py b/metaflow/plugins/kubernetes/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/metaflow/plugins/kubernetes/kubernetes.py b/metaflow/plugins/kubernetes/kubernetes.py
new file mode 100644
index 00000000000..498d5e0473e
--- /dev/null
+++ b/metaflow/plugins/kubernetes/kubernetes.py
@@ -0,0 +1,348 @@
+import json
+import math
+import os
+import shlex
+import time
+
+from metaflow import current, util
+from metaflow.exception import MetaflowException
+from metaflow.metaflow_config import (
+    SERVICE_HEADERS,
+    SERVICE_INTERNAL_URL,
+    CARD_S3ROOT,
+    DATASTORE_SYSROOT_S3,
+    DATATOOLS_S3ROOT,
+    DEFAULT_AWS_CLIENT_PROVIDER,
+    DEFAULT_METADATA,
+    KUBERNETES_SANDBOX_INIT_SCRIPT,
+    S3_ENDPOINT_URL,
+    AZURE_STORAGE_BLOB_SERVICE_ENDPOINT,
+    DATASTORE_SYSROOT_AZURE,
+    CARD_AZUREROOT,
+    CARD_GSROOT,
+    DATASTORE_SYSROOT_GS,
+)
+from metaflow.mflog import (
+    BASH_SAVE_LOGS,
+    bash_capture_logs,
+    export_mflog_env_vars,
+    tail_logs,
+    get_log_tailer,
+)
+
+from .kubernetes_client import KubernetesClient
+
+# Redirect structured logs to $PWD/.logs/
+LOGS_DIR = "$PWD/.logs"
+STDOUT_FILE = "mflog_stdout"
+STDERR_FILE = "mflog_stderr"
+STDOUT_PATH = os.path.join(LOGS_DIR, STDOUT_FILE)
+STDERR_PATH = os.path.join(LOGS_DIR, STDERR_FILE)
+
+
+class KubernetesException(MetaflowException):
+    headline = "Kubernetes error"
+
+
+class KubernetesKilledException(MetaflowException):
+    headline = "Kubernetes Batch job killed"
+
+
+class Kubernetes(object):
+    def __init__(
+        self,
+        datastore,
+        metadata,
+        environment,
+    ):
+        self._datastore = datastore
+        self._metadata = metadata
+        self._environment = environment
+
+    def _command(
+        self,
+        flow_name,
+        run_id,
+        step_name,
+        task_id,
+        attempt,
+        code_package_url,
+        step_cmds,
+    ):
+        mflog_expr = export_mflog_env_vars(
+            flow_name=flow_name,
+            run_id=run_id,
+            step_name=step_name,
+            task_id=task_id,
+            retry_count=attempt,
+            datastore_type=self._datastore.TYPE,
+            stdout_path=STDOUT_PATH,
+            stderr_path=STDERR_PATH,
+        )
+        init_cmds = self._environment.get_package_commands(
+            code_package_url, self._datastore.TYPE
+        )
+        init_expr = " && ".join(init_cmds)
+        step_expr = bash_capture_logs(
+            " && ".join(
+                self._environment.bootstrap_commands(step_name, self._datastore.TYPE)
+                + step_cmds
+            )
+        )
+
+        # Construct an entry point that
+        # 1) initializes the mflog environment (mflog_expr)
+        # 2) bootstraps a metaflow environment (init_expr)
+        # 3) executes a task (step_expr)
+
+        # The `true` command is to make sure that the generated command
+        # plays well with docker containers which have entrypoint set as
+        # eval $@
+        cmd_str = "true && mkdir -p %s && %s && %s && %s; " % (
+            LOGS_DIR,
+            mflog_expr,
+            init_expr,
+            step_expr,
+        )
+        # After the task has finished, we save its exit code (fail/success)
+        # and persist the final logs. The whole entrypoint should exit
+        # with the exit code (c) of the task.
+        #
+        # Note that if step_expr OOMs, this tail expression is never executed.
+        # We lose the last logs in this scenario.
+        #
+        # TODO: Capture hard exit logs in Kubernetes.
+        cmd_str += "c=$?; %s; exit $c" % BASH_SAVE_LOGS
+        # For supporting sandboxes, ensure that a custom script is executed before
+        # anything else is executed. The script is passed in as an env var.
+        cmd_str = (
+            '${METAFLOW_INIT_SCRIPT:+eval \\"${METAFLOW_INIT_SCRIPT}\\"} && %s'
+            % cmd_str
+        )
+        return shlex.split('bash -c "%s"' % cmd_str)
+
+    def launch_job(self, **kwargs):
+        self._job = self.create_job(**kwargs).execute()
+
+    def create_job(
+        self,
+        flow_name,
+        run_id,
+        step_name,
+        task_id,
+        attempt,
+        user,
+        code_package_sha,
+        code_package_url,
+        code_package_ds,
+        step_cli,
+        docker_image,
+        service_account=None,
+        secrets=None,
+        node_selector=None,
+        namespace=None,
+        cpu=None,
+        gpu=None,
+        gpu_vendor=None,
+        disk=None,
+        memory=None,
+        run_time_limit=None,
+        env=None,
+        tolerations=None,
+    ):
+
+        if env is None:
+            env = {}
+
+        job = (
+            KubernetesClient()
+            .job(
+                generate_name="t-",
+                namespace=namespace,
+                service_account=service_account,
+                secrets=secrets,
+                node_selector=node_selector,
+                command=self._command(
+                    flow_name=flow_name,
+                    run_id=run_id,
+                    step_name=step_name,
+                    task_id=task_id,
+                    attempt=attempt,
+                    code_package_url=code_package_url,
+                    step_cmds=[step_cli],
+                ),
+                image=docker_image,
+                cpu=cpu,
+                memory=memory,
+                disk=disk,
+                gpu=gpu,
+                gpu_vendor=gpu_vendor,
+                timeout_in_seconds=run_time_limit,
+                # Retries are handled by Metaflow runtime
+                retries=0,
+                step_name=step_name,
+                tolerations=tolerations,
+            )
+            .environment_variable("METAFLOW_CODE_SHA", code_package_sha)
+            .environment_variable("METAFLOW_CODE_URL", code_package_url)
+            .environment_variable("METAFLOW_CODE_DS", code_package_ds)
+            .environment_variable("METAFLOW_USER", user)
+            .environment_variable("METAFLOW_SERVICE_URL", SERVICE_INTERNAL_URL)
+            .environment_variable(
+                "METAFLOW_SERVICE_HEADERS",
+                json.dumps(SERVICE_HEADERS),
+            )
+            .environment_variable("METAFLOW_DATASTORE_SYSROOT_S3", DATASTORE_SYSROOT_S3)
+            .environment_variable("METAFLOW_DATATOOLS_S3ROOT", DATATOOLS_S3ROOT)
+            .environment_variable("METAFLOW_DEFAULT_DATASTORE", self._datastore.TYPE)
+            .environment_variable("METAFLOW_DEFAULT_METADATA", DEFAULT_METADATA)
+            .environment_variable("METAFLOW_KUBERNETES_WORKLOAD", 1)
+            .environment_variable("METAFLOW_RUNTIME_ENVIRONMENT", "kubernetes")
+            .environment_variable("METAFLOW_CARD_S3ROOT", CARD_S3ROOT)
+            .environment_variable(
+                "METAFLOW_DEFAULT_AWS_CLIENT_PROVIDER", DEFAULT_AWS_CLIENT_PROVIDER
+            )
+            .environment_variable("METAFLOW_S3_ENDPOINT_URL", S3_ENDPOINT_URL)
+            .environment_variable(
+                "METAFLOW_AZURE_STORAGE_BLOB_SERVICE_ENDPOINT",
+                AZURE_STORAGE_BLOB_SERVICE_ENDPOINT,
+            )
+            .environment_variable(
+                "METAFLOW_DATASTORE_SYSROOT_AZURE", DATASTORE_SYSROOT_AZURE
+            )
+            .environment_variable("METAFLOW_CARD_AZUREROOT", CARD_AZUREROOT)
+            .environment_variable("METAFLOW_DATASTORE_SYSROOT_GS", DATASTORE_SYSROOT_GS)
+            .environment_variable("METAFLOW_CARD_GSROOT", CARD_GSROOT)
+            # support Metaflow sandboxes
+            .environment_variable(
+                "METAFLOW_INIT_SCRIPT", KUBERNETES_SANDBOX_INIT_SCRIPT
+            )
+            # Skip setting METAFLOW_DATASTORE_SYSROOT_LOCAL because metadata sync
+            # between the local user instance and the remote Kubernetes pod
+            # assumes metadata is stored in DATASTORE_LOCAL_DIR on the Kubernetes
+            # pod; this happens when METAFLOW_DATASTORE_SYSROOT_LOCAL is NOT set (
+            # see get_datastore_root_from_config in datastore/local.py).
+        )
+
+        for name, value in env.items():
+            job.environment_variable(name, value)
+
+        annotations = {
+            "metaflow/user": user,
+            "metaflow/flow_name": flow_name,
+        }
+        if current.get("project_name"):
+            annotations.update(
+                {
+                    "metaflow/project_name": current.project_name,
+                    "metaflow/branch_name": current.branch_name,
+                    "metaflow/project_flow_name": current.project_flow_name,
+                }
+            )
+
+        for name, value in annotations.items():
+            job.annotation(name, value)
+
+        (
+            job.annotation("metaflow/run_id", run_id)
+            .annotation("metaflow/step_name", step_name)
+            .annotation("metaflow/task_id", task_id)
+            .annotation("metaflow/attempt", attempt)
+            .label("app.kubernetes.io/name", "metaflow-task")
+            .label("app.kubernetes.io/part-of", "metaflow")
+        )
+
+        return job.create()
+
+    def wait(self, stdout_location, stderr_location, echo=None):
+        def update_delay(secs_since_start):
+            # this sigmoid function reaches
+            # - 0.1 after 11 minutes
+            # - 0.5 after 15 minutes
+            # - 1.0 after 23 minutes
+            # in other words, the user will see very frequent updates
+            # during the first 10 minutes
+            sigmoid = 1.0 / (1.0 + math.exp(-0.01 * secs_since_start + 9.0))
+            return 0.5 + sigmoid * 30.0
+
+        def wait_for_launch(job):
+            status = job.status
+            echo(
+                "Task is starting (%s)..." % status,
+                "stderr",
+                job_id=job.id,
+            )
+            t = time.time()
+            start_time = time.time()
+            while job.is_waiting:
+                new_status = job.status
+                if status != new_status or (time.time() - t) > 30:
+                    status = new_status
+                    echo(
+                        "Task is starting (%s)..." % status,
+                        "stderr",
+                        job_id=job.id,
+                    )
+                    t = time.time()
+                time.sleep(update_delay(time.time() - start_time))
+
+        prefix = b"[%s] " % util.to_bytes(self._job.id)
+
+        stdout_tail = get_log_tailer(stdout_location, self._datastore.TYPE)
+        stderr_tail = get_log_tailer(stderr_location, self._datastore.TYPE)
+
+        # 1) Loop until the job has started
+        wait_for_launch(self._job)
+
+        # 2) Tail logs until the job has finished
+        tail_logs(
+            prefix=prefix,
+            stdout_tail=stdout_tail,
+            stderr_tail=stderr_tail,
+            echo=echo,
+            has_log_updates=lambda: self._job.is_running,
+        )
+        # 3) Fetch remaining logs
+        #
+        # It is possible that we exit the loop above before all logs have been
+        # shown.
+        #
+        # TODO : If we notice Kubernetes failing to upload logs to S3,
+        #        we can add a HEAD request here to ensure that the file
+        #        exists prior to calling S3Tail and note the user about
+        #        truncated logs if it doesn't.
+        # TODO : For hard crashes, we can fetch logs from the pod.
+
+        if self._job.has_failed:
+            exit_code, reason = self._job.reason
+            msg = next(
+                msg
+                for msg in [
+                    reason,
+                    "Task crashed",
+                ]
+                if msg is not None
+            )
+            if exit_code:
+                if int(exit_code) == 139:
+                    raise KubernetesException("Task failed with a segmentation fault.")
+                if int(exit_code) == 137:
+                    raise KubernetesException(
+                        "Task ran out of memory. "
+                        "Increase the available memory by specifying "
+                        "@resource(memory=...) for the step. "
+                    )
+                if int(exit_code) == 134:
+                    raise KubernetesException("%s (exit code %s)" % (msg, exit_code))
+                else:
+                    msg = "%s (exit code %s)" % (msg, exit_code)
+            raise KubernetesException(
+                "%s. This could be a transient error. Use @retry to retry." % msg
+            )
+
+        exit_code, _ = self._job.reason
+        echo(
+            "Task finished with exit code %s." % exit_code,
+            "stderr",
+            job_id=self._job.id,
+        )
diff --git a/metaflow/plugins/aws/eks/kubernetes_cli.py b/metaflow/plugins/kubernetes/kubernetes_cli.py
similarity index 75%
rename from metaflow/plugins/aws/eks/kubernetes_cli.py
rename to metaflow/plugins/kubernetes/kubernetes_cli.py
index 73fad2a7aee..a06ff9cb904 100644
--- a/metaflow/plugins/aws/eks/kubernetes_cli.py
+++ b/metaflow/plugins/kubernetes/kubernetes_cli.py
@@ -1,20 +1,17 @@
-from metaflow._vendor import click
 import os
 import sys
 import time
 import traceback
 
-from metaflow import util
-from metaflow.exception import CommandException, METAFLOW_EXIT_DISALLOW_RETRY
+from metaflow import util, JSONTypeClass
+from metaflow._vendor import click
+from metaflow.exception import METAFLOW_EXIT_DISALLOW_RETRY, CommandException
 from metaflow.metadata.util import sync_local_metadata_from_datastore
 from metaflow.metaflow_config import DATASTORE_LOCAL_DIR
 from metaflow.mflog import TASK_LOG_SOURCE
 
 from .kubernetes import Kubernetes, KubernetesKilledException
-
-# TODO(s):
-#    1. Compatibility for Metaflow-R (not a blocker for release).
-#    2. Add more CLI commands to manage Kubernetes objects.
+from .kubernetes_decorator import KubernetesDecorator
 
 
 @click.group()
@@ -22,42 +19,39 @@ def cli():
     pass
 
 
-@cli.group(help="Commands related to Kubernetes on Amazon EKS.")
+@cli.group(help="Commands related to Kubernetes.")
 def kubernetes():
     pass
 
 
 @kubernetes.command(
-    help="Execute a single task on Kubernetes using Amazon EKS. This command "
-    "calls the top-level step command inside a Kubernetes job with the given "
-    "options. Typically you do not call this command directly; it is used "
-    "internally by Metaflow."
+    help="Execute a single task on Kubernetes. This command calls the top-level step "
+    "command inside a Kubernetes pod with the given options. Typically you do not call "
+    "this command directly; it is used internally by Metaflow."
 )
 @click.argument("step-name")
 @click.argument("code-package-sha")
 @click.argument("code-package-url")
 @click.option(
     "--executable",
-    help="Executable requirement for Kubernetes job on Amazon EKS.",
-)
-@click.option(
-    "--image", help="Docker image requirement for Kubernetes job on Amazon EKS."
+    help="Executable requirement for Kubernetes pod.",
 )
+@click.option("--image", help="Docker image requirement for Kubernetes pod.")
 @click.option(
     "--service-account",
-    help="IRSA requirement for Kubernetes job on Amazon EKS.",
+    help="IRSA requirement for Kubernetes pod.",
 )
 @click.option(
     "--secrets",
     multiple=True,
     default=None,
-    help="Secrets for Kubernetes job on Amazon EKS.",
+    help="Secrets for Kubernetes pod.",
 )
 @click.option(
     "--node-selector",
     multiple=True,
     default=None,
-    help="NodeSelector for Kubernetes job on Amazon EKS.",
+    help="NodeSelector for Kubernetes pod.",
 )
 @click.option(
     # Note that ideally we would have liked to use `namespace` rather than
@@ -65,12 +59,13 @@ def kubernetes():
     # Metaflow namespaces.
     "--k8s-namespace",
     default=None,
-    help="Namespace for Kubernetes job on Amazon EKS.",
+    help="Namespace for Kubernetes job.",
 )
-@click.option("--cpu", help="CPU requirement for Kubernetes job on Amazon EKS.")
-@click.option("--gpu", help="GPU requirement for Kubernetes job on Amazon EKS.")
-@click.option("--disk", help="Disk requirement for Kubernetes job on Amazon EKS.")
-@click.option("--memory", help="Memory requirement for Kubernetes job on Amazon EKS.")
+@click.option("--cpu", help="CPU requirement for Kubernetes pod.")
+@click.option("--disk", help="Disk requirement for Kubernetes pod.")
+@click.option("--memory", help="Memory requirement for Kubernetes pod.")
+@click.option("--gpu", help="GPU requirement for Kubernetes pod.")
+@click.option("--gpu-vendor", help="GPU vendor requirement for Kubernetes pod.")
 @click.option("--run-id", help="Passed to the top-level 'step'.")
 @click.option("--task-id", help="Passed to the top-level 'step'.")
 @click.option("--input-paths", help="Passed to the top-level 'step'.")
@@ -88,7 +83,13 @@ def kubernetes():
 @click.option(
     "--run-time-limit",
     default=5 * 24 * 60 * 60,  # Default is set to 5 days
-    help="Run time limit in seconds for Kubernetes job.",
+    help="Run time limit in seconds for Kubernetes pod.",
+)
+@click.option(
+    "--tolerations",
+    default=None,
+    type=JSONTypeClass(),
+    multiple=False,
 )
 @click.pass_context
 def step(
@@ -103,10 +104,12 @@ def step(
     node_selector=None,
     k8s_namespace=None,
     cpu=None,
-    gpu=None,
     disk=None,
     memory=None,
+    gpu=None,
+    gpu_vendor=None,
     run_time_limit=None,
+    tolerations=None,
     **kwargs
 ):
     def echo(msg, stream="stderr", job_id=None):
@@ -153,14 +156,14 @@ def echo(msg, stream="stderr", job_id=None):
         )
         time.sleep(minutes_between_retries * 60)
 
-    step_cli = u"{entrypoint} {top_args} step {step} {step_args}".format(
+    step_cli = "{entrypoint} {top_args} step {step} {step_args}".format(
         entrypoint="%s -u %s" % (executable, os.path.basename(sys.argv[0])),
         top_args=" ".join(util.dict_to_cli_options(ctx.parent.parent.params)),
         step=step_name,
         step_args=" ".join(util.dict_to_cli_options(kwargs)),
     )
 
-    # this information is needed for log tailing
+    # Set log tailing.
     ds = ctx.obj.flow_datastore.get_task_datastore(
         mode="w",
         run_id=kwargs["run_id"],
@@ -171,6 +174,9 @@ def echo(msg, stream="stderr", job_id=None):
     stdout_location = ds.get_log_location(TASK_LOG_SOURCE, "stdout")
     stderr_location = ds.get_log_location(TASK_LOG_SOURCE, "stderr")
 
+    # `node_selector` is a tuple of strings, convert it to a dictionary
+    node_selector = KubernetesDecorator.parse_node_selector(node_selector)
+
     def _sync_metadata():
         if ctx.obj.metadata.TYPE == "local":
             sync_local_metadata_from_datastore(
@@ -185,15 +191,15 @@ def _sync_metadata():
             datastore=ctx.obj.flow_datastore,
             metadata=ctx.obj.metadata,
             environment=ctx.obj.environment,
-            flow_name=ctx.obj.flow.name,
-            run_id=kwargs["run_id"],
-            step_name=step_name,
-            task_id=kwargs["task_id"],
-            attempt=retry_count,
         )
         # Configure and launch Kubernetes job.
-        with ctx.obj.monitor.measure("metaflow.aws.eks.launch_job"):
+        with ctx.obj.monitor.measure("metaflow.kubernetes.launch_job"):
             kubernetes.launch_job(
+                flow_name=ctx.obj.flow.name,
+                run_id=kwargs["run_id"],
+                step_name=step_name,
+                task_id=kwargs["task_id"],
+                attempt=str(retry_count),
                 user=util.get_username(),
                 code_package_sha=code_package_sha,
                 code_package_url=code_package_url,
@@ -205,14 +211,16 @@ def _sync_metadata():
                 node_selector=node_selector,
                 namespace=k8s_namespace,
                 cpu=cpu,
-                gpu=gpu,
                 disk=disk,
                 memory=memory,
+                gpu=gpu,
+                gpu_vendor=gpu_vendor,
                 run_time_limit=run_time_limit,
                 env=env,
+                tolerations=tolerations,
             )
     except Exception as e:
-        traceback.print_exc()
+        traceback.print_exc(chain=False)
         _sync_metadata()
         sys.exit(METAFLOW_EXIT_DISALLOW_RETRY)
     try:
diff --git a/metaflow/plugins/kubernetes/kubernetes_client.py b/metaflow/plugins/kubernetes/kubernetes_client.py
new file mode 100644
index 00000000000..bca4bd48915
--- /dev/null
+++ b/metaflow/plugins/kubernetes/kubernetes_client.py
@@ -0,0 +1,57 @@
+import os
+import sys
+import time
+
+from metaflow.exception import MetaflowException
+
+from .kubernetes_job import KubernetesJob
+
+CLIENT_REFRESH_INTERVAL_SECONDS = 300
+
+
+class KubernetesClientException(MetaflowException):
+    headline = "Kubernetes client error"
+
+
+class KubernetesClient(object):
+    def __init__(self):
+        try:
+            # Kubernetes is a soft dependency.
+            from kubernetes import client, config
+        except (NameError, ImportError):
+            raise KubernetesClientException(
+                "Could not import module 'kubernetes'.\n\nInstall Kubernetes "
+                "Python package (https://pypi.org/project/kubernetes/) first.\n"
+                "You can install the module by executing - "
+                "%s -m pip install kubernetes\n"
+                "or equivalent through your favorite Python package manager."
+                % sys.executable
+            )
+        self._refresh_client()
+
+    def _refresh_client(self):
+        from kubernetes import client, config
+
+        if os.getenv("KUBERNETES_SERVICE_HOST"):
+            # We are inside a pod, authenticate via ServiceAccount assigned to us
+            config.load_incluster_config()
+        else:
+            # Use kubeconfig, likely $HOME/.kube/config
+            # TODO (savin):
+            #  1. Support generating kubeconfig on the fly using boto3
+            #  2. Support auth via OIDC - https://docs.aws.amazon.com/eks/latest/userguide/authenticate-oidc-identity-provider.html
+            config.load_kube_config()
+        self._client = client
+        self._client_refresh_timestamp = time.time()
+
+    def get(self):
+        if (
+            time.time() - self._client_refresh_timestamp
+            > CLIENT_REFRESH_INTERVAL_SECONDS
+        ):
+            self._refresh_client()
+
+        return self._client
+
+    def job(self, **kwargs):
+        return KubernetesJob(self, **kwargs)
diff --git a/metaflow/plugins/kubernetes/kubernetes_decorator.py b/metaflow/plugins/kubernetes/kubernetes_decorator.py
new file mode 100644
index 00000000000..7ee596ecdb9
--- /dev/null
+++ b/metaflow/plugins/kubernetes/kubernetes_decorator.py
@@ -0,0 +1,396 @@
+import json
+import os
+import platform
+import sys
+
+from metaflow.decorators import StepDecorator
+from metaflow.exception import MetaflowException
+from metaflow.metadata import MetaDatum
+from metaflow.metadata.util import sync_local_metadata_to_datastore
+from metaflow.metaflow_config import (
+    DATASTORE_LOCAL_DIR,
+    KUBERNETES_CONTAINER_IMAGE,
+    KUBERNETES_CONTAINER_REGISTRY,
+    KUBERNETES_GPU_VENDOR,
+    KUBERNETES_NAMESPACE,
+    KUBERNETES_NODE_SELECTOR,
+    KUBERNETES_TOLERATIONS,
+    KUBERNETES_SERVICE_ACCOUNT,
+    KUBERNETES_SECRETS,
+)
+from metaflow.plugins.resources_decorator import ResourcesDecorator
+from metaflow.plugins.timeout_decorator import get_run_time_limit_for_task
+from metaflow.sidecar import Sidecar
+
+from ..aws.aws_utils import get_docker_registry
+
+from .kubernetes import KubernetesException
+
+try:
+    unicode
+except NameError:
+    unicode = str
+    basestring = str
+
+
+class KubernetesDecorator(StepDecorator):
+    """
+    Specifies that this step should execute on Kubernetes.
+
+    Parameters
+    ----------
+    cpu : int
+        Number of CPUs required for this step. Defaults to 1. If `@resources` is
+        also present, the maximum value from all decorators is used.
+    memory : int
+        Memory size (in MB) required for this step. Defaults to 4096 (4GB). If
+        `@resources` is also present, the maximum value from all decorators is
+        used.
+    disk : int
+        Disk size (in MB) required for this step. Defaults to 10GB. If
+        `@resources` is also present, the maximum value from all decorators is
+        used.
+    image : str
+        Docker image to use when launching on Kubernetes. If not specified, a
+        default Docker image mapping to the current version of Python is used.
+    service_account : str
+        Kubernetes service account to use when launching pod in Kubernetes. If
+        not specified, the value of `METAFLOW_KUBERNETES_SERVICE_ACCOUNT` is
+        used from Metaflow configuration.
+    namespace : str
+        Kubernetes namespace to use when launching pod in Kubernetes. If
+        not specified, the value of `METAFLOW_KUBERNETES_NAMESPACE` is used
+        from Metaflow configuration.
+    secrets : List[str]
+        Kubernetes secrets to use when launching pod in Kubernetes. These
+        secrets are in addition to the ones defined in `METAFLOW_KUBERNETES_SECRETS`
+        in Metaflow configuration.
+    tolerations : List[str]
+        Kubernetes tolerations to use when launching pod in Kubernetes. If
+        not specified, the value of `METAFLOW_KUBERNETES_TOLERATIONS` is used
+        from Metaflow configuration.
+    """
+
+    name = "kubernetes"
+    defaults = {
+        "cpu": "1",
+        "memory": "4096",
+        "disk": "10240",
+        "image": None,
+        "service_account": None,
+        "secrets": None,  # e.g., mysecret
+        "node_selector": None,  # e.g., kubernetes.io/os=linux
+        "namespace": None,
+        "gpu": None,  # value of 0 implies that the scheduled node should not have GPUs
+        "gpu_vendor": None,
+        "tolerations": None,  # e.g., [{"key": "arch", "operator": "Equal", "value": "amd"},
+        #                              {"key": "foo", "operator": "Equal", "value": "bar"}]
+    }
+    package_url = None
+    package_sha = None
+    run_time_limit = None
+
+    def __init__(self, attributes=None, statically_defined=False):
+        super(KubernetesDecorator, self).__init__(attributes, statically_defined)
+
+        if not self.attributes["namespace"]:
+            self.attributes["namespace"] = KUBERNETES_NAMESPACE
+        if not self.attributes["service_account"]:
+            self.attributes["service_account"] = KUBERNETES_SERVICE_ACCOUNT
+        if not self.attributes["gpu_vendor"]:
+            self.attributes["gpu_vendor"] = KUBERNETES_GPU_VENDOR
+        if not self.attributes["node_selector"] and KUBERNETES_NODE_SELECTOR:
+            self.attributes["node_selector"] = KUBERNETES_NODE_SELECTOR
+        if not self.attributes["tolerations"] and KUBERNETES_TOLERATIONS:
+            self.attributes["tolerations"] = json.loads(KUBERNETES_TOLERATIONS)
+
+        if isinstance(self.attributes["node_selector"], str):
+            self.attributes["node_selector"] = self.parse_node_selector(
+                self.attributes["node_selector"].split(",")
+            )
+
+        if self.attributes["tolerations"]:
+            try:
+                from kubernetes.client import V1Toleration
+
+                for toleration in self.attributes["tolerations"]:
+                    try:
+                        invalid_keys = [
+                            k
+                            for k in toleration.keys()
+                            if k not in V1Toleration.attribute_map.keys()
+                        ]
+                        if len(invalid_keys) > 0:
+                            raise KubernetesException(
+                                "Tolerations parameter contains invalid keys: %s"
+                                % invalid_keys
+                            )
+                    except AttributeError:
+                        raise KubernetesException(
+                            "Unable to parse tolerations: %s"
+                            % self.attributes["tolerations"]
+                        )
+            except (NameError, ImportError):
+                pass
+
+        # If no docker image is explicitly specified, impute a default image.
+        if not self.attributes["image"]:
+            # If metaflow-config specifies a docker image, just use that.
+            if KUBERNETES_CONTAINER_IMAGE:
+                self.attributes["image"] = KUBERNETES_CONTAINER_IMAGE
+            # If metaflow-config doesn't specify a docker image, assign a
+            # default docker image.
+            else:
+                # Default to vanilla Python image corresponding to major.minor
+                # version of the Python interpreter launching the flow.
+                self.attributes["image"] = "python:%s.%s" % (
+                    platform.python_version_tuple()[0],
+                    platform.python_version_tuple()[1],
+                )
+        # Assign docker registry URL for the image.
+        if not get_docker_registry(self.attributes["image"]):
+            if KUBERNETES_CONTAINER_REGISTRY:
+                self.attributes["image"] = "%s/%s" % (
+                    KUBERNETES_CONTAINER_REGISTRY.rstrip("/"),
+                    self.attributes["image"],
+                )
+
+    # Refer https://github.com/Netflix/metaflow/blob/master/docs/lifecycle.png
+    def step_init(self, flow, graph, step, decos, environment, flow_datastore, logger):
+        # Executing Kubernetes jobs requires a non-local datastore.
+        if flow_datastore.TYPE not in ("s3", "azure", "gs"):
+            raise KubernetesException(
+                "The *@kubernetes* decorator requires --datastore=s3 or --datastore=azure or --datastore=gs at the moment."
+            )
+
+        # Set internal state.
+        self.logger = logger
+        self.environment = environment
+        self.step = step
+        self.flow_datastore = flow_datastore
+
+        if any([deco.name == "batch" for deco in decos]):
+            raise MetaflowException(
+                "Step *{step}* is marked for execution both on AWS Batch and "
+                "Kubernetes. Please use one or the other.".format(step=step)
+            )
+
+        for deco in decos:
+            if getattr(deco, "IS_PARALLEL", False):
+                raise KubernetesException(
+                    "@kubernetes does not support parallel execution currently."
+                )
+
+        # Set run time limit for the Kubernetes job.
+        self.run_time_limit = get_run_time_limit_for_task(decos)
+        if self.run_time_limit < 60:
+            raise KubernetesException(
+                "The timeout for step *{step}* should be at least 60 seconds for "
+                "execution on Kubernetes.".format(step=step)
+            )
+
+        for deco in decos:
+            if isinstance(deco, ResourcesDecorator):
+                for k, v in deco.attributes.items():
+                    # If GPU count is specified, explicitly set it in self.attributes.
+                    if k == "gpu" and v != None:
+                        self.attributes["gpu"] = v
+
+                    if k in self.attributes:
+                        if self.defaults[k] is None:
+                            # skip if expected value isn't an int/float
+                            continue
+                        # We use the larger of @resources and @batch attributes
+                        # TODO: Fix https://github.com/Netflix/metaflow/issues/467
+                        my_val = self.attributes.get(k)
+                        if not (my_val is None and v is None):
+                            self.attributes[k] = str(
+                                max(float(my_val or 0), float(v or 0))
+                            )
+
+        # Check GPU vendor.
+        if self.attributes["gpu_vendor"].lower() not in ("amd", "nvidia"):
+            raise KubernetesException(
+                "GPU vendor *{}* for step *{step}* is not currently supported.".format(
+                    self.attributes["gpu_vendor"], step=step
+                )
+            )
+
+        # CPU, Disk, and Memory values should be greater than 0.
+        for attr in ["cpu", "disk", "memory"]:
+            if not (
+                isinstance(self.attributes[attr], (int, unicode, basestring, float))
+                and float(self.attributes[attr]) > 0
+            ):
+                raise KubernetesException(
+                    "Invalid {} value *{}* for step *{step}*; it should be greater than 0".format(
+                        attr, self.attributes[attr], step=step
+                    )
+                )
+
+        if self.attributes["gpu"] is not None and not (
+            isinstance(self.attributes["gpu"], (int, unicode, basestring))
+            and float(self.attributes["gpu"]).is_integer()
+        ):
+            raise KubernetesException(
+                "Invalid GPU value *{}* for step *{step}*; it should be an integer".format(
+                    self.attributes["gpu"], step=step
+                )
+            )
+
+    def package_init(self, flow, step_name, environment):
+        try:
+            # Kubernetes is a soft dependency.
+            from kubernetes import client, config
+        except (NameError, ImportError):
+            raise KubernetesException(
+                "Could not import module 'kubernetes'.\n\nInstall Kubernetes "
+                "Python package (https://pypi.org/project/kubernetes/) first.\n"
+                "You can install the module by executing - "
+                "%s -m pip install kubernetes\n"
+                "or equivalent through your favorite Python package manager."
+                % sys.executable
+            )
+
+    def runtime_init(self, flow, graph, package, run_id):
+        # Set some more internal state.
+        self.flow = flow
+        self.graph = graph
+        self.package = package
+        self.run_id = run_id
+
+    def runtime_task_created(
+        self, task_datastore, task_id, split_index, input_paths, is_cloned, ubf_context
+    ):
+        # To execute the Kubernetes job, the job container needs to have
+        # access to the code package. We store the package in the datastore
+        # which the pod is able to download as part of it's entrypoint.
+        if not is_cloned:
+            self._save_package_once(self.flow_datastore, self.package)
+
+    def runtime_step_cli(
+        self, cli_args, retry_count, max_user_code_retries, ubf_context
+    ):
+        if retry_count <= max_user_code_retries:
+            # After all attempts to run the user code have failed, we don't need
+            # to execute on Kubernetes anymore. We can execute possible fallback
+            # code locally.
+            cli_args.commands = ["kubernetes", "step"]
+            cli_args.command_args.append(self.package_sha)
+            cli_args.command_args.append(self.package_url)
+
+            # --namespace is used to specify Metaflow namespace (a different
+            # concept from k8s namespace).
+            for k, v in self.attributes.items():
+                if k == "namespace":
+                    cli_args.command_options["k8s_namespace"] = v
+                elif k == "node_selector" and v:
+                    cli_args.command_options[k] = ",".join(
+                        ["=".join([key, str(val)]) for key, val in v.items()]
+                    )
+                elif k == "tolerations":
+                    cli_args.command_options[k] = json.dumps(v)
+                else:
+                    cli_args.command_options[k] = v
+            cli_args.command_options["run-time-limit"] = self.run_time_limit
+            cli_args.entrypoint[0] = sys.executable
+
+    def task_pre_step(
+        self,
+        step_name,
+        task_datastore,
+        metadata,
+        run_id,
+        task_id,
+        flow,
+        graph,
+        retry_count,
+        max_retries,
+        ubf_context,
+        inputs,
+    ):
+        self.metadata = metadata
+        self.task_datastore = task_datastore
+
+        # task_pre_step may run locally if fallback is activated for @catch
+        # decorator. In that scenario, we skip collecting Kubernetes execution
+        # metadata. A rudimentary way to detect non-local execution is to
+        # check for the existence of METAFLOW_KUBERNETES_WORKLOAD environment
+        # variable.
+
+        if "METAFLOW_KUBERNETES_WORKLOAD" in os.environ:
+            meta = {}
+            meta["kubernetes-pod-name"] = os.environ["METAFLOW_KUBERNETES_POD_NAME"]
+            meta["kubernetes-pod-namespace"] = os.environ[
+                "METAFLOW_KUBERNETES_POD_NAMESPACE"
+            ]
+            meta["kubernetes-pod-id"] = os.environ["METAFLOW_KUBERNETES_POD_ID"]
+            meta["kubernetes-pod-service-account-name"] = os.environ[
+                "METAFLOW_KUBERNETES_SERVICE_ACCOUNT_NAME"
+            ]
+            # Unfortunately, there doesn't seem to be any straight forward way right
+            # now to attach the Batch/v1 name - While we can rely on a hacky approach
+            # given we know that the pod name is simply a unique suffix with a hyphen
+            # delimiter to the Batch/v1 name - this approach will fail if the Batch/v1
+            # name is closer to 63 chars where the pod name will truncate the Batch/v1
+            # name.
+            # if "ARGO_WORKFLOW_NAME" not in os.environ:
+            #     meta["kubernetes-job-name"] = os.environ[
+            #         "METAFLOW_KUBERNETES_POD_NAME"
+            #     ].rpartition("-")[0]
+
+            entries = [
+                MetaDatum(field=k, value=v, type=k, tags=[]) for k, v in meta.items()
+            ]
+            # Register book-keeping metadata for debugging.
+            metadata.register_metadata(run_id, step_name, task_id, entries)
+
+            # Start MFLog sidecar to collect task logs.
+            self._save_logs_sidecar = Sidecar("save_logs_periodically")
+            self._save_logs_sidecar.start()
+
+    def task_finished(
+        self, step_name, flow, graph, is_task_ok, retry_count, max_retries
+    ):
+        # task_finished may run locally if fallback is activated for @catch
+        # decorator.
+        if "METAFLOW_KUBERNETES_WORKLOAD" in os.environ:
+            # If `local` metadata is configured, we would need to copy task
+            # execution metadata from the AWS Batch container to user's
+            # local file system after the user code has finished execution.
+            # This happens via datastore as a communication bridge.
+
+            # TODO:  There is no guarantee that task_prestep executes before
+            #        task_finished is invoked. That will result in AttributeError:
+            #        'KubernetesDecorator' object has no attribute 'metadata' error.
+            if self.metadata.TYPE == "local":
+                # Note that the datastore is *always* Amazon S3 (see
+                # runtime_task_created function).
+                sync_local_metadata_to_datastore(
+                    DATASTORE_LOCAL_DIR, self.task_datastore
+                )
+
+        try:
+            self._save_logs_sidecar.terminate()
+        except:
+            # Best effort kill
+            pass
+
+    @classmethod
+    def _save_package_once(cls, flow_datastore, package):
+        if cls.package_url is None:
+            cls.package_url, cls.package_sha = flow_datastore.save_data(
+                [package.blob], len_hint=1
+            )[0]
+
+    @staticmethod
+    def parse_node_selector(node_selector: list):
+        try:
+            return {
+                str(k.split("=", 1)[0]): str(k.split("=", 1)[1])
+                for k in node_selector or []
+            }
+        except (AttributeError, IndexError):
+            raise KubernetesException(
+                "Unable to parse node_selector: %s" % node_selector
+            )
diff --git a/metaflow/plugins/aws/eks/kubernetes_client.py b/metaflow/plugins/kubernetes/kubernetes_job.py
similarity index 55%
rename from metaflow/plugins/aws/eks/kubernetes_client.py
rename to metaflow/plugins/kubernetes/kubernetes_job.py
index ce89d7c5be4..ceb267ca502 100644
--- a/metaflow/plugins/aws/eks/kubernetes_client.py
+++ b/metaflow/plugins/kubernetes/kubernetes_job.py
@@ -1,16 +1,10 @@
-import os
-import time
+import json
 import math
 import random
-
-
-try:
-    unicode
-except NameError:
-    unicode = str
-    basestring = str
+import time
 
 from metaflow.exception import MetaflowException
+from metaflow.metaflow_config import KUBERNETES_SECRETS
 
 CLIENT_REFRESH_INTERVAL_SECONDS = 300
 
@@ -19,7 +13,8 @@ class KubernetesJobException(MetaflowException):
     headline = "Kubernetes job error"
 
 
-# Implements truncated exponential backoff from https://cloud.google.com/storage/docs/retry-strategy#exponential-backoff
+# Implements truncated exponential backoff from
+# https://cloud.google.com/storage/docs/retry-strategy#exponential-backoff
 def k8s_retry(deadline_seconds=60, max_backoff=32):
     def decorator(function):
         from functools import wraps
@@ -55,110 +50,26 @@ def wrapper(*args, **kwargs):
     return decorator
 
 
-class KubernetesClient(object):
-    def __init__(self):
-        # TODO: Look into removing the usage of Kubernetes Python SDK
-        # at some point in the future. Given that Kubernetes Python SDK
-        # aggressively drops support for older kubernetes clusters, continued
-        # dependency on it may bite our users.
-
-        try:
-            # Kubernetes is a soft dependency.
-            from kubernetes import client, config
-        except (NameError, ImportError):
-            raise MetaflowException(
-                "Could not import module 'kubernetes'. Install kubernetes "
-                "Python package (https://pypi.org/project/kubernetes/) first."
-            )
-        self._refresh_client()
-
-    def _refresh_client(self):
-        from kubernetes import client, config
-
-        if os.getenv("KUBERNETES_SERVICE_HOST"):
-            # We are inside a pod, authenticate via ServiceAccount assigned to us
-            config.load_incluster_config()
-        else:
-            # Use kubeconfig, likely $HOME/.kube/config
-            # TODO (savin):
-            #     1. Support generating kubeconfig on the fly using boto3
-            #     2. Support auth via OIDC - https://docs.aws.amazon.com/eks/latest/userguide/authenticate-oidc-identity-provider.html
-            # Supporting the above auth mechanisms (atleast 1.) should be
-            # good enough for the initial rollout.
-            config.load_kube_config()
-        self._client = client
-        self._client_refresh_timestamp = time.time()
-
-    def job(self, **kwargs):
-        return KubernetesJob(self, **kwargs)
-
-    def get(self):
-        if (
-            time.time() - self._client_refresh_timestamp
-            > CLIENT_REFRESH_INTERVAL_SECONDS
-        ):
-            self._refresh_client()
-
-        return self._client
-
-
 class KubernetesJob(object):
-    def __init__(self, client_wrapper, **kwargs):
-        self._client_wrapper = client_wrapper
+    def __init__(self, client, **kwargs):
+        self._client = client
         self._kwargs = kwargs
 
     def create(self):
-        # Check that job attributes are sensible.
-
-        # CPU value should be greater than 0
-        if not (
-            isinstance(self._kwargs["cpu"], (int, unicode, basestring, float))
-            and float(self._kwargs["cpu"]) > 0
-        ):
-            raise KubernetesJobException(
-                "Invalid CPU value ({}); it should be greater than 0".format(
-                    self._kwargs["cpu"]
-                )
-            )
-
-        # Memory value should be greater than 0
-        if not (
-            isinstance(self._kwargs["memory"], (int, unicode, basestring))
-            and int(self._kwargs["memory"]) > 0
-        ):
-            raise KubernetesJobException(
-                "Invalid memory value ({}); it should be greater than 0".format(
-                    self._kwargs["memory"]
-                )
-            )
-
-        # Disk value should be greater than 0
-        if not (
-            isinstance(self._kwargs["disk"], (int, unicode, basestring))
-            and int(self._kwargs["disk"]) > 0
-        ):
-            raise KubernetesJobException(
-                "Invalid disk value ({}); it should be greater than 0".format(
-                    self._kwargs["disk"]
-                )
-            )
-
-        # TODO(s) (savin)
-        # 1. Add support for GPUs.
-
         # A discerning eye would notice and question the choice of using the
         # V1Job construct over the V1Pod construct given that we don't rely much
         # on any of the V1Job semantics. The major reasons at the moment are -
-        #     1. It makes the Kubernetes UIs (Octant, Lens) a bit more easy on
+        #     1. It makes the Kubernetes UIs (Octant, Lens) a bit easier on
         #        the eyes, although even that can be questioned.
-        #     2. AWS Step Functions, at the moment (Aug' 21) only supports
+        #     2. AWS Step Functions, at the moment (Apr' 22) only supports
         #        executing Jobs and not Pods as part of it's publicly declared
         #        API. When we ship the AWS Step Functions integration with EKS,
         #        it will hopefully lessen our workload.
         #
         # Note: This implementation ensures that there is only one unique Pod
         # (unique UID) per Metaflow task attempt.
-        client = self._client_wrapper.get()
+        client = self._client.get()
+
         self._job = client.V1Job(
             api_version="batch/v1",
             kind="Job",
@@ -167,38 +78,29 @@ def create(self):
                 annotations=self._kwargs.get("annotations", {}),
                 # While labels are for Kubernetes
                 labels=self._kwargs.get("labels", {}),
-                name=self._kwargs["name"],  # Unique within the namespace
+                generate_name=self._kwargs["generate_name"],
                 namespace=self._kwargs["namespace"],  # Defaults to `default`
             ),
             spec=client.V1JobSpec(
                 # Retries are handled by Metaflow when it is responsible for
                 # executing the flow. The responsibility is moved to Kubernetes
-                # when AWS Step Functions / Argo are responsible for the
-                # execution.
+                # when Argo Workflows is responsible for the execution.
                 backoff_limit=self._kwargs.get("retries", 0),
                 completions=1,  # A single non-indexed pod job
-                # TODO (savin): Implement a job clean-up option in the
-                # kubernetes CLI.
                 ttl_seconds_after_finished=7
                 * 60
-                * 60  # Remove job after a week. TODO (savin): Make this
-                * 24,  # configurable
+                * 60  # Remove job after a week. TODO: Make this configurable
+                * 24,
                 template=client.V1PodTemplateSpec(
                     metadata=client.V1ObjectMeta(
                         annotations=self._kwargs.get("annotations", {}),
                         labels=self._kwargs.get("labels", {}),
-                        name=self._kwargs["name"],
                         namespace=self._kwargs["namespace"],
                     ),
                     spec=client.V1PodSpec(
                         # Timeout is set on the pod and not the job (important!)
                         active_deadline_seconds=self._kwargs["timeout_in_seconds"],
                         # TODO (savin): Enable affinities for GPU scheduling.
-                        #               This requires some thought around the
-                        #               UX since specifying affinities can get
-                        #               complicated quickly. We may well decide
-                        #               to move it out of scope for the initial
-                        #               roll out.
                         # affinity=?,
                         containers=[
                             client.V1Container(
@@ -212,9 +114,6 @@ def create(self):
                                 # And some downward API magic. Add (key, value)
                                 # pairs below to make pod metadata available
                                 # within Kubernetes container.
-                                #
-                                # TODO: Figure out a way to make job
-                                # metadata visible within the container
                                 + [
                                     client.V1EnvVar(
                                         name=k,
@@ -228,38 +127,45 @@ def create(self):
                                         "METAFLOW_KUBERNETES_POD_NAMESPACE": "metadata.namespace",
                                         "METAFLOW_KUBERNETES_POD_NAME": "metadata.name",
                                         "METAFLOW_KUBERNETES_POD_ID": "metadata.uid",
+                                        "METAFLOW_KUBERNETES_SERVICE_ACCOUNT_NAME": "spec.serviceAccountName",
                                     }.items()
                                 ],
                                 env_from=[
                                     client.V1EnvFromSource(
-                                        secret_ref=client.V1SecretEnvSource(name=str(k))
+                                        secret_ref=client.V1SecretEnvSource(
+                                            name=str(k),
+                                            # optional=True
+                                        )
                                     )
-                                    for k in self._kwargs.get("secrets", [])
+                                    for k in list(self._kwargs.get("secrets", []))
+                                    + KUBERNETES_SECRETS.split(",")
+                                    if k
                                 ],
                                 image=self._kwargs["image"],
-                                name=self._kwargs["name"],
+                                name=self._kwargs["step_name"].replace("_", "-"),
                                 resources=client.V1ResourceRequirements(
                                     requests={
                                         "cpu": str(self._kwargs["cpu"]),
                                         "memory": "%sM" % str(self._kwargs["memory"]),
                                         "ephemeral-storage": "%sM"
                                         % str(self._kwargs["disk"]),
-                                    }
+                                    },
+                                    limits={
+                                        "%s.com/gpu".lower()
+                                        % self._kwargs["gpu_vendor"]: str(
+                                            self._kwargs["gpu"]
+                                        )
+                                        for k in [0]
+                                        # Don't set GPU limits if gpu isn't specified.
+                                        if self._kwargs["gpu"] is not None
+                                    },
                                 ),
                             )
                         ],
-                        node_selector={
-                            # TODO: What should be the format of node selector -
-                            #       key:value or key=value?
-                            str(k.split("=", 1)[0]): str(k.split("=", 1)[1])
-                            for k in self._kwargs.get("node_selector", [])
-                        },
-                        # TODO (savin): At some point in the very near future,
-                        #               support docker access secrets.
+                        node_selector=self._kwargs.get("node_selector"),
+                        # TODO (savin): Support image_pull_secrets
                         # image_pull_secrets=?,
-                        #
-                        # TODO (savin): We should, someday, get into the pod
-                        #               priority business
+                        # TODO (savin): Support preemption policies
                         # preemption_policy=?,
                         #
                         # A Container in a Pod may fail for a number of
@@ -271,16 +177,11 @@ def create(self):
                         service_account_name=self._kwargs["service_account"],
                         # Terminate the container immediately on SIGTERM
                         termination_grace_period_seconds=0,
-                        # TODO (savin): Enable tolerations for GPU scheduling.
-                        #               This requires some thought around the
-                        #               UX since specifying tolerations can get
-                        #               complicated quickly.
-                        # tolerations=?,
-                        #
-                        # TODO (savin): At some point in the very near future,
-                        #               support custom volumes (PVCs/EVCs).
+                        tolerations=[
+                            client.V1Toleration(**toleration)
+                            for toleration in self._kwargs.get("tolerations") or []
+                        ],
                         # volumes=?,
-                        #
                         # TODO (savin): Set termination_message_policy
                     ),
                 ),
@@ -289,12 +190,13 @@ def create(self):
         return self
 
     def execute(self):
-        client = self._client_wrapper.get()
+        client = self._client.get()
         try:
-            # TODO (savin): Make job submission back-pressure aware. Currently
-            #               there doesn't seem to be a kubernetes-native way to
-            #               achieve the guarantees that we are seeking.
-            #               Hopefully, we will be able to get creative soon.
+            # TODO: Make job submission back-pressure aware. Currently
+            #       there doesn't seem to be a kubernetes-native way to
+            #       achieve the guarantees that we are seeking.
+            #       https://github.com/kubernetes/enhancements/issues/1040
+            #       Hopefully, we will be able to get creative with kube-batch
             response = (
                 client.BatchV1Api()
                 .create_namespaced_job(
@@ -303,16 +205,21 @@ def execute(self):
                 .to_dict()
             )
             return RunningJob(
-                client_wrapper=self._client_wrapper,
+                client=self._client,
                 name=response["metadata"]["name"],
                 uid=response["metadata"]["uid"],
                 namespace=response["metadata"]["namespace"],
             )
         except client.rest.ApiException as e:
             raise KubernetesJobException(
-                "Unable to launch Kubernetes job.\n %s" % str(e)
+                "Unable to launch Kubernetes job.\n %s"
+                % (json.loads(e.body)["message"] if e.body is not None else e.reason)
             )
 
+    def step_name(self, step_name):
+        self._kwargs["step_name"] = step_name
+        return self
+
     def namespace(self, namespace):
         self._kwargs["namespace"] = namespace
         return self
@@ -338,6 +245,9 @@ def memory(self, mem):
         return self
 
     def environment_variable(self, name, value):
+        # Never set to None
+        if value is None:
+            return self
         self._kwargs["environment_variables"] = dict(
             self._kwargs.get("environment_variables", {}), **{name: value}
         )
@@ -363,17 +273,6 @@ class RunningJob(object):
     # to "running" and "done" (failed/succeeded) state. Note that V1Job and V1Pod
     # status fields are not guaranteed to be always in sync due to the way job
     # controller works.
-    #
-    # For example, for a successful job, RunningJob states and their corresponding
-    # K8S object states look like this:
-    #
-    # | V1JobStatus.active | V1JobStatus.succeeded | V1PodStatus.phase | RunningJob.is_running | RunningJob.is_done |
-    # |--------------------|-----------------------|-------------------|-----------------------|--------------------|
-    # |                  0 |                     0 | N/A               | False                 | False              |
-    # |                  0 |                     0 | Pending           | False                 | False              |
-    # |                  1 |                     0 | Running           | True                  | False              |
-    # |                  1 |                     0 | Succeeded         | True                  | True               |
-    # |                  0 |                     1 | Succeeded         | False                 | True               |
 
     # To ascertain the status of V1Job, we peer into the lifecycle status of
     # the pod it is responsible for executing. Unfortunately, the `phase`
@@ -383,7 +282,7 @@ class RunningJob(object):
     # https://github.com/kubernetes/kubernetes/issues/7856). `conditions` otoh
     # provide a deeper understanding about the state of the pod; however
     # conditions are not state machines and can be oscillating - from the
-    # offical API conventions guide:
+    # official API conventions guide:
     #     In general, condition values may change back and forth, but some
     #     condition transitions may be monotonic, depending on the resource and
     #     condition type. However, conditions are observations and not,
@@ -391,12 +290,10 @@ class RunningJob(object):
     #     machines for objects, nor behaviors associated with state
     #     transitions. The system is level-based rather than edge-triggered,
     #     and should assume an Open World.
-    # In this implementation, we synthesize our notion of "phase" state
+    # As a follow-up, we can synthesize our notion of "phase" state
     # machine from `conditions`, since Kubernetes won't do it for us (for
     # many good reasons).
     #
-    #
-    #
     # `conditions` can be of the following types -
     #    1. (kubelet) Initialized (always True since we don't rely on init
     #       containers)
@@ -406,16 +303,11 @@ class RunningJob(object):
     #       https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/status/generate.go)
     #    4. (kube-scheduler) PodScheduled
     #       (https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/scheduler.go)
-    #    5. (kube-scheduler) Unschedulable
-    #
-    # WIP...
-
-    JOB_ACTIVE = "job:active"
-    JOB_FAILED = ""
 
-    def __init__(self, client_wrapper, name, uid, namespace):
-        self._client_wrapper = client_wrapper
+    def __init__(self, client, name, uid, namespace):
+        self._client = client
         self._name = name
+        self._pod_name = None
         self._id = uid
         self._namespace = namespace
 
@@ -424,7 +316,13 @@ def __init__(self, client_wrapper, name, uid, namespace):
 
         import atexit
 
-        atexit.register(self.kill)
+        def best_effort_kill():
+            try:
+                self.kill()
+            except:
+                pass
+
+        atexit.register(best_effort_kill)
 
     def __repr__(self):
         return "{}('{}/{}')".format(
@@ -433,7 +331,7 @@ def __repr__(self):
 
     @k8s_retry()
     def _fetch_job(self):
-        client = self._client_wrapper.get()
+        client = self._client.get()
         try:
             return (
                 client.BatchV1Api()
@@ -441,15 +339,16 @@ def _fetch_job(self):
                 .to_dict()
             )
         except client.rest.ApiException as e:
-            # TODO: Handle failures as well as the fact that a different
-            #       process can delete the job.
-            raise e
+            if e.status == 404:
+                raise KubernetesJobException(
+                    "Unable to locate Kubernetes batch/v1 job %s" % self._name
+                )
+            raise
 
     @k8s_retry()
     def _fetch_pod(self):
-        """Fetch pod metadata. May return None if pod does not exist."""
-        client = self._client_wrapper.get()
-
+        # Fetch pod metadata.
+        client = self._client.get()
         pods = (
             client.CoreV1Api()
             .list_namespaced_pod(
@@ -458,11 +357,9 @@ def _fetch_pod(self):
             )
             .to_dict()["items"]
         )
-
         if pods:
             return pods[0]
-        else:
-            return None
+        return {}
 
     def kill(self):
         # Terminating a Kubernetes job is a bit tricky. Issuing a
@@ -482,28 +379,22 @@ def kill(self):
         #    itself as done and the pod metadata disappears from the API
         #    server. There is an open issue in the Kubernetes GH to provide
         #    better support for job terminations -
-        #    https://github.com/kubernetes/enhancements/issues/2232 but
-        #    meanwhile as a quick follow-up, we should investigate ways to
-        #    terminate the pod without deleting the object.
+        #    https://github.com/kubernetes/enhancements/issues/2232
         # 3. If the pod object hasn't shown up yet, we set the parallelism to 0
         #    to preempt it.
-        client = self._client_wrapper.get()
-        if not self._check_is_done():
-            if self._check_is_running():
-
-                # Unless there is a bug in the code, self._pod cannot be None
-                # if we're in "running" state.
-                assert self._pod is not None
+        client = self._client.get()
+        if not self.is_done:
+            if self.is_running:
 
                 # Case 1.
                 from kubernetes.stream import stream
 
                 api_instance = client.CoreV1Api
                 try:
-                    # TODO (savin): stream opens a web-socket connection. It may
-                    #               not be desirable to open multiple web-socket
-                    #               connections frivolously (think killing a
-                    #               workflow during a for-each step).
+                    # TODO: stream opens a web-socket connection. It may
+                    #       not be desirable to open multiple web-socket
+                    #       connections frivolously (think killing a
+                    #       workflow during a for-each step).
                     stream(
                         api_instance().connect_get_namespaced_pod_exec,
                         name=self._pod["metadata"]["name"],
@@ -521,14 +412,12 @@ def kill(self):
                 except:
                     # Best effort. It's likely that this API call could be
                     # blocked for the user.
-                    # TODO (savin): Forward the error to the user.
-                    # pass
-                    raise
+                    pass
+                    # raise
             else:
                 # Case 2.
+                # This has the effect of pausing the job.
                 try:
-                    # TODO (savin): Also patch job annotation to reflect this
-                    #               action.
                     client.BatchV1Api().patch_namespaced_job(
                         name=self._name,
                         namespace=self._namespace,
@@ -537,151 +426,141 @@ def kill(self):
                     )
                 except:
                     # Best effort.
-                    # TODO (savin): Forward the error to the user.
-                    # pass
-                    raise
+                    pass
+                    # raise
         return self
 
     @property
     def id(self):
-        # TODO (savin): Should we use pod id instead?
-        return self._id
+        if self._pod_name:
+            return "pod %s" % self._pod_name
+        if self._pod:
+            self._pod_name = self._pod["metadata"]["name"]
+            return self.id
+        return "job %s" % self._name
 
-    def _check_is_done(self):
-        def _job_done():
-            # Either the job succeeds or fails naturally or we may have
+    @property
+    def is_done(self):
+        # Check if the job is done. As a side effect, also refreshes self._job and
+        # self._pod with the latest state
+        def done():
+            # Either the job succeeds or fails naturally or else we may have
             # forced the pod termination causing the job to still be in an
             # active state but for all intents and purposes dead to us.
-
-            # TODO (savin): check for self._job
             return (
                 bool(self._job["status"].get("succeeded"))
                 or bool(self._job["status"].get("failed"))
                 or (self._job["spec"]["parallelism"] == 0)
             )
 
-        if not _job_done():
-            # If not done, check for newer status
+        if not done():
+            # If not done, fetch newer status
             self._job = self._fetch_job()
-        if _job_done():
+        if done():
             return True
         else:
             # It is possible for the job metadata to not be updated yet, but the
-            # Pod has already succeeded or failed.
+            # pod may have already succeeded or failed.
             self._pod = self._fetch_pod()
-            if self._pod and (self._pod["status"]["phase"] in ("Succeeded", "Failed")):
-                return True
-            else:
-                return False
+            return self._pod and (
+                self._pod.get("status", {}).get("phase") in ("Succeeded", "Failed")
+            )
 
-    def _get_status(self):
-        if not self._check_is_done():
-            # If not done, check for newer pod status
-            self._pod = self._fetch_pod()
-        # Success!
-        if bool(self._job["status"].get("succeeded")):
-            return "Job:Succeeded"
-        # Failure!
-        if bool(self._job["status"].get("failed")) or (
-            self._job["spec"]["parallelism"] == 0
-        ):
-            return "Job:Failed"
-        if bool(self._job["status"].get("active")):
-            msg = "Job:Active"
-            if self._pod:
-                msg += " Pod:%s" % self._pod["status"]["phase"].title()
-                # TODO (savin): parse Pod conditions
-                container_status = (
-                    self._pod["status"].get("container_statuses") or [None]
-                )[0]
-                if container_status:
-                    # We have a single container inside the pod
-                    status = {"status": "waiting"}
-                    for k, v in container_status["state"].items():
-                        if v is not None:
-                            status["status"] = k
-                            status.update(v)
-                    msg += " Container:%s" % status["status"].title()
-                    reason = ""
-                    if status.get("reason"):
-                        reason = status["reason"]
-                    if status.get("message"):
-                        reason += ":%s" % status["message"]
-                    if reason:
-                        msg += " [%s]" % reason
-            # TODO (savin): This message should be shortened before release.
-            return msg
-        return "Job:Unknown"
-
-    def _check_has_succeeded(self):
+    @property
+    def status(self):
+        if not self.is_done:
+            if bool(self._job["status"].get("active")):
+                if self._pod:
+                    msg = (
+                        "Pod is %s"
+                        % self._pod.get("status", {})
+                        .get("phase", "uninitialized")
+                        .lower()
+                    )
+                    # TODO (savin): parse Pod conditions
+                    container_status = (
+                        self._pod["status"].get("container_statuses") or [None]
+                    )[0]
+                    if container_status:
+                        # We have a single container inside the pod
+                        status = {"status": "waiting"}
+                        for k, v in container_status["state"].items():
+                            if v is not None:
+                                status["status"] = k
+                                status.update(v)
+                        msg += ", Container is %s" % status["status"].lower()
+                        reason = ""
+                        if status.get("reason"):
+                            pass
+                            reason = status["reason"]
+                        if status.get("message"):
+                            reason += " - %s" % status["message"]
+                        if reason:
+                            msg += " - %s" % reason
+                    return msg
+                return "Job is active"
+            return "Job status is unknown"
+        return "Job is done"
+
+    @property
+    def has_succeeded(self):
         # Job is in a terminal state and the status is marked as succeeded
-        if self._check_is_done():
-            if bool(self._job["status"].get("succeeded")) or (
-                self._pod and self._pod["status"]["phase"] == "Succeeded"
-            ):
-                return True
-            else:
-                return False
-        else:
-            return False
+        return self.is_done and (
+            bool(self._job["status"].get("succeeded"))
+            or (self._pod.get("status", {}).get("phase") == "Succeeded")
+        )
 
-    def _check_has_failed(self):
+    @property
+    def has_failed(self):
         # Job is in a terminal state and either the status is marked as failed
         # or the Job is not allowed to launch any more pods
+        return self.is_done and (
+            bool(self._job["status"].get("failed"))
+            or (self._job["spec"]["parallelism"] == 0)
+            or (self._pod.get("status", {}).get("phase") == "Failed")
+        )
 
-        if self._check_is_done():
-            if (
-                bool(self._job["status"].get("failed"))
-                or (self._job["spec"]["parallelism"] == 0)
-                or (self._pod and self._pod["status"]["phase"] == "Failed")
-            ):
-                return True
-            else:
-                return False
-        else:
-            return False
-
-    def _check_is_running(self):
-        # Returns true if the container is running.
-        if not self._check_is_done():
-            # If not done, check if pod has been assigned and is in Running
-            # phase
-            if self._pod is None:
-                return False
-            pod_phase = self._pod.get("status", {}).get("phase")
-            return pod_phase == "Running"
-        return False
-
-    def _get_done_reason(self):
-        if self._check_is_done():
-            if self._check_has_succeeded():
+    @property
+    def is_running(self):
+        # Returns true if the pod is running.
+        return not self.is_done and (
+            self._pod.get("status", {}).get("phase") == "Running"
+        )
+
+    @property
+    def is_waiting(self):
+        return not self.is_done and not self.is_running
+
+    @property
+    def reason(self):
+        if self.is_done:
+            if self.has_succeeded:
                 return 0, None
             # Best effort since Pod object can disappear on us at anytime
             else:
-
-                def _done():
-                    return self._pod.get("status", {}).get("phase") in (
-                        "Succeeded",
-                        "Failed",
-                    )
-
-                if not _done():
+                if self._pod.get("status", {}).get("phase") not in (
+                    "Succeeded",
+                    "Failed",
+                ):
                     # If pod status is dirty, check for newer status
                     self._pod = self._fetch_pod()
                 if self._pod:
-                    pod_status = self._pod["status"]
-                    if pod_status.get("container_statuses") is None:
+                    if self._pod.get("status", {}).get("container_statuses") is None:
                         # We're done, but no container_statuses is set
                         # This can happen when the pod is evicted
                         return None, ": ".join(
                             filter(
                                 None,
-                                [pod_status.get("reason"), pod_status.get("message")],
+                                [
+                                    self._pod.get("status", {}).get("reason"),
+                                    self._pod.get("status", {}).get("message"),
+                                ],
                             )
                         )
 
                     for k, v in (
-                        pod_status.get("container_statuses", [{}])[0]
+                        self._pod.get("status", {})
+                        .get("container_statuses", [{}])[0]
                         .get("state", {})
                         .items()
                     ):
@@ -692,25 +571,4 @@ def _done():
                                     [v.get("reason"), v.get("message")],
                                 )
                             )
-
         return None, None
-
-    @property
-    def is_done(self):
-        return self._check_is_done()
-
-    @property
-    def has_failed(self):
-        return self._check_has_failed()
-
-    @property
-    def is_running(self):
-        return self._check_is_running()
-
-    @property
-    def reason(self):
-        return self._get_done_reason()
-
-    @property
-    def status(self):
-        return self._get_status()
diff --git a/metaflow/plugins/metadata/local.py b/metaflow/plugins/metadata/local.py
index ed4ae16da77..792572219f7 100644
--- a/metaflow/plugins/metadata/local.py
+++ b/metaflow/plugins/metadata/local.py
@@ -1,10 +1,17 @@
+import collections
 import glob
 import json
 import os
+import random
+import tempfile
 import time
+from collections import namedtuple
 
+from metaflow.exception import MetaflowInternalError, MetaflowTaggingError
+from metaflow.metadata.metadata import ObjectOrder
 from metaflow.metaflow_config import DATASTORE_LOCAL_DIR
 from metaflow.metadata import MetadataProvider
+from metaflow.tagging_util import MAX_USER_TAG_SET_SIZE, validate_tags
 
 
 class LocalMetadataProvider(MetadataProvider):
@@ -17,7 +24,7 @@ def __init__(self, environment, flow, event_logger, monitor):
 
     @classmethod
     def compute_info(cls, val):
-        from metaflow.datastore.local_storage import LocalStorage
+        from metaflow.plugins.datastores.local_storage import LocalStorage
 
         v = os.path.realpath(os.path.join(val, DATASTORE_LOCAL_DIR))
         if os.path.isdir(v):
@@ -29,7 +36,7 @@ def compute_info(cls, val):
 
     @classmethod
     def default_info(cls):
-        from metaflow.datastore.local_storage import LocalStorage
+        from metaflow.plugins.datastores.local_storage import LocalStorage
 
         def print_clean(line, **kwargs):
             print(line)
@@ -61,14 +68,14 @@ def register_run_id(self, run_id, tags=None, sys_tags=None):
             # on creation). However, some IDs are created outside the metadata
             # provider and need to be properly registered
             int(run_id)
-            return
+            return False
         except ValueError:
             return self._new_run(run_id, tags, sys_tags)
 
     def new_task_id(self, run_id, step_name, tags=None, sys_tags=None):
         self._task_id_seq += 1
         task_id = str(self._task_id_seq)
-        self._new_task(run_id, step_name, task_id, tags, sys_tags)
+        self._new_task(run_id, step_name, task_id, tags=tags, sys_tags=sys_tags)
         return task_id
 
     def register_task_id(
@@ -78,9 +85,17 @@ def register_task_id(
             # Same logic as register_run_id
             int(task_id)
         except ValueError:
-            self._new_task(run_id, step_name, task_id, attempt, tags, sys_tags)
+            return self._new_task(
+                run_id,
+                step_name,
+                task_id,
+                attempt=attempt,
+                tags=tags,
+                sys_tags=sys_tags,
+            )
         else:
-            self._register_code_package_metadata(run_id, step_name, task_id, attempt)
+            self._register_system_metadata(run_id, step_name, task_id, attempt)
+            return False
 
     def register_data_artifacts(
         self, run_id, step_name, task_id, attempt_id, artifacts
@@ -105,19 +120,125 @@ def register_metadata(self, run_id, step_name, task_id, metadata):
         }
         self._save_meta(meta_dir, metadict)
 
+    @classmethod
+    def _mutate_user_tags_for_run(
+        cls, flow_id, run_id, tags_to_add=None, tags_to_remove=None
+    ):
+        MutationResult = namedtuple(
+            "MutationResult", field_names="tags_are_consistent tags"
+        )
+
+        def _optimistically_mutate():
+            # get existing tags
+            run = LocalMetadataProvider.get_object(
+                "run", "self", {}, None, flow_id, run_id
+            )
+            if not run:
+                raise MetaflowTaggingError(
+                    msg="Run not found (%s, %s)" % (flow_id, run_id)
+                )
+            existing_user_tag_set = frozenset(run["tags"])
+            existing_system_tag_set = frozenset(run["system_tags"])
+            tags_to_remove_set = frozenset(tags_to_remove)
+            # make sure no existing system tags get added as a user tag
+            tags_to_add_set = frozenset(tags_to_add) - existing_system_tag_set
+
+            # from this point on we work with sets of tags only
+
+            if tags_to_remove_set & existing_system_tag_set:
+                raise MetaflowTaggingError(
+                    msg="Cannot remove a tag that is an existing system tag (%s)"
+                    % str(sorted(tags_to_remove_set & existing_system_tag_set))
+                )
+
+            # remove tags first, then add
+            next_user_tags_set = (
+                existing_user_tag_set - tags_to_remove_set
+            ) | tags_to_add_set
+            # we think it will be a no-op, so let's return right away
+            if next_user_tags_set == existing_user_tag_set:
+                return MutationResult(
+                    tags=next_user_tags_set,
+                    tags_are_consistent=True,
+                )
+
+            validate_tags(next_user_tags_set, existing_tags=existing_user_tag_set)
+
+            # write new tag set to file system
+            LocalMetadataProvider._persist_tags_for_run(
+                flow_id, run_id, next_user_tags_set, existing_system_tag_set
+            )
+
+            # read tags back from file system to see if our optimism is misplaced
+            # I.e. did a concurrent mutate overwrite our change
+            run = LocalMetadataProvider.get_object(
+                "run", "self", {}, None, flow_id, run_id
+            )
+            if not run:
+                raise MetaflowTaggingError(
+                    msg="Run not found for read-back check (%s, %s)" % (flow_id, run_id)
+                )
+            final_tag_set = frozenset(run["tags"])
+            if tags_to_add_set - final_tag_set:
+                return MutationResult(tags=final_tag_set, tags_are_consistent=False)
+            if (
+                tags_to_remove_set & final_tag_set
+            ) - tags_to_add_set:  # Remove before add, remember?  Account for this
+                return MutationResult(tags=final_tag_set, tags_are_consistent=False)
+
+            return MutationResult(tags=final_tag_set, tags_are_consistent=True)
+
+        tries = 1
+        # try up to 5 times, with a gentle exponential backoff (1.1-1.3x)
+        while True:
+            mutation_result = _optimistically_mutate()
+            if mutation_result.tags_are_consistent:
+                return mutation_result.tags
+            if tries >= 5:
+                break
+            time.sleep(0.3 * random.uniform(1.1, 1.3) ** tries)
+            tries += 1
+        raise MetaflowTaggingError(
+            "Tagging failed due to too many conflicting updates from other processes"
+        )
+
     @classmethod
     def _get_object_internal(
         cls, obj_type, obj_order, sub_type, sub_order, filters, attempt, *args
     ):
-        from metaflow.datastore.local_storage import LocalStorage
+        # This is guaranteed by MetaflowProvider.get_object(), sole intended caller
+        if obj_type in ("metadata", "self"):
+            raise MetaflowInternalError(msg="Type %s is not allowed" % obj_type)
+
+        if obj_type not in ("root", "flow", "run", "step", "task", "artifact"):
+            raise MetaflowInternalError(msg="Unexpected object type %s" % obj_type)
+
+        from metaflow.plugins.datastores.local_storage import LocalStorage
 
         if obj_type == "artifact":
             # Artifacts are actually part of the tasks in the filesystem
+            # E.g. we get here for (obj_type, sub_type) == (artifact, self)
             obj_type = "task"
             sub_type = "artifact"
             sub_order = obj_order
             obj_order = obj_order - 1
 
+        if obj_type != ObjectOrder.order_to_type(obj_order):
+            raise MetaflowInternalError(
+                "Object type order mismatch %s %s"
+                % (obj_type, ObjectOrder.order_to_type(obj_order))
+            )
+        if sub_type != ObjectOrder.order_to_type(sub_order):
+            raise MetaflowInternalError(
+                "Sub type order mismatch %s %s"
+                % (sub_type, ObjectOrder.order_to_type(sub_order))
+            )
+
+        RUN_ORDER = ObjectOrder.type_to_order("run")
+
+        if obj_type not in ("root", "flow", "run", "step", "task"):
+            raise MetaflowInternalError(msg="Unexpected object type %s" % obj_type)
+
         # Special handling of self, artifact, and metadata
         if sub_type == "self":
             meta_path = LocalMetadataProvider._get_metadir(*args[:obj_order])
@@ -125,12 +246,35 @@ def _get_object_internal(
                 return None
             self_file = os.path.join(meta_path, "_self.json")
             if os.path.isfile(self_file):
-                return MetadataProvider._apply_filter(
+                obj = MetadataProvider._apply_filter(
                     [LocalMetadataProvider._read_json_file(self_file)], filters
                 )[0]
+                # For non-descendants of a run, we are done
+
+                if obj_order <= RUN_ORDER:
+                    return obj
+
+                if obj_type not in ("step", "task"):
+                    raise MetaflowInternalError(
+                        msg="Unexpected object type %s" % obj_type
+                    )
+                run = LocalMetadataProvider.get_object(
+                    "run", "self", {}, None, *args[:RUN_ORDER]  # *[flow_id, run_id]
+                )
+                if not run:
+                    raise MetaflowInternalError(
+                        msg="Could not find run %s" % str(args[:RUN_ORDER])
+                    )
+
+                obj["tags"] = run.get("tags", [])
+                obj["system_tags"] = run.get("system_tags", [])
+                return obj
             return None
 
         if sub_type == "artifact":
+            if obj_type not in ("root", "flow", "run", "step", "task"):
+                raise MetaflowInternalError(msg="Unexpected object type %s" % obj_type)
+
             meta_path = LocalMetadataProvider._get_metadir(*args[:obj_order])
             result = []
             if meta_path is None:
@@ -156,11 +300,27 @@ def _get_object_internal(
                 )
                 for obj in glob.iglob(artifact_files):
                     result.append(LocalMetadataProvider._read_json_file(obj))
+
+            # We are getting artifacts. We should overlay with ancestral run's tags
+            run = LocalMetadataProvider.get_object(
+                "run", "self", {}, None, *args[:RUN_ORDER]  # *[flow_id, run_id]
+            )
+            if not run:
+                raise MetaflowInternalError(
+                    msg="Could not find run %s" % str(args[:RUN_ORDER])
+                )
+            for obj in result:
+                obj["tags"] = run.get("tags", [])
+                obj["system_tags"] = run.get("system_tags", [])
+
             if len(result) == 1:
                 return result[0]
             return result
 
         if sub_type == "metadata":
+            # artifact is not expected because if obj_type=artifact on function entry, we transform to =task
+            if obj_type not in ("root", "flow", "run", "step", "task"):
+                raise MetaflowInternalError(msg="Unexpected object type %s" % obj_type)
             result = []
             meta_path = LocalMetadataProvider._get_metadir(*args[:obj_order])
             if meta_path is None:
@@ -171,6 +331,10 @@ def _get_object_internal(
             return result
 
         # For the other types, we locate all the objects we need to find and return them
+        if obj_type not in ("root", "flow", "run", "step", "task"):
+            raise MetaflowInternalError(msg="Unexpected object type %s" % obj_type)
+        if sub_type not in ("flow", "run", "step", "task"):
+            raise MetaflowInternalError(msg="unexpected sub type %s" % sub_type)
         obj_path = LocalMetadataProvider._make_path(
             *args[:obj_order], create_on_absent=False
         )
@@ -179,12 +343,65 @@ def _get_object_internal(
             return result
         skip_dirs = "*/" * (sub_order - obj_order)
         all_meta = os.path.join(obj_path, skip_dirs, LocalStorage.METADATA_DIR)
+        SelfInfo = collections.namedtuple("SelfInfo", ["filepath", "run_id"])
+        self_infos = []
         for meta_path in glob.iglob(all_meta):
             self_file = os.path.join(meta_path, "_self.json")
-            if os.path.isfile(self_file):
-                result.append(LocalMetadataProvider._read_json_file(self_file))
+            if not os.path.isfile(self_file):
+                continue
+            run_id = None
+            # flow and run do not need info from ancestral run
+            if sub_type in ("step", "task"):
+                run_id = LocalMetadataProvider._deduce_run_id_from_meta_dir(
+                    meta_path, sub_type
+                )
+                # obj_type IS run, or more granular than run, let's do sanity check vs args
+                if obj_order >= RUN_ORDER:
+                    if run_id != args[RUN_ORDER - 1]:
+                        raise MetaflowInternalError(
+                            msg="Unexpected run id %s deduced from meta path" % run_id
+                        )
+            self_infos.append(SelfInfo(filepath=self_file, run_id=run_id))
+
+        for self_info in self_infos:
+            obj = LocalMetadataProvider._read_json_file(self_info.filepath)
+            if self_info.run_id:
+                flow_id_from_args = args[0]
+                run = LocalMetadataProvider.get_object(
+                    "run",
+                    "self",
+                    {},
+                    None,
+                    flow_id_from_args,
+                    self_info.run_id,
+                )
+                if not run:
+                    raise MetaflowInternalError(
+                        msg="Could not find run %s, %s"
+                        % (flow_id_from_args, self_info.run_id)
+                    )
+                obj["tags"] = run.get("tags", [])
+                obj["system_tags"] = run.get("system_tags", [])
+            result.append(obj)
+
         return MetadataProvider._apply_filter(result, filters)
 
+    @staticmethod
+    def _deduce_run_id_from_meta_dir(meta_dir_path, sub_type):
+        curr_order = ObjectOrder.type_to_order(sub_type)
+        levels_to_ascend = curr_order - ObjectOrder.type_to_order("run")
+        if levels_to_ascend < 0:
+            return None
+        curr_path = meta_dir_path
+        for _ in range(levels_to_ascend + 1):  # +1 to account for ../_meta
+            curr_path, _ = os.path.split(curr_path)
+        _, run_id = os.path.split(curr_path)
+        if not run_id:
+            raise MetaflowInternalError(
+                "Failed to deduce run_id from meta dir %s" % meta_dir_path
+            )
+        return run_id
+
     @staticmethod
     def _makedirs(path):
         # this is for python2 compatibility.
@@ -198,6 +415,26 @@ def _makedirs(path):
             else:
                 raise
 
+    @staticmethod
+    def _persist_tags_for_run(flow_id, run_id, tags, system_tags):
+        subpath = LocalMetadataProvider._create_and_get_metadir(
+            flow_name=flow_id, run_id=run_id
+        )
+        selfname = os.path.join(subpath, "_self.json")
+        if not os.path.isfile(selfname):
+            raise MetaflowInternalError(
+                msg="Could not verify Run existence on disk - missing %s" % selfname
+            )
+        LocalMetadataProvider._save_meta(
+            subpath,
+            {
+                "_self": MetadataProvider._run_to_json_static(
+                    flow_id, run_id=run_id, tags=tags, sys_tags=system_tags
+                )
+            },
+            allow_overwrite=True,
+        )
+
     def _ensure_meta(
         self, obj_type, run_id, step_name, task_id, tags=None, sys_tags=None
     ):
@@ -211,8 +448,12 @@ def _ensure_meta(
         selfname = os.path.join(subpath, "_self.json")
         self._makedirs(subpath)
         if os.path.isfile(selfname):
-            return
-        # In this case, the metadata information does not exist so we create it
+            # There is a race here, but we are not aiming to make this as solid as
+            # the metadata service. This is used primarily for concurrent resumes,
+            # so it is highly unlikely that this combination (multiple resumes of
+            # the same flow on the same machine) happens.
+            return False
+        # In this case the metadata information does not exist, so we create it
         self._save_meta(
             subpath,
             {
@@ -226,24 +467,28 @@ def _ensure_meta(
                 )
             },
         )
+        return True
 
     def _new_run(self, run_id, tags=None, sys_tags=None):
         self._ensure_meta("flow", None, None, None)
-        self._ensure_meta("run", run_id, None, None, tags, sys_tags)
+        return self._ensure_meta("run", run_id, None, None, tags, sys_tags)
 
     def _new_task(
         self, run_id, step_name, task_id, attempt=0, tags=None, sys_tags=None
     ):
         self._ensure_meta("step", run_id, step_name, None)
-        self._ensure_meta("task", run_id, step_name, task_id, tags, sys_tags)
-        self._register_code_package_metadata(run_id, step_name, task_id, attempt)
+        to_return = self._ensure_meta(
+            "task", run_id, step_name, task_id, tags, sys_tags
+        )
+        self._register_system_metadata(run_id, step_name, task_id, attempt)
+        return to_return
 
     @staticmethod
     def _make_path(
         flow_name=None, run_id=None, step_name=None, task_id=None, create_on_absent=True
     ):
 
-        from metaflow.datastore.local_storage import LocalStorage
+        from metaflow.plugins.datastores.local_storage import LocalStorage
 
         if LocalStorage.datastore_root is None:
 
@@ -273,7 +518,7 @@ def print_clean(line, **kwargs):
     def _create_and_get_metadir(
         flow_name=None, run_id=None, step_name=None, task_id=None
     ):
-        from metaflow.datastore.local_storage import LocalStorage
+        from metaflow.plugins.datastores.local_storage import LocalStorage
 
         root_path = LocalMetadataProvider._make_path(
             flow_name, run_id, step_name, task_id
@@ -284,7 +529,7 @@ def _create_and_get_metadir(
 
     @staticmethod
     def _get_metadir(flow_name=None, run_id=None, step_name=None, task_id=None):
-        from metaflow.datastore.local_storage import LocalStorage
+        from metaflow.plugins.datastores.local_storage import LocalStorage
 
         root_path = LocalMetadataProvider._make_path(
             flow_name, run_id, step_name, task_id, create_on_absent=False
@@ -300,9 +545,16 @@ def _get_metadir(flow_name=None, run_id=None, step_name=None, task_id=None):
     def _dump_json_to_file(filepath, data, allow_overwrite=False):
         if os.path.isfile(filepath) and not allow_overwrite:
             return
-        with open(filepath + ".tmp", "w") as f:
-            json.dump(data, f)
-        os.rename(filepath + ".tmp", filepath)
+        try:
+            with tempfile.NamedTemporaryFile(
+                mode="w", dir=os.path.dirname(filepath), delete=False
+            ) as f:
+                json.dump(data, f)
+            os.rename(f.name, filepath)
+        finally:
+            # clean up in case anything goes wrong
+            if f and os.path.isfile(f.name):
+                os.remove(f.name)
 
     @staticmethod
     def _read_json_file(filepath):
@@ -310,7 +562,9 @@ def _read_json_file(filepath):
             return json.load(f)
 
     @staticmethod
-    def _save_meta(root_dir, metadict):
+    def _save_meta(root_dir, metadict, allow_overwrite=False):
         for name, datum in metadict.items():
             filename = os.path.join(root_dir, "%s.json" % name)
-            LocalMetadataProvider._dump_json_to_file(filename, datum)
+            LocalMetadataProvider._dump_json_to_file(
+                filename, datum, allow_overwrite=allow_overwrite
+            )
diff --git a/metaflow/plugins/metadata/service.py b/metaflow/plugins/metadata/service.py
index 2bccd75af05..a66ca1c6c15 100644
--- a/metaflow/plugins/metadata/service.py
+++ b/metaflow/plugins/metadata/service.py
@@ -1,19 +1,25 @@
 import os
+import random
+
 import requests
 import time
 
 from distutils.version import LooseVersion
 
-from metaflow.exception import MetaflowException
+from metaflow.exception import (
+    MetaflowException,
+    MetaflowTaggingError,
+    MetaflowInternalError,
+)
 from metaflow.metaflow_config import (
-    METADATA_SERVICE_NUM_RETRIES,
-    METADATA_SERVICE_HEADERS,
-    METADATA_SERVICE_URL,
+    SERVICE_RETRY_COUNT,
+    SERVICE_HEADERS,
+    SERVICE_URL,
 )
 from metaflow.metadata import MetadataProvider
 from metaflow.metadata.heartbeat import HB_URL_KEY
-from metaflow.sidecar import SidecarSubProcess
-from metaflow.sidecar_messages import MessageTypes, Message
+from metaflow.sidecar import Message, MessageTypes, Sidecar
+
 
 # Define message enums
 class HeartbeatTypes(object):
@@ -34,27 +40,26 @@ class ServiceMetadataProvider(MetadataProvider):
     TYPE = "service"
 
     _supports_attempt_gets = None
+    _supports_tag_mutation = None
 
     def __init__(self, environment, flow, event_logger, monitor):
         super(ServiceMetadataProvider, self).__init__(
             environment, flow, event_logger, monitor
         )
         self.url_task_template = os.path.join(
-            METADATA_SERVICE_URL,
+            SERVICE_URL,
             "flows/{flow_id}/runs/{run_number}/steps/{step_name}/tasks/{task_id}/heartbeat",
         )
         self.url_run_template = os.path.join(
-            METADATA_SERVICE_URL, "flows/{flow_id}/runs/{run_number}/heartbeat"
+            SERVICE_URL, "flows/{flow_id}/runs/{run_number}/heartbeat"
         )
-        self.sidecar_process = None
+        self.sidecar = None
 
     @classmethod
     def compute_info(cls, val):
         v = val.rstrip("/")
         try:
-            resp = requests.get(
-                os.path.join(v, "ping"), headers=METADATA_SERVICE_HEADERS
-            )
+            resp = requests.get(os.path.join(v, "ping"), headers=SERVICE_HEADERS)
             resp.raise_for_status()
         except:  # noqa E722
             raise ValueError("Metaflow service [%s] unreachable." % v)
@@ -62,25 +67,28 @@ def compute_info(cls, val):
 
     @classmethod
     def default_info(cls):
-        return METADATA_SERVICE_URL
+        return SERVICE_URL
 
     def version(self):
         return self._version(self._monitor)
 
     def new_run_id(self, tags=None, sys_tags=None):
-        return self._new_run(tags=tags, sys_tags=sys_tags)
+        v, _ = self._new_run(tags=tags, sys_tags=sys_tags)
+        return v
 
     def register_run_id(self, run_id, tags=None, sys_tags=None):
         try:
             # don't try to register an integer ID which was obtained
             # from the metadata service in the first place
             int(run_id)
-            return
+            return False
         except ValueError:
-            return self._new_run(run_id, tags=tags, sys_tags=sys_tags)
+            _, did_create = self._new_run(run_id, tags=tags, sys_tags=sys_tags)
+            return did_create
 
     def new_task_id(self, run_id, step_name, tags=None, sys_tags=None):
-        return self._new_task(run_id, step_name, tags=tags, sys_tags=sys_tags)
+        v, _ = self._new_task(run_id, step_name, tags=tags, sys_tags=sys_tags)
+        return v
 
     def register_task_id(
         self, run_id, step_name, task_id, attempt=0, tags=None, sys_tags=None
@@ -90,11 +98,18 @@ def register_task_id(
             # from the metadata service in the first place
             int(task_id)
         except ValueError:
-            self._new_task(
-                run_id, step_name, task_id, attempt, tags=tags, sys_tags=sys_tags
+            _, did_create = self._new_task(
+                run_id,
+                step_name,
+                task_id=task_id,
+                attempt=attempt,
+                tags=tags,
+                sys_tags=sys_tags,
             )
+            return did_create
         else:
-            self._register_code_package_metadata(run_id, step_name, task_id, attempt)
+            self._register_system_metadata(run_id, step_name, task_id, attempt)
+            return False
 
     def _start_heartbeat(
         self, heartbeat_type, flow_id, run_id, step_name=None, task_id=None
@@ -104,15 +119,6 @@ def _start_heartbeat(
             # multiple heartbeat side cars of any type/combination. Either a
             # single run heartbeat or a single task heartbeat can be started
             raise Exception("heartbeat already started")
-        # start sidecar
-        if self.version() is None or LooseVersion(self.version()) < LooseVersion(
-            "2.0.4"
-        ):
-            # if old version of the service is running
-            # then avoid running real heartbeat sidecar process
-            self.sidecar_process = SidecarSubProcess("nullSidecarHeartbeat")
-        else:
-            self.sidecar_process = SidecarSubProcess("heartbeat")
         # create init message
         payload = {}
         if heartbeat_type == HeartbeatTypes.TASK:
@@ -132,8 +138,17 @@ def _start_heartbeat(
         else:
             raise Exception("invalid heartbeat type")
         payload["service_version"] = self.version()
-        msg = Message(MessageTypes.LOG_EVENT, payload)
-        self.sidecar_process.msg_handler(msg)
+        # start sidecar
+        if self.version() is None or LooseVersion(self.version()) < LooseVersion(
+            "2.0.4"
+        ):
+            # if old version of the service is running
+            # then avoid running real heartbeat sidecar process
+            self.sidecar = Sidecar("none")
+        else:
+            self.sidecar = Sidecar("heartbeat")
+        self.sidecar.start()
+        self.sidecar.send(Message(MessageTypes.BEST_EFFORT, payload))
 
     def start_run_heartbeat(self, flow_id, run_id):
         self._start_heartbeat(HeartbeatTypes.RUN, flow_id, run_id)
@@ -142,11 +157,10 @@ def start_task_heartbeat(self, flow_id, run_id, step_name, task_id):
         self._start_heartbeat(HeartbeatTypes.TASK, flow_id, run_id, step_name, task_id)
 
     def _already_started(self):
-        return self.sidecar_process is not None
+        return self.sidecar is not None
 
     def stop_heartbeat(self):
-        msg = Message(MessageTypes.SHUTDOWN, None)
-        self.sidecar_process.msg_handler(msg)
+        self.sidecar.terminate()
 
     def register_data_artifacts(
         self, run_id, step_name, task_id, attempt_id, artifacts
@@ -158,7 +172,7 @@ def register_data_artifacts(
         data = self._artifacts_to_json(
             run_id, step_name, task_id, attempt_id, artifacts
         )
-        self._request(self._monitor, url, data)
+        self._request(self._monitor, url, "POST", data)
 
     def register_metadata(self, run_id, step_name, task_id, metadata):
         url = ServiceMetadataProvider._obj_path(
@@ -166,7 +180,56 @@ def register_metadata(self, run_id, step_name, task_id, metadata):
         )
         url += "/metadata"
         data = self._metadata_to_json(run_id, step_name, task_id, metadata)
-        self._request(self._monitor, url, data)
+        self._request(self._monitor, url, "POST", data)
+
+    @classmethod
+    def _mutate_user_tags_for_run(
+        cls, flow_id, run_id, tags_to_add=None, tags_to_remove=None
+    ):
+        min_service_version_with_tag_mutation = "2.3.0"
+        if cls._supports_tag_mutation is None:
+            version = cls._version(None)
+            cls._supports_tag_mutation = version is not None and LooseVersion(
+                version
+            ) >= LooseVersion(min_service_version_with_tag_mutation)
+        if not cls._supports_tag_mutation:
+            raise ServiceException(
+                "Adding or removing tags on a run requires the Metaflow service to be "
+                "at least version %s. Please upgrade your service."
+                % (min_service_version_with_tag_mutation,)
+            )
+
+        url = ServiceMetadataProvider._obj_path(flow_id, run_id) + "/tag/mutate"
+        tag_mutation_data = {
+            # mutate_user_tags_for_run() should have already ensured that this is a list, so let's be tolerant here
+            "tags_to_add": list(tags_to_add or []),
+            "tags_to_remove": list(tags_to_remove or []),
+        }
+        tries = 1
+        status_codes_seen = set()
+        # try up to 10 times, with a gentle exponential backoff (1.4-1.6x)
+        while True:
+            resp, _ = cls._request(
+                None, url, "PATCH", data=tag_mutation_data, return_raw_resp=True
+            )
+            status_codes_seen.add(resp.status_code)
+            # happy path
+            if resp.status_code < 300:
+                return frozenset(resp.json()["tags"])
+            # definitely NOT retriable
+            if resp.status_code in (400, 422):
+                raise MetaflowTaggingError("Metadata service says: %s" % (resp.text,))
+            # if we get here, mutation failure is possibly retriable
+            if tries >= 10:
+                # if we ever received 409 on any of our attempts, report "conflicting updates" blurb to user
+                if 409 in status_codes_seen:
+                    raise MetaflowTaggingError(
+                        "Tagging failed due to too many conflicting updates from other processes"
+                    )
+                # No 409's seen... raise a more generic error
+                raise MetaflowTaggingError("Tagging failed after %d tries" % tries)
+            time.sleep(0.3 * random.uniform(1.4, 1.6) ** tries)
+            tries += 1
 
     @classmethod
     def _get_object_internal(
@@ -194,9 +257,8 @@ def _get_object_internal(
             else:
                 url = ServiceMetadataProvider._obj_path(*args[:obj_order])
             try:
-                return MetadataProvider._apply_filter(
-                    [cls._request(None, url)], filters
-                )[0]
+                v, _ = cls._request(None, url, "GET")
+                return MetadataProvider._apply_filter([v], filters)[0]
             except ServiceException as ex:
                 if ex.http_code == 404:
                     return None
@@ -214,7 +276,8 @@ def _get_object_internal(
         else:
             url += "/%ss" % sub_type
         try:
-            return MetadataProvider._apply_filter(cls._request(None, url), filters)
+            v, _ = cls._request(None, url, "GET")
+            return MetadataProvider._apply_filter(v, filters)
         except ServiceException as ex:
             if ex.http_code == 404:
                 return None
@@ -223,21 +286,22 @@ def _get_object_internal(
     def _new_run(self, run_id=None, tags=None, sys_tags=None):
         # first ensure that the flow exists
         self._get_or_create("flow")
-        run = self._get_or_create("run", run_id, tags=tags, sys_tags=sys_tags)
-        return str(run["run_number"])
+        run, did_create = self._get_or_create(
+            "run", run_id, tags=tags, sys_tags=sys_tags
+        )
+        return str(run["run_number"]), did_create
 
     def _new_task(
         self, run_id, step_name, task_id=None, attempt=0, tags=None, sys_tags=None
     ):
         # first ensure that the step exists
         self._get_or_create("step", run_id, step_name)
-        task = self._get_or_create(
+        task, did_create = self._get_or_create(
             "task", run_id, step_name, task_id, tags=tags, sys_tags=sys_tags
         )
-        self._register_code_package_metadata(
-            run_id, step_name, task["task_id"], attempt
-        )
-        return task["task_id"]
+        if did_create:
+            self._register_system_metadata(run_id, step_name, task["task_id"], attempt)
+        return task["task_id"], did_create
 
     @staticmethod
     def _obj_path(
@@ -297,7 +361,9 @@ def create_object():
                 self.sticky_tags.union(tags),
                 self.sticky_sys_tags.union(sys_tags),
             )
-            return self._request(self._monitor, create_path, data, obj_path)
+            return self._request(
+                self._monitor, create_path, "POST", data=data, retry_409_path=obj_path
+            )
 
         always_create = False
         obj_path = self._obj_path(self._flow_name, run_id, step_name, task_id)
@@ -311,51 +377,78 @@ def create_object():
             return create_object()
 
         try:
-            return self._request(self._monitor, obj_path)
+            return self._request(self._monitor, obj_path, "GET")
         except ServiceException as ex:
             if ex.http_code == 404:
                 return create_object()
             else:
                 raise
 
+    # TODO _request() needs a more deliberate refactor at some point, it looks quite overgrown.
     @classmethod
-    def _request(cls, monitor, path, data=None, retry_409_path=None):
+    def _request(
+        cls,
+        monitor,
+        path,
+        method,
+        data=None,
+        retry_409_path=None,
+        return_raw_resp=False,
+    ):
         if cls.INFO is None:
             raise MetaflowException(
                 "Missing Metaflow Service URL. "
                 "Specify with METAFLOW_SERVICE_URL environment variable"
             )
+        supported_methods = ("GET", "PATCH", "POST")
+        if method not in supported_methods:
+            raise MetaflowException(
+                "Only these methods are supported: %s, but got %s"
+                % (supported_methods, method)
+            )
         url = os.path.join(cls.INFO, path.lstrip("/"))
-        for i in range(METADATA_SERVICE_NUM_RETRIES):
+        for i in range(SERVICE_RETRY_COUNT):
             try:
-                if data is None:
+                if method == "GET":
                     if monitor:
                         with monitor.measure("metaflow.service_metadata.get"):
-                            resp = requests.get(url, headers=METADATA_SERVICE_HEADERS)
+                            resp = requests.get(url, headers=SERVICE_HEADERS)
                     else:
-                        resp = requests.get(url, headers=METADATA_SERVICE_HEADERS)
-                else:
+                        resp = requests.get(url, headers=SERVICE_HEADERS)
+                elif method == "POST":
                     if monitor:
                         with monitor.measure("metaflow.service_metadata.post"):
                             resp = requests.post(
-                                url, headers=METADATA_SERVICE_HEADERS, json=data
+                                url, headers=SERVICE_HEADERS, json=data
                             )
                     else:
-                        resp = requests.post(
-                            url, headers=METADATA_SERVICE_HEADERS, json=data
-                        )
+                        resp = requests.post(url, headers=SERVICE_HEADERS, json=data)
+                elif method == "PATCH":
+                    if monitor:
+                        with monitor.measure("metaflow.service_metadata.patch"):
+                            resp = requests.patch(
+                                url, headers=SERVICE_HEADERS, json=data
+                            )
+                    else:
+                        resp = requests.patch(url, headers=SERVICE_HEADERS, json=data)
+                else:
+                    raise MetaflowInternalError("Unexpected HTTP method %s" % (method,))
+            except MetaflowInternalError:
+                raise
             except:  # noqa E722
                 if monitor:
                     with monitor.count("metaflow.service_metadata.failed_request"):
-                        if i == METADATA_SERVICE_NUM_RETRIES - 1:
+                        if i == SERVICE_RETRY_COUNT - 1:
                             raise
                 else:
-                    if i == METADATA_SERVICE_NUM_RETRIES - 1:
+                    if i == SERVICE_RETRY_COUNT - 1:
                         raise
                 resp = None
             else:
+                if return_raw_resp:
+                    return resp, True
                 if resp.status_code < 300:
-                    return resp.json()
+                    return resp.json(), True
                 elif resp.status_code == 409 and data is not None:
                     # a special case: the post fails due to a conflict
                     # this could occur when we missed a success response
@@ -366,9 +459,10 @@ def _request(cls, monitor, path, data=None, retry_409_path=None):
                     # instead of retrying the post we retry with a get since
                     # the record is guaranteed to exist
                     if retry_409_path:
-                        return cls._request(monitor, retry_409_path)
+                        v, _ = cls._request(monitor, retry_409_path, "GET")
+                        return v, False
                     else:
-                        return
+                        return None, False
                 elif resp.status_code != 503:
                     raise ServiceException(
                         "Metadata request (%s) failed (code %s): %s"
@@ -376,7 +470,7 @@ def _request(cls, monitor, path, data=None, retry_409_path=None):
                         resp.status_code,
                         resp.text,
                     )
-            time.sleep(2 ** i)
+            time.sleep(2**i)
 
         if resp:
             raise ServiceException(
@@ -397,20 +491,20 @@ def _version(cls, monitor):
             )
         path = "ping"
         url = os.path.join(cls.INFO, path)
-        for i in range(METADATA_SERVICE_NUM_RETRIES):
+        for i in range(SERVICE_RETRY_COUNT):
             try:
                 if monitor:
                     with monitor.measure("metaflow.service_metadata.get"):
-                        resp = requests.get(url, headers=METADATA_SERVICE_HEADERS)
+                        resp = requests.get(url, headers=SERVICE_HEADERS)
                 else:
-                    resp = requests.get(url, headers=METADATA_SERVICE_HEADERS)
+                    resp = requests.get(url, headers=SERVICE_HEADERS)
             except:
                 if monitor:
                     with monitor.count("metaflow.service_metadata.failed_request"):
-                        if i == METADATA_SERVICE_NUM_RETRIES - 1:
+                        if i == SERVICE_RETRY_COUNT - 1:
                             raise
                 else:
-                    if i == METADATA_SERVICE_NUM_RETRIES - 1:
+                    if i == SERVICE_RETRY_COUNT - 1:
                         raise
                 resp = None
             else:
@@ -423,7 +517,7 @@ def _version(cls, monitor):
                         resp.status_code,
                         resp.text,
                     )
-            time.sleep(2 ** i)
+            time.sleep(2**i)
         if resp:
             raise ServiceException(
                 "Metadata request (%s) failed (code %s): %s"
diff --git a/metaflow/plugins/parallel_decorator.py b/metaflow/plugins/parallel_decorator.py
index 8cc3f59a13d..c93549926e9 100644
--- a/metaflow/plugins/parallel_decorator.py
+++ b/metaflow/plugins/parallel_decorator.py
@@ -52,7 +52,7 @@ def _step_func_with_setup():
             return _step_func_with_setup
 
     def setup_distributed_env(self, flow):
-        # Overridden by subclasses to setup particular framework's environment.
+        # Overridden by subclasses to set up particular framework's environment.
         pass
 
 
diff --git a/metaflow/plugins/project_decorator.py b/metaflow/plugins/project_decorator.py
index a3e44c01ad9..37ff93efe6b 100644
--- a/metaflow/plugins/project_decorator.py
+++ b/metaflow/plugins/project_decorator.py
@@ -13,6 +13,20 @@
 
 
 class ProjectDecorator(FlowDecorator):
+    """
+    Specifies what flows belong to the same project.
+
+    A project-specific namespace is created for all flows that
+    use the same `@project(name)`.
+
+    Parameters
+    ----------
+    name : str
+        Project name. Make sure that the name is unique amongst all
+        projects that use the same production scheduler. The name may
+        contain only lowercase alphanumeric characters and underscores.
+    """
+
     name = "project"
     defaults = {"name": None}
 
diff --git a/metaflow/plugins/resources_decorator.py b/metaflow/plugins/resources_decorator.py
index 19433742ed1..e8aa0d75822 100644
--- a/metaflow/plugins/resources_decorator.py
+++ b/metaflow/plugins/resources_decorator.py
@@ -3,32 +3,33 @@
 
 class ResourcesDecorator(StepDecorator):
     """
-    Step decorator to specify the resources needed when executing this step.
+    Specifies the resources needed when executing this step.
 
-    This decorator passes this information along to container orchestrator
-    (AWS Batch, Kubernetes, etc.) when requesting resources to execute this
-    step.
+    Use `@resources` to specify the resource requirements
+    independently of the specific compute layer (`@batch`, `@kubernetes`).
 
-    This decorator is ignored if the execution of the step happens locally.
-
-    To use, annotate your step as follows:
+    You can choose the compute layer on the command line by executing e.g.
+    ```
+    python myflow.py run --with batch
     ```
-    @resources(cpu=32)
-    @step
-    def my_step(self):
-        ...
+    or
     ```
+    python myflow.py run --with kubernetes
+    ```
+    which executes the flow on the desired system using the
+    requirements specified in `@resources`.
+
     Parameters
     ----------
     cpu : int
-        Number of CPUs required for this step. Defaults to 1
+        Number of CPUs required for this step. Defaults to 1.
     gpu : int
-        Number of GPUs required for this step. Defaults to 0
+        Number of GPUs required for this step. Defaults to 0.
     memory : int
-        Memory size (in MB) required for this step. Defaults to 4096
+        Memory size (in MB) required for this step. Defaults to 4096.
     shared_memory : int
         The value for the size (in MiB) of the /dev/shm volume for this step.
-        This parameter maps to the --shm-size option to docker run .
+        This parameter maps to the `--shm-size` option in Docker.
     """
 
     name = "resources"
diff --git a/metaflow/plugins/retry_decorator.py b/metaflow/plugins/retry_decorator.py
index e5cc5b72e0d..8f83d978807 100644
--- a/metaflow/plugins/retry_decorator.py
+++ b/metaflow/plugins/retry_decorator.py
@@ -5,29 +5,23 @@
 
 class RetryDecorator(StepDecorator):
     """
-    Step decorator to specify that a step should be retried on failure.
+    Specifies the number of times the task corresponding
+    to a step needs to be retried.
 
-    This decorator indicates that if this step fails, it should be retried a certain number of times.
+    This decorator is useful for handling transient errors, such as networking issues.
+    If your task contains operations that can't be retried safely, e.g. database updates,
+    it is advisable to annotate it with `@retry(times=0)`.
 
-    This decorator is useful if transient errors (like networking issues) are likely in your step.
-
-    This can be used in conjunction with the @retry decorator. In that case, catch will only
-    activate if all retries fail and will catch the last exception thrown by the last retry.
-
-    To use, annotate your step as follows:
-    ```
-    @retry(times=3)
-    @step
-    def myStep(self):
-        ...
-    ```
+    This can be used in conjunction with the `@catch` decorator. The `@catch`
+    decorator will execute a no-op task after all retries have been exhausted,
+    ensuring that the flow execution can continue.
 
     Parameters
     ----------
     times : int
-        Number of times to retry this step. Defaults to 3
+        Number of times to retry this task (default: 3).
     minutes_between_retries : int
-        Number of minutes between retries
+        Number of minutes between retries (default: 2).
     """
 
     name = "retry"
diff --git a/metaflow/plugins/storage_executor.py b/metaflow/plugins/storage_executor.py
new file mode 100644
index 00000000000..1e8e4c5fdd6
--- /dev/null
+++ b/metaflow/plugins/storage_executor.py
@@ -0,0 +1,164 @@
+import math
+import multiprocessing
+import os
+import platform
+import sys
+from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
+
+from metaflow.exception import MetaflowException
+
+if sys.version_info[:2] < (3, 7):
+    # in 3.6, Only BrokenProcessPool exists (there is no BrokenThreadPool)
+    from concurrent.futures.process import BrokenProcessPool
+
+    BrokenStorageExecutorError = BrokenProcessPool
+else:
+    # in 3.7 and newer, BrokenExecutor is a base class that parents BrokenProcessPool AND BrokenThreadPool
+    from concurrent.futures import BrokenExecutor as _BrokenExecutor
+
+    BrokenStorageExecutorError = _BrokenExecutor
+
+
+def _determine_effective_cpu_limit():
+    """Calculate CPU limit (in number of cores) based on:
+
+    - /sys/fs/cgroup/cpu/cpu.max (if available, cgroup 2)
+    OR
+    - /sys/fs/cgroup/cpu/cpu.cfs_quota_us
+    - /sys/fs/cgroup/cpu/cpu.cfs_period_us
+
+    Returns:
+        > 0 if limit was successfully calculated
+        = 0 if we determined that there is no limit
+        -1 if we failed to determine the limit
+    """
+    try:
+        if platform.system() == "Darwin":
+            # On MacOS, and not in a container
+            # We are assuming it is extremely out-of-the-way to run a Darwin container
+            return 0
+        elif platform.system() == "Linux":
+            # Bare metal Linux, or a Linux container running on either Linux or MacOS system
+            # Linux containers on MacOS have this cpu.max file
+            with open("/sys/fs/cgroup/cpu.max", "rb") as f:
+                parts = f.read().decode("utf-8").split(" ")
+                if len(parts) == 2:
+                    if parts[0] == "max":
+                        return 0
+                    return int(parts[0]) / int(parts[1])
+
+            with open("/sys/fs/cgroup/cpu/cpu.cfs_quota_us", "rb") as f:
+                quota = int(f.read())
+                # this file shows -1 for no limit
+                if quota == -1:
+                    return 0
+                # some other negative number - this is weird...return "undetermined"
+                if quota < 0:
+                    return -1
+            with open("/sys/fs/cgroup/cpu/cpu.cfs_period_us", "rb") as f:
+                period = int(f.read())
+                # should be positive. we don't want div by zero errors.
+                if period <= 0:
+                    return -1
+                return quota / period
+        else:
+            # Not Linux or MacOS...give up and return "undetermined"
+            return -1
+    except Exception:
+        return -1
+
+
+def _noop_for_executor_warm_up():
+    pass
+
+
+def _compute_executor_max_workers():
+    # For processes, let's start conservative. Let's restrict to 4-18 always. Can be configurable one day.
+    min_processes = 4
+    max_processes = 18
+    # make an effort to get cgroup cpu limits
+    effective_cpu_limit = _determine_effective_cpu_limit()
+
+    def _bracket(min_v, v, max_v):
+        assert min_v <= max_v
+        if v < min_v:
+            return min_v
+        if v > max_v:
+            return max_v
+        return v
+
+    if effective_cpu_limit < 0:
+        # We don't know the limit, stick to min
+        processpool_max_workers = min_processes
+    elif effective_cpu_limit == 0:
+        # There is (probably) no limit, use physical core count
+        processpool_max_workers = _bracket(
+            min_processes, os.cpu_count() or 1, max_processes
+        )
+    else:
+        # There is a limit, so let's bracket it within min / max
+        processpool_max_workers = _bracket(
+            min_processes, math.ceil(effective_cpu_limit), max_processes
+        )
+    # Threads a lighter than processes... so just tack on a little more
+    threadpool_max_workers = processpool_max_workers + 4
+    return processpool_max_workers, threadpool_max_workers
+
+
+# TODO keyboard interrupt crazy tracebacks in process pool
+class StorageExecutor(object):
+    """Thin wrapper around a ProcessPoolExecutor, or a ThreadPoolExecutor where
+    the former may be unsafe.
+    """
+
+    def __init__(self, use_processes=False):
+        (
+            processpool_max_workers,
+            threadpool_max_workers,
+        ) = _compute_executor_max_workers()
+        if use_processes:
+            mp_start_method = multiprocessing.get_start_method(allow_none=True)
+            if mp_start_method == "spawn":
+                self._executor = ProcessPoolExecutor(
+                    max_workers=processpool_max_workers
+                )
+            elif sys.version_info[:2] >= (3, 7):
+                self._executor = ProcessPoolExecutor(
+                    mp_context=multiprocessing.get_context("spawn"),
+                    max_workers=processpool_max_workers,
+                )
+            else:
+                raise MetaflowException(
+                    msg="Cannot use ProcessPoolExecutor because Python version is older than 3.7 and multiprocess start method has been set to something other than 'spawn'"
+                )
+        else:
+            self._executor = ThreadPoolExecutor(max_workers=threadpool_max_workers)
+
+    def warm_up(self):
+        # warm up at least one process or thread in the pool.
+        # we don't await future... just let it complete in background
+        self._executor.submit(_noop_for_executor_warm_up)
+
+    def submit(self, *args, **kwargs):
+        return self._executor.submit(*args, **kwargs)
+
+
+def handle_executor_exceptions(func):
+    """
+    Decorator for handling errors that come from an Executor. This decorator should
+    only be used on functions where executor errors are possible. I.e. the function
+    uses StorageExecutor.
+    """
+
+    def inner_function(*args, **kwargs):
+        try:
+            return func(*args, **kwargs)
+        except BrokenStorageExecutorError:
+            # This is fatal. So we bail ASAP.
+            # We also don't want to log, because KeyboardInterrupt on worker processes
+            # also take us here, so it's going to be "normal" user operation most of the
+            # time.
+            # BrokenExecutor parents both BrokenThreadPool and BrokenProcessPool.
+            sys.exit(1)
+
+    return inner_function
diff --git a/metaflow/plugins/tag_cli.py b/metaflow/plugins/tag_cli.py
new file mode 100644
index 00000000000..eb2c789d715
--- /dev/null
+++ b/metaflow/plugins/tag_cli.py
@@ -0,0 +1,531 @@
+import sys
+
+from itertools import chain
+
+from metaflow import namespace
+from metaflow.client import Flow, Run
+from metaflow.current import current
+from metaflow.util import resolve_identity
+from metaflow.exception import CommandException, MetaflowNotFound, MetaflowInternalError
+from metaflow.exception import MetaflowNamespaceMismatch
+
+from metaflow._vendor import click
+
+
+def _print_tags_for_runs_by_groups(
+    obj, system_tags_by_group, all_tags_by_group, by_tag
+):
+    header = [
+        "",
+        "System tags and *tags*; only *tags* can be modified",
+        "--------------------------------------------------------",
+        "",
+    ]
+    obj.echo("\n".join(header), err=False)
+    groups = sorted(set(chain(all_tags_by_group.keys(), system_tags_by_group.keys())))
+    for group in groups:
+        all_tags_in_group = all_tags_by_group.get(group, set())
+        system_tags_in_group = system_tags_by_group.get(group, set())
+        if by_tag:
+            if all_tags_in_group:
+                _print_tags_for_group_by_tag(
+                    obj, group, all_tags_in_group, is_system=False
+                )
+            if system_tags_in_group:
+                _print_tags_for_group_by_tag(
+                    obj, group, system_tags_in_group, is_system=True
+                )
+        else:
+            _print_tags_for_group(
+                obj,
+                group,
+                system_tags_in_group,
+                all_tags_in_group,
+                skip_per_group_header=(len(all_tags_by_group) == 1),
+            )
+
+
+def _print_tags_for_group(
+    obj, group, system_tags, all_tags, skip_per_group_header=False
+):
+    if not obj.is_quiet:
+        # Pretty printing computations
+        max_length = max([len(t) for t in all_tags] if all_tags else [0])
+
+        all_tags = sorted(all_tags)
+        if sys.version_info[0] < 3:
+            all_tags = [
+                t if t in system_tags else "*%s*" % t.encode("utf-8") for t in all_tags
+            ]
+        else:
+            all_tags = [t if t in system_tags else "*%s*" % t for t in all_tags]
+
+        num_tags = len(all_tags)
+
+        # We consider 4 spaces in between columns
+        # We consider 120 characters as total width
+        # We want 120 >= column_count*max_length + (column_count - 1)*4
+        column_count = 124 // (max_length + 4)
+        if column_count == 0:
+            # Make sure we have at least 1 column even for very, very long tags
+            column_count = 1
+            words_per_column = num_tags
+        else:
+            words_per_column = (num_tags + column_count - 1) // column_count
+
+        # Extend all_tags by empty tags to be able to easily fill up the lines
+        addl_tags = words_per_column * column_count - num_tags
+        if addl_tags > 0:
+            all_tags.extend([" " for _ in range(addl_tags)])
+
+        lines = []
+        if not skip_per_group_header:
+            lines.append("For run %s" % group)
+        for i in range(words_per_column):
+            line_values = [
+                all_tags[i + j * words_per_column] for j in range(column_count)
+            ]
+            length_values = [
+                max_length + 2 if l[0] == "*" else max_length for l in line_values
+            ]
+            formatter = "    ".join(["{:<%d}" % l for l in length_values])
+            lines.append(formatter.format(*line_values))
+        lines.append("")
+        obj.echo("\n".join(lines), err=False)
+    else:
+        # In quiet mode, we display things much more simply, so it is parseable.
+        # When displaying by run, we output three columns:
+        #  - the comma separated list of runs (usually 1 but more if --all-runs for eg)
+        #  - the comma separated list of system tags
+        #  - the comma separated list of user tags
+        # Columns are separated by a semicolon surrounded by a space on each side
+        # to make it visually clear when there are no system tags for example
+        obj.echo_always(
+            " ; ".join(
+                [
+                    group,
+                    ",".join(sorted(system_tags)),
+                    ",".join(sorted(all_tags.difference(system_tags))),
+                ]
+            )
+        )
+
+
+def _print_tags_for_group_by_tag(obj, group, runs, is_system):
+    if not obj.is_quiet:
+        # Pretty printing computations
+        max_length = max([len(t) for t in runs] if runs else [0])
+
+        all_runs = sorted(runs)
+
+        num_runs = len(all_runs)
+
+        # We consider 4 spaces in between columns, 120 characters total width, and
+        # 120 >= column_count*max_length + (column_count - 1)*4
+        column_count = 124 // (max_length + 4)
+        if column_count == 0:
+            # Make sure we have at least 1 column even for very, very long tags
+            column_count = 1
+            words_per_column = num_runs
+        else:
+            words_per_column = (num_runs + column_count - 1) // column_count
+
+        # Extend all_runs by empty runs to be able to easily fill up the lines
+        addl_runs = words_per_column * column_count - num_runs
+        if addl_runs > 0:
+            all_runs.extend([" " for _ in range(addl_runs)])
+
+        lines = []
+        lines.append("For tag %s" % (group if is_system else "*%s*" % group))
+        for i in range(words_per_column):
+            line_values = [
+                all_runs[i + j * words_per_column] for j in range(column_count)
+            ]
+            length_values = [max_length for l in line_values]
+            formatter = "    ".join(["{:<%d}" % l for l in length_values])
+            lines.append(formatter.format(*line_values))
+        lines.append("")
+        obj.echo("\n".join(lines), err=False)
+    else:
+        # In quiet mode, we display things much more simply so it is parseable.
+        # When displaying by tag, we output twp columns:
+        #  - the tag (user:<tag> or sys:<tag>)
+        #  - the comma separated list of runs
+        # Columns are separated by a semicolon surrounded by a space on each side
+        # to make it consistent with the by_run listing
+        obj.echo_always(
+            " ; ".join(
+                [
+                    "system:%s" % group if is_system else "user:%s" % group,
+                    ",".join(runs),
+                ]
+            )
+        )
+
+
+def _print_tags_for_one_run(obj, run):
+    system_tags = {run.pathspec: run.system_tags}
+    all_tags = {run.pathspec: run.tags}
+    return _print_tags_for_runs_by_groups(obj, system_tags, all_tags, by_tag=False)
+
+
+def _get_client_run_obj(obj, run_id, user_namespace):
+    flow_name = obj.flow.name
+
+    # handle error messaging for two cases
+    # 1. our user tries to tag a new flow before it is run
+    # 2. our user makes a typo in --namespace
+    try:
+        namespace(user_namespace)
+        Flow(pathspec=flow_name)
+    except MetaflowNotFound:
+        raise CommandException(
+            "No run found for *%s*. Please run the flow before tagging." % flow_name
+        )
+
+    except MetaflowNamespaceMismatch:
+        raise CommandException(
+            "No run found for *%s* in namespace *%s*. You can switch the namespace using --namespace"
+            % (flow_name, user_namespace)
+        )
+
+    # throw an error with message to include latest run-id when run_id is None
+    if run_id is None:
+        latest_run_id = Flow(pathspec=flow_name).latest_run.id
+        msg = (
+            "Please specify a run-id using --run-id.\n"
+            "*%s*'s latest run in namespace *%s* has id *%s*."
+            % (flow_name, user_namespace, latest_run_id)
+        )
+        raise CommandException(msg)
+    run_id_parts = run_id.split("/")
+    if len(run_id_parts) == 1:
+        path_spec = "%s/%s" % (flow_name, run_id)
+    else:
+        raise CommandException("Run-id *%s* is not a valid run-id" % run_id)
+
+    # handle error messaging for three cases
+    # 1. our user makes a typo in --run-id
+    # 2. our user's --run-id does not exist in the default/specified namespace
+    try:
+        namespace(user_namespace)
+        run = Run(pathspec=path_spec)
+    except MetaflowNotFound:
+        raise CommandException(
+            "No run *%s* found for flow *%s*" % (path_spec, flow_name)
+        )
+    except MetaflowNamespaceMismatch:
+        msg = "Run *%s* for flow *%s* does not belong to namespace *%s*\n" % (
+            path_spec,
+            flow_name,
+            user_namespace,
+        )
+        raise CommandException(msg)
+    return run
+
+
+def _set_current(obj):
+    current._set_env(
+        metadata_str="%s@%s"
+        % (obj.metadata.__class__.TYPE, obj.metadata.__class__.INFO)
+    )
+
+
+@click.group()
+def cli():
+    pass
+
+
+@cli.group(help="Commands related to tagging.")
+def tag():
+    pass
+
+
+@tag.command("add", help="Add tags to a run.")
+@click.option(
+    "--run-id",
+    required=False,  # set False here so we can throw a better error message
+    default=None,
+    type=str,
+    help="Run ID of the specific run to tag. [required]",
+)
+@click.option(
+    "--namespace",
+    "user_namespace",
+    required=False,
+    default=None,
+    type=str,
+    help="Change namespace from the default (your username) to the one specified.",
+)
+@click.argument("tags", required=True, type=str, nargs=-1)
+@click.pass_obj
+def add(obj, run_id, user_namespace, tags):
+    _set_current(obj)
+    user_namespace = resolve_identity() if user_namespace is None else user_namespace
+    run = _get_client_run_obj(obj, run_id, user_namespace)
+
+    run.add_tags(tags)
+
+    obj.echo("Operation successful. New tags:", err=False)
+    _print_tags_for_one_run(obj, run)
+
+
+@tag.command("remove", help="Remove tags from a run.")
+@click.option(
+    "--run-id",
+    required=False,  # set False here so we can throw a better error message
+    default=None,
+    type=str,
+    help="Run ID of the specific run to tag. [required]",
+)
+@click.option(
+    "--namespace",
+    "user_namespace",
+    required=False,
+    default=None,
+    type=str,
+    help="Change namespace from the default (your username) to the one specified.",
+)
+@click.argument("tags", required=True, type=str, nargs=-1)
+@click.pass_obj
+def remove(obj, run_id, user_namespace, tags):
+    _set_current(obj)
+    user_namespace = resolve_identity() if user_namespace is None else user_namespace
+    run = _get_client_run_obj(obj, run_id, user_namespace)
+
+    run.remove_tags(tags)
+
+    obj.echo("Operation successful. New tags:")
+    _print_tags_for_one_run(obj, run)
+
+
+@tag.command(
+    "replace",
+    help="Replace one or more tags of a run atomically. "
+    "Removals are applied first, then additions.",
+)
+@click.option(
+    "--run-id",
+    required=False,  # set False here, so we can throw a better error message
+    default=None,
+    type=str,
+    help="Run ID of the specific run to tag. [required]",
+)
+@click.option(
+    "--namespace",
+    "user_namespace",
+    required=False,
+    default=None,
+    type=str,
+    help="Change namespace from the default (your username) to the one specified.",
+)
+@click.option(
+    "--add",
+    "tags_to_add",
+    multiple=True,
+    default=None,
+    help="Add this tag to a run. Must specify one or more tags to add.",
+)
+@click.option(
+    "--remove",
+    "tags_to_remove",
+    multiple=True,
+    default=None,
+    help="Remove this tag from a run. Must specify one or more tags to remove.",
+)
+@click.pass_obj
+def replace(obj, run_id, user_namespace, tags_to_add=None, tags_to_remove=None):
+    _set_current(obj)
+    # While run.replace_tag() can accept 0 additions or 0 removals, we want to encourage
+    # the *obvious* way to achieve their goals. E.g. if they are only adding tags, use "tag add"
+    # over more obscure "tag replace --add ... --add ..."
+    if not tags_to_add and not tags_to_remove:
+        raise CommandException(
+            "Specify at least one tag to add (--add) and one tag to remove (--remove)"
+        )
+    if not tags_to_remove:
+        raise CommandException(
+            "Specify at least one tag to remove; else please use *tag add*."
+        )
+    if not tags_to_add:
+        raise CommandException(
+            "Specify at least one tag to add, else please use *tag remove*."
+        )
+    user_namespace = resolve_identity() if user_namespace is None else user_namespace
+    run = _get_client_run_obj(obj, run_id, user_namespace)
+
+    run.replace_tags(tags_to_remove, tags_to_add)
+
+    obj.echo("Operation successful. New tags:")
+    _print_tags_for_one_run(obj, run)
+
+
+@tag.command("list", help="List tags of a run.")
+@click.option(
+    "--run-id",
+    required=False,
+    default=None,
+    type=str,
+    help="Run ID of the specific run to list.",
+)
+@click.option(
+    "--all",
+    "list_all",
+    required=False,
+    is_flag=True,
+    default=False,
+    help="List tags across all runs of this flow.",
+)
+@click.option(
+    "--my-runs",
+    "my_runs",
+    required=False,
+    is_flag=True,
+    default=False,
+    help="List tags across all runs of the flow under the default namespace.",
+)
+@click.option(
+    "--hide-system-tags",
+    required=False,
+    is_flag=True,
+    default=False,
+    help="Hide system tags.",
+)
+@click.option(
+    "--group-by-tag",
+    required=False,
+    is_flag=True,
+    default=False,
+    help="Display results by showing runs grouped by tags",
+)
+@click.option(
+    "--group-by-run",
+    required=False,
+    is_flag=True,
+    default=False,
+    help="Display tags grouped by run",
+)
+@click.option(
+    "--flat",
+    required=False,
+    is_flag=True,
+    default=False,
+    help="List tags, one line per tag, no groupings",
+    # As of 6/3/2022, a hidden option helps automate CLI testing.
+    # We may consider public supporting --flatten and/or --json in future
+    hidden=True,
+)
+@click.argument(
+    "arg_run_id",  # For backwards compatibility with Netflix internal usage of an early version of this CLI
+    required=False,
+    default=None,
+    type=str,
+)
+@click.pass_obj
+def tag_list(
+    obj,
+    run_id,
+    hide_system_tags,
+    list_all,
+    my_runs,
+    group_by_tag,
+    group_by_run,
+    flat,
+    arg_run_id,
+):
+    _set_current(obj)
+    if run_id is None and arg_run_id is None and not list_all and not my_runs:
+        # Assume list_all by default
+        list_all = True
+
+    if list_all and my_runs:
+        raise CommandException("Option --all cannot be used together with --my-runs.")
+
+    if run_id is not None and arg_run_id is not None:
+        raise CommandException(
+            "Specify a run either using --run-id or as an argument but not both"
+        )
+
+    if arg_run_id is not None:
+        run_id = arg_run_id
+
+    if group_by_run and group_by_tag:
+        raise CommandException(
+            "Option --group-by-tag cannot be used with --group-by-run"
+        )
+
+    if flat and (group_by_run or group_by_tag):
+        raise CommandException(
+            "Option --flat cannot be used with any --group-by-* option"
+        )
+
+    system_tags_by_some_grouping = dict()
+    all_tags_by_some_grouping = dict()
+
+    def _populate_tag_groups_from_run(_run):
+        if group_by_run:
+            if hide_system_tags:
+                all_tags_by_some_grouping[_run.pathspec] = _run.tags - _run.system_tags
+            else:
+                system_tags_by_some_grouping[_run.pathspec] = _run.system_tags
+                all_tags_by_some_grouping[_run.pathspec] = _run.tags
+        elif group_by_tag:
+            for t in _run.tags - _run.system_tags:
+                all_tags_by_some_grouping.setdefault(t, []).append(_run.pathspec)
+            if not hide_system_tags:
+                for t in _run.system_tags:
+                    system_tags_by_some_grouping.setdefault(t, []).append(_run.pathspec)
+        else:
+            if hide_system_tags:
+                all_tags_by_some_grouping.setdefault("_", set()).update(
+                    _run.tags.difference(_run.system_tags)
+                )
+            else:
+                system_tags_by_some_grouping.setdefault("_", set()).update(
+                    _run.system_tags
+                )
+                all_tags_by_some_grouping.setdefault("_", set()).update(_run.tags)
+
+    pathspecs = []
+    if list_all or my_runs:
+        user_namespace = resolve_identity() if my_runs else None
+        namespace(user_namespace)
+        try:
+            flow = Flow(pathspec=obj.flow.name)
+        except MetaflowNotFound:
+            raise CommandException(
+                "Cannot list tags because the flow %s has never been run."
+                % (obj.flow.name,)
+            )
+        for run in flow.runs():
+            _populate_tag_groups_from_run(run)
+            pathspecs.append(run.pathspec)
+    else:
+        run = _get_client_run_obj(obj, run_id, None)
+        _populate_tag_groups_from_run(run)
+        pathspecs.append(run.pathspec)
+
+    if not group_by_run and not group_by_tag:
+        # We list all the runs that match to print them out if needed.
+        system_tags_by_some_grouping[
+            ",".join(pathspecs)
+        ] = system_tags_by_some_grouping.get("_", set())
+        all_tags_by_some_grouping[",".join(pathspecs)] = all_tags_by_some_grouping.get(
+            "_", set()
+        )
+        if "_" in system_tags_by_some_grouping:
+            del system_tags_by_some_grouping["_"]
+        if "_" in all_tags_by_some_grouping:
+            del all_tags_by_some_grouping["_"]
+
+    if flat:
+        if len(all_tags_by_some_grouping) != 1:
+            raise MetaflowInternalError("Failed to flatten tag set")
+        for v in all_tags_by_some_grouping.values():
+            for tag in v:
+                obj.echo(tag)
+            return
+
+    _print_tags_for_runs_by_groups(
+        obj, system_tags_by_some_grouping, all_tags_by_some_grouping, group_by_tag
+    )
diff --git a/metaflow/plugins/test_unbounded_foreach_decorator.py b/metaflow/plugins/test_unbounded_foreach_decorator.py
index ff6dfd0ad48..e5f2962fa3c 100644
--- a/metaflow/plugins/test_unbounded_foreach_decorator.py
+++ b/metaflow/plugins/test_unbounded_foreach_decorator.py
@@ -111,11 +111,11 @@ def control_task_step_func(self, flow, graph, retry_count):
             cmd = cli_args.step_command(
                 executable, script, step_name, step_kwargs=kwargs
             )
-            step_cli = u" ".join(cmd)
+            step_cli = " ".join(cmd)
             # Print cmdline for execution. Doesn't work without the temporary
             # unicode object while using `print`.
             print(
-                u"[${cwd}] Starting split#{split} with cmd:{cmd}".format(
+                "[${cwd}] Starting split#{split} with cmd:{cmd}".format(
                     cwd=os.getcwd(), split=i, cmd=step_cli
                 )
             )
diff --git a/metaflow/plugins/timeout_decorator.py b/metaflow/plugins/timeout_decorator.py
index bfede534bb0..0f909d00dda 100644
--- a/metaflow/plugins/timeout_decorator.py
+++ b/metaflow/plugins/timeout_decorator.py
@@ -12,34 +12,25 @@ class TimeoutException(MetaflowException):
 
 class TimeoutDecorator(StepDecorator):
     """
-    Step decorator to specify a timeout for your step.
+    Specifies a timeout for your step.
 
     This decorator is useful if this step may hang indefinitely.
 
-    This can be used in conjunction with the @retry decorator as well as the @catch decorator.
-    A timeout is considered to be an exception thrown by the step and will cause the step to be
-    retried if needed and the exception will be caught by the 'catch' decorator if present.
-
-    To use, annotate your step as follows:
-    ```
-    @timeout(minutes=1)
-    @step
-    def myStep(self):
-        ...
-    ```
+    This can be used in conjunction with the `@retry` decorator as well as the `@catch` decorator.
+    A timeout is considered to be an exception thrown by the step. It will cause the step to be
+    retried if needed and the exception will be caught by the `@catch` decorator, if present.
 
     Note that all the values specified in parameters are added together so if you specify
     60 seconds and 1 hour, the decorator will have an effective timeout of 1 hour and 1 minute.
+
     Parameters
     ----------
     seconds : int
         Number of seconds to wait prior to timing out.
     minutes : int
-        Number of minutes to wait prior to timing out
+        Number of minutes to wait prior to timing out.
     hours : int
-        Number of hours to wait prior to timing out
-    minutes_between_retries : int
-        Number of minutes between retries
+        Number of hours to wait prior to timing out.
     """
 
     name = "timeout"
diff --git a/metaflow/pylint_wrapper.py b/metaflow/pylint_wrapper.py
index 93e82437ce7..fa3da7e023c 100644
--- a/metaflow/pylint_wrapper.py
+++ b/metaflow/pylint_wrapper.py
@@ -1,3 +1,4 @@
+import re
 import sys
 
 try:
@@ -6,6 +7,7 @@
     from io import StringIO
 
 from .exception import MetaflowException
+from .extension_support import get_aliased_modules
 
 
 class PyLintWarn(MetaflowException):
@@ -57,6 +59,8 @@ def run(self, logger=None, warnings=False, pylint_config=[]):
         return pylint_is_happy, pylint_exception_msg
 
     def _filter_lines(self, output):
+        ext_aliases = get_aliased_modules()
+        import_error_line = re.compile(r"Unable to import '([^']+)'")
         for line in output.splitlines():
             # Ignore headers
             if "***" in line:
@@ -65,6 +69,11 @@ def _filter_lines(self, output):
             # Automatic generation of decorators confuses Pylint.
             if "(no-name-in-module)" in line:
                 continue
+            # Ignore things related to module aliasing in EXT_PKG
+            if "E0401" in line:
+                m = import_error_line.search(line)
+                if m and any([m.group(1).startswith(alias) for alias in ext_aliases]):
+                    continue
             # Ignore complaints related to dynamic and JSON-types parameters
             if "Instance of 'Parameter' has no" in line:
                 continue
diff --git a/metaflow/runtime.py b/metaflow/runtime.py
index b0d354c720b..f27dede4210 100644
--- a/metaflow/runtime.py
+++ b/metaflow/runtime.py
@@ -14,7 +14,10 @@
 from io import BytesIO
 from functools import partial
 
+from metaflow.datastore.exceptions import DataException
+
 from . import get_namespace
+from .metadata import MetaDatum
 from .metaflow_config import MAX_ATTEMPTS
 from .exception import (
     MetaflowException,
@@ -36,7 +39,8 @@
 MAX_WORKERS = 16
 MAX_NUM_SPLITS = 100
 MAX_LOG_SIZE = 1024 * 1024
-PROGRESS_INTERVAL = 1000  # ms
+POLL_TIMEOUT = 1000  # ms
+PROGRESS_INTERVAL = 300  # s
 # The following is a list of the (data) artifacts used by the runtime while
 # executing a flow. These are prefetched during the resume operation by
 # leveraging the TaskDataStoreSet.
@@ -64,6 +68,8 @@ def __init__(
         monitor,
         run_id=None,
         clone_run_id=None,
+        clone_only=False,
+        reentrant=False,
         clone_steps=None,
         max_workers=MAX_WORKERS,
         max_num_splits=MAX_NUM_SPLITS,
@@ -83,7 +89,11 @@ def __init__(
         self._environment = environment
         self._logger = logger
         self._max_workers = max_workers
-        self._num_active_workers = 0
+        self._active_tasks = dict()  # Key: step name;
+        # value: [number of running tasks, number of done tasks]
+        # Special key 0 is total number of running tasks
+        self._active_tasks[0] = 0
+        self._unprocessed_steps = set([n.name for n in self._graph])
         self._max_num_splits = max_num_splits
         self._max_log_size = max_log_size
         self._params_task = None
@@ -92,7 +102,9 @@ def __init__(
         self._monitor = monitor
 
         self._clone_run_id = clone_run_id
+        self._clone_only = clone_only
         self._clone_steps = {} if clone_steps is None else clone_steps
+        self._reentrant = reentrant
 
         self._origin_ds_set = None
         if clone_run_id:
@@ -166,6 +178,8 @@ def _new_task(self, step, input_paths=None, **kwargs):
             input_paths=input_paths,
             may_clone=may_clone,
             clone_run_id=self._clone_run_id,
+            clone_only=self._clone_only,
+            reentrant=self._reentrant,
             origin_ds_set=self._origin_ds_set,
             decos=decos,
             logger=self._logger,
@@ -198,7 +212,7 @@ def execute(self):
         try:
             # main scheduling loop
             exception = None
-            while self._run_queue or self._num_active_workers > 0:
+            while self._run_queue or self._active_tasks[0] > 0:
 
                 # 1. are any of the current workers finished?
                 finished_tasks = list(self._poll_workers())
@@ -210,15 +224,46 @@ def execute(self):
 
                 if time.time() - progress_tstamp > PROGRESS_INTERVAL:
                     progress_tstamp = time.time()
-                    msg = "%d tasks are running: %s." % (
-                        self._num_active_workers,
-                        "e.g. ...",
-                    )  # TODO
-                    self._logger(msg, system_msg=True)
-                    msg = "%d tasks are waiting in the queue." % len(self._run_queue)
+                    tasks_print = ", ".join(
+                        [
+                            "%s (%d running; %d done)" % (k, v[0], v[1])
+                            for k, v in self._active_tasks.items()
+                            if k != 0 and v[0] > 0
+                        ]
+                    )
+                    if self._active_tasks[0] == 0:
+                        msg = "No tasks are running."
+                    else:
+                        if self._active_tasks[0] == 1:
+                            msg = "1 task is running: "
+                        else:
+                            msg = "%d tasks are running: " % self._active_tasks[0]
+                        msg += "%s." % tasks_print
+
                     self._logger(msg, system_msg=True)
-                    msg = "%d steps are pending: %s." % (0, "e.g. ...")  # TODO
+
+                    if len(self._run_queue) == 0:
+                        msg = "No tasks are waiting in the queue."
+                    else:
+                        if len(self._run_queue) == 1:
+                            msg = "1 task is waiting in the queue: "
+                        else:
+                            msg = "%d tasks are waiting in the queue." % len(
+                                self._run_queue
+                            )
+
                     self._logger(msg, system_msg=True)
+                    if len(self._unprocessed_steps) > 0:
+                        if len(self._unprocessed_steps) == 1:
+                            msg = "%s step has not started" % (
+                                next(iter(self._unprocessed_steps)),
+                            )
+                        else:
+                            msg = "%d steps have not started: " % len(
+                                self._unprocessed_steps
+                            )
+                            msg += "%s." % ", ".join(self._unprocessed_steps)
+                        self._logger(msg, system_msg=True)
 
         except KeyboardInterrupt as ex:
             self._logger("Workflow interrupted.", system_msg=True, bad=True)
@@ -241,9 +286,15 @@ def execute(self):
         # assert that end was executed and it was successful
         if ("end", ()) in self._finished:
             self._logger("Done!", system_msg=True)
+        elif self._clone_only:
+            self._logger(
+                "Clone-only resume complete -- only previously successful steps were "
+                "cloned; no new tasks executed!",
+                system_msg=True,
+            )
         else:
             raise MetaflowInternalError(
-                "The *end* step was not successful " "by the end of flow."
+                "The *end* step was not successful by the end of flow."
             )
 
     def _killall(self):
@@ -278,9 +329,11 @@ def _killall(self):
     # onto the run_queue is an inexpensive operation.
     def _queue_push(self, step, task_kwargs):
         self._run_queue.insert(0, (step, task_kwargs))
+        # For foreaches, this will happen multiple time but is ok, becomes a no-op
+        self._unprocessed_steps.discard(step)
 
     def _queue_pop(self):
-        return self._run_queue.pop() if self._run_queue else None
+        return self._run_queue.pop() if self._run_queue else (None, {})
 
     def _queue_task_join(self, task, next_steps):
         # if the next step is a join, we need to check that
@@ -518,7 +571,7 @@ def _queue_tasks(self, finished_tasks):
 
     def _poll_workers(self):
         if self._workers:
-            for event in self._poll.poll(PROGRESS_INTERVAL):
+            for event in self._poll.poll(POLL_TIMEOUT):
                 worker = self._workers.get(event.fd)
                 if worker:
                     if event.can_read:
@@ -529,7 +582,13 @@ def _poll_workers(self):
                         for fd in worker.fds():
                             self._poll.remove(fd)
                             del self._workers[fd]
-                        self._num_active_workers -= 1
+                        step_counts = self._active_tasks[worker.task.step]
+                        step_counts[0] -= 1  # One less task for this step is running
+                        step_counts[1] += 1  # ... and one more completed.
+                        # We never remove from self._active_tasks because it is possible
+                        # for all currently running task for a step to complete but
+                        # for others to still be queued up.
+                        self._active_tasks[0] -= 1
 
                         task = worker.task
                         if returncode:
@@ -539,7 +598,7 @@ def _poll_workers(self):
                                 or returncode == METAFLOW_EXIT_DISALLOW_RETRY
                             ):
                                 self._logger(
-                                    "This failed task will not be " "retried.",
+                                    "This failed task will not be retried.",
                                     system_msg=True,
                                 )
                             else:
@@ -555,7 +614,7 @@ def _poll_workers(self):
                             yield task
 
     def _launch_workers(self):
-        while self._run_queue and self._num_active_workers < self._max_workers:
+        while self._run_queue and self._active_tasks[0] < self._max_workers:
             step, task_kwargs = self._queue_pop()
             # Initialize the task (which can be expensive using remote datastores)
             # before launching the worker so that cost is amortized over time, instead
@@ -578,11 +637,26 @@ def _retry_worker(self, worker):
         self._launch_worker(worker.task)
 
     def _launch_worker(self, task):
+        if self._clone_only and not task.is_cloned:
+            # We don't launch a worker here
+            self._logger(
+                "Not executing non-cloned task for step '%s' in clone-only resume"
+                % "/".join([task.flow_name, task.run_id, task.step]),
+                system_msg=True,
+            )
+            return
+
         worker = Worker(task, self._max_log_size)
         for fd in worker.fds():
             self._workers[fd] = worker
             self._poll.add(fd)
-        self._num_active_workers += 1
+        active_step_counts = self._active_tasks.setdefault(task.step, [0, 0])
+
+        # We have an additional task for this step running
+        active_step_counts[0] += 1
+
+        # One more task actively running
+        self._active_tasks[0] += 1
 
 
 class Task(object):
@@ -600,53 +674,31 @@ def __init__(
         event_logger,
         monitor,
         input_paths=None,
+        may_clone=False,
+        clone_run_id=None,
+        clone_only=False,
+        reentrant=False,
+        origin_ds_set=None,
+        decos=None,
+        logger=None,
+        # Anything below this is passed as part of kwargs
         split_index=None,
         ubf_context=None,
         ubf_iter=None,
-        clone_run_id=None,
-        origin_ds_set=None,
-        may_clone=False,
         join_type=None,
-        logger=None,
         task_id=None,
-        decos=[],
     ):
-        if ubf_context == UBF_CONTROL:
-            [input_path] = input_paths
-            run, input_step, input_task = input_path.split("/")
-            # We associate the control task-id to be 1:1 with the split node
-            # where the unbounded-foreach was defined.
-            # We prefer encoding the corresponding split into the task_id of
-            # the control node; so it has access to this information quite
-            # easily. There is anyway a corresponding int id stored in the
-            # metadata backend - so this should be fine.
-            task_id = "control-%s-%s-%s" % (run, input_step, input_task)
-        # Register only regular Metaflow (non control) tasks.
-        if task_id is None:
-            task_id = str(metadata.new_task_id(run_id, step))
-        else:
-            # task_id is preset only by persist_constants() or control tasks.
-            if ubf_context == UBF_CONTROL:
-                tags = [CONTROL_TASK_TAG]
-                metadata.register_task_id(
-                    run_id,
-                    step,
-                    task_id,
-                    0,
-                    sys_tags=tags,
-                )
-            else:
-                metadata.register_task_id(run_id, step, task_id, 0)
 
         self.step = step
         self.flow_name = flow.name
         self.run_id = run_id
-        self.task_id = task_id
+        self.task_id = None
+        self._path = None
         self.input_paths = input_paths
         self.split_index = split_index
         self.ubf_context = ubf_context
         self.ubf_iter = ubf_iter
-        self.decos = decos
+        self.decos = [] if decos is None else decos
         self.entrypoint = entrypoint
         self.environment = environment
         self.environment_type = self.environment.TYPE
@@ -658,15 +710,14 @@ def __init__(
         self.monitor = monitor
 
         self._logger = logger
-        self._path = "%s/%s/%s" % (self.run_id, self.step, self.task_id)
 
         self.retries = 0
         self.user_code_retries = 0
         self.error_retries = 0
 
         self.tags = metadata.sticky_tags
-        self.event_logger_type = self.event_logger.logger_type
-        self.monitor_type = monitor.monitor_type
+        self.event_logger_type = self.event_logger.TYPE
+        self.monitor_type = monitor.TYPE
 
         self.metadata_type = metadata.TYPE
         self.datastore_type = flow_datastore.TYPE
@@ -674,10 +725,119 @@ def __init__(
         self.datastore_sysroot = flow_datastore.datastore_root
         self._results_ds = None
 
+        origin = None
         if clone_run_id and may_clone:
-            self._is_cloned = self._attempt_clone(clone_run_id, join_type)
+            origin = self._find_origin_task(clone_run_id, join_type)
+        if origin and origin["_task_ok"]:
+            # At this point, we know we are going to clone
+            self._is_cloned = True
+            if reentrant:
+                # A re-entrant clone basically allows multiple concurrent processes
+                # to perform the clone at the same time to the same new run id. Let's
+                # assume two processes A and B both simultaneously calling
+                # `resume --reentrant --run-id XX`.
+                # For each task that is cloned, we want to guarantee that:
+                #   - one and only one of A or B will do the actual cloning
+                #   - the other process (or other processes) will block until the cloning
+                #     is complete.
+                # This ensures that the rest of the clone algorithm can proceed as normal
+                # and also guarantees that we only write once to the datastore and
+                # metadata.
+                #
+                # To accomplish this, we use the cloned task's task-id as the "key" to
+                # synchronize on. We then try to "register" this new task-id (or rather
+                # the full pathspec <run>/<step>/<taskid>) with the metadata service
+                # which will indicate if we actually registered it or if it existed
+                # already. If we did manage to register it, we are the "elected cloner"
+                # in essence and proceed to clone. If we didn't, we just wait to make
+                # sure the task is fully done (ie: the clone is finished).
+                if task_id is not None:
+                    # Sanity check -- this should never happen. We cannot allow
+                    # for explicit task-ids because in the reentrant case, we use the
+                    # cloned task's id as the new task's id.
+                    raise MetaflowInternalError(
+                        "Reentrant clone-only resume does not allow for explicit task-id"
+                    )
+
+                # We will use the same task_id as the original task
+                # to use it effectively as a synchronization key
+                clone_task_id = origin.task_id
+                # Make sure the task-id is a non-integer to not clash with task ids
+                # assigned by the metadata provider. If this is an integer, we
+                # add some string to it. It doesn't matter what we add as long as it is
+                # consistent.
+                try:
+                    clone_task_int = int(clone_task_id)
+                    clone_task_id = "resume-%d" % clone_task_int
+                except ValueError:
+                    pass
+
+                # If _get_task_id returns True it means the task already existed, so
+                # we wait for it.
+                self._wait_for_clone = self._get_task_id(clone_task_id)
+            else:
+                self._wait_for_clone = False
+                self._get_task_id(task_id)
+
+            # Store the mapping from current_pathspec -> origin_pathspec which
+            # will be useful for looking up origin_ds_set in find_origin_task.
+            self.clone_pathspec_mapping[self._path] = origin.pathspec
+            if self.step == "_parameters":
+                # We don't put _parameters on the queue so we either clone it or wait
+                # for it.
+                if not self._wait_for_clone:
+                    # Clone in place without relying on run_queue.
+                    self.new_attempt()
+                    self._ds.clone(origin)
+                    self._ds.done()
+                else:
+                    # TODO: There is a bit of a duplication with the task.py
+                    # clone_only function here
+                    self.log(
+                        "Waiting for clone of _parameters step to occur...",
+                        system_msg=True,
+                    )
+                    while True:
+                        try:
+                            ds = self._flow_datastore.get_task_datastore(
+                                self.run_id, self.step, self.task_id
+                            )
+                            if not ds["_task_ok"]:
+                                raise MetaflowInternalError(
+                                    "Externally cloned _parameters task did not succeed"
+                                )
+                            break
+                        except DataException:
+                            self.log("Sleeping for 5s...", system_msg=True)
+                            # No need to get fancy with the sleep here.
+                            time.sleep(5)
+                    self.log("_parameters clone successful", system_msg=True)
+            else:
+                # For non parameter steps
+                # Store the origin pathspec in clone_origin so this can be run
+                # as a task by the runtime.
+                self.clone_origin = origin.pathspec
+                # Save a call to creating the results_ds since its same as origin.
+                self._results_ds = origin
+                if self._wait_for_clone:
+                    self.log(
+                        "Waiting for the successful cloning of results "
+                        "of a previously run task %s (this may take some time)"
+                        % self.clone_origin,
+                        system_msg=True,
+                    )
+                else:
+                    self.log(
+                        "Cloning results of a previously run task %s"
+                        % self.clone_origin,
+                        system_msg=True,
+                    )
         else:
             self._is_cloned = False
+            if clone_only:
+                # We are done -- we don't proceed to create new task-ids
+                return
+            self._get_task_id(task_id)
 
         # Open the output datastore only if the task is not being cloned.
         if not self._is_cloned:
@@ -732,6 +892,63 @@ def log(self, msg, system_msg=False, pid=None, timestamp=True):
         self._logger(msg, head=prefix, system_msg=system_msg, timestamp=timestamp)
         sys.stdout.flush()
 
+    def _get_task_id(self, task_id):
+        already_existed = True
+        if self.ubf_context == UBF_CONTROL:
+            [input_path] = self.input_paths
+            run, input_step, input_task = input_path.split("/")
+            # We associate the control task-id to be 1:1 with the split node
+            # where the unbounded-foreach was defined.
+            # We prefer encoding the corresponding split into the task_id of
+            # the control node; so it has access to this information quite
+            # easily. There is anyway a corresponding int id stored in the
+            # metadata backend - so this should be fine.
+            task_id = "control-%s-%s-%s" % (run, input_step, input_task)
+        # Register only regular Metaflow (non control) tasks.
+        if task_id is None:
+            task_id = str(self.metadata.new_task_id(self.run_id, self.step))
+            already_existed = False
+        else:
+            # task_id is preset only by persist_constants() or control tasks.
+            if self.ubf_context == UBF_CONTROL:
+                tags = [CONTROL_TASK_TAG]
+                attempt_id = 0
+                already_existed = not self.metadata.register_task_id(
+                    self.run_id,
+                    self.step,
+                    task_id,
+                    attempt_id,
+                    sys_tags=tags,
+                )
+                # A Task's tags are now those of its ancestral Run, so we are not able
+                # to rely on a task's tags to indicate the presence of a control task
+                # so, on top of adding the tags above, we also add a task metadata
+                # entry indicating that this is a "control task".
+                #
+                # Here we will also add a task metadata entry to indicate "control task".
+                # Within the metaflow repo, the only dependency of such a "control task"
+                # indicator is in the integration test suite (see Step.control_tasks() in
+                # client API).
+                task_metadata_list = [
+                    MetaDatum(
+                        field="internal_task_type",
+                        value=CONTROL_TASK_TAG,
+                        type="internal_task_type",
+                        tags=["attempt_id:{0}".format(attempt_id)],
+                    )
+                ]
+                self.metadata.register_metadata(
+                    self.run_id, self.step, task_id, task_metadata_list
+                )
+            else:
+                already_existed = not self.metadata.register_task_id(
+                    self.run_id, self.step, task_id, 0
+                )
+
+        self.task_id = task_id
+        self._path = "%s/%s/%s" % (self.run_id, self.step, self.task_id)
+        return already_existed
+
     def _find_origin_task(self, clone_run_id, join_type):
         if self.step == "_parameters":
             pathspec = "%s/_parameters[]" % clone_run_id
@@ -769,32 +986,6 @@ def _find_origin_task(self, clone_run_id, join_type):
             pathspec = "%s/%s[%s]" % (clone_run_id, self.step, index)
             return self.origin_ds_set.get_with_pathspec_index(pathspec)
 
-    def _attempt_clone(self, clone_run_id, join_type):
-        origin = self._find_origin_task(clone_run_id, join_type)
-
-        if origin and origin["_task_ok"]:
-            # Store the mapping from current_pathspec -> origin_pathspec which
-            # will be useful for looking up origin_ds_set in find_origin_task.
-            self.clone_pathspec_mapping[self._path] = origin.pathspec
-            if self.step == "_parameters":
-                # Clone in place without relying on run_queue.
-                self.new_attempt()
-                self._ds.clone(origin)
-                self._ds.done()
-            else:
-                self.log(
-                    "Cloning results of a previously run task %s" % origin.pathspec,
-                    system_msg=True,
-                )
-                # Store the origin pathspec in clone_origin so this can be run
-                # as a task by the runtime.
-                self.clone_origin = origin.pathspec
-                # Save a call to creating the results_ds since its same as origin.
-                self._results_ds = origin
-            return True
-        else:
-            return False
-
     @property
     def path(self):
         return self._path
@@ -818,6 +1009,10 @@ def finished_id(self):
     def is_cloned(self):
         return self._is_cloned
 
+    @property
+    def wait_for_clone(self):
+        return self._wait_for_clone
+
     def persist(self, flow):
         # this is used to persist parameters before the start step
         flow._task_ok = flow._success = True
@@ -890,6 +1085,7 @@ def __init__(self, task):
             "metadata": self.task.metadata_type,
             "environment": self.task.environment_type,
             "datastore": self.task.datastore_type,
+            "pylint": False,
             "event-logger": self.task.event_logger_type,
             "monitor": self.task.monitor_type,
             "datastore-root": self.task.datastore_sysroot,
@@ -928,7 +1124,7 @@ def _options(mapping):
             for k, v in mapping.items():
 
                 # None or False arguments are ignored
-                # v needs to be explicitly False, not falsy, eg. 0 is an acceptable value
+                # v needs to be explicitly False, not falsy, e.g. 0 is an acceptable value
                 if v is None or v is False:
                     continue
 
@@ -989,7 +1185,7 @@ def __init__(self, task, max_logs_size):
         # A killed task is always considered cleaned
         self.cleaned = False  # A cleaned task is one that is shutting down and has been
         # noticed by the runtime and queried for its state (whether or
-        # not is is properly shut down)
+        # not it is properly shut down)
 
     def _launch(self):
         args = CLIArgs(self.task)
@@ -1000,8 +1196,11 @@ def _launch(self):
 
         if self.task.is_cloned and self.task.clone_origin:
             args.command_options["clone-only"] = self.task.clone_origin
-            # disabling atlas sidecar for cloned tasks due to perf reasons
+            # disabling sidecars for cloned tasks due to perf reasons
+            args.top_level_options["event-logger"] = "nullSidecarLogger"
             args.top_level_options["monitor"] = "nullSidecarMonitor"
+            if self.task.wait_for_clone:
+                args.command_options["clone-wait-only"] = True
         else:
             # decorators may modify the CLIArgs object in-place
             for deco in self.task.decos:
@@ -1037,7 +1236,7 @@ def emit_log(self, msg, buf, system_msg=False):
                 if res.should_persist:
                     # in special circumstances we may receive structured
                     # loglines that haven't been properly persisted upstream.
-                    # This is the case e.g. if a task crashes in an external
+                    # This is the case if, for example, a task crashes in an external
                     # system and we retrieve the remaining logs after the crash.
                     # Those lines are marked with a special tag, should_persist,
                     # which we process here
diff --git a/metaflow/sidecar.py b/metaflow/sidecar.py
deleted file mode 100644
index 74a47f41c35..00000000000
--- a/metaflow/sidecar.py
+++ /dev/null
@@ -1,156 +0,0 @@
-from __future__ import print_function
-
-import subprocess
-import fcntl
-import select
-import os
-import sys
-import platform
-
-from fcntl import F_SETFL
-from os import O_NONBLOCK
-
-from .sidecar_messages import Message, MessageTypes
-from .debug import debug
-
-MESSAGE_WRITE_TIMEOUT_IN_MS = 1000
-
-NULL_SIDECAR_PREFIX = "nullSidecar"
-
-# for python 2 compatibility
-try:
-    blockingError = BlockingIOError
-except:
-    blockingError = OSError
-
-
-class PipeUnavailableError(Exception):
-    """raised when unable to write to pipe given allotted time"""
-
-    pass
-
-
-class NullSidecarError(Exception):
-    """raised when trying to poll or interact with the fake subprocess in the null sidecar"""
-
-    pass
-
-
-class MsgTimeoutError(Exception):
-    """raised when trying unable to send message to sidecar in allocated time"""
-
-    pass
-
-
-class NullPoller(object):
-    def poll(self, timeout):
-        raise NullSidecarError()
-
-
-class SidecarSubProcess(object):
-    def __init__(self, worker_type):
-        # type: (str) -> None
-        self.__worker_type = worker_type
-        self.__process = None
-        self.__poller = None
-        self.start()
-
-    def start(self):
-
-        if (
-            self.__worker_type is not None
-            and self.__worker_type.startswith(NULL_SIDECAR_PREFIX)
-        ) or (platform.system() == "Darwin" and sys.version_info < (3, 0)):
-            # if on darwin and running python 2 disable sidecars
-            # there is a bug with importing poll from select in some cases
-            #
-            # TODO: Python 2 shipped by Anaconda allows for
-            # `from select import poll`. We can consider enabling sidecars
-            # for that distribution if needed at a later date.
-            self.__poller = NullPoller()
-
-        else:
-            from select import poll
-
-            python_version = sys.executable
-            cmdline = [
-                python_version,
-                "-u",
-                os.path.dirname(__file__) + "/sidecar_worker.py",
-                self.__worker_type,
-            ]
-            debug.sidecar_exec(cmdline)
-
-            self.__process = self.__start_subprocess(cmdline)
-
-            if self.__process is not None:
-                fcntl.fcntl(self.__process.stdin, F_SETFL, O_NONBLOCK)
-                self.__poller = poll()
-                self.__poller.register(self.__process.stdin.fileno(), select.POLLOUT)
-            else:
-                # unable to start subprocess, fallback to Null sidecar
-                self.logger(
-                    "unable to start subprocess for sidecar %s" % self.__worker_type
-                )
-                self.__poller = NullPoller()
-
-    def __start_subprocess(self, cmdline):
-        for i in range(3):
-            try:
-                # Set stdout=sys.stdout & stderr=sys.stderr
-                # to print to console the output of sidecars.
-                return subprocess.Popen(
-                    cmdline,
-                    stdin=subprocess.PIPE,
-                    stdout=open(os.devnull, "w"),
-                    bufsize=0,
-                )
-            except blockingError as be:
-                self.logger("warning: sidecar popen failed: %s" % repr(be))
-            except Exception as e:
-                self.logger(repr(e))
-                break
-
-    def kill(self):
-        try:
-            msg = Message(MessageTypes.SHUTDOWN, None)
-            self.emit_msg(msg)
-        except:
-            pass
-
-    def emit_msg(self, msg):
-        msg_ser = msg.serialize().encode("utf-8")
-        written_bytes = 0
-        while written_bytes < len(msg_ser):
-            try:
-                fds = self.__poller.poll(MESSAGE_WRITE_TIMEOUT_IN_MS)
-                if fds is None or len(fds) == 0:
-                    raise MsgTimeoutError("poller timed out")
-                for fd, event in fds:
-                    if event & select.POLLERR:
-                        raise PipeUnavailableError("pipe unavailable")
-                    f = os.write(fd, msg_ser[written_bytes:])
-                    written_bytes += f
-            except NullSidecarError:
-                # sidecar is disabled, ignore all messages
-                break
-
-    def msg_handler(self, msg, retries=3):
-        try:
-            self.emit_msg(msg)
-        except MsgTimeoutError:
-            # drop message, do not retry on timeout
-            self.logger("unable to send message due to timeout")
-        except Exception as ex:
-            if isinstance(ex, PipeUnavailableError):
-                self.logger("restarting sidecar %s" % self.__worker_type)
-                self.start()
-            if retries > 0:
-                self.logger("retrying msg send to sidecar")
-                self.msg_handler(msg, retries - 1)
-            else:
-                self.logger("error sending log message")
-                self.logger(repr(ex))
-
-    def logger(self, msg):
-        print("metaflow sidecar logger: " + msg, file=sys.stderr)
diff --git a/metaflow/sidecar/__init__.py b/metaflow/sidecar/__init__.py
new file mode 100644
index 00000000000..afcb7e6f147
--- /dev/null
+++ b/metaflow/sidecar/__init__.py
@@ -0,0 +1,3 @@
+from .sidecar import Sidecar
+from .sidecar_messages import MessageTypes, Message
+from .sidecar_subprocess import SidecarSubProcess
diff --git a/metaflow/sidecar/sidecar.py b/metaflow/sidecar/sidecar.py
new file mode 100644
index 00000000000..d5c511bf80b
--- /dev/null
+++ b/metaflow/sidecar/sidecar.py
@@ -0,0 +1,31 @@
+from .sidecar_subprocess import SidecarSubProcess
+
+
+class Sidecar(object):
+    def __init__(self, sidecar_type):
+        # Needs to be here because this file gets loaded by lots of things and SIDECARS
+        # may not be fully populated by then
+        from metaflow.plugins import SIDECARS
+
+        self._sidecar_type = sidecar_type
+        self._has_valid_worker = False
+        t = SIDECARS.get(self._sidecar_type)
+        if t is not None and t.get_worker() is not None:
+            self._has_valid_worker = True
+        self.sidecar_process = None
+
+    def start(self):
+        if not self.is_active and self._has_valid_worker:
+            self.sidecar_process = SidecarSubProcess(self._sidecar_type)
+
+    def send(self, msg):
+        if self.is_active:
+            self.sidecar_process.send(msg)
+
+    def terminate(self):
+        if self.is_active:
+            self.sidecar_process.kill()
+
+    @property
+    def is_active(self):
+        return self.sidecar_process is not None
diff --git a/metaflow/sidecar/sidecar_messages.py b/metaflow/sidecar/sidecar_messages.py
new file mode 100644
index 00000000000..57522753aa7
--- /dev/null
+++ b/metaflow/sidecar/sidecar_messages.py
@@ -0,0 +1,34 @@
+import json
+
+
+# Define message enums
+# Unfortunately we can't use enums because they are not supported
+# officially in Python2
+# INVALID: Not a valid message
+# MUST_SEND: Will attempt to send until successful and not send any BEST_EFFORT
+#            messages until then. A newer MUST_SEND message will take precedence on
+#            any currently unsent one
+# BEST_EFFORT: Will try to send once and drop if not possible
+# SHUTDOWN: Signal termination; also best effort
+class MessageTypes(object):
+    INVALID, MUST_SEND, BEST_EFFORT, SHUTDOWN = range(1, 5)
+
+
+class Message(object):
+    def __init__(self, msg_type, payload):
+        self.msg_type = msg_type
+        self.payload = payload
+
+    def serialize(self):
+        msg = {
+            "msg_type": self.msg_type,
+            "payload": self.payload,
+        }
+        return json.dumps(msg) + "\n"
+
+    @staticmethod
+    def deserialize(json_msg):
+        try:
+            return Message(**json.loads(json_msg))
+        except json.decoder.JSONDecodeError:
+            return Message(MessageTypes.INVALID, None)
diff --git a/metaflow/sidecar/sidecar_subprocess.py b/metaflow/sidecar/sidecar_subprocess.py
new file mode 100644
index 00000000000..0b2ee99dd3d
--- /dev/null
+++ b/metaflow/sidecar/sidecar_subprocess.py
@@ -0,0 +1,237 @@
+from __future__ import print_function
+
+import subprocess
+import fcntl
+import select
+import os
+import sys
+import platform
+
+from fcntl import F_SETFL
+from os import O_NONBLOCK
+
+from .sidecar_messages import Message, MessageTypes
+from ..debug import debug
+
+MUST_SEND_RETRY_TIMES = 4
+MESSAGE_WRITE_TIMEOUT_IN_MS = 1000
+
+NULL_SIDECAR_PREFIX = "nullSidecar"
+
+# for python 2 compatibility
+try:
+    blockingError = BlockingIOError
+except:
+    blockingError = OSError
+
+
+class PipeUnavailableError(Exception):
+    """raised when unable to write to pipe given allotted time"""
+
+    pass
+
+
+class NullSidecarError(Exception):
+    """raised when trying to poll or interact with the fake subprocess in the null sidecar"""
+
+    pass
+
+
+class MsgTimeoutError(Exception):
+    """raised when trying unable to send message to sidecar in allocated time"""
+
+    pass
+
+
+class NullPoller(object):
+    def poll(self, timeout):
+        raise NullSidecarError()
+
+
+class SidecarSubProcess(object):
+    def __init__(self, worker_type):
+        # type: (str, dict) -> None
+        self._worker_type = worker_type
+
+        # Sub-process launched and poller used
+        self._process = None
+        self._poller = None
+
+        # Retry counts when needing to send a MUST_SEND message
+        self._send_mustsend_remaining_tries = 0
+        # Keep track of the `mustsend` across restarts
+        self._cached_mustsend = None
+        # Tracks if a previous message had an error
+        self._prev_message_error = False
+
+        self.start()
+
+    def start(self):
+
+        if (
+            self._worker_type is not None
+            and self._worker_type.startswith(NULL_SIDECAR_PREFIX)
+        ) or (platform.system() == "Darwin" and sys.version_info < (3, 0)):
+            # If on darwin and running python 2 disable sidecars
+            # there is a bug with importing poll from select in some cases
+            #
+            # TODO: Python 2 shipped by Anaconda allows for
+            # `from select import poll`. We can consider enabling sidecars
+            # for that distribution if needed at a later date.
+            self._poller = NullPoller()
+            self._process = None
+            self._logger("No sidecar started")
+        else:
+            self._starting = True
+            from select import poll
+
+            python_version = sys.executable
+            cmdline = [
+                python_version,
+                "-u",
+                os.path.dirname(__file__) + "/sidecar_worker.py",
+                self._worker_type,
+            ]
+            self._logger("Starting sidecar")
+            debug.sidecar_exec(cmdline)
+
+            self._process = self._start_subprocess(cmdline)
+
+            if self._process is not None:
+                fcntl.fcntl(self._process.stdin, F_SETFL, O_NONBLOCK)
+                self._poller = poll()
+                self._poller.register(self._process.stdin.fileno(), select.POLLOUT)
+            else:
+                # unable to start subprocess, fallback to Null sidecar
+                self._logger("Unable to start subprocess")
+                self._poller = NullPoller()
+
+    def kill(self):
+        try:
+            msg = Message(MessageTypes.SHUTDOWN, None)
+            self._emit_msg(msg)
+        except:
+            pass
+
+    def send(self, msg, retries=3):
+        if msg.msg_type == MessageTypes.MUST_SEND:
+            # If this is a must-send message, we treat it a bit differently. A must-send
+            # message has to be properly sent before any of the other best effort messages.
+            self._cached_mustsend = msg.payload
+            self._send_mustsend_remaining_tries = MUST_SEND_RETRY_TIMES
+            self._send_mustsend(retries)
+        else:
+            # Ignore return code for send.
+            self._send_internal(msg, retries=retries)
+
+    def _start_subprocess(self, cmdline):
+        for _ in range(3):
+            try:
+                # Set stdout=sys.stdout & stderr=sys.stderr
+                # to print to console the output of sidecars.
+                return subprocess.Popen(
+                    cmdline,
+                    stdin=subprocess.PIPE,
+                    stdout=sys.stdout if debug.sidecar else subprocess.DEVNULL,
+                    stderr=sys.stderr if debug.sidecar else subprocess.DEVNULL,
+                    bufsize=0,
+                )
+            except blockingError as be:
+                self._logger("Sidecar popen failed: %s" % repr(be))
+            except Exception as e:
+                self._logger("Unknown popen error: %s" % repr(e))
+                break
+
+    def _send_internal(self, msg, retries=3):
+        if self._process is None:
+            return False
+        try:
+            if msg.msg_type == MessageTypes.BEST_EFFORT:
+                # If we have a mustsend to send, we need to send it first prior to
+                # sending a best-effort message
+                if self._send_mustsend_remaining_tries == -1:
+                    # We could not send the "mustsend" so we don't try to send this out;
+                    # restart sidecar so use the PipeUnavailableError caught below
+                    raise PipeUnavailableError()
+                elif self._send_mustsend_remaining_tries > 0:
+                    self._send_mustsend()
+                if self._send_mustsend_remaining_tries == 0:
+                    self._emit_msg(msg)
+                    self._prev_message_error = False
+                    return True
+            else:
+                self._emit_msg(msg)
+                self._prev_message_error = False
+                return True
+            return False
+        except MsgTimeoutError:
+            # drop message, do not retry on timeout
+            self._logger("Unable to send message due to timeout")
+            self._prev_message_error = True
+        except Exception as ex:
+            if isinstance(ex, (PipeUnavailableError, BrokenPipeError)):
+                self._logger("Restarting sidecar due to broken/unavailable pipe")
+                self.start()
+                if self._cached_mustsend is not None:
+                    self._send_mustsend_remaining_tries = MUST_SEND_RETRY_TIMES
+                    # We don't send the "must send" here, letting it send "lazily" on the
+                    # next message. The reason for this is to simplify the interactions
+                    # with the retry logic.
+            else:
+                self._prev_message_error = True
+            if retries > 0:
+                self._logger("Retrying msg send to sidecar (due to %s)" % repr(ex))
+                return self._send_internal(msg, retries - 1)
+            else:
+                self._logger(
+                    "Error sending log message (exhausted retries): %s" % repr(ex)
+                )
+        return False
+
+    def _send_mustsend(self, retries=3):
+        if (
+            self._cached_mustsend is not None
+            and self._send_mustsend_remaining_tries > 0
+        ):
+            # If we don't succeed in sending the must-send, we will try again
+            # next time.
+            if self._send_internal(
+                Message(MessageTypes.MUST_SEND, self._cached_mustsend), retries
+            ):
+                self._cached_mustsend = None
+                self._send_mustsend_remaining_tries = 0
+                return True
+            else:
+                self._send_mustsend_remaining_tries -= 1
+                if self._send_mustsend_remaining_tries == 0:
+                    # Mark as "failed after try"
+                    self._send_mustsend_remaining_tries = -1
+                return False
+
+    def _emit_msg(self, msg):
+        # If the previous message had an error, we want to prepend a "\n" to this message
+        # to maximize the chance of this message being valid (for example, if the
+        # previous message only partially sent for whatever reason, we want to "clear" it)
+        msg = msg.serialize()
+        if self._prev_message_error:
+            msg = "\n" + msg
+        msg_ser = msg.encode("utf-8")
+        written_bytes = 0
+        while written_bytes < len(msg_ser):
+            # self._logger("Sent %d out of %d bytes" % (written_bytes, len(msg_ser)))
+            try:
+                fds = self._poller.poll(MESSAGE_WRITE_TIMEOUT_IN_MS)
+                if fds is None or len(fds) == 0:
+                    raise MsgTimeoutError("Poller timed out")
+                for fd, event in fds:
+                    if event & select.POLLERR:
+                        raise PipeUnavailableError("Pipe unavailable")
+                    f = os.write(fd, msg_ser[written_bytes:])
+                    written_bytes += f
+            except NullSidecarError:
+                # sidecar is disabled, ignore all messages
+                break
+
+    def _logger(self, msg):
+        if debug.sidecar:
+            print("[sidecar:%s] %s" % (self._worker_type, msg), file=sys.stderr)
diff --git a/metaflow/sidecar/sidecar_worker.py b/metaflow/sidecar/sidecar_worker.py
new file mode 100644
index 00000000000..6f264215d40
--- /dev/null
+++ b/metaflow/sidecar/sidecar_worker.py
@@ -0,0 +1,68 @@
+from __future__ import print_function
+
+import os
+import sys
+
+import traceback
+
+
+# add metaflow module to python path if not already present
+myDir = os.path.dirname(os.path.abspath(__file__))
+parentDir = os.path.split(os.path.split(myDir)[0])[0]
+sys.path.insert(0, parentDir)
+
+from metaflow.sidecar import Message, MessageTypes
+from metaflow.plugins import SIDECARS
+from metaflow._vendor import click
+
+
+def process_messages(worker_type, worker):
+    while True:
+        try:
+            msg = sys.stdin.readline().strip()
+            if msg:
+                parsed_msg = Message.deserialize(msg)
+                if parsed_msg.msg_type == MessageTypes.INVALID:
+                    print(
+                        "[sidecar:%s] Invalid message -- skipping: %s"
+                        % (worker_type, str(msg))
+                    )
+                    continue
+                else:
+                    worker.process_message(parsed_msg)
+                    if parsed_msg.msg_type == MessageTypes.SHUTDOWN:
+                        break
+            else:
+                break
+
+        except:  # todo handle other possible exceptions gracefully
+            print(
+                "[sidecar:%s]: %s" % (worker_type, traceback.format_exc()),
+                file=sys.stderr,
+            )
+            break
+    try:
+        worker.shutdown()
+    except:
+        pass
+
+
+@click.command(help="Initialize workers")
+@click.argument("worker-type")
+def main(worker_type):
+    sidecar_type = SIDECARS.get(worker_type)
+    if sidecar_type is not None:
+        worker_class = sidecar_type.get_worker()
+        if worker_class is not None:
+            process_messages(worker_type, worker_class())
+        else:
+            print(
+                "[sidecar:%s] Sidecar does not have associated worker" % worker_type,
+                file=sys.stderr,
+            )
+    else:
+        print("Unrecognized sidecar_process: %s" % worker_type, file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/metaflow/sidecar_messages.py b/metaflow/sidecar_messages.py
deleted file mode 100644
index a7b6d183c39..00000000000
--- a/metaflow/sidecar_messages.py
+++ /dev/null
@@ -1,24 +0,0 @@
-import json
-
-# Define message enums
-# Unfortunately we can't use enums because they are not supported
-# officially in Python2
-class MessageTypes(object):
-    SHUTDOWN, LOG_EVENT = range(1, 3)
-
-
-class Message(object):
-    def __init__(self, msg_type, payload):
-        self.msg_type = msg_type
-        self.payload = payload
-
-    def serialize(self):
-        msg = {
-            "msg_type": self.msg_type,
-            "payload": self.payload,
-        }
-        return json.dumps(msg) + "\n"
-
-
-def deserialize(json_msg):
-    return Message(**json.loads(json_msg))
diff --git a/metaflow/sidecar_worker.py b/metaflow/sidecar_worker.py
deleted file mode 100644
index 3856a953df7..00000000000
--- a/metaflow/sidecar_worker.py
+++ /dev/null
@@ -1,61 +0,0 @@
-from __future__ import print_function
-
-import os
-import sys
-
-import traceback
-
-
-# add module to python path if not already present
-myDir = os.path.dirname(os.path.abspath(__file__))
-parentDir = os.path.split(myDir)[0]
-sys.path.insert(0, parentDir)
-
-from metaflow.sidecar_messages import MessageTypes, deserialize
-from metaflow.plugins import SIDECARS
-from metaflow._vendor import click
-
-
-class WorkershutdownError(Exception):
-    """raised when terminating sidecar"""
-
-    pass
-
-
-def process_messages(worker):
-    while True:
-        try:
-            msg = sys.stdin.readline().strip()
-            if msg:
-                parsed_msg = deserialize(msg)
-                if parsed_msg.msg_type == MessageTypes.SHUTDOWN:
-                    raise WorkershutdownError()
-                else:
-                    worker.process_message(parsed_msg)
-            else:
-                raise WorkershutdownError()
-        except WorkershutdownError:
-            break
-        except Exception as e:  # todo handle other possible exceptions gracefully
-            print(traceback.format_exc())
-            break
-    try:
-        worker.shutdown()
-    except:
-        pass
-
-
-@click.command(help="Initialize workers")
-@click.argument("worker-type")
-def main(worker_type):
-
-    worker_process = SIDECARS.get(worker_type)
-
-    if worker_process is not None:
-        process_messages(worker_process())
-    else:
-        print("UNRECOGNIZED WORKER: %s" % worker_type, file=sys.stderr)
-
-
-if __name__ == "__main__":
-    main()
diff --git a/metaflow/tagging_util.py b/metaflow/tagging_util.py
new file mode 100644
index 00000000000..e124a6e24e1
--- /dev/null
+++ b/metaflow/tagging_util.py
@@ -0,0 +1,76 @@
+from metaflow.exception import MetaflowTaggingError
+from metaflow.util import unicode_type, bytes_type
+
+
+def is_utf8_encodable(x):
+    """
+    Returns true if the object can be encoded with UTF-8
+    """
+    try:
+        x.encode("utf-8")
+        return True
+    except UnicodeError:
+        return False
+
+
+def is_utf8_decodable(x):
+    """
+    Returns true if the object can be decoded with UTF-8
+    """
+    try:
+        x.decode("utf-8")
+        return True
+    except UnicodeError:
+        return False
+
+
+# How many user tags are allowed on a run
+MAX_USER_TAG_SET_SIZE = 50
+# How long may an individual tag value be
+MAX_TAG_SIZE = 500
+
+
+def validate_tags(tags, existing_tags=None):
+    """
+    Raises MetaflowTaggingError if invalid based on these rules:
+
+    Tag set size is too large. But it's OK if tag set is not larger
+    than an existing tag set (if provided).
+
+    Then, we validate each tag.  See validate_tag()
+    """
+    tag_set = frozenset(tags)
+    if len(tag_set) > MAX_USER_TAG_SET_SIZE:
+        # We want to allow user to remediate excessively large tag sets via tag mutation
+        if existing_tags is None or len(frozenset(existing_tags)) < len(tag_set):
+            raise MetaflowTaggingError(
+                msg="Cannot increase size of tag set beyond %d"
+                % (MAX_USER_TAG_SET_SIZE,)
+            )
+    for tag in tag_set:
+        validate_tag(tag)
+
+
+def validate_tag(tag):
+    """
+    - Tag must be either of bytes-type or unicode-type.
+    - If tag is of bytes-type, it must be UTF-8 decodable
+    - If tag is of unicode-type, it must be UTF-8 encodable
+    - Tag may not be empty string.
+    - Tag cannot be too long (500 chars)
+    """
+    if isinstance(tag, bytes_type):
+        if not is_utf8_decodable(tag):
+            raise MetaflowTaggingError("Tags must be UTF-8 decodable")
+    elif isinstance(tag, unicode_type):
+        if not is_utf8_encodable(tag):
+            raise MetaflowTaggingError("Tags must be UTF-8 encodable")
+    else:
+        raise MetaflowTaggingError(
+            "Tags must be some kind of string (bytes or unicode), got %s",
+            str(type(tag)),
+        )
+    if not len(tag):
+        raise MetaflowTaggingError("Tags must not be empty string")
+    if len(tag) > MAX_TAG_SIZE:
+        raise MetaflowTaggingError("Tag is too long %d > %d" % (len(tag), MAX_TAG_SIZE))
diff --git a/metaflow/task.py b/metaflow/task.py
index 3e8fd43b6d4..5e392a2f982 100644
--- a/metaflow/task.py
+++ b/metaflow/task.py
@@ -1,12 +1,17 @@
 from __future__ import print_function
+from io import BytesIO
+import math
 import sys
 import os
 import time
 
 from types import MethodType, FunctionType
 
+from metaflow.datastore.exceptions import DataException
+
 from .metaflow_config import MAX_ATTEMPTS
 from .metadata import MetaDatum
+from .mflog import TASK_LOG_SOURCE
 from .datastore import Inputs, TaskDataStoreSet
 from .exception import (
     MetaflowInternalError,
@@ -48,7 +53,9 @@ def __init__(
         self.ubf_context = ubf_context
 
     def _exec_step_function(self, step_function, input_obj=None):
-        self.environment.validate_environment(echo=self.console_logger)
+        self.environment.validate_environment(
+            self.console_logger, self.flow_datastore.TYPE
+        )
         if input_obj is None:
             step_function()
         else:
@@ -94,7 +101,7 @@ def property_setter(
 
         param_only_vars = list(all_vars)
         # make class-level values read-only to be more consistent across steps in a flow
-        # they are also only persisted once and so we similarly pass them down if
+        # they are also only persisted once, so we similarly pass them down if
         # required
         for var in dir(cls):
             if var[0] == "_" or var in cls._NON_PARAMETERS or var in all_vars:
@@ -157,7 +164,7 @@ def _init_data(self, run_id, join_type, input_paths):
         if not ds_list:
             # this guards against errors in input paths
             raise MetaflowDataMissing(
-                "Input paths *%s* resolved to zero " "inputs" % ",".join(input_paths)
+                "Input paths *%s* resolved to zero inputs" % ",".join(input_paths)
             )
         return ds_list
 
@@ -198,7 +205,7 @@ def lineage():
                 )
 
             # assert that none of the inputs are splits - we don't
-            # allow empty foreaches (joins immediately following splits)
+            # allow empty `foreach`s (joins immediately following splits)
             if any(not i.is_none("_foreach_var") for i in inputs):
                 raise MetaflowInternalError(
                     "Step *%s* tries to join a foreach "
@@ -256,11 +263,37 @@ def _clone_flow(self, datastore):
         x._set_datastore(datastore)
         return x
 
-    def clone_only(self, step_name, run_id, task_id, clone_origin_task, retry_count):
+    def clone_only(
+        self,
+        step_name,
+        run_id,
+        task_id,
+        clone_origin_task,
+        retry_count,
+        wait_only=False,
+    ):
         if not clone_origin_task:
             raise MetaflowInternalError(
-                "task.clone_only needs a valid " "clone_origin_task value."
+                "task.clone_only needs a valid clone_origin_task value."
             )
+        if wait_only:
+            # In this case, we are actually going to wait for the clone to be done
+            # by someone else. To do this, we just get the task_datastore in "r" mode
+            while True:
+                try:
+                    ds = self.flow_datastore.get_task_datastore(
+                        run_id, step_name, task_id
+                    )
+                    if not ds["_task_ok"]:
+                        raise MetaflowInternalError(
+                            "Externally cloned task did not succeed"
+                        )
+                    break
+                except DataException:
+                    # No need to get fancy with the sleep here.
+                    time.sleep(5)
+            return
+        # If we actually have to do the clone ourselves, proceed...
         # 1. initialize output datastore
         output = self.flow_datastore.get_task_datastore(
             run_id, step_name, task_id, attempt=0, mode="w"
@@ -292,6 +325,12 @@ def clone_only(self, step_name, run_id, task_id, clone_origin_task, retry_count)
                     type="origin-run-id",
                     tags=metadata_tags,
                 ),
+                MetaDatum(
+                    field="attempt",
+                    value=str(retry_count),
+                    type="attempt",
+                    tags=metadata_tags,
+                ),
             ],
         )
         output.done()
@@ -348,7 +387,7 @@ def run_step(
             self.metadata.register_task_id(run_id, step_name, task_id, retry_count)
         else:
             raise MetaflowInternalError(
-                "task.run_step needs a valid run_id " "and task_id"
+                "task.run_step needs a valid run_id and task_id"
             )
 
         if retry_count >= MAX_ATTEMPTS:
@@ -356,7 +395,7 @@ def run_step(
             # by datastore, so running a task with such a retry_could would
             # be pointless and dangerous
             raise MetaflowInternalError(
-                "Too many task attempts (%d)! " "MAX_ATTEMPTS exceeded." % retry_count
+                "Too many task attempts (%d)! MAX_ATTEMPTS exceeded." % retry_count
             )
 
         metadata_tags = ["attempt_id:{0}".format(retry_count)]
@@ -424,7 +463,10 @@ def run_step(
             origin_run_id=origin_run_id,
             namespace=resolve_identity(),
             username=get_username(),
+            metadata_str="%s@%s"
+            % (self.metadata.__class__.TYPE, self.metadata.__class__.INFO),
             is_running=True,
+            tags=self.metadata.sticky_tags,
         )
 
         # 5. run task
@@ -442,9 +484,6 @@ def run_step(
         start = time.time()
         self.metadata.start_task_heartbeat(self.flow.name, run_id, step_name, task_id)
         try:
-            # init side cars
-            logger.start()
-
             msg = {
                 "task_id": task_id,
                 "msg": "task starting",
@@ -641,5 +680,4 @@ def run_step(
                 )
 
             # terminate side cars
-            logger.terminate()
             self.metadata.stop_heartbeat()
diff --git a/metaflow/tutorials/01-playlist/playlist.py b/metaflow/tutorials/01-playlist/playlist.py
index 16b653cac1d..f1a89f58a91 100644
--- a/metaflow/tutorials/01-playlist/playlist.py
+++ b/metaflow/tutorials/01-playlist/playlist.py
@@ -70,7 +70,7 @@ def start(self):
             for column in columns:
                 self.dataframe[column].append(fields[idx[column]])
 
-        # Compute genre specific movies and a bonus movie in parallel.
+        # Compute genre-specific movies and a bonus movie in parallel.
         self.next(self.bonus_movie, self.genre_movies)
 
     @step
diff --git a/metaflow/tutorials/02-statistics/README.md b/metaflow/tutorials/02-statistics/README.md
index 0ee1f8274ff..a25df6be2a8 100644
--- a/metaflow/tutorials/02-statistics/README.md
+++ b/metaflow/tutorials/02-statistics/README.md
@@ -1,7 +1,7 @@
 # Episode 02-statistics: Is this Data Science?
 
 **Use metaflow to load the movie metadata CSV file into a Pandas Dataframe and
-compute some movie genre specific statistics. These statistics are then used in
+compute some movie genre-specific statistics. These statistics are then used in
 later examples to improve our playlist generator. You can optionally use the
 Metaflow client to eyeball the results in a Notebook, and make some simple
 plots using the Matplotlib library.**
diff --git a/metaflow/tutorials/02-statistics/stats.ipynb b/metaflow/tutorials/02-statistics/stats.ipynb
index 5c239290071..5a3bd51a8d8 100644
--- a/metaflow/tutorials/02-statistics/stats.ipynb
+++ b/metaflow/tutorials/02-statistics/stats.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# Episode 02: Is this Data Science?\n",
     "\n",
-    "### MovieStatsFlow loads the movie metadata CSV file into a Pandas Dataframe and computes some movie genre specific statistics. You can use this notebook and the Metaflow client to eyeball the results and make some simple plots. "
+    "### MovieStatsFlow loads the movie metadata CSV file into a Pandas Dataframe and computes some movie genre-specific statistics. You can use this notebook and the Metaflow client to eyeball the results and make some simple plots. "
    ]
   },
   {
diff --git a/metaflow/tutorials/02-statistics/stats.py b/metaflow/tutorials/02-statistics/stats.py
index ca1419f8697..d097297d56c 100644
--- a/metaflow/tutorials/02-statistics/stats.py
+++ b/metaflow/tutorials/02-statistics/stats.py
@@ -22,7 +22,7 @@ class MovieStatsFlow(FlowSpec):
     1) Ingests a CSV into a Pandas Dataframe.
     2) Fan-out over genre using Metaflow foreach.
     3) Compute quartiles for each genre.
-    4) Save a dictionary of genre specific statistics.
+    4) Save a dictionary of genre-specific statistics.
 
     """
 
@@ -89,7 +89,7 @@ def join(self, inputs):
         Join our parallel branches and merge results into a dictionary.
 
         """
-        # Merge results from the genre specific computations.
+        # Merge results from the genre-specific computations.
         self.genre_stats = {
             inp.genre.lower(): {"quartiles": inp.quartiles, "dataframe": inp.dataframe}
             for inp in inputs
diff --git a/metaflow/tutorials/03-playlist-redux/playlist.py b/metaflow/tutorials/03-playlist-redux/playlist.py
index 43153802e64..e51d24ee329 100644
--- a/metaflow/tutorials/03-playlist-redux/playlist.py
+++ b/metaflow/tutorials/03-playlist-redux/playlist.py
@@ -8,7 +8,7 @@ class PlayListFlow(FlowSpec):
 
     The flow performs the following steps:
 
-    1) Load the genre specific statistics from the MovieStatsFlow.
+    1) Load the genre-specific statistics from the MovieStatsFlow.
     2) In parallel branches:
        - A) Build a playlist from the top grossing films in the requested genre.
        - B) Choose a random movie.
@@ -55,7 +55,7 @@ def bonus_movie(self):
         """
         import pandas
 
-        # Concatenate all the genre specific data frames and choose a random
+        # Concatenate all the genre-specific data frames and choose a random
         # movie.
         df = pandas.concat(
             [
diff --git a/metaflow/tutorials/04-playlist-plus/README.md b/metaflow/tutorials/04-playlist-plus/README.md
index e0b5a39df00..7f1c9d7ae2b 100644
--- a/metaflow/tutorials/04-playlist-plus/README.md
+++ b/metaflow/tutorials/04-playlist-plus/README.md
@@ -13,8 +13,8 @@ isolated and reproducible environments for individual steps.**
 #### Before playing this episode:
 1. This tutorial requires the 'conda' package manager to be installed with the
    conda-forge channel added.
-   a. Download Miniconda at https://docs.conda.io/en/latest/miniconda.html
-   b. ```conda config --add channels conda-forge```
+   1. Download Miniconda at https://docs.conda.io/en/latest/miniconda.html
+   2. ```conda config --add channels conda-forge```
 
 #### To play this episode:
 1. ```cd metaflow-tutorials```
diff --git a/metaflow/tutorials/04-playlist-plus/playlist.py b/metaflow/tutorials/04-playlist-plus/playlist.py
index 64ef8da8655..2972d06730d 100644
--- a/metaflow/tutorials/04-playlist-plus/playlist.py
+++ b/metaflow/tutorials/04-playlist-plus/playlist.py
@@ -23,7 +23,7 @@ class PlayListFlow(FlowSpec):
 
     The flow performs the following steps:
 
-    1) Load the genre specific statistics from the MovieStatsFlow.
+    1) Load the genre-specific statistics from the MovieStatsFlow.
     2) In parallel branches:
        - A) Build a playlist from the top films in the requested genre.
        - B) Choose a bonus movie that has the closest string edit distance to
@@ -71,7 +71,7 @@ def start(self):
         print("Using analysis from '%s'" % str(run))
 
         # Get the dataframe from the start step before we sliced into into
-        # genre specific dataframes.
+        # genre-specific dataframes.
         self.dataframe = run["start"].task.data.dataframe
 
         # Also grab the summary statistics.
diff --git a/metaflow/tutorials/05-helloaws/README.md b/metaflow/tutorials/05-helloaws/README.md
index 58be68c8be3..65b3b656b5f 100644
--- a/metaflow/tutorials/05-helloaws/README.md
+++ b/metaflow/tutorials/05-helloaws/README.md
@@ -15,10 +15,10 @@ use the client to access information about any flow from anywhere.**
 1. ```python -m pip install notebook```
 2. This tutorial requires access to compute and storage resources on AWS, which
    can be configured by 
-   a. Following the instructions at 
-      https://docs.metaflow.org/metaflow-on-aws/deploy-to-aws or
-   b. Requesting a sandbox at 
-      https://docs.metaflow.org/metaflow-on-aws/metaflow-sandbox
+   1. Following the instructions at 
+         https://docs.metaflow.org/metaflow-on-aws/deploy-to-aws or
+   2. Requesting a sandbox at 
+         https://docs.metaflow.org/metaflow-on-aws/metaflow-sandbox
 
 #### To play this episode:
 1. ```cd metaflow-tutorials```
diff --git a/metaflow/tutorials/06-statistics-redux/README.md b/metaflow/tutorials/06-statistics-redux/README.md
index ef394b46147..8e7af404772 100644
--- a/metaflow/tutorials/06-statistics-redux/README.md
+++ b/metaflow/tutorials/06-statistics-redux/README.md
@@ -6,7 +6,7 @@ running on remote compute. In this example we re-run the 'stats.py' workflow
 adding the '--with batch' command line argument. This instructs Metaflow to run
 all your steps on AWS batch without changing any code. You can control the
 behavior with additional arguments, like '--max-workers'. For this example,
-'max-workers' is used to limit the number of parallel genre specific statistics
+'max-workers' is used to limit the number of parallel genre-specific statistics
 computations.
 You can then access the data artifacts (even the local CSV file) from anywhere
 because the data is being stored in AWS S3.
@@ -31,9 +31,9 @@ this tutorial in any environment**
    b. ```conda config --add channels conda-forge```
 5. This tutorial requires access to compute and storage resources on AWS, which
    can be configured by
-   a. Following the instructions at
+   1. Following the instructions at
       https://docs.metaflow.org/metaflow-on-aws/deploy-to-aws or
-   b. Requesting a sandbox at
+   2. Requesting a sandbox at
       https://docs.metaflow.org/metaflow-on-aws/metaflow-sandbox
 
 
diff --git a/metaflow/tutorials/06-statistics-redux/stats.ipynb b/metaflow/tutorials/06-statistics-redux/stats.ipynb
index 6ba90ae7dc6..5618a352a0b 100644
--- a/metaflow/tutorials/06-statistics-redux/stats.ipynb
+++ b/metaflow/tutorials/06-statistics-redux/stats.ipynb
@@ -69,7 +69,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Get the genre specific movie statistics"
+    "## Get the genre-specific movie statistics"
    ]
   },
   {
diff --git a/metaflow/tutorials/07-worldview/README.md b/metaflow/tutorials/07-worldview/README.md
index f51ef31830a..327e87d6efc 100644
--- a/metaflow/tutorials/07-worldview/README.md
+++ b/metaflow/tutorials/07-worldview/README.md
@@ -10,9 +10,9 @@ monitor all of your Metaflow flows.**
 1. ```python -m pip install notebook```
 2. This tutorial requires access to compute and storage resources on AWS, which
    can be configured by 
-   a. Following the instructions at 
+   1. Following the instructions at 
       https://docs.metaflow.org/metaflow-on-aws/deploy-to-aws or
-   b. Requesting a sandbox at 
+   2. Requesting a sandbox at 
       https://docs.metaflow.org/metaflow-on-aws/metaflow-sandbox
 
 #### To play this episode:
diff --git a/metaflow/tutorials/08-autopilot/README.md b/metaflow/tutorials/08-autopilot/README.md
index da68e335138..8ccba86271b 100644
--- a/metaflow/tutorials/08-autopilot/README.md
+++ b/metaflow/tutorials/08-autopilot/README.md
@@ -6,7 +6,7 @@ in the cloud. In this example we will schedule the 'stats.py' workflow
 using the 'step-functions create' command line argument. This instructs 
 Metaflow to schedule your flow on AWS Step Functions without changing any code. 
 You can execute your flow on AWS Step Functions by using the 
-'step-functions trigger' command line argument. You can use a notebook to setup
+'step-functions trigger' command line argument. You can use a notebook to set up
 a simple dashboard to monitor all of your Metaflow flows.**
 
 #### Showcasing:
diff --git a/metaflow/util.py b/metaflow/util.py
index 5dee0d1d533..dad837afd40 100644
--- a/metaflow/util.py
+++ b/metaflow/util.py
@@ -174,19 +174,24 @@ def get_username():
     return None
 
 
-def resolve_identity():
+def resolve_identity_as_tuple():
     prod_token = os.environ.get("METAFLOW_PRODUCTION_TOKEN")
     if prod_token:
-        return "production:%s" % prod_token
+        return "production", prod_token
     user = get_username()
     if user and user != "root":
-        return "user:%s" % user
+        return "user", user
     else:
         raise MetaflowUnknownUser()
 
 
+def resolve_identity():
+    identity_type, identity_value = resolve_identity_as_tuple()
+    return "%s:%s" % (identity_type, identity_value)
+
+
 def get_latest_run_id(echo, flow_name):
-    from metaflow.datastore.local_storage import LocalStorage
+    from metaflow.plugins.datastores.local_storage import LocalStorage
 
     local_root = LocalStorage.datastore_root
     if local_root is None:
@@ -202,7 +207,7 @@ def get_latest_run_id(echo, flow_name):
 
 
 def write_latest_run_id(obj, run_id):
-    from metaflow.datastore.local_storage import LocalStorage
+    from metaflow.plugins.datastores.local_storage import LocalStorage
 
     if LocalStorage.datastore_root is None:
         LocalStorage.datastore_root = LocalStorage.get_datastore_root_from_config(
@@ -382,6 +387,25 @@ def _access_check(fn, mode):
         return None
 
 
+def to_camelcase(obj):
+    """
+    Convert all keys of a json to camel case from snake case.
+    """
+    if isinstance(obj, (str, int, float)):
+        return obj
+    if isinstance(obj, dict):
+        res = obj.__class__()
+        for k in obj:
+            res[
+                re.sub(r"(?!^)_([a-zA-Z])", lambda x: x.group(1).upper(), k)
+            ] = to_camelcase(obj[k])
+    elif isinstance(obj, (list, set, tuple)):
+        res = obj.__class__(to_camelcase(v) for v in obj)
+    else:
+        return obj
+    return res
+
+
 def to_pascalcase(obj):
     """
     Convert all keys of a json to pascal case.
@@ -399,3 +423,18 @@ def to_pascalcase(obj):
     else:
         return obj
     return res
+
+
+def tar_safe_extract(tar, path=".", members=None, *, numeric_owner=False):
+    def is_within_directory(abs_directory, target):
+        prefix = os.path.commonprefix([abs_directory, os.path.abspath(target)])
+        return prefix == abs_directory
+
+    abs_directory = os.path.abspath(path)
+    if any(
+        not is_within_directory(abs_directory, os.path.join(path, member.name))
+        for member in tar.getmembers()
+    ):
+        raise Exception("Attempted path traversal in TAR file")
+
+    tar.extractall(path, members, numeric_owner=numeric_owner)
diff --git a/metaflow/vendor.py b/metaflow/vendor.py
index 9b488bcd94c..56b952324cc 100644
--- a/metaflow/vendor.py
+++ b/metaflow/vendor.py
@@ -1,3 +1,4 @@
+import glob
 import shutil
 import subprocess
 import re
@@ -6,10 +7,19 @@
 from itertools import chain
 from pathlib import Path
 
-WHITELIST = {"README.txt", "__init__.py", "vendor.txt", "pip.LICENSE"}
+WHITELIST = {
+    "README.txt",
+    "__init__.py",
+    "vendor_any.txt",
+    "vendor_v3_5.txt",
+    "vendor_v3_6.txt",
+    "pip.LICENSE",
+}
 
 # Borrowed from https://github.com/pypa/pip/tree/main/src/pip/_vendor
 
+VENDOR_SUBDIR = re.compile(r"^_vendor/vendor_([a-zA-Z0-9_]+).txt$")
+
 
 def delete_all(*paths, whitelist=frozenset()):
     for item in paths:
@@ -39,15 +49,15 @@ def patch_vendor_imports(file, replacements):
     file.write_text(text, "utf8")
 
 
-def find_vendored_libs(vendor_dir, whitelist):
+def find_vendored_libs(vendor_dir, whitelist, whitelist_dirs):
     vendored_libs = []
     paths = []
     for item in vendor_dir.iterdir():
-        if item.is_dir():
+        if item.is_dir() and item not in whitelist_dirs:
             vendored_libs.append(item.name)
         elif item.is_file() and item.name not in whitelist:
             vendored_libs.append(item.stem)  # without extension
-        else:  # not a dir or a file not in the whilelist
+        else:  # not a dir or a file not in the whitelist
             continue
         paths.append(item)
     return vendored_libs, paths
@@ -63,60 +73,89 @@ def fetch_licenses(*info_dir, vendor_dir):
 
 
 def vendor(vendor_dir):
-    # target package is <parent>.<vendor_dir>; foo/_vendor -> foo._vendor
-    pkgname = f"{vendor_dir.parent.name}.{vendor_dir.name}"
-
     # remove everything
     delete_all(*vendor_dir.iterdir(), whitelist=WHITELIST)
 
-    # install with pip
-    subprocess.run(
-        [
-            "python3",
-            "-m",
-            "pip",
-            "install",
-            "-t",
-            str(vendor_dir),
-            "-r",
-            str(vendor_dir / "vendor.txt"),
-            "--no-compile",
-        ]
-    )
-
-    # fetch licenses
-    fetch_licenses(*vendor_dir.glob("*.dist-info"), vendor_dir=vendor_dir)
-
-    # delete stuff that's not needed
-    delete_all(
-        *vendor_dir.glob("*.dist-info"),
-        *vendor_dir.glob("*.egg-info"),
-        vendor_dir / "bin",
-    )
-
-    vendored_libs, paths = find_vendored_libs(vendor_dir, WHITELIST)
-
-    replacements = []
-    for lib in vendored_libs:
-        replacements += (
-            partial(  # import bar -> import foo._vendor.bar
-                re.compile(r"(^\s*)import {}\n".format(lib), flags=re.M).sub,
-                r"\1from {} import {}\n".format(pkgname, lib),
-            ),
-            partial(  # from bar -> from foo._vendor.bar
-                re.compile(r"(^\s*)from {}(\.|\s+)".format(lib), flags=re.M).sub,
-                r"\1from {}.{}\2".format(pkgname, lib),
-            ),
+    exclude_subdirs = []
+    # Iterate on the vendor*.txt files
+    for vendor_file in glob.glob(f"{vendor_dir.name}/vendor*.txt"):
+        # We extract the subdirectory we are going to extract into
+        subdir = VENDOR_SUBDIR.match(vendor_file).group(1)
+        # Includes "any" but it doesn't really matter unless you install "any"
+        exclude_subdirs.append(subdir)
+
+    for subdir in exclude_subdirs:
+        create_init_file = False
+        if subdir == "any":
+            vendor_subdir = vendor_dir
+            # target package is <parent>.<vendor_dir>; foo/_vendor -> foo._vendor
+            pkgname = f"{vendor_dir.parent.name}.{vendor_dir.name}"
+        else:
+            create_init_file = True
+            vendor_subdir = vendor_dir / subdir
+            # target package is <parent>.<vendor_dir>; foo/_vendor -> foo._vendor
+            pkgname = f"{vendor_dir.parent.name}.{vendor_dir.name}.{vendor_subdir.name}"
+
+        # install with pip
+        subprocess.run(
+            [
+                "python3",
+                "-m",
+                "pip",
+                "install",
+                "-t",
+                str(vendor_subdir),
+                "-r",
+                "_vendor/vendor_%s.txt" % subdir,
+                "--no-compile",
+            ]
         )
 
-    for file in chain.from_iterable(map(iter_subtree, paths)):
-        if file.suffix == ".py":
-            patch_vendor_imports(file, replacements)
+        # fetch licenses
+        fetch_licenses(*vendor_subdir.glob("*.dist-info"), vendor_dir=vendor_subdir)
+
+        # delete stuff that's not needed
+        delete_all(
+            *vendor_subdir.glob("*.dist-info"),
+            *vendor_subdir.glob("*.egg-info"),
+            vendor_subdir / "bin",
+        )
+
+        # Touch a __init__.py file
+        if create_init_file:
+            with open(
+                "%s/__init__.py" % str(vendor_subdir), "w+", encoding="utf-8"
+            ) as f:
+                f.write("# Empty file")
+
+        vendored_libs, paths = find_vendored_libs(
+            vendor_subdir, WHITELIST, exclude_subdirs
+        )
+
+        replacements = []
+        for lib in vendored_libs:
+            replacements += (
+                partial(  # import bar -> import foo._vendor.bar
+                    re.compile(r"(^\s*)import {}\n".format(lib), flags=re.M).sub,
+                    r"\1from {} import {}\n".format(pkgname, lib),
+                ),
+                partial(  # from bar -> from foo._vendor.bar
+                    re.compile(r"(^\s*)from {}(\.|\s+)".format(lib), flags=re.M).sub,
+                    r"\1from {}.{}\2".format(pkgname, lib),
+                ),
+            )
+
+        for file in chain.from_iterable(map(iter_subtree, paths)):
+            if file.suffix == ".py":
+                patch_vendor_imports(file, replacements)
 
 
 if __name__ == "__main__":
     here = Path("__file__").resolve().parent
-    vendor_dir = here / "_vendor"
-    assert (vendor_dir / "vendor.txt").exists(), "_vendor/vendor.txt file not found"
-    assert (vendor_dir / "__init__.py").exists(), "_vendor/__init__.py file not found"
-    vendor(vendor_dir)
+    vendor_tl_dir = here / "_vendor"
+    has_vendor_file = len(glob.glob(f"{vendor_tl_dir.name}/vendor*.txt")) > 0
+    assert has_vendor_file, "_vendor/vendor*.txt file not found"
+    assert (
+        vendor_tl_dir / "__init__.py"
+    ).exists(), "_vendor/__init__.py file not found"
+    vendor(vendor_tl_dir)
diff --git a/setup.py b/setup.py
index 70478e5f910..94fe7ae5993 100644
--- a/setup.py
+++ b/setup.py
@@ -1,6 +1,6 @@
 from setuptools import setup, find_packages
 
-version = "2.5.2"
+version = "2.7.18"
 
 setup(
     include_package_data=True,
@@ -19,7 +19,7 @@
     package_data={"metaflow": ["tutorials/*/*"]},
     entry_points="""
         [console_scripts]
-        metaflow=metaflow.main_cli:main
+        metaflow=metaflow.cmd.main_cli:start
       """,
     install_requires=[
         "requests",
diff --git a/test/README.md b/test/README.md
index f7ef0aa55fc..b857959df0d 100644
--- a/test/README.md
+++ b/test/README.md
@@ -136,7 +136,7 @@ functions, in case multiple step function templates match. A typical
 pattern is to provide a specific function for a specific step type,
 such as joins and give it a precedence of `0`. Then another catch-all
 can be defined with `@steps(2, ['all'])`. As the result, the special
-function is applied to joins and the catch all function for all other
+function is applied to joins and the catch-all function for all other
 steps.
 
 The second argument gives a list of *qualifiers* specifying which
@@ -188,11 +188,11 @@ to get an idea of the syntax.
 
 #### Checkers
 
-Currently the test harness exercises two types of user interfaces:
+Currently, the test harness exercises two types of user interfaces:
 The command line interface, defined in `cli_check.py`, and the Python
 API, defined in `mli_check.py`.
 
-Currently you can use these checkers to assert values of data artifacts
+Currently, you can use these checkers to assert values of data artifacts
 or log output. If you want to add test for new type of functionality
 in the CLI and/or the Python API, you should add a new method in
 the `MetaflowCheck` base class and corresponding implementations in
diff --git a/test/core/contexts.json b/test/core/contexts.json
index d4c241f0cac..f8eac9d7468 100644
--- a/test/core/contexts.json
+++ b/test/core/contexts.json
@@ -30,6 +30,36 @@
                 "S3FailureTest"
             ]
         },
+        {
+            "name": "python3-all-local-azure-storage",
+            "disabled": true,
+            "env": {
+                "METAFLOW_USER": "tester",
+                "METAFLOW_RUN_BOOL_PARAM": "False",
+                "METAFLOW_RUN_NO_DEFAULT_PARAM": "test_str",
+                "METAFLOW_DEFAULT_METADATA": "local"
+            },
+            "python": "python3",
+            "top_options": [
+                "--metadata=local",
+                "--datastore=azure",
+                "--environment=local",
+                "--event-logger=nullSidecarLogger",
+                "--no-pylint",
+                "--quiet"
+            ],
+            "run_options": [
+                "--max-workers", "50",
+                "--max-num-splits", "10000",
+                "--tag", "\u523a\u8eab means sashimi",
+                "--tag", "multiple tags should be ok"
+            ],
+            "checks": [ "python3-cli", "python3-metadata"],
+            "disabled_tests": [
+                "LargeArtifactTest",
+                "S3FailureTest"
+            ]
+        },
         {
             "name": "dev-local",
             "disabled": true,
@@ -91,7 +121,9 @@
                 "BasicUnboundedForeachTest",
                 "NestedUnboundedForeachTest",
                 "DetectSegFaultTest",
-                "TimeoutDecoratorTest"
+                "TimeoutDecoratorTest",
+                "CardExtensionsImportTest",
+                "RunIdFileTest"
             ]
         },
         {
@@ -126,7 +158,9 @@
                 "BasicUnboundedForeachTest",
                 "NestedUnboundedForeachTest",
                 "DetectSegFaultTest",
-                "TimeoutDecoratorTest"
+                "TimeoutDecoratorTest",
+                "CardExtensionsImportTest",
+                "RunIdFileTest"
             ]
         }
     ],
diff --git a/test/core/metaflow_extensions/test_org/plugins/frameworks/__init__.py b/test/core/metaflow_extensions/test_org/plugins/frameworks/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/test/core/metaflow_extensions/test_org/plugins/frameworks/pytorch.py b/test/core/metaflow_extensions/test_org/plugins/frameworks/pytorch.py
new file mode 100644
index 00000000000..0d9b69fa3c1
--- /dev/null
+++ b/test/core/metaflow_extensions/test_org/plugins/frameworks/pytorch.py
@@ -0,0 +1,8 @@
+from metaflow.plugins.frameworks._orig.pytorch import (
+    PytorchParallelDecorator,
+    setup_torch_distributed,
+)
+
+
+class NewPytorchParallelDecorator(PytorchParallelDecorator):
+    pass
diff --git a/test/core/metaflow_extensions/test_org/plugins/mfextinit_test_org.py b/test/core/metaflow_extensions/test_org/plugins/mfextinit_test_org.py
index 69507dfddf6..4aa191affee 100644
--- a/test/core/metaflow_extensions/test_org/plugins/mfextinit_test_org.py
+++ b/test/core/metaflow_extensions/test_org/plugins/mfextinit_test_org.py
@@ -6,4 +6,4 @@
 
 STEP_DECORATORS = [TestStepDecorator]
 
-__mf_promote_submodules__ = ["nondecoplugin"]
+__mf_promote_submodules__ = ["nondecoplugin", "frameworks"]
diff --git a/test/core/metaflow_test/__init__.py b/test/core/metaflow_test/__init__.py
index 3460b74d67d..0c5012f0437 100644
--- a/test/core/metaflow_test/__init__.py
+++ b/test/core/metaflow_test/__init__.py
@@ -79,6 +79,22 @@ def assert_equals(expected, got):
         raise ExpectationFailed(expected, got)
 
 
+def assert_equals_metadata(expected, got, exclude_keys=None):
+    # Check if the keys match
+    exclude_keys = set(exclude_keys if exclude_keys is not None else [])
+    k1_set = set(expected.keys()).difference(exclude_keys)
+    k2_set = set(got.keys()).difference(exclude_keys)
+    sym_diff = k1_set.symmetric_difference(k2_set)
+    if len(sym_diff) > 0:
+        raise ExpectationFailed("keys: %s" % str(k1_set), "keys: %s" % str(k2_set))
+    # At this point, we compare the metadata values, types and dates.
+    for k in k1_set:
+        if expected[k] != got[k]:
+            raise ExpectationFailed(
+                "[%s]: %s" % (k, str(expected[k])), "[%s]: %s" % (k, str(got[k]))
+            )
+
+
 def assert_exception(func, exception):
     try:
         func()
@@ -131,6 +147,30 @@ def get_card(self, step, task, card_type):
     def list_cards(self, step, task, card_type=None):
         raise NotImplementedError()
 
+    def get_user_tags(self):
+        raise NotImplementedError()
+
+    def get_system_tags(self):
+        raise NotImplementedError()
+
+    def add_tag(self, tag):
+        raise NotImplementedError()
+
+    def add_tags(self, tags):
+        raise NotImplementedError()
+
+    def remove_tag(self, tag):
+        raise NotImplementedError()
+
+    def remove_tags(self, tags):
+        raise NotImplementedError()
+
+    def replace_tag(self, tag_to_remove, tag_to_add):
+        raise NotImplementedError()
+
+    def replace_tags(self, tags_to_remove, tags_to_add):
+        raise NotImplementedError()
+
 
 def new_checker(flow):
     from . import cli_check, metadata_check
diff --git a/test/core/metaflow_test/cli_check.py b/test/core/metaflow_test/cli_check.py
index 7e3b233e750..f6edc00a3a3 100644
--- a/test/core/metaflow_test/cli_check.py
+++ b/test/core/metaflow_test/cli_check.py
@@ -1,9 +1,10 @@
-import os
 import sys
 import subprocess
 import json
+from collections import namedtuple
 from tempfile import NamedTemporaryFile
 
+from metaflow.includefile import IncludedFile
 from metaflow.util import is_stringish
 
 from . import (
@@ -23,22 +24,17 @@
 
 
 class CliCheck(MetaflowCheck):
-    def run_cli(self, args, capture_output=False, pipe_error_to_output=False):
+    def run_cli(self, args):
         cmd = [sys.executable, "test_flow.py"]
 
         # remove --quiet from top level options to capture output from echo
         # we will add --quiet in args if needed
         cmd.extend([opt for opt in self.cli_options if opt != "--quiet"])
-
         cmd.extend(args)
-        options_kwargs = {}
-        if pipe_error_to_output:
-            options_kwargs["stderr"] = subprocess.STDOUT
 
-        if capture_output:
-            return subprocess.check_output(cmd, **options_kwargs)
-        else:
-            subprocess.check_call(cmd, **options_kwargs)
+        return subprocess.run(
+            cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=True
+        )
 
     def assert_artifact(self, step, name, value, fields=None):
         for task, artifacts in self.artifact_dict(step, name).items():
@@ -48,6 +44,8 @@ def assert_artifact(self, step, name, value, fields=None):
                     for field, v in fields.items():
                         if is_stringish(artifact):
                             data = json.loads(artifact)
+                        elif isinstance(artifact, IncludedFile):
+                            data = json.loads(artifact.descriptor)
                         else:
                             data = artifact
                         if not isinstance(data, dict):
@@ -129,7 +127,7 @@ def assert_card(
                 step, task, card_type, card_hash=card_hash, card_id=card_id
             )
         except subprocess.CalledProcessError as e:
-            if no_card_found_message in e.output.decode("utf-8").strip():
+            if no_card_found_message in e.stderr.decode("utf-8").strip():
                 card_data = None
             else:
                 raise e
@@ -148,45 +146,87 @@ def list_cards(self, step, task, card_type=None):
         no_card_found_message = CardNotPresentException.headline
         try:
             card_data = self._list_cards(step, task=task, card_type=card_type)
-            card_data = json.loads(card_data)
         except subprocess.CalledProcessError as e:
-            if no_card_found_message in e.output.decode("utf-8").strip():
+            if no_card_found_message in e.stderr.decode("utf-8").strip():
                 card_data = None
             else:
                 raise e
         return card_data
 
     def _list_cards(self, step, task=None, card_type=None):
-        pathspec = "%s/%s" % (self.run_id, step)
-        if task is not None:
-            pathspec = "%s/%s/%s" % (self.run_id, step, task)
-        cmd = ["--quiet", "card", "list", pathspec, "--as-json"]
-        if card_type is not None:
-            cmd.extend(["--type", card_type])
-
-        return self.run_cli(cmd, capture_output=True, pipe_error_to_output=True).decode(
-            "utf-8"
-        )
+        with NamedTemporaryFile(dir=".") as f:
+            pathspec = "%s/%s" % (self.run_id, step)
+            if task is not None:
+                pathspec = "%s/%s/%s" % (self.run_id, step, task)
+            cmd = ["--quiet", "card", "list", pathspec, "--as-json", "--file", f.name]
+            if card_type is not None:
+                cmd.extend(["--type", card_type])
+
+            self.run_cli(cmd)
+            with open(f.name, "r") as jsf:
+                return json.load(jsf)
 
     def get_card(self, step, task, card_type, card_hash=None, card_id=None):
-        cmd = [
-            "--quiet",
-            "card",
-            "get",
-            "%s/%s/%s" % (self.run_id, step, task),
-            "--type",
-            card_type,
-        ]
-
-        if card_hash is not None:
-            cmd.extend(["--hash", card_hash])
-        if card_id is not None:
-            cmd.extend(["--id", card_id])
-
-        return self.run_cli(cmd, capture_output=True, pipe_error_to_output=True).decode(
-            "utf-8"
-        )
+        with NamedTemporaryFile(dir=".") as f:
+            cmd = [
+                "--quiet",
+                "card",
+                "get",
+                "%s/%s/%s" % (self.run_id, step, task),
+                f.name,
+                "--type",
+                card_type,
+            ]
+
+            if card_hash is not None:
+                cmd.extend(["--hash", card_hash])
+            if card_id is not None:
+                cmd.extend(["--id", card_id])
+
+            self.run_cli(cmd)
+            with open(f.name, "r") as jsf:
+                return jsf.read()
 
     def get_log(self, step, logtype):
         cmd = ["--quiet", "logs", "--%s" % logtype, "%s/%s" % (self.run_id, step)]
-        return self.run_cli(cmd, capture_output=True).decode("utf-8")
+        completed_process = self.run_cli(cmd)
+        return completed_process.stdout.decode("utf-8")
+
+    def get_user_tags(self):
+        completed_process = self.run_cli(
+            ["tag", "list", "--flat", "--hide-system-tags", "--run-id", self.run_id]
+        )
+        lines = completed_process.stderr.decode("utf-8").splitlines()[1:]
+        return frozenset(lines)
+
+    def get_system_tags(self):
+        completed_process = self.run_cli(
+            ["tag", "list", "--flat", "--run-id", self.run_id]
+        )
+        lines = completed_process.stderr.decode("utf-8").splitlines()[1:]
+        return frozenset(lines) - self.get_user_tags()
+
+    def add_tag(self, tag):
+        self.run_cli(["tag", "add", "--run-id", self.run_id, tag])
+
+    def add_tags(self, tags):
+        self.run_cli(["tag", "add", "--run-id", self.run_id, *tags])
+
+    def remove_tag(self, tag):
+        self.run_cli(["tag", "remove", "--run-id", self.run_id, tag])
+
+    def remove_tags(self, tags):
+        self.run_cli(["tag", "remove", "--run-id", self.run_id, *tags])
+
+    def replace_tag(self, tag_to_remove, tag_to_add):
+        self.run_cli(
+            ["tag", "replace", "--run-id", self.run_id, tag_to_remove, tag_to_add]
+        )
+
+    def replace_tags(self, tags_to_remove, tags_to_add):
+        cmd = ["tag", "replace", "--run-id", self.run_id]
+        for tag_to_remove in tags_to_remove:
+            cmd.extend(["--remove", tag_to_remove])
+        for tag_to_add in tags_to_add:
+            cmd.extend(["--add", tag_to_add])
+        self.run_cli(cmd)
diff --git a/test/core/metaflow_test/formatter.py b/test/core/metaflow_test/formatter.py
index 275ade63ecd..a68d1340923 100644
--- a/test/core/metaflow_test/formatter.py
+++ b/test/core/metaflow_test/formatter.py
@@ -19,7 +19,7 @@ def __init__(self, graphspec, test):
         self.valid = True
 
         for step in self.steps:
-            if step.required and not step in self.used:
+            if step.required and step not in self.used:
                 self.valid = False
 
     def _format_method(self, step):
@@ -85,7 +85,7 @@ def _flow_lines(self):
 
         yield 0, "# -*- coding: utf-8 -*-"
         yield 0, "from metaflow import FlowSpec, step, Parameter, project, IncludeFile, JSONType, current, parallel"
-        yield 0, "from metaflow_test import assert_equals, " "assert_exception, " "ExpectationFailed, " "is_resumed, " "ResumeFromHere, " "TestRetry"
+        yield 0, "from metaflow_test import assert_equals, assert_equals_metadata, assert_exception, ExpectationFailed, is_resumed, ResumeFromHere, TestRetry"
         if tags:
             yield 0, "from metaflow import %s" % ",".join(tags)
 
@@ -151,7 +151,7 @@ def _flow_lines(self):
     def _check_lines(self):
         yield 0, "# -*- coding: utf-8 -*-"
         yield 0, "import sys"
-        yield 0, "from metaflow_test import assert_equals, new_checker"
+        yield 0, "from metaflow_test import assert_equals, assert_equals_metadata, assert_exception, new_checker"
         yield 0, "def check_results(flow, checker):"
         for line in self._format_method(self.test.check_results):
             yield 1, line
diff --git a/test/core/metaflow_test/metadata_check.py b/test/core/metaflow_test/metadata_check.py
index ef7d57e56bd..543c6cee764 100644
--- a/test/core/metaflow_test/metadata_check.py
+++ b/test/core/metaflow_test/metadata_check.py
@@ -27,7 +27,6 @@ def __init__(self, flow):
     def _test_namespace(self):
         from metaflow.client import Flow, get_namespace, namespace, default_namespace
         from metaflow.exception import MetaflowNamespaceMismatch
-        import os
 
         # test 1) METAFLOW_USER should be the default
         assert_equals("user:%s" % os.environ.get("METAFLOW_USER"), get_namespace())
@@ -151,7 +150,7 @@ def assert_card(
             card_iter = None
         card_data = None
         # Since there are many cards possible for a taskspec, we check for hash to assert a single card.
-        # If the id argument is present then there will be a single cards anyways.
+        # If the id argument is present then there will be a single cards anyway.
         if card_iter is not None:
             if len(card_iter) > 0:
                 if card_hash is None:
@@ -176,3 +175,27 @@ def get_card(self, step, task, card_type, card_id=None):
 
         iterator = get_cards(self.run[step][task], type=card_type, id=card_id)
         return iterator
+
+    def get_user_tags(self):
+        return self.run.user_tags
+
+    def get_system_tags(self):
+        return self.run.system_tags
+
+    def add_tag(self, tag):
+        return self.run.add_tag(tag)
+
+    def add_tags(self, tags):
+        return self.run.add_tags(tags)
+
+    def remove_tag(self, tag):
+        return self.run.remove_tag(tag)
+
+    def remove_tags(self, tags):
+        return self.run.remove_tags(tags)
+
+    def replace_tag(self, tag_to_remove, tag_to_add):
+        return self.run.replace_tag(tag_to_remove, tag_to_add)
+
+    def replace_tags(self, tags_to_remove, tags_to_add):
+        return self.run.replace_tags(tags_to_remove, tags_to_add)
diff --git a/test/core/tests/basic_include.py b/test/core/tests/basic_include.py
index e79bf653f7f..15a7a5ab127 100644
--- a/test/core/tests/basic_include.py
+++ b/test/core/tests/basic_include.py
@@ -26,9 +26,9 @@ class BasicIncludeTest(MetaflowTest):
     @steps(0, ["all"])
     def step_all(self):
         assert_equals("Regular Text File", self.myfile_txt)
-        assert_equals(u"UTF Text File \u5e74", self.myfile_utf8)
+        assert_equals("UTF Text File \u5e74", self.myfile_utf8)
         assert_equals(
-            u"UTF Text File \u5e74".encode(encoding="utf8"), self.myfile_binary
+            "UTF Text File \u5e74".encode(encoding="utf8"), self.myfile_binary
         )
         assert_equals("Override Text File", self.myfile_overriden)
 
@@ -43,27 +43,11 @@ def step_all(self):
 
     def check_results(self, flow, checker):
         for step in flow:
-            checker.assert_artifact(
-                step.name,
-                "myfile_txt",
-                None,
-                fields={"type": "uploader-v1", "is_text": True, "encoding": None},
-            )
-            checker.assert_artifact(
-                step.name,
-                "myfile_utf8",
-                None,
-                fields={"type": "uploader-v1", "is_text": True, "encoding": "utf8"},
-            )
+            checker.assert_artifact(step.name, "myfile_txt", "Regular Text File")
+            checker.assert_artifact(step.name, "myfile_utf8", "UTF Text File \u5e74")
             checker.assert_artifact(
                 step.name,
                 "myfile_binary",
-                None,
-                fields={"type": "uploader-v1", "is_text": False, "encoding": None},
-            )
-            checker.assert_artifact(
-                step.name,
-                "myfile_overriden",
-                None,
-                fields={"type": "uploader-v1", "is_text": True, "encoding": None},
+                "UTF Text File \u5e74".encode(encoding="utf8"),
             )
+        checker.assert_artifact(step.name, "myfile_overriden", "Override Text File")
diff --git a/test/core/tests/basic_log.py b/test/core/tests/basic_log.py
index c315ead8b4d..535f9950522 100644
--- a/test/core/tests/basic_log.py
+++ b/test/core/tests/basic_log.py
@@ -14,7 +14,7 @@ def step_single(self):
         import sys
 
         msg1 = "stdout: A regular message.\n"
-        msg2 = u"stdout: A message with unicode: \u5e74\n"
+        msg2 = "stdout: A message with unicode: \u5e74\n"
         sys.stdout.write(msg1)
         if not sys.stdout.encoding:
             sys.stdout.write(msg2.encode("utf8"))
@@ -22,7 +22,7 @@ def step_single(self):
             sys.stdout.write(msg2)
 
         msg3 = "stderr: A regular message.\n"
-        msg4 = u"stderr: A message with unicode: \u5e74\n"
+        msg4 = "stderr: A message with unicode: \u5e74\n"
         sys.stderr.write(msg3)
         if not sys.stderr.encoding:
             sys.stderr.write(msg4.encode("utf8"))
@@ -35,11 +35,11 @@ def step_all(self):
 
     def check_results(self, flow, checker):
         msg1 = "stdout: A regular message.\n"
-        msg2 = u"stdout: A message with unicode: \u5e74\n"
+        msg2 = "stdout: A message with unicode: \u5e74\n"
         stdout_combined_msg = "".join([msg1, msg2, ""])
 
         msg3 = "stderr: A regular message.\n"
-        msg4 = u"stderr: A message with unicode: \u5e74\n"
+        msg4 = "stderr: A message with unicode: \u5e74\n"
         stderr_combined_msg = "".join([msg3, msg4, ""])
 
         for step in flow:
diff --git a/test/core/tests/basic_tags.py b/test/core/tests/basic_tags.py
index 3b1d90bd244..0d396a43e02 100644
--- a/test/core/tests/basic_tags.py
+++ b/test/core/tests/basic_tags.py
@@ -31,11 +31,11 @@ def check_results(self, flow, checker):
         # test crazy unicode and spaces in tags
         # these tags must be set with --tag option in contexts.json
         tags = (
-            u"project:basic_tag",
-            u"project_branch:user.tester",
-            u"user:%s" % os.environ.get("METAFLOW_USER"),
-            u"刺身 means sashimi",
-            u"multiple tags should be ok",
+            "project:basic_tag",
+            "project_branch:user.tester",
+            "user:%s" % os.environ.get("METAFLOW_USER"),
+            "刺身 means sashimi",
+            "multiple tags should be ok",
         )
         for tag in tags:
             # test different namespaces: one is a system-tag,
diff --git a/test/core/tests/card_default_editable.py b/test/core/tests/card_default_editable.py
index 1da3dfdfc5c..3d7726aa459 100644
--- a/test/core/tests/card_default_editable.py
+++ b/test/core/tests/card_default_editable.py
@@ -34,7 +34,7 @@ def step_start(self):
     @tag('card(type="test_editable_card", id="xyz")')
     @steps(0, ["foreach-nested-inner"])
     def step_foreach_inner(self):
-        # In this step `test_editable_card` should considered default editable even with `id`
+        # In this step `test_editable_card` should be considered default editable even with `id`
         from metaflow import current
         from metaflow.plugins.cards.card_modules.test_cards import TestStringComponent
         import random
@@ -91,7 +91,7 @@ def check_results(self, flow, checker):
                         step.name,
                         task_id,
                         card_type,
-                        "%d\n" % number,
+                        "%d" % number,
                         card_hash=card["hash"],
                         exact_match=True,
                     )
diff --git a/test/core/tests/card_default_editable_customize.py b/test/core/tests/card_default_editable_customize.py
index 3abf6aa906c..64ead003e44 100644
--- a/test/core/tests/card_default_editable_customize.py
+++ b/test/core/tests/card_default_editable_customize.py
@@ -58,7 +58,7 @@ def check_results(self, flow, checker):
                         step.name,
                         task_id,
                         card_type,
-                        "%d\n" % number,
+                        "%d" % number,
                         card_hash=card["hash"],
                         exact_match=True,
                     )
diff --git a/test/core/tests/card_default_editable_with_id.py b/test/core/tests/card_default_editable_with_id.py
index a805ec24926..c46c52248ab 100644
--- a/test/core/tests/card_default_editable_with_id.py
+++ b/test/core/tests/card_default_editable_with_id.py
@@ -3,8 +3,9 @@
 
 class DefaultEditableCardWithIdTest(MetaflowTest):
     """
-    `current.card.append` should add to default editable card and not the one with `id` when a card with `id` and non id are present
-        - Access of `current.card` with non existant id should not fail.
+    `current.card.append` should add to default editable card and not the one with `id`
+    when a card with `id` and non id are present
+        - Access of `current.card` with nonexistent id should not fail.
     """
 
     PRIORITY = 3
@@ -38,7 +39,7 @@ def check_results(self, flow, checker):
             # This means CliCheck is in context.
             for step in flow:
                 if step.name == "end":
-                    # Ensure we reach the `end` even even when a wrong `id` is used with `current.card`
+                    # Ensure we reach the `end` even when a wrong `id` is used with `current.card`
                     checker.assert_artifact(step.name, "here", True)
                     continue
                 elif step.name != "start":
@@ -66,7 +67,7 @@ def check_results(self, flow, checker):
                         step.name,
                         task_id,
                         "test_editable_card",
-                        "%d\n" % number,
+                        "%d" % number,
                         card_hash=card["hash"],
                         exact_match=True,
                     )
@@ -74,7 +75,7 @@ def check_results(self, flow, checker):
             # This means MetadataCheck is in context.
             for step in flow:
                 if step.name == "end":
-                    # Ensure we reach the `end` even even when a wrong `id` is used with `current.card`
+                    # Ensure we reach the `end` even when a wrong `id` is used with `current.card`
                     checker.assert_artifact(step.name, "here", True)
                     continue
                 elif step.name != "start":
diff --git a/test/core/tests/card_error.py b/test/core/tests/card_error.py
index 0ab047bfb3f..c4b76682643 100644
--- a/test/core/tests/card_error.py
+++ b/test/core/tests/card_error.py
@@ -5,7 +5,8 @@
 class CardErrorTest(MetaflowTest):
     """
     Test that checks if the card decorator handles Errors gracefully.
-    In the checker assert that the end step finished an has artifacts after failing to create the card on the start step
+    In the checker assert that the end step finished and has artifacts after failing
+    to create the card on the start step.
     """
 
     PRIORITY = 2
diff --git a/test/core/tests/card_extension_test.py b/test/core/tests/card_extension_test.py
new file mode 100644
index 00000000000..d65c64f8dfa
--- /dev/null
+++ b/test/core/tests/card_extension_test.py
@@ -0,0 +1,58 @@
+from metaflow_test import MetaflowTest, ExpectationFailed, steps, tag
+
+
+class CardExtensionsImportTest(MetaflowTest):
+    """
+    - Requires on tests/extensions/packages to be installed.
+    """
+
+    PRIORITY = 5
+
+    @tag('card(type="card_ext_init_b",save_errors=False)')
+    @tag('card(type="card_ext_init_a",save_errors=False)')
+    @tag('card(type="card_ns_subpackage",save_errors=False)')
+    @tag('card(type="card_init",save_errors=False)')
+    @steps(0, ["start"])
+    def step_start(self):
+        from metaflow import current
+
+        self.task = current.pathspec
+
+    @steps(1, ["all"])
+    def step_all(self):
+        pass
+
+    def check_results(self, flow, checker):
+        run = checker.get_run()
+        if run is None:
+            # This means CliCheck is in context.
+            for step in flow:
+                if step.name != "start":
+                    continue
+                cli_check_dict = checker.artifact_dict(step.name, "task")
+                for task_pathspec in cli_check_dict:
+                    full_pathspec = "/".join([flow.name, task_pathspec])
+                    task_id = task_pathspec.split("/")[-1]
+                    cards_info = checker.list_cards(step.name, task_id)
+                    # Just check if the cards are created.
+                    assert_equals(
+                        cards_info is not None
+                        and "cards" in cards_info
+                        and len(cards_info["cards"]) == 4,
+                        True,
+                    )
+        else:
+            # This means MetadataCheck is in context.
+            for step in flow:
+                if step.name != "start":
+                    continue
+                meta_check_dict = checker.artifact_dict(step.name, "task")
+                for task_id in meta_check_dict:
+                    full_pathspec = meta_check_dict[task_id]["task"]
+                    cards_info = checker.list_cards(step.name, task_id)
+                    assert_equals(
+                        cards_info is not None
+                        and "cards" in cards_info
+                        and len(cards_info["cards"]) == 4,
+                        True,
+                    )
diff --git a/test/core/tests/card_id_append.py b/test/core/tests/card_id_append.py
index 93d995e3ba9..52e12914209 100644
--- a/test/core/tests/card_id_append.py
+++ b/test/core/tests/card_id_append.py
@@ -28,7 +28,7 @@ def step_start(self):
     @tag('card(type="test_non_editable_card",id="abc")')
     @steps(0, ["end"])
     def step_end(self):
-        # If the card is default non editable, we can still access it via `current.card[id]`
+        # If the card is default non-editable, we can still access it via `current.card[id]`
         from metaflow import current
         from metaflow.plugins.cards.card_modules.test_cards import TestStringComponent
         import random
@@ -59,7 +59,7 @@ def check_results(self, flow, checker):
                         "test_editable_card"
                         if step.name == "start"
                         else "test_non_editable_card",
-                        "%d\n" % number,
+                        "%d" % number,
                         card_id="abc",
                         exact_match=True,
                     )
diff --git a/test/core/tests/card_import.py b/test/core/tests/card_import.py
index c462f2f3fdb..920b629efe4 100644
--- a/test/core/tests/card_import.py
+++ b/test/core/tests/card_import.py
@@ -64,7 +64,7 @@ def check_results(self, flow, checker):
                         step.name,
                         task_id,
                         impc_ne["type"],
-                        "%s\n" % cards_info["pathspec"],
+                        "%s" % cards_info["pathspec"],
                         card_hash=impc_ne["hash"],
                         exact_match=True,
                     )
@@ -72,7 +72,7 @@ def check_results(self, flow, checker):
                         step.name,
                         task_id,
                         impc_e["type"],
-                        "%d\n" % random_number,
+                        "%d" % random_number,
                         card_hash=impc_e["hash"],
                         exact_match=True,
                     )
diff --git a/test/core/tests/card_resume.py b/test/core/tests/card_resume.py
index 643ee85d68d..52d1e0543f5 100644
--- a/test/core/tests/card_resume.py
+++ b/test/core/tests/card_resume.py
@@ -45,5 +45,5 @@ def check_results(self, flow, checker):
                         step.name,
                         task_id,
                         "taskspec_card",
-                        "%s\n" % cli_check_dict[task_pathspec]["origin_pathspec"],
+                        "%s" % cli_check_dict[task_pathspec]["origin_pathspec"],
                     )
diff --git a/test/core/tests/card_simple.py b/test/core/tests/card_simple.py
index d86c5423767..2386600d492 100644
--- a/test/core/tests/card_simple.py
+++ b/test/core/tests/card_simple.py
@@ -3,8 +3,8 @@
 
 class CardDecoratorBasicTest(MetaflowTest):
     """
-    Test that checks if the card decorator stores the information as intended for a built in card
-    - sets the pathspec in the a task
+    Test that checks if the card decorator stores the information as intended for a built-in card
+    - sets the pathspec in the task
     - Checker Asserts that taskpathspec in the card and the one set in the task match.
     """
 
@@ -51,7 +51,7 @@ def check_results(self, flow, checker):
                         step.name,
                         task_id,
                         "taskspec_card",
-                        "%s\n" % taskpathspec_artifact,
+                        "%s" % taskpathspec_artifact,
                     )
         else:
             # This means MetadataCheck is in context.
diff --git a/test/core/tests/card_timeout.py b/test/core/tests/card_timeout.py
index 200a6df4a44..42503f0327b 100644
--- a/test/core/tests/card_timeout.py
+++ b/test/core/tests/card_timeout.py
@@ -4,7 +4,7 @@
 class CardTimeoutTest(MetaflowTest):
     """
     Test that checks if the card decorator works as intended with the timeout decorator.
-    # This test set an artifact in the steps and also set an timeout to the card arguement.
+    # This test set an artifact in the steps and also set a timeout to the card argument.
     # It will assert the artifact to be None.
     """
 
diff --git a/test/core/tests/catch_retry.py b/test/core/tests/catch_retry.py
index 283390f4d68..6d9231da460 100644
--- a/test/core/tests/catch_retry.py
+++ b/test/core/tests/catch_retry.py
@@ -8,7 +8,8 @@ class CatchRetryTest(MetaflowTest):
     @tag("retry(times=3,minutes_between_retries=0)")
     @steps(0, ["start"])
     def step_start(self):
-        import os, sys
+        import os
+        import sys
 
         self.test_attempt = current.retry_count
         sys.stdout.write("stdout testing logs %d\n" % self.test_attempt)
diff --git a/test/core/tests/current_singleton.py b/test/core/tests/current_singleton.py
index 723ad49f019..562ac2dcd3c 100644
--- a/test/core/tests/current_singleton.py
+++ b/test/core/tests/current_singleton.py
@@ -28,6 +28,7 @@ def step_start(self):
         self.usernames = {current.username}
         self.uuid = str(uuid4())
         self.task_data = {current.pathspec: self.uuid}
+        self.tags = current.tags
 
     @steps(1, ["join"])
     def step_join(self):
@@ -52,6 +53,7 @@ def step_join(self):
         self.task_data = {}
         for i in inputs:
             self.task_data.update(i.task_data)
+        self.tags = set(chain(*(i.tags for i in inputs)))
 
         # add data for the join step
         self.project_names.add(current.project_name)
@@ -67,6 +69,7 @@ def step_join(self):
         self.steps.add(current.step_name)
         self.uuid = str(uuid4())
         self.task_data[current.pathspec] = self.uuid
+        self.tags.update(current.tags)
 
     @steps(2, ["all"])
     def step_all(self):
@@ -86,6 +89,7 @@ def step_all(self):
         self.steps.add(current.step_name)
         self.uuid = str(uuid4())
         self.task_data[current.pathspec] = self.uuid
+        self.tags.update(current.tags)
 
     def check_results(self, flow, checker):
         run = checker.get_run()
@@ -119,3 +123,7 @@ def check_results(self, flow, checker):
             assert_equals(run.data.origin_run_ids, {None})
             assert_equals(run.data.namespaces, {"user:tester"})
             assert_equals(run.data.usernames, {"tester"})
+            assert_equals(
+                run.data.tags,
+                {"\u523a\u8eab means sashimi", "multiple tags should be ok"},
+            )
diff --git a/test/core/tests/extensions.py b/test/core/tests/extensions.py
index b760d605ea9..29e4cb472e9 100644
--- a/test/core/tests/extensions.py
+++ b/test/core/tests/extensions.py
@@ -16,6 +16,7 @@ def step_all(self):
         from metaflow.plugins.nondecoplugin import my_value
 
         from metaflow.exception import MetaflowTestException
+        from metaflow.plugins.frameworks.pytorch import NewPytorchParallelDecorator
 
         self.plugin_value = my_value
         self.tl_value = tl_value
diff --git a/test/core/tests/large_artifact.py b/test/core/tests/large_artifact.py
index cbb56f2738d..8e9e69e7d9f 100644
--- a/test/core/tests/large_artifact.py
+++ b/test/core/tests/large_artifact.py
@@ -16,7 +16,7 @@ def step_single(self):
         import sys
 
         if sys.version_info[0] > 2:
-            self.large = b"x" * int(4.1 * 1024 ** 3)
+            self.large = b"x" * int(4.1 * 1024**3)
             self.noop = False
         else:
             self.noop = True
@@ -26,7 +26,7 @@ def step_end(self):
         import sys
 
         if sys.version_info[0] > 2:
-            assert_equals(self.large, b"x" * int(4.1 * 1024 ** 3))
+            assert_equals(self.large, b"x" * int(4.1 * 1024**3))
 
     @steps(1, ["all"])
     def step_all(self):
@@ -37,4 +37,4 @@ def check_results(self, flow, checker):
 
         noop = next(iter(checker.artifact_dict("end", "noop").values()))["noop"]
         if not noop and sys.version_info[0] > 2:
-            checker.assert_artifact("end", "large", b"x" * int(4.1 * 1024 ** 3))
+            checker.assert_artifact("end", "large", b"x" * int(4.1 * 1024**3))
diff --git a/test/core/tests/large_mflog.py b/test/core/tests/large_mflog.py
index cb244fe9b03..86f049774af 100644
--- a/test/core/tests/large_mflog.py
+++ b/test/core/tests/large_mflog.py
@@ -17,7 +17,8 @@ class LargeMflogTest(MetaflowTest):
     def split(self):
         self.arr = range(NUM_FOREACH)
 
-        import random, string
+        import random
+        import string
 
         self.random_log_prefix = "".join(
             [random.choice(string.ascii_lowercase) for _ in range(5)]
@@ -86,7 +87,7 @@ def check_results(self, flow, checker):
 
                 for i, (_, _, stream_type, idx, tstamp) in enumerate(task_lines):
                     # test that loglines originate from the correct stream
-                    # and they are properly ordered
+                    # and are properly ordered
                     assert_equals(stream_type, stream)
                     assert_equals(int(idx), i)
 
diff --git a/test/core/tests/resume_end_step.py b/test/core/tests/resume_end_step.py
index c184d5335c8..226cfb16dbc 100644
--- a/test/core/tests/resume_end_step.py
+++ b/test/core/tests/resume_end_step.py
@@ -32,3 +32,32 @@ def check_results(self, flow, checker):
                 checker.assert_artifact(step.name, "data", "foo")
             else:
                 checker.assert_artifact(step.name, "data", "start")
+        run = checker.get_run()
+        if run is not None:
+            # We can also check the metadata for all steps
+            common_run_id = None
+            exclude_keys = ["origin-task-id", "origin-run-id", "python_version"]
+            for step in run:
+                for task in step:
+                    resumed_metadata = task.metadata_dict
+                    step_name = step.path_components[-1]
+                    if step_name == "end":
+                        if common_run_id is None:
+                            common_run_id = resumed_metadata["origin-run-id"]
+                        assert_equals(common_run_id, resumed_metadata["origin-run-id"])
+                        assert "origin-task-id" not in resumed_metadata, "Invalid clone"
+                        continue
+                    # Here we check if we have the correct metadata
+                    assert all(
+                        [k in resumed_metadata for k in exclude_keys]
+                    ), "Invalid cloned task"
+                    if common_run_id is None:
+                        common_run_id = resumed_metadata["origin-run-id"]
+                    assert_equals(common_run_id, resumed_metadata["origin-run-id"])
+                    orig_metadata = run.parent[resumed_metadata["origin-run-id"]][
+                        step_name
+                    ][resumed_metadata["origin-task-id"]].metadata_dict
+                    # Only resumes once so key not present elsewhere
+                    assert_equals_metadata(
+                        orig_metadata, resumed_metadata, exclude_keys
+                    )
diff --git a/test/core/tests/resume_start_step.py b/test/core/tests/resume_start_step.py
index 681d18ba1ca..97fab72c58f 100644
--- a/test/core/tests/resume_start_step.py
+++ b/test/core/tests/resume_start_step.py
@@ -17,7 +17,7 @@ def step_start(self):
         if is_resumed():
             self.data = "foo"
             # Verify that the `current` singleton contains the correct origin
-            # run_id by double checking with the environment variables used
+            # run_id by double-checking with the environment variables used
             # for tests.
             self.actual_origin_run_id = current.origin_run_id
             from metaflow_test import origin_run_id_for_resume
@@ -42,3 +42,10 @@ def check_results(self, flow, checker):
             assert_equals(
                 run.data.expected_origin_run_id, run.data.actual_origin_run_id
             )
+            # We can also check the metadata for the start task
+            exclude_keys = ["origin-task-id", "origin-run-id"]
+            resumed_metadata = run["start"].task.metadata_dict
+            # Here we actually expect just origin-run-id but NOT origin-task-id because
+            # we didn't clone it
+            assert "origin-task-id" not in resumed_metadata, "Invalid clone"
+            assert "origin-run-id" in resumed_metadata, "Invalid resume"
diff --git a/test/core/tests/run_id_file.py b/test/core/tests/run_id_file.py
new file mode 100644
index 00000000000..a08bf2dc956
--- /dev/null
+++ b/test/core/tests/run_id_file.py
@@ -0,0 +1,34 @@
+from metaflow_test import MetaflowTest, ExpectationFailed, steps
+
+
+class RunIdFileTest(MetaflowTest):
+    """
+    Resuming and initial running of a flow should write run id file early (prior to execution)
+    """
+
+    RESUME = True
+    PRIORITY = 3
+
+    @steps(0, ["singleton-start"], required=True)
+    def step_start(self):
+        import os
+        from metaflow import current
+
+        # Whether we are in "run" or "resume" mode, --run-id-file must be written prior to execution
+        assert os.path.isfile(
+            "run-id"
+        ), "run id file should exist before resume execution"
+        with open("run-id", "r") as f:
+            run_id_from_file = f.read()
+        assert run_id_from_file == current.run_id
+
+        # Test both regular run and resume paths
+        if not is_resumed():
+            raise ResumeFromHere()
+
+    @steps(2, ["all"])
+    def step_all(self):
+        pass
+
+    def check_results(self, flow, checker):
+        pass
diff --git a/test/core/tests/tag_catch.py b/test/core/tests/tag_catch.py
index 7d30d2bb051..a326a466844 100644
--- a/test/core/tests/tag_catch.py
+++ b/test/core/tests/tag_catch.py
@@ -8,7 +8,8 @@ class TagCatchTest(MetaflowTest):
     @tag("retry(times=3)")
     @steps(0, ["start"])
     def step_start(self):
-        import os, sys
+        import os
+        import sys
 
         self.test_attempt = current.retry_count
         sys.stdout.write("stdout testing logs %d\n" % self.test_attempt)
@@ -54,7 +55,8 @@ def step_end(self):
     @tag("retry(times=2)")
     @steps(1, ["all"])
     def step_all(self):
-        import signal, os
+        import signal
+        import os
 
         # die an ugly death
         os.kill(os.getpid(), signal.SIGKILL)
diff --git a/test/core/tests/tag_mutation.py b/test/core/tests/tag_mutation.py
new file mode 100644
index 00000000000..bfec071f1a0
--- /dev/null
+++ b/test/core/tests/tag_mutation.py
@@ -0,0 +1,114 @@
+# -*- coding: utf-8 -*-
+from metaflow_test import MetaflowTest, ExpectationFailed, steps
+
+
+class TagMutationTest(MetaflowTest):
+    """
+    Test that tag mutation works
+    """
+
+    PRIORITY = 2
+    HEADER = "@project(name='tag_mutation')"
+
+    @steps(1, ["all"])
+    def step_all(self):
+        from metaflow import current, Task
+
+        run = Task(current.pathspec).parent.parent
+        for i in range(7):
+            tag = str(i)
+            run.add_tag(tag)
+            assert tag in run.user_tags
+            run.remove_tag(tag)
+            assert tag not in run.user_tags
+
+    def check_results(self, flow, checker):
+        import random
+
+        system_tags = checker.get_system_tags()
+        assert (
+            system_tags
+        ), "Expect at least one system tag for an effective set of checks"
+        some_existing_system_tags = random.sample(
+            list(system_tags), min(len(system_tags) // 2, 1)
+        )
+
+        # Verify that trying to add a tag that already exists as a system tag is OK (only non system tags get added)
+        checker.add_tags(["tag_along", *some_existing_system_tags])
+        assert "tag_along" in checker.get_user_tags()
+        assert len(set(some_existing_system_tags) & checker.get_user_tags()) == 0
+
+        # Verify that trying to remove a tag that already exists as a system tag fails (all or nothing)
+        assert_exception(
+            lambda: checker.remove_tags(["tag_along", *some_existing_system_tags]),
+            Exception,
+        )
+        assert "tag_along" in checker.get_user_tags()
+        checker.remove_tag("tag_along")
+        assert "tag_along" not in checker.get_user_tags()
+
+        # Verify "remove, then add" behavior of replace_tags
+        checker.add_tags(["AAA", "BBB"])
+        assert "AAA" in checker.get_user_tags() and "BBB" in checker.get_user_tags()
+        checker.replace_tags(["AAA", "BBB"], ["BBB", "CCC"])
+        assert "AAA" not in checker.get_user_tags()
+        assert "BBB" in checker.get_user_tags()
+        assert "CCC" in checker.get_user_tags()
+
+        # Verify UTF-8 support for tags
+        checker.add_tags(["FeatEng1", "FeatEng2", "新想法"])
+        assert "FeatEng1" in checker.get_user_tags()
+        assert "FeatEng2" in checker.get_user_tags()
+        assert "新想法" in checker.get_user_tags()
+
+        checker.remove_tags(["新想法", "FeatEng1"])
+        assert "FeatEng1" not in checker.get_user_tags()
+        assert "FeatEng2" in checker.get_user_tags()
+        assert "新想法" not in checker.get_user_tags()
+
+        # try empty str as tag - should fail
+        assert_exception(lambda: checker.add_tag(""), Exception)
+        assert "" not in checker.get_user_tags()
+
+        # try adding a tag that is too long - should fail
+        assert_exception(lambda: checker.add_tag("a" * 600), Exception)
+        assert ("a" * 600) not in checker.get_user_tags()
+
+        # try adding a tag made up of random bytes
+        random_bytes = bytes(random.getrandbits(8) for _ in range(64))
+        assert_exception(lambda: checker.add_tag(random_bytes), Exception)
+        assert random_bytes not in checker.get_user_tags()
+
+        # TODO add test for "too many tags", pending metadata service support (it depends on existing tags as well)
+
+        # try int as tag - should fail
+        assert_exception(lambda: checker.remove_tag(4), Exception)
+        assert 4 not in checker.get_user_tags()
+
+        # try to replace nothing with nothing - should fail
+        assert_exception(lambda: checker.replace_tags([], []), Exception)
+
+        # these check actions do not work for CliCheck. As of 6/3/2022, the only other
+        # checker is MetadataCheck. But we write the code like this to force consideration
+        # if/when we add the third checker.
+        if checker.__class__.__name__ != "CliCheck":
+            # Verify task tags do not diverge
+            run = checker.get_run()
+            assert run.end_task.tags == run.tags
+
+            # Validate deprecated functionality (maintained for backwards compatibility
+            # until usage migrated off)
+            # When that happens, these test cases may be removed.
+            checker.add_tag(["whoop", "eee"])
+            assert "whoop" in checker.get_user_tags()
+            assert "eee" in checker.get_user_tags()
+
+            checker.replace_tag(["whoop", "eee"], ["woo", "hoo"])
+            assert "whoop" not in checker.get_user_tags()
+            assert "eee" not in checker.get_user_tags()
+            assert "woo" in checker.get_user_tags()
+            assert "hoo" in checker.get_user_tags()
+
+            checker.remove_tag(["woo", "hoo"])
+            assert "woo" not in checker.get_user_tags()
+            assert "hoo" not in checker.get_user_tags()
diff --git a/test/data/__init__.py b/test/data/__init__.py
index 7a6c1e6acb6..1e78bb45026 100644
--- a/test/data/__init__.py
+++ b/test/data/__init__.py
@@ -4,14 +4,18 @@
 # if you want a fresh set of data
 S3ROOT = os.environ.get("METAFLOW_S3_TEST_ROOT")
 
-from metaflow.datatools.s3util import get_s3_client
+from metaflow.plugins.datatools.s3.s3util import get_s3_client
 
 s3client, _ = get_s3_client()
 
 from metaflow import FlowSpec
 
+
 # ast parsing in metaflow.graph doesn't like this class
 # to be defined in test_s3.py. Defining it here works.
 class FakeFlow(FlowSpec):
     def __init__(self, name="FakeFlow", use_cli=False):
         self.name = name
+
+
+DO_TEST_RUN = False
diff --git a/test/data/s3/s3_data.py b/test/data/s3/s3_data.py
index 8062dd38318..0584c6ff148 100644
--- a/test/data/s3/s3_data.py
+++ b/test/data/s3/s3_data.py
@@ -13,7 +13,7 @@
     # python3
     from urllib.parse import urlparse
 
-from metaflow.datatools.s3 import S3PutObject
+from metaflow.plugins.datatools.s3 import S3PutObject
 
 from metaflow.util import to_fileobj, to_bytes, url_quote
 
@@ -36,7 +36,7 @@
     "complex": (
         "text/plain",
         {
-            "utf8-data": u"\u523a\u8eab/means sashimi",
+            "utf8-data": "\u523a\u8eab/means sashimi",
             "with-weird-chars": "Space and !@#<>:/-+=&%",
         },
     ),
@@ -44,7 +44,7 @@
 
 BASIC_RANGE_INFO = {
     "from_beg": (0, 16),  # From beginning
-    "exceed_end": (0, 10 * 1024 ** 3),  # From beginning, should fetch full file
+    "exceed_end": (0, 10 * 1024**3),  # From beginning, should fetch full file
     "middle": (5, 10),  # From middle
     "end": (None, -5),  # Fetch from end
     "till_end": (5, None),  # Fetch till end
@@ -61,7 +61,7 @@
     # a basic sanity check
     (
         "3_small_files",
-        {"empty_file": 0, "kb_file": 1024, "mb_file": 1024 ** 2, "missing_file": None},
+        {"empty_file": 0, "kb_file": 1024, "mb_file": 1024**2, "missing_file": None},
     ),
     # S3 paths can be longer than the max allowed filename on Linux
     (
@@ -90,27 +90,27 @@
     (
         "crazypath",
         {
-            u"crazy spaces": 34,
-            u"\x01\xff": 64,
-            u"\u523a\u8eab/means sashimi": 33,
-            u"crazy-!.$%@2_()\"'": 100,
-            u" /cra._:zy/\x01\x02/p a t h/$this/!!is()": 1000,
-            u"crazy missing :(": None,
+            "crazy spaces": 34,
+            "\x01\xff": 64,
+            "\u523a\u8eab/means sashimi": 33,
+            "crazy-!.$%@2_()\"'": 100,
+            " /cra._:zy/\x01\x02/p a t h/$this/!!is()": 1000,
+            "crazy missing :(": None,
         },
     ),
 ]
 
 BIG_DATA = [
     # test a file > 4GB
-    ("5gb_file", {"5gb_file": 5 * 1024 ** 3}),
+    ("5gb_file", {"5gb_file": 5 * 1024**3}),
     # ensure that e.g. paged listings work correctly with many keys
     ("3000_files", {str(i): i for i in range(3000)}),
 ]
 
 # Large file to use for benchmark, must be in BASIC_DATA or BIG_DATA
 BENCHMARK_SMALL_FILE = ("3000_files", {"1": 1})
-BENCHMARK_MEDIUM_FILE = ("3_small_files", {"mb_file": 1024 ** 2})
-BENCHMARK_LARGE_FILE = ("5gb_file", {"5gb_file": 5 * 1024 ** 3})
+BENCHMARK_MEDIUM_FILE = ("3_small_files", {"mb_file": 1024**2})
+BENCHMARK_LARGE_FILE = ("5gb_file", {"5gb_file": 5 * 1024**3})
 
 BENCHMARK_SMALL_ITER_MAX = 10001
 BENCHMARK_MEDIUM_ITER_MAX = 501
@@ -127,6 +127,10 @@
     "ExpectedResult", "size checksum content_type metadata range"
 )
 
+ExpectedRange = namedtuple(
+    "ExpectedRange", "total_size result_offset result_size req_offset req_size"
+)
+
 
 class RandomFile(object):
 
@@ -164,7 +168,7 @@ def checksum(self, start=None, length=None):
 
     def size_from_range(self, start, length):
         if self.size is None:
-            return None
+            return None, None
         if length:
             if length > 0:
                 end = length + start
@@ -177,7 +181,9 @@ def size_from_range(self, start, length):
 
         if end > self.size:
             end = self.size
-        return end - start
+        if start >= end:
+            return None, None
+        return end - start, start
 
     def fileobj(self):
         if self.size is not None:
@@ -226,12 +232,21 @@ def _format_test_cases(dataset, meta=None, ranges=None):
             # checksum and create a new dictionary
             for k, (obj, content_type, usermeta) in objs.items():
                 for offset, length in ranges.values():
+                    expected_size, real_offset = obj.size_from_range(offset, length)
+                    if expected_size is None or expected_size > obj.size:
+                        continue
                     files[k][(offset, length)] = ExpectedResult(
-                        size=obj.size_from_range(offset, length),
+                        size=expected_size,
                         checksum=obj.checksum(offset, length),
                         content_type=content_type,
                         metadata=usermeta,
-                        range=(offset, length),
+                        range=ExpectedRange(
+                            total_size=obj.size,
+                            result_offset=real_offset,
+                            result_size=expected_size,
+                            req_offset=offset,
+                            req_size=length,
+                        ),
                     )
 
         ids.append(prefix)
@@ -382,7 +397,7 @@ def pytest_many_prefixes_case():
 def pytest_put_strings_case(meta=None):
     put_prefix = os.path.join(S3ROOT, PUT_PREFIX)
     data = [
-        u"unicode: \u523a\u8eab means sashimi",
+        "unicode: \u523a\u8eab means sashimi",
         b"bytes: \x00\x01\x02",
         "just a string",
     ]
diff --git a/test/data/s3/test_s3.py b/test/data/s3/test_s3.py
index 0f50a00ad83..8c71e37d21c 100644
--- a/test/data/s3/test_s3.py
+++ b/test/data/s3/test_s3.py
@@ -3,26 +3,28 @@
 import shutil
 from hashlib import sha1
 from tempfile import mkdtemp
-from itertools import groupby
+from itertools import groupby, starmap
 import random
 from uuid import uuid4
+from metaflow.plugins.datatools import s3
 
 import pytest
 
 from metaflow import current, namespace, Run
-from metaflow.datatools.s3 import (
+from metaflow.plugins.datatools.s3 import (
     S3,
     MetaflowS3AccessDenied,
     MetaflowS3NotFound,
     MetaflowS3URLException,
     MetaflowS3InvalidObject,
     S3PutObject,
+    S3GetObject,
 )
 
 from metaflow.util import to_bytes, unicode_type
 
 from . import s3_data
-from .. import FakeFlow
+from .. import FakeFlow, DO_TEST_RUN
 
 try:
     # python2
@@ -32,13 +34,23 @@
     from urllib.parse import urlparse
 
 
-def assert_results(s3objs, expected, info_should_be_empty=False, info_only=False):
+def s3_get_object_from_url_range(url, range_info):
+    if range_info is None:
+        return S3GetObject(url, None, None)
+    return S3GetObject(url, range_info.req_offset, range_info.req_size)
+
+
+def assert_results(
+    s3objs, expected, info_should_be_empty=False, info_only=False, ranges_fetched=None
+):
     # did we receive all expected objects and nothing else?
     if info_only:
         info_should_be_empty = False
+    if ranges_fetched is None:
+        ranges_fetched = [None] * len(s3objs)
+    assert len(s3objs) == len(ranges_fetched)
 
-    assert {s3obj.url for s3obj in s3objs} == set(expected)
-    for s3obj in s3objs:
+    for s3obj, range_info in zip(s3objs, ranges_fetched):
         # assert that all urls returned are unicode, if not None
         assert isinstance(s3obj.key, (unicode_type, type(None)))
         assert isinstance(s3obj.url, (unicode_type, type(None)))
@@ -55,24 +67,27 @@ def assert_results(s3objs, expected, info_should_be_empty=False, info_only=False
             # if there's no prefix, the key is the url
             assert s3obj.url == s3obj.key
 
-        range_info = s3obj.range_info
         if range_info:
-            range_info = (range_info.request_offset, range_info.request_length)
-        expected_result = expected[s3obj.url].get(range_info, None)
+            expected_result = expected[s3obj.url].get(
+                (range_info.req_offset, range_info.req_size)
+            )
+        else:
+            expected_result = expected[s3obj.url].get(None)
         assert expected_result
         size = expected_result.size
         checksum = expected_result.checksum
         content_type = expected_result.content_type
         metadata = expected_result.metadata
+        range_to_match = expected_result.range
         if size is None:
-            assert s3obj.exists == False
-            assert s3obj.downloaded == False
+            assert s3obj.exists is False
+            assert s3obj.downloaded is False
         else:
-            assert s3obj.exists == True
+            assert s3obj.exists is True
             if info_only:
-                assert s3obj.downloaded == False
+                assert s3obj.downloaded is False
             else:
-                assert s3obj.downloaded == True
+                assert s3obj.downloaded is True
                 # local file exists?
                 assert os.path.exists(s3obj.path)
                 # blob is ok?
@@ -85,15 +100,38 @@ def assert_results(s3objs, expected, info_should_be_empty=False, info_only=False
             if info_should_be_empty:
                 assert not s3obj.has_info
             else:
+                assert s3obj.has_info
                 # Content_type is OK
                 if content_type is None:
                     # Default content-type when nothing is supplied
                     assert s3obj.content_type == "binary/octet-stream"
                 else:
                     assert s3obj.content_type == content_type
+                # Range information is properly reported. Note that in this case, even
+                # for whole files we return a range. Ranges don't exist for just information
+                # on files
+                if not info_only:
+                    s3obj_range_info = s3obj.range_info
+                    assert s3obj_range_info is not None
+                    if range_to_match is None:
+                        # This is the entire file
+                        assert s3obj_range_info.total_size == size
+                        assert s3obj_range_info.request_offset == 0
+                        assert s3obj_range_info.request_length == size
+                    else:
+                        # A specific range of the file
+                        assert s3obj_range_info.total_size == range_to_match.total_size
+                        assert (
+                            s3obj_range_info.request_offset
+                            == range_to_match.result_offset
+                        )
+                        assert (
+                            s3obj_range_info.request_length
+                            == range_to_match.result_size
+                        )
                 # metadata is OK
                 if metadata is None:
-                    assert s3obj.metadata == None
+                    assert s3obj.metadata is None
                 else:
                     s3objmetadata = s3obj.metadata
                     assert s3objmetadata is not None
@@ -149,11 +187,14 @@ def _do():
     assert_results(res, expected, info_only=True)
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "pathspecs", "expected"], **s3_data.pytest_benchmark_many_case()
 )
 @pytest.mark.benchmark(max_time=30)
-def test_info_many_benchmark(benchmark, s3root, pathspecs, expected):
+def test_info_many_benchmark(
+    benchmark, inject_failure_rate, s3root, pathspecs, expected
+):
     urls = []
     check_expected = {}
     for count, v in expected:
@@ -163,7 +204,7 @@ def test_info_many_benchmark(benchmark, s3root, pathspecs, expected):
     random.shuffle(urls)
 
     def _do():
-        with S3() as s3:
+        with S3(inject_failure_rate=inject_failure_rate) as s3:
             res = s3.info_many(urls)
         return res
 
@@ -190,11 +231,14 @@ def _do():
     # assert_results(res, expected, info_should_be_empty=True)
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "pathspecs", "expected"], **s3_data.pytest_benchmark_many_case()
 )
 @pytest.mark.benchmark(max_time=60)
-def test_get_many_benchmark(benchmark, s3root, pathspecs, expected):
+def test_get_many_benchmark(
+    benchmark, inject_failure_rate, s3root, pathspecs, expected
+):
     urls = []
     check_expected = {}
     for count, v in expected:
@@ -204,7 +248,7 @@ def test_get_many_benchmark(benchmark, s3root, pathspecs, expected):
     random.shuffle(urls)
 
     def _do():
-        with S3() as s3:
+        with S3(inject_failure_rate=inject_failure_rate) as s3:
             # Use return_missing as this is the most expensive path
             res = s3.get_many(urls, return_missing=True)
         return res
@@ -230,7 +274,7 @@ def _generate_files(blobs):
                 f.write(data.data)
             yield key, path
 
-    # Generate all files before the test so we don't time this
+    # Generate all files before the test so that we don't time this
     all_files = list(_generate_files(blobs))
 
     def _do():
@@ -244,11 +288,14 @@ def _do():
     res = benchmark(_do)
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "blobs", "expected"], **s3_data.pytest_benchmark_put_many_case()
 )
 @pytest.mark.benchmark(max_time=60)
-def test_put_many_benchmark(benchmark, tempdir, s3root, blobs, expected):
+def test_put_many_benchmark(
+    benchmark, tempdir, inject_failure_rate, s3root, blobs, expected
+):
     def _generate_files(blobs):
         generated_paths = {}
         for blob in blobs:
@@ -271,17 +318,18 @@ def _generate_files(blobs):
 
     def _do():
         new_files = [(str(uuid4()), path) for _, path in all_files]
-        with S3(s3root=s3root) as s3:
+        with S3(s3root=s3root, inject_failure_rate=inject_failure_rate) as s3:
             s3urls = s3.put_files(new_files, overwrite=False)
         return s3urls
 
     res = benchmark(_do)
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "pathspecs", "expected"], **s3_data.pytest_fakerun_cases()
 )
-def test_init_options(s3root, pathspecs, expected):
+def test_init_options(inject_failure_rate, s3root, pathspecs, expected):
     [pathspec] = pathspecs
     flow_name, run_id = pathspec.split("/")
     plen = len(s3root)
@@ -303,7 +351,7 @@ def test_init_options(s3root, pathspecs, expected):
             assert_results([s3obj], {url: exp})
 
     # option 3) full urls
-    with S3() as s3:
+    with S3(inject_failure_rate=inject_failure_rate) as s3:
         for url, exp in expected.items():
             # s3root should work as a prefix
             s3obj = s3.get(url)
@@ -322,12 +370,14 @@ def test_init_options(s3root, pathspecs, expected):
 
     # option 4) 'current' environment (fake a running flow)
     flow = FakeFlow(use_cli=False)
-
     parsed = urlparse(s3root)
-    with pytest.raises(MetaflowS3URLException):
-        # current not set yet, so this should fail
-        with S3(run=flow):
-            pass
+
+    # Once current is set, we can't test again. It doesn't inject failures anyways so OK
+    if inject_failure_rate == 0:
+        with pytest.raises(MetaflowS3URLException):
+            # current not set yet, so this should fail
+            with S3(run=flow):
+                pass
 
     current._set_env(
         FakeFlow(name=flow_name),
@@ -339,7 +389,12 @@ def test_init_options(s3root, pathspecs, expected):
         "no_user",
     )
 
-    with S3(bucket=parsed.netloc, prefix=parsed.path, run=flow) as s3:
+    with S3(
+        bucket=parsed.netloc,
+        prefix=parsed.path,
+        run=flow,
+        inject_failure_rate=inject_failure_rate,
+    ) as s3:
         for url, exp in expected.items():
             name = url.split("/")[-1]
             s3obj = s3.get(name)
@@ -351,6 +406,19 @@ def test_init_options(s3root, pathspecs, expected):
         assert_results(s3objs, expected)
         assert_results(s3.get_all(), expected, info_should_be_empty=True)
 
+    # option 5) run object
+    if DO_TEST_RUN:
+        # Only works if a metadata service exists with the run in question.
+        namespace(None)
+        with S3(
+            bucket=parsed.netloc,
+            prefix=parsed.path,
+            run=Run(pathspec),
+            inject_failure_rate=inject_failure_rate,
+        ) as s3:
+            names = [url.split("/")[-1] for url in expected]
+            assert_results(s3.get_many(names), expected)
+
 
 @pytest.mark.parametrize(
     argnames=["s3root", "prefixes", "expected"], **s3_data.pytest_basic_case()
@@ -370,11 +438,12 @@ def test_info_one(s3root, prefixes, expected):
                 assert_results([s3obj], {url: expected[url]}, info_only=True)
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "prefixes", "expected"], **s3_data.pytest_basic_case()
 )
-def test_info_many(s3root, prefixes, expected):
-    with S3() as s3:
+def test_info_many(inject_failure_rate, s3root, prefixes, expected):
+    with S3(inject_failure_rate=inject_failure_rate) as s3:
         # 1) test the non-missing case
 
         # to test result ordering, make sure we are requesting
@@ -403,12 +472,13 @@ def test_info_many(s3root, prefixes, expected):
         assert_results(s3objs, expected, info_only=True)
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "prefixes", "expected"], **s3_data.pytest_fakerun_cases()
 )
-def test_get_exceptions(s3root, prefixes, expected):
+def test_get_exceptions(inject_failure_rate, s3root, prefixes, expected):
     # get_many() goes via s3op, get() is a method - test both the code paths
-    with S3() as s3:
+    with S3(inject_failure_rate=inject_failure_rate) as s3:
         with pytest.raises(MetaflowS3AccessDenied):
             s3.get_many(["s3://foobar/foo"])
         with pytest.raises(MetaflowS3AccessDenied):
@@ -426,16 +496,23 @@ def test_get_exceptions(s3root, prefixes, expected):
 def test_get_one(s3root, prefixes, expected):
     with S3() as s3:
         for url, item in expected.items():
-            if item[None].size is None:
-                # ensure that the default return_missing=False works
-                with pytest.raises(MetaflowS3NotFound):
-                    s3obj = s3.get(url)
-                # test return_missing=True
-                s3obj = s3.get(url, return_missing=True)
-                assert_results([s3obj], {url: expected[url]})
-            else:
-                s3obj = s3.get(url, return_info=True)
-                assert_results([s3obj], {url: expected[url]})
+            for _, expected_result in item.items():
+                range_info = expected_result.range
+                if expected_result.size is None:
+                    # ensure that the default return_missing=False works
+                    with pytest.raises(MetaflowS3NotFound):
+                        s3obj = s3.get(s3_get_object_from_url_range(url, range_info))
+                    # test return_missing=True
+                    s3obj = s3.get(
+                        s3_get_object_from_url_range(url, range_info),
+                        return_missing=True,
+                    )
+                    assert_results([s3obj], {url: item}, ranges_fetched=[range_info])
+                else:
+                    s3obj = s3.get(
+                        s3_get_object_from_url_range(url, range_info), return_info=True
+                    )
+                    assert_results([s3obj], {url: item}, ranges_fetched=[range_info])
 
 
 @pytest.mark.parametrize(
@@ -444,78 +521,131 @@ def test_get_one(s3root, prefixes, expected):
 def test_get_one_wo_meta(s3root, prefixes, expected):
     with S3() as s3:
         for url, item in expected.items():
-            if item[None].size is None:
-                # ensure that the default return_missing=False works
-                with pytest.raises(MetaflowS3NotFound):
-                    s3obj = s3.get(url)
-                s3obj = s3.get(url, return_missing=True, return_info=False)
-                assert_results([s3obj], {url: expected[url]}, info_should_be_empty=True)
-            else:
-                s3obj = s3.get(url, return_info=False)
-                assert_results([s3obj], {url: expected[url]}, info_should_be_empty=True)
+            for _, expected_result in item.items():
+                range_info = expected_result.range
+                if expected_result.size is None:
+                    # ensure that the default return_missing=False works
+                    with pytest.raises(MetaflowS3NotFound):
+                        s3obj = s3.get(s3_get_object_from_url_range(url, range_info))
+                    s3obj = s3.get(
+                        s3_get_object_from_url_range(url, range_info),
+                        return_missing=True,
+                        return_info=False,
+                    )
+                    assert_results(
+                        [s3obj],
+                        {url: item},
+                        info_should_be_empty=True,
+                        ranges_fetched=[range_info],
+                    )
+                else:
+                    s3obj = s3.get(
+                        s3_get_object_from_url_range(url, range_info), return_info=False
+                    )
+                    assert_results(
+                        [s3obj],
+                        {url: item},
+                        info_should_be_empty=True,
+                        ranges_fetched=[range_info],
+                    )
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "prefixes", "expected"], **s3_data.pytest_large_case()
 )
-def test_get_all(s3root, prefixes, expected):
+def test_get_all(inject_failure_rate, s3root, prefixes, expected):
     expected_exists = {
         url: v for url, v in expected.items() if v[None].size is not None
     }
     for prefix in prefixes:
-        with S3(s3root=os.path.join(s3root, prefix)) as s3:
+        with S3(
+            s3root=os.path.join(s3root, prefix), inject_failure_rate=inject_failure_rate
+        ) as s3:
             s3objs = s3.get_all()
             # results should be in lexicographic order
             assert list(sorted(e.url for e in s3objs)) == [e.url for e in s3objs]
             assert_results(s3objs, expected_exists, info_should_be_empty=True)
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "prefixes", "expected"], **s3_data.pytest_basic_case()
 )
-def test_get_all_with_meta(s3root, prefixes, expected):
+def test_get_all_with_meta(inject_failure_rate, s3root, prefixes, expected):
     expected_exists = {
         url: v for url, v in expected.items() if v[None].size is not None
     }
     for prefix in prefixes:
-        with S3(s3root=os.path.join(s3root, prefix)) as s3:
+        with S3(
+            s3root=os.path.join(s3root, prefix), inject_failure_rate=inject_failure_rate
+        ) as s3:
             s3objs = s3.get_all(return_info=True)
             # results should be in lexicographic order
             assert list(sorted(e.url for e in s3objs)) == [e.url for e in s3objs]
             assert_results(s3objs, expected_exists)
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "prefixes", "expected"], **s3_data.pytest_basic_case()
 )
-def test_get_many(s3root, prefixes, expected):
-    with S3() as s3:
+def test_get_many(inject_failure_rate, s3root, prefixes, expected):
+    def iter_objs(urls, objs):
+        for url in urls:
+            obj = objs[url]
+            for r, expected in obj.items():
+                if r is None:
+                    yield url, None, None
+                else:
+                    yield url, expected.range.req_offset, expected.range.req_size
+
+    with S3(inject_failure_rate=inject_failure_rate) as s3:
         # 1) test the non-missing case
 
         # to test result ordering, make sure we are requesting
         # keys in a non-lexicographic order
-        not_missing = [url for url, v in expected.items() if v[None].size is not None]
-        urls = list(sorted(not_missing, reverse=True))
-        s3objs = s3.get_many(urls, return_info=True)
-
+        not_missing_urls = [k for k, v in expected.items() if v[None].size is not None]
+        urls_in_order = list(sorted(not_missing_urls, reverse=True))
+        ranges_in_order = []
+        for url in urls_in_order:
+            ranges_in_order.extend(v.range for v in expected[url].values())
+
+        objs_in_order = list(starmap(S3GetObject, iter_objs(urls_in_order, expected)))
+        s3objs = s3.get_many(list(objs_in_order), return_info=True)
+
+        fetched_urls = []
+        for url in urls_in_order:
+            fetched_urls.extend([url] * len(expected[url]))
         # results should come out in the order of keys requested
-        assert urls == [e.url for e in s3objs]
-        assert_results(s3objs, {k: expected[k] for k in not_missing})
+        assert fetched_urls == [e.url for e in s3objs]
+        assert_results(s3objs, expected, ranges_fetched=ranges_in_order)
 
         # 2) test with missing items, default case
-        if not_missing != list(expected):
+        if not_missing_urls != list(expected.keys()):
+            urls_in_order = list(sorted(expected.keys(), reverse=True))
+            ranges_in_order = []
+            for url in urls_in_order:
+                ranges_in_order.extend(v.range for v in expected[url].values())
+            objs_in_order = list(
+                starmap(S3GetObject, iter_objs(urls_in_order, expected))
+            )
+            fetched_urls = []
+            for url in urls_in_order:
+                fetched_urls.extend([url] * len(expected[url]))
             with pytest.raises(MetaflowS3NotFound):
-                s3objs = s3.get_many(list(expected), return_info=True)
+                s3objs = s3.get_many(list(objs_in_order), return_info=True)
 
         # 3) test with missing items, return_missing=True
 
         # to test result ordering, make sure we are requesting
         # keys in a non-lexicographic order. Missing files should
         # be returned in order too
-        urls = list(sorted(expected, reverse=True))
-        s3objs = s3.get_many(urls, return_missing=True, return_info=True)
-        assert urls == [e.url for e in s3objs]
-        assert_results(s3objs, expected)
+        # Here we can use urls_in_order, ranges_in_order and objs_in_order because they
+        # always correspond to the full set
+        s3objs = s3.get_many(list(objs_in_order), return_missing=True, return_info=True)
+        assert fetched_urls == [e.url for e in s3objs]
+        assert_results(s3objs, expected, ranges_fetched=ranges_in_order)
 
 
 @pytest.mark.parametrize(
@@ -613,15 +743,16 @@ def test_list_recursive(s3root, prefixes, expected):
         assert all(e.exists for e in s3objs)
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "prefixes", "expected"], **s3_data.pytest_many_prefixes_case()
 )
-def test_get_recursive(s3root, prefixes, expected):
+def test_get_recursive(inject_failure_rate, s3root, prefixes, expected):
     expected_exists = {
         url: v for url, v in expected.items() if v[None].size is not None
     }
     local_files = []
-    with S3(s3root=s3root) as s3:
+    with S3(s3root=s3root, inject_failure_rate=inject_failure_rate) as s3:
         s3objs = s3.get_recursive(prefixes)
 
         # we need to deduce which prefixes actually produce results
@@ -656,13 +787,14 @@ def test_get_recursive(s3root, prefixes, expected):
             assert {e.key for e in s3objs} == keys
 
         local_files = [s3obj.path for s3obj in s3objs]
-    # local files must not exist outside of the S3 context
+    # local files must not exist outside the S3 context
     for path in local_files:
         assert not os.path.exists(path)
 
 
-def test_put_exceptions():
-    with S3() as s3:
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
+def test_put_exceptions(inject_failure_rate):
+    with S3(inject_failure_rate=inject_failure_rate) as s3:
         with pytest.raises(MetaflowS3InvalidObject):
             s3.put_many([("a", 1)])
         with pytest.raises(MetaflowS3InvalidObject):
@@ -673,29 +805,30 @@ def test_put_exceptions():
             s3.put_many([("foo", "bar")])
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "objs", "expected"], **s3_data.pytest_put_strings_case()
 )
-def test_put_many(s3root, objs, expected):
-    with S3(s3root=s3root) as s3:
+def test_put_many(inject_failure_rate, s3root, objs, expected):
+    with S3(s3root=s3root, inject_failure_rate=inject_failure_rate) as s3:
         s3urls = s3.put_many(objs)
         assert list(dict(s3urls)) == list(dict(objs))
         # results must be in the same order as the keys requested
         for i in range(len(s3urls)):
             assert objs[i][0] == s3urls[i][0]
-    with S3() as s3:
+    with S3(inject_failure_rate=inject_failure_rate) as s3:
         s3objs = s3.get_many(dict(s3urls).values())
         assert_results(s3objs, expected)
-    with S3(s3root=s3root) as s3:
+    with S3(s3root=s3root, inject_failure_rate=inject_failure_rate) as s3:
         s3objs = s3.get_many(list(dict(objs)))
         assert {s3obj.key for s3obj in s3objs} == {key for key, _ in objs}
 
     # upload shuffled objs with overwrite disabled
     shuffled_objs = deranged_shuffle(objs)
-    with S3(s3root=s3root) as s3:
+    with S3(s3root=s3root, inject_failure_rate=inject_failure_rate) as s3:
         overwrite_disabled_s3urls = s3.put_many(shuffled_objs, overwrite=False)
         assert len(overwrite_disabled_s3urls) == 0
-    with S3() as s3:
+    with S3(inject_failure_rate=inject_failure_rate) as s3:
         s3objs = s3.get_many(dict(s3urls).values())
         assert_results(s3objs, expected)
 
@@ -721,10 +854,11 @@ def test_put_one(s3root, objs, expected):
             assert s3obj.blob == to_bytes(obj)
 
 
+@pytest.mark.parametrize("inject_failure_rate", [0, 10, 50, 90])
 @pytest.mark.parametrize(
     argnames=["s3root", "blobs", "expected"], **s3_data.pytest_put_blobs_case()
 )
-def test_put_files(tempdir, s3root, blobs, expected):
+def test_put_files(tempdir, inject_failure_rate, s3root, blobs, expected):
     def _files(blobs):
         for blob in blobs:
             key = getattr(blob, "key", blob[0])
@@ -738,16 +872,16 @@ def _files(blobs):
                 key=key, value=path, content_type=content_type, metadata=metadata
             )
 
-    with S3(s3root=s3root) as s3:
+    with S3(s3root=s3root, inject_failure_rate=inject_failure_rate) as s3:
         s3urls = s3.put_files(_files(blobs))
         assert list(dict(s3urls)) == list(dict(blobs))
 
-    with S3() as s3:
+    with S3(inject_failure_rate=inject_failure_rate) as s3:
         # get urls
         s3objs = s3.get_many(dict(s3urls).values())
         assert_results(s3objs, expected)
 
-    with S3(s3root=s3root) as s3:
+    with S3(s3root=s3root, inject_failure_rate=inject_failure_rate) as s3:
         # get keys
         s3objs = s3.get_many(key for key, blob in blobs)
         assert {s3obj.key for s3obj in s3objs} == {key for key, _ in blobs}
@@ -755,15 +889,15 @@ def _files(blobs):
     # upload shuffled blobs with overwrite disabled
     shuffled_blobs = blobs[:]
     shuffle(shuffled_blobs)
-    with S3(s3root=s3root) as s3:
+    with S3(s3root=s3root, inject_failure_rate=inject_failure_rate) as s3:
         overwrite_disabled_s3urls = s3.put_files(
             _files(shuffled_blobs), overwrite=False
         )
         assert len(overwrite_disabled_s3urls) == 0
 
-    with S3() as s3:
+    with S3(inject_failure_rate=inject_failure_rate) as s3:
         s3objs = s3.get_many(dict(s3urls).values())
         assert_results(s3objs, expected)
-    with S3(s3root=s3root) as s3:
+    with S3(s3root=s3root, inject_failure_rate=inject_failure_rate) as s3:
         s3objs = s3.get_many(key for key, blob in shuffled_blobs)
         assert {s3obj.key for s3obj in s3objs} == {key for key, _ in shuffled_blobs}
diff --git a/test/extensions/README.md b/test/extensions/README.md
new file mode 100644
index 00000000000..f135475d26a
--- /dev/null
+++ b/test/extensions/README.md
@@ -0,0 +1,5 @@
+# Extensions Testing Framework. 
+
+What does this framework do ? It installs the extensions and then runs the test suite which leverages the extensions.
+
+Currently installs the cards related packages. 
\ No newline at end of file
diff --git a/test/extensions/install_packages.sh b/test/extensions/install_packages.sh
new file mode 100644
index 00000000000..0d546ec6eb7
--- /dev/null
+++ b/test/extensions/install_packages.sh
@@ -0,0 +1,3 @@
+pip install ./packages/card_via_extinit
+pip install ./packages/card_via_init
+pip install ./packages/card_via_ns_subpackage
\ No newline at end of file
diff --git a/test/extensions/packages/card_via_extinit/README.md b/test/extensions/packages/card_via_extinit/README.md
new file mode 100644
index 00000000000..72dfd8a76ff
--- /dev/null
+++ b/test/extensions/packages/card_via_extinit/README.md
@@ -0,0 +1,3 @@
+# card_via_extinit
+
+This test will check if card extensions installed with `mfextinit_*.py` work with Metaflow.
\ No newline at end of file
diff --git a/test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/card_a/__init__.py b/test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/card_a/__init__.py
new file mode 100644
index 00000000000..87baa717d28
--- /dev/null
+++ b/test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/card_a/__init__.py
@@ -0,0 +1,15 @@
+from metaflow.cards import MetaflowCard
+
+
+class TestMockCard(MetaflowCard):
+    type = "card_ext_init_a"
+
+    def __init__(self, options={"key": "task"}, **kwargs):
+        self._key = options["key"] if "key" in options else "task"
+
+    def render(self, task):
+        task_data = task[self._key].data
+        return "%s" % task_data
+
+
+CARDS = [TestMockCard]
diff --git a/test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/card_b/__init__.py b/test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/card_b/__init__.py
new file mode 100644
index 00000000000..37cc6268a53
--- /dev/null
+++ b/test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/card_b/__init__.py
@@ -0,0 +1,15 @@
+from metaflow.cards import MetaflowCard
+
+
+class TestMockCard(MetaflowCard):
+    type = "card_ext_init_b"
+
+    def __init__(self, options={"key": "task"}, **kwargs):
+        self._key = options["key"] if "key" in options else "task"
+
+    def render(self, task):
+        task_data = task[self._key].data
+        return "%s" % task_data
+
+
+CARDS = [TestMockCard]
diff --git a/test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/mfextinit_X.py b/test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/mfextinit_X.py
new file mode 100644
index 00000000000..8bcb1a36f0e
--- /dev/null
+++ b/test/extensions/packages/card_via_extinit/metaflow_extensions/card_via_extinit/plugins/cards/mfextinit_X.py
@@ -0,0 +1,4 @@
+from .card_a import CARDS as a
+from .card_b import CARDS as b
+
+CARDS = a + b
diff --git a/test/extensions/packages/card_via_extinit/setup.py b/test/extensions/packages/card_via_extinit/setup.py
new file mode 100644
index 00000000000..b811ae38f0c
--- /dev/null
+++ b/test/extensions/packages/card_via_extinit/setup.py
@@ -0,0 +1,21 @@
+from setuptools import find_namespace_packages, setup
+
+
+def get_long_description() -> str:
+    with open("README.md") as fh:
+        return fh.read()
+
+
+setup(
+    name="metaflow-card-via-extinit",
+    version="1.0.0",
+    description="A description of your card",
+    long_description=get_long_description(),
+    long_description_content_type="text/markdown",
+    author="Your Name",
+    author_email="your_name@yourdomain.com",
+    license="Apache Software License 2.0",
+    packages=find_namespace_packages(include=["metaflow_extensions.*"]),
+    include_package_data=True,
+    zip_safe=False,
+)
diff --git a/test/extensions/packages/card_via_init/README.md b/test/extensions/packages/card_via_init/README.md
new file mode 100644
index 00000000000..2ac3fcae4d8
--- /dev/null
+++ b/test/extensions/packages/card_via_init/README.md
@@ -0,0 +1,3 @@
+# card_via_init
+
+This test checks if card extensions directly with a `plugins/cards` directory structure work as planned. 
\ No newline at end of file
diff --git a/test/extensions/packages/card_via_init/metaflow_extensions/card_via_init/plugins/cards/__init__.py b/test/extensions/packages/card_via_init/metaflow_extensions/card_via_init/plugins/cards/__init__.py
new file mode 100644
index 00000000000..d3d4160c118
--- /dev/null
+++ b/test/extensions/packages/card_via_init/metaflow_extensions/card_via_init/plugins/cards/__init__.py
@@ -0,0 +1,15 @@
+from metaflow.cards import MetaflowCard
+
+
+class TestMockCard(MetaflowCard):
+    type = "card_init"
+
+    def __init__(self, options={"key": "task"}, **kwargs):
+        self._key = options["key"] if "key" in options else "task"
+
+    def render(self, task):
+        task_data = task[self._key].data
+        return "%s" % task_data
+
+
+CARDS = [TestMockCard]
diff --git a/test/extensions/packages/card_via_init/setup.py b/test/extensions/packages/card_via_init/setup.py
new file mode 100644
index 00000000000..df4b61baba4
--- /dev/null
+++ b/test/extensions/packages/card_via_init/setup.py
@@ -0,0 +1,21 @@
+from setuptools import find_namespace_packages, setup
+
+
+def get_long_description() -> str:
+    with open("README.md") as fh:
+        return fh.read()
+
+
+setup(
+    name="metaflow-card-via-init",
+    version="1.0.0",
+    description="A description of your card",
+    long_description=get_long_description(),
+    long_description_content_type="text/markdown",
+    author="Your Name",
+    author_email="your_name@yourdomain.com",
+    license="Apache Software License 2.0",
+    packages=find_namespace_packages(include=["metaflow_extensions.*"]),
+    include_package_data=True,
+    zip_safe=False,
+)
diff --git a/test/extensions/packages/card_via_ns_subpackage/README.md b/test/extensions/packages/card_via_ns_subpackage/README.md
new file mode 100644
index 00000000000..2edcf43ae0d
--- /dev/null
+++ b/test/extensions/packages/card_via_ns_subpackage/README.md
@@ -0,0 +1,3 @@
+# card_ns_subpackage
+
+This test will check if card extensions installed subpackages under namespace packages work
\ No newline at end of file
diff --git a/test/extensions/packages/card_via_ns_subpackage/metaflow_extensions/card_via_ns_subpackage/plugins/cards/nssubpackage/__init__.py b/test/extensions/packages/card_via_ns_subpackage/metaflow_extensions/card_via_ns_subpackage/plugins/cards/nssubpackage/__init__.py
new file mode 100644
index 00000000000..36b3bd2dead
--- /dev/null
+++ b/test/extensions/packages/card_via_ns_subpackage/metaflow_extensions/card_via_ns_subpackage/plugins/cards/nssubpackage/__init__.py
@@ -0,0 +1,15 @@
+from metaflow.cards import MetaflowCard
+
+
+class TestMockCard(MetaflowCard):
+    type = "card_ns_subpackage"
+
+    def __init__(self, options={"key": "task"}, **kwargs):
+        self._key = options["key"] if "key" in options else "task"
+
+    def render(self, task):
+        task_data = task[self._key].data
+        return "%s" % task_data
+
+
+CARDS = [TestMockCard]
diff --git a/test/extensions/packages/card_via_ns_subpackage/setup.py b/test/extensions/packages/card_via_ns_subpackage/setup.py
new file mode 100644
index 00000000000..c36eef213a3
--- /dev/null
+++ b/test/extensions/packages/card_via_ns_subpackage/setup.py
@@ -0,0 +1,21 @@
+from setuptools import find_namespace_packages, setup
+
+
+def get_long_description() -> str:
+    with open("README.md") as fh:
+        return fh.read()
+
+
+setup(
+    name="metaflow-card-via-nspackage",
+    version="1.0.0",
+    description="A description of your card",
+    long_description=get_long_description(),
+    long_description_content_type="text/markdown",
+    author="Your Name",
+    author_email="your_name@yourdomain.com",
+    license="Apache Software License 2.0",
+    packages=find_namespace_packages(include=["metaflow_extensions.*"]),
+    include_package_data=True,
+    zip_safe=False,
+)
diff --git a/test/unit/test_compute_resource_attributes.py b/test/unit/test_compute_resource_attributes.py
index 84027317462..adb21c521b5 100644
--- a/test/unit/test_compute_resource_attributes.py
+++ b/test/unit/test_compute_resource_attributes.py
@@ -33,50 +33,45 @@ def test_compute_resource_attributes():
     ) == {"cpu": "1", "memory": "100"}
 
     # take largest of @resources and @batch if both are present
-    assert (
-        compute_resource_attributes(
-            [MockDeco("resources", {"cpu": "2"})],
-            MockDeco("batch", {"cpu": 1}),
-            {"cpu": "3"},
-        )
-        == {"cpu": "2"}
-    )
+    assert compute_resource_attributes(
+        [MockDeco("resources", {"cpu": "2"})],
+        MockDeco("batch", {"cpu": 1}),
+        {"cpu": "3"},
+    ) == {"cpu": "2.0"}
+
+    # take largest of @resources and @batch if both are present
+    assert compute_resource_attributes(
+        [MockDeco("resources", {"cpu": 0.83})],
+        MockDeco("batch", {"cpu": "0.5"}),
+        {"cpu": "1"},
+    ) == {"cpu": "0.83"}
 
 
 def test_compute_resource_attributes_string():
     """Test string-valued resource attributes"""
 
-    # if default is None, and the value is not set in @batch, the value is not included in computed attributes in the end
+    # if default is None and the value is not set in @batch, the value is not included in computed attributes in the end
     assert compute_resource_attributes(
         [], MockDeco("batch", {}), {"cpu": "1", "instance_type": None}
     ) == {"cpu": "1"}
 
     # use string value from deco if set (default is None)
-    assert (
-        compute_resource_attributes(
-            [],
-            MockDeco("batch", {"instance_type": "p3.xlarge"}),
-            {"cpu": "1", "instance_type": None},
-        )
-        == {"cpu": "1", "instance_type": "p3.xlarge"}
-    )
+    assert compute_resource_attributes(
+        [],
+        MockDeco("batch", {"instance_type": "p3.xlarge"}),
+        {"cpu": "1", "instance_type": None},
+    ) == {"cpu": "1", "instance_type": "p3.xlarge"}
 
     # use string value from deco if set (default is not None)
-    assert (
-        compute_resource_attributes(
-            [],
-            MockDeco("batch", {"instance_type": "p3.xlarge"}),
-            {"cpu": "1", "instance_type": "p4.xlarge"},
-        )
-        == {"cpu": "1", "instance_type": "p3.xlarge"}
-    )
+    assert compute_resource_attributes(
+        [],
+        MockDeco("batch", {"instance_type": "p3.xlarge"}),
+        {"cpu": "1", "instance_type": "p4.xlarge"},
+    ) == {"cpu": "1", "instance_type": "p3.xlarge"}
 
     # use string value from defaults if @batch has it set to None
-    assert (
-        compute_resource_attributes(
-            [],
-            MockDeco("batch", {"instance_type": None}),
-            {"cpu": "1", "instance_type": "p4.xlarge"},
-        )
-        == {"cpu": "1", "instance_type": "p4.xlarge"}
-    )
+    assert compute_resource_attributes(
+        [],
+        MockDeco("batch", {"instance_type": None}),
+        {"cpu": "1", "instance_type": "p4.xlarge"},
+    ) == {"cpu": "1", "instance_type": "p4.xlarge"}
diff --git a/test/unit/test_k8s_job_name_sanitizer.py b/test/unit/test_k8s_job_name_sanitizer.py
deleted file mode 100644
index 0b82893d765..00000000000
--- a/test/unit/test_k8s_job_name_sanitizer.py
+++ /dev/null
@@ -1,33 +0,0 @@
-import re
-from metaflow.plugins.aws.eks.kubernetes import generate_rfc1123_name
-
-rfc1123 = re.compile(r"^[a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?$")
-
-
-def test_job_name_santitizer():
-    # Basic name
-    assert rfc1123.match(generate_rfc1123_name("HelloFlow", "1", "end", "321", "1"))
-
-    # Step name ends with _
-    assert rfc1123.match(generate_rfc1123_name("HelloFlow", "1", "_end", "321", "1"))
-
-    # Step name starts and ends with _
-    assert rfc1123.match(generate_rfc1123_name("HelloFlow", "1", "_end_", "321", "1"))
-
-    # Flow name ends with _
-    assert rfc1123.match(generate_rfc1123_name("HelloFlow_", "1", "end", "321", "1"))
-
-    # Same flow name, different case must produce different job names
-    assert generate_rfc1123_name(
-        "Helloflow", "1", "end", "321", "1"
-    ) != generate_rfc1123_name("HelloFlow", "1", "end", "321", "1")
-
-    # Very long step name should be fine
-    assert rfc1123.match(
-        generate_rfc1123_name("Helloflow", "1", "end" * 50, "321", "1")
-    )
-
-    # Very long run id should be fine too
-    assert rfc1123.match(
-        generate_rfc1123_name("Helloflow", "1" * 100, "end", "321", "1")
-    )
diff --git a/test/unit/test_k8s_label_sanitizer.py b/test/unit/test_k8s_label_sanitizer.py
deleted file mode 100644
index c053996af16..00000000000
--- a/test/unit/test_k8s_label_sanitizer.py
+++ /dev/null
@@ -1,28 +0,0 @@
-import re
-from metaflow.plugins.aws.eks.kubernetes import sanitize_label_value, LABEL_VALUE_REGEX
-
-
-def test_label_value_santitizer():
-    assert LABEL_VALUE_REGEX.match(sanitize_label_value("HelloFlow"))
-
-    # The value is too long
-    assert LABEL_VALUE_REGEX.match(sanitize_label_value("a" * 1000))
-
-    # Different long values should still not be equal after sanitization
-    assert sanitize_label_value("a" * 1000) != sanitize_label_value("a" * 1001)
-    assert sanitize_label_value("-" * 1000) != sanitize_label_value("-" * 1001)
-
-    # Different long values should still not be equal after sanitization
-    assert sanitize_label_value("alice!") != sanitize_label_value("alice?")
-
-    # ends with dash
-    assert LABEL_VALUE_REGEX.match(sanitize_label_value("HelloFlow-"))
-
-    # non-ascii
-    assert LABEL_VALUE_REGEX.match(sanitize_label_value("метафлоу"))
-
-    # different only in case
-    assert sanitize_label_value("Alice") != sanitize_label_value("alice")
-
-    # spaces
-    assert LABEL_VALUE_REGEX.match(sanitize_label_value("Meta flow"))
diff --git a/test/unit/test_local_metadata_provider.py b/test/unit/test_local_metadata_provider.py
new file mode 100644
index 00000000000..71355306d31
--- /dev/null
+++ b/test/unit/test_local_metadata_provider.py
@@ -0,0 +1,31 @@
+from metaflow.plugins import LocalMetadataProvider
+
+
+def test_deduce_run_id_from_meta_dir():
+    test_cases = [
+        {
+            "meta_path": ".metaflow/BasicParameterTestFlow/1652384326805262/start/1/_meta",
+            "sub_type": "task",
+            "expected_run_id": "1652384326805262",
+        },
+        {
+            "meta_path": ".metaflow/BasicParameterTestFlow/1652384326805262/start/_meta",
+            "sub_type": "step",
+            "expected_run_id": "1652384326805262",
+        },
+        {
+            "meta_path": ".metaflow/BasicParameterTestFlow/1652384326805262/_meta",
+            "sub_type": "run",
+            "expected_run_id": "1652384326805262",
+        },
+        {
+            "meta_path": ".metaflow/BasicParameterTestFlow/_meta",
+            "sub_type": "flow",
+            "expected_run_id": None,
+        },
+    ]
+    for case in test_cases:
+        actual_run_id = LocalMetadataProvider._deduce_run_id_from_meta_dir(
+            case["meta_path"], case["sub_type"]
+        )
+        assert case["expected_run_id"] == actual_run_id
diff --git a/test_runner b/test_runner
index 85cc7ab6bab..8de1f3ecdf9 100755
--- a/test_runner
+++ b/test_runner
@@ -9,8 +9,14 @@ install_deps() {
   done
 }
 
+install_extensions() {
+    cd test/extensions
+    sh install_packages.sh
+    cd ../../
+}
+
 run_tests() {
   cd test/core && PYTHONPATH=`pwd`/../../ python3 run_tests.py --num-parallel 8
 }
 
-install_deps && run_tests
+install_deps && install_extensions && run_tests 
diff --git a/tox.ini b/tox.ini
index d6a21b7db9c..df34b17171f 100644
--- a/tox.ini
+++ b/tox.ini
@@ -1,5 +1,6 @@
 [tox]
 
 [testenv]
+allowlist_externals = ./test_runner
 commands = ./test_runner