Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix vfs.ls with access_credentials_name #512

Merged
merged 6 commits into from
Apr 3, 2024
Merged

Fix vfs.ls with access_credentials_name #512

merged 6 commits into from
Apr 3, 2024

Conversation

johnkerl
Copy link
Collaborator

@johnkerl johnkerl commented Feb 9, 2024

Overview

When attempting to use the run_ingest_workflow function to ingest H5AD files into SOMA, users are expected to provide AWS credentials through the extra_tiledb_config argument. This is a departure from other 1-line ingestors, which allow users to provide only an access_credential_name (or a config) referencing their role-based credential.

The function fails if AWS credentials are not defined in the local environment or passed through the extra_tiledb_config argument, even when access_credential_name is provided because a tiledb.VFS instance is created locally to determine whether the input_uri points to a file or directory.

Why this is

  • run_ingest_workflow runs on the client
  • We need vfs in order to:
    • Decide if the input URI is a leaf (.h5ad) or not (a prefix)
    • In the latter case, to enumerate over the .h5ad leaves at that prefix

Sample script

stamp=$(vdatetime)
prefix="s3://tiledb-johnkerl/s/a/stack-small"
scratch="tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/scratch"
unset AWS_ACCESS_KEY_ID
unset AWS_SECRET_ACCESS_KEY
unset AWS_DEFAULT_REGION
./clingt.py $prefix $scratch/csi-ing-b-prefix-env-no-$stamp

clingt.py

#!/usr/bin/env python

import tiledb.cloud
import tiledb.cloud.soma
import datetime
import sys
import os

input_uri        = "s3://tiledb-johnkerl/s/a/stack-small"
stamp            = datetime.datetime.today().strftime('%Y%m%d-%H%M%S')
output_uri       = f"tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/scratch/csi-{stamp}"
measurement_name = "RNA"
namespace        = "TileDB-Inc"
dry_run          = False

args = sys.argv[1:]

if len(args) >= 1 and args[0] == "-n":
    dry_run = True
    args = args[1:]

if len(args) == 1:
    input_uri  = args[0]
if len(args) == 2:
    input_uri  = args[0]
    output_uri = args[1]

print(input_uri)
print(output_uri)

dikt = tiledb.cloud.soma.run_ingest_workflow(
    output_uri=output_uri,
    input_uri=input_uri,
    measurement_name=measurement_name,
    namespace=namespace,
    access_credentials_name="tiledb-cloud-sandbox-role",
    resources={"cpu": "8", "memory": "32Gi"},
    extra_tiledb_config={"config.logging_level": "5"},
    dry_run=dry_run,
)
print(dikt)
print("https://cloud.tiledb.com/activity/taskgraphs/TileDB-Inc/" + dikt["graph_id"])

Expected behavior

$ aws s3 ls s3://tiledb-johnkerl/s/a/stack-small/
2023-08-11 15:37:15      20200 stack1.h5ad
2023-08-11 15:37:15      20200 stack2.h5ad
2023-08-11 15:37:15      20200 stack3.h5ad
2023-08-11 15:37:15      20200 stack4.h5ad

Since the prefix provided to clingt.py is s3://tiledb-johnkerl/s/a/stack-small, I expect one enumerator node to vfs.ls that prefix, and four leaves to be launched, one for each .h5ad file at that prefix.

Sample logs

The first one is user-visible, from clingt.py, and the second one is linked to from there:

The next one (the enumerator node) can't be found from the ones above. The user has to discover it by searching compute logs :(. The first one is the enumerator (four-point candelabra); the ones after are the four leaves:

See also

[sc-35137]

Copy link

@johnkerl johnkerl marked this pull request as ready for review February 9, 2024 21:48
@johnkerl
Copy link
Collaborator Author

@thetorpedodog ping 🙏

src/tiledb/cloud/soma/ingest.py Outdated Show resolved Hide resolved
src/tiledb/cloud/soma/ingest.py Outdated Show resolved Hide resolved
src/tiledb/cloud/soma/ingest.py Outdated Show resolved Hide resolved
src/tiledb/cloud/soma/ingest.py Outdated Show resolved Hide resolved
@johnkerl
Copy link
Collaborator Author

@thetorpedodog thank you for punctuational review.

What I really need help with is the two big-picture questions I wrote about in the description field.

Thanks in advance for any help you can offer. I am in need of expert guidance here.

@thetorpedodog
Copy link
Contributor

I guess the question I have at this stage is: why does the setup task run on the client side? It seems like if that were run on the server side, that would solve the problem.

@johnkerl
Copy link
Collaborator Author

I guess the question I have at this stage is: why does the setup task run on the client side? It seems like if that were run on the server side, that would solve the problem.

This PR has the setup task run server-side as described in the description section of this PR.

@thetorpedodog
Copy link
Contributor

Coming back to this after way too long, I think I have a solution here. The change that needs to happen is in the run_ingest_workflow function.

Right now it starts the one-node graph to build the actual ingestion graph, but it returns the ID of the one-node graph.

Instead, we need to wait for that one-node graph to complete (i.e., for it to build and launch the actual ingestion graph). Then, the one-node graph will return the ID of the actual ingestion graph, and the run_ingest_workflow function can look the output of that node, which contains the graph ID, and return that to the user:

def run_ingest_workflow(...) -> Dict[str, str]:
    ...
    grf = build_ingest_workflow_graph(...)
    grf.compute()
    the_node = next(iter(grf.nodes.values()))
    real_graph_uuid = the_node.result()
    return {
        "status": "started",
        "graph_id": str(real_graph_uuid),
    }

@johnkerl johnkerl force-pushed the kerl/vfs-refactor branch 2 times, most recently from 8e0973e to afa2f06 Compare March 18, 2024 16:15
@johnkerl johnkerl changed the title Fix vfs.ls with access_credentials_name [WIP] Fix vfs.ls with access_credentials_name Mar 18, 2024
@ihnorton ihnorton closed this Mar 18, 2024
@ihnorton ihnorton reopened this Mar 18, 2024
@johnkerl
Copy link
Collaborator Author

@thetorpedodog ready for next round of review!

Copy link
Contributor

@thetorpedodog thetorpedodog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things, otherwise this looks pretty good!

src/tiledb/cloud/soma/ingest.py Outdated Show resolved Hide resolved
src/tiledb/cloud/soma/ingest.py Outdated Show resolved Hide resolved
src/tiledb/cloud/soma/ingest.py Outdated Show resolved Hide resolved
@johnkerl johnkerl force-pushed the kerl/vfs-refactor branch from 66f0af7 to 37fe7de Compare April 2, 2024 21:26
@johnkerl johnkerl requested a review from thetorpedodog April 2, 2024 21:36
@johnkerl
Copy link
Collaborator Author

johnkerl commented Apr 2, 2024

@thetorpedodog ready for the next round of review!

@johnkerl
Copy link
Collaborator Author

johnkerl commented Apr 3, 2024

@thetorpedodog ping 🙏

logging.debug("ENUMERATOR ENTER")
logging.debug("ENUMERATOR INPUT_URI %s", input_uri)
logging.debug("ENUMERATOR OUTPUT_URI %s", output_uri)
logging.debug("ENUMERATOR DRY_RUN %s", str(dry_run))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for str here; %s transforms everything into a string.

@johnkerl johnkerl merged commit 93e8fe1 into main Apr 3, 2024
17 checks passed
@johnkerl johnkerl deleted the kerl/vfs-refactor branch April 3, 2024 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants