Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload folder implementation. #618

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

JohnMoutafis
Copy link
Contributor

@JohnMoutafis JohnMoutafis commented Jul 26, 2024

Upload a local folder, utilizing the upload_file method.

  • Respects the uploading folder structure by default and recreates it in the destination.
  • Ingests the files into a given Group, or creates a new one form the folder name

@JohnMoutafis JohnMoutafis self-assigned this Jul 26, 2024
src/tiledb/cloud/files/utils.py Outdated Show resolved Hide resolved
namespace, name = utils.split_uri(output_uri)
_, sp, acn = groups._default_ns_path_cred(namespace=namespace)

storage_path = name if name.startswith("s3://") else sp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we treating Amazon URIs specially here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now checking if the name contains :// will that suffice?

src/tiledb/cloud/files/utils.py Outdated Show resolved Hide resolved
src/tiledb/cloud/files/utils.py Outdated Show resolved Hide resolved
src/tiledb/cloud/files/utils.py Outdated Show resolved Hide resolved
src/tiledb/cloud/files/utils.py Outdated Show resolved Hide resolved
)
uploaded += 1
except Exception as exc:
upload_errors[fname] = str(exc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we turning the exception here into a string? it strips away any useful information that might have been included.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the result to be serializable.

src/tiledb/cloud/files/utils.py Outdated Show resolved Hide resolved
src/tiledb/cloud/files/utils.py Show resolved Hide resolved
@Shelnutt2
Copy link
Member

Do we have a design doc for this? There seems to be miss understanding of what the goals here are. Lets get this cleared up in a design doc before we continue iterating here,

cc: @antalakas

@JohnMoutafis
Copy link
Contributor Author

@Shelnutt2 I have added you in the relevant ticket


vfs = tiledb.VFS(config=config)
# List local folder
input_ls: List[str] = vfs.ls(input_uri)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a recursive=True flag available, I think this might be needed to list the whole hierarchy below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recursive vfs.ls flag lists every file AND folder in the input_uri.
The upload process does the recursion internally creating a Group for every sub-folder and then uploading the folder level listed files in it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnMoutafis can you check the performance of the current implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the following snippet:

def vfs_ls_recursive(uri):
    vfs = tiledb.VFS()
    files = []
    for fname in vfs.ls(uri):
        files.append(fname)
        if vfs.is_dir(fname):
            files += vfs_ls_recursive(fname)
    return files

to time the implementation's manual recursion against vfs.ls(<folder URI>, recursive=True) and vfs.ls_recursive on a local folder containing sub-folders, got the following results:

  • vfs_ls_recursive: 1.18 ms ± 24.4 µs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)
  • vfs.ls(<folder URI>, recursive=True): 1.95 ms ± 121 µs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)
  • vfs.ls_recursive: 1.96 ms ± 35.4 µs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)

Timings were collected using IPython's %timeit for 100 runs of 1000 loops per run.
The solutions seem comparable.

@Shelnutt2 Shelnutt2 requested a review from sgillies September 11, 2024 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants