-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload folder implementation. #618
base: main
Are you sure you want to change the base?
Upload folder implementation. #618
Conversation
src/tiledb/cloud/files/utils.py
Outdated
namespace, name = utils.split_uri(output_uri) | ||
_, sp, acn = groups._default_ns_path_cred(namespace=namespace) | ||
|
||
storage_path = name if name.startswith("s3://") else sp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we treating Amazon URIs specially here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now checking if the name contains ://
will that suffice?
src/tiledb/cloud/files/utils.py
Outdated
) | ||
uploaded += 1 | ||
except Exception as exc: | ||
upload_errors[fname] = str(exc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we turning the exception here into a string? it strips away any useful information that might have been included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the result to be serializable.
Do we have a design doc for this? There seems to be miss understanding of what the goals here are. Lets get this cleared up in a design doc before we continue iterating here, cc: @antalakas |
@Shelnutt2 I have added you in the relevant ticket |
|
||
vfs = tiledb.VFS(config=config) | ||
# List local folder | ||
input_ls: List[str] = vfs.ls(input_uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a recursive=True
flag available, I think this might be needed to list the whole hierarchy below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recursive
vfs.ls flag lists every file AND folder in the input_uri
.
The upload process does the recursion internally creating a Group for every sub-folder and then uploading the folder level listed files in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohnMoutafis can you check the performance of the current implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the following snippet:
def vfs_ls_recursive(uri):
vfs = tiledb.VFS()
files = []
for fname in vfs.ls(uri):
files.append(fname)
if vfs.is_dir(fname):
files += vfs_ls_recursive(fname)
return files
to time the implementation's manual recursion against vfs.ls(<folder URI>, recursive=True)
and vfs.ls_recursive
on a local folder containing sub-folders, got the following results:
vfs_ls_recursive
: 1.18 ms ± 24.4 µs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)vfs.ls(<folder URI>, recursive=True)
: 1.95 ms ± 121 µs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)vfs.ls_recursive
: 1.96 ms ± 35.4 µs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)
Timings were collected using IPython's %timeit
for 100 runs of 1000 loops per run.
The solutions seem comparable.
…/implement-upload-of-local-folder
Upload a local folder, utilizing the
upload_file
method.