-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert platybrowser data to zarr #1
Comments
@tischi : want to add me to this repo so I can assign myself? |
Do you have any info on the sizes of all these volumes? I'm running a recursive |
That's the issue with these millions of small files...doing anything with them but lazy loading chunks is not much fun. |
...running now |
and the other one is much smaller. |
Doh. My command was done after my 🏃 :
|
I am curious how long it will take to copy and zarrify this. Maybe would be interesting to time it for future reference. |
The transfer is progressing extremely slowly. Do you have a small example dataset I could use to write a script and then you could transform locally? Edit: actually, I'm now getting permission denied when I try to access s3.embl.de! |
The myosin data set in the list above is small.
Interesting, this may be related to this: |
No luck:
|
I will write IT... |
We also have the exact same data on a file system. |
Yeah, if you can provide me a small- to mid-size download, I'll get started on a script and/or docker you can run. |
just to keep track what I do in case I have to repeat:
note: is is important to add the root folder to the upload destination @joshmoore |
sbem-6dpf-1-whole-segmented-cells.n5
started...
I think it finished:
Took 6.5 hours, seems to have arrived.
TODO:
|
Based on above experiment, if I extrapolate how long it would take to upload the 3D volume EM raw data using
Any thoughts? |
@constantinpape @martinschorb |
One idea could be to start several copy processes, e.g., parallelising over the resolution layers:
I would think both our local file system and Josh's the receiving s3 storage should handle 10 parallel processes. |
I found that it was much faster from a 3dcloud VM than from the cluster. But that could be specific to the network connnectivity to the s3 machines. |
I think about a day. I used a cluster node (gpu6 or 7 probably). |
I am not sure if this is helpful, but I could also convert the data to zarr on the EMBL side. |
I think this is very interesting indeed, but @joshmoore should comment, because I don't know whether he needs some specific zarr flavour. |
Not immediately, unless you want to also try tar'ing it up.
@constantinpape : if you want to kick off a |
I think I'll just start it, resolution layer by resolution layer... |
Edit: Sorry I wrote this before tischis last comment. If you want to do it Tischi, Go ahead.
|
s9
This finished instantly...
|
👍 for however it happens but the equivalent, yeah. 👍 |
@joshmoore ok, let's see if we can get Tischi's conversion to run first and then have this as a fallback.
I can't log into VPN right now, will check later. |
@tischi s9 has exactly one chunk, which is 41kb, so I would expect it to copy almost immediately:
|
|
Here's a quick script which looks to be working locally. I'm unsure if setups are always channels for this data and if there's ever more than one channel and/or setup.
|
For now we always have a single setup, corresponding to a single channel. |
I don't know if this is a problem in
whereas if I edit the file I get:
|
I have written these files with Did this maybe change recently to be more in line with the zarr group metadata? (It shouldn't without changing major version because I think this would be a breaking change.) Or is it just a bug in the Anyway, for now we can fix it by adding the attributes to find the underlying issue. |
@joshmoore Note that I am using |
Yes. zarr-developers/zarr-python#651
👍 I'll look more tomorrow.
It previously wasn't on the zarr side, so in the ome-zarr spec it's prevented. I agree! I'd very much like to move to nested storage in the next version bump. |
Ok, I updated it to support the flat chunk hierarchy. |
@joshmoore
Does it work for you? |
Ah, possibly. I've canceled my
|
@joshmoore |
I think you are right that the lower paths are struggling under the number of subelements. Certainly listing the top .n5 works (--> |
@joshmoore
From the results I hope to deduce what has been copied already such that I do not start the sync in more subfolders than necessary. |
@tischi Sure! output
|
@joshmoore
any ideas? |
@joshmoore
Maybe the server is kind of down? |
@joshmoore
All those xml point to n5s3 datasets: https://github.com/mobie/platybrowser-datasets/tree/master/data/1.0.1/images/remote
Within the xml you can see all information needed to access the object in the bucket.
Cool would be to have those converted to zarr:
The text was updated successfully, but these errors were encountered: