tiledb #120

rabernat · 2018-02-20T19:40:39Z

As long as we are discussing cloud storage formats, it seems like we should be looking at TileDB:
https://www.tiledb.io/
https://github.com/TileDB-Inc/TileDB

TileDB manages data that can be represented as dense or sparse arrays. It can support any number of dimensions and store in each array element any number of attributes of various data types.

If it lives up to the promised goals, TileDB could really be the solution to our storage problems.

cc @kaipak, @sirrice

jhamman · 2018-02-20T23:58:23Z

@rabernat - I've seen this before too and it does look quite promising. I get the impression from reading the documentation that the Python API is not quite ready to go yet. In theory though, this could be yet another backend for xarray.

stavrospapadopoulos · 2018-02-23T16:28:57Z

Hello folks. We just released a new version of TileDB, which has optimized support for AWS S3 and a Python API that uses NumPy (there is a lot of info and examples in our docs: https://docs.tiledb.io/). We would love to get some feedback from you.

cc @jakebolewski, @tdenniston

mrocklin · 2018-02-23T16:47:59Z

Hi @stavrospapadopoulos ! Thanks for chiming in.

Do you happen to have a demo TileDB deployment somewhere publicly available that we could hit to try things out?

rabernat · 2018-02-23T17:42:41Z

Thanks @stavrospapadopoulos for sharing this information! TileDB looks like a promising possible storage layer for xarray / pangeo.

I have read through the documentation and have a couple of questions that I hope you can clarify whenever you might have the time:

Technical Questions

Does TileDB use a client / server model? This was my impression when I saw the word "DB", but, looking more closely at the documentation, it appears that it is more like a [virtual] file format + an API for accessing those [virtual] files (similar to HDF5 or zarr).
What sort of indexing is supported on the array-like objects returned by tiledb (e.g. the dense_array object in this example)?
Does TileDB support distributed reads / writes on the same array (either via a shared filesystem or an S3 bucket) from multiple processes on a distributed cluster? If locking is required, does the python API provide a way for the user to pass a shared lock object?
Does the python API support asynchronous I/O?
The docs contain lots of examples, but can you point me to the complete python API documentation?
Any estimate on when the conda-forge package will be available? (This greatly lowers the bar for the python community to try something new.)

Social Questions

Can you point us towards some projects that are currently using TileDB?
How does TileDB compare to zarr, our current library of choice for chunked / compressed array storage?
Is anyone from your team interested in collaborating more directly on testing TileDB in the context of dask / xarray?

rabernat · 2018-02-23T18:01:44Z

Oh and one more technical question

Is there a way to store array metadata in TileDB (e.g. units or other arbitrary attributes), as in HDF5, netCDF, zarr, etc?

stavrospapadopoulos · 2018-02-23T22:26:52Z

@jakebolewski, @tdenniston will chime in (probably next week), but here is my take.

Responses to Technical Questions

TileDB is an embeddable library (written in C and C++, more similar to HDF5 than zarr) that comes with a Python wrapper. You should truly think of TileDB as an alternative to HDF5, with (among some other important things) the extra capability of storing sparse arrays in addition to dense and also supporting a wide variety of backends in addition to posix filesystems (currently HDFS and S3, more in the future).
- About “DB” in the name: when we started working on TileDB at MIT, we thought that we would be an alternative to SciDB and follow its client/server architecture. We eventually decided to make TileDB a lightweight embeddable library (although we are exploring implementing some client/server functionality along with a REST API). We kept the name because people already knew the software as TileDB and because, well, we liked it. :)
The Python API currently supports only positional indexing. We assume though that you want this, so this is where we are going. Please stay tuned.
Absolutely, this is the big strength of TileDB.
- First, TileDB supports both concurrent reads and writes (even interleaved) without any locking. Please check our concurrency and consistency model. For any operation where locking is required (e.g., creating an array), we implement our own locking (via mutexes, as well as file locks for the backends that support them) on the C++ side.
- Second, note that TileDB implements its own tight integration with the various backends via C/C++ SDKs (e.g., libhdfs for HDFS and AWS C++ SDK for S3), and its own LRU cache. We do not rely on another package (e.g, s3fs like zarr). This enables us to do some pretty cool stuff (performance-wise) under the covers (e.g., we use our own async IO internally, we are currently implementing another parallel functionality, etc). You may also want to understand how our updates work, which is ideal for append-only backends like S3. So, TileDB is designed and continuously optimized for extreme parallelism for distributed backends, even beyond posix.
Our C and C++ API do, but we haven’t wrapped this for Python yet. It is trivial to do so for a NULL callback, but we need more time to make it work for arbitrary Python functions (e.g., lambdas). This is certainly in our roadmap.
There is a menu item API Reference in the docs. Here are the Python API docs.
We have already started working on this, we will publish it very soon. For now, the easiest way to start playing is

docker pull tiledb/tiledb:1.2.0
docker run -it tiledb/tiledb:1.2.0
python3
import tiledb

@mrocklin we will put together a demo very soon as well.

Yes. Please take a look at our key-value store functionality and a Python example. You can attach a key-value store to an array (say, with URI /path/to/array/), by storing it in URI, say, /path/to/array/meta. The C/C++ implementation is very flexible (and much more powerful than HDF5's attributes), as it allows you to store any type of keys (not just strings) and any number of attributes of arbitrary types as values (not just strings), inheriting all the benefits from TileDB arrays. Currently the Python API supports only string keys/values, but we will extend it to support equivalent functionality to C/C++ very soon.

Responses to Social Questions

TileDB is used at the core of GenomicsDB, which is the product of a collaboration between Intel HLS and the Broad Institute (which we started when I was still at Intel). GenomicsDB is officially a part of GATK 4.0. We are currently working with the Oak Ridge National Lab on another genomics project. We built a LiDAR data adaptor for Naval Postgraduate School (which we will release pretty soon). We have also started a POC with the JGI on yet another genomics project. We would love to see what value TileDB can bring to a use case like pangeo :).
TileDB is very similar to zarr in that respect. Some of the key differences: (i) TileDB is built exclusively in C and C++, which allows us to bind it to other HL languages beyond Python as well (e.g., our Java bindings are coming up soon, we are starting working on R, which is very popular in genomics), (ii) TileDB natively supports sparse arrays with a powerful new format, and (iii) TileDB builds its own tight integration with the storage backends, rather than relying on generic libraries like s3fs, which allows us to do some nice low-level optimizations. We would be very interested to compare TileDB vs zarr on your workloads though.
Absolutely! We love what you guys do and we will be benefitted enormously by your feedback. Please let me know if you would like to start an email thread ({stavros,jake,tyler}@tiledb.io).

mrocklin · 2018-02-23T22:58:04Z

The Python API currently supports only positional indexing. We assume though that you want this, so this is where we are going. Please stay tuned.

I suspect that positional indexing may not be requied. XArray is accustomed to dealing with data stores that don't support this (like numpy itself). It has logic to deal with it. I suspect that the desired answer is instead "anything numpy can provide, including integers, slices (of all varieties), lists, or boolean numpy arrays" or some subset of that.

@mrocklin we will put together a demo very soon as well.

The ideal solution from our perspective would be some publicly-available data on a GCS bucket (we could probably front the cost for this) and an easy way to install TileDB, ideally with a conda package. We could also set up a smaller example to run from S3, but are unlikely to host.

Conda

I'm biased here (I am employed by Anaconda) but I strongly recommend creating a conda package if you're looking for adoption within the broader numeric Python community.

One way to do this is through conda-forge, which operates a build farm with linux/windows/mac support.
This is a community group that is very friendly and supportive. As an example, here are some links for HDF5 and H5Py which I'm guessing is similar-to but more-complex than your situation:

stavrospapadopoulos · 2018-02-23T23:17:13Z

I suspect that positional indexing may not be requied. XArray is accustomed to dealing with data stores that don't support this (like numpy itself). It has logic to deal with it. I suspect that the desired answer is instead "anything numpy can provide, including integers, slices (of all varieties), lists, or boolean numpy arrays" or some subset of that.

Correct. Currently we support integers and slices, whereas we are working on lists and boolean numpy arrays.

The ideal solution from our perspective would be some publicly-available data on a GCS bucket (we could probably front the cost for this) and an easy way to install TileDB, ideally with a conda package. We could also set up a smaller example to run from S3, but are unlikely to host.

As I mentioned above, TileDB provides tight integration with the backends, which means that we implement our own low-level IO functionality using the backend's C or C++ SDK. The new release has AWS S3 support. GCS is in our roadmap, but it will require quite some work to tightly integrate with it using its SDK. Do you prefer GCS to S3? Please note that we have no bias - we will eventually support both.

As a first step, we can certainly provide you with access to an S3 bucket + a conda package.

I'm biased here (I am employed by Anaconda) but I strongly recommend creating a conda package if you're looking for adoption within the broader numeric Python community.

As mentioned above, we are almost there. :)

mrocklin · 2018-02-23T23:22:53Z

Do you prefer GCS to S3?

The software packages have no preference.

But the particular distributed deployment of http://pangeo.pydata.org/ runs on GCP, which would make it trivial for folks to try running scalability and performance tests without paying data transfer costs (though short term I don't think that these will amount to much).

On a personal/OSS note I'd like to push people to support more than the dominant cloud vendor.

mrocklin · 2018-02-23T23:27:03Z

Having data on some cloud accessible system would make it easy to repeat this experience: https://youtu.be/rSOJKbfNBNk

mrocklin · 2018-02-23T23:27:24Z

Also cc @llllllllll who has been interested in comparing array storage systems

stavrospapadopoulos · 2018-02-23T23:38:55Z

OK, all this sounds good. We will set something up on S3 so that we can get some feedback, and we can run detailed benchmarks once we have the GCS integration. Thanks for taking a look!

mrocklin · 2018-02-23T23:39:49Z

One can also run dask/xarray feasibility studies and benchmarks on a single machine. It's just a bit less compelling due to the wealth of options with a local disk

…

On Fri, Feb 23, 2018 at 6:38 PM, Stavros Papadopoulos < ***@***.***> wrote: OK, all this sounds good. We will set something up on S3 so that we can get some feedback, and we can run detailed benchmarks once we have the GCS integration. Thanks for taking a look! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#120 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszCvZz0QsASeRSdHzqyc9xU_ShFx0ks5tX0wPgaJpZM4SMi0Q> .

stavrospapadopoulos · 2018-02-23T23:41:40Z

On a side note, this is our up-to-date repo, which we maintain and further develop at TileDB-Inc:
https://github.com/TileDB-Inc/TileDB

Not to be confused with the one that I used to develop at Intel Labs:
https://github.com/Intel-HLS/TileDB

mrocklin · 2018-02-23T23:51:26Z

Good to know. I suspect that this is due to Intel-HLS/TileDB#72 . Want me to resubmit?

…

On Fri, Feb 23, 2018 at 6:41 PM, Stavros Papadopoulos < ***@***.***> wrote: On a side note, this is our up-to-date repo, which we maintain and further develop at TileDB-Inc: https://github.com/TileDB-Inc/TileDB Not to be confused with the one that I used to develop at Intel Labs: https://github.com/Intel-HLS/TileDB — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#120 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszDPlKkqJmejTzwh70SXA2bx4OkcPks5tX0y1gaJpZM4SMi0Q> .

stavrospapadopoulos · 2018-02-23T23:52:20Z

No worries, fixed (with thanks).

rabernat · 2018-02-26T17:07:45Z

Thanks so much for your quick and detailed replies to our questions. At this point, it is clear to me that we definitely need to evaluate TileDB as a potential backend for xarray.

I imagine the first step will be to evaluate TileDB with dask. Perhaps @mrocklin can suggest the best approach to that.

The main practical obstacle for testing with xarray will be the need for a new xarray backend. I see two main options:

Duplicate / modify the zarr backend. This involves about 500 lines of code plus another 200 of tests. (Note that the backend code is due for a refactor, WIP: New DataStore / Encoder / Decoder API for review pydata/xarray#1087.)
Create a tiledbnetcdf project, similar to h5netcdf. This would essentially wrap TileDB with the same API as the netcdf4-python library, allowing xarray to use it as a storage backend with less (but still some) new backend code.

I think option 2 is attractive for several reasons. Geosciences are potentially one of the biggest potential sources of TileDB-type data (e.g. NASA is putting > 300 PB of data into the cloud over the next 5 years). NetCDF is already a familiar interface for our community, so this might significantly lower the bar for the broader community.

The pangeo project is stretched pretty thin right now in terms of developer time. So the main challenge will be to find someone to work on this. Maybe @sirrice can help us get a Columbia CS student interested.

mrocklin · 2018-02-26T17:11:05Z

Testing locally with dask is probably pretty easy, assuming that TileDB exposes a Python object that supports slicing (which I think we've already established that it does). Relevant dask docs here: http://dask.pydata.org/en/latest/array-creation.html#numpy-slicing

…

On Mon, Feb 26, 2018 at 12:07 PM, Ryan Abernathey ***@***.***> wrote: Thanks so much for your quick and detailed replies to our questions. At this point, it is clear to me that we definitely need to evaluate TileDB as a potential backend for xarray. I imagine the first step will be to evaluate TileDB with dask. Perhaps @mrocklin <https://github.com/mrocklin> can suggest the best approach to that. The main practical obstacle for testing with xarray will be the need for a new xarray backend. I see two main options: 1. Duplicate / modify the zarr backend. This involves about 500 lines of code <https://github.com/pydata/xarray/blob/master/xarray/backends/zarr.py> plus another 200 of tests <https://github.com/pydata/xarray/blob/master/xarray/tests/test_backends.py#L1132-L1362>. (Note that the backend code is due for a refactor, pydata/xarray#1087 <pydata/xarray#1087>.) 2. Create a tiledbnetcdf project, similar to h5netcdf <https://github.com/shoyer/h5netcdf>. This would essentially wrap TileDB with the same API as the netcdf4-python <https://github.com/Unidata/netcdf4-python> library, allowing xarray to use it as a storage backend with less (but still some) new backend code. I think option 2 is attractive for several reasons. Geosciences are potentially one of the biggest potential sources of TileDB-type data (e.g. NASA is putting > 300 PB of data into the cloud over the next 5 years). NetCDF is already a familiar interface for our community, so this might significantly lower the bar for the broader community. The pangeo project is stretched pretty thin right now in terms of developer time. So the main challenge will be to find someone to work on this. Maybe @sirrice <https://github.com/sirrice> can help us get a Columbia CS student interested. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#120 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszNF9CRyOPoau3Jd6mJJOSfpxMtOfks5tYuTjgaJpZM4SMi0Q> .

rabernat · 2018-02-27T14:27:10Z

cc @kaipak, who may be interested in this discussion

kaipak · 2018-02-27T14:42:49Z

Yep, I've been tagging along for the discussion :).

stavrospapadopoulos · 2018-02-27T16:17:20Z

Before we consider integrating TileDB with xarray or building a netcdf wrapper for TileDB, we suggest we perform some simple experiments to evaluate TileDB's performance on parallel writes and reads on nd-arrays using dask.

We need to start with the following questions:

What are we comparing with (e.g., zarr, hdf5, netcdf)?
What is the format of the input data to be ingested to TileDB and the other approaches? One idea would be to generate some nd-array data and store them in some simple CSV or flat binary format. We are open to any suggestions here.
Where will the input and ingested data be stored? We suggest two experiments: (i) both locally, (ii) both on S3. In the future we can test with more backends.

mrocklin · 2018-02-27T17:19:21Z

What are we comparing with (e.g., zarr, hdf5, netcdf)?

On normal file systems NetCDF4 on HDF5 is probably the thing to beat, at least for the science domains that interest the community active on this issue tracker (which is sizable). On cloud-based systems there is no obvious contender. I wrote about cloud options and concerns here: http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

What is the format of the input data to be ingested to TileDB and the other approaches? One idea would be to generate some nd-array data and store them in some simple CSV or flat binary format. We are open to any suggestions here.

I don't think that this matters much. This community's workloads are typically write-once read-many. Data tends to be delivered in the form of NetCDF4 files. I don't think that people care strongly about the cost-to-convert. I think that the broader point is that you have to convert, which is a negative for any format other than NetCDF.

Where will the input and ingested data be stored? We suggest two experiments: (i) both locally, (ii) both on S3. In the future we can test with more backends.

Those are both common. So too are large parallel POSIX file systems.

mrocklin · 2018-02-27T17:40:58Z

Also, to be clear, if your goal is to gain adoption then I suspect that demonstrating modest factor speed improvements is not sufficient. Disk IO isn't necessarily a pain point, and the inertia to existing data formats is very high. Instead I think you would need to go a bit more broadly and demonstrate that various workflows now become feasible where they were not feasible before. Parallel and random access into cloud-storage is one such workflow, but there are likely others.

rabernat · 2018-02-27T17:49:05Z

If you are looking for an appropriate test dataset, I would recommend something from NASA. For example:
GHRSST Level 4 MUR Global Foundation Sea Surface Temperature Analysis (v4.1)

A single file (e.g. ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2018/057/20180226090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc) is 377 MB, and there is one every day for the past 16 years, ~2.2 TB total. Any reasonable subset of this will make a good test dataset.

Simple yet representative things our users might want to do are:

get a timeseries at a single point
aggregations
- global mean / variance of SST as a function of time
- mean / variance over longitude and time (i.e. zonal mean)
- monthly climatology
convolution (smoothing) over spatial dimensions
multidimensional Fourier transforms

There are of course more complicated workflows, but these are representative of the I/O bound ones.

mrocklin · 2018-02-27T17:50:41Z

Dask.array can do all of these computations. The question would be how things feel when Dask accesses data from TileDB when doing these computations.

stavrospapadopoulos · 2018-02-27T17:52:02Z

@mrocklin we are on the same page about adoption and TileDB data access using Dask.
@rabernat this is very helpful, we can start with that.

jreadey · 2018-02-27T18:38:00Z

@rabernat - I like the strategy you outlined. Would it make sense to create a repo with a set of performance benchmarks based on this data collection? Ideally it would enable various I/O backends (e.g. HDF5 files, Fuse to S3, zarr, hsds, etc.) to be pluggedin. This would seem a good framework to evaluate performance for these common use cases.

jhamman · 2018-02-27T18:49:35Z

@jreadey et al. I think a formal benchmarking repository would be a great contribution to the community (see also #45 and #5). This is something I'd be happy to see move forward and would eagerly participate in getting setup.

jreadey · 2018-02-27T19:05:35Z

@rabernat - I was planning on doing some benchmarking with hdf5lib vs s3vfd vs hsds. I can start with some of the simpler codes and then ask you for help when I get stuck!

For now I'm thinking to keep it to just a single client (i.e. without dask distributed).

Is the Sea Temperature data on S3? Alternatively, we could use the DCP-30 dataset: s3://nasanex/NEX-DCP30. Any issues with that?

rabernat · 2018-02-27T20:43:02Z

Would it make sense to create a repo with a set of performance benchmarks based on this data collection?

Yes, I really like this idea. A modular system would be ideal and would save us from writing a lot of boilerplate.

Ideally it would enable various I/O backends (e.g. HDF5 files, Fuse to S3, zarr, hsds, etc.) to be pluggedin.

This is kind of what xarray already does! 😏 Unfortunately, we do not have direct xarray for some of these storage layers (e.g. TileDB), so we can't just use xarray. We will end up reinventing some of xarray's backend logic, but I suppose that's acceptable.

Does it make sense to try to integrate airspeedvelocity or is that overkill?

@kaipak, a Pangeo intern here at Columbia, has some time to contribute to this benchmark project. @jreadey, it would be great if you two could work together.

scottyhq · 2019-08-14T21:41:35Z

Reopening this issue because I think there is a lot of great discussion and there are some new developments worth pointing out.

Thanks to @normanb TileDB support is now available in GDAL>3.0 (OSGeo/gdal#1402)! Documentation here: https://gdal.org/drivers/raster/tiledb.html.

The upcoming GDAL 3.1 release will also have better support for multidimensional raster data. See:

Once rasterio releases support gdal>3 (conda-forge/gdal-feedstock#326), this should provide an easy way to bring tileDB into an xarray dataset. See also rasterio/rasterio#1699.

I'd love to see a Zarr driver for GDAL as well, and given the overlap with TileDB, it seems like this would be fairly straightforward to implement (https://gdal.org/tutorials/raster_driver_tut.html). There has been a fair amount of discussion and interest on how to convert existing h5 and netcdf files to Zarr format (just one example here #686) Maybe someone is already doing this? But it seems like there isn't much cross talk between the zarr, gdal, rasterio communities. I'm happy to open some issues in the respective repos, but probably won't have time in the near future myself for a PR to GDAL.

pinging @jakebolewski, @shoyer, @jhamman, @davidbrochart, @rabernat, @lewismc

scottyhq · 2019-08-14T21:47:21Z

Also a quick illustration how to convert h5 to tiledb and read with gdal

conda create -n gdal3 gdal=3
wget https://github.com/OSGeo/gdal/raw/master/autotest/gdrivers/data/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.h5
gdal_translate -of TileDB -SDS DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.h5 DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb
gdalinfo DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb

Currently trying to read over s3 throws an error:

gdalinfo /vsis3/pangeo-data-upload-virginia/gdal3-test/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb
ERROR 1: [TileDB::StorageManager] Error: Cannot open array; Array does not exist
gdalinfo failed - unable to open '/vsis3/pangeo-data-upload-virginia/gdal3-test/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb'.

I'm pretty sure this is b/c it's a folder, so maybe need to a point to a file under .tiledb? But not sure how to read these since its a directory structure on S3:

DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb/
├── DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tdb.aux.xml
├── __1565818782587_1565818782587_4938da7e10f443a5bb583be9ad448ee1
│   ├── __fragment_metadata.tdb
│   ├── solar_zenith_angle.tdb
│   └── viewing_zenith_angle.tdb
├── __array_schema.tdb
└── __lock.tdb

normanb · 2019-08-15T16:21:06Z

@scottyhq the TileDB driver within GDAL uses it own methods to access S3 as opposed to the GDAL VSI approach.

To access the array try

gdalinfo -OO TILEDB_CONFIG=aws.config s3://pangeo-data-upload-virginia/gdal3-test/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb

The GDAL driver code currently relies on their being a config file to determine S3 access https://github.com/OSGeo/gdal/blob/master/gdal/frmts/tiledb/tiledbdataset.cpp#L908. This isn't ideal for public buckets with no config parameters and I will change this check to remove the need for the config file when I add the new GDAL multi-dimensional API to this driver that is currently in master https://gdal.org/tutorials/multidimensional_api_tut.html.

The TileDB configuration parameters are described here - https://docs.tiledb.io/en/stable/tutorials/config.html#summary-of-parameters

If there are any problems then I am happy to work through them.

scottyhq · 2019-08-28T04:47:47Z

Thanks @normanb for the pointers. I'm still having trouble getting tiledb to work with a config file.
Full details here: https://gist.github.com/scottyhq/8222b99c3400209f96826d09389482c1

Reading the local file without a config file works (gdalinfo DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb):


Driver: TileDB/TileDB
Files: DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tdb
       DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb
       DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tdb.aux.xml
Size is 512, 512
Metadata:
  solar_zenith_angle_long_name=solar zenith angle
  solar_zenith_angle_standard_name=solar_zenith_angle
  solar_zenith_angle_units=degrees
  solar_zenith_angle_valid_range=0 90 
  solar_zenith_angle__FillValue=-999 
  viewing_zenith_angle_long_name=viewing zenith angle
  viewing_zenith_angle_units=degrees
  viewing_zenith_angle_valid_range=0 90 
  viewing_zenith_angle__FillValue=-999 
Subdatasets:
  SUBDATASET_1_NAME=TILEDB:"DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb":solar_zenith_angle
  SUBDATASET_1_DESC=[1x360x180] solar_zenith_angle (Float32)
  SUBDATASET_2_NAME=TILEDB:"DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb":viewing_zenith_angle
  SUBDATASET_2_DESC=[1x360x180] viewing_zenith_angle (Float32)
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  512.0)
Upper Right (  512.0,    0.0)
Lower Right (  512.0,  512.0)
Center      (  256.0,  256.0)

I've tried the simple case of pointing TILEDB_CONFIG at a file containing a copy of the default config (CPL_DEBUG=ON gdalinfo -oo TILEDB_CONFIG=tiledb.config DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb), but then get a different error:


ERROR 4: `DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb' not recognized as a supported file format.
gdalinfo failed - unable to open 'DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb'.

normanb · 2019-08-28T20:45:17Z

@scottyhq I ran through the notebook you linked to. Your last command should be

gdalinfo -oo TILEDB_CONFIG=tiledb.config s3://pangeo-data-upload-virginia/gdal3-test/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb

and the config should include the following keys (and add your own values)

vfs.s3.aws_access_key_id  xxxxxx
vfs.s3.aws_secret_access_key xxxxx

The config is used to determine whether to access s3 or not which is a bug (as noted above) as a config file can also be used locally. I will changed that with my current updates.

stale · 2019-10-27T21:11:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2019-11-03T21:34:53Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

JackKelly · 2020-02-04T08:56:36Z

Great conversations! Just wondering if there are still plans afoot to implement an Xarray backend for TileDB? (That would be wonderful!)

Hoeze · 2020-02-04T10:48:32Z

I would love having xarray + https://feedback.tiledb.com/tiledb-core/p/support-axes-labels
This way, we could have e.g. native-performance string indexing.

rabernat · 2020-02-04T13:31:22Z

I would say there is broad interest in TileDB support in xarray. Xarray is developed by volunteers. If anyone on this thread wants to volunteer implement a TileDB backend, the xarray devs will be glad to provide advice.

…

Sent from my iPhone

On Feb 4, 2020, at 5:48 AM, Florian R. Hölzlwimmer ***@***.***> wrote: I would love having xarray + https://feedback.tiledb.com/tiledb-core/p/support-axes-labels This way, we could have e.g. native-performance string indexing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

JackKelly · 2020-02-04T14:52:41Z

I hear you, @rabernat! I've sent a tweet asking for folks to help implement a TileDB backend to Xarray :)

petacube · 2020-02-04T21:34:57Z

i would be happy to implement tiledb backend for xarray. i am waiting on TileDB to implement heterogenous dimensions feature - should be out soon. Stan

…

On Feb 4, 2020, at 9:52 AM, Jack Kelly ***@***.***> wrote: I hear you, @rabernat <https://github.com/rabernat>! I've sent a tweet <https://twitter.com/jack_kelly/status/1224706970801901568> asking for folks to help implement a TileDB backend to Xarray :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#120?email_source=notifications&email_token=AHHF6KUYIL2UDDJCE4WVPDDRBF6LZA5CNFSM4ERSFUIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKX4ZLA#issuecomment-581946540>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHHF6KQK5D447MP3GQTIXFTRBF6LZANCNFSM4ERSFUIA>.

JackKelly · 2020-02-04T21:53:02Z

Awesome, thank you Stan!

JackKelly · 2020-02-07T09:27:20Z

@petacube I'm also interested in tiledb's heterogenous dimensions feature. Do you know roughly when it might be implemented? I found the feature listed in tiledb's roadmap, but I couldn't see it in the tiledb github issue queue.

stavrospapadopoulos · 2020-02-07T23:11:24Z

@JackKelly this feature is currently under heavy development. It is blocked by #93. You can see some related merged PRs here. I expect the feature to be ready by early March.

Ledenel · 2020-05-18T03:58:18Z

Just want to mention that heterogenous dimensions feature has been implemented in TileDB 2.0 release (1 May).

JackKelly · 2020-05-18T12:10:45Z

Should we re-open this issue? (It was automatically closed due to inactivity. But maybe now's the time to re-open this issue, because heterogeneous dims are now implemented in TileDB?)

JackKelly · 2020-08-17T15:33:27Z

@petacube would you still be interested in implementing a TileDB backend for xarray? It would be hugely useful! :)

petacube · 2020-08-17T16:08:44Z

We can of course Do you guys have a project/data to test it on?

…

Sent from my iPhone

On Aug 17, 2020, at 11:33 AM, Jack Kelly ***@***.***> wrote: @petacube would you still be interested in implementing a TileDB backend for xarray? It would be hugely useful! :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

JackKelly · 2020-08-17T16:46:44Z

Awesome! Yeah, I plan to benchmark TileDB vs Zarr vs cloud-optimised-GeoTIFF on a satellite dataset (on Google Cloud Storage) over the next few weeks.

stavrospapadopoulos · 2020-08-17T17:22:01Z

Just FYI folks, here is a good thread about our plans to model netcdf data with TileDB, after great discussions with @rabernat and @DPeterK, which is very relevant to our TileDB-xarray integration. We will first share a proposal on a general spec for modeling labeled arrays (encompassing netcdf), as well as a proof implementation with our upcoming "axes labels" core TileDB feature. Eventually, this axes labels feature will drive the TileDB-xarray integration.

jp-dark · 2021-03-11T21:35:39Z

Just to update this discussion: We developed a read-only TileDB backend for xarray available here. This uses the new xarray backend plugin architecture that is soon-to-be-released.

rabernat · 2021-03-12T02:45:00Z

Thanks so much @jp-dark! So exciting!

Just curious how you recommend starting the transition to tiledb without a way to write files? Our community doesn't have any existing tiledb data. With Zarr, we can basically write code like this

ds = xr.open_dataset('file.nc')
ds.to_zarr('store.zarr')

...and convert all our existing data to a new format. What sort of workflow would your recommend for converting existing netCDF data to tiledb?

jp-dark · 2021-03-12T22:56:33Z

@rabernat This xarray plugin was developed for other customers that would like to visualize and manipulate data from existing TileDB arrays with xarray. A full fledge NetCDF data model in TileDB (including ingestor from NetCDF) and specification for comment is coming soon.

stale bot closed this as completed Jun 10, 2019

scottyhq reopened this Aug 14, 2019

stale bot removed the stale label Aug 14, 2019

stale bot added the stale label Oct 27, 2019

stale bot closed this as completed Nov 3, 2019

grumbling-tom mentioned this issue Oct 21, 2020

Modularity informatics-lab/tiledb_netcdf#48

Closed

tiledb #120

tiledb #120

Comments

rabernat commented Feb 20, 2018

jhamman commented Feb 20, 2018

stavrospapadopoulos commented Feb 23, 2018

mrocklin commented Feb 23, 2018

rabernat commented Feb 23, 2018 • edited Loading

Technical Questions

Social Questions

rabernat commented Feb 23, 2018

stavrospapadopoulos commented Feb 23, 2018

Responses to Technical Questions

Responses to Social Questions

mrocklin commented Feb 23, 2018

Conda

stavrospapadopoulos commented Feb 23, 2018

mrocklin commented Feb 23, 2018

mrocklin commented Feb 23, 2018

mrocklin commented Feb 23, 2018

stavrospapadopoulos commented Feb 23, 2018

mrocklin commented Feb 23, 2018 via email

stavrospapadopoulos commented Feb 23, 2018

mrocklin commented Feb 23, 2018 via email

stavrospapadopoulos commented Feb 23, 2018 • edited Loading

rabernat commented Feb 26, 2018

mrocklin commented Feb 26, 2018 via email

rabernat commented Feb 27, 2018

kaipak commented Feb 27, 2018

stavrospapadopoulos commented Feb 27, 2018

mrocklin commented Feb 27, 2018

mrocklin commented Feb 27, 2018 • edited Loading

rabernat commented Feb 27, 2018 • edited Loading

mrocklin commented Feb 27, 2018

stavrospapadopoulos commented Feb 27, 2018

jreadey commented Feb 27, 2018

jhamman commented Feb 27, 2018

jreadey commented Feb 27, 2018

rabernat commented Feb 27, 2018

scottyhq commented Aug 14, 2019 • edited Loading

scottyhq commented Aug 14, 2019

normanb commented Aug 15, 2019

scottyhq commented Aug 28, 2019

normanb commented Aug 28, 2019

stale bot commented Oct 27, 2019

stale bot commented Nov 3, 2019

JackKelly commented Feb 4, 2020

Hoeze commented Feb 4, 2020

rabernat commented Feb 4, 2020 via email

JackKelly commented Feb 4, 2020

petacube commented Feb 4, 2020 via email

JackKelly commented Feb 4, 2020

JackKelly commented Feb 7, 2020

stavrospapadopoulos commented Feb 7, 2020

Ledenel commented May 18, 2020

JackKelly commented May 18, 2020

JackKelly commented Aug 17, 2020

petacube commented Aug 17, 2020 via email

JackKelly commented Aug 17, 2020

stavrospapadopoulos commented Aug 17, 2020

jp-dark commented Mar 11, 2021

rabernat commented Mar 12, 2021

jp-dark commented Mar 12, 2021

rabernat commented Feb 23, 2018 •

edited

Loading

stavrospapadopoulos commented Feb 23, 2018 •

edited

Loading

mrocklin commented Feb 27, 2018 •

edited

Loading

rabernat commented Feb 27, 2018 •

edited

Loading

scottyhq commented Aug 14, 2019 •

edited

Loading