Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about writing parallel and group handling #172

Open
asparsa opened this issue Jun 21, 2024 · 39 comments
Open

question about writing parallel and group handling #172

asparsa opened this issue Jun 21, 2024 · 39 comments

Comments

@asparsa
Copy link

asparsa commented Jun 21, 2024

I'm using the tensorstore C++ API to write a Zarr file, and I need to address these two issue in my code:
1- I want to create a group like Zarr Python and create a hierarchy of namespace and datasets. Right now, I'm handling it using kvstore path while creating them. What is the optimal way? This approach takes more time than the Python version!

2- I'm trying to write data into each chunk parallelly. I wanted to avoid multithreading for now and use mpi instead. I'm writing everything from scratch for this purpose as I couldn't find anything in the documentation. Is there any better way to avoid multithreading?

Thanks

@jbms
Copy link
Collaborator

jbms commented Jun 21, 2024

Tensorstore automatically handles chunks in parallel, so you can just issue the write from a single thread and all relevant chunks will be handled in parallel.

Groups currently aren't supported unfortunately.

@asparsa
Copy link
Author

asparsa commented Jun 21, 2024

Thanks, the timing makes more sense now.

As an example, If I have {10,10} chunks and I want to write to [35:45][35:45] parts. it will be divided into four parallel sections [35:40][35:40], [35:40][40:45], [40:45][35:40], and [40:45][40:45] and write 4 way parallel?

And what is the maximum number of parallel writes? Can it be equal to processes?

@jbms
Copy link
Collaborator

jbms commented Jun 21, 2024

Thanks, the timing makes more sense now.

As an example, If I have {10,10} chunks and I want to write to [35:45][35:45] parts. it will be divided into four parallel sections [35:40][35:40], [35:40][40:45], [40:45][35:40], and [40:45][40:45] and write 4 way parallel?

Exactly.

And what is the maximum number of parallel writes? Can it be equal to processes?

It depends on the underlying kvstore --- with the local filesystem: https://google.github.io/tensorstore/kvstore/file/index.html#json-Context.file_io_concurrency

With GCS: https://google.github.io/tensorstore/kvstore/gcs/index.html#json-Context.gcs_request_concurrency

I'm not sure what you mean as far as it being equal to number of processes, or how you intend to use MPI.

You can indeed also write in parallel from multiple processes, but tensorstore doesn't itself contain any support for that you. You will need to partition your work between processes, and for efficiency should ensure that the partitions are aligned to chunk boundaries.

@asparsa
Copy link
Author

asparsa commented Jun 26, 2024

Thanks.
I believe limit:"shared" is what I was searching for.
Can I ask how Tensorstor is handling the parallelism?

@laramiel
Copy link
Collaborator

laramiel commented Jun 26, 2024

It depends. There are two mechanisms in use: thread pools and admission queues.

Generally the "shared" concurrency objects create a thread-pool with a concurrency limit of at least 4 (https://github.com/google/tensorstore/blob/master/tensorstore/internal/file_io_concurrency_resource.cc). Then file io operations are queued on the thread pool for completion, and the thread pool schedules/allows the 4 operations to run in parallel.

In the case of writing to gcs, the underlying kvstore uses an admission queue with a default value of 32 to control concurrency (https://github.com/google/tensorstore/blob/master/tensorstore/kvstore/gcs_http/gcs_resource.h). The admission queue then allows up to 32 parallel HTTP operations.

@asparsa
Copy link
Author

asparsa commented Jun 27, 2024

Thanks again,
I have another question.
I am using zarr3 to write an array. When the file size becomes big, it will not finish the writing and will leave the folder with the metadata. (it will be only 4kb instead of 5GB)

auto write_result = tensorstore::Write(array, store);
write_result.Force();
auto write_result2=write_result.result();

How can I ensure the writing process is complete? I want to ensure the data is on the disk, not the buffer or cache.

@laramiel
Copy link
Collaborator

That's an open-ended question. Have you tried logging the write_result.status()?

auto write_result = tensorstore::Write(array, store);
ABSL_LOG(INFO) << write_result.status();

You can look at how we write to tensorstore in examples/compute_percentiles.cc.

Can you distill your problem into a function which you can include in a comment?

@jbms
Copy link
Collaborator

jbms commented Jun 27, 2024

Also there was a recent fix for sharded writing --- make sure you are using a sufficiently new version (0.1.63 or later).

@asparsa
Copy link
Author

asparsa commented Jun 28, 2024

Sorry for the ambiguousness. Thanks for notifying me about shared writing.

That's an open-ended question. Have you tried logging the write_result.status()?
Yeah, I tried both status and result, and they returned OK.

I saw the example; I am using the same method. Zarr creates the chunk, but the chunk remains empty.
I can create a simple code to reproduce the results if you want

@laramiel
Copy link
Collaborator

Yes, please post your code.

@asparsa
Copy link
Author

asparsa commented Jun 28, 2024

I created a repo for that.
https://github.com/asparsa/zarrtest/tree/main
but here is the main.cpp (edited by laramiel):

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <iostream>
#include <typeinfo>

#include "tensorstore/context.h"
#include "tensorstore/index.h"
#include "tensorstore/open.h"
#include "tensorstore/open_mode.h"
#include "tensorstore/tensorstore.h"
#include "tensorstore/util/status.h"

using ::tensorstore::Context;

::nlohmann::json GetJsonSpec() {
  return {
      {"driver", "zarr3"},
      {"kvstore", {{"driver", "file"}, {"path", "/tmp/testo"}}},
      {"metadata",
       {{"data_type", "int16"},
        {"shape", {100, 100}},
        {"chunk_grid",
         {{"name", "regular"},
          {"configuration", {{"chunk_shape", {10, 10}}}}}}}},
  };
}


// try to create a zarr and write a 100*100 random array in it
int main() {
  auto context = Context::Default();
  auto store_result = tensorstore::Open(GetJsonSpec(), context,
                                        tensorstore::OpenMode::create |
                                            tensorstore::OpenMode::open,
                                        tensorstore::ReadWriteMode::read_write)
                          .result();
  if (!store_result.ok()) {
    std::cerr << "Error creating Zarr file: " << store_result.status() << std::endl;
    return -1;
  }
  std::cout << "Zarr file created successfully!" << std::endl;

  srand(time(0));
  auto array = tensorstore::AllocateArray<int16_t>({100, 100});
  for (tensorstore::Index i = 0; i < 100; ++i) {
    for (tensorstore::Index j = 0; j < 100; ++j) {
      array(i, j) = (rand() % 10) + 1;
    }
  }

  auto write_result = tensorstore::Write(array, store_result).status();
  if (!write_result.ok()) {
    std::cerr << "Error creating Zarr file: " << write_result << std::endl;
    return -1;
  }

  std::cout << "Zarr file wrote successfully!" << std::endl;
  return 0;
}

No matter how much data I write in it the zarr file remain 4kb.

@laramiel
Copy link
Collaborator

laramiel commented Jun 30, 2024

NOTE: I reformatted the above example and removed some unnecessary lines to make it more readable; I also changed the output location from testo to /tmp/testo.

The code above is essentially correct, but I think that you merely misunderstand the data storage format. The zarr.json file is the zarr-format metadata which describes the dataset. The data is stored in individual files within subdirectories of /tmp/tenso, such as:

$ find /tmp/testo | sort

/tmp/testo
/tmp/testo/c
/tmp/testo/c/0
/tmp/testo/c/0/0
/tmp/testo/c/0/1
/tmp/testo/c/0/2
/tmp/testo/c/0/3
...
/tmp/testo/c/9/7
/tmp/testo/c/9/8
/tmp/testo/c/9/9
/tmp/testo/zarr.json

Since your chunk shape is 10x10, and you write into indices 0..99 x 0..99, these files have indices from 0..9.

@asparsa
Copy link
Author

asparsa commented Jun 30, 2024

Can I see your code?
I was talking about the whole folder size that never exceeds 4KB.
Yes, I was checking the data, not the metadata.

@laramiel
Copy link
Collaborator

laramiel commented Jun 30, 2024

I edited your above comment; that is exactly what I compiled.

Are you misunderstanding the 4.0K 'byte size' field in the ls output format, perhaps? The 4.0K for the c directory is the size of the directory structure itself on disk, likely a single block, not the size of all data within the directory. To get the sum of all file sizes you need to use the du command.

$ ls -lAh /tmp/testo
total 8.0K
drwxr-xr-x 12 laramiel laramiel 4.0K Jun 30 06:14 c
-rw-r--r--  1 laramiel laramiel  266 Jun 30 06:14 zarr.json

$ du -sh /tmp/testo
452K  /tmp/testo

@asparsa
Copy link
Author

asparsa commented Jul 1, 2024

Oh, I see. Sorry, I confused that with file size.
Thank you so much.
I have another question. I am trying to use the MakeArray() function for different data types from different variables, but I have some difficulty using it every time. For examples, I looked at this, but most were created from numbers, not variables. Is there any example or documentation on that?
This is one of the examples I try to write:
'''
std::vector<int16_t> data_to_use(parsed_data.begin() + 4, parsed_data.end());
auto array = tensorstore::MakeArray<int16_t>(data_to_use);
'''
I tried any combination of using and not using <int16_t> and still get no matching function for call to 'MakeArray<int16_t>(std::vector&) error.
How should I handle them, and what other variable type can I pass to MakeArray?
If I use MakeArrayView, can I write it to a Zarr driver?

@asparsa
Copy link
Author

asparsa commented Jul 1, 2024

Is there any way to turn off all the compression in zarr3? as compression is no longer enabled.

@brian-michell
Copy link

Setting the clevel to 0 in the Blosc codec should effectively disable compression.

@asparsa
Copy link
Author

asparsa commented Jul 2, 2024

Won't that be an overhead? Calling the Blosc codec and getting out of it without compression.

Oh, I see. Sorry, I confused that with file size. Thank you so much. I have another question. I am trying to use the MakeArray() function for different data types from different variables, but I have some difficulty using it every time. For examples, I looked at this, but most were created from numbers, not variables. Is there any example or documentation on that? This is one of the examples I try to write: ''' std::vector<int16_t> data_to_use(parsed_data.begin() + 4, parsed_data.end()); auto array = tensorstore::MakeArray<int16_t>(data_to_use); ''' I tried any combination of using and not using <int16_t> and still get no matching function for call to 'MakeArray<int16_t>(std::vector&) error. How should I handle them, and what other variable type can I pass to MakeArray? If I use MakeArrayView, can I write it to a Zarr driver?

what you think about this?

@brian-michell
Copy link

brian-michell commented Jul 2, 2024

Won't that be an overhead? Calling the Blosc codec and getting out of it without compression.

I'm not too familiar with Zarr3, but it looks like specifying no codec will result in uncompressed data.

import tensorstore as ts
import numpy as np

spec = {
    'driver': 'zarr3',
    'kvstore': {
        'driver': 'file',
        'path': 'tmp/zarr3',
    },
    'create': True,
    'delete_existing': True,
    'metadata': {
        'shape': [100, 100],
        'data_type': 'float64',
        "codecs": [{"name": "blosc", "configuration": {"cname": "lz4", "clevel": 9}}] # You could remove this line
    }
}
tensorstore_object_result = ts.open(spec)
tensorstore_object = tensorstore_object_result.result()
data = np.random.rand(100, 100)
write_future = tensorstore_object.write(data)

# tmp/zarr3/c/0 results in 72K according to `du -sh 0`
# tmp/zarr3/c/0 results in 80K according to `du -sh 0` without the codec line

what you think about this?

I have never used the MakeArray() function. SharedArray has served my purposes with the Zarr driver so far, your milage may vary.

@laramiel
Copy link
Collaborator

laramiel commented Jul 3, 2024

Yes, you can write any of ArrayView / SharedArrayView / Array / SharedArray to tensorstore. So to make a tensorstore ArrayView of your parsed data you don't need to copy it to a vector; it should be possible to use this MakeArrayView method, more or less like this:

// No need to copy the data into an array, just use a tensorstore::span to pass the pointer + length.
auto array_view = tensorstore::MakeArrayView(
    tensorstore::span(parsed_data.begin() + 4, parsed_data.end()));
auto status = tensorstore::Write(array_view, store).result();

If you want full control over the array shape, such as creating an ArrayView with the shape 3x4
from the parsed_data, then you can construct an ArrayView directly, more or less like this:

// NOTE: In this example, std::distance(parsed_data.begin() + 4, parsed_data.end())
// must be at least 12.
tensorstore::StridedLayout<2> array_layout(tensorstore::c_order, sizeof(uint16_t), {3, 4});

// NOTE: array_view references array_layout, so the array_layout must outlive the array_view.
tensorstore::ArrayView<uint16_t, 2> array_view(&*(parsed_data.begin() + 4), array_layout);

auto status = tensorstore::Write(array_view, store).result();

@asparsa
Copy link
Author

asparsa commented Jul 8, 2024

I followed the exact instructions and I got

 error: no matching function for call to 'Write(tensorstore::Array<double, 2, tensorstore::ArrayOriginKind::zero, tensorstore::ContainerKind::view>&)'
  213 |  auto write_result = tensorstore::Write(data).result();
      |                      ~~~~~~~~~~~~~~~~~~^~~~~~
In file included from /u/asalimiparsa/iometry/build/_deps/tensorstore-src/tensorstore/open.h:28,
                 from /u/asalimiparsa/iometry/plugins/macsio_zarr.c:24:
/u/asalimiparsa/iometry/build/_deps/tensorstore-src/tensorstore/tensorstore.h:772:1: note: candidate: 'template<class SourceArray, class Target> tensorstore::internal::EnableIfCanCopyArrayToTensorStore<typename tensorstore::internal_result::UnwrapResultHelper<typename std::remove_cv<typename std::remove_reference<_Tp>::type>::type>::type, typename tensorstore::internal_result::UnwrapResultHelper<typename std::remove_cv<typename std::remove_reference<_Arg>::type>::type>::type, tensorstore::WriteFutures> tensorstore::Write(SourceArray&&, Target&&, tensorstore::WriteOptions)'
  772 | Write(SourceArray&& source, Target&& target, WriteOptions options) {

This is the same error I got with converting from other types.

@laramiel
Copy link
Collaborator

laramiel commented Jul 8, 2024

Sorry, I omitted the second Tensorstore parameter from write in the above example; updated.

You can see that more parameters are required by the error message.

@asparsa
Copy link
Author

asparsa commented Jul 8, 2024

still I'm getting error:
no matching function for call to 'DriverWrite(tensorstore::Array<double, 2, tensorstore::ArrayOriginKind::zero, tensorstore::ContainerKind::view>&, tensorstore::internal::Driver::Handle&, std::remove_referencetensorstore::WriteOptions&::type)'

@laramiel
Copy link
Collaborator

laramiel commented Jul 9, 2024

My best advice here is to (1) look for the differences between what you have and working examples, (2) try and determine what you want to be doing based on reading the code, and (3) always post your code when you want help with an error message. Otherwise anyone who tries to help is just guessing.

@asparsa
Copy link
Author

asparsa commented Jul 9, 2024

My bad! I didn't notice It would be ambiguous.
I just changed the datatype to double in your code and received that error.

tensorstore::ArrayView<double, 2> array_view(&*(parsed_data.begin() + 4), array_layout);

auto status = tensorstore::Write(array_view, store).result();

And I received :

 error: no matching function for call to 'Write(tensorstore::Array<double, 2, tensorstore::ArrayOriginKind::zero, tensorstore::ContainerKind::view>&)'
  213 |  auto write_result = tensorstore::Write(data).result();
      |                      ~~~~~~~~~~~~~~~~~~^~~~~~
In file included from /u/asalimiparsa/iometry/build/_deps/tensorstore-src/tensorstore/open.h:28,
                 from /u/asalimiparsa/iometry/plugins/macsio_zarr.c:24:
/u/asalimiparsa/iometry/build/_deps/tensorstore-src/tensorstore/tensorstore.h:772:1: note: candidate: 'template<class SourceArray, class Target> tensorstore::internal::EnableIfCanCopyArrayToTensorStore<typename tensorstore::internal_result::UnwrapResultHelper<typename std::remove_cv<typename std::remove_reference<_Tp>::type>::type>::type, typename tensorstore::internal_result::UnwrapResultHelper<typename std::remove_cv<typename std::remove_reference<_Arg>::type>::type>::type, tensorstore::WriteFutures> tensorstore::Write(SourceArray&&, Target&&, tensorstore::WriteOptions)'
  772 | Write(SourceArray&& source, Target&& target, WriteOptions options) {

I also tested the code without changing the datatype and received the same error with:
error: no matching function for call to 'Write(tensorstore::Array<uint16_t, 2, tensorstore::ArrayOriginKind::zero, tensorstore::ContainerKind::view>&)'

@jbms
Copy link
Collaborator

jbms commented Jul 9, 2024

Write takes two parameters. But additionally, since Write is asynchronous, the array needs to have a Shared element pointer so that Write can retain a reference. Here since you are calling result() you can be sure the lifetime of the source array is sufficient --- it is therefore safe to use UnownedToShared to convert an unowned array to a SharedArray.

@asparsa
Copy link
Author

asparsa commented Jul 15, 2024

I used all the methods you suggested, and all came with an error.
For example:

void const *buf = 0;
buf = json_object_extarr_data(extarr_obj);
const double* double_buf = static_cast<const double*>(buf);
const Index rows = shape[0];
const Index cols = shape[1];
tensorstore::StridedLayout<2> array_layout(tensorstore::c_order, sizeof(double), {rows,cols});
tensorstore::ArrayView<double, 2> array_view(&*(double_buf), array_layout);
auto data= tensorstore::Open(json_spec, context, tensorstore::OpenMode::open, tensorstore::ReadWriteMode::read_write);
auto write_result = tensorstore::Write(arrayview, data).result();

Make error:

 error: no matching function for call to 'DriverWrite(tensorstore::Array<double, 2, tensorstore::ArrayOriginKind::zero, tensorstore::ContainerKind::view>&, tensorstore::internal::Driver::Handle&, std::remove_reference<tensorstore::WriteOptions&>::type)'
  776 |         return internal::DriverWrite(
      |                ~~~~~~~~~~~~~~~~~~~~~^
  777 |             std::forward<decltype(unwrapped_source)>(unwrapped_source),
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  778 |             internal::TensorStoreAccess::handle(
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  779 |                 std::forward<decltype(unwrapped_target)>(unwrapped_target)),
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  780 |             std::move(options));

And I have no idea how to fix it.

@brian-michell
Copy link

I can't reproduce your exact code as it's lacking some context, but I believe you have the parameters in the write incorrect.

The write is expecting tensorstore::Write(SourceArray, Target, WriteOptions...) but it looks like you're just providing tensorstore::Write(Target, WriteOptions).

I believe what you want would be auto write_result = tensorstore::Write(array_view, data, /*optional options go here*/).result();

@asparsa
Copy link
Author

asparsa commented Jul 15, 2024

I'm sorry. I made a typo when writing the question. I am passing the data and target to the write, and it makes that error.

@laramiel
Copy link
Collaborator

laramiel commented Jul 16, 2024

Try something like:

auto write_result = tensorstore::Write(tensorstore::UnownedToShared(arrayview), data).result();

@asparsa
Copy link
Author

asparsa commented Jul 25, 2024

Thanks.
Another question:
I am using an HPC. I'm trying to write a huge file to storage, and I've tried different numbers of threads to write.
It seems that after 16, there will be no improvement, and it won't scale up to different NUMA node cpus.
This is what lscpu looks like:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        8
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7763 64-Core Processor
Stepping:            1
CPU MHz:             3243.376
CPU max MHz:         2450.0000
CPU min MHz:         1500.0000
BogoMIPS:            4891.06
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
NUMA node2 CPU(s):   32-47
NUMA node3 CPU(s):   48-63
NUMA node4 CPU(s):   64-79
NUMA node5 CPU(s):   80-95
NUMA node6 CPU(s):   96-111
NUMA node7 CPU(s):   112-127

I tried taskset -c 0-127 and --cpunodebind=0,1 and --cpu-bind=cores and --physcpubind.
How should I encounter this problem? how about more than 1 node?

@jbms
Copy link
Collaborator

jbms commented Jul 25, 2024

Can you share the code you are using for writing?

@asparsa
Copy link
Author

asparsa commented Jul 25, 2024

same code as : #172 (comment)
just changed this:
::nlohmann::json GetJsonSpec(int limit) {
return {
{"driver", "zarr3"},
{"kvstore", {{"driver", "file"}, {"path", "testo/"},{"file_io_concurrency",{{"limit",limit}}}}},
{"metadata",
{{"data_type", "int32"},
{"shape", {16384,16384}},
{"chunk_grid",
{{"name", "regular"},
{"configuration", {{"chunk_shape", {4096,512}}}}}}}},
};
}
I ran with limit from 1 to 128. stopped improvement at 16

@asparsa
Copy link
Author

asparsa commented Aug 2, 2024

I'm trying to get the best writing time from the zarr3 driver. I ran the mentioned code in #172 (comment) with different numbers of limits and different system configurations. Shockingly, it will stop improvement at 16, the number of cores in one NUMA. I tried the mentioned approaches to overcome this issue, but none worked. I wanted to scale my writing to more than one node and ask how this can be handled.

@laramiel
Copy link
Collaborator

laramiel commented Aug 2, 2024

It's really hard to tell you what to do since all of your comments lack enough detail to debug and your test repo clearly hasn't been updated to include any of the suggested changes. We have no idea what your numa machine looks like, nor the characteristics of the storage, etc. (Mostly IO is going to be storage-bound, not CPU-bound, so adding additional threads is unlikely to have a large increase in write throughput). Thus we have no idea about what actual performance you see is, nor what the expectations are.

If you want to run a benchmark with different options you can try our existing benchmarks replacing the default --tensorstore_spec option with your spec. See the comments for more:

https://github.com/google/tensorstore/tree/master/tensorstore/internal/benchmark

To run them, try something like this:

git clone https://github.com/google/tensorstore.git
cd tensorstore

bazelisk.py run -c opt \
  //tensorstore/internal/benchmark:ts_benchmark -- \
  --strategy=random       \
  --total_read_bytes=-10  \
  --total_write_bytes=-2  \
  --repeat_reads=16       \
  --repeat_writes=8       \
  --context_spec='{ 
      "cache_pool": { "total_bytes_limit": 268435456 },
      "file_io_concurrency": {"limit": 32}
    }'  \
  --tensorstore_spec='{
  "driver": "zarr3",
  "kvstore": "file:///tmp/abc/",
  "metadata": {
    "shape": [16384, 16384],
    "chunk_grid": {"name": "regular", "configuration": {"chunk_shape": [1024, 1024]}},
    "chunk_key_encoding": {"name": "default"},
    "data_type": "int32"
  }
}'

When I run this on my vm I see output like:

... Write summary: 2149121600 bytes in 12871 ms:  166.973 MB/second (1025 chunks of 2096704 bytes)
... Write summary: 2149121600 bytes in 11377 ms:  188.898 MB/second (1025 chunks of 2096704 bytes)
<snip>
... Read summary: 10739317888 bytes in 2507 ms:  4283.797 MB/second (5122 chunks of 2096704 bytes)
... Read summary: 10739317888 bytes in 2469 ms:  4349.583 MB/second (5122 chunks of 2096704 bytes)
<snip>

/tensorstore/cache/chunk_cache/reads=235168
/tensorstore/cache/chunk_cache/writes=23395
/tensorstore/cache/evict_count=151204
/tensorstore/cache/hit_count=107296
/tensorstore/cache/kvs_cache_read<category>[changed]=151268
/tensorstore/cache/kvs_cache_read<category>[unchanged]=73524
/tensorstore/cache/miss_count=151268
/tensorstore/kvstore/file/batch_read=151011
/tensorstore/kvstore/file/bytes_read=633386041344
/tensorstore/kvstore/file/bytes_written=98125742354
/tensorstore/kvstore/file/open_read=224792
/tensorstore/kvstore/file/read=224792
/tensorstore/kvstore/file/read_latency_ms={count=151011 mean=4.85197 buckets=[0,  47,13368,44601,65487,27101,364,41,2]}
/tensorstore/kvstore/file/write=23396
/tensorstore/kvstore/file/write_latency_ms={count=23396 mean=68.0924 buckets=[0,  0,0,0,1,413,4144,14163,1845,2176,523,  51,80]}

For my vm and this example, increasing from the default concurrency (32) to 64 looks like this:

... Write summary: 2149121600 bytes in 10092 ms:  212.947 MB/second (1025 chunks of 2096704 bytes)

/tensorstore/kvstore/file/write_latency_ms={count=23525 mean=204.975 buckets=[0,  0,0,0,0,59,132,237,1474,17843,3780]}

And going to 128 looks like this:

... Write summary: 2149121600 bytes in 9942 ms:  216.161 MB/second (1025 chunks of 2096704 bytes)

/tensorstore/kvstore/file/write_latency_ms={count=23692 mean=368.27 buckets=[0,  0,0,0,0,38,131,242,393,2385,18557,  1946]}

That is, total throughput went down, but individual file write times went up. And clearly there is no benefit going from 64 to 128.

It also appears to me that some of the issues you're running into have to do with C++ having a more difficult api to decipher and use than python. Try running your test code in python and see if that lets you iterate faster to figure out what you actually need to do, then convert that to C++. The python code is a wrapper for C++, so in theory it will be easier to test the performance of the underlying code in a faster to iterate configuration.

You can also look at the comment I made here #179 (comment) which demonstrates creating a relatively large volume.

I think that your best course of action is to make your above github repository into a self-contained test/benchmark which you can use to get an idea of the performance characteristics. Once that works, if there are issues, it's possible to use profiling tools to dig into the performance in a more detailed way.

@asparsa
Copy link
Author

asparsa commented Aug 2, 2024

I believe I shared the needed details in the comments, but they are inconsistent. SO to wrap everything up:
I'm running this code and giving the limit as input:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <iostream>
#include <typeinfo>

#include "tensorstore/context.h"
#include "tensorstore/index.h"
#include "tensorstore/open.h"
#include "tensorstore/open_mode.h"
#include "tensorstore/tensorstore.h"
#include "tensorstore/util/status.h"
#include "tensorstore/index.h"
#include "tensorstore/index_space/dim_expression.h"
#include "tensorstore/index_space/index_transform.h"
#include "tensorstore/index_space/transformed_array.h"

using ::tensorstore::Context;

::nlohmann::json GetJsonSpec(int limit) {
  return {
      {"driver", "zarr3"},
      {"kvstore", {{"driver", "file"}, {"path", "testo/"},{"file_io_concurrency",{{"limit",limit}}}}},
      {"metadata",
       {{"data_type", "int32"},
        {"shape", {16384,16384}},
        {"chunk_grid",
         {{"name", "regular"},
          {"configuration", {{"chunk_shape", {4096,512}}}}}}}},
  };
}

int main() {
	int num;
	scanf("%d", &num);
  auto context = Context::Default();
  auto store_result = tensorstore::Open(GetJsonSpec(num), context,
                                        tensorstore::OpenMode::create |
                                            tensorstore::OpenMode::open,
                                        tensorstore::ReadWriteMode::read_write)
                          .result();
  if (!store_result.ok()) {
    std::cerr << "Error creating Zarr file: " << store_result.status() << std::endl;
    return -1;
  }

  auto array=tensorstore::AllocateArray<int32_t>({16384,16384});
  for (tensorstore::Index i = 0; i < 16384; ++i) {
    for (tensorstore::Index j = 0; j < 16384; ++j) {
      array(i,j)= (rand() % 10) + 1;
    }
  } 
struct timespec start, end;
    double elapsed_time;
      clock_gettime(CLOCK_MONOTONIC, &start);
  auto write_result = tensorstore::Write(array, store_result).status();
clock_gettime(CLOCK_MONOTONIC, &end);
elapsed_time = (end.tv_sec - start.tv_sec) +
                   (end.tv_nsec - start.tv_nsec) / 1e9;
printf("Elapsed time: %.5f seconds\n", elapsed_time);
  return 0;
}

The time I'm getting with limit [1,128] is exhibited in:
image

As it is said, tensorstore writes each chunk parallelly; I expected the time to get halved when doubling the limit till we get to the number of chunks. But It seems to stop at 16. I ran the test with different numbers of chunks and data sizes, and the results were the same.
I thought maybe there were some issues with how I was running the code, so I tried to share the thread pool between different NUMAs and cores or bind cpus together!
But seeing the results you shared and the benchmark, I can see the improvement is around 1% by doubling the limit!
Will there be an update on that? I'm looking to get the best writing time out of Zarr.

@laramiel
Copy link
Collaborator

laramiel commented Aug 2, 2024

That's a nice update. Given the code above, tensorstore is writing ~ 1 Gigabyte across 128 files.

While there aren't numbers, the chart indicates that you're getting around 2 Gigabytes/second.
Do you expect it to be higher? Why? Do you have other benchmarks which write at a higher rate? What are they?

For example, a Samsung 970 Pro SSD can sustain about 2.8 Gb/s. See Tom's Hardware SSD Benchmark page for an idea about SSD performance; The copy (MB/s) column is probably close to the performance you'd expect in real-world applications.

Also, can you try running the benchmark bundled with tensorstore on your machine using the same configuration as I've posted above, which is Zarr3, 1GB volume, etc? I'd be interested in seeing the output of the rates and the metrics.

Edit:

As it is said, tensorstore writes each chunk parallelly; I expected the time to get halved when doubling the limit till we get to the number of chunks. But It seems to stop at 16. I ran the test with different numbers of chunks and data sizes, and the results were the same.

This expectation is not warranted. The OS schedules writes to a disk and it's easily possible to saturate the IO bandwidth with a smaller number of writes.

@asparsa
Copy link
Author

asparsa commented Aug 2, 2024

Here is the numbers for time: y = [4.78404, 2.37955,1.38456, 0.93017, 0.73465, 0.77801, 0.72834, 0.60902]
so when the limit is 128, it's writing 1.64GB/s.
I tried to run the benchmark, but it produced errors:

error: taking address of temporary [-fpermissive]
                    ArrayView<Element, RankConstraint::Subtract(SfinaeR, 1),
                                       ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
./tensorstore/array.h:855:30: error: no matching function for call to 'tensorstore::RankConstraint::operator tensorstore::DimensionIndex(tensorstore::RankConstraint*)'
                              array_origin_kind>>
                    and error: no matching function for call to 'tensorstore::RankConstraint::operator tensorstore::DimensionIndex(tensorstore::RankConstraint*)'
   span<MaybeConstIndex, RankConstraint::FromInlineRank(Rank)> byte_strides() {
                                                             ^
In file included from ./tensorstore/internal/multi_vector.h:29,
                 from ./tensorstore/box.h:34,
                 from ./tensorstore/driver/downsample/downsample_util.h:21,
                 from tensorstore/driver/downsample/downsample_util.cc:15:

I'm trying to figure it out now.

I'm running on Delta HPC. If I increase the stripe count to the number of available OSTs, will that prevent IO saturation?

@laramiel
Copy link
Collaborator

laramiel commented Aug 2, 2024

To diagnose the compiler error I may need more of a log file, but that shouldn't happen. Can you share the log when building all of tensorstore along with your gcc/clang version?

git clone https://github.com/google/tensorstore.git
cd tensorstore
gcc -v
clang -v
./bazelisk.py build -k -c opt //tensorstore/...

By delta, you mean this: https://www.ncsa.illinois.edu/research/project-highlights/delta/

Your tensorstore spec is writing to the local NVME disk and the throughput seems pretty reasonable for local storage. I don't know how the Lustre filesystem works, however it "provides a POSIX *standard-compliant UNIX file system interface." So there is some local path that you could use for tensorstore writes, something like "file:///lustre/user/blah", (You'd need to talk to a sysadmin about that, probably), and using that could give you higher throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants