Skip to content

Commit

Permalink
[ntuple] adjust default cluster size
Browse files Browse the repository at this point in the history
Following the experience with AGC testing, double the default compressed
cluster size to 100 MB and also double the maximum uncompressed cluster
size to 1 GiB.
  • Loading branch information
jblomer committed Oct 30, 2024
1 parent 5baa552 commit 543900d
Show file tree
Hide file tree
Showing 4 changed files with 7 additions and 7 deletions.
4 changes: 2 additions & 2 deletions tree/ntuple/v7/doc/tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,15 @@ A cluster contains all the data of a given event range.
As clusters are usually compressed and tied to event boundaries, an exact size cannot be enforced.
Instead, RNTuple uses a *target size* for the compressed data as a guideline for when to flush a cluster.

The default cluster target size is 50MB of compressed data.
The default cluster target size is 100 MB of compressed data.
The default can be changed by the `RNTupleWriteOptions`.
The default should work well in the majority of cases.
In general, larger clusters provide room for more and larger pages and should improve compression ratio and speed.
However, clusters also need to be buffered during write and (partially) during read,
so larger clusters increase the memory footprint.

A second option in `RNTupleWriteOptions` specifies the maximum uncompressed cluster size.
The default is 512MiB.
The default is 1 GiB.
This setting acts as an "emergency break" and should prevent very compressible clusters from growing too large.

Given the two settings, writing works as follows:
Expand Down
4 changes: 2 additions & 2 deletions tree/ntuple/v7/inc/ROOT/RNTupleWriteOptions.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,10 @@ public:
protected:
int fCompression{RCompressionSetting::EDefaults::kUseGeneralPurpose};
/// Approximation of the target compressed cluster size
std::size_t fApproxZippedClusterSize = 50 * 1000 * 1000;
std::size_t fApproxZippedClusterSize = 100 * 1000 * 1000;
/// Memory limit for committing a cluster: with very high compression ratio, we need a limit
/// on how large the I/O buffer can grow during writing.
std::size_t fMaxUnzippedClusterSize = 512 * 1024 * 1024;
std::size_t fMaxUnzippedClusterSize = 1024 * 1024 * 1024;
/// Initially, columns start with a page large enough to hold the given number of elements. The initial
/// page size is the given number of elements multiplied by the column's element size.
/// If more elements are needed, pages are increased up until the byte limit given by fMaxUnzippedPageSize
Expand Down
2 changes: 1 addition & 1 deletion tree/ntupleutil/v7/inc/ROOT/RNTupleImporter.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ public:
/// Used to make adjustments to the fields of the output model.
using FieldModifier_t = std::function<void(RFieldBase &)>;

/// Used to report every ~50MB (compressed), and at the end about the status of the import.
/// Used to report every ~100 MB (compressed), and at the end about the status of the import.
class RProgressCallback {
public:
virtual ~RProgressCallback() = default;
Expand Down
4 changes: 2 additions & 2 deletions tree/ntupleutil/v7/src/RNTupleImporter.cxx
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,14 @@ namespace {

class RDefaultProgressCallback : public ROOT::Experimental::RNTupleImporter::RProgressCallback {
private:
static constexpr std::uint64_t gUpdateFrequencyBytes = 50 * 1000 * 1000; // report every 50MB
static constexpr std::uint64_t gUpdateFrequencyBytes = 100 * 1000 * 1000; // report every 100 MB
std::uint64_t fNbytesNext = gUpdateFrequencyBytes;

public:
~RDefaultProgressCallback() override {}
void Call(std::uint64_t nbytesWritten, std::uint64_t neventsWritten) final
{
// Report if more than 50MB (compressed) where written since the last status update
// Report if more than 100 MB (compressed) where written since the last status update
if (nbytesWritten < fNbytesNext)
return;
std::cout << "Wrote " << nbytesWritten / 1000 / 1000 << "MB, " << neventsWritten << " entries\n";
Expand Down

0 comments on commit 543900d

Please sign in to comment.