(WIP) Making compression ratio dynamically calculated based on bytes written #347

n3nash · 2018-03-14T06:42:47Z

No description provided.

n3nash · 2018-03-14T06:46:30Z

@vinothchandar I would like to have a quick discussion on this before you take a pass.

jianxu · 2018-03-14T17:42:56Z

hoodie-client/src/main/java/com/uber/hoodie/io/storage/HoodieParquetConfig.java


  public HoodieParquetConfig(HoodieAvroWriteSupport writeSupport,
-      CompressionCodecName compressionCodecName, int blockSize, int pageSize, long maxFileSize,
-      Configuration hadoopConf) {
+      CompressionCodecName compressionCodecName, int blockSize, int pageSize, int maxFileSize,


type change here, probably do a rebase?

thanks, done.

n3nash · 2018-03-14T20:23:58Z

hoodie-client/src/main/java/com/uber/hoodie/io/storage/HoodieWrapperFileSystem.java


    SizeAwareFSDataOutputStream os =
        new SizeAwareFSDataOutputStream(fsDataOutputStream, new Runnable() {
          @Override
          public void run() {
-            openStreams.remove(path.getName());
+            // openStreams.remove(path.getName());


Doing this helps to call fs.getBytesWritten(file) even after the stream is closed and gives back an exact number of uncompressed bytes written. @vinothchandar

stream is closed implies that the file (block i think) is fully written?

Yes, and so the wrappedStream has the correct number of bytes to return..

n3nash · 2018-03-14T20:26:10Z

hoodie-client/src/main/java/com/uber/hoodie/io/storage/HoodieStorageWriterFactory.java

+    } catch (Throwable t) {
+      // make this fail safe.
+    }
+    return HoodieStorageConfig.DEFAULT_STREAM_COMPRESSION_RATIO;


We can make the DEFAULT be calculated based on maxFileSize.

ovj · 2018-03-14T23:06:29Z

@n3nash Can we check in the underlying OutputStream (by adding wrapper) to see how much is getting written. This will help us to correctly throttle file size.

n3nash · 2018-03-16T20:54:56Z

@ovj this code actually already does that, the change is just to make sure we can get bytes before the stream is closed.

vinothchandar

High level: we can go with two approaches
A) Just care about the compressed_size_per_record, based on commit metadata previously
B) Get uncompressed and compressed sizes and determine the compression ratio.. (This PR)

Neither really tackles the case when there is no history/commits to get a sense of the record size.. (correct me if I am missing sth)

I am actually leaning more on doubling down on A (which is what the partitioner uses to pack data today). Is that grossly inaccurate in sizing partitions?

Also can you confirm this has been set..

 // Config to control whether we control insert split sizes automatically based on average record sizes
  public static final String COPY_ON_WRITE_TABLE_AUTO_SPLIT_INSERTS = "hoodie.copyonwrite.insert.auto.split";
  // its off by default
  public static final String DEFAULT_COPY_ON_WRITE_TABLE_AUTO_SPLIT_INSERTS = String.valueOf(false);

vinothchandar · 2018-03-19T21:47:44Z

hoodie-client/src/main/java/com/uber/hoodie/io/HoodieAppendHandle.java

@@ -219,6 +219,9 @@ public void close() {
      }
      writeStatus.getStat().setNumWrites(recordsWritten);
      writeStatus.getStat().setNumDeletes(recordsDeleted);
+      // an estimate of the number of bytes written
+      writeStatus.getStat().setTotalUncompressedWriteBytes(recordsWritten*averageRecordSize);


really like to understand how your IDE is setup :).. how come it missed formatting the space between * in this diff, while it changed it everywhere in the other..

we really need to do #287

vinothchandar · 2018-03-19T21:53:15Z

hoodie-client/src/main/java/com/uber/hoodie/io/storage/HoodieWrapperFileSystem.java


    SizeAwareFSDataOutputStream os =
        new SizeAwareFSDataOutputStream(fsDataOutputStream, new Runnable() {
          @Override
          public void run() {
-            openStreams.remove(path.getName());
+            // openStreams.remove(path.getName());


stream is closed implies that the file (block i think) is fully written?

vinothchandar · 2018-03-19T23:53:52Z

hoodie-client/src/main/java/com/uber/hoodie/io/storage/HoodieStorageWriterFactory.java

@@ -50,8 +54,29 @@
    HoodieParquetConfig parquetConfig =
        new HoodieParquetConfig(writeSupport, CompressionCodecName.GZIP,
            config.getParquetBlockSize(), config.getParquetPageSize(),
-            config.getParquetMaxFileSize(), hoodieTable.getHadoopConf());
+            config.getParquetMaxFileSize(), hoodieTable.getHadoopConf(), compressionRatioPerRecord(hoodieTable));


this will be opened in each executor?

n3nash · 2018-03-20T18:35:23Z

@vinothchandar I see few ways to pass information of number of records needed in a partition to create handle :

We can add this to the workload profile and call getpartitioner() before savingWorkloadProfile to the inflight file, but again, each executor will have to read the workload profile..
Expose a setter in the HoodieWriteConfig and set it there and pass it down..
Add it as another value to all method invokations leading to IOHandle..

I personally like 1 over 2 and am not in favor of 3. But I also want to explore the addition of the metric (this PR) which makes things simpler.

vinothchandar · 2018-03-23T03:16:09Z

Option #4 (lmk what you think)

We pull out BucketInfo which is what holds things like fileLocation etc as a top level class
We save up the number of records needed to write a file of configure size, when computed once during UpsertPartitioner construction, into BucketInfo (or open to another member in Partitioner itself)
We pass BucketInfo from driver to each executor and let it do its thing
For createHandle, we directly chop off files based on number of records (which will factor in compression ratio automatically) set in BucketInfo or fallback using a configured compression fraction if not set (i.e initial bulkInsert)

n3nash · 2018-03-23T04:30:33Z

I'm fine with this approach too. Ideally, I was looking for something that requires less refactor and can be a quick way to fix this compression issue so I can spend time starting to run a dataset end to end, tune the compaction process and move towards running this in prod, hence this PR and my suggested approaches..

vinothchandar · 2018-03-23T05:25:37Z

Understand where you are coming from. Unfortunately this is not straightforward. If thats your stated short term goal, I suggest just introduce a config for compression ratio, set it as desired and move on..
Would that work?

n3nash · 2018-03-23T06:01:58Z

Yeah, let me do that for now. Once I start to run some datasets and am able to do validations, I can spend time on this again.

vinothchandar · 2018-03-29T17:41:15Z

closing for now , keeping the issue open

n3nash changed the title ~~Making compression ratio dynamically calculated based on bytes written~~ (WIP) Making compression ratio dynamically calculated based on bytes written Mar 14, 2018

jianxu reviewed Mar 14, 2018

View reviewed changes

n3nash force-pushed the fix_compression_ratio branch from b30eef8 to f04879e Compare March 14, 2018 20:20

n3nash commented Mar 14, 2018

View reviewed changes

n3nash force-pushed the fix_compression_ratio branch from f04879e to 1437afb Compare March 14, 2018 20:25

n3nash commented Mar 14, 2018

View reviewed changes

n3nash force-pushed the fix_compression_ratio branch from 1437afb to 1cfd36d Compare March 14, 2018 20:40

Making compression ratio dynamically calculated based on bytes written

7483160

n3nash force-pushed the fix_compression_ratio branch from 1cfd36d to 7483160 Compare March 14, 2018 21:12

vinothchandar requested changes Mar 19, 2018

View reviewed changes

vinothchandar reviewed Mar 19, 2018

View reviewed changes

n3nash mentioned this pull request Mar 25, 2018

Adding config for parquet compression ratio #366

Merged

vinothchandar mentioned this pull request Mar 29, 2018

Make compressionRatio be calculated dynamically #352

Closed

vinothchandar closed this Mar 29, 2018

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023

Incremental load for SnapshotQueryLoad (apache#347)

328ddfa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Making compression ratio dynamically calculated based on bytes written #347

(WIP) Making compression ratio dynamically calculated based on bytes written #347

n3nash commented Mar 14, 2018

n3nash commented Mar 14, 2018

jianxu Mar 14, 2018

n3nash Mar 14, 2018

n3nash Mar 14, 2018 •

edited

Loading

vinothchandar Mar 19, 2018

n3nash Mar 19, 2018

n3nash Mar 14, 2018

ovj commented Mar 14, 2018

n3nash commented Mar 16, 2018

vinothchandar left a comment

vinothchandar Mar 19, 2018

vinothchandar Mar 19, 2018

vinothchandar Mar 19, 2018

n3nash commented Mar 20, 2018 •

edited

Loading

vinothchandar commented Mar 23, 2018

n3nash commented Mar 23, 2018

vinothchandar commented Mar 23, 2018

n3nash commented Mar 23, 2018

vinothchandar commented Mar 29, 2018

(WIP) Making compression ratio dynamically calculated based on bytes written #347

(WIP) Making compression ratio dynamically calculated based on bytes written #347

Conversation

n3nash commented Mar 14, 2018

n3nash commented Mar 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n3nash Mar 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ovj commented Mar 14, 2018

n3nash commented Mar 16, 2018

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n3nash commented Mar 20, 2018 • edited Loading

vinothchandar commented Mar 23, 2018

n3nash commented Mar 23, 2018

vinothchandar commented Mar 23, 2018

n3nash commented Mar 23, 2018

vinothchandar commented Mar 29, 2018

n3nash Mar 14, 2018 •

edited

Loading

n3nash commented Mar 20, 2018 •

edited

Loading