[WIP] Adaptive fragment sizes in Parquet writer #12627

etseidl · 2023-01-26T22:46:50Z

Description

Trying to write Parquet files where rows are very wide can result in pages that are much too large due to the default fragment size of 5000. This in turn can have an adverse effect on file size when using Zstandard compression. This PR attempts to address this by modifying the fragment size to a value where each fragment will still fit in the desired page size.

Fixes #12613

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rapids-bot · 2023-01-26T22:46:54Z

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

etseidl · 2023-01-26T22:47:59Z

@vuule you had mentioned awhile back that you thought there was a function somewhere that calculates a column's size. I couldn't find it so implemented my own. If you can point me to something better I'll use it.

etseidl · 2023-01-26T22:50:51Z

Side benefit is better performance in benchmarks for list columns, without the performance regression when just setting the default fragment size to 1000.

Before this PR

## parquet_write_encode

### [0] NVIDIA RTX A6000

| data_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|  INTEGRAL |           0 |          1 |      5x | 111.840 ms | 0.32% | 111.835 ms | 0.31% |       4800565202 |         2.146 GiB |       498.123 MiB |
|  INTEGRAL |        1000 |          1 |     12x |  45.089 ms | 0.38% |  45.083 ms | 0.38% |      11908481790 |         2.770 GiB |       161.438 MiB |
|  INTEGRAL |           0 |         32 |     21x |  34.929 ms | 0.50% |  34.924 ms | 0.50% |      15372537807 |         2.770 GiB |        27.720 MiB |
|  INTEGRAL |        1000 |         32 |     17x |  29.901 ms | 0.14% |  29.896 ms | 0.14% |      17957667626 |         2.770 GiB |        14.403 MiB |
|     FLOAT |           0 |          1 |      5x | 106.580 ms | 0.08% | 106.575 ms | 0.08% |       5037494184 |         1.100 GiB |       510.303 MiB |
|     FLOAT |        1000 |          1 |    112x |  30.835 ms | 0.58% |  30.829 ms | 0.58% |      17414302741 |         1.765 GiB |       110.206 MiB |
|     FLOAT |           0 |         32 |     21x |  23.975 ms | 0.20% |  23.970 ms | 0.20% |      22397993527 |         1.765 GiB |        23.640 MiB |
|     FLOAT |        1000 |         32 |     26x |  19.871 ms | 0.24% |  19.866 ms | 0.24% |      27024975803 |         1.765 GiB |         9.888 MiB |
|   DECIMAL |           0 |          1 |     10x |  51.763 ms | 0.14% |  51.758 ms | 0.14% |      10372678425 |       811.156 MiB |       221.308 MiB |
|   DECIMAL |        1000 |          1 |     28x |  18.359 ms | 0.31% |  18.354 ms | 0.31% |      29251120183 |         1.145 GiB |        48.997 MiB |
|   DECIMAL |           0 |         32 |     33x |  15.604 ms | 0.31% |  15.598 ms | 0.31% |      34418188596 |         1.145 GiB |        10.298 MiB |
|   DECIMAL |        1000 |         32 |     37x |  13.608 ms | 0.32% |  13.603 ms | 0.31% |      39468368025 |         1.145 GiB |         4.717 MiB |
| TIMESTAMP |           0 |          1 |      5x | 131.251 ms | 0.07% | 131.245 ms | 0.07% |       4090588948 |         1.170 GiB |       458.580 MiB |
| TIMESTAMP |        1000 |          1 |     19x |  26.401 ms | 0.37% |  26.396 ms | 0.37% |      20339412646 |         1.474 GiB |        92.808 MiB |
| TIMESTAMP |           0 |         32 |    272x |  27.217 ms | 0.57% |  27.211 ms | 0.57% |      19729735749 |         1.474 GiB |        20.948 MiB |
| TIMESTAMP |        1000 |         32 |     29x |  17.745 ms | 0.42% |  17.740 ms | 0.41% |      30263290620 |         1.474 GiB |         8.718 MiB |
|  DURATION |           0 |          1 |      7x |  79.987 ms | 0.10% |  79.982 ms | 0.10% |       6712426044 |       957.340 MiB |       355.521 MiB |
|  DURATION |        1000 |          1 |     34x |  25.773 ms | 0.50% |  25.768 ms | 0.50% |      20835191262 |         1.474 GiB |        90.214 MiB |
|  DURATION |           0 |         32 |     24x |  21.270 ms | 0.40% |  21.264 ms | 0.40% |      25247594804 |         1.474 GiB |        17.107 MiB |
|  DURATION |        1000 |         32 |     30x |  16.814 ms | 0.48% |  16.809 ms | 0.48% |      31939875607 |         1.474 GiB |         8.113 MiB |
|    STRING |           0 |          1 |      5x | 130.180 ms | 0.22% | 130.174 ms | 0.22% |       4124244709 |         1.342 GiB |       597.486 MiB |
|    STRING |        1000 |          1 |    568x |  26.363 ms | 1.36% |  26.358 ms | 1.36% |      20368583961 |       677.964 MiB |        46.473 MiB |
|    STRING |           0 |         32 |      5x | 130.524 ms | 0.05% | 130.519 ms | 0.05% |       4113363887 |         1.342 GiB |       597.486 MiB |
|    STRING |        1000 |         32 |     35x |  14.426 ms | 0.48% |  14.421 ms | 0.48% |      37228695840 |       677.964 MiB |         8.504 MiB |
|      LIST |           0 |          1 |      5x | 534.754 ms | 0.04% | 534.747 ms | 0.04% |       1003971377 |         1.602 GiB |       498.003 MiB |
|      LIST |        1000 |          1 |      5x | 355.656 ms | 0.03% | 355.649 ms | 0.03% |       1509550797 |         2.752 GiB |       166.640 MiB |
|      LIST |           0 |         32 |      5x | 270.222 ms | 0.05% | 270.217 ms | 0.05% |       1986816725 |         2.752 GiB |        37.257 MiB |
|      LIST |        1000 |         32 |      5x | 274.311 ms | 0.11% | 274.306 ms | 0.11% |       1957200557 |         2.752 GiB |        24.421 MiB |
|    STRUCT |           0 |          1 |      5x | 136.164 ms | 0.09% | 136.159 ms | 0.09% |       3942984412 |         1.283 GiB |       569.525 MiB |
|    STRUCT |        1000 |          1 |    369x |  40.609 ms | 0.63% |  40.603 ms | 0.63% |      13222497588 |         1.324 GiB |        90.699 MiB |
|    STRUCT |           0 |         32 |      5x | 109.019 ms | 0.48% | 109.013 ms | 0.48% |       4924833685 |         1.473 GiB |       409.317 MiB |
|    STRUCT |        1000 |         32 |     19x |  27.182 ms | 0.32% |  27.176 ms | 0.32% |      19755007841 |         1.324 GiB |        15.400 MiB |

With these changes:

| data_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|  INTEGRAL |           0 |          1 |      5x | 111.332 ms | 0.10% | 111.327 ms | 0.10% |       4822460334 |         2.146 GiB |       498.123 MiB |
|  INTEGRAL |        1000 |          1 |     43x |  44.586 ms | 0.50% |  44.581 ms | 0.50% |      12042558020 |         2.770 GiB |       161.438 MiB |
|  INTEGRAL |           0 |         32 |     15x |  34.555 ms | 0.30% |  34.550 ms | 0.29% |      15538786092 |         2.770 GiB |        27.720 MiB |
|  INTEGRAL |        1000 |         32 |     17x |  29.718 ms | 0.45% |  29.712 ms | 0.46% |      18069054636 |         2.770 GiB |        14.403 MiB |
|     FLOAT |           0 |          1 |      5x | 106.595 ms | 0.14% | 106.590 ms | 0.14% |       5036783215 |         1.100 GiB |       510.303 MiB |
|     FLOAT |        1000 |          1 |     17x |  30.503 ms | 0.36% |  30.498 ms | 0.36% |      17603747240 |         1.765 GiB |       110.206 MiB |
|     FLOAT |           0 |         32 |     21x |  23.825 ms | 0.26% |  23.820 ms | 0.26% |      22538378841 |         1.765 GiB |        23.640 MiB |
|     FLOAT |        1000 |         32 |     26x |  19.719 ms | 0.50% |  19.714 ms | 0.50% |      27233359575 |         1.765 GiB |         9.888 MiB |
|   DECIMAL |           0 |          1 |     10x |  51.820 ms | 0.38% |  51.815 ms | 0.38% |      10361253329 |       811.156 MiB |       221.308 MiB |
|   DECIMAL |        1000 |          1 |     28x |  18.281 ms | 0.49% |  18.276 ms | 0.49% |      29375372176 |         1.145 GiB |        48.997 MiB |
|   DECIMAL |           0 |         32 |     33x |  15.538 ms | 0.33% |  15.532 ms | 0.33% |      34564602643 |         1.145 GiB |        10.298 MiB |
|   DECIMAL |        1000 |         32 |    608x |  13.573 ms | 0.58% |  13.567 ms | 0.58% |      39570530602 |         1.145 GiB |         4.717 MiB |
| TIMESTAMP |           0 |          1 |      5x | 129.445 ms | 0.19% | 129.440 ms | 0.19% |       4147628465 |         1.170 GiB |       458.580 MiB |
| TIMESTAMP |        1000 |          1 |     20x |  26.116 ms | 0.21% |  26.111 ms | 0.21% |      20561446173 |         1.474 GiB |        92.808 MiB |
| TIMESTAMP |           0 |         32 |    528x |  26.924 ms | 0.62% |  26.918 ms | 0.62% |      19944485006 |         1.474 GiB |        20.948 MiB |
| TIMESTAMP |        1000 |         32 |     56x |  17.680 ms | 0.50% |  17.674 ms | 0.50% |      30375588647 |         1.474 GiB |         8.718 MiB |
|  DURATION |           0 |          1 |      7x |  79.813 ms | 0.14% |  79.808 ms | 0.14% |       6727042871 |       957.340 MiB |       355.521 MiB |
|  DURATION |        1000 |          1 |     20x |  25.534 ms | 0.22% |  25.529 ms | 0.22% |      21029918810 |         1.474 GiB |        90.214 MiB |
|  DURATION |           0 |         32 |     24x |  21.126 ms | 0.34% |  21.121 ms | 0.34% |      25418552644 |         1.474 GiB |        17.108 MiB |
|  DURATION |        1000 |         32 |    138x |  16.675 ms | 0.50% |  16.670 ms | 0.50% |      32206674557 |         1.474 GiB |         8.113 MiB |
|    STRING |           0 |          1 |      5x | 129.812 ms | 0.24% | 129.807 ms | 0.24% |       4135924927 |         1.342 GiB |       597.486 MiB |
|    STRING |        1000 |          1 |    560x |  26.538 ms | 0.86% |  26.533 ms | 0.86% |      20234323957 |       677.964 MiB |        46.473 MiB |
|    STRING |           0 |         32 |      5x | 130.361 ms | 0.13% | 130.356 ms | 0.13% |       4118511404 |         1.342 GiB |       597.486 MiB |
|    STRING |        1000 |         32 |     72x |  14.652 ms | 0.50% |  14.647 ms | 0.50% |      36654441681 |       677.964 MiB |         8.504 MiB |
|      LIST |           0 |          1 |      9x | 263.635 ms | 0.48% | 263.630 ms | 0.48% |       2036458592 |         1.602 GiB |       498.638 MiB |
|      LIST |        1000 |          1 |      5x | 141.951 ms | 0.21% | 141.945 ms | 0.21% |       3782243064 |         2.752 GiB |       167.795 MiB |
|      LIST |           0 |         32 |     13x |  94.726 ms | 0.50% |  94.721 ms | 0.50% |       5667941431 |         2.752 GiB |        39.319 MiB |
|      LIST |        1000 |         32 |    163x |  92.155 ms | 0.58% |  92.149 ms | 0.58% |       5826129787 |         2.752 GiB |        25.456 MiB |
|    STRUCT |           0 |          1 |      5x | 136.310 ms | 0.28% | 136.304 ms | 0.28% |       3938768656 |         1.283 GiB |       569.525 MiB |
|    STRUCT |        1000 |          1 |     13x |  40.940 ms | 0.32% |  40.935 ms | 0.32% |      13115251265 |         1.324 GiB |        90.700 MiB |
|    STRUCT |           0 |         32 |      5x | 109.057 ms | 0.35% | 109.052 ms | 0.35% |       4923061272 |         1.473 GiB |       409.317 MiB |
|    STRUCT |        1000 |         32 |    528x |  27.551 ms | 0.59% |  27.546 ms | 0.59% |      19490103598 |         1.324 GiB |        15.400 MiB |

vuule · 2023-01-26T23:22:31Z

Side benefit is better performance in benchmarks for list columns

Another side benefit is faster reading of files with list columns, when written with this change :)

etseidl · 2023-01-26T23:34:55Z

Another side benefit is faster reading of files with list columns, when written with this change :)

😃 Right you are

Before:

## parquet_read_decode

### [0] NVIDIA RTX A6000

| data_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|      LIST |           0 |          1 |     38x | 403.717 ms | 2.24% | 403.711 ms | 2.24% |       1329839927 |         1.004 GiB |       498.003 MiB |
|      LIST |        1000 |          1 |     10x | 345.366 ms | 0.49% | 345.360 ms | 0.49% |       1554524822 |       698.783 MiB |       166.640 MiB |
|      LIST |           0 |         32 |      5x | 263.115 ms | 0.16% | 263.109 ms | 0.16% |       2040488804 |       567.905 MiB |        37.258 MiB |
|      LIST |        1000 |         32 |      5x | 264.765 ms | 0.08% | 264.759 ms | 0.08% |       2027769035 |       555.236 MiB |        24.422 MiB |

This PR:

| data_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|      LIST |           0 |          1 |     84x | 180.044 ms | 1.38% | 180.039 ms | 1.38% |       2981977419 |         1.004 GiB |       498.638 MiB |
|      LIST |        1000 |          1 |    126x | 119.529 ms | 0.95% | 119.523 ms | 0.95% |       4491779278 |       699.813 MiB |       167.795 MiB |
|      LIST |           0 |         32 |    174x |  86.238 ms | 0.80% |  86.233 ms | 0.80% |       6225825854 |       569.638 MiB |        39.319 MiB |
|      LIST |        1000 |         32 |    177x |  85.076 ms | 0.87% |  85.070 ms | 0.87% |       6310920104 |       556.266 MiB |        25.456 MiB |

vuule · 2023-01-27T00:19:57Z

cpp/src/io/parquet/writer_impl.cu

+          auto const avg_len = column_size(column, stream) / num_rows;
+
+          if (avg_len > 0) {
+            size_type frag_size     = max_page_size_bytes / avg_len;


is this too large? IIUC, max_page_size_bytes / avg_len is the average number of rows in each page. That means that any deviation in size between rows would cause us to overshoot the max page size.

Yes, avg will tend to overshoot. But with the deeply nested cases, it we overshoot anyway, now just by much less. 😅 I could try max row length perhaps, but that will be trickier to calculate than total size for the nested case.

I tried to suggest a better option, but anything small enough to get really precise page sizes in the largest column would significantly degrade performance for other columns.
Current implementation is actually a good compromise.

vuule · 2023-01-27T00:28:38Z

cpp/src/io/parquet/writer_impl.cu

+
+          if (avg_len > 0) {
+            size_type frag_size     = max_page_size_bytes / avg_len;
+            max_page_fragment_size_ = std::min(frag_size, max_page_fragment_size_);


I see, so we use the fragment size based on the largest column.
Do you expect (perf) issues when we have a single large column and many small columns? The benchmarks show the best case scenario, where each table has columns of similar size.

When we talked about dynamic fragment size I envisioned per-column fragment size. That seems more optimal than the static size in all cases. I'm trying to figure out if we can claim that this PR is also always better than the (current) static option.

Yes, this will probably be bad in the single large column/many fixed length columns case. I'm interested to see what this does with @jbrennan333's user data, which seems more mixed than the test data he generated.

I really see this as a POC to demonstrate the value of changing up the fragment size. I agree having a per column fragment size would be best, but that's a heavier lift too. But maybe with these numbers it can be justified.

have you ran the "parquet_write_io_compression" group of benchmarks? That one has an even mix of all supported types.

I have, but my WS is shut down now. I can post tomorrow, but IIRC there was not a big difference, with this code being maybe a percent or two faster in most cases. I actually want to run that benchmark with ZSTD too, since that has issues with the run length=32 cases.

I will try this with the customer data. Have you already verified it with the test data I provided in #12613?

Here is the parquet-tools inspect output for the gpu file:

############ file meta data ############ created_by: num_columns: 7 num_rows: 17105 num_row_groups: 1 format_version: 1.0 serialized_size: 5922 ############ Columns ############ format hash data id part offset relayTs ############ Column(format) ############ name: format path: format max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: None converted_type (legacy): NONE compression: ZSTD (space_saved: 100%) ############ Column(hash) ############ name: hash path: hash max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: None converted_type (legacy): NONE compression: ZSTD (space_saved: 47%) ############ Column(data) ############ name: data path: data max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: None converted_type (legacy): NONE compression: ZSTD (space_saved: 85%) ############ Column(id) ############ name: id path: origin.id max_definition_level: 2 max_repetition_level: 0 physical_type: INT32 logical_type: None converted_type (legacy): NONE compression: UNCOMPRESSED (space_saved: 0%) ############ Column(part) ############ name: part path: origin.part max_definition_level: 2 max_repetition_level: 0 physical_type: INT32 logical_type: None converted_type (legacy): NONE compression: UNCOMPRESSED (space_saved: 0%) ############ Column(offset) ############ name: offset path: origin.offset max_definition_level: 2 max_repetition_level: 0 physical_type: INT64 logical_type: None converted_type (legacy): NONE compression: ZSTD (space_saved: 89%) ############ Column(relayTs) ############ name: relayTs path: relayTs max_definition_level: 1 max_repetition_level: 0 physical_type: INT64 logical_type: None converted_type (legacy): NONE compression: ZSTD (space_saved: 54%)

Hmm, that's somewhat disappointing. Can you also run parquet-tools dump -d -n on the gpu and cpu files?

I don't have those options in my version of parquet-tools. Here is the washed inspect --detail output for each.
cust-inspect-detail-cpu.txt
cust-inspect-detail-gpu.txt

Sorry, I didn't realize parquet-tools was an overloaded name. I was referring to the (now deprecated it seems) jar that comes with parquet-mr. Thanks for the extra details...combing through it now.

@jbrennan333 So the compression is less good for the 'data' column ("only" 85% vs 92%). That may be down to there being fewer pages, given the 1MB default page size for parquet-mr vs 512KB for libcudf. But it seems the fragment sizes are allowing zstd to compress things, so that's good news.

etseidl · 2023-01-27T17:03:00Z

@vuule here are the parquet_write_io_compression benchmarks:

before:

|    io    | compression | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|----------|-------------|-------------|------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------|
| FILEPATH |      SNAPPY |           0 |          1 |      5x |  3.466 s | 1.42% |  3.466 s | 1.42% |        154889730 |         1.556 GiB |       493.950 MiB |
| FILEPATH |      SNAPPY |        1000 |          1 |      8x |  2.019 s | 1.27% |  2.019 s | 1.27% |        265967925 |         2.536 GiB |       161.238 MiB |
| FILEPATH |      SNAPPY |           0 |         32 |      5x |  1.597 s | 0.35% |  1.597 s | 0.35% |        336229281 |         2.532 GiB |        49.703 MiB |
| FILEPATH |      SNAPPY |        1000 |         32 |     10x |  1.592 s | 0.55% |  1.592 s | 0.55% |        337129111 |         2.536 GiB |        23.416 MiB |
| FILEPATH |        NONE |           0 |          1 |      9x |  1.752 s | 0.74% |  1.752 s | 0.74% |        306399699 |         1.556 GiB |       501.137 MiB |
| FILEPATH |        NONE |        1000 |          1 |     10x |  1.277 s | 0.49% |  1.277 s | 0.49% |        420275736 |         2.536 GiB |       169.774 MiB |
| FILEPATH |        NONE |           0 |         32 |     11x |  1.373 s | 0.75% |  1.373 s | 0.75% |        390934488 |         2.532 GiB |        56.544 MiB |
| FILEPATH |        NONE |        1000 |         32 |      5x |  1.323 s | 0.48% |  1.323 s | 0.48% |        405657348 |         2.536 GiB |        30.410 MiB |

This PR:

|    io    | compression | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|----------|-------------|-------------|------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------|
| FILEPATH |      SNAPPY |           0 |          1 |      5x |  3.485 s | 1.38% |  3.485 s | 1.38% |        154041364 |         1.559 GiB |       494.432 MiB |
| FILEPATH |      SNAPPY |        1000 |          1 |      5x |  2.008 s | 0.32% |  2.008 s | 0.32% |        267352736 |         2.539 GiB |       161.603 MiB |
| FILEPATH |      SNAPPY |           0 |         32 |      5x |  1.598 s | 0.30% |  1.598 s | 0.30% |        336051961 |         2.535 GiB |        50.053 MiB |
| FILEPATH |      SNAPPY |        1000 |         32 |      5x |  1.591 s | 0.21% |  1.591 s | 0.21% |        337388398 |         2.539 GiB |        23.724 MiB |
| FILEPATH |        NONE |           0 |          1 |      5x |  1.742 s | 0.34% |  1.742 s | 0.34% |        308227480 |         1.559 GiB |       501.113 MiB |
| FILEPATH |        NONE |        1000 |          1 |      5x |  1.275 s | 0.42% |  1.275 s | 0.42% |        421134277 |         2.539 GiB |       169.721 MiB |
| FILEPATH |        NONE |           0 |         32 |      5x |  1.357 s | 0.17% |  1.357 s | 0.17% |        395569655 |         2.535 GiB |        56.571 MiB |
| FILEPATH |        NONE |        1000 |         32 |      5x |  1.307 s | 0.20% |  1.307 s | 0.20% |        410754253 |         2.539 GiB |        30.415 MiB |

vuule · 2023-01-27T20:49:09Z

cpp/src/io/parquet/writer_impl.cu

+
+  if (column.type().id() == type_id::STRING) {
+    auto scol         = strings_column_view(column);
+    size_type colsize = cudf::detail::get_value<size_type>(scol.offsets(), column.size(), stream);


what if there are sliced rows at the start of a column. I think we would need to subtract the first offset from this.

I'm frankly amazed this code worked at all 😉 Yes, I'll subtract offsets[0].

…sizev2

etseidl · 2023-01-30T22:33:14Z

Ack. The size calculation is all wrong...should be using leaf columns rather than the whole nested mess. I'm closing this to look at ways to do the per-column fragment sizes we were talking about.

etseidl added 6 commits January 26, 2023 13:21

change fragment size based on total data size in each column

d35cc82

formatting

4fac8e0

move column_size() to anonymous namespace

1f3d10f

change test for num_children()

5fd1836

use strings_column_view

54edccb

check for empty tables

483dd60

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 26, 2023

etseidl mentioned this pull request Jan 26, 2023

[BUG] Compressing a table with larger strings with ZSTD fails to compress #12613

Closed

vuule added feature request New feature or request non-breaking Non-breaking change cuIO cuIO issue labels Jan 26, 2023

vuule reviewed Jan 27, 2023

View reviewed changes

etseidl added 4 commits January 27, 2023 14:01

subtract first offset from size for string columns

20802cf

make column views const

ca7b67e

Merge remote-tracking branch 'origin/branch-23.04' into feature/frag_…

61af5dc

…sizev2

Merge remote-tracking branch 'origin/branch-23.04' into feature/frag_…

2bf6530

…sizev2

etseidl closed this Jan 30, 2023

etseidl mentioned this pull request Feb 2, 2023

Variable fragment sizes for Parquet writer #12685

Merged

3 tasks

etseidl deleted the feature/frag_sizev2 branch March 6, 2023 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Adaptive fragment sizes in Parquet writer #12627

[WIP] Adaptive fragment sizes in Parquet writer #12627

etseidl commented Jan 26, 2023

rapids-bot bot commented Jan 26, 2023

etseidl commented Jan 26, 2023

etseidl commented Jan 26, 2023

vuule commented Jan 26, 2023

etseidl commented Jan 26, 2023

vuule Jan 27, 2023

etseidl Jan 27, 2023

vuule Jan 27, 2023

vuule Jan 27, 2023

etseidl Jan 27, 2023

vuule Jan 27, 2023

etseidl Jan 27, 2023

jbrennan333 Jan 27, 2023

jbrennan333 Jan 27, 2023

etseidl Jan 27, 2023

jbrennan333 Jan 27, 2023

etseidl Jan 27, 2023

etseidl Jan 27, 2023

etseidl commented Jan 27, 2023

vuule Jan 27, 2023

etseidl Jan 27, 2023

etseidl commented Jan 30, 2023

[WIP] Adaptive fragment sizes in Parquet writer #12627

[WIP] Adaptive fragment sizes in Parquet writer #12627

Conversation

etseidl commented Jan 26, 2023

Description

Checklist

rapids-bot bot commented Jan 26, 2023

etseidl commented Jan 26, 2023

etseidl commented Jan 26, 2023

vuule commented Jan 26, 2023

etseidl commented Jan 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Jan 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Jan 30, 2023