-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup hostdevice_vector
and add more APIs
#15252
Conversation
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
How did you manage to open this hours after #15079 merged?! |
how did this come up? |
I've just merged 15079 into my chunked ORC reader and all the tests crashed, so I figured out this bug. |
I'm using |
I've just tried to remove Note that this will change the interface of |
If you think you can quickly remove it, please go ahead. I'll be happy to see |
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
hostdevice_vector
and add more APIshostdevice_vector
and add more APIs
|
@bdice first of all, 😂 |
cudf::detail::hostdevice_vector<gpu::CompressedStreamInfo> compinfo(stream_info.size(), stream); | ||
for (std::size_t idx = 0; idx < stream_info.size(); ++idx) { | ||
auto const& info = stream_info[idx]; | ||
compinfo[idx] = gpu::CompressedStreamInfo( | ||
static_cast<uint8_t const*>(stripe_data[info.stripe_idx].data()) + info.dst_pos, info.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you kidding me, this was it?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing, I just thought (feared) bigger changes were needed to remove push_back.
apparently span tests also use push_back |
So far we have the new names:
Any other candidate? @nvdbaranec? |
Not sure if this is clear - IMO we should not rename it in this PR. Maybe something we can evaluate for 24.06. |
Signed-off-by: Nghia Truong <[email protected]>
I'd point out: |
For me "buffer" implies typeless, ie |
we could add a |
I don’t want to force a name change in this PR. Just had to |
I think we should better keep the So, keep |
Feel free to limit this PR to its original scope. I always felt like |
This reverts commit 0c149c3.
This reverts commit f931d43. # Conflicts: # cpp/src/io/utilities/hostdevice_vector.hpp
This reverts commit eac0d51. # Conflicts: # cpp/src/io/utilities/hostdevice_vector.hpp
Signed-off-by: Nghia Truong <[email protected]>
/merge |
This implements ORC chunked reader, to support reading ORC such that: * The output is multiple tables instead of once, each of them is issue when calling to `read_chunk()`, and has limited size which stays within a given `output_limit` parameter. * The temporary device memory usage can be limited by a soft limit `data_read_limit` parameter, allowing to read very large ORC files without OOM. * ORC files containing many billions of rows can be properly read chunk-by-chunk without seeing the size overflow issue when the number of rows exceeds cudf size limit (`2^31` rows). Depends on: * #14911 * #15008 * #15169 * #15252 Partially contribute to #12228. --- ## Benchmarks Due to some small optimizations in ORC reader, reading ORC files all-at-once (reading the entire file into just one output table) can be a little bit faster. For example, with the benchmark `orc_read_io_compression`: ``` ## [0] Quadro RTX 6000 | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------------|---------------|---------------|--------------|------------|-------------|------------|-------------|---------------|---------|----------| | FILEPATH | SNAPPY | 0 | 1 | 183.027 ms | 7.45% | 157.293 ms | 4.72% | -25733.837 us | -14.06% | FAIL | | FILEPATH | SNAPPY | 1000 | 1 | 198.228 ms | 6.43% | 164.395 ms | 4.14% | -33833.020 us | -17.07% | FAIL | | FILEPATH | SNAPPY | 0 | 32 | 96.676 ms | 6.19% | 82.522 ms | 1.36% | -14153.945 us | -14.64% | FAIL | | FILEPATH | SNAPPY | 1000 | 32 | 94.508 ms | 4.80% | 81.078 ms | 0.48% | -13429.672 us | -14.21% | FAIL | | FILEPATH | NONE | 0 | 1 | 161.868 ms | 5.40% | 139.849 ms | 2.44% | -22018.910 us | -13.60% | FAIL | | FILEPATH | NONE | 1000 | 1 | 164.902 ms | 5.80% | 142.041 ms | 3.43% | -22861.258 us | -13.86% | FAIL | | FILEPATH | NONE | 0 | 32 | 88.298 ms | 5.15% | 74.924 ms | 1.97% | -13374.607 us | -15.15% | FAIL | | FILEPATH | NONE | 1000 | 32 | 87.147 ms | 5.61% | 72.502 ms | 0.50% | -14645.122 us | -16.81% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 1 | 124.990 ms | 0.39% | 111.670 ms | 2.13% | -13320.483 us | -10.66% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 1 | 149.858 ms | 4.10% | 126.266 ms | 0.48% | -23591.543 us | -15.74% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 32 | 92.499 ms | 4.46% | 77.653 ms | 1.58% | -14846.471 us | -16.05% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 32 | 93.373 ms | 4.14% | 80.033 ms | 3.19% | -13340.002 us | -14.29% | FAIL | | HOST_BUFFER | NONE | 0 | 1 | 111.792 ms | 0.50% | 97.083 ms | 0.50% | -14709.530 us | -13.16% | FAIL | | HOST_BUFFER | NONE | 1000 | 1 | 117.646 ms | 5.60% | 97.634 ms | 0.44% | -20012.301 us | -17.01% | FAIL | | HOST_BUFFER | NONE | 0 | 32 | 84.983 ms | 4.96% | 66.975 ms | 0.50% | -18007.403 us | -21.19% | FAIL | | HOST_BUFFER | NONE | 1000 | 32 | 82.648 ms | 4.42% | 65.510 ms | 0.91% | -17137.910 us | -20.74% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 1 | 65.538 ms | 4.02% | 59.399 ms | 2.54% | -6138.560 us | -9.37% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 1 | 101.427 ms | 4.10% | 92.276 ms | 3.30% | -9150.278 us | -9.02% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 32 | 80.133 ms | 4.64% | 73.959 ms | 3.50% | -6173.818 us | -7.70% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 32 | 86.232 ms | 4.71% | 77.446 ms | 3.32% | -8786.606 us | -10.19% | FAIL | | DEVICE_BUFFER | NONE | 0 | 1 | 52.189 ms | 6.62% | 45.018 ms | 4.11% | -7171.043 us | -13.74% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 1 | 54.664 ms | 6.76% | 46.855 ms | 3.35% | -7809.803 us | -14.29% | FAIL | | DEVICE_BUFFER | NONE | 0 | 32 | 67.975 ms | 5.12% | 60.553 ms | 4.22% | -7422.279 us | -10.92% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 32 | 68.485 ms | 4.86% | 62.253 ms | 6.23% | -6232.340 us | -9.10% | FAIL | ``` When memory is limited, chunked read can help avoiding OOM but with some sort of performance trade-off. For example, for reading a table of size 500MB from file using 64MB output limits and 640 MB data read limit: ``` | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------------|---------------|---------------|--------------|------------|-------------|------------|-------------|------------|---------|----------| | FILEPATH | SNAPPY | 0 | 1 | 183.027 ms | 7.45% | 350.824 ms | 2.74% | 167.796 ms | 91.68% | FAIL | | FILEPATH | SNAPPY | 1000 | 1 | 198.228 ms | 6.43% | 322.414 ms | 3.46% | 124.186 ms | 62.65% | FAIL | | FILEPATH | SNAPPY | 0 | 32 | 96.676 ms | 6.19% | 133.363 ms | 4.78% | 36.686 ms | 37.95% | FAIL | | FILEPATH | SNAPPY | 1000 | 32 | 94.508 ms | 4.80% | 128.897 ms | 0.37% | 34.389 ms | 36.39% | FAIL | | FILEPATH | NONE | 0 | 1 | 161.868 ms | 5.40% | 316.637 ms | 4.21% | 154.769 ms | 95.61% | FAIL | | FILEPATH | NONE | 1000 | 1 | 164.902 ms | 5.80% | 326.043 ms | 3.06% | 161.141 ms | 97.72% | FAIL | | FILEPATH | NONE | 0 | 32 | 88.298 ms | 5.15% | 124.819 ms | 5.17% | 36.520 ms | 41.36% | FAIL | | FILEPATH | NONE | 1000 | 32 | 87.147 ms | 5.61% | 123.047 ms | 5.82% | 35.900 ms | 41.19% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 1 | 124.990 ms | 0.39% | 285.718 ms | 0.78% | 160.728 ms | 128.59% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 1 | 149.858 ms | 4.10% | 263.491 ms | 2.89% | 113.633 ms | 75.83% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 32 | 92.499 ms | 4.46% | 127.881 ms | 0.86% | 35.382 ms | 38.25% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 32 | 93.373 ms | 4.14% | 128.022 ms | 0.98% | 34.650 ms | 37.11% | FAIL | | HOST_BUFFER | NONE | 0 | 1 | 111.792 ms | 0.50% | 241.064 ms | 1.89% | 129.271 ms | 115.64% | FAIL | | HOST_BUFFER | NONE | 1000 | 1 | 117.646 ms | 5.60% | 248.134 ms | 3.08% | 130.488 ms | 110.92% | FAIL | | HOST_BUFFER | NONE | 0 | 32 | 84.983 ms | 4.96% | 118.049 ms | 5.99% | 33.066 ms | 38.91% | FAIL | | HOST_BUFFER | NONE | 1000 | 32 | 82.648 ms | 4.42% | 114.577 ms | 2.34% | 31.929 ms | 38.63% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 1 | 65.538 ms | 4.02% | 232.466 ms | 3.28% | 166.928 ms | 254.71% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 1 | 101.427 ms | 4.10% | 221.578 ms | 1.43% | 120.152 ms | 118.46% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 32 | 80.133 ms | 4.64% | 120.604 ms | 0.35% | 40.471 ms | 50.50% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 32 | 86.232 ms | 4.71% | 125.521 ms | 3.93% | 39.289 ms | 45.56% | FAIL | | DEVICE_BUFFER | NONE | 0 | 1 | 52.189 ms | 6.62% | 182.943 ms | 0.29% | 130.754 ms | 250.54% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 1 | 54.664 ms | 6.76% | 190.501 ms | 0.49% | 135.836 ms | 248.49% | FAIL | | DEVICE_BUFFER | NONE | 0 | 32 | 67.975 ms | 5.12% | 107.172 ms | 3.56% | 39.197 ms | 57.66% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 32 | 68.485 ms | 4.86% | 108.097 ms | 2.92% | 39.611 ms | 57.84% | FAIL | ``` And if memory is too limited, chunked read with 8MB output limit/80MB data read limit: ``` | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------------|---------------|---------------|--------------|------------|-------------|------------|-------------|------------|---------|----------| | FILEPATH | SNAPPY | 0 | 1 | 183.027 ms | 7.45% | 732.926 ms | 1.98% | 549.899 ms | 300.45% | FAIL | | FILEPATH | SNAPPY | 1000 | 1 | 198.228 ms | 6.43% | 834.309 ms | 4.21% | 636.081 ms | 320.88% | FAIL | | FILEPATH | SNAPPY | 0 | 32 | 96.676 ms | 6.19% | 363.033 ms | 1.66% | 266.356 ms | 275.51% | FAIL | | FILEPATH | SNAPPY | 1000 | 32 | 94.508 ms | 4.80% | 313.813 ms | 1.28% | 219.305 ms | 232.05% | FAIL | | FILEPATH | NONE | 0 | 1 | 161.868 ms | 5.40% | 607.700 ms | 2.90% | 445.832 ms | 275.43% | FAIL | | FILEPATH | NONE | 1000 | 1 | 164.902 ms | 5.80% | 616.101 ms | 3.46% | 451.199 ms | 273.62% | FAIL | | FILEPATH | NONE | 0 | 32 | 88.298 ms | 5.15% | 267.703 ms | 0.46% | 179.405 ms | 203.18% | FAIL | | FILEPATH | NONE | 1000 | 32 | 87.147 ms | 5.61% | 250.528 ms | 0.43% | 163.381 ms | 187.48% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 1 | 124.990 ms | 0.39% | 636.270 ms | 0.44% | 511.280 ms | 409.06% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 1 | 149.858 ms | 4.10% | 747.264 ms | 0.50% | 597.406 ms | 398.65% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 32 | 92.499 ms | 4.46% | 359.660 ms | 0.19% | 267.161 ms | 288.82% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 32 | 93.373 ms | 4.14% | 311.608 ms | 0.43% | 218.235 ms | 233.73% | FAIL | | HOST_BUFFER | NONE | 0 | 1 | 111.792 ms | 0.50% | 493.797 ms | 0.13% | 382.005 ms | 341.71% | FAIL | | HOST_BUFFER | NONE | 1000 | 1 | 117.646 ms | 5.60% | 516.706 ms | 0.12% | 399.060 ms | 339.20% | FAIL | | HOST_BUFFER | NONE | 0 | 32 | 84.983 ms | 4.96% | 258.477 ms | 0.46% | 173.495 ms | 204.15% | FAIL | | HOST_BUFFER | NONE | 1000 | 32 | 82.648 ms | 4.42% | 248.028 ms | 5.30% | 165.380 ms | 200.10% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 1 | 65.538 ms | 4.02% | 606.010 ms | 3.76% | 540.472 ms | 824.68% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 1 | 101.427 ms | 4.10% | 742.774 ms | 4.64% | 641.347 ms | 632.33% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 32 | 80.133 ms | 4.64% | 364.701 ms | 2.70% | 284.568 ms | 355.12% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 32 | 86.232 ms | 4.71% | 320.387 ms | 2.80% | 234.155 ms | 271.54% | FAIL | | DEVICE_BUFFER | NONE | 0 | 1 | 52.189 ms | 6.62% | 458.100 ms | 2.15% | 405.912 ms | 777.78% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 1 | 54.664 ms | 6.76% | 478.527 ms | 1.41% | 423.862 ms | 775.39% | FAIL | | DEVICE_BUFFER | NONE | 0 | 32 | 67.975 ms | 5.12% | 260.009 ms | 3.71% | 192.034 ms | 282.51% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 32 | 68.485 ms | 4.86% | 243.705 ms | 2.09% | 175.220 ms | 255.85% | FAIL | ``` Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) URL: #15094
This work includes:
hostdevice_vector
when the host buffer does not change its size when appending new elements. Instead, the new elements are written directly into raw memory (which is out of bounds). Previously, this did not trigger any issue since the host buffer has reserved plenty of memory upon its construction, until I attempted to access thefront()
andback()
elements of it.front()
andback()
accessors which return the first and last elements in the host buffer.