Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pinned memory vector and ORC+Parquet writers #13206

Merged
merged 19 commits into from
Apr 25, 2023

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Apr 24, 2023

Currently, cudf::detail::pinned_allocator is used in various places to implement a pinned host vector. This standardizes such usage, removing cudf::detail::pinned_allocator from the usage sites and replacing its usage by a standard cudf::detail::pinned_host_vector instead.

Some small changes are also made for ORC/Parquet writer classes, replacing bool _single_write_mode by SingleWriteMode _single_write_mode.

This is needed for #13076.

@ttnghia ttnghia added 3 - Ready for Review Ready for review by team libcudf blocker libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 24, 2023
@ttnghia ttnghia requested a review from a team as a code owner April 24, 2023 17:37
@ttnghia ttnghia self-assigned this Apr 24, 2023
@ttnghia ttnghia requested a review from a team as a code owner April 24, 2023 17:37
@github-actions github-actions bot added the conda label Apr 24, 2023
@ttnghia ttnghia changed the title Refactor pinned memory vector Refactor pinned memory vector and ORC+Parquet writers Apr 24, 2023
Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, pinned memory is always on the host side thus not sure this renaming is really needed.

Copy link
Contributor Author

@ttnghia ttnghia Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the name *_host_vector is better expressing its purpose, similar to having thurst::host_vector instead of just thrust::vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the name as host_vector in pinned memory, so the name looks good.

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
* @brief Helper for pinned host memory
*/
template <typename T>
using pinned_buffer = std::unique_ptr<T, decltype(&cudaFreeHost)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, with one minor concern. For places that already used host_vector, there's no change. But, for this use case, we are introducing initialization into the memory that does not need to be initialized (and previously wasn't). If you don't mind, please run ORC or Parquet writer benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't see where is the initialization that you mentioned? The new pinned_host_vector uses allocator that also doesn't initialize the internal buffer:

__host__ inline pointer allocate(size_type cnt, const_pointer /*hint*/ = 0)
  {
    if (cnt > this->max_size()) { throw std::bad_alloc(); }  // end if

    pointer result(0);
    CUDF_CUDA_TRY(cudaMallocHost(reinterpret_cast<void**>(&result), cnt * sizeof(value_type)));
    return result;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see benchmarks here: #13206 (comment)

Copy link
Contributor

@vuule vuule Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that host_vector has its own initialization outside of the allocator (same as std::vector). Either way, it does not seem to impact the overall performance.

@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 24, 2023

@vuule Here is some bechmarks for ORC writer:

['./before.json', 'after.json']
# orc_write_encode

## [0] Quadro RTX 6000

|    data_type    |  cardinality  |  run_length  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-----------------|---------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
| INTEGRAL_SIGNED |       0       |      1       | 127.838 ms |       3.63% | 126.039 ms |       2.71% | -1799.682 us |  -1.41% |   PASS   |
| INTEGRAL_SIGNED |     1000      |      1       | 254.980 ms |       1.54% | 257.532 ms |       1.94% |     2.552 ms |   1.00% |   PASS   |
| INTEGRAL_SIGNED |       0       |      32      |  52.873 ms |       3.76% |  54.257 ms |       4.01% |     1.385 ms |   2.62% |   PASS   |
| INTEGRAL_SIGNED |     1000      |      32      |  55.094 ms |       1.51% |  55.554 ms |       2.56% |   459.615 us |   0.83% |   PASS   |
|     STRING      |       0       |      1       | 545.909 ms |       0.02% | 547.088 ms |       0.37% |     1.178 ms |   0.22% |   FAIL   |
|     STRING      |     1000      |      1       | 813.042 ms |       0.11% | 813.765 ms |       0.15% |   722.949 us |   0.09% |   PASS   |
|     STRING      |       0       |      32      | 545.776 ms |       0.07% | 546.026 ms |       0.10% |   250.000 us |   0.05% |   PASS   |
|     STRING      |     1000      |      32      | 650.413 ms |       0.07% | 650.709 ms |       0.12% |   295.630 us |   0.05% |   PASS   |
|      LIST       |       0       |      1       | 133.945 ms |       0.31% | 134.238 ms |       0.35% |   293.109 us |   0.22% |   PASS   |
|      LIST       |     1000      |      1       | 336.120 ms |       0.10% | 337.486 ms |       0.82% |     1.367 ms |   0.41% |   FAIL   |
|      LIST       |       0       |      32      |  88.915 ms |       0.35% |  88.975 ms |       0.40% |    60.099 us |   0.07% |   PASS   |
|      LIST       |     1000      |      32      | 104.201 ms |       0.34% | 104.000 ms |       0.30% |  -201.286 us |  -0.19% |   PASS   |
|     STRUCT      |       0       |      1       | 547.717 ms |       0.16% | 548.105 ms |       0.15% |   387.793 us |   0.07% |   PASS   |
|     STRUCT      |     1000      |      1       | 902.261 ms |       0.43% | 901.978 ms |       0.08% |  -282.324 us |  -0.03% |   PASS   |
|     STRUCT      |       0       |      32      | 527.460 ms |       0.08% | 528.633 ms |       0.25% |     1.173 ms |   0.22% |   FAIL   |
|     STRUCT      |     1000      |      32      | 607.662 ms |       0.27% | 608.766 ms |       0.52% |     1.103 ms |   0.18% |   PASS   |

# orc_write

## [0] Quadro RTX 6000

|  num_columns  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------------|------------|-------------|------------|-------------|------------|---------|----------|
|       8       | 165.981 ms |       0.30% | 166.784 ms |       0.49% | 803.009 us |   0.48% |   FAIL   |
|      64       |    1.558 s |       1.41% |    1.591 s |       3.23% |  32.855 ms |   2.11% |   FAIL   |

# orc_chunked_write

## [0] Quadro RTX 6000

|  num_columns  |  num_chunks  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|---------------|--------------|------------|-------------|------------|-------------|---------------|---------|----------|
|       8       |      8       | 393.974 ms |       0.51% | 425.155 ms |       2.94% |     31.181 ms |   7.91% |   FAIL   |
|      64       |      8       |    3.969 s |        inf% |    4.332 s |        inf% |    362.998 ms |   9.15% |   PASS   |
|       8       |      64      |    1.831 s |       0.12% |    1.976 s |       0.95% |    144.625 ms |   7.90% |   FAIL   |
|      64       |      64      |    7.401 s |        inf% |    7.366 s |        inf% | -34970.378 us |  -0.47% |   PASS   |

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to make these improvements.

@ttnghia ttnghia requested a review from PointKernel April 25, 2023 02:15
@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 25, 2023

/merge

@rapids-bot rapids-bot bot merged commit 848dd5a into rapidsai:branch-23.06 Apr 25, 2023
@ttnghia ttnghia deleted the refactor_pinned_allocator branch April 25, 2023 02:55
rapids-bot bot pushed a commit that referenced this pull request May 1, 2023
Similar to #12949, this refactors Parquet writer to support retry mechanism. The internal `writer::impl::write()` function is rewritten such that it is separated into multiple pieces:
 * A free function that performs compressing/encoding the input table into intermediate results. These intermediate results are totally independent of the writer.
 * After having the intermediate results in the previous step, these results will be actually applied to the output data sink to start the actual data writing.

Closes: 
 * #13042

Depends on:
 * #13206

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec

URL: #13076
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants