Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate include grouping order in .clang-format #14993

Closed

Conversation

harrism
Copy link
Member

@harrism harrism commented Feb 7, 2024

Description

This uses the IncludeCategories settings in .clang-format to attempt to enforce our documented #include order in libcudf. See https://docs.rapids.ai/api/libcudf/stable/developer_guide

I realize that there was a previous attempt at this by @bdice that met with some resistance. Reading it, I wouldn't say it was vetoed; rather, reviewers requested something much simpler. I have a few reasons to attempt this again.

  1. To make a separate task much easier. We are undertaking a refactoring of RMM that will replace rmm::mr::device_memory_resource* with rmm::device_async_resource-ref everywhere in RAPIDS (not just cuDF). This requires adding an include to MANY files. Getting the location of the include right everywhere is very difficult without automatic grouping of headers. I started out writing a bash script to do this before realizing clang-format has the necessary feature. And I realized that my script would never properly handle files like this.
  2. To increase velocity. Everywhere in RAPIDS that we have automated code standard/style/formatting/other, the benefits to velocity have outweighed the costs. To paraphrase @bdice, $auto \nearrow \rightarrow \mu \searrow \rightarrow v \nearrow$
  3. The previous PR Automatic include sorting with clang-format #12760 had nearly 50 categories of headers. There was no way this could be applied universally across RAPIDS repos. My proposal has 10 categories. I tried to reduce it further but realized that it wouldn't be much less configuration to maintain, so I stopped at 10.

Note that one of the ways that having few categories can work while still maintaining clear groups is that this PR updates many files to use quotes ("") instead of angle brackets (<>) for local cuDF headers that do not live in cudf/cpp/include. With our "near to far" include ordering policy, these are arguably the nearest files, and using quotes allows us to have our first category simply check for quotes. These files will be grouped and sorted without blank lines, but in practice this does not lose clarity because typically headers from more than two directories are not included from the same file. The downside of this change is I don't yet know how to automatically enforce it. I hope that when developers accidentally use <> for internal includes that don't start with (e.g.) "cudf", they will be grouped one of the lowest priority categories, and perhaps this will induce them to switch to "" to get the headers listed at the top. The rule is simple: if it's in libcudf but not in cpp/include/cudf, then use quotes. For everything else, use angle brackets.

Other than headers from RAPIDS repos, I group all other headers that have a file extension in a single group, and all files that have no file extension in another group. Since the latter also matches includes some files from libcu++, I have an explicit category for <cuda/ includes to keep them separate from STL includes. A frequent effect of the single "." group is that cub and thrust headers get grouped without a blank line between them. I don't think this is a problem.

Below I'm listing the (fairly simple, in my opinion) .clang-format settings for this PR. Note that categories 2-5 will require tweaking for different RAPIDS repos.

Some may ask why I ordered cudf_test headers before cudf headers. I tried both orders, and putting cudf_test first generated significantly fewer changes in the PR, meaning that it's already the more common ordering (I suppose cudf_test is closer to the files that include it, since they are libcudf tests).

I've opened a similar PR for RMM with only 5 groups. rapidsai/rmm#1463

CC @davidwendt @vyasr @wence- @GregoryKimball for feedback

@isVoid contributed to this PR via pair programming.

IncludeBlocks: Regroup
IncludeCategories:
  - Regex:           '^"' # quoted includes
    Priority:        1
  - Regex:           '^<(benchmarks|tests)/' # benchmark includes
    Priority:        2
  - Regex:           '^<cudf_test/' # cuDF includes
    Priority:        3
  - Regex:           '^<(nvtext|cudf_kafka)' # other libcudf includes
    Priority:        4
  - Regex:           '^<cudf/' # cuDF includes
    Priority:        5
  - Regex:           '^<(cugraph|cuml|cuspatial|raft|kvikio)' # Other RAPIDS includes
    Priority:        6
  - Regex:           '^<rmm/' # RMM includes
    Priority:        7
  - Regex:           '^<(thrust|cub|cuda)/' # CCCL includes
    Priority:        8
  - Regex:           '^<(cuda.h|cuda_runtime|device_types|math_constants|cooperative_groups)' # CUDART includes
    Priority:        8
  - Regex:           '^<.*\..*' # other system includes (e.g. with a '.')
    Priority:        9
  - Regex:           '^<[^.]+' # STL includes (no '.')
    Priority:        10

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@harrism harrism added feature request New feature or request non-breaking Non-breaking change labels Feb 7, 2024
@harrism harrism self-assigned this Feb 7, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Feb 7, 2024
@harrism harrism changed the title Enforce include grouping order in .clang-format Automate include grouping order in .clang-format Feb 7, 2024
@harrism
Copy link
Member Author

harrism commented Feb 7, 2024

Note that these changes did turn up two places where needed headers were not included.

davidwendt and others added 4 commits February 7, 2024 15:08
…rapidsai#14868)

Changes some internal offset arrays used for managing temporary unicode code-points to int64 type.
This effects the nvtext normalize and subword-tokenizer functions.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: rapidsai#14868
This PR ensures that all calls to `ctest` include the flag `--no-tests=error`. See rapidsai/build-planning#18.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)

URL: rapidsai#14983
…ai#14989)

This PR filters all `DeprecationWarning`'s that are being originated by `ArrowTable.to_pandas`

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: rapidsai#14989
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing comments for now. I'll look further at this later.

cpp/benchmarks/io/orc/orc_writer_chunks.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/synchronization/synchronization.hpp Outdated Show resolved Hide resolved
.clang-format Outdated Show resolved Hide resolved
.clang-format Outdated Show resolved Hide resolved
galipremsagar and others added 2 commits February 8, 2024 08:02
…idsai#14995)

This PR fixes `DataFrame.sort_index` to properly ignore indexes for all values of `axis`. This is fixed in pandas-2.2, hence xfailing the tests with a version check.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#14995
@davidwendt
Copy link
Contributor

davidwendt commented Feb 8, 2024

Seems something here makes the jitify code-paths fail. I know jitify has issues with certain include files.
I would recommend addressing the build/test errors sooner than later.

mroeschke and others added 4 commits February 8, 2024 13:41
Adds offsetalator in place of hardcoded offset type arrays to the strings split functions:
- `cudf::strings::split()`
- `cudf::strings::rsplit()`
- `cudf::strings::split_record()`
- `cudf::strings::rsplit_record()`
- `cudf::strings::split_re()`
- `cudf::strings::rsplit_re()`
- `cudf::strings::split_record_re()`
- `cudf::strings::rsplit_record_re()`

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#14757
For comparison:

pandas-dev/pandas#55856
pandas-dev/pandas#55895
pandas-dev/pandas#55499

The `errors="ignore"` parameter is the only one that is implemented so just added a test for that deprecation

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai#14984
…#14998)

A small bug in our previous implementation leads to a segfault when calling `.get_groups()` with no `values`. Thankfully, the cuDF Python API always calls this function with a value, but it's possible `pylibcudf` consumers will not.

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#14998
@wence-
Copy link
Contributor

wence- commented Feb 12, 2024

I am very much in favour of this. I like the groupings. One minor question around the colour of the bikeshed:

Do we want to (probably not in this change due to the churn it would cause) have separate groups for the public (<cudf/public_header.hpp>) and detail (<cudf/detail/private_header.hpp>) include separation?

davidwendt and others added 5 commits February 12, 2024 15:46
Updates `cudf::strings::wrap()` to use the offsetalator instead of hardcoded int32 type for offsets column data.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: rapidsai#15002
…i#14956)

Resolves [10219](NVIDIA/spark-rapids#10219)

This PR introduces a new class named `GetJsonObjectOptions` that holds the configurations to control the behavior of the underlying `cudf::get_json_object` function. It incorporates this new class into the `getJSONObject` JAVA API as an additional argument but also keeps the previous API to maintain backwards compatibility.  It also includes a test case, `testGetJSONObjectWithSingleQuotes`, validating the behavior of `getJSONObject` when single quotes are enabled.

Authors:
  - Suraj Aralihalli (https://github.com/SurajAralihalli)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - MithunR (https://github.com/mythrocks)
  - Karthikeyan (https://github.com/karthikeyann)

URL: rapidsai#14956
Forward-merge branch-24.02 to branch-24.04
raydouglass and others added 12 commits February 13, 2024 09:22
I think unpinning `numba` in the conda recipe was just missed in rapidsai#14616.

I discovered this issue [trying to build the `24.02` release](https://github.com/rapidsai/cudf/actions/runs/7878153691/job/21496377912#step:7:1674).

PRs & nightly builds are working because the `rapidsai-nightly` channel has an older version of `pynvjitlink` that supported `numba>=0.57` whereas the `rapidsai` channel only has the latest version which pins to `numba>=0.58`.

Authors:
   - Ray Douglass (https://github.com/raydouglass)

Approvers:
   - Vyas Ramasubramani (https://github.com/vyasr)
   - Bradley Dice (https://github.com/bdice)
…apidsai#15018)

Found while working on large strings where copy-if is called. In places where `copy_if_safe` utility is called the non-stencil overload calls the stencil-ed function by forwarding the `first` iterator as the `stencil` parameter. This works logically because both values will return the same result. Unfortunately, this can be a performance issue if the iterator is complex/slow transform iterator since it would be called twice (an inlined twice). Changing the non-stencil version to call `thrust::copy_if` directly fixes the potential issue.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#15018
Forward-merge branch-24.02 to branch-24.04
Toward pandas 2.2 compat: Deprecated in pandas in pandas-dev/pandas#56557

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: rapidsai#14986
Updates `cudf::get_json_object()` to use the offsetalator to build the output strings column.
It adds a sizes vector to hold the output row lengths which is then converted to offsets using the new `make_offsets_child_column()` utitlity.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: rapidsai#15009
Updates `cudf::interleave_columns()` to use the new `make_offsets_child_column` utility and the offsetalator to build the output strings column.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: rapidsai#15004
…#15052)

Developers expect that 'cleaning' a build directory will remove all forms of cached files ( objects, libraries, jit cache, etc ). To ensure that happens consistenly we also need to remove the jitify cache objects for cudf.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Mark Harris (https://github.com/harrism)

URL: rapidsai#15052
Reworks the `cudf::strings::detail::copy_if_else()` to improve performance for long strings. The rework builds a vector of rows to pass to the `make_strings_column` factory that uses the optimized `gather_chars` function.
Also includes a benchmark for copy_if_else specifically for strings columns.

Closes rapidsai#15014

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: rapidsai#15017
Closes rapidsai#14495

Adds support for reading and writing ORC and Parquet files with LZ4 compression.
Also adds the new value to the Python API.

Included basic C++ and Python tests so that the option is exercised in CI.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Shruti Shivakumar (https://github.com/shrshi)
  - MithunR (https://github.com/mythrocks)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai#14906
Copy link

copy-pr-bot bot commented Feb 15, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. labels Feb 15, 2024
@harrism
Copy link
Member Author

harrism commented Feb 15, 2024

The commit history on this PR has gotten out of hand somehow. I think it's best to start a new PR with the latest agreed-upon .clang-format changes and then rerun the pre-commit hooks and apply any other applicable reviewer suggestions from this PR. This way I will minimize changes and have a better change of figuring out any CI failures.

@wence-
Copy link
Contributor

wence- commented Feb 15, 2024

Closing in favour of #15063

@wence- wence- closed this Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.