Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic include sorting with clang-format #12760

Closed
wants to merge 2 commits into from

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Feb 11, 2023

Description

This PR is inspired by a comment from @wence-

I guess it would be nice to do this automatically with clang-format, but perhaps IncludeBlocks: Regroup plus IncludeCategories leads to too many bad automated decisions.

I looked into these clang-format options and I came up with some settings that I think are pretty good.

I haven't quite finished iterating on the rule set, so I'm opening this draft just to mark my progress. If you're interested in seeing the results or tweak the rules, check out this PR and run pre-commit run clang-format --all-files.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 11, 2023
@davidwendt
Copy link
Contributor

Adding specific directory names to clang-format seems problematic. It means we are not likely to be able to share a clang-format with other RAPIDS projects very easily. Also, it becomes a maintenance issue to try to remember to update this correctly.
Is this really important to solve this way? I don't think this is worth doing.

@bdice
Copy link
Contributor Author

bdice commented Feb 12, 2023

@davidwendt I think that is valid criticism. I posed this PR for discussion because I am frequently commenting in PR reviews to keep includes grouped in accordance with our developer guide, and that seems like a job for tooling to handle on our behalf. I think maintaining this file with explicit rules might be easier than the continuous burden of cleaning up includes that are automatically added by other developers’ IDEs in the wrong places or for developers who are not familiarized with the include order we prescribe in the developer guide. I would rather comment on PRs when things are organized incorrectly and add a new rule for clang-format. However, I don’t want to push this if it is viewed as an excessive maintenance burden relative to the benefit it might provide.

@wence-
Copy link
Contributor

wence- commented Feb 15, 2023

I assume the reason for include sorting is style, rather than correctness? Modulo bugs in libstdc++ like this one you would hope that include sorting doesn't change the correctness of the code. Would it therefore be reasonable to just try and sort into buckets of "cuda", "libstdc++", and "other"? That would make things more portable across projects. FWIW, as an occasional (for now at least) contributed to the C++ code in cudf, I haven't fully internalised how "far away" everything is, and definitely clangd autocompletion providing IWYU by magic helps, but it obviously doesn't know about the rules so can't know where to put things.

If I contrast with the python side, I just import where-ever at the top of the file and then isort automagically fixes it for me, so I never have to think; the same criticism that David raises about the configuration being project-specific could be raised against isort (since one has to specify things), but that doesn't seem like a massive problem in usage AFAICT.

@bdice
Copy link
Contributor Author

bdice commented Feb 27, 2023

If I contrast with the python side, I just import where-ever at the top of the file and then isort automagically fixes it for me, so I never have to think; the same criticism that David raises about the configuration being project-specific could be raised against isort (since one has to specify things), but that doesn't seem like a massive problem in usage AFAICT.

This agrees with my experience for Python. In practice, it hasn't been painful to maintain for isort. The primary downsides are that the C++ code has a few more "categories" than Python, and it is harder to maintain a regex for some of the categories, such as what's in the C++ STL (isort maintains the list of Python standard libraries for us) and what's local to libcudf (e.g. <io/...>).

@vyasr
Copy link
Contributor

vyasr commented Feb 28, 2023

The difference I see between the way isort is configured in Python and how clang-format is configured for includes is that the includes from a given library may be in an arbitrary paths whereas Python package imports can be easily identified by either import pkg or from pkg[*] import *. IOW it's sufficient to tell isort to group all cudf imports, whereas for clang-format we need a regex for every top-level directory that we want to consider "part of cudf". Therefore this list is likely to see more churn than the isort list, although probably not all that much more.

My biggest takeaway from glancing through this PR is that our "distance metric" is too complicated. To @wence-'s point, I suspect that most infrequent contributors don't know the metric we use, and based on how often these comments are raised in reviews I would guess that many frequent contributors haven't quite internalized this either. Before going down the programmatic sorting route I think we should consider reducing the complexity of our groupings as well so that there is a smaller list to learn and maintain. I don't personally get much value out of groups beyond internal, stdlib, and thrust/cub/libcu++. Enforcing a simpler set would also reduce PR reviewer burden.

@bdice bdice self-assigned this Jun 28, 2023
@bdice
Copy link
Contributor Author

bdice commented Jul 28, 2023

Closing this as stale -- there are several difficult questions to answer about simplifying the rules we use to sort includes, and I don't think I can prioritize this work anytime soon.

@bdice bdice closed this Jul 28, 2023
rapids-bot bot pushed a commit that referenced this pull request Feb 21, 2024
This uses the `IncludeCategories` settings in .clang-format to attempt to enforce our documented `#include` order in libcudf. See https://docs.rapids.ai/api/libcudf/stable/developer_guide

I realize that there was a [previous attempt at this](#12760) by @bdice that met with some resistance. Reading it, I wouldn't say it was vetoed; rather, reviewers requested something much simpler. I have a few reasons to attempt this again.

1. To make a separate task much easier. We are undertaking a refactoring of RMM that will replace `rmm::mr::device_memory_resource*` with `rmm::device_async_resource-ref` everywhere in RAPIDS (not just cuDF). This requires adding an include to MANY files. Getting the location of the include right everywhere is very difficult without automatic grouping of headers. I started out writing a bash script to do this before realizing clang-format has the necessary feature. And I realized that my script would never properly handle [files like this](https://github.com/rapidsai/raft/blob/branch-24.04/cpp/bench/ann/src/raft/raft_cagra_wrapper.h).
2. To increase velocity. Everywhere in RAPIDS that we have automated code standard/style/formatting/other, the benefits to velocity have outweighed the costs. To paraphrase @bdice, $auto \nearrow \rightarrow \mu \searrow \rightarrow v \nearrow$
3. The previous PR #12760 had nearly 50 categories of headers. There was no way this could be applied universally across RAPIDS repos. My proposal has 10 categories. I tried to reduce it further but realized that it wouldn't be much less configuration to maintain, so I stopped at 10.

Note that one of the ways that having few categories can work while still maintaining clear groups is that this PR updates many files to use quotes ("") instead of angle brackets (<>) for local cuDF headers that do not live in `cudf/cpp/include`. With our "near to far" include ordering policy, these are arguably the nearest files, and using quotes allows us to have our first category simply check for quotes. These files will be grouped and sorted without blank lines, but in practice this does not lose clarity because typically headers from more than two directories are not included from the same file. The downside of this change is I don't yet know how to automatically enforce it. I hope that when developers accidentally use <> for internal includes that don't start with (e.g.) "cudf", they will be grouped one of the lowest priority categories, and perhaps this will induce them to switch to "" to get the headers listed at the top. The rule is simple: if it's in libcudf but not in `cpp/include/cudf`, then use quotes. For **everything** else, use angle brackets.

Other than headers from RAPIDS repos, we have a group for all CCCL/CUDA headers, a group for all other headers that have a file extension, and a final group for all files that have no file extension (e.g. STL).

Below I'm listing the (fairly simple, in my opinion) .clang-format settings for this PR. Note that categories 2-5 will require tweaking for different RAPIDS repos. 

Some may ask why I ordered `cudf_test` headers before `cudf` headers. I tried both orders, and putting `cudf_test` first generated significantly fewer changes in the PR, meaning that it's already the more common ordering (I suppose `cudf_test` is closer to the files that include it, since they are libcudf tests).

I've opened a similar PR for RMM with only 5 groups. rapidsai/rmm#1463

CC @davidwendt @vyasr @wence- @GregoryKimball for feedback

@isVoid contributed to this PR via pair programming.

```
IncludeBlocks: Regroup
IncludeCategories:
  - Regex:           '^"' # quoted includes
    Priority:        1
  - Regex:           '^<(benchmarks|tests)/' # benchmark includes
    Priority:        2
  - Regex:           '^<cudf_test/' # cuDF includes
    Priority:        3
  - Regex:           '^<cudf/' # cuDF includes
    Priority:        4
  - Regex:           '^<(nvtext|cudf_kafka)' # other libcudf includes
    Priority:        5
  - Regex:           '^<(cugraph|cuml|cuspatial|raft|kvikio)' # Other RAPIDS includes
    Priority:        6
  - Regex:           '^<rmm/' # RMM includes
    Priority:        7
  - Regex:           '^<(thrust|cub|cuda)/' # CCCL includes
    Priority:        8
  - Regex:           '^<(cooperative_groups|cuco|cuda.h|cuda_runtime|device_types|math_constants|nvtx3)' # CUDA includes
    Priority:        8
  - Regex:           '^<.*\..*' # other system includes (e.g. with a '.')
    Priority:        9
  - Regex:           '^<[^.]+' # STL includes (no '.')
    Priority:        10
```

Authors:
  - Mark Harris (https://github.com/harrism)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #15063
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants