-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic include sorting with clang-format #12760
Conversation
Adding specific directory names to clang-format seems problematic. It means we are not likely to be able to share a clang-format with other RAPIDS projects very easily. Also, it becomes a maintenance issue to try to remember to update this correctly. |
@davidwendt I think that is valid criticism. I posed this PR for discussion because I am frequently commenting in PR reviews to keep includes grouped in accordance with our developer guide, and that seems like a job for tooling to handle on our behalf. I think maintaining this file with explicit rules might be easier than the continuous burden of cleaning up includes that are automatically added by other developers’ IDEs in the wrong places or for developers who are not familiarized with the include order we prescribe in the developer guide. I would rather comment on PRs when things are organized incorrectly and add a new rule for clang-format. However, I don’t want to push this if it is viewed as an excessive maintenance burden relative to the benefit it might provide. |
I assume the reason for include sorting is style, rather than correctness? Modulo bugs in libstdc++ like this one you would hope that include sorting doesn't change the correctness of the code. Would it therefore be reasonable to just try and sort into buckets of "cuda", "libstdc++", and "other"? That would make things more portable across projects. FWIW, as an occasional (for now at least) contributed to the C++ code in cudf, I haven't fully internalised how "far away" everything is, and definitely clangd autocompletion providing IWYU by magic helps, but it obviously doesn't know about the rules so can't know where to put things. If I contrast with the python side, I just import where-ever at the top of the file and then |
This agrees with my experience for Python. In practice, it hasn't been painful to maintain for isort. The primary downsides are that the C++ code has a few more "categories" than Python, and it is harder to maintain a regex for some of the categories, such as what's in the C++ STL (isort maintains the list of Python standard libraries for us) and what's local to libcudf (e.g. |
The difference I see between the way isort is configured in Python and how clang-format is configured for includes is that the includes from a given library may be in an arbitrary paths whereas Python package imports can be easily identified by either My biggest takeaway from glancing through this PR is that our "distance metric" is too complicated. To @wence-'s point, I suspect that most infrequent contributors don't know the metric we use, and based on how often these comments are raised in reviews I would guess that many frequent contributors haven't quite internalized this either. Before going down the programmatic sorting route I think we should consider reducing the complexity of our groupings as well so that there is a smaller list to learn and maintain. I don't personally get much value out of groups beyond internal, stdlib, and thrust/cub/libcu++. Enforcing a simpler set would also reduce PR reviewer burden. |
Closing this as stale -- there are several difficult questions to answer about simplifying the rules we use to sort includes, and I don't think I can prioritize this work anytime soon. |
This uses the `IncludeCategories` settings in .clang-format to attempt to enforce our documented `#include` order in libcudf. See https://docs.rapids.ai/api/libcudf/stable/developer_guide I realize that there was a [previous attempt at this](#12760) by @bdice that met with some resistance. Reading it, I wouldn't say it was vetoed; rather, reviewers requested something much simpler. I have a few reasons to attempt this again. 1. To make a separate task much easier. We are undertaking a refactoring of RMM that will replace `rmm::mr::device_memory_resource*` with `rmm::device_async_resource-ref` everywhere in RAPIDS (not just cuDF). This requires adding an include to MANY files. Getting the location of the include right everywhere is very difficult without automatic grouping of headers. I started out writing a bash script to do this before realizing clang-format has the necessary feature. And I realized that my script would never properly handle [files like this](https://github.com/rapidsai/raft/blob/branch-24.04/cpp/bench/ann/src/raft/raft_cagra_wrapper.h). 2. To increase velocity. Everywhere in RAPIDS that we have automated code standard/style/formatting/other, the benefits to velocity have outweighed the costs. To paraphrase @bdice, $auto \nearrow \rightarrow \mu \searrow \rightarrow v \nearrow$ 3. The previous PR #12760 had nearly 50 categories of headers. There was no way this could be applied universally across RAPIDS repos. My proposal has 10 categories. I tried to reduce it further but realized that it wouldn't be much less configuration to maintain, so I stopped at 10. Note that one of the ways that having few categories can work while still maintaining clear groups is that this PR updates many files to use quotes ("") instead of angle brackets (<>) for local cuDF headers that do not live in `cudf/cpp/include`. With our "near to far" include ordering policy, these are arguably the nearest files, and using quotes allows us to have our first category simply check for quotes. These files will be grouped and sorted without blank lines, but in practice this does not lose clarity because typically headers from more than two directories are not included from the same file. The downside of this change is I don't yet know how to automatically enforce it. I hope that when developers accidentally use <> for internal includes that don't start with (e.g.) "cudf", they will be grouped one of the lowest priority categories, and perhaps this will induce them to switch to "" to get the headers listed at the top. The rule is simple: if it's in libcudf but not in `cpp/include/cudf`, then use quotes. For **everything** else, use angle brackets. Other than headers from RAPIDS repos, we have a group for all CCCL/CUDA headers, a group for all other headers that have a file extension, and a final group for all files that have no file extension (e.g. STL). Below I'm listing the (fairly simple, in my opinion) .clang-format settings for this PR. Note that categories 2-5 will require tweaking for different RAPIDS repos. Some may ask why I ordered `cudf_test` headers before `cudf` headers. I tried both orders, and putting `cudf_test` first generated significantly fewer changes in the PR, meaning that it's already the more common ordering (I suppose `cudf_test` is closer to the files that include it, since they are libcudf tests). I've opened a similar PR for RMM with only 5 groups. rapidsai/rmm#1463 CC @davidwendt @vyasr @wence- @GregoryKimball for feedback @isVoid contributed to this PR via pair programming. ``` IncludeBlocks: Regroup IncludeCategories: - Regex: '^"' # quoted includes Priority: 1 - Regex: '^<(benchmarks|tests)/' # benchmark includes Priority: 2 - Regex: '^<cudf_test/' # cuDF includes Priority: 3 - Regex: '^<cudf/' # cuDF includes Priority: 4 - Regex: '^<(nvtext|cudf_kafka)' # other libcudf includes Priority: 5 - Regex: '^<(cugraph|cuml|cuspatial|raft|kvikio)' # Other RAPIDS includes Priority: 6 - Regex: '^<rmm/' # RMM includes Priority: 7 - Regex: '^<(thrust|cub|cuda)/' # CCCL includes Priority: 8 - Regex: '^<(cooperative_groups|cuco|cuda.h|cuda_runtime|device_types|math_constants|nvtx3)' # CUDA includes Priority: 8 - Regex: '^<.*\..*' # other system includes (e.g. with a '.') Priority: 9 - Regex: '^<[^.]+' # STL includes (no '.') Priority: 10 ``` Authors: - Mark Harris (https://github.com/harrism) Approvers: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #15063
Description
This PR is inspired by a comment from @wence-
I looked into these clang-format options and I came up with some settings that I think are pretty good.
I haven't quite finished iterating on the rule set, so I'm opening this draft just to mark my progress. If you're interested in seeing the results or tweak the rules, check out this PR and run
pre-commit run clang-format --all-files
.Checklist