[TASK] Review types used for size and indexing throughout codebase #4105

wphicks · 2021-07-27T15:01:35Z

As noted in discussion on #4075, there are currently places in the codebase where we are using signed integers for positive indexing or to represent container sizes or other quantities that are better represented by other types (generally std::size_t). In order to clearly signal the semantic meaning of these variables, avoid signed/unsigned casting and comparisons, and to reduce the likelihood of overflow/underflow bugs, I propose the following rules:

Avoid raw indexed loops wherever possible, instead giving preference to the relevant STL algorithm
Always represent the size of an STL container with some_container<T>::size_type and use the same type for raw positive indexing. Liberal use of auto should make this less onerous, and rule 1 should make it rare.
When representing the size of a quantity other than the size of a container or doing any other kind of raw positive indexing, use std::size_t unless there is a clear performance/resource reason to use a smaller type. In this case, an unsigned integer of the desired size should be used explicitly (e.g. uint32_t) and a comment should be added where the variable is declared explaining the performance consideration. If this is necessary for a custom container, alias custom_container<T>::size_type to the selected integer type.
For negative indexing, the same basic rules apply. Use iterator::difference_type where possible or explicitly invoke std::ptrdiff_t otherwise.
Use int to represent mathematical integers, quantities which can take on negative values and which are not sizes or indexes. A signed integer should not be used simply to include -1 as a special case; prefer std::optional or other ways of signaling this more clearly.

If we are better about rule 1, rules 2-5 should become less and less relevant.

TL;DR version:

for-loops bad, algorithms good
container::size_type for sizes and positive indexing where possible, std::size_t otherwise
iterator::difference_type for negative indexing where possible, std::ptrdiff_t otherwise
int is a mathematical quantity, not a size or an index
Performance exceptions are always allowed, but be explicit about them

I'm not married to these rules, but they seem like a reasonably small and complete set that is consistent with other style and usage guidelines. We can use this issue for any necessary discussion until #4075 is merged, and then I'll be happy to work on implementing whatever we land on.

The text was updated successfully, but these errors were encountered:

levsnv · 2021-08-19T01:13:15Z

1. When traversing arrays, often multiple things are to be accumulated/done with the elements. In performance-critical sections of code, I'd prefer a more nuanced rule, perhaps on ways to cogently describe actor classes instead of deciding on 2% slowdown for more readable code.
3. @canonizer @wphicks does it mean existing code should switch to spelling out std::size_t instead of size_t when it was effectively the former all along?
5. what about the nice property of int indices to often contain negative numbers when not initialized, instead of a plausible positive?

wphicks · 2021-08-19T13:55:18Z

For 1, can you give me an example of code where a raw loop provides a performance benefit that cannot be achieved without a raw loop so I make sure I understand what you're getting at?

For 3, yes, definitely. size_t is a C type, std::size_t is a C++ type. If we use size_t it should be a signal to future developers that we're expecting to interface with C code at this point.

For 5, that's where the recommendation to use std::optional comes in. It carries the clear semantic meaning that this might not yet have a valid value.

levsnv · 2021-08-19T21:08:22Z

e.g. I have this in a PR:

    Timer together = Timer::start("together");
    // count nodes for each feature id, while splitting the sets between nodes
    std::size_t bit_pool_size = 0;
    for (std::size_t node_id = 0; node_id < num_nodes; ++node_id) {
      int fid = fids_h[node_id];
      
      if (!feature_categorical[fid] || is_leafs_h[node_id]) is_categoricals_h[node_id] = 0.0f;
      
      if (is_categoricals_h[node_id] == 1.0) {
        // might allocate a categorical set for an unreachable parent node. That's OK.
        ++cf[fid].n_nodes;
        node_cat_set[node_id] = bit_pool_size;
        bit_pool_size += categorical_sets::sizeof_mask_from_max_matching(cf[fid].max_matching);
      }
    }
    together.stop();

which I could split like this for readability, sacrificing 0.66% runtime in a third of our tests. I agree that there may be larger opportunities to optimize, but this doesn't require a perf investigation, and I feel like it's significant enough to have a decent approach from the start, both in readability and performance.

    Timer fct = Timer::start("fc");
    for (std::size_t node_id = 0; node_id < num_nodes; ++node_id) {
      int fid = fids_h[node_id];
      if (!feature_categorical[fid] || is_leafs_h[node_id]) is_categoricals_h[node_id] = 0.0f;
    }
    fct.stop();
    Timer cft = Timer::start("cf");
    // count nodes for each feature id, while splitting the sets between nodes
    std::size_t bit_pool_size = 0;
    for (std::size_t node_id = 0; node_id < num_nodes; ++node_id) {
      if (is_categoricals_h[node_id] == 1.0) {
        int fid = fids_h[node_id];
        // might allocate a categorical set for an unreachable parent node. That's OK.
        ++cf[fid].n_nodes;
        node_cat_set[node_id] = bit_pool_size;
        bit_pool_size += categorical_sets::sizeof_mask_from_max_matching(cf[fid].max_matching);
      }
    }
    cft.stop();

If I split it further (e.g. bit_pool_size = std::accumulate(...)), I might again lose that much, since there's a cost to move is_categoricals_h[node_id] in and out of cache.

wphicks · 2021-08-19T22:38:52Z

Sorry, I should have been more specific; I was more looking for a standalone example. If we need to dive into this further, we might look at something we can play with in Godbolt, but here's a very rough cut at one way to move the code you posted away from raw loops:

  auto begin = zip_iterator(is_categoricals_h.begin(), fids_h.begin(), is_leafs_h.begin(), node_cat_set.begin());
  auto end = zip_iterator(is_categoricals_h.end(), fids_h.end(), is_leafs_h.end(), node_cat_set.end());
  auto bit_pool_size = std::accumulate(
      begin, end, std::size_t{},
      [&feature_categorical, &cf, &bit_pool_size](auto result, auto&& tup) {
        auto& [cat, fid, leaf, ncs] = tup;

        if (!feature_categorical[fid] || leaf) {
          cat = 0.0f;
        }
        if (cat == 1.0f) {
          ++cf[fid].n_nodes;
          ncs = bit_pool_size;
          result += categorical_sets::sizeof_mask_from_max_matching(
              cf[fid].max_matching);
        }
        return result;
      });

All that's untested, so don't try to grab it as is ;). Note that we should only bring each element of is_categoricals_h into cache once with this implementation. Switching from a raw loop to an algorithm should never force you to break up a loop into multiple passes, although it may highlight ways that you can restructure your data to keep relevant bits closer together.

Do you have any other examples of basic problems with performance that might emerge from moving away from a raw loop? I won't say they don't exist, but I've yet to encounter one myself. I've encountered plenty of situations where raw loops are slower than an algorithm-based implementation, though.

levsnv · 2021-08-20T01:09:35Z

I'll skip the nitpicking and focus on the main thing:
One line of for (std::size_t node_id = 0; node_id < num_nodes; ++node_id) turns into 6 lines of routing the things through std::accumulate. IIUC, neither zip_iterator nor std::accumulate enforce same length constraint, which would've made the loop actually safer, as opposed to just hiding the indexing in syntactic sugar.
I wish std:: had things like zip_iterator_pair (this one could be easy to implement), and std::accumulate would have a specialization which effectively calls std::apply so that we can declare auto& cat, auto& fid, auto& leaf, auto& ncs) in the lambda signature.
I've shared a standalone with you, since there's a couple hiccups I couldn't resolve myself.

wphicks · 2021-08-25T14:09:49Z

I was focusing solely on the performance question you raised (i.e. whether avoiding raw loops necessarily imposes a performance penalty in some cases). If you're interested in specific recommendations for refactoring on that PR that would also keep things concise, you can always tag me for code review, but I can't say much about what the "right" way to structure that would be out of context.

For the purposes of this particular issue, I'm more interested in any general problems that might arise by avoiding raw loops. If there are general pitfalls that exist, it would be great to document them here. While avoiding raw loops is generally a good idea (these gentlemen lay out the case very nicely) for other reasons, that basic principle is not the focus of this proposal, either. It's more about establishing consistency with how we're indexing, and one way to do that is by avoiding indexing altogether, which can eliminate other bugs.

Related to #4105. This PR fixes the types used for indexing and sizes in the PCA/TSVD C++ code. Authors: - Micka (https://github.com/lowener) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #4255

github-actions · 2021-11-23T20:03:17Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-11-23T20:04:21Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Related to rapidsai#4105. This PR fixes the types used for indexing and sizes in the PCA/TSVD C++ code. Authors: - Micka (https://github.com/lowener) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4255

wphicks added feature request proposal CUDA / C++ labels Jul 27, 2021

wphicks self-assigned this Jul 27, 2021

lowener mentioned this issue Sep 30, 2021

Fix indexing of PCA to use safer types #4255

Merged

github-actions bot added the inactive-90d label Nov 23, 2021

github-actions bot added the inactive-30d label Nov 23, 2021

wphicks mentioned this issue Jan 24, 2022

[TRACKER][BUG] Integer-based indexing causing failures when data grows very large. #2459

Open

5 tasks

RAMitchell mentioned this issue Feb 4, 2022

[BUG] Pairwise distances illegal memory access #4552

Closed

lowener mentioned this issue Sep 29, 2022

[BUG] Integer overflow in Linear Models #4906

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TASK] Review types used for size and indexing throughout codebase #4105

[TASK] Review types used for size and indexing throughout codebase #4105

wphicks commented Jul 27, 2021 •

edited

Loading

levsnv commented Aug 19, 2021 •

edited

Loading

wphicks commented Aug 19, 2021

levsnv commented Aug 19, 2021 •

edited

Loading

wphicks commented Aug 19, 2021

levsnv commented Aug 20, 2021

wphicks commented Aug 25, 2021

github-actions bot commented Nov 23, 2021

github-actions bot commented Nov 23, 2021

[TASK] Review types used for size and indexing throughout codebase #4105

[TASK] Review types used for size and indexing throughout codebase #4105

Comments

wphicks commented Jul 27, 2021 • edited Loading

levsnv commented Aug 19, 2021 • edited Loading

wphicks commented Aug 19, 2021

levsnv commented Aug 19, 2021 • edited Loading

wphicks commented Aug 19, 2021

levsnv commented Aug 20, 2021

wphicks commented Aug 25, 2021

github-actions bot commented Nov 23, 2021

github-actions bot commented Nov 23, 2021

wphicks commented Jul 27, 2021 •

edited

Loading

levsnv commented Aug 19, 2021 •

edited

Loading

levsnv commented Aug 19, 2021 •

edited

Loading