Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TASK] Review types used for size and indexing throughout codebase #4105

Open
wphicks opened this issue Jul 27, 2021 · 8 comments
Open

[TASK] Review types used for size and indexing throughout codebase #4105

wphicks opened this issue Jul 27, 2021 · 8 comments
Assignees
Labels
CUDA / C++ CUDA issue feature request New feature or request inactive-30d inactive-90d proposal Change current process or code

Comments

@wphicks
Copy link
Contributor

wphicks commented Jul 27, 2021

As noted in discussion on #4075, there are currently places in the codebase where we are using signed integers for positive indexing or to represent container sizes or other quantities that are better represented by other types (generally std::size_t). In order to clearly signal the semantic meaning of these variables, avoid signed/unsigned casting and comparisons, and to reduce the likelihood of overflow/underflow bugs, I propose the following rules:

  1. Avoid raw indexed loops wherever possible, instead giving preference to the relevant STL algorithm
  2. Always represent the size of an STL container with some_container<T>::size_type and use the same type for raw positive indexing. Liberal use of auto should make this less onerous, and rule 1 should make it rare.
  3. When representing the size of a quantity other than the size of a container or doing any other kind of raw positive indexing, use std::size_t unless there is a clear performance/resource reason to use a smaller type. In this case, an unsigned integer of the desired size should be used explicitly (e.g. uint32_t) and a comment should be added where the variable is declared explaining the performance consideration. If this is necessary for a custom container, alias custom_container<T>::size_type to the selected integer type.
  4. For negative indexing, the same basic rules apply. Use iterator::difference_type where possible or explicitly invoke std::ptrdiff_t otherwise.
  5. Use int to represent mathematical integers, quantities which can take on negative values and which are not sizes or indexes. A signed integer should not be used simply to include -1 as a special case; prefer std::optional or other ways of signaling this more clearly.

If we are better about rule 1, rules 2-5 should become less and less relevant.

TL;DR version:

  • for-loops bad, algorithms good
  • container::size_type for sizes and positive indexing where possible, std::size_t otherwise
  • iterator::difference_type for negative indexing where possible, std::ptrdiff_t otherwise
  • int is a mathematical quantity, not a size or an index
  • Performance exceptions are always allowed, but be explicit about them

I'm not married to these rules, but they seem like a reasonably small and complete set that is consistent with other style and usage guidelines. We can use this issue for any necessary discussion until #4075 is merged, and then I'll be happy to work on implementing whatever we land on.

@wphicks wphicks added feature request New feature or request proposal Change current process or code CUDA / C++ CUDA issue labels Jul 27, 2021
@wphicks wphicks self-assigned this Jul 27, 2021
@levsnv
Copy link
Contributor

levsnv commented Aug 19, 2021

1. When traversing arrays, often multiple things are to be accumulated/done with the elements. In performance-critical sections of code, I'd prefer a more nuanced rule, perhaps on ways to cogently describe actor classes instead of deciding on 2% slowdown for more readable code.
3. @canonizer @wphicks does it mean existing code should switch to spelling out std::size_t instead of size_t when it was effectively the former all along?
5. what about the nice property of int indices to often contain negative numbers when not initialized, instead of a plausible positive?

@wphicks
Copy link
Contributor Author

wphicks commented Aug 19, 2021

For 1, can you give me an example of code where a raw loop provides a performance benefit that cannot be achieved without a raw loop so I make sure I understand what you're getting at?

For 3, yes, definitely. size_t is a C type, std::size_t is a C++ type. If we use size_t it should be a signal to future developers that we're expecting to interface with C code at this point.

For 5, that's where the recommendation to use std::optional comes in. It carries the clear semantic meaning that this might not yet have a valid value.

@levsnv
Copy link
Contributor

levsnv commented Aug 19, 2021

  1. e.g. I have this in a PR:
    Timer together = Timer::start("together");
    // count nodes for each feature id, while splitting the sets between nodes
    std::size_t bit_pool_size = 0;
    for (std::size_t node_id = 0; node_id < num_nodes; ++node_id) {
      int fid = fids_h[node_id];
      
      if (!feature_categorical[fid] || is_leafs_h[node_id]) is_categoricals_h[node_id] = 0.0f;
      
      if (is_categoricals_h[node_id] == 1.0) {
        // might allocate a categorical set for an unreachable parent node. That's OK.
        ++cf[fid].n_nodes;
        node_cat_set[node_id] = bit_pool_size;
        bit_pool_size += categorical_sets::sizeof_mask_from_max_matching(cf[fid].max_matching);
      }
    }
    together.stop();

which I could split like this for readability, sacrificing 0.66% runtime in a third of our tests. I agree that there may be larger opportunities to optimize, but this doesn't require a perf investigation, and I feel like it's significant enough to have a decent approach from the start, both in readability and performance.

    Timer fct = Timer::start("fc");
    for (std::size_t node_id = 0; node_id < num_nodes; ++node_id) {
      int fid = fids_h[node_id];
      if (!feature_categorical[fid] || is_leafs_h[node_id]) is_categoricals_h[node_id] = 0.0f;
    }
    fct.stop();
    Timer cft = Timer::start("cf");
    // count nodes for each feature id, while splitting the sets between nodes
    std::size_t bit_pool_size = 0;
    for (std::size_t node_id = 0; node_id < num_nodes; ++node_id) {
      if (is_categoricals_h[node_id] == 1.0) {
        int fid = fids_h[node_id];
        // might allocate a categorical set for an unreachable parent node. That's OK.
        ++cf[fid].n_nodes;
        node_cat_set[node_id] = bit_pool_size;
        bit_pool_size += categorical_sets::sizeof_mask_from_max_matching(cf[fid].max_matching);
      }
    }
    cft.stop();

If I split it further (e.g. bit_pool_size = std::accumulate(...)), I might again lose that much, since there's a cost to move is_categoricals_h[node_id] in and out of cache.

@wphicks
Copy link
Contributor Author

wphicks commented Aug 19, 2021

Sorry, I should have been more specific; I was more looking for a standalone example. If we need to dive into this further, we might look at something we can play with in Godbolt, but here's a very rough cut at one way to move the code you posted away from raw loops:

  auto begin = zip_iterator(is_categoricals_h.begin(), fids_h.begin(), is_leafs_h.begin(), node_cat_set.begin());
  auto end = zip_iterator(is_categoricals_h.end(), fids_h.end(), is_leafs_h.end(), node_cat_set.end());
  auto bit_pool_size = std::accumulate(
      begin, end, std::size_t{},
      [&feature_categorical, &cf, &bit_pool_size](auto result, auto&& tup) {
        auto& [cat, fid, leaf, ncs] = tup;

        if (!feature_categorical[fid] || leaf) {
          cat = 0.0f;
        }
        if (cat == 1.0f) {
          ++cf[fid].n_nodes;
          ncs = bit_pool_size;
          result += categorical_sets::sizeof_mask_from_max_matching(
              cf[fid].max_matching);
        }
        return result;
      });

All that's untested, so don't try to grab it as is ;). Note that we should only bring each element of is_categoricals_h into cache once with this implementation. Switching from a raw loop to an algorithm should never force you to break up a loop into multiple passes, although it may highlight ways that you can restructure your data to keep relevant bits closer together.

Do you have any other examples of basic problems with performance that might emerge from moving away from a raw loop? I won't say they don't exist, but I've yet to encounter one myself. I've encountered plenty of situations where raw loops are slower than an algorithm-based implementation, though.

@levsnv
Copy link
Contributor

levsnv commented Aug 20, 2021

I'll skip the nitpicking and focus on the main thing:
One line of for (std::size_t node_id = 0; node_id < num_nodes; ++node_id) turns into 6 lines of routing the things through std::accumulate. IIUC, neither zip_iterator nor std::accumulate enforce same length constraint, which would've made the loop actually safer, as opposed to just hiding the indexing in syntactic sugar.
I wish std:: had things like zip_iterator_pair (this one could be easy to implement), and std::accumulate would have a specialization which effectively calls std::apply so that we can declare auto& cat, auto& fid, auto& leaf, auto& ncs) in the lambda signature.
I've shared a standalone with you, since there's a couple hiccups I couldn't resolve myself.

@wphicks
Copy link
Contributor Author

wphicks commented Aug 25, 2021

I was focusing solely on the performance question you raised (i.e. whether avoiding raw loops necessarily imposes a performance penalty in some cases). If you're interested in specific recommendations for refactoring on that PR that would also keep things concise, you can always tag me for code review, but I can't say much about what the "right" way to structure that would be out of context.

For the purposes of this particular issue, I'm more interested in any general problems that might arise by avoiding raw loops. If there are general pitfalls that exist, it would be great to document them here. While avoiding raw loops is generally a good idea (these gentlemen lay out the case very nicely) for other reasons, that basic principle is not the focus of this proposal, either. It's more about establishing consistency with how we're indexing, and one way to do that is by avoiding indexing altogether, which can eliminate other bugs.

rapids-bot bot pushed a commit that referenced this issue Nov 5, 2021
Related to #4105.

This PR fixes the types used for indexing and sizes in the PCA/TSVD C++ code.

Authors:
  - Micka (https://github.com/lowener)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #4255
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this issue Oct 9, 2023
Related to rapidsai#4105.

This PR fixes the types used for indexing and sizes in the PCA/TSVD C++ code.

Authors:
  - Micka (https://github.com/lowener)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4255
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA / C++ CUDA issue feature request New feature or request inactive-30d inactive-90d proposal Change current process or code
Projects
None yet
Development

No branches or pull requests

2 participants