Header-only Refactor of `point_in_polygon` #587

isVoid · 2022-07-08T23:59:05Z

Overview

This PR closes #569. This PR rewrites the core pip test with Eric Haine's "crossings-multiply" algorithm: http://erich.realtimerendering.com/ptinpoly/

And this gives 30%~70% kernel time reduction (please ignore Status column):

['before_optim.json', 'optim_2.json']
# point_in_polygon_benchmark

## [0] Tesla V100-SXM2-32GB

|  CoordsType  |  NumTestPoints  |  NumSidesPerRing  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|--------------|-----------------|-------------------|------------|-------------|------------|-------------|---------------|---------|----------|
|     F32      |      1000       |         4         |  63.706 us |       6.09% |  36.483 us |       6.33% |    -27.223 us | -42.73% |   FAIL   |
|     F32      |     100000      |         4         | 113.862 us |       0.90% |  55.468 us |       1.75% |    -58.394 us | -51.29% |   FAIL   |
|     F32      |    10000000     |         4         |   9.167 ms |       0.02% |   3.785 ms |       0.14% |  -5382.554 us | -58.71% |   FAIL   |
|     F32      |      1000       |        10         | 109.911 us |       1.04% |  55.120 us |       1.82% |    -54.791 us | -49.85% |   FAIL   |
|     F32      |     100000      |        10         | 190.151 us |       0.57% |  81.772 us |       1.14% |   -108.380 us | -57.00% |   FAIL   |
|     F32      |    10000000     |        10         |  16.934 ms |       0.17% |   5.987 ms |       0.06% | -10947.199 us | -64.65% |   FAIL   |
|     F32      |      1000       |        100        | 813.854 us |       0.14% | 314.469 us |       0.30% |   -499.384 us | -61.36% |   FAIL   |
|     F32      |     100000      |        100        |   1.292 ms |       0.18% | 343.747 us |       0.37% |   -948.724 us | -73.40% |   FAIL   |
|     F32      |    10000000     |        100        | 124.471 ms |       0.09% |  29.318 ms |       0.06% | -95152.593 us | -76.45% |   FAIL   |
|     F64      |      1000       |         4         |  73.915 us |       1.40% |  46.365 us |       2.23% |    -27.550 us | -37.27% |   FAIL   |
|     F64      |     100000      |         4         | 149.814 us |       2.77% |  74.001 us |       1.42% |    -75.813 us | -50.60% |   FAIL   |
|     F64      |    10000000     |         4         |  10.541 ms |       0.08% |   5.186 ms |       0.05% |  -5354.802 us | -50.80% |   FAIL   |
|     F64      |      1000       |        10         | 133.100 us |       0.65% |  76.647 us |       1.25% |    -56.453 us | -42.41% |   FAIL   |
|     F64      |     100000      |        10         | 277.337 us |       3.51% | 117.104 us |       0.89% |   -160.233 us | -57.78% |   FAIL   |
|     F64      |    10000000     |        10         |  19.648 ms |       0.06% |   9.214 ms |       0.05% | -10434.298 us | -53.11% |   FAIL   |
|     F64      |      1000       |        100        |   1.017 ms |       0.19% | 485.950 us |       0.37% |   -531.440 us | -52.24% |   FAIL   |
|     F64      |     100000      |        100        |   1.992 ms |       0.63% | 517.869 us |       0.25% |  -1474.597 us | -74.01% |   FAIL   |
|     F64      |    10000000     |        100        | 137.110 ms |       0.08% |  40.617 ms |       0.07% | -96493.030 us | -70.38% |   FAIL   |

# Summary

- Total Matches: 18
  - Pass    (diff <= min_noise): 0
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  18

Potential impact introducing divergence and removal of division

EDIT: there's no divergence: #587 (comment)

@zhangjianting and @harrism mentions in different occasions of the changes in

cuspatial/cpp/include/cuspatial/experimental/detail/point_in_polygon.cuh

Lines 85 to 88 in e3a2134

    
           if (rise > 0 && (test_point.x - a.x) * rise < run * rise_to_point) 
        
             point_is_within = not point_is_within; 
        
           else if (rise < 0 && (test_point.x - a.x) * rise > run * rise_to_point) 
        
             point_is_within = not point_is_within;

may cause unwanted overhead due to divergence. The assumption to do so is that the cycles cost by division is greater than that of warp convergence. This is apparently dependent on many factors, such as compiling with/without --fast-math flag or how fast the hardware supports warp convergence. On a Tesla V100 the benchmark between divergence branch and division branch is as follows:

['optim_1.json', 'optim_2.json']
# point_in_polygon_benchmark

## [0] Tesla V100-SXM2-32GB

|  CoordsType  |  NumTestPoints  |  NumSidesPerRing  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|--------------|-----------------|-------------------|------------|-------------|------------|-------------|--------------|---------|----------|
|     F32      |      1000       |         4         |  37.351 us |       6.39% |  36.483 us |       6.33% |    -0.868 us |  -2.32% |   PASS   |
|     F32      |     100000      |         4         |  56.265 us |       1.63% |  55.468 us |       1.75% |    -0.797 us |  -1.42% |   PASS   |
|     F32      |    10000000     |         4         |   3.858 ms |       0.02% |   3.785 ms |       0.14% |   -72.843 us |  -1.89% |   FAIL   |
|     F32      |      1000       |        10         |  56.490 us |       2.04% |  55.120 us |       1.82% |    -1.370 us |  -2.43% |   FAIL   |
|     F32      |     100000      |        10         |  85.581 us |       1.08% |  81.772 us |       1.14% |    -3.810 us |  -4.45% |   FAIL   |
|     F32      |    10000000     |        10         |   6.327 ms |       0.11% |   5.987 ms |       0.06% |  -340.206 us |  -5.38% |   FAIL   |
|     F32      |      1000       |        100        | 314.513 us |       0.41% | 314.469 us |       0.30% |    -0.044 us |  -0.01% |   PASS   |
|     F32      |     100000      |        100        | 347.750 us |       0.49% | 343.747 us |       0.37% |    -4.003 us |  -1.15% |   FAIL   |
|     F32      |    10000000     |        100        |  29.822 ms |       0.07% |  29.318 ms |       0.06% |  -504.290 us |  -1.69% |   FAIL   |
|     F64      |      1000       |         4         |  49.142 us |       1.83% |  46.365 us |       2.23% |    -2.777 us |  -5.65% |   FAIL   |
|     F64      |     100000      |         4         |  78.167 us |       1.23% |  74.001 us |       1.42% |    -4.166 us |  -5.33% |   FAIL   |
|     F64      |    10000000     |         4         |   5.641 ms |       0.05% |   5.186 ms |       0.05% |  -454.642 us |  -8.06% |   FAIL   |
|     F64      |      1000       |        10         |  80.904 us |       1.13% |  76.647 us |       1.25% |    -4.257 us |  -5.26% |   FAIL   |
|     F64      |     100000      |        10         | 125.702 us |       0.65% | 117.104 us |       0.89% |    -8.598 us |  -6.84% |   FAIL   |
|     F64      |    10000000     |        10         |  10.074 ms |       0.03% |   9.214 ms |       0.05% |  -860.423 us |  -8.54% |   FAIL   |
|     F64      |      1000       |        100        | 487.959 us |       0.40% | 485.950 us |       0.37% |    -2.009 us |  -0.41% |   FAIL   |
|     F64      |     100000      |        100        | 528.070 us |       0.35% | 517.869 us |       0.25% |   -10.201 us |  -1.93% |   FAIL   |
|     F64      |    10000000     |        100        |  44.663 ms |       0.18% |  40.617 ms |       0.07% | -4045.961 us |  -9.06% |   FAIL   |

which gives visible speedups. This PR adopts the change based on this benchmark.

Minor improvement and follow ups

This PR also introduces benchmarking suites for point_in_polygon API.

Follow ups to this PR includes:

Introduce a set of functions to do input validation. This can help alleviate API misuse of inconsistent result with cuspatial.point_in_polygon #492.
Improve test cases of this function. The function supports polygons of any sides, but test cases contains largely only quads.

cpp/include/cuspatial/detail/utility/traits.hpp

…into refactor/pip_header_only

harrism · 2022-07-25T22:19:52Z

cpp/include/cuspatial/experimental/detail/point_in_polygon.cuh

-
-      if (y_in_bounds && test_point.x < (run / rise) * rise_to_point + a.x) {
-        point_is_within = not point_is_within;
+    for (auto point_idx = ring_begin; point_idx < ring_end; point_idx++) {


It would be nice to know how much of the speedup came just from eliminating the division vs. all the other changes. Because there is potential for more thread divergence in the new code.

I will include the comparison data of division in the PR description. I will also include a comment stating the assumption we have here is that the cycles it costs on division is greater than that of warp syncs. It's apparently dependent on many factors such as whether user compiles the code with fast-math, or how fast the hardware synchronizes divergence.

That said, I don't think it's appropriate to mention benchmark results in the comments as it's hardware dependent and will get stale very quickly.

harrism · 2022-07-25T22:22:35Z

cpp/tests/spatial/point_in_polygon_test.cpp

+struct PointInPolygonUnsupportedChronoTypesTest : public BaseFixture {
+};
+
+TYPED_TEST_CASE(PointInPolygonUnsupportedChronoTypesTest, ChronoTypes);


I can't remember why we have unsupported types tests specifically for Chrono types but not for other non-numerical types (e.g. nested types). Is it necessary to test these at all? Every type we test increases compile time, so I wonder if we can just test one unsupported type as a proxy for all the others?

Don't necessarily need to decide this in the present PR!

harrism

Nice optimization!

…into refactor/pip_header_only

cpp/include/cuspatial/experimental/detail/point_in_polygon.cuh

Co-authored-by: Mark Harris <[email protected]>

isVoid · 2022-07-27T21:29:00Z

@gpucibot merge

isVoid added 12 commits July 6, 2022 11:50

Initial conversion

6cb6180

add first header only test

7133e34

add second test

4efbe67

Add more tests

a6209e1

Separate the two offset iterator types

c8afc71

style

f7ab8c2

Remove some type tests

aef327d

Replace iterator_trait with helpers

acef20e

Add folded is_integral

b3d5b7a

move iterator type helpers to traits.hpp

e1d0e30

Remove overlapping tests in cudf API

0432721

documentation

27fb73c

isVoid requested review from a team as code owners July 8, 2022 23:59

isVoid requested review from cwharris and zhangjianting July 8, 2022 23:59

github-actions bot added cmake Related to CMake code or build configuration libcuspatial Relates to the cuSpatial C++ library labels Jul 8, 2022

thomcom approved these changes Jul 11, 2022

View reviewed changes

cpp/include/cuspatial/detail/utility/traits.hpp Show resolved Hide resolved

isVoid added 4 commits July 13, 2022 13:58

Adopt explict type for references to iterator values

794f6ad

Factor out the *single* point in polygon logic

d4fd3dc

Refactor single point test kernel interface

56fdbd8

remove experiemental prefix in doxygen_groups.h

fe9c220

isVoid added 10 commits July 13, 2022 17:14

Add todo

f192862

fix typo

ae44ef3

Merge branch 'branch-22.08' of https://github.com/rapidsai/cuspatial …

0c5298c

…into refactor/pip_header_only

Introduce point in polygon benchmark

61366d0

Implement Eric's style of "wrap around" technique

399e80d

Further remove the division in the pip test

041d722

Reintroduce the type tests for cudf API pip test.

1db2574

style

c5eaa2d

Update docs

f015f28

More docstrings

047551d

harrism reviewed Jul 25, 2022

View reviewed changes

harrism approved these changes Jul 25, 2022

View reviewed changes

harrism removed request for cwharris and zhangjianting July 26, 2022 00:15

isVoid added 3 commits July 26, 2022 18:07

Add comment on potential impact on divergence

e3a2134

docs update

faaf9a3

Merge branch 'branch-22.08' of https://github.com/rapidsai/cuspatial …

8a7b34b

…into refactor/pip_header_only

harrism reviewed Jul 27, 2022

View reviewed changes

cpp/include/cuspatial/experimental/detail/point_in_polygon.cuh Outdated Show resolved Hide resolved

isVoid and others added 2 commits July 27, 2022 11:50

Update cpp/include/cuspatial/experimental/detail/point_in_polygon.cuh

841b69d

Co-authored-by: Mark Harris <[email protected]>

style

901a4d8

rapids-bot bot merged commit f7d9687 into rapidsai:branch-22.08 Jul 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Header-only Refactor of `point_in_polygon` #587

Header-only Refactor of `point_in_polygon` #587

isVoid commented Jul 8, 2022 •

edited

Loading

harrism Jul 25, 2022

isVoid Jul 27, 2022 •

edited

Loading

harrism Jul 25, 2022

harrism left a comment

isVoid commented Jul 27, 2022

	if (rise > 0 && (test_point.x - a.x) * rise < run * rise_to_point)
	point_is_within = not point_is_within;
	else if (rise < 0 && (test_point.x - a.x) * rise > run * rise_to_point)
	point_is_within = not point_is_within;

Header-only Refactor of point_in_polygon #587

Header-only Refactor of point_in_polygon #587

Conversation

isVoid commented Jul 8, 2022 • edited Loading

Overview

Potential impact introducing divergence and removal of division

Minor improvement and follow ups

harrism Jul 25, 2022

Choose a reason for hiding this comment

isVoid Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

harrism Jul 25, 2022

Choose a reason for hiding this comment

harrism left a comment

Choose a reason for hiding this comment

isVoid commented Jul 27, 2022

Header-only Refactor of `point_in_polygon` #587

Header-only Refactor of `point_in_polygon` #587

isVoid commented Jul 8, 2022 •

edited

Loading

isVoid Jul 27, 2022 •

edited

Loading