Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hostdevice_vector in kernel_error to avoid the pageable copy #15140

Merged
merged 9 commits into from
Mar 5, 2024

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Feb 24, 2024

Description

Issue #15122

The addition of kernel error checking introduced a 5% performance regression in Spark-RAPIDS. It was determined that the pageable copy of the error back to host caused this overhead, presumably because of the CUDA's bounce buffer bottleneck.

This PR aims to eliminate most of the error checking overhead by using hostdevice_vector in the kernel_error class. The hostdevice_vector uses pinned memory so the copy is no longer pageable. The PR also removes the redundant sync after we read the error.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added cuIO cuIO issue Performance Performance related issue Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 24, 2024
@vuule vuule self-assigned this Feb 24, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 24, 2024
@GregoryKimball
Copy link
Contributor

Thank you @vuule for opening this. Do you think it the pageable copy could be the root cause of the performance difference Spark-RAPIDS observed from #14167 on A100? (also note the tangentially-related V100 issue #14415)

@vuule
Copy link
Contributor Author

vuule commented Feb 26, 2024

Do you think it the pageable copy could be the root cause of the performance difference Spark-RAPIDS observed from #14167 on A100?

@abellina suspects this might be the case. It's possible that the shared bounce buffer used for pageable copies creates a bottleneck only in multi-threaded use case. This could explain why the regression is specific to spark. I opened the PR to enable @abellina to test the hypothesis.

@abellina
Copy link
Contributor

@vuule @GregoryKimball we will review/test this, sorry for the delayed response.

@abellina
Copy link
Contributor

@vuule I have done some testing with this but I need a bit more time. So far I think with the error code checking plus the _stream.synchronize() we are 10 seconds lower than the version without that, so:

in reader_impl.cpp when we check the error code:

This:

  //if (error_code.value() != 0) {
  //  CUDF_FAIL("Parquet data decode failed with code(s) " + error_code.str());
  //}
  // error_code.value() is synchronous; explicitly sync here for better visibility
  //_stream.synchronize();

Is 10 seconds faster than:

  if (error_code.value() != 0) {
    CUDF_FAIL("Parquet data decode failed with code(s) " + error_code.str());
  }
  // error_code.value() is synchronous; explicitly sync here for better visibility
  _stream.synchronize();

Which is much better than what I had measured before (with pageable we had a 20 seconds or 5% regression).

I also want to not do that last _stream.synchronize() and see if that's part of it. Even with this, the pinned copy could be contending with a busy copy engine that is trying handle large h2d copies to get data into the parquet decode kernels, so it makes sense that there's unintended synchronization here.

@abellina
Copy link
Contributor

abellina commented Feb 28, 2024

@vuule here are my updates so far:

  1. I ran our spark-rapids integration tests and found an unexpected test failure that led me to dig in. The issue is with removing the synchronize (your hostdevice_vector change is fine, it's just we don't want to remove the synchronize here https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L251). This looks to be protecting the subpass null fixing that happens later in https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L295, posting here to make sure others agree, I am not entirely sure if we expect some of that state to be in flight at this stage? If so, we should probably adjust the comment above it.

  2. Assuming that a test without the sync is invalid, here are the results for checking error_code vs not checking it on top of a branch that has several optimizations (micro kernel PR, pinned pool, chunking optimization from Dave), while always keeping that synchronize I pointed at in (1):

    Not checking as the baseline vs pooled+pinned approach (overall ~3 seconds slower, noise IMHO)
    query1: Previous (1817.6666666666667 ms) vs Current (1760.6666666666667 ms) Diff 57 E2E 1.03x
    query2: Previous (2549.0 ms) vs Current (2553.3333333333335 ms) Diff -4 E2E 1.00x
    query3: Previous (668.6666666666666 ms) vs Current (690.0 ms) Diff -21 E2E 0.97x
    query4: Previous (13728.666666666666 ms) vs Current (13486.666666666666 ms) Diff 242 E2E 1.02x
    query5: Previous (2851.6666666666665 ms) vs Current (2786.6666666666665 ms) Diff 65 E2E 1.02x
    query6: Previous (1088.0 ms) vs Current (1245.3333333333333 ms) Diff -157 E2E 0.87x
    query7: Previous (1536.0 ms) vs Current (1461.0 ms) Diff 75 E2E 1.05x
    query8: Previous (1369.3333333333333 ms) vs Current (1390.3333333333333 ms) Diff -21 E2E 0.98x
    query9: Previous (6805.666666666667 ms) vs Current (7064.666666666667 ms) Diff -259 E2E 0.96x
    query10: Previous (1643.3333333333333 ms) vs Current (2533.6666666666665 ms) Diff -890 E2E 0.65x
    query11: Previous (6980.666666666667 ms) vs Current (6977.333333333333 ms) Diff 3 E2E 1.00x
    query12: Previous (790.0 ms) vs Current (1067.3333333333333 ms) Diff -277 E2E 0.74x
    query13: Previous (2043.0 ms) vs Current (2370.0 ms) Diff -327 E2E 0.86x
    query14_part1: Previous (8817.666666666666 ms) vs Current (8887.333333333334 ms) Diff -69 E2E 0.99x
    query14_part2: Previous (7198.666666666667 ms) vs Current (7580.0 ms) Diff -381 E2E 0.95x
    query15: Previous (1237.6666666666667 ms) vs Current (1445.3333333333333 ms) Diff -207 E2E 0.86x
    query16: Previous (9869.0 ms) vs Current (9954.0 ms) Diff -85 E2E 0.99x
    query17: Previous (2513.6666666666665 ms) vs Current (2261.0 ms) Diff 252 E2E 1.11x
    query18: Previous (4320.0 ms) vs Current (4429.0 ms) Diff -109 E2E 0.98x
    query19: Previous (1328.6666666666667 ms) vs Current (1783.3333333333333 ms) Diff -454 E2E 0.75x
    query20: Previous (835.3333333333334 ms) vs Current (696.6666666666666 ms) Diff 138 E2E 1.20x
    query21: Previous (503.0 ms) vs Current (506.3333333333333 ms) Diff -3 E2E 0.99x
    query22: Previous (1204.6666666666667 ms) vs Current (1172.6666666666667 ms) Diff 32 E2E 1.03x
    query23_part1: Previous (17476.666666666668 ms) vs Current (17624.0 ms) Diff -147 E2E 0.99x
    query23_part2: Previous (23662.0 ms) vs Current (23583.666666666668 ms) Diff 78 E2E 1.00x
    query24_part1: Previous (8408.666666666666 ms) vs Current (8234.333333333334 ms) Diff 174 E2E 1.02x
    query24_part2: Previous (8295.0 ms) vs Current (8625.666666666666 ms) Diff -330 E2E 0.96x
    query25: Previous (1959.3333333333333 ms) vs Current (1831.3333333333333 ms) Diff 128 E2E 1.07x
    query26: Previous (1651.6666666666667 ms) vs Current (1309.3333333333333 ms) Diff 342 E2E 1.26x
    query27: Previous (1693.6666666666667 ms) vs Current (1607.0 ms) Diff 86 E2E 1.05x
    query28: Previous (7103.0 ms) vs Current (7196.0 ms) Diff -93 E2E 0.99x
    query29: Previous (3470.3333333333335 ms) vs Current (3352.3333333333335 ms) Diff 118 E2E 1.04x
    query30: Previous (3395.0 ms) vs Current (3432.3333333333335 ms) Diff -37 E2E 0.99x
    query31: Previous (3170.6666666666665 ms) vs Current (3190.3333333333335 ms) Diff -19 E2E 0.99x
    query32: Previous (1783.6666666666667 ms) vs Current (1884.6666666666667 ms) Diff -101 E2E 0.95x
    query33: Previous (1560.3333333333333 ms) vs Current (1564.3333333333333 ms) Diff -4 E2E 1.00x
    query34: Previous (2351.6666666666665 ms) vs Current (2397.3333333333335 ms) Diff -45 E2E 0.98x
    query35: Previous (2431.6666666666665 ms) vs Current (2564.3333333333335 ms) Diff -132 E2E 0.95x
    query36: Previous (1705.6666666666667 ms) vs Current (1758.0 ms) Diff -52 E2E 0.97x
    query37: Previous (1485.0 ms) vs Current (1451.6666666666667 ms) Diff 33 E2E 1.02x
    query38: Previous (4119.666666666667 ms) vs Current (4177.333333333333 ms) Diff -57 E2E 0.99x
    query39_part1: Previous (1526.3333333333333 ms) vs Current (1491.0 ms) Diff 35 E2E 1.02x
    query39_part2: Previous (1244.0 ms) vs Current (1170.0 ms) Diff 74 E2E 1.06x
    query40: Previous (1706.0 ms) vs Current (1629.3333333333333 ms) Diff 76 E2E 1.05x
    query41: Previous (414.3333333333333 ms) vs Current (437.6666666666667 ms) Diff -23 E2E 0.95x
    query42: Previous (340.3333333333333 ms) vs Current (345.0 ms) Diff -4 E2E 0.99x
    query43: Previous (827.6666666666666 ms) vs Current (872.3333333333334 ms) Diff -44 E2E 0.95x
    query44: Previous (538.0 ms) vs Current (582.3333333333334 ms) Diff -44 E2E 0.92x
    query45: Previous (1118.6666666666667 ms) vs Current (1206.0 ms) Diff -87 E2E 0.93x
    query46: Previous (1755.3333333333333 ms) vs Current (2000.0 ms) Diff -244 E2E 0.88x
    query47: Previous (2244.3333333333335 ms) vs Current (2237.6666666666665 ms) Diff 6 E2E 1.00x
    query48: Previous (1035.6666666666667 ms) vs Current (1055.0 ms) Diff -19 E2E 0.98x
    query49: Previous (2834.0 ms) vs Current (3222.0 ms) Diff -388 E2E 0.88x
    query50: Previous (9709.333333333334 ms) vs Current (9730.0 ms) Diff -20 E2E 1.00x
    query51: Previous (2882.6666666666665 ms) vs Current (2956.3333333333335 ms) Diff -73 E2E 0.98x
    query52: Previous (453.3333333333333 ms) vs Current (473.3333333333333 ms) Diff -20 E2E 0.96x
    query53: Previous (838.6666666666666 ms) vs Current (869.3333333333334 ms) Diff -30 E2E 0.96x
    query54: Previous (1625.6666666666667 ms) vs Current (1795.6666666666667 ms) Diff -170 E2E 0.91x
    query55: Previous (448.6666666666667 ms) vs Current (441.0 ms) Diff 7 E2E 1.02x
    query56: Previous (1036.0 ms) vs Current (1093.6666666666667 ms) Diff -57 E2E 0.95x
    query57: Previous (2095.3333333333335 ms) vs Current (1909.0 ms) Diff 186 E2E 1.10x
    query58: Previous (1326.0 ms) vs Current (1398.0 ms) Diff -72 E2E 0.95x
    query59: Previous (2669.3333333333335 ms) vs Current (2130.6666666666665 ms) Diff 538 E2E 1.25x
    query60: Previous (1957.6666666666667 ms) vs Current (2025.6666666666667 ms) Diff -68 E2E 0.97x
    query61: Previous (1129.6666666666667 ms) vs Current (1188.0 ms) Diff -58 E2E 0.95x
    query62: Previous (1534.0 ms) vs Current (1535.0 ms) Diff -1 E2E 1.00x
    query63: Previous (1024.0 ms) vs Current (994.6666666666666 ms) Diff 29 E2E 1.03x
    query64: Previous (17846.0 ms) vs Current (18693.333333333332 ms) Diff -847 E2E 0.95x
    query65: Previous (4099.0 ms) vs Current (3879.6666666666665 ms) Diff 219 E2E 1.06x
    query66: Previous (4788.333333333333 ms) vs Current (4858.0 ms) Diff -69 E2E 0.99x
    query67: Previous (26821.666666666668 ms) vs Current (26586.666666666668 ms) Diff 235 E2E 1.01x
    query68: Previous (1419.3333333333333 ms) vs Current (1506.3333333333333 ms) Diff -87 E2E 0.94x
    query69: Previous (1475.6666666666667 ms) vs Current (1413.3333333333333 ms) Diff 62 E2E 1.04x
    query70: Previous (2310.0 ms) vs Current (2130.0 ms) Diff 180 E2E 1.08x
    query71: Previous (3797.6666666666665 ms) vs Current (3436.0 ms) Diff 361 E2E 1.11x
    query72: Previous (3469.0 ms) vs Current (3404.0 ms) Diff 65 E2E 1.02x
    query73: Previous (1002.6666666666666 ms) vs Current (978.0 ms) Diff 24 E2E 1.03x
    query74: Previous (5282.333333333333 ms) vs Current (5346.0 ms) Diff -63 E2E 0.99x
    query75: Previous (7436.333333333333 ms) vs Current (7329.666666666667 ms) Diff 106 E2E 1.01x
    query76: Previous (3003.6666666666665 ms) vs Current (3002.6666666666665 ms) Diff 1 E2E 1.00x
    query77: Previous (1577.6666666666667 ms) vs Current (1612.0 ms) Diff -34 E2E 0.98x
    query78: Previous (10374.666666666666 ms) vs Current (10545.666666666666 ms) Diff -171 E2E 0.98x
    query79: Previous (1587.0 ms) vs Current (1674.0 ms) Diff -87 E2E 0.95x
    query80: Previous (4167.0 ms) vs Current (4153.333333333333 ms) Diff 13 E2E 1.00x
    query81: Previous (2775.6666666666665 ms) vs Current (2792.3333333333335 ms) Diff -16 E2E 0.99x
    query82: Previous (2502.3333333333335 ms) vs Current (2510.3333333333335 ms) Diff -8 E2E 1.00x
    query83: Previous (11964.0 ms) vs Current (11808.666666666666 ms) Diff 155 E2E 1.01x
    query84: Previous (1756.3333333333333 ms) vs Current (1899.6666666666667 ms) Diff -143 E2E 0.92x
    query85: Previous (2163.3333333333335 ms) vs Current (2054.0 ms) Diff 109 E2E 1.05x
    query86: Previous (1393.3333333333333 ms) vs Current (948.0 ms) Diff 445 E2E 1.47x
    query87: Previous (4516.666666666667 ms) vs Current (4484.666666666667 ms) Diff 32 E2E 1.01x
    query88: Previous (6350.666666666667 ms) vs Current (6380.0 ms) Diff -29 E2E 1.00x
    query89: Previous (1077.6666666666667 ms) vs Current (1089.6666666666667 ms) Diff -12 E2E 0.99x
    query90: Previous (922.3333333333334 ms) vs Current (1442.0 ms) Diff -519 E2E 0.64x
    query91: Previous (831.6666666666666 ms) vs Current (855.0 ms) Diff -23 E2E 0.97x
    query92: Previous (1125.0 ms) vs Current (1072.0 ms) Diff 53 E2E 1.05x
    query93: Previous (12711.0 ms) vs Current (12922.333333333334 ms) Diff -211 E2E 0.98x
    query94: Previous (4993.666666666667 ms) vs Current (5130.0 ms) Diff -136 E2E 0.97x
    query95: Previous (8324.666666666666 ms) vs Current (8345.666666666666 ms) Diff -21 E2E 1.00x
    query96: Previous (1311.3333333333333 ms) vs Current (1380.0 ms) Diff -68 E2E 0.95x
    query97: Previous (2320.6666666666665 ms) vs Current (2329.3333333333335 ms) Diff -8 E2E 1.00x
    query98: Previous (2410.3333333333335 ms) vs Current (2406.0 ms) Diff 4 E2E 1.00x
    query99: Previous (2567.3333333333335 ms) vs Current (2483.6666666666665 ms) Diff 83 E2E 1.03x
    benchmark: Previous (395333.3333333333 ms) vs Current (398666.6666666667 ms) Diff -3333 E2E 0.99x
    

    Checking the error code using the old way (pageable) vs the pinned way that this PR allows. The pinned approach is significantly faster (15 seconds faster, 4% faster), and in q9 for example 26% better (heavy parquet user).

    Pageable as the baseline vs the pooled+pinned approach
    query1: Previous (2036.3333333333333 ms) vs Current (1760.6666666666667 ms) Diff 275 E2E 1.16x
    query2: Previous (2658.0 ms) vs Current (2553.3333333333335 ms) Diff 104 E2E 1.04x
    query3: Previous (777.6666666666666 ms) vs Current (690.0 ms) Diff 87 E2E 1.13x
    query4: Previous (13619.0 ms) vs Current (13486.666666666666 ms) Diff 132 E2E 1.01x
    query5: Previous (2868.6666666666665 ms) vs Current (2786.6666666666665 ms) Diff 82 E2E 1.03x
    query6: Previous (1238.0 ms) vs Current (1245.3333333333333 ms) Diff -7 E2E 0.99x
    query7: Previous (1586.3333333333333 ms) vs Current (1461.0 ms) Diff 125 E2E 1.09x
    query8: Previous (1365.0 ms) vs Current (1390.3333333333333 ms) Diff -25 E2E 0.98x
    query9: Previous (8893.666666666666 ms) vs Current (7064.666666666667 ms) Diff 1828 E2E 1.26x
    query10: Previous (1890.6666666666667 ms) vs Current (2533.6666666666665 ms) Diff -642 E2E 0.75x
    query11: Previous (7476.666666666667 ms) vs Current (6977.333333333333 ms) Diff 499 E2E 1.07x
    query12: Previous (874.3333333333334 ms) vs Current (1067.3333333333333 ms) Diff -192 E2E 0.82x
    query13: Previous (2479.3333333333335 ms) vs Current (2370.0 ms) Diff 109 E2E 1.05x
    query14_part1: Previous (9293.333333333334 ms) vs Current (8887.333333333334 ms) Diff 406 E2E 1.05x
    query14_part2: Previous (7501.666666666667 ms) vs Current (7580.0 ms) Diff -78 E2E 0.99x
    query15: Previous (1336.6666666666667 ms) vs Current (1445.3333333333333 ms) Diff -108 E2E 0.92x
    query16: Previous (10106.333333333334 ms) vs Current (9954.0 ms) Diff 152 E2E 1.02x
    query17: Previous (2467.6666666666665 ms) vs Current (2261.0 ms) Diff 206 E2E 1.09x
    query18: Previous (4462.333333333333 ms) vs Current (4429.0 ms) Diff 33 E2E 1.01x
    query19: Previous (1619.0 ms) vs Current (1783.3333333333333 ms) Diff -164 E2E 0.91x
    query20: Previous (706.3333333333334 ms) vs Current (696.6666666666666 ms) Diff 9 E2E 1.01x
    query21: Previous (520.6666666666666 ms) vs Current (506.3333333333333 ms) Diff 14 E2E 1.03x
    query22: Previous (1250.0 ms) vs Current (1172.6666666666667 ms) Diff 77 E2E 1.07x
    query23_part1: Previous (17595.666666666668 ms) vs Current (17624.0 ms) Diff -28 E2E 1.00x
    query23_part2: Previous (24149.666666666668 ms) vs Current (23583.666666666668 ms) Diff 566 E2E 1.02x
    query24_part1: Previous (8733.333333333334 ms) vs Current (8234.333333333334 ms) Diff 499 E2E 1.06x
    query24_part2: Previous (8806.333333333334 ms) vs Current (8625.666666666666 ms) Diff 180 E2E 1.02x
    query25: Previous (2005.0 ms) vs Current (1831.3333333333333 ms) Diff 173 E2E 1.09x
    query26: Previous (1447.3333333333333 ms) vs Current (1309.3333333333333 ms) Diff 138 E2E 1.11x
    query27: Previous (1747.3333333333333 ms) vs Current (1607.0 ms) Diff 140 E2E 1.09x
    query28: Previous (8014.666666666667 ms) vs Current (7196.0 ms) Diff 818 E2E 1.11x
    query29: Previous (3444.3333333333335 ms) vs Current (3352.3333333333335 ms) Diff 92 E2E 1.03x
    query30: Previous (3492.3333333333335 ms) vs Current (3432.3333333333335 ms) Diff 60 E2E 1.02x
    query31: Previous (3596.3333333333335 ms) vs Current (3190.3333333333335 ms) Diff 406 E2E 1.13x
    query32: Previous (1913.6666666666667 ms) vs Current (1884.6666666666667 ms) Diff 29 E2E 1.02x
    query33: Previous (1593.3333333333333 ms) vs Current (1564.3333333333333 ms) Diff 29 E2E 1.02x
    query34: Previous (2539.0 ms) vs Current (2397.3333333333335 ms) Diff 141 E2E 1.06x
    query35: Previous (2964.6666666666665 ms) vs Current (2564.3333333333335 ms) Diff 400 E2E 1.16x
    query36: Previous (1872.3333333333333 ms) vs Current (1758.0 ms) Diff 114 E2E 1.07x
    query37: Previous (1487.3333333333333 ms) vs Current (1451.6666666666667 ms) Diff 35 E2E 1.02x
    query38: Previous (4559.666666666667 ms) vs Current (4177.333333333333 ms) Diff 382 E2E 1.09x
    query39_part1: Previous (1547.6666666666667 ms) vs Current (1491.0 ms) Diff 56 E2E 1.04x
    query39_part2: Previous (1297.3333333333333 ms) vs Current (1170.0 ms) Diff 127 E2E 1.11x
    query40: Previous (2219.3333333333335 ms) vs Current (1629.3333333333333 ms) Diff 590 E2E 1.36x
    query41: Previous (448.0 ms) vs Current (437.6666666666667 ms) Diff 10 E2E 1.02x
    query42: Previous (411.6666666666667 ms) vs Current (345.0 ms) Diff 66 E2E 1.19x
    query43: Previous (1002.3333333333334 ms) vs Current (872.3333333333334 ms) Diff 130 E2E 1.15x
    query44: Previous (557.6666666666666 ms) vs Current (582.3333333333334 ms) Diff -24 E2E 0.96x
    query45: Previous (1201.6666666666667 ms) vs Current (1206.0 ms) Diff -4 E2E 1.00x
    query46: Previous (1886.6666666666667 ms) vs Current (2000.0 ms) Diff -113 E2E 0.94x
    query47: Previous (2415.0 ms) vs Current (2237.6666666666665 ms) Diff 177 E2E 1.08x
    query48: Previous (1178.3333333333333 ms) vs Current (1055.0 ms) Diff 123 E2E 1.12x
    query49: Previous (2992.0 ms) vs Current (3222.0 ms) Diff -230 E2E 0.93x
    query50: Previous (9786.333333333334 ms) vs Current (9730.0 ms) Diff 56 E2E 1.01x
    query51: Previous (3071.6666666666665 ms) vs Current (2956.3333333333335 ms) Diff 115 E2E 1.04x
    query52: Previous (528.6666666666666 ms) vs Current (473.3333333333333 ms) Diff 55 E2E 1.12x
    query53: Previous (1395.3333333333333 ms) vs Current (869.3333333333334 ms) Diff 525 E2E 1.61x
    query54: Previous (1741.6666666666667 ms) vs Current (1795.6666666666667 ms) Diff -54 E2E 0.97x
    query55: Previous (475.3333333333333 ms) vs Current (441.0 ms) Diff 34 E2E 1.08x
    query56: Previous (1192.0 ms) vs Current (1093.6666666666667 ms) Diff 98 E2E 1.09x
    query57: Previous (2074.6666666666665 ms) vs Current (1909.0 ms) Diff 165 E2E 1.09x
    query58: Previous (1713.6666666666667 ms) vs Current (1398.0 ms) Diff 315 E2E 1.23x
    query59: Previous (2273.0 ms) vs Current (2130.6666666666665 ms) Diff 142 E2E 1.07x
    query60: Previous (2155.0 ms) vs Current (2025.6666666666667 ms) Diff 129 E2E 1.06x
    query61: Previous (1405.6666666666667 ms) vs Current (1188.0 ms) Diff 217 E2E 1.18x
    query62: Previous (1573.6666666666667 ms) vs Current (1535.0 ms) Diff 38 E2E 1.03x
    query63: Previous (1261.6666666666667 ms) vs Current (994.6666666666666 ms) Diff 267 E2E 1.27x
    query64: Previous (18739.333333333332 ms) vs Current (18693.333333333332 ms) Diff 46 E2E 1.00x
    query65: Previous (4060.6666666666665 ms) vs Current (3879.6666666666665 ms) Diff 181 E2E 1.05x
    query66: Previous (5219.333333333333 ms) vs Current (4858.0 ms) Diff 361 E2E 1.07x
    query67: Previous (26947.0 ms) vs Current (26586.666666666668 ms) Diff 360 E2E 1.01x
    query68: Previous (1568.3333333333333 ms) vs Current (1506.3333333333333 ms) Diff 62 E2E 1.04x
    query69: Previous (1645.3333333333333 ms) vs Current (1413.3333333333333 ms) Diff 232 E2E 1.16x
    query70: Previous (2375.6666666666665 ms) vs Current (2130.0 ms) Diff 245 E2E 1.12x
    query71: Previous (3673.0 ms) vs Current (3436.0 ms) Diff 237 E2E 1.07x
    query72: Previous (3622.3333333333335 ms) vs Current (3404.0 ms) Diff 218 E2E 1.06x
    query73: Previous (1173.6666666666667 ms) vs Current (978.0 ms) Diff 195 E2E 1.20x
    query74: Previous (5551.666666666667 ms) vs Current (5346.0 ms) Diff 205 E2E 1.04x
    query75: Previous (7977.666666666667 ms) vs Current (7329.666666666667 ms) Diff 648 E2E 1.09x
    query76: Previous (3162.0 ms) vs Current (3002.6666666666665 ms) Diff 159 E2E 1.05x
    query77: Previous (1804.6666666666667 ms) vs Current (1612.0 ms) Diff 192 E2E 1.12x
    query78: Previous (10676.0 ms) vs Current (10545.666666666666 ms) Diff 130 E2E 1.01x
    query79: Previous (1681.0 ms) vs Current (1674.0 ms) Diff 7 E2E 1.00x
    query80: Previous (4389.333333333333 ms) vs Current (4153.333333333333 ms) Diff 236 E2E 1.06x
    query81: Previous (2883.3333333333335 ms) vs Current (2792.3333333333335 ms) Diff 91 E2E 1.03x
    query82: Previous (2568.0 ms) vs Current (2510.3333333333335 ms) Diff 57 E2E 1.02x
    query83: Previous (11272.0 ms) vs Current (11808.666666666666 ms) Diff -536 E2E 0.95x
    query84: Previous (1752.0 ms) vs Current (1899.6666666666667 ms) Diff -147 E2E 0.92x
    query85: Previous (2042.3333333333333 ms) vs Current (2054.0 ms) Diff -11 E2E 0.99x
    query86: Previous (960.6666666666666 ms) vs Current (948.0 ms) Diff 12 E2E 1.01x
    query87: Previous (4850.333333333333 ms) vs Current (4484.666666666667 ms) Diff 365 E2E 1.08x
    query88: Previous (7409.666666666667 ms) vs Current (6380.0 ms) Diff 1029 E2E 1.16x
    query89: Previous (1223.3333333333333 ms) vs Current (1089.6666666666667 ms) Diff 133 E2E 1.12x
    query90: Previous (1035.0 ms) vs Current (1442.0 ms) Diff -407 E2E 0.72x
    query91: Previous (1005.0 ms) vs Current (855.0 ms) Diff 150 E2E 1.18x
    query92: Previous (1188.0 ms) vs Current (1072.0 ms) Diff 116 E2E 1.11x
    query93: Previous (12937.333333333334 ms) vs Current (12922.333333333334 ms) Diff 15 E2E 1.00x
    query94: Previous (5003.333333333333 ms) vs Current (5130.0 ms) Diff -126 E2E 0.98x
    query95: Previous (8532.333333333334 ms) vs Current (8345.666666666666 ms) Diff 186 E2E 1.02x
    query96: Previous (1324.0 ms) vs Current (1380.0 ms) Diff -56 E2E 0.96x
    query97: Previous (2628.6666666666665 ms) vs Current (2329.3333333333335 ms) Diff 299 E2E 1.13x
    query98: Previous (2663.6666666666665 ms) vs Current (2406.0 ms) Diff 257 E2E 1.11x
    query99: Previous (2666.6666666666665 ms) vs Current (2483.6666666666665 ms) Diff 183 E2E 1.07x
    benchmark: Previous (414000.0 ms) vs Current (398666.6666666667 ms) Diff 15333 E2E 1.04x
    
  3. As you mentioned before, only spark so far will have a pool-backed hostdevice_vector, so all other users will need to either setup a pool or somehow fallback to a pageable version of error_code again? The cudaHostAlloc + cudaFreeHost calls for 4 bytes may add up quickly, especially in multi-threaded cases, but I haven't benched this scenario yet.

@nvdbaranec
Copy link
Contributor

  1. I ran our spark-rapids integration tests and found an unexpected test failure that led me to dig in. The issue is with removing the synchronize (your hostdevice_vector change is fine, it's just we don't want to remove the synchronize here https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L251). This looks to be protecting the subpass null fixing that happens later in https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L295, posting here to make sure others agree, I am not entirely sure if we expect some of that state to be in flight at this stage? If so, we should probably adjust the comment above it.

Correct. The synchronization there is needed because the null count code needs the updated data from the page and nesting info bufs that gets computed in the kernel. The .device_to_host_async() calls just above the original error_code check kicks those copies off.

@etseidl
Copy link
Contributor

etseidl commented Feb 28, 2024

  1. I ran our spark-rapids integration tests and found an unexpected test failure that led me to dig in. The issue is with removing the synchronize (your hostdevice_vector change is fine, it's just we don't want to remove the synchronize here https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L251). This looks to be protecting the subpass null fixing that happens later in https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L295, posting here to make sure others agree, I am not entirely sure if we expect some of that state to be in flight at this stage? If so, we should probably adjust the comment above it.

Correct. The synchronization there is needed because the null count code needs the updated data from the page and nesting info bufs that gets computed in the kernel. The .device_to_host_async() calls just above the original error_code check kicks those copies off.

But the call to error_code.value() explicity calls synchronize() on _stream, so I really don't understand why a second sync on the same stream immediately after is necessary. Unless there's some funny business with copying cuda_stream_views.

@abellina
Copy link
Contributor

  1. I ran our spark-rapids integration tests and found an unexpected test failure that led me to dig in. The issue is with removing the synchronize (your hostdevice_vector change is fine, it's just we don't want to remove the synchronize here https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L251). This looks to be protecting the subpass null fixing that happens later in https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L295, posting here to make sure others agree, I am not entirely sure if we expect some of that state to be in flight at this stage? If so, we should probably adjust the comment above it.

Correct. The synchronization there is needed because the null count code needs the updated data from the page and nesting info bufs that gets computed in the kernel. The .device_to_host_async() calls just above the original error_code check kicks those copies off.

But the call to error_code.value() explicity calls synchronize() on _stream, so I really don't understand why a second sync on the same stream immediately after is necessary. Unless there's some funny business with copying cuda_stream_views.

Sorry the funny business is in my end. I also didn't call error_code.value() because I wanted to remove the synchronization that is being added by that pageable copy, so yes with the error_code.value() synchronization we'd be fine as well, but we probably want to improve the documentation here for the future.

@etseidl
Copy link
Contributor

etseidl commented Feb 28, 2024

Sorry the funny business is in my end. I also didn't call error_code.value() because I wanted to remove the synchronization that is being added by that pageable copy, so yes with the error_code.value() synchronization we'd be fine as well, but we probably want to improve the documentation here for the future.

What if we change value() to value_sync(). Then it's explicit in the name.

@abellina
Copy link
Contributor

Got an extra data point by using hostdevice_vector without any pooling. So this case is going to cudaMallocHost and cudaFreeHost for the 4 bytes:

Comparing it with a baseline that is the pageable version of this (the code before this PR), it's essentially unchanged overall. Some queries are faster, and some are slower. Query 9 is 9% faster.

Details:

query1: Previous (2036.3333333333333 ms) vs Current (1887.6666666666667 ms) Diff 148 E2E 1.08x
query2: Previous (2658.0 ms) vs Current (2952.3333333333335 ms) Diff -294 E2E 0.90x
query3: Previous (777.6666666666666 ms) vs Current (761.3333333333334 ms) Diff 16 E2E 1.02x
query4: Previous (13619.0 ms) vs Current (13861.333333333334 ms) Diff -242 E2E 0.98x
query5: Previous (2868.6666666666665 ms) vs Current (2735.6666666666665 ms) Diff 133 E2E 1.05x
query6: Previous (1238.0 ms) vs Current (1198.3333333333333 ms) Diff 39 E2E 1.03x
query7: Previous (1586.3333333333333 ms) vs Current (1625.6666666666667 ms) Diff -39 E2E 0.98x
query8: Previous (1365.0 ms) vs Current (1405.0 ms) Diff -40 E2E 0.97x
query9: Previous (8893.666666666666 ms) vs Current (8147.666666666667 ms) Diff 745 E2E 1.09x
query10: Previous (1890.6666666666667 ms) vs Current (2038.3333333333333 ms) Diff -147 E2E 0.93x
query11: Previous (7476.666666666667 ms) vs Current (7891.0 ms) Diff -414 E2E 0.95x
query12: Previous (874.3333333333334 ms) vs Current (819.3333333333334 ms) Diff 55 E2E 1.07x
query13: Previous (2479.3333333333335 ms) vs Current (2219.3333333333335 ms) Diff 260 E2E 1.12x
query14_part1: Previous (9293.333333333334 ms) vs Current (9385.333333333334 ms) Diff -92 E2E 0.99x
query14_part2: Previous (7501.666666666667 ms) vs Current (7342.333333333333 ms) Diff 159 E2E 1.02x
query15: Previous (1336.6666666666667 ms) vs Current (1307.6666666666667 ms) Diff 29 E2E 1.02x
query16: Previous (10106.333333333334 ms) vs Current (9841.333333333334 ms) Diff 265 E2E 1.03x
query17: Previous (2467.6666666666665 ms) vs Current (2595.6666666666665 ms) Diff -128 E2E 0.95x
query18: Previous (4462.333333333333 ms) vs Current (4460.333333333333 ms) Diff 2 E2E 1.00x
query19: Previous (1619.0 ms) vs Current (1640.3333333333333 ms) Diff -21 E2E 0.99x
query20: Previous (706.3333333333334 ms) vs Current (742.3333333333334 ms) Diff -36 E2E 0.95x
query21: Previous (520.6666666666666 ms) vs Current (577.6666666666666 ms) Diff -57 E2E 0.90x
query22: Previous (1250.0 ms) vs Current (1257.0 ms) Diff -7 E2E 0.99x
query23_part1: Previous (17595.666666666668 ms) vs Current (18077.333333333332 ms) Diff -481 E2E 0.97x
query23_part2: Previous (24149.666666666668 ms) vs Current (23993.666666666668 ms) Diff 156 E2E 1.01x
query24_part1: Previous (8733.333333333334 ms) vs Current (8505.666666666666 ms) Diff 227 E2E 1.03x
query24_part2: Previous (8806.333333333334 ms) vs Current (8578.0 ms) Diff 228 E2E 1.03x
query25: Previous (2005.0 ms) vs Current (1961.0 ms) Diff 44 E2E 1.02x
query26: Previous (1447.3333333333333 ms) vs Current (1436.3333333333333 ms) Diff 11 E2E 1.01x
query27: Previous (1747.3333333333333 ms) vs Current (1779.0 ms) Diff -31 E2E 0.98x
query28: Previous (8014.666666666667 ms) vs Current (8848.0 ms) Diff -833 E2E 0.91x
query29: Previous (3444.3333333333335 ms) vs Current (3466.3333333333335 ms) Diff -22 E2E 0.99x
query30: Previous (3492.3333333333335 ms) vs Current (3882.6666666666665 ms) Diff -390 E2E 0.90x
query31: Previous (3596.3333333333335 ms) vs Current (3594.3333333333335 ms) Diff 2 E2E 1.00x
query32: Previous (1913.6666666666667 ms) vs Current (1850.3333333333333 ms) Diff 63 E2E 1.03x
query33: Previous (1593.3333333333333 ms) vs Current (1595.3333333333333 ms) Diff -2 E2E 1.00x
query34: Previous (2539.0 ms) vs Current (2689.0 ms) Diff -150 E2E 0.94x
query35: Previous (2964.6666666666665 ms) vs Current (2799.6666666666665 ms) Diff 165 E2E 1.06x
query36: Previous (1872.3333333333333 ms) vs Current (1691.0 ms) Diff 181 E2E 1.11x
query37: Previous (1487.3333333333333 ms) vs Current (1526.6666666666667 ms) Diff -39 E2E 0.97x
query38: Previous (4559.666666666667 ms) vs Current (4462.0 ms) Diff 97 E2E 1.02x
query39_part1: Previous (1547.6666666666667 ms) vs Current (1521.0 ms) Diff 26 E2E 1.02x
query39_part2: Previous (1297.3333333333333 ms) vs Current (1221.6666666666667 ms) Diff 75 E2E 1.06x
query40: Previous (2219.3333333333335 ms) vs Current (1734.6666666666667 ms) Diff 484 E2E 1.28x
query41: Previous (448.0 ms) vs Current (420.6666666666667 ms) Diff 27 E2E 1.06x
query42: Previous (411.6666666666667 ms) vs Current (392.6666666666667 ms) Diff 19 E2E 1.05x
query43: Previous (1002.3333333333334 ms) vs Current (962.6666666666666 ms) Diff 39 E2E 1.04x
query44: Previous (557.6666666666666 ms) vs Current (564.3333333333334 ms) Diff -6 E2E 0.99x
query45: Previous (1201.6666666666667 ms) vs Current (1165.3333333333333 ms) Diff 36 E2E 1.03x
query46: Previous (1886.6666666666667 ms) vs Current (1999.6666666666667 ms) Diff -113 E2E 0.94x
query47: Previous (2415.0 ms) vs Current (2421.6666666666665 ms) Diff -6 E2E 1.00x
query48: Previous (1178.3333333333333 ms) vs Current (1189.6666666666667 ms) Diff -11 E2E 0.99x
query49: Previous (2992.0 ms) vs Current (3087.0 ms) Diff -95 E2E 0.97x
query50: Previous (9786.333333333334 ms) vs Current (9685.0 ms) Diff 101 E2E 1.01x
query51: Previous (3071.6666666666665 ms) vs Current (3105.0 ms) Diff -33 E2E 0.99x
query52: Previous (528.6666666666666 ms) vs Current (493.3333333333333 ms) Diff 35 E2E 1.07x
query53: Previous (1395.3333333333333 ms) vs Current (956.3333333333334 ms) Diff 438 E2E 1.46x
query54: Previous (1741.6666666666667 ms) vs Current (1763.3333333333333 ms) Diff -21 E2E 0.99x
query55: Previous (475.3333333333333 ms) vs Current (471.0 ms) Diff 4 E2E 1.01x
query56: Previous (1192.0 ms) vs Current (1155.6666666666667 ms) Diff 36 E2E 1.03x
query57: Previous (2074.6666666666665 ms) vs Current (1961.6666666666667 ms) Diff 112 E2E 1.06x
query58: Previous (1713.6666666666667 ms) vs Current (1601.6666666666667 ms) Diff 112 E2E 1.07x
query59: Previous (2273.0 ms) vs Current (2261.3333333333335 ms) Diff 11 E2E 1.01x
query60: Previous (2155.0 ms) vs Current (2040.0 ms) Diff 115 E2E 1.06x
query61: Previous (1405.6666666666667 ms) vs Current (1265.6666666666667 ms) Diff 140 E2E 1.11x
query62: Previous (1573.6666666666667 ms) vs Current (1545.6666666666667 ms) Diff 28 E2E 1.02x
query63: Previous (1261.6666666666667 ms) vs Current (1374.3333333333333 ms) Diff -112 E2E 0.92x
query64: Previous (18739.333333333332 ms) vs Current (18136.0 ms) Diff 603 E2E 1.03x
query65: Previous (4060.6666666666665 ms) vs Current (3973.6666666666665 ms) Diff 87 E2E 1.02x
query66: Previous (5219.333333333333 ms) vs Current (5110.333333333333 ms) Diff 109 E2E 1.02x
query67: Previous (26947.0 ms) vs Current (26934.666666666668 ms) Diff 12 E2E 1.00x
query68: Previous (1568.3333333333333 ms) vs Current (1555.0 ms) Diff 13 E2E 1.01x
query69: Previous (1645.3333333333333 ms) vs Current (1650.0 ms) Diff -4 E2E 1.00x
query70: Previous (2375.6666666666665 ms) vs Current (2705.0 ms) Diff -329 E2E 0.88x
query71: Previous (3673.0 ms) vs Current (3724.0 ms) Diff -51 E2E 0.99x
query72: Previous (3622.3333333333335 ms) vs Current (3991.3333333333335 ms) Diff -369 E2E 0.91x
query73: Previous (1173.6666666666667 ms) vs Current (1055.3333333333333 ms) Diff 118 E2E 1.11x
query74: Previous (5551.666666666667 ms) vs Current (5652.333333333333 ms) Diff -100 E2E 0.98x
query75: Previous (7977.666666666667 ms) vs Current (7479.0 ms) Diff 498 E2E 1.07x
query76: Previous (3162.0 ms) vs Current (3141.3333333333335 ms) Diff 20 E2E 1.01x
query77: Previous (1804.6666666666667 ms) vs Current (1625.0 ms) Diff 179 E2E 1.11x
query78: Previous (10676.0 ms) vs Current (10498.0 ms) Diff 178 E2E 1.02x
query79: Previous (1681.0 ms) vs Current (2097.3333333333335 ms) Diff -416 E2E 0.80x
query80: Previous (4389.333333333333 ms) vs Current (3983.3333333333335 ms) Diff 405 E2E 1.10x
query81: Previous (2883.3333333333335 ms) vs Current (2792.0 ms) Diff 91 E2E 1.03x
query82: Previous (2568.0 ms) vs Current (2673.6666666666665 ms) Diff -105 E2E 0.96x
query83: Previous (11272.0 ms) vs Current (11555.0 ms) Diff -283 E2E 0.98x
query84: Previous (1752.0 ms) vs Current (1808.6666666666667 ms) Diff -56 E2E 0.97x
query85: Previous (2042.3333333333333 ms) vs Current (1974.3333333333333 ms) Diff 68 E2E 1.03x
query86: Previous (960.6666666666666 ms) vs Current (1087.0 ms) Diff -126 E2E 0.88x
query87: Previous (4850.333333333333 ms) vs Current (5534.0 ms) Diff -683 E2E 0.88x
query88: Previous (7409.666666666667 ms) vs Current (7733.666666666667 ms) Diff -324 E2E 0.96x
query89: Previous (1223.3333333333333 ms) vs Current (1153.6666666666667 ms) Diff 69 E2E 1.06x
query90: Previous (1035.0 ms) vs Current (1036.6666666666667 ms) Diff -1 E2E 1.00x
query91: Previous (1005.0 ms) vs Current (907.0 ms) Diff 98 E2E 1.11x
query92: Previous (1188.0 ms) vs Current (1196.6666666666667 ms) Diff -8 E2E 0.99x
query93: Previous (12937.333333333334 ms) vs Current (12793.666666666666 ms) Diff 143 E2E 1.01x
query94: Previous (5003.333333333333 ms) vs Current (4964.0 ms) Diff 39 E2E 1.01x
query95: Previous (8532.333333333334 ms) vs Current (8335.666666666666 ms) Diff 196 E2E 1.02x
query96: Previous (1324.0 ms) vs Current (1493.0 ms) Diff -169 E2E 0.89x
query97: Previous (2628.6666666666665 ms) vs Current (2459.6666666666665 ms) Diff 169 E2E 1.07x
query98: Previous (2663.6666666666665 ms) vs Current (2461.6666666666665 ms) Diff 202 E2E 1.08x
query99: Previous (2666.6666666666665 ms) vs Current (2565.3333333333335 ms) Diff 101 E2E 1.04x
benchmark: Previous (414000.0 ms) vs Current (412666.6666666667 ms) Diff 1333 E2E 1.00x

@abellina
Copy link
Contributor

Sorry the funny business is in my end. I also didn't call error_code.value() because I wanted to remove the synchronization that is being added by that pageable copy, so yes with the error_code.value() synchronization we'd be fine as well, but we probably want to improve the documentation here for the future.

What if we change value() to value_sync(). Then it's explicit in the name.

Re-reading the comment again, I think it makes sense now, so I am not sure we need to reword.

Hope we have enough data for this PR. If you need other tests let me know.

@vuule
Copy link
Contributor Author

vuule commented Mar 1, 2024

No measurable performance impact on libcudf read_parquet benchmarks.
I'm cleaning up the PR and the kernel_error use, should be ready for review soon.

@vuule
Copy link
Contributor Author

vuule commented Mar 1, 2024

I cooked the kernel_error API a bit to avoid redundant syncs that appear to actually impact perf.
@nvdbaranec @etseidl please review the new API
@abellina the only change that impacts performance compared to the first draft is removal of stream.synchronize() after checking the error. The overhead should now be minimal.

@etseidl
Copy link
Contributor

etseidl commented Mar 1, 2024

LGTM! Thanks!

@vuule vuule marked this pull request as ready for review March 4, 2024 19:54
@vuule vuule requested a review from a team as a code owner March 4, 2024 19:54
@vuule vuule requested a review from nvdbaranec March 5, 2024 18:54
*/
[[nodiscard]] std::string str() const
[[nodiscard]] static std::string to_string(value_type value)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of static, shouldn't this be a free function instead of a member function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be, but I like how it reads at the call site: kernel_error::to_string(error) and it does not really change much else about the function.
I'm not insisting on this option, so, open to suggestions :)

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically don't like to see if (auto ret = foo(); ret == true) type stuff, but I'll give you pass on this usage, @vuule

@vuule
Copy link
Contributor Author

vuule commented Mar 5, 2024

I typically don't like to see if (auto ret = foo(); ret == true) type stuff, but I'll give you pass on this usage, @vuule

Yeah, it would be pretty pointless if not for the repeated synchronization this avoids.

@vuule
Copy link
Contributor Author

vuule commented Mar 5, 2024

/merge

@rapids-bot rapids-bot bot merged commit 3ea947a into rapidsai:branch-24.04 Mar 5, 2024
73 checks passed
@vuule vuule deleted the perf-hd_v-kernel_error branch March 5, 2024 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants