Use `hostdevice_vector` in `kernel_error` to avoid the pageable copy #15140

vuule · 2024-02-24T03:39:56Z

Description

The addition of kernel error checking introduced a 5% performance regression in Spark-RAPIDS. It was determined that the pageable copy of the error back to host caused this overhead, presumably because of the CUDA's bounce buffer bottleneck.

This PR aims to eliminate most of the error checking overhead by using hostdevice_vector in the kernel_error class. The hostdevice_vector uses pinned memory so the copy is no longer pageable. The PR also removes the redundant sync after we read the error.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

GregoryKimball · 2024-02-26T17:59:14Z

Thank you @vuule for opening this. Do you think it the pageable copy could be the root cause of the performance difference Spark-RAPIDS observed from #14167 on A100? (also note the tangentially-related V100 issue #14415)

vuule · 2024-02-26T18:10:37Z

Do you think it the pageable copy could be the root cause of the performance difference Spark-RAPIDS observed from #14167 on A100?

@abellina suspects this might be the case. It's possible that the shared bounce buffer used for pageable copies creates a bottleneck only in multi-threaded use case. This could explain why the regression is specific to spark. I opened the PR to enable @abellina to test the hypothesis.

abellina · 2024-02-27T16:10:01Z

@vuule @GregoryKimball we will review/test this, sorry for the delayed response.

abellina · 2024-02-27T23:53:17Z

@vuule I have done some testing with this but I need a bit more time. So far I think with the error code checking plus the _stream.synchronize() we are 10 seconds lower than the version without that, so:

in reader_impl.cpp when we check the error code:

This:

  //if (error_code.value() != 0) {
  //  CUDF_FAIL("Parquet data decode failed with code(s) " + error_code.str());
  //}
  // error_code.value() is synchronous; explicitly sync here for better visibility
  //_stream.synchronize();

Is 10 seconds faster than:

  if (error_code.value() != 0) {
    CUDF_FAIL("Parquet data decode failed with code(s) " + error_code.str());
  }
  // error_code.value() is synchronous; explicitly sync here for better visibility
  _stream.synchronize();

Which is much better than what I had measured before (with pageable we had a 20 seconds or 5% regression).

I also want to not do that last _stream.synchronize() and see if that's part of it. Even with this, the pinned copy could be contending with a busy copy engine that is trying handle large h2d copies to get data into the parquet decode kernels, so it makes sense that there's unintended synchronization here.

abellina · 2024-02-28T20:16:49Z

@vuule here are my updates so far:

I ran our spark-rapids integration tests and found an unexpected test failure that led me to dig in. The issue is with removing the synchronize (your hostdevice_vector change is fine, it's just we don't want to remove the synchronize here https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L251). This looks to be protecting the subpass null fixing that happens later in https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L295, posting here to make sure others agree, I am not entirely sure if we expect some of that state to be in flight at this stage? If so, we should probably adjust the comment above it.

Assuming that a test without the sync is invalid, here are the results for checking error_code vs not checking it on top of a branch that has several optimizations (micro kernel PR, pinned pool, chunking optimization from Dave), while always keeping that synchronize I pointed at in (1):

Not checking as the baseline vs pooled+pinned approach (overall ~3 seconds slower, noise IMHO)

query1: Previous (1817.6666666666667 ms) vs Current (1760.6666666666667 ms) Diff 57 E2E 1.03x
query2: Previous (2549.0 ms) vs Current (2553.3333333333335 ms) Diff -4 E2E 1.00x
query3: Previous (668.6666666666666 ms) vs Current (690.0 ms) Diff -21 E2E 0.97x
query4: Previous (13728.666666666666 ms) vs Current (13486.666666666666 ms) Diff 242 E2E 1.02x
query5: Previous (2851.6666666666665 ms) vs Current (2786.6666666666665 ms) Diff 65 E2E 1.02x
query6: Previous (1088.0 ms) vs Current (1245.3333333333333 ms) Diff -157 E2E 0.87x
query7: Previous (1536.0 ms) vs Current (1461.0 ms) Diff 75 E2E 1.05x
query8: Previous (1369.3333333333333 ms) vs Current (1390.3333333333333 ms) Diff -21 E2E 0.98x
query9: Previous (6805.666666666667 ms) vs Current (7064.666666666667 ms) Diff -259 E2E 0.96x
query10: Previous (1643.3333333333333 ms) vs Current (2533.6666666666665 ms) Diff -890 E2E 0.65x
query11: Previous (6980.666666666667 ms) vs Current (6977.333333333333 ms) Diff 3 E2E 1.00x
query12: Previous (790.0 ms) vs Current (1067.3333333333333 ms) Diff -277 E2E 0.74x
query13: Previous (2043.0 ms) vs Current (2370.0 ms) Diff -327 E2E 0.86x
query14_part1: Previous (8817.666666666666 ms) vs Current (8887.333333333334 ms) Diff -69 E2E 0.99x
query14_part2: Previous (7198.666666666667 ms) vs Current (7580.0 ms) Diff -381 E2E 0.95x
query15: Previous (1237.6666666666667 ms) vs Current (1445.3333333333333 ms) Diff -207 E2E 0.86x
query16: Previous (9869.0 ms) vs Current (9954.0 ms) Diff -85 E2E 0.99x
query17: Previous (2513.6666666666665 ms) vs Current (2261.0 ms) Diff 252 E2E 1.11x
query18: Previous (4320.0 ms) vs Current (4429.0 ms) Diff -109 E2E 0.98x
query19: Previous (1328.6666666666667 ms) vs Current (1783.3333333333333 ms) Diff -454 E2E 0.75x
query20: Previous (835.3333333333334 ms) vs Current (696.6666666666666 ms) Diff 138 E2E 1.20x
query21: Previous (503.0 ms) vs Current (506.3333333333333 ms) Diff -3 E2E 0.99x
query22: Previous (1204.6666666666667 ms) vs Current (1172.6666666666667 ms) Diff 32 E2E 1.03x
query23_part1: Previous (17476.666666666668 ms) vs Current (17624.0 ms) Diff -147 E2E 0.99x
query23_part2: Previous (23662.0 ms) vs Current (23583.666666666668 ms) Diff 78 E2E 1.00x
query24_part1: Previous (8408.666666666666 ms) vs Current (8234.333333333334 ms) Diff 174 E2E 1.02x
query24_part2: Previous (8295.0 ms) vs Current (8625.666666666666 ms) Diff -330 E2E 0.96x
query25: Previous (1959.3333333333333 ms) vs Current (1831.3333333333333 ms) Diff 128 E2E 1.07x
query26: Previous (1651.6666666666667 ms) vs Current (1309.3333333333333 ms) Diff 342 E2E 1.26x
query27: Previous (1693.6666666666667 ms) vs Current (1607.0 ms) Diff 86 E2E 1.05x
query28: Previous (7103.0 ms) vs Current (7196.0 ms) Diff -93 E2E 0.99x
query29: Previous (3470.3333333333335 ms) vs Current (3352.3333333333335 ms) Diff 118 E2E 1.04x
query30: Previous (3395.0 ms) vs Current (3432.3333333333335 ms) Diff -37 E2E 0.99x
query31: Previous (3170.6666666666665 ms) vs Current (3190.3333333333335 ms) Diff -19 E2E 0.99x
query32: Previous (1783.6666666666667 ms) vs Current (1884.6666666666667 ms) Diff -101 E2E 0.95x
query33: Previous (1560.3333333333333 ms) vs Current (1564.3333333333333 ms) Diff -4 E2E 1.00x
query34: Previous (2351.6666666666665 ms) vs Current (2397.3333333333335 ms) Diff -45 E2E 0.98x
query35: Previous (2431.6666666666665 ms) vs Current (2564.3333333333335 ms) Diff -132 E2E 0.95x
query36: Previous (1705.6666666666667 ms) vs Current (1758.0 ms) Diff -52 E2E 0.97x
query37: Previous (1485.0 ms) vs Current (1451.6666666666667 ms) Diff 33 E2E 1.02x
query38: Previous (4119.666666666667 ms) vs Current (4177.333333333333 ms) Diff -57 E2E 0.99x
query39_part1: Previous (1526.3333333333333 ms) vs Current (1491.0 ms) Diff 35 E2E 1.02x
query39_part2: Previous (1244.0 ms) vs Current (1170.0 ms) Diff 74 E2E 1.06x
query40: Previous (1706.0 ms) vs Current (1629.3333333333333 ms) Diff 76 E2E 1.05x
query41: Previous (414.3333333333333 ms) vs Current (437.6666666666667 ms) Diff -23 E2E 0.95x
query42: Previous (340.3333333333333 ms) vs Current (345.0 ms) Diff -4 E2E 0.99x
query43: Previous (827.6666666666666 ms) vs Current (872.3333333333334 ms) Diff -44 E2E 0.95x
query44: Previous (538.0 ms) vs Current (582.3333333333334 ms) Diff -44 E2E 0.92x
query45: Previous (1118.6666666666667 ms) vs Current (1206.0 ms) Diff -87 E2E 0.93x
query46: Previous (1755.3333333333333 ms) vs Current (2000.0 ms) Diff -244 E2E 0.88x
query47: Previous (2244.3333333333335 ms) vs Current (2237.6666666666665 ms) Diff 6 E2E 1.00x
query48: Previous (1035.6666666666667 ms) vs Current (1055.0 ms) Diff -19 E2E 0.98x
query49: Previous (2834.0 ms) vs Current (3222.0 ms) Diff -388 E2E 0.88x
query50: Previous (9709.333333333334 ms) vs Current (9730.0 ms) Diff -20 E2E 1.00x
query51: Previous (2882.6666666666665 ms) vs Current (2956.3333333333335 ms) Diff -73 E2E 0.98x
query52: Previous (453.3333333333333 ms) vs Current (473.3333333333333 ms) Diff -20 E2E 0.96x
query53: Previous (838.6666666666666 ms) vs Current (869.3333333333334 ms) Diff -30 E2E 0.96x
query54: Previous (1625.6666666666667 ms) vs Current (1795.6666666666667 ms) Diff -170 E2E 0.91x
query55: Previous (448.6666666666667 ms) vs Current (441.0 ms) Diff 7 E2E 1.02x
query56: Previous (1036.0 ms) vs Current (1093.6666666666667 ms) Diff -57 E2E 0.95x
query57: Previous (2095.3333333333335 ms) vs Current (1909.0 ms) Diff 186 E2E 1.10x
query58: Previous (1326.0 ms) vs Current (1398.0 ms) Diff -72 E2E 0.95x
query59: Previous (2669.3333333333335 ms) vs Current (2130.6666666666665 ms) Diff 538 E2E 1.25x
query60: Previous (1957.6666666666667 ms) vs Current (2025.6666666666667 ms) Diff -68 E2E 0.97x
query61: Previous (1129.6666666666667 ms) vs Current (1188.0 ms) Diff -58 E2E 0.95x
query62: Previous (1534.0 ms) vs Current (1535.0 ms) Diff -1 E2E 1.00x
query63: Previous (1024.0 ms) vs Current (994.6666666666666 ms) Diff 29 E2E 1.03x
query64: Previous (17846.0 ms) vs Current (18693.333333333332 ms) Diff -847 E2E 0.95x
query65: Previous (4099.0 ms) vs Current (3879.6666666666665 ms) Diff 219 E2E 1.06x
query66: Previous (4788.333333333333 ms) vs Current (4858.0 ms) Diff -69 E2E 0.99x
query67: Previous (26821.666666666668 ms) vs Current (26586.666666666668 ms) Diff 235 E2E 1.01x
query68: Previous (1419.3333333333333 ms) vs Current (1506.3333333333333 ms) Diff -87 E2E 0.94x
query69: Previous (1475.6666666666667 ms) vs Current (1413.3333333333333 ms) Diff 62 E2E 1.04x
query70: Previous (2310.0 ms) vs Current (2130.0 ms) Diff 180 E2E 1.08x
query71: Previous (3797.6666666666665 ms) vs Current (3436.0 ms) Diff 361 E2E 1.11x
query72: Previous (3469.0 ms) vs Current (3404.0 ms) Diff 65 E2E 1.02x
query73: Previous (1002.6666666666666 ms) vs Current (978.0 ms) Diff 24 E2E 1.03x
query74: Previous (5282.333333333333 ms) vs Current (5346.0 ms) Diff -63 E2E 0.99x
query75: Previous (7436.333333333333 ms) vs Current (7329.666666666667 ms) Diff 106 E2E 1.01x
query76: Previous (3003.6666666666665 ms) vs Current (3002.6666666666665 ms) Diff 1 E2E 1.00x
query77: Previous (1577.6666666666667 ms) vs Current (1612.0 ms) Diff -34 E2E 0.98x
query78: Previous (10374.666666666666 ms) vs Current (10545.666666666666 ms) Diff -171 E2E 0.98x
query79: Previous (1587.0 ms) vs Current (1674.0 ms) Diff -87 E2E 0.95x
query80: Previous (4167.0 ms) vs Current (4153.333333333333 ms) Diff 13 E2E 1.00x
query81: Previous (2775.6666666666665 ms) vs Current (2792.3333333333335 ms) Diff -16 E2E 0.99x
query82: Previous (2502.3333333333335 ms) vs Current (2510.3333333333335 ms) Diff -8 E2E 1.00x
query83: Previous (11964.0 ms) vs Current (11808.666666666666 ms) Diff 155 E2E 1.01x
query84: Previous (1756.3333333333333 ms) vs Current (1899.6666666666667 ms) Diff -143 E2E 0.92x
query85: Previous (2163.3333333333335 ms) vs Current (2054.0 ms) Diff 109 E2E 1.05x
query86: Previous (1393.3333333333333 ms) vs Current (948.0 ms) Diff 445 E2E 1.47x
query87: Previous (4516.666666666667 ms) vs Current (4484.666666666667 ms) Diff 32 E2E 1.01x
query88: Previous (6350.666666666667 ms) vs Current (6380.0 ms) Diff -29 E2E 1.00x
query89: Previous (1077.6666666666667 ms) vs Current (1089.6666666666667 ms) Diff -12 E2E 0.99x
query90: Previous (922.3333333333334 ms) vs Current (1442.0 ms) Diff -519 E2E 0.64x
query91: Previous (831.6666666666666 ms) vs Current (855.0 ms) Diff -23 E2E 0.97x
query92: Previous (1125.0 ms) vs Current (1072.0 ms) Diff 53 E2E 1.05x
query93: Previous (12711.0 ms) vs Current (12922.333333333334 ms) Diff -211 E2E 0.98x
query94: Previous (4993.666666666667 ms) vs Current (5130.0 ms) Diff -136 E2E 0.97x
query95: Previous (8324.666666666666 ms) vs Current (8345.666666666666 ms) Diff -21 E2E 1.00x
query96: Previous (1311.3333333333333 ms) vs Current (1380.0 ms) Diff -68 E2E 0.95x
query97: Previous (2320.6666666666665 ms) vs Current (2329.3333333333335 ms) Diff -8 E2E 1.00x
query98: Previous (2410.3333333333335 ms) vs Current (2406.0 ms) Diff 4 E2E 1.00x
query99: Previous (2567.3333333333335 ms) vs Current (2483.6666666666665 ms) Diff 83 E2E 1.03x
benchmark: Previous (395333.3333333333 ms) vs Current (398666.6666666667 ms) Diff -3333 E2E 0.99x

Checking the error code using the old way (pageable) vs the pinned way that this PR allows. The pinned approach is significantly faster (15 seconds faster, 4% faster), and in q9 for example 26% better (heavy parquet user).

Pageable as the baseline vs the pooled+pinned approach

query1: Previous (2036.3333333333333 ms) vs Current (1760.6666666666667 ms) Diff 275 E2E 1.16x
query2: Previous (2658.0 ms) vs Current (2553.3333333333335 ms) Diff 104 E2E 1.04x
query3: Previous (777.6666666666666 ms) vs Current (690.0 ms) Diff 87 E2E 1.13x
query4: Previous (13619.0 ms) vs Current (13486.666666666666 ms) Diff 132 E2E 1.01x
query5: Previous (2868.6666666666665 ms) vs Current (2786.6666666666665 ms) Diff 82 E2E 1.03x
query6: Previous (1238.0 ms) vs Current (1245.3333333333333 ms) Diff -7 E2E 0.99x
query7: Previous (1586.3333333333333 ms) vs Current (1461.0 ms) Diff 125 E2E 1.09x
query8: Previous (1365.0 ms) vs Current (1390.3333333333333 ms) Diff -25 E2E 0.98x
query9: Previous (8893.666666666666 ms) vs Current (7064.666666666667 ms) Diff 1828 E2E 1.26x
query10: Previous (1890.6666666666667 ms) vs Current (2533.6666666666665 ms) Diff -642 E2E 0.75x
query11: Previous (7476.666666666667 ms) vs Current (6977.333333333333 ms) Diff 499 E2E 1.07x
query12: Previous (874.3333333333334 ms) vs Current (1067.3333333333333 ms) Diff -192 E2E 0.82x
query13: Previous (2479.3333333333335 ms) vs Current (2370.0 ms) Diff 109 E2E 1.05x
query14_part1: Previous (9293.333333333334 ms) vs Current (8887.333333333334 ms) Diff 406 E2E 1.05x
query14_part2: Previous (7501.666666666667 ms) vs Current (7580.0 ms) Diff -78 E2E 0.99x
query15: Previous (1336.6666666666667 ms) vs Current (1445.3333333333333 ms) Diff -108 E2E 0.92x
query16: Previous (10106.333333333334 ms) vs Current (9954.0 ms) Diff 152 E2E 1.02x
query17: Previous (2467.6666666666665 ms) vs Current (2261.0 ms) Diff 206 E2E 1.09x
query18: Previous (4462.333333333333 ms) vs Current (4429.0 ms) Diff 33 E2E 1.01x
query19: Previous (1619.0 ms) vs Current (1783.3333333333333 ms) Diff -164 E2E 0.91x
query20: Previous (706.3333333333334 ms) vs Current (696.6666666666666 ms) Diff 9 E2E 1.01x
query21: Previous (520.6666666666666 ms) vs Current (506.3333333333333 ms) Diff 14 E2E 1.03x
query22: Previous (1250.0 ms) vs Current (1172.6666666666667 ms) Diff 77 E2E 1.07x
query23_part1: Previous (17595.666666666668 ms) vs Current (17624.0 ms) Diff -28 E2E 1.00x
query23_part2: Previous (24149.666666666668 ms) vs Current (23583.666666666668 ms) Diff 566 E2E 1.02x
query24_part1: Previous (8733.333333333334 ms) vs Current (8234.333333333334 ms) Diff 499 E2E 1.06x
query24_part2: Previous (8806.333333333334 ms) vs Current (8625.666666666666 ms) Diff 180 E2E 1.02x
query25: Previous (2005.0 ms) vs Current (1831.3333333333333 ms) Diff 173 E2E 1.09x
query26: Previous (1447.3333333333333 ms) vs Current (1309.3333333333333 ms) Diff 138 E2E 1.11x
query27: Previous (1747.3333333333333 ms) vs Current (1607.0 ms) Diff 140 E2E 1.09x
query28: Previous (8014.666666666667 ms) vs Current (7196.0 ms) Diff 818 E2E 1.11x
query29: Previous (3444.3333333333335 ms) vs Current (3352.3333333333335 ms) Diff 92 E2E 1.03x
query30: Previous (3492.3333333333335 ms) vs Current (3432.3333333333335 ms) Diff 60 E2E 1.02x
query31: Previous (3596.3333333333335 ms) vs Current (3190.3333333333335 ms) Diff 406 E2E 1.13x
query32: Previous (1913.6666666666667 ms) vs Current (1884.6666666666667 ms) Diff 29 E2E 1.02x
query33: Previous (1593.3333333333333 ms) vs Current (1564.3333333333333 ms) Diff 29 E2E 1.02x
query34: Previous (2539.0 ms) vs Current (2397.3333333333335 ms) Diff 141 E2E 1.06x
query35: Previous (2964.6666666666665 ms) vs Current (2564.3333333333335 ms) Diff 400 E2E 1.16x
query36: Previous (1872.3333333333333 ms) vs Current (1758.0 ms) Diff 114 E2E 1.07x
query37: Previous (1487.3333333333333 ms) vs Current (1451.6666666666667 ms) Diff 35 E2E 1.02x
query38: Previous (4559.666666666667 ms) vs Current (4177.333333333333 ms) Diff 382 E2E 1.09x
query39_part1: Previous (1547.6666666666667 ms) vs Current (1491.0 ms) Diff 56 E2E 1.04x
query39_part2: Previous (1297.3333333333333 ms) vs Current (1170.0 ms) Diff 127 E2E 1.11x
query40: Previous (2219.3333333333335 ms) vs Current (1629.3333333333333 ms) Diff 590 E2E 1.36x
query41: Previous (448.0 ms) vs Current (437.6666666666667 ms) Diff 10 E2E 1.02x
query42: Previous (411.6666666666667 ms) vs Current (345.0 ms) Diff 66 E2E 1.19x
query43: Previous (1002.3333333333334 ms) vs Current (872.3333333333334 ms) Diff 130 E2E 1.15x
query44: Previous (557.6666666666666 ms) vs Current (582.3333333333334 ms) Diff -24 E2E 0.96x
query45: Previous (1201.6666666666667 ms) vs Current (1206.0 ms) Diff -4 E2E 1.00x
query46: Previous (1886.6666666666667 ms) vs Current (2000.0 ms) Diff -113 E2E 0.94x
query47: Previous (2415.0 ms) vs Current (2237.6666666666665 ms) Diff 177 E2E 1.08x
query48: Previous (1178.3333333333333 ms) vs Current (1055.0 ms) Diff 123 E2E 1.12x
query49: Previous (2992.0 ms) vs Current (3222.0 ms) Diff -230 E2E 0.93x
query50: Previous (9786.333333333334 ms) vs Current (9730.0 ms) Diff 56 E2E 1.01x
query51: Previous (3071.6666666666665 ms) vs Current (2956.3333333333335 ms) Diff 115 E2E 1.04x
query52: Previous (528.6666666666666 ms) vs Current (473.3333333333333 ms) Diff 55 E2E 1.12x
query53: Previous (1395.3333333333333 ms) vs Current (869.3333333333334 ms) Diff 525 E2E 1.61x
query54: Previous (1741.6666666666667 ms) vs Current (1795.6666666666667 ms) Diff -54 E2E 0.97x
query55: Previous (475.3333333333333 ms) vs Current (441.0 ms) Diff 34 E2E 1.08x
query56: Previous (1192.0 ms) vs Current (1093.6666666666667 ms) Diff 98 E2E 1.09x
query57: Previous (2074.6666666666665 ms) vs Current (1909.0 ms) Diff 165 E2E 1.09x
query58: Previous (1713.6666666666667 ms) vs Current (1398.0 ms) Diff 315 E2E 1.23x
query59: Previous (2273.0 ms) vs Current (2130.6666666666665 ms) Diff 142 E2E 1.07x
query60: Previous (2155.0 ms) vs Current (2025.6666666666667 ms) Diff 129 E2E 1.06x
query61: Previous (1405.6666666666667 ms) vs Current (1188.0 ms) Diff 217 E2E 1.18x
query62: Previous (1573.6666666666667 ms) vs Current (1535.0 ms) Diff 38 E2E 1.03x
query63: Previous (1261.6666666666667 ms) vs Current (994.6666666666666 ms) Diff 267 E2E 1.27x
query64: Previous (18739.333333333332 ms) vs Current (18693.333333333332 ms) Diff 46 E2E 1.00x
query65: Previous (4060.6666666666665 ms) vs Current (3879.6666666666665 ms) Diff 181 E2E 1.05x
query66: Previous (5219.333333333333 ms) vs Current (4858.0 ms) Diff 361 E2E 1.07x
query67: Previous (26947.0 ms) vs Current (26586.666666666668 ms) Diff 360 E2E 1.01x
query68: Previous (1568.3333333333333 ms) vs Current (1506.3333333333333 ms) Diff 62 E2E 1.04x
query69: Previous (1645.3333333333333 ms) vs Current (1413.3333333333333 ms) Diff 232 E2E 1.16x
query70: Previous (2375.6666666666665 ms) vs Current (2130.0 ms) Diff 245 E2E 1.12x
query71: Previous (3673.0 ms) vs Current (3436.0 ms) Diff 237 E2E 1.07x
query72: Previous (3622.3333333333335 ms) vs Current (3404.0 ms) Diff 218 E2E 1.06x
query73: Previous (1173.6666666666667 ms) vs Current (978.0 ms) Diff 195 E2E 1.20x
query74: Previous (5551.666666666667 ms) vs Current (5346.0 ms) Diff 205 E2E 1.04x
query75: Previous (7977.666666666667 ms) vs Current (7329.666666666667 ms) Diff 648 E2E 1.09x
query76: Previous (3162.0 ms) vs Current (3002.6666666666665 ms) Diff 159 E2E 1.05x
query77: Previous (1804.6666666666667 ms) vs Current (1612.0 ms) Diff 192 E2E 1.12x
query78: Previous (10676.0 ms) vs Current (10545.666666666666 ms) Diff 130 E2E 1.01x
query79: Previous (1681.0 ms) vs Current (1674.0 ms) Diff 7 E2E 1.00x
query80: Previous (4389.333333333333 ms) vs Current (4153.333333333333 ms) Diff 236 E2E 1.06x
query81: Previous (2883.3333333333335 ms) vs Current (2792.3333333333335 ms) Diff 91 E2E 1.03x
query82: Previous (2568.0 ms) vs Current (2510.3333333333335 ms) Diff 57 E2E 1.02x
query83: Previous (11272.0 ms) vs Current (11808.666666666666 ms) Diff -536 E2E 0.95x
query84: Previous (1752.0 ms) vs Current (1899.6666666666667 ms) Diff -147 E2E 0.92x
query85: Previous (2042.3333333333333 ms) vs Current (2054.0 ms) Diff -11 E2E 0.99x
query86: Previous (960.6666666666666 ms) vs Current (948.0 ms) Diff 12 E2E 1.01x
query87: Previous (4850.333333333333 ms) vs Current (4484.666666666667 ms) Diff 365 E2E 1.08x
query88: Previous (7409.666666666667 ms) vs Current (6380.0 ms) Diff 1029 E2E 1.16x
query89: Previous (1223.3333333333333 ms) vs Current (1089.6666666666667 ms) Diff 133 E2E 1.12x
query90: Previous (1035.0 ms) vs Current (1442.0 ms) Diff -407 E2E 0.72x
query91: Previous (1005.0 ms) vs Current (855.0 ms) Diff 150 E2E 1.18x
query92: Previous (1188.0 ms) vs Current (1072.0 ms) Diff 116 E2E 1.11x
query93: Previous (12937.333333333334 ms) vs Current (12922.333333333334 ms) Diff 15 E2E 1.00x
query94: Previous (5003.333333333333 ms) vs Current (5130.0 ms) Diff -126 E2E 0.98x
query95: Previous (8532.333333333334 ms) vs Current (8345.666666666666 ms) Diff 186 E2E 1.02x
query96: Previous (1324.0 ms) vs Current (1380.0 ms) Diff -56 E2E 0.96x
query97: Previous (2628.6666666666665 ms) vs Current (2329.3333333333335 ms) Diff 299 E2E 1.13x
query98: Previous (2663.6666666666665 ms) vs Current (2406.0 ms) Diff 257 E2E 1.11x
query99: Previous (2666.6666666666665 ms) vs Current (2483.6666666666665 ms) Diff 183 E2E 1.07x
benchmark: Previous (414000.0 ms) vs Current (398666.6666666667 ms) Diff 15333 E2E 1.04x

As you mentioned before, only spark so far will have a pool-backed hostdevice_vector, so all other users will need to either setup a pool or somehow fallback to a pageable version of error_code again? The cudaHostAlloc + cudaFreeHost calls for 4 bytes may add up quickly, especially in multi-threaded cases, but I haven't benched this scenario yet.

nvdbaranec · 2024-02-28T20:38:01Z

I ran our spark-rapids integration tests and found an unexpected test failure that led me to dig in. The issue is with removing the synchronize (your hostdevice_vector change is fine, it's just we don't want to remove the synchronize here https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L251). This looks to be protecting the subpass null fixing that happens later in https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L295, posting here to make sure others agree, I am not entirely sure if we expect some of that state to be in flight at this stage? If so, we should probably adjust the comment above it.

Correct. The synchronization there is needed because the null count code needs the updated data from the page and nesting info bufs that gets computed in the kernel. The .device_to_host_async() calls just above the original error_code check kicks those copies off.

etseidl · 2024-02-28T20:51:50Z

I ran our spark-rapids integration tests and found an unexpected test failure that led me to dig in. The issue is with removing the synchronize (your hostdevice_vector change is fine, it's just we don't want to remove the synchronize here https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L251). This looks to be protecting the subpass null fixing that happens later in https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L295, posting here to make sure others agree, I am not entirely sure if we expect some of that state to be in flight at this stage? If so, we should probably adjust the comment above it.

Correct. The synchronization there is needed because the null count code needs the updated data from the page and nesting info bufs that gets computed in the kernel. The .device_to_host_async() calls just above the original error_code check kicks those copies off.

But the call to error_code.value() explicity calls synchronize() on _stream, so I really don't understand why a second sync on the same stream immediately after is necessary. Unless there's some funny business with copying cuda_stream_views.

abellina · 2024-02-28T20:54:35Z

I ran our spark-rapids integration tests and found an unexpected test failure that led me to dig in. The issue is with removing the synchronize (your hostdevice_vector change is fine, it's just we don't want to remove the synchronize here https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L251). This looks to be protecting the subpass null fixing that happens later in https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L295, posting here to make sure others agree, I am not entirely sure if we expect some of that state to be in flight at this stage? If so, we should probably adjust the comment above it.

Correct. The synchronization there is needed because the null count code needs the updated data from the page and nesting info bufs that gets computed in the kernel. The .device_to_host_async() calls just above the original error_code check kicks those copies off.

But the call to error_code.value() explicity calls synchronize() on _stream, so I really don't understand why a second sync on the same stream immediately after is necessary. Unless there's some funny business with copying cuda_stream_views.

Sorry the funny business is in my end. I also didn't call error_code.value() because I wanted to remove the synchronization that is being added by that pageable copy, so yes with the error_code.value() synchronization we'd be fine as well, but we probably want to improve the documentation here for the future.

etseidl · 2024-02-28T21:00:14Z

Sorry the funny business is in my end. I also didn't call error_code.value() because I wanted to remove the synchronization that is being added by that pageable copy, so yes with the error_code.value() synchronization we'd be fine as well, but we probably want to improve the documentation here for the future.

What if we change value() to value_sync(). Then it's explicit in the name.

abellina · 2024-02-28T21:02:03Z

Got an extra data point by using hostdevice_vector without any pooling. So this case is going to cudaMallocHost and cudaFreeHost for the 4 bytes:

Comparing it with a baseline that is the pageable version of this (the code before this PR), it's essentially unchanged overall. Some queries are faster, and some are slower. Query 9 is 9% faster.

Details:

query1: Previous (2036.3333333333333 ms) vs Current (1887.6666666666667 ms) Diff 148 E2E 1.08x
query2: Previous (2658.0 ms) vs Current (2952.3333333333335 ms) Diff -294 E2E 0.90x
query3: Previous (777.6666666666666 ms) vs Current (761.3333333333334 ms) Diff 16 E2E 1.02x
query4: Previous (13619.0 ms) vs Current (13861.333333333334 ms) Diff -242 E2E 0.98x
query5: Previous (2868.6666666666665 ms) vs Current (2735.6666666666665 ms) Diff 133 E2E 1.05x
query6: Previous (1238.0 ms) vs Current (1198.3333333333333 ms) Diff 39 E2E 1.03x
query7: Previous (1586.3333333333333 ms) vs Current (1625.6666666666667 ms) Diff -39 E2E 0.98x
query8: Previous (1365.0 ms) vs Current (1405.0 ms) Diff -40 E2E 0.97x
query9: Previous (8893.666666666666 ms) vs Current (8147.666666666667 ms) Diff 745 E2E 1.09x
query10: Previous (1890.6666666666667 ms) vs Current (2038.3333333333333 ms) Diff -147 E2E 0.93x
query11: Previous (7476.666666666667 ms) vs Current (7891.0 ms) Diff -414 E2E 0.95x
query12: Previous (874.3333333333334 ms) vs Current (819.3333333333334 ms) Diff 55 E2E 1.07x
query13: Previous (2479.3333333333335 ms) vs Current (2219.3333333333335 ms) Diff 260 E2E 1.12x
query14_part1: Previous (9293.333333333334 ms) vs Current (9385.333333333334 ms) Diff -92 E2E 0.99x
query14_part2: Previous (7501.666666666667 ms) vs Current (7342.333333333333 ms) Diff 159 E2E 1.02x
query15: Previous (1336.6666666666667 ms) vs Current (1307.6666666666667 ms) Diff 29 E2E 1.02x
query16: Previous (10106.333333333334 ms) vs Current (9841.333333333334 ms) Diff 265 E2E 1.03x
query17: Previous (2467.6666666666665 ms) vs Current (2595.6666666666665 ms) Diff -128 E2E 0.95x
query18: Previous (4462.333333333333 ms) vs Current (4460.333333333333 ms) Diff 2 E2E 1.00x
query19: Previous (1619.0 ms) vs Current (1640.3333333333333 ms) Diff -21 E2E 0.99x
query20: Previous (706.3333333333334 ms) vs Current (742.3333333333334 ms) Diff -36 E2E 0.95x
query21: Previous (520.6666666666666 ms) vs Current (577.6666666666666 ms) Diff -57 E2E 0.90x
query22: Previous (1250.0 ms) vs Current (1257.0 ms) Diff -7 E2E 0.99x
query23_part1: Previous (17595.666666666668 ms) vs Current (18077.333333333332 ms) Diff -481 E2E 0.97x
query23_part2: Previous (24149.666666666668 ms) vs Current (23993.666666666668 ms) Diff 156 E2E 1.01x
query24_part1: Previous (8733.333333333334 ms) vs Current (8505.666666666666 ms) Diff 227 E2E 1.03x
query24_part2: Previous (8806.333333333334 ms) vs Current (8578.0 ms) Diff 228 E2E 1.03x
query25: Previous (2005.0 ms) vs Current (1961.0 ms) Diff 44 E2E 1.02x
query26: Previous (1447.3333333333333 ms) vs Current (1436.3333333333333 ms) Diff 11 E2E 1.01x
query27: Previous (1747.3333333333333 ms) vs Current (1779.0 ms) Diff -31 E2E 0.98x
query28: Previous (8014.666666666667 ms) vs Current (8848.0 ms) Diff -833 E2E 0.91x
query29: Previous (3444.3333333333335 ms) vs Current (3466.3333333333335 ms) Diff -22 E2E 0.99x
query30: Previous (3492.3333333333335 ms) vs Current (3882.6666666666665 ms) Diff -390 E2E 0.90x
query31: Previous (3596.3333333333335 ms) vs Current (3594.3333333333335 ms) Diff 2 E2E 1.00x
query32: Previous (1913.6666666666667 ms) vs Current (1850.3333333333333 ms) Diff 63 E2E 1.03x
query33: Previous (1593.3333333333333 ms) vs Current (1595.3333333333333 ms) Diff -2 E2E 1.00x
query34: Previous (2539.0 ms) vs Current (2689.0 ms) Diff -150 E2E 0.94x
query35: Previous (2964.6666666666665 ms) vs Current (2799.6666666666665 ms) Diff 165 E2E 1.06x
query36: Previous (1872.3333333333333 ms) vs Current (1691.0 ms) Diff 181 E2E 1.11x
query37: Previous (1487.3333333333333 ms) vs Current (1526.6666666666667 ms) Diff -39 E2E 0.97x
query38: Previous (4559.666666666667 ms) vs Current (4462.0 ms) Diff 97 E2E 1.02x
query39_part1: Previous (1547.6666666666667 ms) vs Current (1521.0 ms) Diff 26 E2E 1.02x
query39_part2: Previous (1297.3333333333333 ms) vs Current (1221.6666666666667 ms) Diff 75 E2E 1.06x
query40: Previous (2219.3333333333335 ms) vs Current (1734.6666666666667 ms) Diff 484 E2E 1.28x
query41: Previous (448.0 ms) vs Current (420.6666666666667 ms) Diff 27 E2E 1.06x
query42: Previous (411.6666666666667 ms) vs Current (392.6666666666667 ms) Diff 19 E2E 1.05x
query43: Previous (1002.3333333333334 ms) vs Current (962.6666666666666 ms) Diff 39 E2E 1.04x
query44: Previous (557.6666666666666 ms) vs Current (564.3333333333334 ms) Diff -6 E2E 0.99x
query45: Previous (1201.6666666666667 ms) vs Current (1165.3333333333333 ms) Diff 36 E2E 1.03x
query46: Previous (1886.6666666666667 ms) vs Current (1999.6666666666667 ms) Diff -113 E2E 0.94x
query47: Previous (2415.0 ms) vs Current (2421.6666666666665 ms) Diff -6 E2E 1.00x
query48: Previous (1178.3333333333333 ms) vs Current (1189.6666666666667 ms) Diff -11 E2E 0.99x
query49: Previous (2992.0 ms) vs Current (3087.0 ms) Diff -95 E2E 0.97x
query50: Previous (9786.333333333334 ms) vs Current (9685.0 ms) Diff 101 E2E 1.01x
query51: Previous (3071.6666666666665 ms) vs Current (3105.0 ms) Diff -33 E2E 0.99x
query52: Previous (528.6666666666666 ms) vs Current (493.3333333333333 ms) Diff 35 E2E 1.07x
query53: Previous (1395.3333333333333 ms) vs Current (956.3333333333334 ms) Diff 438 E2E 1.46x
query54: Previous (1741.6666666666667 ms) vs Current (1763.3333333333333 ms) Diff -21 E2E 0.99x
query55: Previous (475.3333333333333 ms) vs Current (471.0 ms) Diff 4 E2E 1.01x
query56: Previous (1192.0 ms) vs Current (1155.6666666666667 ms) Diff 36 E2E 1.03x
query57: Previous (2074.6666666666665 ms) vs Current (1961.6666666666667 ms) Diff 112 E2E 1.06x
query58: Previous (1713.6666666666667 ms) vs Current (1601.6666666666667 ms) Diff 112 E2E 1.07x
query59: Previous (2273.0 ms) vs Current (2261.3333333333335 ms) Diff 11 E2E 1.01x
query60: Previous (2155.0 ms) vs Current (2040.0 ms) Diff 115 E2E 1.06x
query61: Previous (1405.6666666666667 ms) vs Current (1265.6666666666667 ms) Diff 140 E2E 1.11x
query62: Previous (1573.6666666666667 ms) vs Current (1545.6666666666667 ms) Diff 28 E2E 1.02x
query63: Previous (1261.6666666666667 ms) vs Current (1374.3333333333333 ms) Diff -112 E2E 0.92x
query64: Previous (18739.333333333332 ms) vs Current (18136.0 ms) Diff 603 E2E 1.03x
query65: Previous (4060.6666666666665 ms) vs Current (3973.6666666666665 ms) Diff 87 E2E 1.02x
query66: Previous (5219.333333333333 ms) vs Current (5110.333333333333 ms) Diff 109 E2E 1.02x
query67: Previous (26947.0 ms) vs Current (26934.666666666668 ms) Diff 12 E2E 1.00x
query68: Previous (1568.3333333333333 ms) vs Current (1555.0 ms) Diff 13 E2E 1.01x
query69: Previous (1645.3333333333333 ms) vs Current (1650.0 ms) Diff -4 E2E 1.00x
query70: Previous (2375.6666666666665 ms) vs Current (2705.0 ms) Diff -329 E2E 0.88x
query71: Previous (3673.0 ms) vs Current (3724.0 ms) Diff -51 E2E 0.99x
query72: Previous (3622.3333333333335 ms) vs Current (3991.3333333333335 ms) Diff -369 E2E 0.91x
query73: Previous (1173.6666666666667 ms) vs Current (1055.3333333333333 ms) Diff 118 E2E 1.11x
query74: Previous (5551.666666666667 ms) vs Current (5652.333333333333 ms) Diff -100 E2E 0.98x
query75: Previous (7977.666666666667 ms) vs Current (7479.0 ms) Diff 498 E2E 1.07x
query76: Previous (3162.0 ms) vs Current (3141.3333333333335 ms) Diff 20 E2E 1.01x
query77: Previous (1804.6666666666667 ms) vs Current (1625.0 ms) Diff 179 E2E 1.11x
query78: Previous (10676.0 ms) vs Current (10498.0 ms) Diff 178 E2E 1.02x
query79: Previous (1681.0 ms) vs Current (2097.3333333333335 ms) Diff -416 E2E 0.80x
query80: Previous (4389.333333333333 ms) vs Current (3983.3333333333335 ms) Diff 405 E2E 1.10x
query81: Previous (2883.3333333333335 ms) vs Current (2792.0 ms) Diff 91 E2E 1.03x
query82: Previous (2568.0 ms) vs Current (2673.6666666666665 ms) Diff -105 E2E 0.96x
query83: Previous (11272.0 ms) vs Current (11555.0 ms) Diff -283 E2E 0.98x
query84: Previous (1752.0 ms) vs Current (1808.6666666666667 ms) Diff -56 E2E 0.97x
query85: Previous (2042.3333333333333 ms) vs Current (1974.3333333333333 ms) Diff 68 E2E 1.03x
query86: Previous (960.6666666666666 ms) vs Current (1087.0 ms) Diff -126 E2E 0.88x
query87: Previous (4850.333333333333 ms) vs Current (5534.0 ms) Diff -683 E2E 0.88x
query88: Previous (7409.666666666667 ms) vs Current (7733.666666666667 ms) Diff -324 E2E 0.96x
query89: Previous (1223.3333333333333 ms) vs Current (1153.6666666666667 ms) Diff 69 E2E 1.06x
query90: Previous (1035.0 ms) vs Current (1036.6666666666667 ms) Diff -1 E2E 1.00x
query91: Previous (1005.0 ms) vs Current (907.0 ms) Diff 98 E2E 1.11x
query92: Previous (1188.0 ms) vs Current (1196.6666666666667 ms) Diff -8 E2E 0.99x
query93: Previous (12937.333333333334 ms) vs Current (12793.666666666666 ms) Diff 143 E2E 1.01x
query94: Previous (5003.333333333333 ms) vs Current (4964.0 ms) Diff 39 E2E 1.01x
query95: Previous (8532.333333333334 ms) vs Current (8335.666666666666 ms) Diff 196 E2E 1.02x
query96: Previous (1324.0 ms) vs Current (1493.0 ms) Diff -169 E2E 0.89x
query97: Previous (2628.6666666666665 ms) vs Current (2459.6666666666665 ms) Diff 169 E2E 1.07x
query98: Previous (2663.6666666666665 ms) vs Current (2461.6666666666665 ms) Diff 202 E2E 1.08x
query99: Previous (2666.6666666666665 ms) vs Current (2565.3333333333335 ms) Diff 101 E2E 1.04x
benchmark: Previous (414000.0 ms) vs Current (412666.6666666667 ms) Diff 1333 E2E 1.00x

abellina · 2024-02-28T22:15:48Z

Sorry the funny business is in my end. I also didn't call error_code.value() because I wanted to remove the synchronization that is being added by that pageable copy, so yes with the error_code.value() synchronization we'd be fine as well, but we probably want to improve the documentation here for the future.

What if we change value() to value_sync(). Then it's explicit in the name.

Re-reading the comment again, I think it makes sense now, so I am not sure we need to reword.

Hope we have enough data for this PR. If you need other tests let me know.

…perf-hd_v-kernel_error

vuule · 2024-03-01T21:45:11Z

No measurable performance impact on libcudf read_parquet benchmarks.
I'm cleaning up the PR and the kernel_error use, should be ready for review soon.

…into perf-hd_v-kernel_error

…perf-hd_v-kernel_error

vuule · 2024-03-01T22:28:16Z

I cooked the kernel_error API a bit to avoid redundant syncs that appear to actually impact perf.
@nvdbaranec @etseidl please review the new API
@abellina the only change that impacts performance compared to the first draft is removal of stream.synchronize() after checking the error. The overhead should now be minimal.

etseidl · 2024-03-01T22:35:28Z

LGTM! Thanks!

cpp/src/io/utilities/hostdevice_span.hpp

…perf-hd_v-kernel_error

pmattione-nvidia · 2024-03-05T19:50:55Z

cpp/src/io/parquet/error.hpp

   */
-  [[nodiscard]] std::string str() const
+  [[nodiscard]] static std::string to_string(value_type value)
  {


Instead of static, shouldn't this be a free function instead of a member function?

It could be, but I like how it reads at the call site: kernel_error::to_string(error) and it does not really change much else about the function.
I'm not insisting on this option, so, open to suggestions :)

hyperbolic2346

I typically don't like to see if (auto ret = foo(); ret == true) type stuff, but I'll give you pass on this usage, @vuule

vuule · 2024-03-05T19:56:33Z

I typically don't like to see if (auto ret = foo(); ret == true) type stuff, but I'll give you pass on this usage, @vuule

Yeah, it would be pretty pointless if not for the repeated synchronization this avoids.

vuule · 2024-03-05T20:45:27Z

/merge

use hostdevice_vector in kernel_error

55efd0a

vuule added cuIO cuIO issue Performance Performance related issue Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 24, 2024

vuule self-assigned this Feb 24, 2024

Merge branch 'branch-24.04' into perf-hd_v-kernel_error

9a0602f

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 24, 2024

Merge branch 'branch-24.04' into perf-hd_v-kernel_error

d164b1f

Merge branch 'branch-24.04' of https://github.com/rapidsai/cudf into …

f1ed253

…perf-hd_v-kernel_error

vuule added 4 commits March 1, 2024 14:01

missing header

eb2c034

move away from device_scalar API; avoid redundant syncs

2a71814

Merge branch 'perf-hd_v-kernel_error' of https://github.com/vuule/cudf …

016bf4f

…into perf-hd_v-kernel_error

Merge branch 'branch-24.04' of https://github.com/rapidsai/cudf into …

74c7616

…perf-hd_v-kernel_error

etseidl reviewed Mar 1, 2024

View reviewed changes

cpp/src/io/utilities/hostdevice_span.hpp Show resolved Hide resolved

vuule marked this pull request as ready for review March 4, 2024 19:54

vuule requested a review from a team as a code owner March 4, 2024 19:54

vuule requested review from hyperbolic2346 and pmattione-nvidia March 4, 2024 19:54

Merge branch 'branch-24.04' of https://github.com/rapidsai/cudf into …

9cb8477

…perf-hd_v-kernel_error

vuule requested a review from nvdbaranec March 5, 2024 18:54

pmattione-nvidia reviewed Mar 5, 2024

View reviewed changes

hyperbolic2346 approved these changes Mar 5, 2024

View reviewed changes

pmattione-nvidia approved these changes Mar 5, 2024

View reviewed changes

rapids-bot bot merged commit 3ea947a into rapidsai:branch-24.04 Mar 5, 2024
73 checks passed

vuule deleted the perf-hd_v-kernel_error branch March 5, 2024 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `hostdevice_vector` in `kernel_error` to avoid the pageable copy #15140

Use `hostdevice_vector` in `kernel_error` to avoid the pageable copy #15140

vuule commented Feb 24, 2024 •

edited

Loading

GregoryKimball commented Feb 26, 2024

vuule commented Feb 26, 2024

abellina commented Feb 27, 2024

abellina commented Feb 27, 2024

abellina commented Feb 28, 2024 •

edited

Loading

nvdbaranec commented Feb 28, 2024

etseidl commented Feb 28, 2024

abellina commented Feb 28, 2024

etseidl commented Feb 28, 2024

abellina commented Feb 28, 2024

abellina commented Feb 28, 2024

vuule commented Mar 1, 2024

vuule commented Mar 1, 2024

etseidl commented Mar 1, 2024

pmattione-nvidia Mar 5, 2024

vuule Mar 5, 2024

hyperbolic2346 left a comment

vuule commented Mar 5, 2024

vuule commented Mar 5, 2024

Use hostdevice_vector in kernel_error to avoid the pageable copy #15140

Use hostdevice_vector in kernel_error to avoid the pageable copy #15140

Conversation

vuule commented Feb 24, 2024 • edited Loading

Description

Checklist

GregoryKimball commented Feb 26, 2024

vuule commented Feb 26, 2024

abellina commented Feb 27, 2024

abellina commented Feb 27, 2024

abellina commented Feb 28, 2024 • edited Loading

nvdbaranec commented Feb 28, 2024

etseidl commented Feb 28, 2024

abellina commented Feb 28, 2024

etseidl commented Feb 28, 2024

abellina commented Feb 28, 2024

abellina commented Feb 28, 2024

vuule commented Mar 1, 2024

vuule commented Mar 1, 2024

etseidl commented Mar 1, 2024

pmattione-nvidia Mar 5, 2024

Choose a reason for hiding this comment

vuule Mar 5, 2024

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

vuule commented Mar 5, 2024

vuule commented Mar 5, 2024

Use `hostdevice_vector` in `kernel_error` to avoid the pageable copy #15140

Use `hostdevice_vector` in `kernel_error` to avoid the pageable copy #15140

vuule commented Feb 24, 2024 •

edited

Loading

abellina commented Feb 28, 2024 •

edited

Loading