Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Optimize has_duplicate_edges (rapidsai#2409)
This PR fixes drop duplicates scalability by removing apply which does serial processing. **Benchmark Data** ```python n_nodes = 100_000 n_rows = 1_500_000 df = cudf.DataFrame({'src':cp.random.randint(0,n_nodes,n_rows), 'dst':cp.random.randint(0,n_nodes,n_rows)}) ``` **### After PR:** ```python 17.8 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` **### Before PR:** ```python 26.3 s ± 78.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Authors: - Vibhu Jawa (https://github.com/VibhuJawa) Approvers: - Brad Rees (https://github.com/BradReesWork) - Rick Ratzel (https://github.com/rlratzel) URL: rapidsai#2409
- Loading branch information