[REVIEW] DataFrame `insert` and creation optimizations #10285

galipremsagar · 2022-02-14T20:05:45Z

This PR removes double index equality & reindexing checks in DataFrame construction and insert code-flows.

scalar_broadcast_to, a 2.2x speedup for str dtype, numeric types perf remains the same:

# This PR:

In [12]: %timeit cudf.utils.utils.scalar_broadcast_to('abc', 400000000, "str")
100 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# branch-22.04
In [25]: %timeit cudf.utils.utils.scalar_broadcast_to('abc', 400000000, "str")
227 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

assign, a 1.2x-1.4x speedup:

In [9]: import numpy as np

In [10]: df = cudf.DataFrame(
    ...:     {
    ...:         "a": [1, 2, 3, np.nan] * 10000000,
    ...:         "b": ["a", "b", "c", "d"] * 10000000,
    ...:         "c": [0.0, 0.12, np.nan, 10.12] * 10000000,
    ...:     },
    ...: )

# THIS PR
In [11]: %timeit df.assign(f=10)
28.1 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# branch-22.04
In [4]: %timeit df.assign(f=10)
35 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



# THIS PR
In [4]: %timeit df.assign(g='hello world')
38.4 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# branch-22.04
In [4]: %timeit df.assign(g='hello world')
54 ms ± 82.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

DataFrame constructor, a 2x speedup:

In [2]: s1 = cudf.Series([1, 2, 3] * 1000000)

In [4]: s1.index = s1.index.astype("float64")

In [5]: s2 = cudf.Series([2, 3, 3] * 1000000)

In [6]: s2.index = s2.index.astype("float64")


In [12]: s3 = cudf.Series([10, 11, 12] * 1000000)

In [13]: s3.index = s3.index.astype("float64")

# THIS PR
In [14]: %timeit cudf.DataFrame({'a':s1, 'b':s2, 'c':s3})
2.81 ms ± 63.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# branch-22.04
In [9]: %timeit cudf.DataFrame({'a':s1, 'b':s2, 'c':s3})
5.37 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

…mizations

galipremsagar · 2022-02-14T20:06:14Z

cc: @randerzander @ayushdg

python/cudf/cudf/core/dataframe.py

codecov · 2022-02-14T21:34:25Z

Codecov Report

Merging #10285 (3e798f8) into branch-22.04 (a7d88cd) will increase coverage by 0.24%.
The diff coverage is n/a.

❗ Current head 3e798f8 differs from pull request most recent head 33bad4d. Consider uploading reports for the commit 33bad4d to get more accurate results

@@               Coverage Diff                @@
##           branch-22.04   #10285      +/-   ##
================================================
+ Coverage         10.42%   10.67%   +0.24%     
================================================
  Files               119      122       +3     
  Lines             20603    20878     +275     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18650     +195

Impacted Files	Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py	`0.00% <ø> (ø)`
python/cudf/cudf/_fuzz_testing/io.py	`0.00% <ø> (ø)`
python/cudf/cudf/_fuzz_testing/main.py	`0.00% <ø> (ø)`
python/cudf/cudf/_version.py	`0.00% <ø> (ø)`
python/cudf/cudf/comm/gpuarrow.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/_base_index.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/categorical.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/column.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/datetime.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/methods.py	`0.00% <ø> (ø)`
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f263820...33bad4d. Read the comment docs.

vyasr

LGTM. Mostly just suggestions to remove the default parameter since we changed it and that should be the default expected behavior for _insert. Other than that, one suggestion where you might be able to optimize a constructor. I'll leave it to you to finalize.

python/cudf/cudf/core/dataframe.py

vyasr · 2022-02-16T01:04:56Z

python/cudf/cudf/core/dataframe.py

-            value = Series(value, nan_as_null=nan_as_null)._align_to_index(
-                self._index, how="right", sort=False
-            )
+            value = Series(value, nan_as_null=nan_as_null)


There's probably a faster way to construct this if we know the input is a cudf.Series. I'm not sure how much we could save for a pandas Series by handling it manually, I don't think we typically do anything special for those. That may be worth exploring in a future PR (basically seeing if we can implement Series.from_pandas in a more efficient manner than just calling the constructor), but is out of scope for now.

Yeah, I think it is out of scope for this PR. The reason is we have other places in the code-base which use similar patterns we might be better off tackling that in a separate PR all at a time.

python/cudf/cudf/core/dataframe.py

python/cudf/cudf/core/frame.py

python/cudf/cudf/core/groupby/groupby.py

…mizations

galipremsagar · 2022-02-16T04:13:33Z

@gpucibot merge

galipremsagar added 8 commits February 11, 2022 10:46

faster scalar broadcasting

5470de9

Merge remote-tracking branch 'upstream/branch-22.04' into insert_opti…

eeb1f34

…mizations

ignore double reindexing

c42c0e8

reduce reindexing

4b20ef8

Merge remote-tracking branch 'upstream/branch-22.04' into insert_opti…

4860ed5

…mizations

Merge remote-tracking branch 'upstream/branch-22.04' into insert_opti…

c1d8457

…mizations

more optimization

9d6142b

Merge remote-tracking branch 'upstream/branch-22.04' into insert_opti…

334eb4d

…mizations

galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer labels Feb 14, 2022

galipremsagar requested review from quasiben and shwina February 14, 2022 20:05

galipremsagar requested a review from a team as a code owner February 14, 2022 20:05

galipremsagar self-assigned this Feb 14, 2022

galipremsagar requested a review from rgsl888prabhu February 14, 2022 20:05

galipremsagar added the non-breaking Non-breaking change label Feb 14, 2022

galipremsagar added the improvement Improvement / enhancement to an existing function label Feb 14, 2022

vyasr reviewed Feb 14, 2022

View reviewed changes

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved

vyasr reviewed Feb 14, 2022

View reviewed changes

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved

address reviews

d3e062f

vyasr approved these changes Feb 16, 2022

View reviewed changes

galipremsagar added 2 commits February 15, 2022 17:34

Merge remote-tracking branch 'upstream/branch-22.04' into insert_opti…

42515d2

…mizations

address reviews

33bad4d

galipremsagar removed the 3 - Ready for Review Ready for review by team label Feb 16, 2022

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Feb 16, 2022

rapids-bot bot merged commit 203f7b0 into rapidsai:branch-22.04 Feb 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] DataFrame `insert` and creation optimizations #10285

[REVIEW] DataFrame `insert` and creation optimizations #10285

galipremsagar commented Feb 14, 2022 •

edited

Loading

galipremsagar commented Feb 14, 2022

codecov bot commented Feb 14, 2022 •

edited

Loading

vyasr left a comment

vyasr Feb 16, 2022

galipremsagar Feb 16, 2022 •

edited

Loading

galipremsagar commented Feb 16, 2022

[REVIEW] DataFrame insert and creation optimizations #10285

[REVIEW] DataFrame insert and creation optimizations #10285

Conversation

galipremsagar commented Feb 14, 2022 • edited Loading

galipremsagar commented Feb 14, 2022

codecov bot commented Feb 14, 2022 • edited Loading

Codecov Report

vyasr left a comment

Choose a reason for hiding this comment

vyasr Feb 16, 2022

Choose a reason for hiding this comment

galipremsagar Feb 16, 2022 • edited Loading

Choose a reason for hiding this comment

galipremsagar commented Feb 16, 2022

[REVIEW] DataFrame `insert` and creation optimizations #10285

[REVIEW] DataFrame `insert` and creation optimizations #10285

galipremsagar commented Feb 14, 2022 •

edited

Loading

codecov bot commented Feb 14, 2022 •

edited

Loading

galipremsagar Feb 16, 2022 •

edited

Loading