Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] DataFrame insert and creation optimizations #10285

Merged
merged 11 commits into from
Feb 16, 2022

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Feb 14, 2022

This PR removes double index equality & reindexing checks in DataFrame construction and insert code-flows.

scalar_broadcast_to, a 2.2x speedup for str dtype, numeric types perf remains the same:

# This PR:

In [12]: %timeit cudf.utils.utils.scalar_broadcast_to('abc', 400000000, "str")
100 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# branch-22.04
In [25]: %timeit cudf.utils.utils.scalar_broadcast_to('abc', 400000000, "str")
227 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

assign, a 1.2x-1.4x speedup:

In [9]: import numpy as np

In [10]: df = cudf.DataFrame(
    ...:     {
    ...:         "a": [1, 2, 3, np.nan] * 10000000,
    ...:         "b": ["a", "b", "c", "d"] * 10000000,
    ...:         "c": [0.0, 0.12, np.nan, 10.12] * 10000000,
    ...:     },
    ...: )

# THIS PR
In [11]: %timeit df.assign(f=10)
28.1 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# branch-22.04
In [4]: %timeit df.assign(f=10)
35 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



# THIS PR
In [4]: %timeit df.assign(g='hello world')
38.4 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# branch-22.04
In [4]: %timeit df.assign(g='hello world')
54 ms ± 82.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

DataFrame constructor, a 2x speedup:

In [2]: s1 = cudf.Series([1, 2, 3] * 1000000)

In [4]: s1.index = s1.index.astype("float64")

In [5]: s2 = cudf.Series([2, 3, 3] * 1000000)

In [6]: s2.index = s2.index.astype("float64")


In [12]: s3 = cudf.Series([10, 11, 12] * 1000000)

In [13]: s3.index = s3.index.astype("float64")

# THIS PR
In [14]: %timeit cudf.DataFrame({'a':s1, 'b':s2, 'c':s3})
2.81 ms ± 63.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# branch-22.04
In [9]: %timeit cudf.DataFrame({'a':s1, 'b':s2, 'c':s3})
5.37 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer labels Feb 14, 2022
@galipremsagar galipremsagar requested a review from a team as a code owner February 14, 2022 20:05
@galipremsagar galipremsagar self-assigned this Feb 14, 2022
@galipremsagar
Copy link
Contributor Author

cc: @randerzander @ayushdg

@galipremsagar galipremsagar added the non-breaking Non-breaking change label Feb 14, 2022
@galipremsagar galipremsagar added the improvement Improvement / enhancement to an existing function label Feb 14, 2022
@codecov
Copy link

codecov bot commented Feb 14, 2022

Codecov Report

Merging #10285 (3e798f8) into branch-22.04 (a7d88cd) will increase coverage by 0.24%.
The diff coverage is n/a.

❗ Current head 3e798f8 differs from pull request most recent head 33bad4d. Consider uploading reports for the commit 33bad4d to get more accurate results

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10285      +/-   ##
================================================
+ Coverage         10.42%   10.67%   +0.24%     
================================================
  Files               119      122       +3     
  Lines             20603    20878     +275     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18650     +195     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <ø> (ø)
python/cudf/cudf/_version.py 0.00% <ø> (ø)
python/cudf/cudf/comm/gpuarrow.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/methods.py 0.00% <ø> (ø)
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f263820...33bad4d. Read the comment docs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Mostly just suggestions to remove the default parameter since we changed it and that should be the default expected behavior for _insert. Other than that, one suggestion where you might be able to optimize a constructor. I'll leave it to you to finalize.

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
value = Series(value, nan_as_null=nan_as_null)._align_to_index(
self._index, how="right", sort=False
)
value = Series(value, nan_as_null=nan_as_null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a faster way to construct this if we know the input is a cudf.Series. I'm not sure how much we could save for a pandas Series by handling it manually, I don't think we typically do anything special for those. That may be worth exploring in a future PR (basically seeing if we can implement Series.from_pandas in a more efficient manner than just calling the constructor), but is out of scope for now.

Copy link
Contributor Author

@galipremsagar galipremsagar Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it is out of scope for this PR. The reason is we have other places in the code-base which use similar patterns we might be better off tackling that in a separate PR all at a time.

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
@galipremsagar galipremsagar removed the 3 - Ready for Review Ready for review by team label Feb 16, 2022
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Feb 16, 2022
@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 203f7b0 into rapidsai:branch-22.04 Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants