-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add ignore_index for df.drop_duplicates #30405
ENH: Add ignore_index for df.drop_duplicates #30405
Conversation
pandas/core/frame.py
Outdated
@@ -4606,6 +4607,8 @@ def drop_duplicates( | |||
- False : Drop all duplicates. | |||
inplace : bool, default False | |||
Whether to drop duplicates in place or to return a copy. | |||
ignore_index : bool, default False | |||
If True, the resulting axis will be labeled 0, …, n - 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If True, the resulting axis will be labeled 0, …, n - 1. | |
If True, the resulting axis will be labeled 0, 1, …, n - 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, changed!
pandas/core/frame.py
Outdated
self._update_inplace(new_data) | ||
else: | ||
if ignore_index: | ||
idx = ibase.default_index(len(self[-duplicated])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this block could be more succinctly written as:
result = self[~duplicated]
if ignore_index:
result = result.reset_index(drop=True)
return result
Or something similar. FWIW I think the current method of evaluating self[~duplicated]
twice can be costly for larger frames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, good call! @WillAyd
will change! i think idx = ibase.default_index(sum(-duplicated))
should be also faster than using self[-duplicated]
doc/source/whatsnew/v1.0.0.rst
Outdated
@@ -474,7 +474,7 @@ Other API changes | |||
Supplying anything else than ``how`` to ``**kwargs`` raised a ``TypeError`` previously (:issue:`29388`) | |||
- When testing pandas, the new minimum required version of pytest is 5.0.1 (:issue:`29664`) | |||
- :meth:`Series.str.__iter__` was deprecated and will be removed in future releases (:issue:`28277`). | |||
|
|||
- Add ``ignore_index`` to :meth:`DataFrame.drop_duplicates` to reset index (:issue:`30114`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverse the ordering here, e.g. .drop_duplicates has gained the ignore_index
keyword.
move to other enhancements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved and rephrased!
pandas/core/frame.py
Outdated
@@ -4621,9 +4624,15 @@ def drop_duplicates( | |||
if inplace: | |||
(inds,) = (-duplicated)._ndarray_values.nonzero() | |||
new_data = self._data.take(inds) | |||
|
|||
if ignore_index: | |||
new_data = new_data.reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use reset_index, simply use what you did in the other PR, default_index. .reset_index()
causes another copy here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, changed!
pandas/core/frame.py
Outdated
return self[-duplicated] | ||
result = self[-duplicated] | ||
if ignore_index: | ||
return result.reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you have to copy before you update the index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did not use copy, but .index
assignment, and i think it is also correct.
pandas/core/frame.py
Outdated
result = self[-duplicated] | ||
|
||
if ignore_index: | ||
result.index = ibase.default_index(sum(-duplicated)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result.index = ibase.default_index(len(result))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
tm.assert_frame_equal(df, DataFrame(origin_dict)) | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you incorporate these in the above test as well.
can you also assert if inplace== True that the input is unchanged (copy it before and check equality)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorporated, and added test
thanks @jreback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
thanks @charlesdong1991 |
…ndexing-1row-df * upstream/master: (333 commits) CI: troubleshoot Web_and_Docs failing (pandas-dev#30534) WARN: Ignore NumbaPerformanceWarning in test suite (pandas-dev#30525) DEPR: camelCase in offsets, get_offset (pandas-dev#30340) PERF: implement scalar ops blockwise (pandas-dev#29853) DEPR: Remove Series.compress (pandas-dev#30514) ENH: Add numba engine for rolling apply (pandas-dev#30151) [ENH] Add to_markdown method (pandas-dev#30350) DEPR: Deprecate pandas.np module (pandas-dev#30386) ENH: Add ignore_index for df.drop_duplicates (pandas-dev#30405) BUG: The setting xrot=0 in DataFrame.hist() doesn't work with by and subplots pandas-dev#30288 (pandas-dev#30491) CI: Fix GBQ Tests (pandas-dev#30478) Bug groupby quantile listlike q and int columns (pandas-dev#30485) ENH: Add ignore_index for df.sort_values and series.sort_values (pandas-dev#30402) TYP: Typing hints in pandas/io/formats/{css,csvs}.py (pandas-dev#30398) BUG: raise on non-hashable Index name, closes pandas-dev#29069 (pandas-dev#30335) Replace "foo!r" to "repr(foo)" syntax pandas-dev#29886 (pandas-dev#30502) BUG: preserve EA dtype in transpose (pandas-dev#30091) BLD: add check to prevent tempita name error, clsoes pandas-dev#28836 (pandas-dev#30498) REF/TST: method-specific files for test_append (pandas-dev#30503) marked unused parameters (pandas-dev#30504) ...
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff