Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: DataFrame.plot.scatter argument c now accepts a column of strings, where rows with the same string are colored identically #59239

Merged
merged 17 commits into from
Sep 3, 2024

Conversation

michaelmannino
Copy link
Contributor

@michaelmannino michaelmannino commented Jul 12, 2024

@michaelmannino
Copy link
Contributor Author

import pandas as pd
import random as rand

df = pd.DataFrame({
    'dataX': [rand.randint(0, 100) for _ in range(100)],
    'dataY': [rand.randint(0, 100) for _ in range(100)],
    'fav_fruit': [rand.choice(['Apples', 'Bananas', 'Grapes', 'Peaches']) for _ in range(100)],
})
df.plot.scatter('dataX', 'dataY', c='fav_fruit')

image

Michael Vincent Mannino added 2 commits July 12, 2024 23:28
@michaelmannino michaelmannino changed the title ENH: Update plot.scatter color to define colors on iterables of non-numerics and non-mpl colors ENH: DataFrame.plot.scatter argument c now accepts a column of strings, where rows with the same string are colored identically Jul 13, 2024
@michaelmannino michaelmannino marked this pull request as ready for review July 13, 2024 04:18
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dr-Irv in case you have any feedback as well

@@ -1390,6 +1407,38 @@ def _get_c_values(self, color, color_by_categorical: bool, c_is_column: bool):
c_values = c
return c_values

def _are_valid_colors(self, c_values: np.ndarray | list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what instances is c_values a list? Might be misreading but would be better if we only worked with a pd.Series and could call .unique on that, instead of checking every single value in a loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to take a pd.Series, not np.ndarray | list

except (TypeError, ValueError) as _:
return False

def _uniquely_color_strs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am no matplotlib expert but I think we need to defer to that somehow to get the desired colors, instead of trying to write this out ourselves

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some research looking into how other libs which support this functionality, including seaborn, handle this workflow. Each utilize the current mpl.colormap and draw colors from a linear space across this map. This allows users to change the choice colors of the same way they would with all other graphs

Additionally, I've added an automatic legend to this type of plot since the chosen colors are not exposed to the user, similar to how a colorbar is drawn in some cases of the same function

@WillAyd WillAyd added the Visualization plotting label Jul 26, 2024
Comment on lines 220 to 221
with tm.assert_produces_warning(None):
ax = df.plot.scatter(x=0, y=1, c="state")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure that the test here should be in the tm.assert_produces_warning(None) context. My intuition is that should be removed.

There should also be tests where the column state contains values that are colors, as well as a mix of colors and non-colors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the tm.assert_produces_warning(None). I added 2 test cases: 1 containing only strings and 1 containing both valid mpl strings and invalid mpl strings.

Additionally, I updated some other functionality tests to confirm they are not executing the new logic added and coloring properly

@michaelmannino
Copy link
Contributor Author

df = pd.DataFrame({
    'dataX': [rand.randint(0, 100) for _ in range(100)],
    'dataY': [rand.randint(0, 100) for _ in range(100)],
    'fav_fruit': [rand.choice(['Grapes', 'Bananas', 'Apple', 'Peaches']) for _ in range(100)],
})

df.plot.scatter('dataX', 'dataY', c='fav_fruit')
image
df = pd.DataFrame({
    'dataX': [rand.randint(0, 100) for _ in range(100)],
    'dataY': [rand.randint(0, 100) for _ in range(100)],
    'fav_fruit': [rand.choice(['Grapes', 'Bananas', 'Apple', 'Peaches']) for _ in range(100)],
})

df.plot.scatter('dataX', 'dataY', c='fav_fruit', cmap='rainbow')
image

@michaelmannino
Copy link
Contributor Author

michaelmannino commented Aug 30, 2024

Not sure if I need to add this as part of this PR, but there is some documentation (and maybe more) about using the c arg in df.plot.scatter found in: https://github.com/pandas-dev/pandas/blob/main/doc/source/user_guide/visualization.rst?plain=1#L636-L637

Should I update this to include this new feature?

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this. Will leave it to @WillAyd to decide/merge

@WillAyd WillAyd merged commit f3e1991 into pandas-dev:main Sep 3, 2024
47 checks passed
@WillAyd
Copy link
Member

WillAyd commented Sep 3, 2024

Thanks @michaelmannino - very impressive work! Thanks for all the efforts on the refactors / feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scatter plot with colour_by and size_by variables Meaning of color argument in DataFrame.plot.scatter()
3 participants