Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.Categorial and level ordering and adding to a DataFrame #6242

Closed
jankatins opened this issue Feb 3, 2014 · 10 comments
Closed

pd.Categorial and level ordering and adding to a DataFrame #6242

jankatins opened this issue Feb 3, 2014 · 10 comments
Labels
Categorical Categorical Data Type Enhancement Internals Related to non-user accessible pandas implementation

Comments

@jankatins
Copy link
Contributor

As part of yhat/ggpy#188 I'm trying to build a factor function which works the same way as the one in R (e.g. https://www.stat.berkeley.edu/classes/s133/factors.html) and which I could use to store factors in a dataframe.

What currently does not work is ordering:

mons = c("March","April","January","November","January",
+ "September","October","September","November","August",
+ "January","November","November","February","May","August",
+ "July","December","August","August","September","November",
+ "February","April")
factor(mons,levels=c("January","February","March",
+               "April","May","June","July","August","September",
+               "October","November","December"),ordered=TRUE)

Currently I see no way to make a change to the ordering of the levels to a predefined list.

In [69]: pd.Categorical(["b","a","c","a","c"], levels=["a","b","c"])
Out[69]: <repr(<pandas.core.categorical.Categorical at 0xb7f28d0>) failed: ValueError: invalid literal for long() with base 10: 'b'>

[The problem here seems to be that pd.Categorial.__init__() expects the arguments to be already transformed to a factor data structure (list of index positions + labels) and the above is actually a "labels + label ordering" structure]

pd.Categorial.from_array() only takes a data argument, but no ordering :-(

It seems that pandas.core.algorithms.factorize() was once intended to do that, but the order keyword is not used in the actual implementation (https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L116).

I'm not sure what I should do with ordered =[True, False]. What I think it should do is to remove some standard ordering operations in case of ordered == False?

Another problem is that i can't roundtrip factors into a dataframe:

>>> df = pd.DataFrame({"a":["b","a","c","a","c"]})
>>> df["factor"] = pd.Categorical.from_array(df.a)
>>> df
   a factor
0  b      b
1  a      a
2  c      c
3  a      a
4  c      c
[5 rows x 2 columns]
>>> df.factor
0    b
1    a
2    c
3    a
4    c
Name: factor, dtype: object

-> There is no level attribute anymore :-(

@jreback
Copy link
Contributor

jreback commented Feb 3, 2014

see related #6219

Categrical is not a first-class object that can be embeded in a series (and thus a dataframe), e.g. like datetimes,floats,strings etc.

I don't think this is too difficult, but would help if you can put some examples of what you can do if this was the case. E.g. pretend that you can add a categorical as a Series (e.g. has dtype='categorical'); can you show some operations on a frame that make sense for that?

or does this make sense at all?

@jankatins
Copy link
Contributor Author

What I want to do is this:

import matplotlib.pyplot as plt
df = pd.DataFrame({"a":["b","a","c","a"]})
df["factor"] = factor(df.a, levels=["a","b","c", "d"], labels=["Jan", "Feb", "Mar", "Apr"])
indentation = np.arange(len(df.factor.levels))
weights = pd.value_counts(df.factor) 
# weights must be [2,1,1,0] -> sorted by levels, zero in case a level is 
# not present in the series!
bar_width = 0.8
plt.bar(indentation, weights, bar_width)
# matplotlib bars are not centered but left aligned :-/
plt.gca().set_xticks([indentation+width/2) 
plt.gca().set_xticklabels(df.factor.labels)
# matplotlib: give the bars some space :-)
plt.gca().set_xlim(min(df.factor)-0.3, max(df.factor)+0.3) 

Also:

if df[column_name].is_categorical:
    # do the stuff for discrete variables
else:
    # do the stuff for continuous variables

Also nice:

df.groupby("categorical_variable").size()
# -> Would give 0 for levels which are not in the dataframe
[...]
df["factor"] = factor(df.a, levels=["a","b","c", "d"], 
                               labels=["Jan", "Feb", "Mar", "Apr"], ordered=True)
df.set_index("factor") # would raise if ordered=False
# -> would be sorted according to levels and be sliceable by labels
first_quarter = df["Jan":"Mar"]
# -> Would contain all values for the first quarter of the year

@jreback
Copy link
Contributor

jreback commented Feb 3, 2014

can you show what a sample frame (include also a float as well) looks like?

@jankatins
Copy link
Contributor Author

@jreback ?

[Edit: changed .value into .val_pointer]
Is this what you mean?:

df = pd.DataFrame({"a":["b","a","c","a"], "b":[4,3,2,0], "c":[1.,2.,3.,3.3]})
df["fa"] = factor(df.a)
assert df["fa"].levels == ["b","a","c"]
assert df["fa"].val_pointer == [0,1,2,0]
df["fb"] = factor(df.b)
assert df["fb"].levels == [4,3,2,0]
assert df["fb"].val_pointer == [0,1,2,3]
with assert_raises(exception): 
    # len(labels) != len(levels)
    df["fb"] = factor(df.b, labels=[1,2,3,4,5])
with assert_raises(exception): 
    # 0 not in supplied levels, but values not in the original series is ok
    df["fb"] = factor(df.b, levels=[1,2,3,4,5])
df["fb"] = factor(df.b, levels=[0,1,2,3,4,5])
assert df["fb"].levels == [0,1,2,3,4,5]
assert df["fb"].val_pointer == [4,3,2,0]
with assert_raises(exception):
    # len(labels) != len("levels")
    df["fb"] = factor(df.b, levels=[0,1,2,3,4,5], labels=["a","b","c","d","e"])
df["fb"] = factor(df.b, levels=[0,1,2,3,4,5], labels=["a","b","c","d","e","f"])
assert df["fb"].levels == ["a","b","c","d","e","f"]
assert df["fb"].val_pointer == [4,3,2,0]

# floats are the same: each unique value gets a factor of it's own 
# Probably messy if one is to supply levels
df["fc"] = factor(df.c)
assert df["fc"].levels == [1.,2.,3.,3.3]
assert df["fc"].val_pointer == [1,2,3,4]

#groupby with factors
grouped = df.groupby("fb")
assert grouped.groups == {'a': [3], 'b': [],'c': [2],'d': [1],'e': [0],'f': []}

# value counts: empty labels are set to zero
assert pd.value_counts(df.fb) == [1,0,1,1,1,0]

# series of factors get real values
assert pd.Series(df.fb) == pd.Series(["e","d","c","a"])

# Factors of factors get their levels shrinked to only the ones which are used
smaler = factor(factor([1,2], levels=[0,1,2,3]))
assert smaler.levels == [1,2] 

# indexing with factors
with assert_raises(Exception):
     # not orderd
    df.set_index("fb")
df["fb"] = factor(df.b, levels=[0,1,2,3,4,5], labels=["a","b","c","d","e","f"], ordered=True)
df.set_index("fb")
# [4,3,2,0] : ["b","a","c","a"] -factors-> [e,d,c,a] : ["b","a","c","a"] -index-> [a,c,d,e] : ["a","c","a","b"]
df["a":"d"]["a"] == Series(["a","c","a"]) # including "stop"

# concat
a = factor([1,2,3])
b = factor(["a","b","c"])
c = factor([1,2,3,4])
with assert_raises(exception):
     # only factors with the same levels (items and oder) are concatable
     concat(a,b)
assert concat(a,c) == factor([1,2,3,1,2,3,4])
d = factor(concat(Series(a), Series(b))) # would work

Cut and #3943 would also be a candidate

And when I see #3678

f = factor([nan,"a","b","c"])
assert f.levels == ["a","b","c"]
assert f.val_pointer == [nan,0,1,2]

@jreback
Copy link
Contributor

jreback commented Feb 3, 2014

what does the factor look like when you print the frame, e.g.

df = pd.DataFrame({"a":["b","a","c","a"], "b":[4,3,2,0], "c":[1.,2.,3.,3.3]})
df["fa"] = factor(df.a)

e.g. if you print out df

@jankatins
Copy link
Contributor Author

I would think it look the same as if you would print out Series(factor(...)), only the dtypes and so would show that it is a factor.

@jankatins
Copy link
Contributor Author

#5313

@jreback
Copy link
Contributor

jreback commented Feb 3, 2014

yep....do we want to close this (and just ref from #5313) ?

@jankatins
Copy link
Contributor Author

I think this makes sense. I add a comment there about the "what should work" thingies above.

@jreback
Copy link
Contributor

jreback commented May 23, 2014

@JanSchulz just saw your examples.....let me take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

No branches or pull requests

2 participants