Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent values while using .applyrows() #171

Closed
Nanthini10 opened this issue Jul 27, 2018 · 3 comments
Closed

Inconsistent values while using .applyrows() #171

Nanthini10 opened this issue Jul 27, 2018 · 3 comments
Labels
bug Something isn't working

Comments

@Nanthini10
Copy link

Trying to convert string(category) into numerical values. The equivalent of applying .to_numeric() in pandas

df_left = pd.DataFrame()
nelem = 5
df_left['key1'] = np.random.randint(0, 3, nelem)
df_left['key2'] = np.random.randint(0, 5, nelem)
df_left['val1'] = ['1', '2', '5', '3', '2']
df_left['val1'] = df_left['val1'].astype('category')
df_left = DataFrame.from_pandas(df_left)

df_left
  key1 key2 val1
0    2    0    1
1    0    4    2
2    1    3    5
3    1    4    3
4    0    4    2
def kernel(val1, out1, extra1):
    for i, val in enumerate(val1):
        out1[i] = int(val)

val1 = df_left['val1']
outdf = df_left.apply_rows(kernel, incols=['val1'], 
                           outcols=dict(out1=np.int64), kwargs=dict(extra1=1))
outdf
  key1 key2 val1 out1
0    2    0    1    0
1    0    4    2    1
2    1    3    5    3
3    1    4    3    2
4    0    4    2    1

outdf['out1'] should be exactly same as val1 but be of the type int

Pandas equivalent:

df = pd.DataFrame()
nelem = 5
df['key1'] = np.random.randint(0, 3, nelem)
df['key2'] = np.random.randint(0, 5, nelem)
df['val1'] = ['1', '2', '5', '3', '2']
df['val1'] = df['val1'].astype('category')
df['out1'] = pd.to_numeric(df['val1'])
df

key1	key2	val1	out1
0	1	0	1	1
1	1	1	2	2
2	0	2	5	5
3	2	3	3	3
4	2	0	2	2
@kkraus14
Copy link
Collaborator

kkraus14 commented Aug 2, 2018

It looks like apply_rows is operating on the integer representation of the categorical column, as opposed to the string values in the dictionary.

To confirm could you test if what you're seeing matches df['val1'].cat.codes?

@Nanthini10
Copy link
Author

Yes, df['val1'].cat.codes values match the output out the kernel.

@mike-wendt mike-wendt added the bug Something isn't working label Aug 6, 2018
@mike-wendt mike-wendt changed the title Inconsistent values while using .applyrows() Inconsistent values while using .applyrows() Aug 8, 2018
mike-wendt pushed a commit that referenced this issue Oct 26, 2018
@kkraus14
Copy link
Collaborator

Closing as not a bug.

raydouglass pushed a commit that referenced this issue Nov 7, 2023
* Use Apache-2.0 license.

* Pin libarrow the same way as build time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants