Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Use of applymap on StringColumns #3646

Closed
dmitra79 opened this issue Dec 19, 2019 · 8 comments
Closed

[BUG] Use of applymap on StringColumns #3646

dmitra79 opened this issue Dec 19, 2019 · 8 comments
Labels
bug Something isn't working numba Numba issue Python Affects Python cuDF API.

Comments

@dmitra79
Copy link

dmitra79 commented Dec 19, 2019

I am trying to use applymap method on a String column to convert to integers. (The values are ex. 'A101', 'B236', etc). I am using cudf 0.10.0 on Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-45-generic x86_64)

My code:
``
def id2int(x):
return int(x[1:])

s=cudf.Series(id_df['id'])

z=s.applymap(id2int)
``

The error is:

  3 s=cudf.Series(id_df['id'])

----> 4 z=s.applymap(id2int)

~/anaconda3/envs/mindsynchro-test/lib/python3.7/site-packages/cudf/core/series.py in applymap(self, udf, out_dtype)
1735 """
1736 if callable(udf):
-> 1737 res_col = self._unaryop(udf)
1738 else:
1739 res_col = self._column.applymap(udf, out_dtype=out_dtype)

~/anaconda3/envs/mindsynchro-test/lib/python3.7/site-packages/cudf/core/series.py in _unaryop(self, fn)
607 operand.
608 """
--> 609 outcol = self._column.unary_operator(fn)
610 return self._copy_construct(data=outcol)
611

AttributeError: 'StringColumn' object has no attribute 'unary_operator'

@dmitra79 dmitra79 added Needs Triage Need team to review and classify bug Something isn't working labels Dec 19, 2019
@taureandyernv
Copy link
Contributor

taureandyernv commented Jan 2, 2020

Hey @dmitra79, an option is to use nvstrings to do the string manipulation. However, i'm unsure of your end goal with the output. is it to
a) just get an integer representation of the string value? It's an id column, so this may be unlikely...
or
b) actually manipulate the string values because the integer portion has usefulness to you?

If a) , we may need to tweak the process a bit. Please try:

import nvstrings
import cudf
a = ['A101', 'A236','A101', 'A236'] #this is your "id_df['id']"
s = nvstrings.to_device(a)
cs=cudf.Series(s.hash()) # gives you a hashed value of your strings
cs.head()

Output is:

0     667292359
1    3485306761
2     667292359
3    3485306761
dtype: int64

If b), you'll bring in numpy as well, but only for typecasting in this instance:

import nvstrings
import cudf
import numpy as np
a = ['A101', 'A236','A101', 'A236'] #this is your "id_df['id']"
s = nvstrings.to_device(a)
cs = cudf.Series(s.slice(1)).astype(np.int32) #does what you wanted to the applymap to do
cs.head()

Output is:

0    101
1    236
2    101
3    236
dtype: int32

Please let me know if this helps!

Just an FYI, I couldn't get your applymap implementation won't work in neither RAPIDS nor Pandas for Series. I could get it to work in Pandas if it is a Dataframe. Unfortunately, cuDF's Dataframes don't seem to have string implementation for this function yet.

@dmitra79
Copy link
Author

dmitra79 commented Jan 2, 2020

Hi @taureandyernv ,

Thank you for your response! My goal was to use just the integer part (the id is essentially an integer with a single letter code, and I don't need the letter) to sort the ids and partition the set systematically. We ended up with doing this without converting to integers (but had to convert to pandas). Ex:
np.array_split(df['id'].unique().sort_values().to_pandas().values, options.n_splits)

Thanks for mentioning nvstrings - I wasn't aware of it, will keep in mind for the future.
Your approach (b) should be good (and the example above works) but:

  • when I try it with my real data the kernel dies - not sure why, and can't spend too much time on it now;
  • seems to require converting from a cudf column to a list? Or would nvstrings.to_device() take a cudf Series or a column as input?

@taureandyernv
Copy link
Contributor

taureandyernv commented Jan 3, 2020

@dmitra79 , Courtesy of @VibhuJawa, we do have a string accessor for dataframes

import cudf
import numpy as np
df = cudf.DataFrame({'a':['A101', 'A236','A101', 'A236']})
df['a'] = df['a'].str.slice(1).astype(np.int32) #this slices the string, then typecasts the output to int32
df.head()

and then

df.dtypes

Outputs

0    101
1    236
2    101
3    236
a    int32
dtype: object

here is a great blog for reference: https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801

@dmitra79
Copy link
Author

dmitra79 commented Jan 3, 2020

Great - thank you!

@taureandyernv
Copy link
Contributor

taureandyernv commented Jan 3, 2020

@dantegd if @dmitra79 agrees, the usecase issue is solved by using string accessors instead of applymap. We may want to add string functionality in applymap to replicate this pandas dataframe functionality, as some people may expect it:

import pandas as pd
t = pd.DataFrame(['A101', 'A236'], ['A101', 'A236'])
t.applymap(lambda x: (x[1:]))

outputs:

	0
A101	101
A236	236

Thoughts?

@dmitra79
Copy link
Author

dmitra79 commented Jan 5, 2020

I agree that this resolves the issue. Having this functionality in applymap would be great, or alternatively better documentation describing how to handle strings. I was not aware of the string functionality or of nvstrings from cudf documentation

@kkraus14 kkraus14 added Python Affects Python cuDF API. numba Numba issue and removed Needs Triage Need team to review and classify labels Jan 22, 2020
@kkraus14
Copy link
Collaborator

This is being explored in the long term, but string UDFs are an ongoing challenge due to memory and branching challenges.

@beckernick
Copy link
Member

As noted, this is a challenging problem. I'm going to close this issue to consolidate further discussion in #3802

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working numba Numba issue Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

4 participants