[BUG] Use of applymap on StringColumns #3646

dmitra79 · 2019-12-19T15:59:35Z

I am trying to use applymap method on a String column to convert to integers. (The values are ex. 'A101', 'B236', etc). I am using cudf 0.10.0 on Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-45-generic x86_64)

My code:
``
def id2int(x):
return int(x[1:])

s=cudf.Series(id_df['id'])

z=s.applymap(id2int)
``

The error is:

  3 s=cudf.Series(id_df['id'])
----> 4 z=s.applymap(id2int)

~/anaconda3/envs/mindsynchro-test/lib/python3.7/site-packages/cudf/core/series.py in applymap(self, udf, out_dtype)
1735 """
1736 if callable(udf):
-> 1737 res_col = self._unaryop(udf)
1738 else:
1739 res_col = self._column.applymap(udf, out_dtype=out_dtype)

~/anaconda3/envs/mindsynchro-test/lib/python3.7/site-packages/cudf/core/series.py in _unaryop(self, fn)
607 operand.
608 """
--> 609 outcol = self._column.unary_operator(fn)
610 return self._copy_construct(data=outcol)
611

AttributeError: 'StringColumn' object has no attribute 'unary_operator'

The text was updated successfully, but these errors were encountered:

taureandyernv · 2020-01-02T22:41:33Z

Hey @dmitra79, an option is to use nvstrings to do the string manipulation. However, i'm unsure of your end goal with the output. is it to
a) just get an integer representation of the string value? It's an id column, so this may be unlikely...
or
b) actually manipulate the string values because the integer portion has usefulness to you?

If a) , we may need to tweak the process a bit. Please try:

import nvstrings
import cudf
a = ['A101', 'A236','A101', 'A236'] #this is your "id_df['id']"
s = nvstrings.to_device(a)
cs=cudf.Series(s.hash()) # gives you a hashed value of your strings
cs.head()

Output is:

0     667292359
1    3485306761
2     667292359
3    3485306761
dtype: int64

If b), you'll bring in numpy as well, but only for typecasting in this instance:

import nvstrings
import cudf
import numpy as np
a = ['A101', 'A236','A101', 'A236'] #this is your "id_df['id']"
s = nvstrings.to_device(a)
cs = cudf.Series(s.slice(1)).astype(np.int32) #does what you wanted to the applymap to do
cs.head()

Output is:

0    101
1    236
2    101
3    236
dtype: int32

Please let me know if this helps!

Just an FYI, I couldn't get your applymap implementation won't work in neither RAPIDS nor Pandas for Series. I could get it to work in Pandas if it is a Dataframe. Unfortunately, cuDF's Dataframes don't seem to have string implementation for this function yet.

dmitra79 · 2020-01-02T23:22:38Z

Hi @taureandyernv ,

Thank you for your response! My goal was to use just the integer part (the id is essentially an integer with a single letter code, and I don't need the letter) to sort the ids and partition the set systematically. We ended up with doing this without converting to integers (but had to convert to pandas). Ex:
np.array_split(df['id'].unique().sort_values().to_pandas().values, options.n_splits)

Thanks for mentioning nvstrings - I wasn't aware of it, will keep in mind for the future.
Your approach (b) should be good (and the example above works) but:

when I try it with my real data the kernel dies - not sure why, and can't spend too much time on it now;
seems to require converting from a cudf column to a list? Or would nvstrings.to_device() take a cudf Series or a column as input?

taureandyernv · 2020-01-03T18:03:20Z

@dmitra79 , Courtesy of @VibhuJawa, we do have a string accessor for dataframes

import cudf
import numpy as np
df = cudf.DataFrame({'a':['A101', 'A236','A101', 'A236']})
df['a'] = df['a'].str.slice(1).astype(np.int32) #this slices the string, then typecasts the output to int32
df.head()

and then

df.dtypes

Outputs

a    int32
dtype: object

here is a great blog for reference: https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801

dmitra79 · 2020-01-03T20:29:20Z

Great - thank you!

taureandyernv · 2020-01-03T23:56:40Z

@dantegd if @dmitra79 agrees, the usecase issue is solved by using string accessors instead of applymap. We may want to add string functionality in applymap to replicate this pandas dataframe functionality, as some people may expect it:

import pandas as pd
t = pd.DataFrame(['A101', 'A236'], ['A101', 'A236'])
t.applymap(lambda x: (x[1:]))

outputs:

	0
A101	101
A236	236

Thoughts?

dmitra79 · 2020-01-05T02:47:27Z

I agree that this resolves the issue. Having this functionality in applymap would be great, or alternatively better documentation describing how to handle strings. I was not aware of the string functionality or of nvstrings from cudf documentation

kkraus14 · 2020-01-22T00:35:59Z

This is being explored in the long term, but string UDFs are an ongoing challenge due to memory and branching challenges.

beckernick · 2021-07-02T16:09:11Z

As noted, this is a challenging problem. I'm going to close this issue to consolidate further discussion in #3802

dmitra79 added Needs Triage Need team to review and classify bug Something isn't working labels Dec 19, 2019

antw-cg mentioned this issue Jan 8, 2020

[BUG] apply_rows string AttributeError #3700

Closed

kkraus14 added Python Affects Python cuDF API. numba Numba issue and removed Needs Triage Need team to review and classify labels Jan 22, 2020

jrhemstad mentioned this issue Feb 8, 2021

[FEA] Support of StringColumn as numba udf input #7301

Closed

beckernick closed this as completed Jul 2, 2021

beckernick mentioned this issue Jul 2, 2021

[FEA] UDF support on string columns #1195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Use of applymap on StringColumns #3646

[BUG] Use of applymap on StringColumns #3646

dmitra79 commented Dec 19, 2019 •

edited

Loading

taureandyernv commented Jan 2, 2020 •

edited

Loading

dmitra79 commented Jan 2, 2020

taureandyernv commented Jan 3, 2020 •

edited

Loading

dmitra79 commented Jan 3, 2020

taureandyernv commented Jan 3, 2020 •

edited

Loading

dmitra79 commented Jan 5, 2020

kkraus14 commented Jan 22, 2020

beckernick commented Jul 2, 2021

[BUG] Use of applymap on StringColumns #3646

[BUG] Use of applymap on StringColumns #3646

Comments

dmitra79 commented Dec 19, 2019 • edited Loading

taureandyernv commented Jan 2, 2020 • edited Loading

dmitra79 commented Jan 2, 2020

taureandyernv commented Jan 3, 2020 • edited Loading

dmitra79 commented Jan 3, 2020

taureandyernv commented Jan 3, 2020 • edited Loading

dmitra79 commented Jan 5, 2020

kkraus14 commented Jan 22, 2020

beckernick commented Jul 2, 2021

dmitra79 commented Dec 19, 2019 •

edited

Loading

taureandyernv commented Jan 2, 2020 •

edited

Loading

taureandyernv commented Jan 3, 2020 •

edited

Loading

taureandyernv commented Jan 3, 2020 •

edited

Loading