BUG: not properly converting S1 in astype ,on PY3 #12857

cchrysostomou · 2016-04-11T03:25:32Z

I am trying to create a dataframe where each cell is represented as a single characters rather than python objects. I am able to create and work with the dataframe when using .astype command. However, If i try to print out a larger portion of the table, then I get an error.

Code Sample, a copy-pastable example if possible

import random
import pandas as pd
lets = 'ACDEFGHIJKLMNOP'
slen = 50
nseqs = 1000
words = [[random.choice(lets) for x in range(slen)] for _ in range(nseqs)]
df = pd.DataFrame(words).astype('S1')
#this will print correctly:
print(df.iloc[:60, :])
#this will raise an error:
print(df.iloc[:61, :])

error raised

C:\Anaconda3\lib\site-packages\pandas\core\internals.py in _vstack(to_stack, dtype)
   4248 
   4249     # work around NumPy 1.6 bug
-> 4250     if dtype == _NS_DTYPE or dtype == _TD_DTYPE:
   4251         new_values = np.vstack([x.view('i8') for x in to_stack])
   4252         return new_values.view(dtype)
TypeError: data type "bytes8" not understood

output of `pd.show_versions()`

commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: None
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.1.1
sphinx: 1.4b1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: 2.8

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-04-11T11:24:14Z

@costas821 I cannot reproduce this (also using Windows 7, pandas 0.17.1). If you run the above code sample in a new session, you get that error?

jreback · 2016-04-11T12:34:55Z

this fails on the astype. dtype S1(and all fixed sized string dtypes are) not supported and should be converted to object. Kind of puzzled why this is not. So I'll mark this as a bug.

jreback · 2016-04-11T12:38:03Z

So .astype('U1') works as excepted (IOW it coerces to object), but we need to either raise on S dtypes in PY3 I think (or just coerce as we do unicode), though the user is technically saying that want to encode.

cchrysostomou · 2016-04-11T13:10:40Z

Well I was kind of hoping that datatype could be supported. When its represented as an object, the memory it takes up is extremely high when all I need is for for each cell to take up a single byte. Everything except for 'printing' seemed to work for me. Is there any work-around for this?

jreback · 2016-04-11T13:16:58Z

you are much better off using categoricals

# your frame
In [17]: df.memory_usage(deep=True).sum()
Out[17]: 2300072

In [18]: uniques = np.sort(pd.unique(df.values.ravel()))

# converted to categoricals (I happen to preserver the mappings, but its actually not necessary)
In [19]: df.apply(lambda x: x.astype('category',categories=uniques)).memory_usage(deep=True).sum()
Out[19]: 84572

cchrysostomou · 2016-04-11T14:40:34Z

OK I can go that route, but now I am having some functionality issues. Some things that worked before, no longer work when I set it as a category. If you don't think this is pertinent to the issue, then should I just send you a personal message of what I am trying to do and some sample code?

#  set my frame as category
uniques = np.sort(pd.unique(df.values.ravel()))
df = df.apply(lambda x: x.astype('category', categories=uniques))

# slicing and search operations
df_ints = pd.DataFrame(np.zeros((10000, 500)))
df_ints[5,3] = 1
# when df is a category, I cannot do the following
df[df_ints==0] = 'Z'  
# this also raises an error
df_ints == 'A'

jreback · 2016-04-11T14:56:07Z

categoricals have a sets that are allowed, IOW, to the categories themselves. You can

In [75]: df2 = df.apply(lambda x: x.astype('category', categories=uniques.tolist() + ['Z']))

In [77]: df2.iloc[0,1] = 'Z'

cchrysostomou · 2016-04-11T15:00:02Z

Whoops that was a bad example, my mistake. What I was trying to show was that I cannot use the dataframe df_ints to change values:

df[df_ints==0] = 'A' # where 'A' is already defined in set.
or find where df is a:
df[df=='A']

jreback · 2016-04-11T15:12:28Z

hmm, that should work, see #12861 . well good of you to test this out!
In the meantime you can do .astype('U1') to save some memory (or of course pull-requests to fix issues always welcome!)

mroeschke · 2019-02-22T02:35:59Z

This looks fixed on master. Could use a test.

topper-123 · 2019-06-15T12:17:45Z

Removing the p2/p3 compat label, as Python2 is being dropped and this issue still needs tests.

…andas-dev#30327)

TomAugspurger · 2019-12-30T14:16:49Z

This was fixed by #30327 (ccbe7be specifically I think).

jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Apr 11, 2016

jreback added this to the 0.18.1 milestone Apr 11, 2016

jreback changed the title ~~Problems printing dataframe with datatype of 'S' or 'a' (fixed string size) of given size~~ BUG: not properly converting S1 in astype ,on PY3 Apr 11, 2016

jreback added the Error Reporting Incorrect or improved errors from pandas label Apr 11, 2016

jreback mentioned this issue Apr 11, 2016

BUG: indexing with boolean array and categoricals #12861

Closed

jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016

jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 21, 2016

jreback mentioned this issue Mar 5, 2017

BUG: DataFrame.consolidate throws TypeError with bytes blocks #15482

Closed

jreback added the 2/3 Compat label Mar 5, 2017

jreback modified the milestones: 0.20.0, Next Major Release Mar 5, 2017

jreback mentioned this issue Mar 5, 2017

Converting dtype of column from str to S not working on reassignment #15575

Closed

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

jreback modified the milestones: Next Major Release, 0.20.0 Mar 23, 2017

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Difficulty Intermediate labels Feb 22, 2019

topper-123 removed the 2/3 Compat label Jun 15, 2019

jbrockmendel removed the Effort Low label Oct 21, 2019

jbrockmendel added a commit to jbrockmendel/pandas that referenced this issue Dec 18, 2019

TST: tests for needs-test issues pandas-dev#12857, pandas-dev#12689

74fc0cc

jbrockmendel mentioned this issue Dec 20, 2019

TST: tests for needs-test issues #12857 #12689 #30327

Merged

7 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Dec 20, 2019

jreback pushed a commit that referenced this issue Dec 24, 2019

TST: tests for needs-test issues #12857 #12689 (#30327)

ccbe7be

AlexKirko pushed a commit to AlexKirko/pandas that referenced this issue Dec 29, 2019

TST: tests for needs-test issues pandas-dev#12857 pandas-dev#12689 (p…

7872f78

…andas-dev#30327)

TomAugspurger closed this as completed Dec 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: not properly converting S1 in astype ,on PY3 #12857

BUG: not properly converting S1 in astype ,on PY3 #12857

cchrysostomou commented Apr 11, 2016

jorisvandenbossche commented Apr 11, 2016

jreback commented Apr 11, 2016

jreback commented Apr 11, 2016

cchrysostomou commented Apr 11, 2016

jreback commented Apr 11, 2016

cchrysostomou commented Apr 11, 2016

jreback commented Apr 11, 2016

cchrysostomou commented Apr 11, 2016

jreback commented Apr 11, 2016

mroeschke commented Feb 22, 2019

topper-123 commented Jun 15, 2019

TomAugspurger commented Dec 30, 2019

BUG: not properly converting S1 in astype ,on PY3 #12857

BUG: not properly converting S1 in astype ,on PY3 #12857

Comments

cchrysostomou commented Apr 11, 2016

Code Sample, a copy-pastable example if possible

error raised

output of pd.show_versions()

jorisvandenbossche commented Apr 11, 2016

jreback commented Apr 11, 2016

jreback commented Apr 11, 2016

cchrysostomou commented Apr 11, 2016

jreback commented Apr 11, 2016

cchrysostomou commented Apr 11, 2016

jreback commented Apr 11, 2016

cchrysostomou commented Apr 11, 2016

jreback commented Apr 11, 2016

mroeschke commented Feb 22, 2019

topper-123 commented Jun 15, 2019

TomAugspurger commented Dec 30, 2019

output of `pd.show_versions()`