Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: not properly converting S1 in astype ,on PY3 #12857

Closed
cchrysostomou opened this issue Apr 11, 2016 · 12 comments
Closed

BUG: not properly converting S1 in astype ,on PY3 #12857

cchrysostomou opened this issue Apr 11, 2016 · 12 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@cchrysostomou
Copy link

I am trying to create a dataframe where each cell is represented as a single characters rather than python objects. I am able to create and work with the dataframe when using .astype command. However, If i try to print out a larger portion of the table, then I get an error.

Code Sample, a copy-pastable example if possible

import random
import pandas as pd
lets = 'ACDEFGHIJKLMNOP'
slen = 50
nseqs = 1000
words = [[random.choice(lets) for x in range(slen)] for _ in range(nseqs)]
df = pd.DataFrame(words).astype('S1')
#this will print correctly:
print(df.iloc[:60, :])
#this will raise an error:
print(df.iloc[:61, :])

error raised

C:\Anaconda3\lib\site-packages\pandas\core\internals.py in _vstack(to_stack, dtype)
   4248 
   4249     # work around NumPy 1.6 bug
-> 4250     if dtype == _NS_DTYPE or dtype == _TD_DTYPE:
   4251         new_values = np.vstack([x.view('i8') for x in to_stack])
   4252         return new_values.view(dtype)
TypeError: data type "bytes8" not understood

output of pd.show_versions()

commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: None
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.1.1
sphinx: 1.4b1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: 2.8

@jorisvandenbossche
Copy link
Member

@costas821 I cannot reproduce this (also using Windows 7, pandas 0.17.1). If you run the above code sample in a new session, you get that error?

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

this fails on the astype. dtype S1(and all fixed sized string dtypes are) not supported and should be converted to object. Kind of puzzled why this is not. So I'll mark this as a bug.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Apr 11, 2016
@jreback jreback added this to the 0.18.1 milestone Apr 11, 2016
@jreback jreback changed the title Problems printing dataframe with datatype of 'S' or 'a' (fixed string size) of given size BUG: not properly converting S1 in astype ,on PY3 Apr 11, 2016
@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

So .astype('U1') works as excepted (IOW it coerces to object), but we need to either raise on S dtypes in PY3 I think (or just coerce as we do unicode), though the user is technically saying that want to encode.

@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Apr 11, 2016
@cchrysostomou
Copy link
Author

Well I was kind of hoping that datatype could be supported. When its represented as an object, the memory it takes up is extremely high when all I need is for for each cell to take up a single byte. Everything except for 'printing' seemed to work for me. Is there any work-around for this?

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

you are much better off using categoricals

# your frame
In [17]: df.memory_usage(deep=True).sum()
Out[17]: 2300072

In [18]: uniques = np.sort(pd.unique(df.values.ravel()))

# converted to categoricals (I happen to preserver the mappings, but its actually not necessary)
In [19]: df.apply(lambda x: x.astype('category',categories=uniques)).memory_usage(deep=True).sum()
Out[19]: 84572

@cchrysostomou
Copy link
Author

OK I can go that route, but now I am having some functionality issues. Some things that worked before, no longer work when I set it as a category. If you don't think this is pertinent to the issue, then should I just send you a personal message of what I am trying to do and some sample code?

#  set my frame as category
uniques = np.sort(pd.unique(df.values.ravel()))
df = df.apply(lambda x: x.astype('category', categories=uniques))

# slicing and search operations
df_ints = pd.DataFrame(np.zeros((10000, 500)))
df_ints[5,3] = 1
# when df is a category, I cannot do the following
df[df_ints==0] = 'Z'  
# this also raises an error
df_ints == 'A'

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

categoricals have a sets that are allowed, IOW, to the categories themselves. You can

In [75]: df2 = df.apply(lambda x: x.astype('category', categories=uniques.tolist() + ['Z']))

In [77]: df2.iloc[0,1] = 'Z'

@cchrysostomou
Copy link
Author

Whoops that was a bad example, my mistake. What I was trying to show was that I cannot use the dataframe df_ints to change values:

df[df_ints==0] = 'A' # where 'A' is already defined in set.
or find where df is a:
df[df=='A']

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

hmm, that should work, see #12861 . well good of you to test this out!
In the meantime you can do .astype('U1') to save some memory (or of course pull-requests to fix issues always welcome!)

@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 21, 2016
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 5, 2017
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@jreback jreback modified the milestones: Next Major Release, 0.20.0 Mar 23, 2017
@mroeschke
Copy link
Member

This looks fixed on master. Could use a test.

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Difficulty Intermediate labels Feb 22, 2019
@topper-123
Copy link
Contributor

Removing the p2/p3 compat label, as Python2 is being dropped and this issue still needs tests.

@TomAugspurger
Copy link
Contributor

This was fixed by #30327 (ccbe7be specifically I think).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

No branches or pull requests

7 participants