Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: extending object series on assignment with datetime coerces to int #18410

Closed
bear0330 opened this issue Nov 21, 2017 · 9 comments
Closed
Labels
Bug Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions

Comments

@bear0330
Copy link

bear0330 commented Nov 21, 2017

"Assigning with extending" an object series with a datetime / timestamp introduces an int:

In [63]: s = pd.Series([pd.Timestamp("2016-01-01")])

In [64]: s
Out[64]: 
0   2016-01-01
dtype: datetime64[ns]

In [65]: s[1] = datetime.datetime(2016, 1, 2)

In [66]: s
Out[66]: 
0   2016-01-01
1   2016-01-02
dtype: datetime64[ns]

In [67]: s = pd.Series([pd.Timestamp("2016-01-01")], dtype=object)

In [68]: s[1] = datetime.datetime(2016, 1, 2)

In [69]: s
Out[69]: 
0    2016-01-01 00:00:00
1    1451692800000000000
dtype: object

original report:

Code Sample, a copy-pastable example if possible

# Your code here
import urllib.request
import datetime as dt
import pandas as pd

def get_google_data(symbol, period, window, exch = 'NYSE'):
    url_root = ('http://www.google.com/finance/getprices?i='
                + str(period) + '&p=' + str(window)
                + 'd&f=d,o,h,l,c,v&df=cpct&x=' + exch.upper()
                + '&q=' + symbol.upper())
    response = urllib.request.urlopen(url_root)
    data=response.read().decode().split('\n')       #decode() required for Python 3
    data = [data[i].split(',') for i in range(len(data)-1)]
    header = data[0:7]
    data = data[7:]
    header[4][0] = header[4][0][8:]                 #get rid of 'Columns:' for label row
    df=pd.DataFrame(data, columns=header[4])
    df = df.dropna()                                #to fix the inclusion of more timezone shifts in the .csv returned from the goog api
    df.index = range(len(df))                       #fix the index from the previous dropna()

    ind=pd.Series(len(df))
    for i in range(len(df)):
        if df['DATE'].ix[i][0] == 'a':
            anchor_time = dt.datetime.fromtimestamp(int(df['DATE'].ix[i][1:]))  #make datetime object out of 'a' prefixed unix timecode
            ind[i]=anchor_time
        else:
            t = anchor_time +dt.timedelta(seconds = (period * int(df['DATE'].ix[i])))
            ind[i] = t
    df.index = ind

    df=df.drop('DATE', 1)

    for column in df.columns:                
        df[column]=pd.to_numeric(df[column])

    return df

Problem description

Under 0.21 (0.20.3, 0.19 ...etc) I got expected result when I call

>>> from get_google_data import get_google_data
>>> get_google_data('NUGT', 300, 200, 'NYSEARCA')
                       CLOSE     HIGH      LOW     OPEN    VOLUME
2017-09-13 04:00:00  39.5500  39.8400  37.9201  38.4100   4506107
2017-09-13 04:05:00  37.3800  39.3300  37.0800  39.1500   5723511
2017-09-13 04:10:00  38.0600  38.4400  36.5000  37.3000   5892153
...

But After upgrade to 0.21, I got

>>> from get_google_data import get_google_data
>>> get_google_data('NUGT', 300, 20, 'NYSEARCA')
                       CLOSE     HIGH      LOW     OPEN    VOLUME
2017-10-25 04:00:00  31.02  31.8500  30.8500  31.5000   6947080
1508904600000000000  30.43  31.0800  29.8500  31.0300   7214226
1508905200000000000  28.83  30.5800  28.4100  30.4300  10026369
1508905800000000000  29.42  29.6500  28.4000  28.8600   6217398
...

datetime object assigned to index becomes int type.
ind[i] = t

Expected Output

                   CLOSE     HIGH      LOW     OPEN    VOLUME

2017-09-13 04:00:00 39.5500 39.8400 37.9201 38.4100 4506107
2017-09-13 04:05:00 37.3800 39.3300 37.0800 39.1500 5723511
2017-09-13 04:10:00 38.0600 38.4400 36.5000 37.3000 5892153

Output of pd.show_versions()

0.21

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

can you make a much simpler example

@omytea
Copy link

omytea commented Nov 29, 2017

I am facing a similar issue. If I stored a dataframe with date index into hdfs with table format, the index becomes integers.

In [187]:
from datetime import datetime, timedelta
df = pd.DataFrame(np.random.random((40, 9)), index=[datetime.now().date() - timedelta(days=i) for i in range(40, 0, -1)])
​
df.to_hdf('d:/tmp/tmp.h5', 'test', format='table', append=False)
df.index.dtype, df.index

Out[187]:
(dtype('O'),
 Index([2017-10-20, 2017-10-21, 2017-10-22, 2017-10-23, 2017-10-24, 2017-10-25,
        2017-10-26, 2017-10-27, 2017-10-28, 2017-10-29, 2017-10-30, 2017-10-31,
        2017-11-01, 2017-11-02, 2017-11-03, 2017-11-04, 2017-11-05, 2017-11-06,
        2017-11-07, 2017-11-08, 2017-11-09, 2017-11-10, 2017-11-11, 2017-11-12,
        2017-11-13, 2017-11-14, 2017-11-15, 2017-11-16, 2017-11-17, 2017-11-18,
        2017-11-19, 2017-11-20, 2017-11-21, 2017-11-22, 2017-11-23, 2017-11-24,
        2017-11-25, 2017-11-26, 2017-11-27, 2017-11-28],
       dtype='object'))

In [188]:
index = pd.read_hdf('d:/tmp/tmp.h5', 'test').index
index.dtype, index

Out[188]:
(dtype('int64'),
 Int64Index([736622, 736623, 736624, 736625, 736626, 736627, 736628, 736629,
             736630, 736631, 736632, 736633, 736634, 736635, 736636, 736637,
             736638, 736639, 736640, 736641, 736642, 736643, 736644, 736645,
             736646, 736647, 736648, 736649, 736650, 736651, 736652, 736653,
             736654, 736655, 736656, 736657, 736658, 736659, 736660, 736661],
            dtype='int64'))

While it does not happen with store into fixed format.

In [191]:
df.to_hdf('d:/tmp/tmp.h5', 'test', format='fixed')
df.index.dtype, df.index

Out[191]:
(dtype('O'),
 Index([2017-10-20, 2017-10-21, 2017-10-22, 2017-10-23, 2017-10-24, 2017-10-25,
        2017-10-26, 2017-10-27, 2017-10-28, 2017-10-29, 2017-10-30, 2017-10-31,
        2017-11-01, 2017-11-02, 2017-11-03, 2017-11-04, 2017-11-05, 2017-11-06,
        2017-11-07, 2017-11-08, 2017-11-09, 2017-11-10, 2017-11-11, 2017-11-12,
        2017-11-13, 2017-11-14, 2017-11-15, 2017-11-16, 2017-11-17, 2017-11-18,
        2017-11-19, 2017-11-20, 2017-11-21, 2017-11-22, 2017-11-23, 2017-11-24,
        2017-11-25, 2017-11-26, 2017-11-27, 2017-11-28],
       dtype='object'))

In [192]:
index = pd.read_hdf('d:/tmp/tmp.h5', 'test').index
index.dtype, index

Out[192]:
(dtype('O'),
 Index([2017-10-20, 2017-10-21, 2017-10-22, 2017-10-23, 2017-10-24, 2017-10-25,
        2017-10-26, 2017-10-27, 2017-10-28, 2017-10-29, 2017-10-30, 2017-10-31,
        2017-11-01, 2017-11-02, 2017-11-03, 2017-11-04, 2017-11-05, 2017-11-06,
        2017-11-07, 2017-11-08, 2017-11-09, 2017-11-10, 2017-11-11, 2017-11-12,
        2017-11-13, 2017-11-14, 2017-11-15, 2017-11-16, 2017-11-17, 2017-11-18,
        2017-11-19, 2017-11-20, 2017-11-21, 2017-11-22, 2017-11-23, 2017-11-24,
        2017-11-25, 2017-11-26, 2017-11-27, 2017-11-28],
       dtype='object'))

@jreback
Copy link
Contributor

jreback commented Nov 29, 2017

@omytea you are using a non-first-class type datetime.date, wrap pd.to_datetime in the creation of the index and it will work.

@jreback
Copy link
Contributor

jreback commented Nov 29, 2017

closing as not repro

@jreback jreback closed this as completed Nov 29, 2017
@jorisvandenbossche
Copy link
Member

so small example, with 0.20.3:

In [55]: import datetime

In [56]: s = pd.Series(5)

In [57]: for i in range(5):
    ...:     s[i] = datetime.datetime(2016, 1, i+1)  
    ...:     

In [58]: s
Out[58]: 
0   2016-01-01
1   2016-01-02
2   2016-01-03
3   2016-01-04
4   2016-01-05
dtype: datetime64[ns]

and with master:

In [22]: import datetime

In [23]: s = pd.Series(5)

In [24]: for i in range(5):
    ...:     s[i] = datetime.datetime(2016, 1, i+1)  

In [25]: s
Out[25]: 
0    2016-01-01 00:00:00
1    1451692800000000000
2    1451779200000000000
3    1451865600000000000
4    1451952000000000000
dtype: object

@jorisvandenbossche
Copy link
Member

So there are two different things under the hood:

1. "Assigning with extending" an object series with a datetime / timestamp introduces an int:

In [63]: s = pd.Series([pd.Timestamp("2016-01-01")])

In [64]: s
Out[64]: 
0   2016-01-01
dtype: datetime64[ns]

In [65]: s[1] = datetime.datetime(2016, 1, 2)

In [66]: s
Out[66]: 
0   2016-01-01
1   2016-01-02
dtype: datetime64[ns]

In [67]: s = pd.Series([pd.Timestamp("2016-01-01")], dtype=object)

In [68]: s[1] = datetime.datetime(2016, 1, 2)

In [69]: s
Out[69]: 
0    2016-01-01 00:00:00
1    1451692800000000000
dtype: object

The above happens both on master as in 0.20.3 (so didn't change), and looks like a bug to me.

2. When you have a series of int and assign a Timestamp / datetime into it, previously it changed to datetime64 dtype, now it changes to object dtype:

In [74]: pd.__version__
Out[74]: '0.20.3'

In [75]: s = pd.Series([1])

In [76]: s
Out[76]: 
0    1
dtype: int64

In [77]: s[0] = datetime.datetime(2016, 1, 1)

In [78]: s
Out[78]: 
0   2016-01-01
dtype: datetime64[ns]
In [35]: pd.__version__
Out[35]: '0.22.0.dev0+260.g27931f6.dirty'

In [36]: s = pd.Series([1])

In [37]: s
Out[37]: 
0    1
dtype: int64

In [38]: s[0] = datetime.datetime(2016, 1, 1)

In [39]: s
Out[39]: 
0    2016-01-01 00:00:00
dtype: object

This change caused the 'bug' to surface. But this change was a good change, as we should not special case Series of length 1 (if you have a longer int series and assign a Timestamp into it, you end up with a mixed object dtype).

@jorisvandenbossche
Copy link
Member

@bear0330 you can easily fix your code by changing the ind=pd.Series(len(df)) line to not create an integer series (ideally you should also not iteratively expand a series, but that requires some more changes)

@jorisvandenbossche jorisvandenbossche changed the title Assign datetime object to index becomes int after upgrade to 0.21 BUG: extending object series on assignment with datetime coerces to int Nov 30, 2017
@jorisvandenbossche
Copy link
Member

Updated top post and title to only deal with issue 1. from #18410 (comment)

@mroeschke
Copy link
Member

Closing in favor of #13910, but will reference examples from this issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

5 participants