Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: various bug fixes for DataFrame/Series construction #2752

Merged
merged 1 commit into from
Feb 10, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jan 25, 2013

0 and 1 len ndarrays not inferring dtype correctly
datetimes that are single objects not inferring dtype
mixed datetimes and objects (GH #2751), casting datetimes to object
timedelta64 creation on series subtraction (of datetime64[ns])
astype on datetimes to object are now handled (as well as NaT conversions to np.nan)

astype conversion

In [4]: s = pd.Series([datetime.datetime(2001, 1, 2, 0, 0) for i in range(3)])

In [5]: s.dtype
Out[5]: dtype('datetime64[ns]')

In [6]: s[1] = np.nan

In [7]: s
Out[7]: 
0   2001-01-02 00:00:00
1                   NaT
2   2001-01-02 00:00:00

In [8]: s.dtype
Out[8]: dtype('datetime64[ns]')

In [9]: s = s.astype('O')

In [10]: s
Out[10]: 
0    2001-01-02 00:00:00
1                    NaN
2    2001-01-02 00:00:00

In [11]: s.dtype
Out[11]: dtype('object')

construction with datetimes

In [7]: pd.DataFrame({'A' : 1, 'B' : 'foo', 'C' : 'bar', 
    'D' : pd.Timestamp("20010101"),
    'E' : datetime.datetime(2001,1,2,0,0) }, index=np.arange(5)).get_dtype_counts()
Out[7]: 
datetime64[ns]    2
int64             1
object            2

In [16]: pd.DataFrame({'a':[1,2,4,7], 'b':[1.2, 2.3, 5.1, 6.3], 
    'c':list('abcd'), 
    'd':[datetime.datetime(2000,1,1) for i in range(4)] }).get_dtype_counts()
Out[16]: 
datetime64[ns]    1
float64           1
int64             1
object            1

1 len ndarrays

In [16]: pd.DataFrame({'a': 1., 'b': 2, 'c': 'foo', 'float64' : np.array(1.,dtype='float64'),
    'int64' : np.array(1,dtype='int64')}, index=np.arange(10)).get_dtype_counts()
Out[16]: 
float64    2
int64      2
object     1

        0 and 1 len ndarrays
        datetimes that are single objects
        mixed datetimes and objects (GH pandas-dev#2751)
        astype now converts correctly with a datetime64 type to object, NaT are converted to np.nan
        _get_numeric_data with empty mixed-type returning empty, but index was missing
DOC: release notes updated, added missing_data section to docs, whatsnew 0.10.2
wesm added a commit that referenced this pull request Feb 10, 2013
@wesm wesm merged commit 132d90d into pandas-dev:master Feb 10, 2013
@wesm
Copy link
Member

wesm commented Feb 10, 2013

Thanks jeff

@stephenwlin
Copy link
Contributor

fyi: this combined with #2708 is causing test_constructor_with_datetimes and test_get_numeric_data to fail on 32-bit, because it is assuming that dtypes default to int64

@jreback
Copy link
Contributor Author

jreback commented Feb 10, 2013

just saw that
I can do a PR to fix the test a bit later or
u r welcome to do if u want

@stephenwlin
Copy link
Contributor

actually, here's what i have:

        intname = np.dtype(np.int_).name
        floatname = np.dtype(np.float_).name
        datetime64name = np.dtype('M8[ns]').name
        objectname = np.dtype(np.object_).name

        # single item
        df = DataFrame({'A' : 1, 'B' : 'foo', 'C' : 'bar', 'D' : Timestamp("20010101"), 'E' : datetime(2001,1,2,0,0) },
                       index=np.arange(10))
        result = df.get_dtype_counts()
        expected = Series({intname: 1, datetime64name: 2, objectname : 2})
        assert_series_equal(result, expected)

that's working fine

but this is not:

        # GH #2751 (construction with no index specified)
        df = DataFrame({'a':[1,2,4,7], 'b':[1.2, 2.3, 5.1, 6.3], 'c':list('abcd'), 'd':[datetime(2000,1,1) for i in range(4)] })
        result = df.get_dtype_counts()
        expected = Series({intname: 1, floatname : 1, datetime64name: 1, objectname : 1})
        assert_series_equal(result, expected)

because the integer list is being promoted to int64, even on 32-bit. is that intended?

see commit 9a047679c2e6f2064fb4b656a2461cceba7df679 (b05b97986f779cdd9007f281c4255c0d31ab263c below, if it's still showing, is out of date)

@stephenwlin
Copy link
Contributor

In [29]: p.DataFrame([1,2]).dtypes
Out[29]: 
0    int32
Dtype: object

In [30]: p.DataFrame({'a': [1,2]}).dtypes
Out[30]: 
a    int64
Dtype: object

@jreback
Copy link
Contributor Author

jreback commented Feb 10, 2013

this does look inconsistent
will see of I can track it down

is your example on 32-bit?

(default for ints should be np.int_)

  • platform dependent (should not be upcasted)

On Feb 10, 2013, at 6:13 PM, stephenwlin [email protected] wrote:

In [29]: p.DataFrame([1,2]).dtypes
Out[29]:
0 int32
Dtype: object

In [30]: p.DataFrame({'a': [1,2]}).dtypes
Out[30]:
a int64
Dtype: object

Reply to this email directly or view it on GitHub..

@stephenwlin
Copy link
Contributor

yeah, 32-bit

@stephenwlin
Copy link
Contributor

ok, just fyi to save you time, I've narrowed it down to lib.maybe_convert_objects, called by way of series._sanitize_array

here's the relevant lines, with some print statements added (and what they print when calling p.DataFrame.from_dict({'a': [1,2]}))

            print data, np.asarray(data).dtype 
            # "[1, 2] int32"
            subarr = lib.list_to_object_array(data)
            print subarr, subarr.dtype
            # "[1 2] object"
            print [np.asarray(x).dtype for x in subarr]
            # "[dtype('int32'), dtype('int32')]"
            subarr = lib.maybe_convert_objects(subarr)
            print subarr, subarr.dtype
            # "[1 2] int64"
            subarr = com._possibly_cast_to_datetime(subarr, dtype)
            print subarr, subarr.dtype
            # "[1 2] int64"

i can look at it further if you want, but since you've been working on this I'm guessing you're more familiar with it and less likely to break something else with a fix (if this needs to be fixed)

@jreback
Copy link
Contributor Author

jreback commented Feb 10, 2013

hmm
didn't do anything here
but if I recall maybe_convert_objects I think DOES up convert (so will have to take that out - and hope doesn't break anything else)

On Feb 10, 2013, at 6:40 PM, stephenwlin [email protected] wrote:

ok, just fyi to save you time, I've narrowed it down to lib.maybe_convert_objects, called by way of series._sanitize_array

here's the relevant lines, with some print statements added (and what they print when calling p.DataFrame.from_dict({'a': [1,2]}))

        print data, np.asarray(data).dtype # prints "[1, 2] int32"
        subarr = lib.list_to_object_array(data)
        print subarr, subarr.dtype # prints "[1 2] object"
        subarr = lib.maybe_convert_objects(subarr)
        print subarr, subarr.dtype # prints "[1 2] int64"
        subarr = com._possibly_cast_to_datetime(subarr, dtype)
        print subarr, subarr.dtype # prints "[1 2] int64"

``

i can look at it further if you want, but since you've been working on this I'm guessing you're more familiar with this and less likely to break something else with a fix (if this needs to be fixed)

Reply to this email directly or view it on GitHub..

@stephenwlin
Copy link
Contributor

ok made a PR to just fix the build for now (leaving that particular test referencing int64 instead of platform int)

wesm pushed a commit that referenced this pull request Feb 11, 2013
@jreback jreback mentioned this pull request Feb 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants