Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in make_block_same_class (tests failing for new fastparquet release) #19431

Closed
jorisvandenbossche opened this issue Jan 28, 2018 · 12 comments
Labels
IO Parquet parquet, feather
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Travis is red for master at the moment due to some changes we did in the internals (#19265)

Issue on fastparquet is here: dask/fastparquet#297

@jorisvandenbossche jorisvandenbossche changed the title TST: tests failing for new fastparquet release Regression in make_block_same_class (tests failing for new fastparquet release) Jan 28, 2018
@jreback
Copy link
Contributor

jreback commented Jan 28, 2018

no this was working before
this is fastparquet breaking things

@jorisvandenbossche
Copy link
Member Author

From a quick look, I think this is a regression in the changes to make_block_same_class that we did. In case of a DatetimeTZ block, you are required to pass a dtype.

In [4]: df = pd.DataFrame({'a': pd.date_range("2012-01-01", periods=3, tz='Europe/Brussels')})

In [8]: b = df._data.blocks[0]

In [9]: b.make_block_same_class(np.empty(5, np.dtype('datetime64[ns]')))
...
ValueError: cannot create a DatetimeTZBlock without a tz

In [10]: b.make_block_same_class(np.empty(5, np.dtype('datetime64[ns]')), dtype=pd.api.types.DatetimeTZDtype(unit='ns', tz='Europe/Brussels'))
...
TypeError: make_block_same_class() got an unexpected keyword argument 'dtype'

@jorisvandenbossche
Copy link
Member Author

this is fastparquet breaking things

Fastparquet fixed their tz handling for released pandas (it is working for that), we changed how make_block_same_class works on master, which breaks their code.
Of course internals is rather internal, but as far as I can see, the change we did makes difficult to use make_block_same_class giving it a numpy array (instead of DatetimeIndex).

@minggli
Copy link
Contributor

minggli commented Jan 28, 2018

looking at the same issue here. #19265 seems to have caused side-effects for fastparquet to fail in Travis CI. I've tried to restore **kwargs in Block class so it no longer raises TypeError, but https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py still breaks for an assertion error.

@jorisvandenbossche
Copy link
Member Author

It might be that the test needs to be updated independent from the bocks fix. As before fastparquet didn't handle tz columns, which was asserted in the tests I think. The latest release now handles them correctly.

@minggli
Copy link
Contributor

minggli commented Jan 28, 2018

yes, I wonder why the test looks weird. will fix update the test now that fastparquet handles timezones.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2018

@mdurant you cannot use make_block_same_klass like this, instead use make_block

@jreback jreback added the IO Parquet parquet, feather label Jan 28, 2018
@jreback
Copy link
Contributor

jreback commented Jan 28, 2018

this is the correct usage.

In [10]: from pandas.core.internals import make_block

In [11]: make_block(pd.DatetimeIndex([], tz='Europe/Brussels'), placement=[0])
Out[11]: DatetimeTZBlock: slice(0, 1, 1), 1 x 0, dtype: datetime64[ns, Europe/Brussels]

@martindurant
Copy link
Contributor

@jreback , there are three calls to make_block_same_klass in the code, and I do not know how they should be transformed into make_block, especially given the mystery parameter placement=. Your help here would be appreciated (in fact, the current method was, of course, your suggestion some 15 months ago ago).

In fact, I believe that the empty function ought to reside in pandas, as it is generally useful, but could do with more polish and rigour. That way we can avoid the cross-library problem.

@jreback
Copy link
Contributor

jreback commented Jan 29, 2018

you can just use make_block directly.

@martindurant
Copy link
Contributor

I don't know how to call make_block, e.g.

        if (new_blknos == -1).any():
>           raise AssertionError("Gaps in blk ref_locs")
E           AssertionError: Gaps in blk ref_locs

@jorisvandenbossche
Copy link
Member Author

you can pass the block.mgr_locs of the existing block for placement in make_block to have a correct value for that. But, then you are actually just replicating what the make_block_same_class is doing ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

4 participants