Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: repr - allow block to override values that get formatted #17143

Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,10 @@ def internal_values(self, dtype=None):
"""
return self.values

def formatting_values(self):
"""Return the internal values used by the DataFrame/SeriesFormatter"""
return self.internal_values()

def get_values(self, dtype=None):
"""
return an internal format, currently just the ndarray
Expand Down Expand Up @@ -4316,6 +4320,10 @@ def external_values(self):
def internal_values(self):
return self._block.internal_values()

def formatting_values(self):
"""Return the internal values used by the DataFrame/SeriesFormatter"""
return self._block.formatting_values()

def get_values(self):
""" return a dense type view """
return np.array(self._block.to_dense(), copy=False)
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,12 @@ def _values(self):
""" return the internal repr of this data """
return self._data.internal_values()

def _formatting_values(self):
"""Return the values that can be formatted (used by SeriesFormatter
and DataFrameFormatter)
"""
return self._data.formatting_values()

def get_values(self):
""" same as values (but handles sparseness conversions); is a view """
return self._data.get_values()
Expand Down
6 changes: 4 additions & 2 deletions pandas/io/formats/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,8 @@ def _get_formatted_index(self):
return fmt_index, have_header

def _get_formatted_values(self):
return format_array(self.tr_series._values, None,
values_to_format = self.tr_series._formatting_values()
return format_array(values_to_format, None,
float_format=self.float_format, na_rep=self.na_rep)

def to_string(self):
Expand Down Expand Up @@ -694,7 +695,8 @@ def to_latex(self, column_format=None, longtable=False, encoding=None,
def _format_col(self, i):
frame = self.tr_frame
formatter = self._get_formatter(i)
return format_array(frame.iloc[:, i]._values, formatter,
values_to_format = frame.iloc[:, i]._formatting_values()
return format_array(values_to_format, formatter,
float_format=self.float_format, na_rep=self.na_rep,
space=self.col_space, decimal=self.decimal)

Expand Down
Empty file.
28 changes: 28 additions & 0 deletions pandas/tests/internals/test_external_block.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# -*- coding: utf-8 -*-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add this in setup.py as well

# pylint: disable=W0102

import numpy as np

import pandas as pd
from pandas.core.internals import Block, BlockManager


class CustomBlock(Block):

def formatting_values(self):
return np.array(["Val: {}".format(i) for i in self.values])


def test_custom_repr():
values = np.arange(3)

# series
block = CustomBlock(values, placement=slice(0, 3))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't pass fastpath, that's not really a public option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't use fastpath, it does not preserve the Block type. Eg:

In [53]: block = CustomBlock(values, placement=slice(0, 3))

In [54]: s = pd.Series(block, index=pd.RangeIndex(3), fastpath=True)

In [55]: s._data._block
Out[55]: CustomBlock: 3 dtype: int64

In [56]: s = pd.Series(block, index=pd.RangeIndex(3))

In [57]: s._data._block
Out[57]: ObjectBlock: 3 dtype: object

The reason is that we don't check for isinstance(data, Block) in the Series __init__.py (we do for isinstance(data, SingleBlockManager) in the non-fastpath code path.

For that reason I am also using the fastpath in GeoPandas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that looks like a bug. if you change that does it break anything else? (could be followup as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it has proven a bit difficult to construct Series and DataFrame objects from given blocks, without re-creating the blocks (eg in Series, the block gets converted to array, which is then passed to SingleBlockManager, which does not preserve the block type)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you generally need to give it a SingleBkockManger or BlockManager

blocks are a lower level item
and Series/DataFrame don't/shouldn't know about these

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's fine. But slightly change my comment: it has also proven to be difficult to add a block to a BlockManager with preserving the block type.

Once we have a BlockManager, it's indeed simply passing it to DataFrame(..) to get a df. That's what we do to create dataframes, I should probably take the same approach to create SingleBlockManager for the series case instead of using that fastpath.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it hard to add a Block to BM
something is wrong if it's hard

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have Block.set and Block.insert methods to add things to a Block, but those also do not preserve the block you pass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here for a work around that I now use in the geopandas refactor branch: geopandas/geopandas#467 (comment)
Basically it is constructing a dataframe, taking the blocks and axes out of it, append a Block to the blocks, and a value to the column axis, and create a new Block from those.

(but given the length of code in Block.insert/set, this is maybe actually a simple way)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I added a commit with an attempt to remove the usage of fastpath

s = pd.Series(block, index=pd.RangeIndex(3), fastpath=True)
assert repr(s) == '0 Val: 0\n1 Val: 1\n2 Val: 2\ndtype: int64'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the windows test fail because this is int32 there. you have to use is_platform_windows() to make this test pass (e.g. use int32 on windows, else use int64). we do this somewhere else in the suite in test_format.py I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just specified the dtype as int64, I suppose that is fine as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep thats good too


# dataframe
block = CustomBlock(values.reshape(1, -1), placement=slice(0, 1))
blk_mgr = BlockManager([block], [['col'], range(3)])
df = pd.DataFrame(blk_mgr)
assert repr(df) == ' col\n0 Val: 0\n1 Val: 1\n2 Val: 2'