Skip to content

Commit

Permalink
DOC: Add expanded index descriptors for specifying for RangeIndex-as-…
Browse files Browse the repository at this point in the history
…metadata in Parquet file schema (#25709)
  • Loading branch information
wesm authored and jorisvandenbossche committed Aug 8, 2019
1 parent d320ef7 commit 78c6843
Showing 1 changed file with 41 additions and 19 deletions.
60 changes: 41 additions & 19 deletions doc/source/development/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,19 @@ So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a

.. code-block:: text
{'index_columns': ['__index_level_0__', '__index_level_1__', ...],
{'index_columns': [<descr0>, <descr1>, ...],
'column_indexes': [<ci0>, <ci1>, ..., <ciN>],
'columns': [<c0>, <c1>, ...],
'pandas_version': $VERSION}
'pandas_version': $VERSION,
'creator': {
'library': $LIBRARY,
'version': $LIBRARY_VERSION
}}
Here, ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata
The "descriptor" values ``<descr0>`` in the ``'index_columns'`` field are
strings (referring to a column) or dictionaries with values as described below.

The ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata
for each column, *including the index columns*. This has JSON form:

.. code-block:: text
Expand All @@ -53,26 +60,37 @@ for each column, *including the index columns*. This has JSON form:
'numpy_type': numpy_type,
'metadata': metadata}
.. note::
See below for the detailed specification for these.

Index Metadata Descriptors
~~~~~~~~~~~~~~~~~~~~~~~~~~

``RangeIndex`` can be stored as metadata only, not requiring serialization. The
descriptor format for these as is follows:

Every index column is stored with a name matching the pattern
``__index_level_\d+__`` and its corresponding column information is can be
found with the following code snippet.
.. code-block:: python
Following this naming convention isn't strictly necessary, but strongly
suggested for compatibility with Arrow.
index = pd.RangeIndex(0, 10, 2)
{'kind': 'range',
'name': index.name,
'start': index.start,
'stop': index.stop,
'step': index.step}
Here's an example of how the index metadata is structured in pyarrow:
Other index types must be serialized as data columns along with the other
DataFrame columns. The metadata for these is a string indicating the name of
the field in the data columns, for example ``'__index_level_0__'``.

.. code-block:: python
If an index has a non-None ``name`` attribute, and there is no other column
with a name matching that value, then the ``index.name`` value can be used as
the descriptor. Otherwise (for unnamed indexes and ones with names colliding
with other column names) a disambiguating name with pattern matching
``__index_level_\d+__`` should be used. In cases of named indexes as data
columns, ``name`` attribute is always stored in the column descriptors as
above.

# assuming there's at least 3 levels in the index
index_columns = metadata['index_columns'] # noqa: F821
columns = metadata['columns'] # noqa: F821
ith_index = 2
assert index_columns[ith_index] == '__index_level_2__'
ith_index_info = columns[-len(index_columns):][ith_index]
ith_index_level_name = ith_index_info['name']
Column Metadata
~~~~~~~~~~~~~~~

``pandas_type`` is the logical type of the column, and is one of:

Expand Down Expand Up @@ -161,4 +179,8 @@ As an example of fully-formed metadata:
'numpy_type': 'int64',
'metadata': None}
],
'pandas_version': '0.20.0'}
'pandas_version': '0.20.0',
'creator': {
'library': 'pyarrow',
'version': '0.13.0'
}}

0 comments on commit 78c6843

Please sign in to comment.