Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] When converting nested types to pandas, use tuples #20222

Open
asfimport opened this issue Nov 16, 2018 · 3 comments
Open

[Python] When converting nested types to pandas, use tuples #20222

asfimport opened this issue Nov 16, 2018 · 3 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Nov 16, 2018

When converting to pandas, convert nested types (e.g. list) to tuples. Columns with lists are difficult to query. Here are a few unsuccessful attempts:

>>> mini
    CHROM    POS           ID            REF    ALTS  QUAL
80     20  63521  rs191905748              G     [A]   100
81     20  63541  rs117322527              C     [A]   100
82     20  63548  rs541129280              G    [GT]   100
83     20  63553  rs536661806              T     [C]   100
84     20  63555  rs553463231              T     [C]   100
85     20  63559  rs138359120              C     [A]   100
86     20  63586  rs545178789              T     [G]   100
87     20  63636  rs374311122              G     [A]   100
88     20  63696  rs149160003              A     [G]   100
89     20  63698  rs544072005              A     [C]   100
90     20  63729  rs181483669              G     [A]   100
91     20  63733   rs75670495              C     [T]   100
92     20  63799    rs1418258              C     [T]   100
93     20  63808   rs76004960              G     [C]   100
94     20  63813  rs532151719              G     [A]   100
95     20  63857  rs543686274  CCTGGAAAGGATT     [C]   100
96     20  63865  rs551938596              G     [A]   100
97     20  63902  rs571779099              A     [T]   100
98     20  63963  rs531152674              G     [A]   100
99     20  63967  rs116770801              A     [G]   100
100    20  63977  rs199703510              C     [G]   100
101    20  64016  rs143263863              G     [A]   100
102    20  64062  rs148297240              G     [A]   100
103    20  64139  rs186497980              G  [A, T]   100
104    20  64150    rs7274499              C     [A]   100
105    20  64151  rs190945171              C     [T]   100
106    20  64154  rs537656456              T     [G]   100
107    20  64175  rs116531220              A     [G]   100
108    20  64186  rs141793347              C     [G]   100
109    20  64210  rs182418654              G     [C]   100
110    20  64303  rs559929739              C     [A]   100
  1. I think this one fails because it tries to broadcast the comparison.

    >>> mini[mini.ALTS == ["A", "T"]]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1283, in wrapper
        res = na_op(values, other)
      File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1143, in na_op
        result = _comp_method_OBJECT_ARRAY(op, x, y)
      File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1120, in _comp_method_OBJECT_ARRAY
        result = libops.vec_compare(x, y, op)
      File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare
    ValueError: Arrays were different lengths: 31 vs 2
  2. I think this fails due to a similar reason, but the broadcasting is happening at a different place.

    >>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
        return self._getitem_array(key)
      File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
        indexer = self.loc._convert_to_indexer(key, axis=1)
      File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1314, in _convert_to_indexer
        indexer = check = labels.get_indexer(objarr)
      File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3259, in get_indexer
        indexer = self._engine.get_indexer(target._ndarray_values)
      File "pandas/_libs/index.pyx", line 301, in pandas._libs.index.IndexEngine.get_indexer
      File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in pandas._libs.hashtable.PyObjectHashTable.lookup
    TypeError: unhashable type: 'numpy.ndarray'
    >>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head()
    80     [True, False]
    81     [True, False]
    82    [False, False]
    83    [False, False]
    84    [False, False]
  3. Unfortunately this clever hack fails as well!

    >>> c = np.empty(1, object)
    >>> c[0] = ["A", "T"]
    >>> mini[mini.ALTS.values == c]
    Traceback (most recent call last):
      File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
        return self._engine.get_loc(key)
      File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: False
    >>> mini.ALTS.values == c
    False

    Finally, what succeeds is the following (probably because of the immutability of tuple):

    >>> mini["ALTS2"] = mini.ALTS.apply(tuple)
    >>> mini.head()
       CHROM    POS           ID REF  ALTS  QUAL  ALTS2
    80    20  63521  rs191905748   G   [A]   100   (A,)
    81    20  63541  rs117322527   C   [A]   100   (A,)
    82    20  63548  rs541129280   G  [GT]   100  (GT,)
    83    20  63553  rs536661806   T   [C]   100   (C,)
    84    20  63555  rs553463231   T   [C]   100   (C,)
    >>> mini[mini["ALTS2"] == ("A", "T")]
        CHROM    POS           ID REF    ALTS  QUAL   ALTS2
    103    20  64139  rs186497980   G  [A, T]   100  (A, T)
    >>> mini[mini["ALTS2"] == ("GT",)]
       CHROM    POS           ID REF  ALTS  QUAL  ALTS2
    82    20  63548  rs541129280   G  [GT]   100  (GT,)
    >>> mini[mini["ALTS2"] == tuple("C")]
        CHROM    POS           ID            REF ALTS  QUAL ALTS2
    83     20  63553  rs536661806              T  [C]   100  (C,)
    84     20  63555  rs553463231              T  [C]   100  (C,)
    89     20  63698  rs544072005              A  [C]   100  (C,)
    93     20  63808   rs76004960              G  [C]   100  (C,)
    95     20  63857  rs543686274  CCTGGAAAGGATT  [C]   100  (C,)
    109    20  64210  rs182418654              G  [C]   100  (C,)

Environment: Fedora 29, pyarrow installed with conda
Reporter: Suvayu Ali / @suvayu

Related issues:

Note: This issue was originally created as ARROW-3806. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Hm, I don't have a particularly strong opinion about our use of lists. Do we have type inference for tuples on the round trip? @xhochy @pitrou what do you think about this?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
There is currently no automatic type inference for tuples to ListArray:

In [41]: pyarrow.array(np.array([(1, 2), (1, 2, 3), (3,)]))                                                                                                   
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-41-ca2c4b264383> in <module>
----> 1 pyarrow.array(np.array([(1, 2), (1, 2, 3), (3,)]))

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Could not convert (1, 2) with type tuple: did not recognize Python value type when inferring an Arrow data type

but conversion from tuples is implemented when specifying the proper type:

In [42]: pyarrow.array(np.array([(1, 2), (1, 2, 3), (3,)]), type=pyarrow.list_(pyarrow.int64()))                                                              
Out[42]: 
<pyarrow.lib.ListArray object at 0x7fad2254d9f8>
[
  [
    1,
    2
  ],
  [
    1,
    2,
    3
  ],
  [
    3
  ]
]

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Well, we could add proper inference for tuples. ListArray is the only applicable type AFAICT (StructArray cannot apply since the field names are not known).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant