Self joins with non-unique indexes #3791

eric-czech · 2020-02-21T20:47:35Z

Hi, is there a good way to self join arrays?

For example, given a dataset like this:

import pandas as pd
df = pd.DataFrame(dict(
    x=[1, 1, 2, 2], 
    y=['1', '1', '2', '2'], 
    z=['a', 'b', 'c', 'd']))
df

I am not looking for the pandas concat behavior for alignment:

pd.concat([
    df.set_index(['x', 'y'])[['z']].rename(columns={'z': 'z_x'}),
    df.set_index(['x', 'y'])[['z']].rename(columns={'z': 'z_y'})
], axis=1, join='inner')

but rather the merge behavior for a join by index:

pd.merge(df, df, on=['x', 'y'])

I tried using xarray.merge but that seems to give the behavior like concat (i.e. alignment and not joining). Even if it is possible, it's a large dataset that I need to process out-of-core via dask, and I have found that it takes some elbow grease to get this working with dask dataframes by ensuring that the number of partitions is set well and that the divisions are known prior to joining by index. Should I expect that this sort of operation will work well with xarray (if it is possible) knowing that it's hard enough to do directly with dask without hitting OOM errors?

Output of `xr.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-28-generic
machine: x86_64
processor:
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.7.3

xarray: 0.15.0
pandas: 0.25.2
numpy: 1.17.2
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.1
dask: 2.11.0
distributed: 2.11.0
matplotlib: 3.1.1
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 45.2.0.post20200209
pip: 20.0.2
conda: None
pytest: None
IPython: 7.12.0
sphinx: None

The text was updated successfully, but these errors were encountered:

max-sixty · 2020-03-05T19:32:38Z

Hi @eric-czech -- thanks for the issue.

Unfortunately xarray isn't strong as these sort of relational joins, and I don't think there's a way of doing that specific operation. Relational algebra generally depends on data on a single dimension, which fits into xarray's model less well.

Feel free to post back here with contiguous questions, though

mrocklin · 2020-03-13T16:12:53Z

I wonder if there are multi-dimensional analogs that might be interesting.

@eric-czech , if you have time to say a bit more about the data and operation that you're trying to do I think it would be an interesting exercise to see how to do that operation with Xarray's current functionality. I wouldn't be surprised to learn that there was some way to do what you wanted that went under a different name here.

eric-czech · 2020-03-26T14:32:40Z

Hey @mrocklin (cc @max-sixty), sure thing.

My original question was about how to implement a join in a typical relational algebra sense, where rows with identical values in the join clause are repeated, but I think I have an even simpler problem that is much more common in our workflows (and touches on how duplicated index values are supported).

For example, I'd like to do something like this:

import xarray as xr
import numpy as np
import pandas as pd

# Assume we have a dataset of 3 individuals, one of African 
# ancestry and two of European ancestry
a = pd.DataFrame({'pop_name': ['AFR', 'EUR', 'EUR'], 'sample_id': [1, 2, 3]})

# Join on ancestry to get population size
b = pd.DataFrame({'pop_name': ['AFR', 'EUR'], 'pop_size': [10, 100]})
pd.merge(a, b, on='pop_name')

	pop_name	sample_id	pop_size
0	AFR	1	10
1	EUR	2	100
2	EUR	3	100

With xarray, the closest equivalent to this I can find is:

a = xr.DataArray(
    data=[1, 2, 3], dims='x',
    coords=dict(pop_name=('x', ['AFR', 'EUR', 'EUR'])),
    name='sample_id'
).set_index(dict(x='pop_name'))
# <xarray.DataArray 'sample_id' (x: 3)>
# array([1, 2, 3])
# Coordinates:
#   * x        (x) object 'AFR' 'EUR' 'EUR'

b = xr.DataArray(
    data=[10, 100], dims='x',
    coords=dict(pop_name=('x', ['AFR', 'EUR'])),
    name='pop_size'
).set_index(dict(x='pop_name'))
# <xarray.DataArray 'pop_size' (x: 2)>
# array([100,  10])
# Coordinates:
#   * x        (x) object 'EUR' 'AFR'

xr.merge([a, b])
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects

The above does exactly what I want as long as the population names being used as the coordinate to merge on are unique, but that obviously doesn't make sense if those names correspond to a bunch of individuals in one of a small number of populations.

The larger context for this is that genetic data itself is typically some 2+ dimensional array with the first two dimensions corresponding to genomic sites and people. Xarray is perfect for carrying around the extra information relating to those dimensions as coordinates, but being able to attach new coordinate values by joins to external tables is important.

Am I missing something obvious in the API that will do this? Or am I likely better off converting DataArrays to DFs, doing my operations with some DF api, and then converting back?

keewis · 2020-03-26T16:39:38Z

The only way I could come up with is:

In [2]: a = xr.DataArray( 
   ...:     name="sample_id", 
   ...:     data=[1, 2, 3], 
   ...:     dims="population_name", 
   ...:     coords={"population_name": ["AFR", "EUR", "EUR"]}, 
   ...: ) 
   ...: b = xr.DataArray( 
   ...:     name="population_size", 
   ...:     data=[10, 100], 
   ...:     dims="population_name", 
   ...:     coords={"population_name": ["AFR", "EUR"]}, 
   ...: ) 
   ...: a.to_dataset().assign({b.name: b.sel(population_name=a.population_name)})
Out[2]: 
<xarray.Dataset>
Dimensions:          (population_name: 3)
Coordinates:
  * population_name  (population_name) <U3 'AFR' 'EUR' 'EUR'
Data variables:
    sample_id        (population_name) int64 1 2 3
    population_size  (population_name) int64 10 100 100

which is a manual join?

eric-czech · 2020-03-26T17:51:34Z

That'll work, thanks @keewis!

fwiw the number of use cases I've found concerning my initial question, where there are repeated index values on both sides of the join, is way lower.

max-sixty closed this as completed Mar 5, 2020

eric-czech mentioned this issue Apr 12, 2020

Explore Xarray as the basis for a genetic toolkit API related-sciences/gwas-analysis#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self joins with non-unique indexes #3791

Self joins with non-unique indexes #3791

eric-czech commented Feb 21, 2020

INSTALLED VERSIONS

max-sixty commented Mar 5, 2020

mrocklin commented Mar 13, 2020

eric-czech commented Mar 26, 2020 •

edited

Loading

keewis commented Mar 26, 2020

eric-czech commented Mar 26, 2020

Self joins with non-unique indexes #3791

Self joins with non-unique indexes #3791

Comments

eric-czech commented Feb 21, 2020

Output of xr.show_versions()

INSTALLED VERSIONS

max-sixty commented Mar 5, 2020

mrocklin commented Mar 13, 2020

eric-czech commented Mar 26, 2020 • edited Loading

keewis commented Mar 26, 2020

eric-czech commented Mar 26, 2020

Output of `xr.show_versions()`

eric-czech commented Mar 26, 2020 •

edited

Loading