Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using DataFrame.resample with 'agg' method on non-existant columns provides unexpected behavior #16766

Closed
dwilson-icr opened this issue Jun 24, 2017 · 1 comment · Fixed by #19552
Labels
Bug Resample resample method
Milestone

Comments

@dwilson-icr
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
from datetime import datetime

data = [
    {
        't': datetime(2017,6,1,0),
        'x': 1.0,
        'y': 2.0
    },
    {
        't': datetime(2017,6,1,1),
        'x': 2.0,
        'y': 2.0
    },
    {
        't': datetime(2017,6,1,2),
        'x': 3.0,
        'y': 1.5
    }

]

df = pd.DataFrame(data)
df = df.set_index('t')

# Perform a resample is get a binned time series DataFrame... this works fine
ts = df.resample('30T').agg({'x':['mean'],'y':['median']})
print ts['x'].shape
# (5,1)

# What if I put a field in there that doesn't exist? 
ts = df.resample('30T').agg({'x':['mean'],'y':['median'],'z':['sum']})
print ts['x'].shape
# (5,2) ??? I don't understand why the shape isn't (5,1)
print ts['x'].values
#[[ 1.   2. ]
# [ nan  nan]
# [ 2.   2. ]
# [ nan  nan]
# [ 3.   1.5]]
# Looks like a copy of the full aggregation even though I only requested the 'x' column
# Furthermore, now ts['z'] exists

Output:

(5, 1)
[[ 1.]
[ nan]
[ 2.]
[ nan]
[ 3.]]
(5, 2)
[[ 1. 2. ]
[ nan nan]
[ 2. 2. ]
[ nan nan]
[ 3. 1.5]]

Problem description

I am using pandas on records from an Elasticsearch database. The queries are pulling from multiple indices with overlapping column/field names. Most records will have most of their data in common, but when available, I want to know the values of other fields. I'm essentially creating a time series for each column with a specific time-based binning and a per-column aggregation.

I think this should either ignore columns that don't exist or raise an exception. If it silently ignores these columns, ts['x'] should return the same result as the first example. I spent several hours on a workaround today that required that I check for which columns were available for each aggregation and remove those from my agg dictionary. I feel like the current behavior doesn't have a purpose, but perhaps I'm missing something.

Expected Output

(5, 1)
[[  1.]
 [ nan]
 [  2.]
 [ nan]
 [  3.]]
(5, 1)
[[  1.]
 [ nan]
 [  2.]
 [ nan]
 [  3.]]

OR

ValueError("Column 'z' does not exist!")

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.5.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 0.9.8
Cython: None
numpy: 1.11.3
scipy: 0.18.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 1.5
pytz: 2012d
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

Yeah, this looks buggy - groupby does the right thing. PRs welcome!

df.groupby('x').agg({'z': ['sum']})

KeyError: 'z'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method
Projects
None yet
3 participants