using DataFrame.resample with 'agg' method on non-existant columns provides unexpected behavior #16766

dwilson-icr · 2017-06-24T05:41:33Z

Code Sample, a copy-pastable example if possible

import pandas as pd
from datetime import datetime

data = [
    {
        't': datetime(2017,6,1,0),
        'x': 1.0,
        'y': 2.0
    },
    {
        't': datetime(2017,6,1,1),
        'x': 2.0,
        'y': 2.0
    },
    {
        't': datetime(2017,6,1,2),
        'x': 3.0,
        'y': 1.5
    }

]

df = pd.DataFrame(data)
df = df.set_index('t')

# Perform a resample is get a binned time series DataFrame... this works fine
ts = df.resample('30T').agg({'x':['mean'],'y':['median']})
print ts['x'].shape
# (5,1)

# What if I put a field in there that doesn't exist? 
ts = df.resample('30T').agg({'x':['mean'],'y':['median'],'z':['sum']})
print ts['x'].shape
# (5,2) ??? I don't understand why the shape isn't (5,1)
print ts['x'].values
#[[ 1.   2. ]
# [ nan  nan]
# [ 2.   2. ]
# [ nan  nan]
# [ 3.   1.5]]
# Looks like a copy of the full aggregation even though I only requested the 'x' column
# Furthermore, now ts['z'] exists

Output:

(5, 1)
[[ 1.]
[ nan]
[ 2.]
[ nan]
[ 3.]]
(5, 2)
[[ 1. 2. ]
[ nan nan]
[ 2. 2. ]
[ nan nan]
[ 3. 1.5]]

Problem description

I am using pandas on records from an Elasticsearch database. The queries are pulling from multiple indices with overlapping column/field names. Most records will have most of their data in common, but when available, I want to know the values of other fields. I'm essentially creating a time series for each column with a specific time-based binning and a per-column aggregation.

I think this should either ignore columns that don't exist or raise an exception. If it silently ignores these columns, ts['x'] should return the same result as the first example. I spent several hours on a workaround today that required that I check for which columns were available for each aggregation and remove those from my agg dictionary. I feel like the current behavior doesn't have a purpose, but perhaps I'm missing something.

Expected Output

(5, 1)
[[  1.]
 [ nan]
 [  2.]
 [ nan]
 [  3.]]
(5, 1)
[[  1.]
 [ nan]
 [  2.]
 [ nan]
 [  3.]]

OR

ValueError("Column 'z' does not exist!")

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.5.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 0.9.8
Cython: None
numpy: 1.11.3
scipy: 0.18.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 1.5
pytz: 2012d
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-06-26T15:21:21Z

Yeah, this looks buggy - groupby does the right thing. PRs welcome!

df.groupby('x').agg({'z': ['sum']})

KeyError: 'z'

chris-b1 added Bug Resample resample method labels Jun 26, 2017

chris-b1 added this to the Next Major Release milestone Jun 26, 2017

jreback added Difficulty Intermediate labels Jun 28, 2017

leosartaj mentioned this issue Jul 15, 2017

BUG: resample with non-existant columns #16973

Closed

4 tasks

jreback mentioned this issue Oct 8, 2017

BUG: groupby with resample using on parameter errors when selecting column to apply function #17813

Closed

discort mentioned this issue Feb 6, 2018

using DataFrame.resample with 'agg' method on non-existant columns provides unexpected behavior #19552

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.23.0 Feb 7, 2018

jreback closed this as completed in #19552 Feb 7, 2018

nabeelio mentioned this issue Apr 20, 2020

Update Pandas alpacahq/pylivetrader#145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using DataFrame.resample with 'agg' method on non-existant columns provides unexpected behavior #16766

using DataFrame.resample with 'agg' method on non-existant columns provides unexpected behavior #16766

dwilson-icr commented Jun 24, 2017

chris-b1 commented Jun 26, 2017

using DataFrame.resample with 'agg' method on non-existant columns provides unexpected behavior #16766

using DataFrame.resample with 'agg' method on non-existant columns provides unexpected behavior #16766

Comments

dwilson-icr commented Jun 24, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

chris-b1 commented Jun 26, 2017

Output of `pd.show_versions()`