Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: validate partial string indexing with tz-aware end-points #16785

Closed
1kastner opened this issue Jun 28, 2017 · 13 comments · Fixed by #25263
Closed

ERR: validate partial string indexing with tz-aware end-points #16785

1kastner opened this issue Jun 28, 2017 · 13 comments · Fixed by #25263
Labels
Datetime Datetime data dtype Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype
Milestone

Comments

@1kastner
Copy link

1kastner commented Jun 28, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import dateutil.parser

df = pd.DataFrame(index=pd.date_range('2016-01-01T00:00', '2016-12-31T23:59', freq='T'))

df[
    "2016-01-01T00:00-02:00"
    :
    "2016-01-01T02:03"
]  # returns 124 entries

df[
    dateutil.parser.parse("2016-01-01T00:00-02:00")
    :
    dateutil.parser.parse("2016-01-01T02:03")
]  # returns 4 entries

df[
    pd.Timestamp("2016-01-01T00:00-02:00")
    :
    pd.Timestamp("2016-01-01T02:03")
]  # returns 4 entries

Problem description

The current behavior is that the time zone information is ignored without any warning when providing strings.

Expected Output

Either use the time zone information and only return the four desired entries OR warn the user about the fact that the time zone information is not used

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 32
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.12.1
scipy: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@1kastner 1kastner changed the title pandas DatetimeIndex not time zone sensitive DatetimeIndex not time zone sensitive Jun 28, 2017
@jreback
Copy link
Contributor

jreback commented Jun 28, 2017

Your end-point is naive (FYI dateutil.parser is not used anywhere in pandas except as a very last resort as it is slow).

In [8]: pd.Timestamp("2016-01-01T00:00-02:00")
Out[8]: Timestamp('2016-01-01 00:00:00-0200', tz='pytz.FixedOffset(-120)')

In [9]: pd.Timestamp("2016-01-01T02:03")
Out[9]: Timestamp('2016-01-01 02:03:00')

In [12]: dateutil.parser.parse("2016-01-01T02:03")
Out[12]: datetime.datetime(2016, 1, 1, 2, 3)

In [13]: dateutil.parser.parse("2016-01-01T00:00-02:00")
Out[13]: datetime.datetime(2016, 1, 1, 0, 0, tzinfo=tzoffset(None, -7200))

Full working example

In [4]: idx = pd.date_range('2016-01-01T00:00', '2016-12-31T23:59', freq='T')

In [5]: df = pd.DataFrame({'A': range(len(idx))}, index=idx)

In [6]: df
Out[6]: 
                          A
2016-01-01 00:00:00       0
2016-01-01 00:01:00       1
2016-01-01 00:02:00       2
2016-01-01 00:03:00       3
2016-01-01 00:04:00       4
2016-01-01 00:05:00       5
...                     ...
2016-12-31 23:54:00  527034
2016-12-31 23:55:00  527035
2016-12-31 23:56:00  527036
2016-12-31 23:57:00  527037
2016-12-31 23:58:00  527038
2016-12-31 23:59:00  527039

[527040 rows x 1 columns]

In [7]: df["2016-01-01T00:00-02:00":"2016-01-02T00:00-02:00"]
Out[7]: 
                        A
2016-01-01 00:00:00     0
2016-01-01 00:01:00     1
2016-01-01 00:02:00     2
2016-01-01 00:03:00     3
2016-01-01 00:04:00     4
2016-01-01 00:05:00     5
...                   ...
2016-01-01 23:55:00  1435
2016-01-01 23:56:00  1436
2016-01-01 23:57:00  1437
2016-01-01 23:58:00  1438
2016-01-01 23:59:00  1439
2016-01-02 00:00:00  1440

[1441 rows x 1 columns]

@1kastner if you are proposing that we raise if the partial-string indexing with a tz must match on both end-points (if given), I would agree.

If you would like to submit a PR would be great. Just step thru to see where the conversions happens.

@jreback jreback added Difficulty Intermediate Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Datetime Datetime data dtype Timezones Timezone data dtype labels Jun 28, 2017
@jreback jreback added this to the 0.21.0 milestone Jun 28, 2017
@jreback jreback changed the title DatetimeIndex not time zone sensitive ERR: validate partial string indexing with tz-aware end-points Jun 28, 2017
@1kastner
Copy link
Author

Thank you very much for this explanation! I will try to find a good spot to insert the proposed exception.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

@1kastner want to do a PR for this?

@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017
@1kastner
Copy link
Author

After 17th Oct. (probably a few days earlier) I will have some free time and I'd be interested to solve this. If you prefer to have it solved by somebody else before this date, that'd be no deal for me.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2017

@1kastner totally fine.

@1kastner
Copy link
Author

1kastner commented Oct 19, 2017

So the easiest solution seems now to invoke pd.Timestamp("timestamp-string") on the respective endpoint. I do not agree that the timezones should match. Because in some countries they have daylight saving time, so it is completely normal for several countries to change their timezone twice a year.

Because of that it is so critical (when not working with UTC) that the timezone information is not ignored.

@1kastner
Copy link
Author

Well, in datetimes.py the parsed offset is totally ignored in the method _parsed_string_to_bounds, for my case lines 1292-1296. Instead, the assumed timezone defaults to the timezone of the DatetimeIndex. As said before, because of daylight saving time (also called "summer time" in German), this does not necessarily make sense. You should be free to choose whatever timezone you need for your project, changing with different contexts. I can understand that if no timezone is provided, it defaults to the timezone of the DatetimeIndex timezone instead of UTC. But if a timezone is provided, neglecting it here is not nice.

@bkmrkr
Copy link

bkmrkr commented Oct 19, 2017 via email

@1kastner
Copy link
Author

@bkmrkr Can't you unsubscribe yourself?

@1kastner
Copy link
Author

@jreback Now I struggle with Visual Studio... It might take some some time to properly check everything. But the main idea is to introduce the following change:

    def _parsed_string_to_bounds(self, reso, parsed):
        # ...
        if parsed.tzinfo is None:
            target_tz = self.tz
        else:
            target_tz = parsed.tzinfo

        if reso == 'year':
            return (Timestamp(datetime(parsed.year, 1, 1), tz=target_tz),
                    Timestamp(datetime(parsed.year, 12, 31, 23,
                                       59, 59, 999999), tz=target_tz))
    #...

Of course throughout all resolutions self.tz needs to be replaced with target_tz.

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Nov 16, 2017
@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018
@1kastner
Copy link
Author

At #24076 it seems like something has changed.
I will soon re-check whether this issue still exists or whether it can be closed.

@1kastner
Copy link
Author

1kastner commented Jan 1, 2019

I locally built the version 0.24.0.dev0+1448.gb9284a2.dirty and the problem continued to exist.

@1kastner
Copy link
Author

great that it is solved, thank you everybody!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype
Projects
None yet
3 participants