-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: read_csv parse_dates should use datetime64[us] instead of datetime64[ns] if date out of bound is detected. #31711
Comments
Other resolutions than ns are currently not supported in pandas (see #7307), so returning |
closing as this is out of scope for now |
Seriously? The issue you mentioned is from 2014 and there are 15 other linked issues with practically the same problem. How can such an often requested feature be out of scope? I understand that most python users are scientists, but in enterprise applications (e. g. data warehouse systems) the reality is different, so there are special requirements, which could be easily fulfilled. There is absolutely no reason to use nanoseconds for dates at all. Fixing the problem looks quite simple and in doubt we have to fork pandas to perform that change ourselves. |
Your immediate question (support for other resolutions in read_csv) is out of scope for now, as long as we don't have support for other resolutions in general. #7307 is still an open, within scope issue, and contributions are welcome (but it is not an easy task). But that is already covered there, so no need to keep this open |
These two statements are very opionated and wrong. There are an amazing diversity of python users and ns representation of dates has provided a pretty good balance of dates that can be represented and usability. Sure supporting other freq of dates natively is desired but as @jorisvandenbossche not trivial. you can use Period if this is an actual issue for you; meaning you need a larger date range that ns provides and a performing type. Using an out of range marker eg 2999-01-01 while common in SQL systems is easily transformed to a NaT based missing value : you get performance and good representation. |
@jorisvandenbossche can we reopen now #7307 is done? 😁 I'd love e.g. a |
That would be great. We use a modified version of Pandas 0.24.1 where |
Code Sample, a copy-pastable example if possible
Problem description
read_csv
with parameteterparse_dates
usesdatetime64[ns]
as default. That datatype, however, only covers dates between the years 1678 an 2262. Many systems use hardcoded special dates for-Inf
orInf
, including0001-01-01 00:00:00
,2999-12-31 23:59:59
or9999-12-31 23:59:59
.The current behavior of
read_csv
is that anobject
column instead of adatetime64[us]
is returned. Using thedate_parser
argument leads to a huge performance drop and even solutions likena_values = ['2999-12-31 23:59:59']
do not work.Instead of an
object
column, adatetime64[us]
column should be returned, when an out-of-bounds date is found. Possible approaches to solve the issue include autodectection bycsv.Sniffer
, a separate parameterdatetime_unit
or backtracking.Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-1052-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: None
pip: 19.2.2
setuptools: 41.0.1
Cython: 0.29.14
numpy: 1.16.4
scipy: 1.3.1
pyarrow: 0.15.1
xarray: None
IPython: 7.8.0
sphinx: 2.2.0
patsy: 0.5.0
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.1.1
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.10
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: