Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to write timedelta64[s] type correctly with parquet writer #13409

Closed
galipremsagar opened this issue May 22, 2023 · 6 comments
Closed
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@galipremsagar
Copy link
Contributor

Describe the bug
Only when we have timedelta64[s] dtype for a column, the parquet writer seems to be writing it as a timedelta64[ms] column which is leading both cudf & pyarrow parquet readers to pickup the column type incorrectly.

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

In [1]: import cudf

In [3]: df = cudf.DataFrame({"seconds": cudf.Series([1234, 3456, 32442], dtype='timedelta64[s]')})

In [4]: df
Out[4]: 
          seconds
0 0 days 00:20:34
1 0 days 00:57:36
2 0 days 09:00:42

In [5]: df.dtypes
Out[5]: 
seconds    timedelta64[s]
dtype: object

In [6]: df.to_parquet("a")

In [7]: cudf.read_parquet("a")
Out[7]: 
          seconds
0 0 days 00:20:34
1 0 days 00:57:36
2 0 days 09:00:42

In [8]: cudf.read_parquet("a").dtypes
Out[8]: 
seconds    timedelta64[ms]               # Should be timedelta64[s]
dtype: object

In [9]: import pyarrow as pa

In [10]: pa.parquet.read_table("a")
Out[10]: 
pyarrow.Table
seconds: time32[ms] not null           # Should be time32[s]
----
seconds: [[00:20:34.000,00:57:36.000,09:00:42.000]]


# If we now try to write & read using pyarrow the dtype stays intact:

In [11]: pa_table = df.to_arrow()

In [12]: pa_table
Out[12]: 
pyarrow.Table
seconds: duration[s]
----
seconds: [[1234,3456,32442]]

In [13]: pa.parquet.write_table(pa_table, "a")

In [15]: pa.parquet.read_table("a")
Out[15]: 
pyarrow.Table
seconds: duration[s]
----
seconds: [[1234,3456,32442]]

In [17]: import pandas as pd

In [18]: pd.read_parquet("a")
Out[18]: 
          seconds
0 0 days 00:20:34
1 0 days 00:57:36
2 0 days 09:00:42

In [19]: pd.read_parquet("a").dtypes
Out[19]: 
seconds    timedelta64[ns]
dtype: object

In [21]: pa.parquet.read_metadata("a").schema
Out[21]: 
<pyarrow._parquet.ParquetSchema object at 0x7fc8e622b0c0>
required group field_id=-1 schema {
  optional int64 field_id=-1 seconds;
}

Expected behavior
We are writing all other timedelta resolutions(ns, ms, us) correctly. It's a problem only being seen with s. We should be able to round-trip this type correctly if writer can correctly write this type.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal]
  • Method of cuDF install: [from source]

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Click here to see environment details
 **git***
 commit 9b1496df64b9ae9bd7b44a30cfaa42a2f7e2db3f (HEAD -> branch-23.06)
 Author: Ashwin Srinath <[email protected]>
 Date:   Mon May 22 13:52:36 2023 -0400
 
 Fix groupby head/tail for empty dataframe (#13398)
 
 Closes #13397
 
 Authors:
 - Ashwin Srinath (https://github.com/shwina)
 
 Approvers:
 - GALI PREM SAGAR (https://github.com/galipremsagar)
 - Bradley Dice (https://github.com/bdice)
 
 URL: https://github.com/rapidsai/cudf/pull/13398
 **git submodules***
 
 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=18.04
 DISTRIB_CODENAME=bionic
 DISTRIB_DESCRIPTION="Ubuntu 18.04.4 LTS"
 NAME="Ubuntu"
 VERSION="18.04.4 LTS (Bionic Beaver)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 18.04.4 LTS"
 VERSION_ID="18.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=bionic
 UBUNTU_CODENAME=bionic
 Linux dt07 4.15.0-76-generic #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
 
 ***GPU Information***
 Mon May 22 13:53:56 2023
 +---------------------------------------------------------------------------------------+
 | NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
 |-----------------------------------------+----------------------+----------------------+
 | GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                                         |                      |               MIG M. |
 |=========================================+======================+======================|
 |   0  Tesla T4                        On | 00000000:3B:00.0 Off |                    0 |
 | N/A   45C    P8               10W /  70W|      2MiB / 15360MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 |   1  Tesla T4                        On | 00000000:5E:00.0 Off |                    0 |
 | N/A   34C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 |   2  Tesla T4                        On | 00000000:AF:00.0 Off |                    0 |
 | N/A   29C    P8               10W /  70W|      2MiB / 15360MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 |   3  Tesla T4                        On | 00000000:D8:00.0 Off |                    0 |
 | N/A   29C    P8               10W /  70W|      2MiB / 15360MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 
 +---------------------------------------------------------------------------------------+
 | Processes:                                                                            |
 |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
 |        ID   ID                                                             Usage      |
 |=======================================================================================|
 |  No running processes found                                                           |
 +---------------------------------------------------------------------------------------+
 
 ***CPU***
 Architecture:        x86_64
 CPU op-mode(s):      32-bit, 64-bit
 Byte Order:          Little Endian
 CPU(s):              64
 On-line CPU(s) list: 0-63
 Thread(s) per core:  2
 Core(s) per socket:  16
 Socket(s):           2
 NUMA node(s):        2
 Vendor ID:           GenuineIntel
 CPU family:          6
 Model:               85
 Model name:          Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
 Stepping:            4
 CPU MHz:             1412.660
 BogoMIPS:            4200.00
 Virtualization:      VT-x
 L1d cache:           32K
 L1i cache:           32K
 L2 cache:            1024K
 L3 cache:            22528K
 NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62
 NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities
 
 ***CMake***
 /nvme/0/pgali/envs/cudfdev/bin/cmake
 cmake version 3.26.4
 
 CMake suite maintained and supported by Kitware (kitware.com/cmake).
 
 ***g++***
 /nvme/0/pgali/envs/cudfdev/bin/g++
 g++ (conda-forge gcc 11.3.0-19) 11.3.0
 Copyright (C) 2021 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
 
 ***nvcc***
 /nvme/0/pgali/envs/cudfdev/bin/nvcc
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2022 NVIDIA Corporation
 Built on Wed_Sep_21_10:33:58_PDT_2022
 Cuda compilation tools, release 11.8, V11.8.89
 Build cuda_11.8.r11.8/compiler.31833905_0
 
 ***Python***
 /nvme/0/pgali/envs/cudfdev/bin/python
 Python 3.10.11
 
 ***Environment Variables***
 PATH                            : /nvme/0/pgali/envs/cudfdev/bin:/nvme/0/pgali/envs/cudfdev/bin:/nvme/0/pgali/.cargo/bin:/home/nfs/pgali/.vscode-server/bin/b3e4e68a0bc097f0ae7907b217c1119af9e03435/bin/remote-cli:/nvme/0/pgali/.cargo/bin:/nvme/0/pgali/anaconda3/bin:/nvme/0/pgali/anaconda3/condabin:/nvme/0/pgali/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin
 LD_LIBRARY_PATH                 : /usr/local/cuda/lib64::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    : /nvme/0/pgali/envs/cudfdev
 PYTHON_PATH                     :
 
 ***conda packages***
 /nvme/0/pgali/anaconda3/bin/conda
 # packages in environment at /nvme/0/pgali/envs/cudfdev:
 #
 # Name                    Version                   Build  Channel
 _libgcc_mutex             0.1                 conda_forge    conda-forge
 _openmp_mutex             4.5                       2_gnu    conda-forge
 _sysroot_linux-64_curr_repodata_hack 3                   h69a702a_13    conda-forge
 accessible-pygments       0.0.4              pyhd8ed1ab_0    conda-forge
 aiobotocore               2.5.0              pyhd8ed1ab_0    conda-forge
 aiohttp                   3.8.4           py310h1fa729e_0    conda-forge
 aioitertools              0.11.0             pyhd8ed1ab_0    conda-forge
 aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
 alabaster                 0.7.13             pyhd8ed1ab_0    conda-forge
 anyio                     3.6.2              pyhd8ed1ab_0    conda-forge
 argon2-cffi               21.3.0             pyhd8ed1ab_0    conda-forge
 argon2-cffi-bindings      21.2.0          py310h5764c6d_3    conda-forge
 arrow-cpp                 11.0.0          ha770c72_20_cpu    conda-forge
 asttokens                 2.2.1              pyhd8ed1ab_0    conda-forge
 async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
 attrs                     23.1.0             pyh71513ae_1    conda-forge
 aws-c-auth                0.6.27               he072965_1    conda-forge
 aws-c-cal                 0.5.26               hf677bf3_1    conda-forge
 aws-c-common              0.8.19               hd590300_0    conda-forge
 aws-c-compression         0.2.16               hbad4bc6_7    conda-forge
 aws-c-event-stream        0.2.20               hb4b372c_7    conda-forge
 aws-c-http                0.7.7                h2632f9a_4    conda-forge
 aws-c-io                  0.13.21              h9fef7b8_5    conda-forge
 aws-c-mqtt                0.8.11               h2282364_1    conda-forge
 aws-c-s3                  0.3.0                hcb5a9b2_2    conda-forge
 aws-c-sdkutils            0.1.9                hbad4bc6_2    conda-forge
 aws-checksums             0.1.14               hbad4bc6_7    conda-forge
 aws-crt-cpp               0.20.1               he0fdcb3_3    conda-forge
 aws-sam-translator        1.55.0             pyhd8ed1ab_0    conda-forge
 aws-sdk-cpp               1.10.57             hb0b1f3a_12    conda-forge
 aws-xray-sdk              2.12.0             pyhd8ed1ab_0    conda-forge
 babel                     2.12.1             pyhd8ed1ab_1    conda-forge
 backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
 backports                 1.0                pyhd8ed1ab_3    conda-forge
 backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
 backports.zoneinfo        0.2.1           py310hff52083_7    conda-forge
 bcrypt                    3.2.2           py310h5764c6d_1    conda-forge
 beautifulsoup4            4.12.2             pyha770c72_0    conda-forge
 binutils                  2.39                 hdd6e379_1    conda-forge
 binutils_impl_linux-64    2.39                 he00db2b_1    conda-forge
 binutils_linux-64         2.39                h5fc0e48_13    conda-forge
 blas                      1.0                         mkl    conda-forge
 bleach                    6.0.0              pyhd8ed1ab_0    conda-forge
 blinker                   1.6.2              pyhd8ed1ab_0    conda-forge
 bokeh                     2.4.3              pyhd8ed1ab_3    conda-forge
 boto3                     1.26.76            pyhd8ed1ab_0    conda-forge
 botocore                  1.29.76            pyhd8ed1ab_0    conda-forge
 brotlipy                  0.7.0           py310h5764c6d_1005    conda-forge
 bzip2                     1.0.8                h7f98852_4    conda-forge
 c-ares                    1.19.0               hd590300_0    conda-forge
 c-compiler                1.5.2                h0b41bf4_0    conda-forge
 ca-certificates           2023.5.7             hbcca054_0    conda-forge
 cachetools                5.3.0              pyhd8ed1ab_0    conda-forge
 certifi                   2023.5.7           pyhd8ed1ab_0    conda-forge
 cffi                      1.15.1          py310h255011f_3    conda-forge
 cfgv                      3.3.1              pyhd8ed1ab_0    conda-forge
 cfn-lint                  0.75.1             pyhd8ed1ab_0    conda-forge
 charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
 click                     8.1.3           unix_pyhd8ed1ab_2    conda-forge
 cloudpickle               2.2.1              pyhd8ed1ab_0    conda-forge
 cmake                     3.26.4               hcfe8598_0    conda-forge
 colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
 comm                      0.1.3              pyhd8ed1ab_0    conda-forge
 commonmark                0.9.1                      py_0    conda-forge
 coverage                  7.2.5           py310h2372a71_0    conda-forge
 cryptography              40.0.2          py310h34c0648_0    conda-forge
 cubinlinker               0.2.2           py310hf09951c_0    rapidsai
 cuda-python               11.8.1          py310h01a121a_2    conda-forge
 cuda-sanitizer-api        11.8.86                       0    nvidia
 cudatoolkit               11.8.0              h37601d7_11    conda-forge
 cudf                      23.6.0                   pypi_0    pypi
 cupy                      12.0.0          py310h9216885_1    conda-forge
 cxx-compiler              1.5.2                hf52228f_0    conda-forge
 cyrus-sasl                2.1.27               h9033bb2_6    conda-forge
 cython                    0.29.34         py310heca2aa9_0    conda-forge
 cytoolz                   0.12.0          py310h5764c6d_1    conda-forge
 dask                      2023.3.2           pyhd8ed1ab_0    conda-forge
 dask-core                 2023.3.2           pyhd8ed1ab_0    conda-forge
 dask-cuda                 23.06.00a       py310_230522_gcf6e9fb_24    rapidsai-nightly
 dask-cudf                 23.6.0                   pypi_0    pypi
 dataclasses               0.8                pyhc8e2a94_3    conda-forge
 datasets                  2.12.0             pyhd8ed1ab_0    conda-forge
 debugpy                   1.6.7           py310heca2aa9_0    conda-forge
 decopatch                 1.4.10             pyhd8ed1ab_0    conda-forge
 decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
 defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
 dill                      0.3.6              pyhd8ed1ab_1    conda-forge
 distlib                   0.3.6              pyhd8ed1ab_0    conda-forge
 distributed               2023.3.2.1         pyhd8ed1ab_0    conda-forge
 distro                    1.8.0              pyhd8ed1ab_0    conda-forge
 dlpack                    0.5                  h9c3ff4c_0    conda-forge
 docker-py                 6.1.0              pyhd8ed1ab_0    conda-forge
 docutils                  0.19            py310hff52083_1    conda-forge
 doxygen                   1.8.20               had0d8f1_0    conda-forge
 ecdsa                     0.18.0             pyhd8ed1ab_1    conda-forge
 entrypoints               0.4                pyhd8ed1ab_0    conda-forge
 exceptiongroup            1.1.1              pyhd8ed1ab_0    conda-forge
 execnet                   1.9.0              pyhd8ed1ab_0    conda-forge
 executing                 1.2.0              pyhd8ed1ab_0    conda-forge
 expat                     2.5.0                hcb278e6_1    conda-forge
 fastavro                  1.7.4           py310h2372a71_0    conda-forge
 fastrlock                 0.8             py310hd8f1fbe_3    conda-forge
 filelock                  3.12.0             pyhd8ed1ab_0    conda-forge
 flask                     2.3.2              pyhd8ed1ab_0    conda-forge
 flask_cors                3.0.10             pyhd3deb0d_0    conda-forge
 flit-core                 3.9.0              pyhd8ed1ab_0    conda-forge
 fmt                       9.1.0                h924138e_0    conda-forge
 freetype                  2.12.1               hca18f0e_1    conda-forge
 frozenlist                1.3.3           py310h5764c6d_0    conda-forge
 fsspec                    2023.5.0           pyh1a96a4e_0    conda-forge
 future                    0.18.3             pyhd8ed1ab_0    conda-forge
 gcc                       11.3.0              h02d0930_13    conda-forge
 gcc_impl_linux-64         11.3.0              hab1b70f_19    conda-forge
 gcc_linux-64              11.3.0              he6f903b_13    conda-forge
 gflags                    2.2.2             he1b5a44_1004    conda-forge
 glog                      0.6.0                h6f12383_0    conda-forge
 gmock                     1.13.0               ha770c72_1    conda-forge
 gmp                       6.2.1                h58526e2_0    conda-forge
 gmpy2                     2.1.2           py310h3ec546c_1    conda-forge
 graphql-core              3.2.3              pyhd8ed1ab_0    conda-forge
 greenlet                  2.0.2           py310hc6cd4ac_1    conda-forge
 gtest                     1.13.0               h00ab1b0_1    conda-forge
 gxx                       11.3.0              h02d0930_13    conda-forge
 gxx_impl_linux-64         11.3.0              hab1b70f_19    conda-forge
 gxx_linux-64              11.3.0              hc203a17_13    conda-forge
 huggingface_hub           0.14.1             pyhd8ed1ab_0    conda-forge
 hypothesis                6.75.3             pyha770c72_0    conda-forge
 identify                  2.5.24             pyhd8ed1ab_0    conda-forge
 idna                      3.4                pyhd8ed1ab_0    conda-forge
 imagesize                 1.4.1              pyhd8ed1ab_0    conda-forge
 importlib-metadata        6.6.0              pyha770c72_0    conda-forge
 importlib_metadata        6.6.0                hd8ed1ab_0    conda-forge
 iniconfig                 2.0.0              pyhd8ed1ab_0    conda-forge
 intel-openmp              2022.1.0          h9e868ea_3769
 ipykernel                 6.23.1             pyh210e3f2_0    conda-forge
 ipython                   8.13.2             pyh41d4057_0    conda-forge
 ipython_genutils          0.2.0                      py_1    conda-forge
 itsdangerous              2.1.2              pyhd8ed1ab_0    conda-forge
 jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
 jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
 jmespath                  1.0.1              pyhd8ed1ab_0    conda-forge
 joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
 jschema-to-python         1.2.3              pyhd8ed1ab_0    conda-forge
 jsondiff                  2.0.0              pyhd8ed1ab_0    conda-forge
 jsonpatch                 1.32               pyhd8ed1ab_0    conda-forge
 jsonpickle                2.2.0              pyhd8ed1ab_0    conda-forge
 jsonpointer               2.0                        py_0    conda-forge
 jsonschema                3.2.0              pyhd8ed1ab_3    conda-forge
 junit-xml                 1.9                pyh9f0ad1d_0    conda-forge
 jupyter-cache             0.6.1              pyhd8ed1ab_0    conda-forge
 jupyter_client            8.2.0              pyhd8ed1ab_0    conda-forge
 jupyter_core              5.3.0           py310hff52083_0    conda-forge
 jupyter_events            0.6.3              pyhd8ed1ab_0    conda-forge
 jupyter_server            2.5.0              pyhd8ed1ab_0    conda-forge
 jupyter_server_terminals  0.4.4              pyhd8ed1ab_1    conda-forge
 jupyterlab_pygments       0.2.2              pyhd8ed1ab_0    conda-forge
 kernel-headers_linux-64   3.10.0              h4a8ded7_13    conda-forge
 keyutils                  1.6.1                h166bdaf_0    conda-forge
 krb5                      1.20.1               h81ceb04_0    conda-forge
 lcms2                     2.15                 haa2dc70_1    conda-forge
 ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
 lerc                      4.0.0                h27087fc_0    conda-forge
 libabseil                 20230125.2      cxx17_h59595ed_2    conda-forge
 libarrow                  11.0.0          h6564b11_20_cpu    conda-forge
 libblas                   3.9.0            16_linux64_mkl    conda-forge
 libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
 libbrotlidec              1.0.9                h166bdaf_8    conda-forge
 libbrotlienc              1.0.9                h166bdaf_8    conda-forge
 libcblas                  3.9.0            16_linux64_mkl    conda-forge
 libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
 libcufile                 1.4.0.31                      0    nvidia
 libcufile-dev             1.4.0.31                      0    nvidia
 libcurand                 10.3.0.86                     0    nvidia
 libcurand-dev             10.3.0.86                     0    nvidia
 libcurl                   8.1.0                h409715c_0    conda-forge
 libdeflate                1.18                 h0b41bf4_0    conda-forge
 libedit                   3.1.20191231         he28a2e2_2    conda-forge
 libev                     4.33                 h516909a_1    conda-forge
 libevent                  2.1.12               h3358134_0    conda-forge
 libexpat                  2.5.0                hcb278e6_1    conda-forge
 libffi                    3.4.2                h7f98852_5    conda-forge
 libgcc-devel_linux-64     11.3.0              h210ce93_19    conda-forge
 libgcc-ng                 12.2.0              h65d4601_19    conda-forge
 libgfortran-ng            12.2.0              h69a702a_19    conda-forge
 libgfortran5              12.2.0              h337968e_19    conda-forge
 libgomp                   12.2.0              h65d4601_19    conda-forge
 libgoogle-cloud           2.10.1               hac9eb74_1    conda-forge
 libgrpc                   1.54.2               hb20ce57_2    conda-forge
 libiconv                  1.17                 h166bdaf_0    conda-forge
 libjpeg-turbo             2.1.5.1              h0b41bf4_0    conda-forge
 libkvikio                 23.06.00a       cuda11_230522_g2fbcd33_26    rapidsai-nightly
 liblapack                 3.9.0            16_linux64_mkl    conda-forge
 libllvm11                 11.1.0               he0ac6c6_5    conda-forge
 libnghttp2                1.52.0               h61bc06f_0    conda-forge
 libnsl                    2.0.0                h7f98852_0    conda-forge
 libntlm                   1.4               h7f98852_1002    conda-forge
 libnuma                   2.0.16               h0b41bf4_1    conda-forge
 libpng                    1.6.39               h753d276_0    conda-forge
 libprotobuf               3.21.12              h3eb15da_0    conda-forge
 librdkafka                1.9.2                ha5a0de0_2    conda-forge
 librmm                    23.06.00a       cuda11_230522_gc11ea8a5_19    rapidsai-nightly
 libsanitizer              11.3.0              h239ccf8_19    conda-forge
 libsodium                 1.0.18               h36c2ea0_1    conda-forge
 libsqlite                 3.42.0               h2797004_0    conda-forge
 libssh2                   1.10.0               hf14f497_3    conda-forge
 libstdcxx-devel_linux-64  11.3.0              h210ce93_19    conda-forge
 libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
 libthrift                 0.18.1               h8fd135c_1    conda-forge
 libtiff                   4.5.0                ha587672_6    conda-forge
 libutf8proc               2.8.0                h166bdaf_0    conda-forge
 libuuid                   2.38.1               h0b41bf4_0    conda-forge
 libuv                     1.44.2               h166bdaf_0    conda-forge
 libwebp-base              1.3.0                h0b41bf4_0    conda-forge
 libxcb                    1.15                 h0b41bf4_0    conda-forge
 libzlib                   1.2.13               h166bdaf_4    conda-forge
 livereload                2.6.3              pyh9f0ad1d_0    conda-forge
 llvmlite                  0.39.1          py310h58363a5_1    conda-forge
 locket                    1.0.0              pyhd8ed1ab_0    conda-forge
 lz4                       4.3.2           py310h0cfdcf0_0    conda-forge
 lz4-c                     1.9.4                hcb278e6_0    conda-forge
 makefun                   1.15.1             pyhd8ed1ab_0    conda-forge
 markdown                  3.4.3              pyhd8ed1ab_0    conda-forge
 markdown-it-py            2.2.0              pyhd8ed1ab_0    conda-forge
 markupsafe                2.1.2           py310h1fa729e_0    conda-forge
 matplotlib-inline         0.1.6              pyhd8ed1ab_0    conda-forge
 mdit-py-plugins           0.3.5              pyhd8ed1ab_0    conda-forge
 mdurl                     0.1.0              pyhd8ed1ab_0    conda-forge
 mimesis                   10.0.0             pyhd8ed1ab_0    conda-forge
 mistune                   2.0.5              pyhd8ed1ab_0    conda-forge
 mkl                       2022.1.0           hc2b9512_224
 moto                      4.1.10             pyhd8ed1ab_0    conda-forge
 mpc                       1.3.1                hfe3b2da_0    conda-forge
 mpfr                      4.2.0                hb012696_0    conda-forge
 msgpack-python            1.0.5           py310hdf3cbec_0    conda-forge
 multidict                 6.0.4           py310h1fa729e_0    conda-forge
 multiprocess              0.70.14         py310h5764c6d_3    conda-forge
 myst-nb                   0.17.2             pyhd8ed1ab_0    conda-forge
 myst-parser               0.18.1             pyhd8ed1ab_0    conda-forge
 nbclassic                 1.0.0              pyhb4ecaf3_1    conda-forge
 nbclient                  0.7.4              pyhd8ed1ab_0    conda-forge
 nbconvert                 7.2.9              pyhd8ed1ab_0    conda-forge
 nbconvert-core            7.2.9              pyhd8ed1ab_0    conda-forge
 nbconvert-pandoc          7.2.9              pyhd8ed1ab_0    conda-forge
 nbformat                  5.8.0              pyhd8ed1ab_0    conda-forge
 nbsphinx                  0.9.1              pyhd8ed1ab_0    conda-forge
 ncurses                   6.3                  h27087fc_1    conda-forge
 nest-asyncio              1.5.6              pyhd8ed1ab_0    conda-forge
 networkx                  2.8.8              pyhd8ed1ab_0    conda-forge
 ninja                     1.11.1               h924138e_0    conda-forge
 nodeenv                   1.8.0              pyhd8ed1ab_0    conda-forge
 notebook                  6.5.4              pyha770c72_0    conda-forge
 notebook-shim             0.2.3              pyhd8ed1ab_0    conda-forge
 numba                     0.56.4          py310h0e39c9b_1    conda-forge
 numpy                     1.23.5          py310h53a5b5f_0    conda-forge
 numpydoc                  1.5.0              pyhd8ed1ab_0    conda-forge
 nvcc_linux-64             11.8                h41dc85b_22    conda-forge
 nvtx                      0.2.5           py310h1fa729e_0    conda-forge
 openapi-schema-validator  0.2.3              pyhd8ed1ab_0    conda-forge
 openapi-spec-validator    0.4.0              pyhd8ed1ab_1    conda-forge
 openjpeg                  2.5.0                hfec8fc6_2    conda-forge
 openssl                   3.1.0                hd590300_3    conda-forge
 orc                       1.8.3                hfdbbad2_0    conda-forge
 packaging                 23.1               pyhd8ed1ab_0    conda-forge
 pandas                    1.5.3                    pypi_0    pypi
 pandoc                    3.1.2                h32600fe_1    conda-forge
 pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
 paramiko                  3.1.0              pyhd8ed1ab_0    conda-forge
 parquet-cpp               1.5.1                         2    conda-forge
 parso                     0.8.3              pyhd8ed1ab_0    conda-forge
 partd                     1.4.0              pyhd8ed1ab_0    conda-forge
 pbr                       5.11.1             pyhd8ed1ab_0    conda-forge
 pexpect                   4.8.0              pyh1a96a4e_2    conda-forge
 pickleshare               0.7.5                   py_1003    conda-forge
 pillow                    9.5.0           py310h582fbeb_1    conda-forge
 pip                       23.1.2             pyhd8ed1ab_0    conda-forge
 platformdirs              3.5.1              pyhd8ed1ab_0    conda-forge
 pluggy                    1.0.0              pyhd8ed1ab_5    conda-forge
 pooch                     1.7.0              pyha770c72_3    conda-forge
 pre-commit                3.3.2              pyha770c72_0    conda-forge
 prometheus_client         0.16.0             pyhd8ed1ab_0    conda-forge
 prompt-toolkit            3.0.38             pyha770c72_0    conda-forge
 prompt_toolkit            3.0.38               hd8ed1ab_0    conda-forge
 protobuf                  4.21.12         py310heca2aa9_0    conda-forge
 psutil                    5.9.5           py310h1fa729e_0    conda-forge
 pthread-stubs             0.4               h36c2ea0_1001    conda-forge
 ptxcompiler               0.8.1           py310h01a121a_0    conda-forge
 ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
 pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
 py-cpuinfo                9.0.0              pyhd8ed1ab_0    conda-forge
 pyarrow                   11.0.0          py310he6bfd7f_20_cpu    conda-forge
 pyasn1                    0.4.8                      py_0    conda-forge
 pycparser                 2.21               pyhd8ed1ab_0    conda-forge
 pydata-sphinx-theme       0.13.3             pyhd8ed1ab_0    conda-forge
 pygments                  2.15.1             pyhd8ed1ab_0    conda-forge
 pynacl                    1.5.0           py310h5764c6d_2    conda-forge
 pynvml                    11.4.1             pyhd8ed1ab_0    conda-forge
 pyopenssl                 23.1.1             pyhd8ed1ab_0    conda-forge
 pyorc                     0.8.0           py310hd52fb3e_4    conda-forge
 pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
 pyrsistent                0.19.3          py310h1fa729e_0    conda-forge
 pysocks                   1.7.1              pyha2e5f31_6    conda-forge
 pytest                    7.3.1              pyhd8ed1ab_0    conda-forge
 pytest-benchmark          4.0.0              pyhd8ed1ab_0    conda-forge
 pytest-cases              3.6.14             pyhd8ed1ab_0    conda-forge
 pytest-cov                4.0.0              pyhd8ed1ab_0    conda-forge
 pytest-xdist              3.3.1              pyhd8ed1ab_0    conda-forge
 python                    3.10.11         he550d4f_0_cpython    conda-forge
 python-confluent-kafka    1.9.2           py310h5764c6d_2    conda-forge
 python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
 python-fastjsonschema     2.17.1             pyhd8ed1ab_0    conda-forge
 python-jose               3.3.0              pyh6c4a22f_1    conda-forge
 python-json-logger        2.0.7              pyhd8ed1ab_0    conda-forge
 python-snappy             0.6.1           py310hcee4d7c_0    conda-forge
 python-xxhash             3.2.0           py310h1fa729e_0    conda-forge
 python_abi                3.10                    3_cp310    conda-forge
 pytorch                   1.11.0             py3.10_cpu_0    pytorch
 pytorch-mutex             1.0                         cpu    pytorch
 pytz                      2023.3             pyhd8ed1ab_0    conda-forge
 pywin32-on-windows        0.1.0              pyh1179c8e_3    conda-forge
 pyyaml                    6.0             py310h5764c6d_5    conda-forge
 pyzmq                     25.0.2          py310h059b190_0    conda-forge
 re2                       2023.03.02           h8c504da_0    conda-forge
 readline                  8.2                  h8228510_1    conda-forge
 recommonmark              0.7.1              pyhd8ed1ab_0    conda-forge
 regex                     2023.5.5        py310h2372a71_0    conda-forge
 requests                  2.31.0             pyhd8ed1ab_0    conda-forge
 responses                 0.18.0             pyhd8ed1ab_0    conda-forge
 rfc3339-validator         0.1.4              pyhd8ed1ab_0    conda-forge
 rfc3986-validator         0.1.1              pyh9f0ad1d_0    conda-forge
 rhash                     1.4.3                h166bdaf_0    conda-forge
 rmm                       23.06.00a       cuda11_py310_230522_gc11ea8a5_19    rapidsai-nightly
 rsa                       4.9                pyhd8ed1ab_0    conda-forge
 s2n                       1.3.44               h06160fa_0    conda-forge
 s3fs                      2023.5.0           pyhd8ed1ab_0    conda-forge
 s3transfer                0.6.1              pyhd8ed1ab_0    conda-forge
 sacremoses                0.0.53             pyhd8ed1ab_0    conda-forge
 sarif-om                  1.0.4              pyhd8ed1ab_0    conda-forge
 scikit-build              0.17.1             pyh56297ac_0    conda-forge
 scipy                     1.10.1          py310ha4c1d20_3    conda-forge
 sed                       4.8                  he412f7d_0    conda-forge
 send2trash                1.8.2              pyh41d4057_0    conda-forge
 setuptools                67.7.2             pyhd8ed1ab_0    conda-forge
 six                       1.16.0             pyh6c4a22f_0    conda-forge
 snappy                    1.1.10               h9fff704_0    conda-forge
 sniffio                   1.3.0              pyhd8ed1ab_0    conda-forge
 snowballstemmer           2.2.0              pyhd8ed1ab_0    conda-forge
 sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
 soupsieve                 2.3.2.post1        pyhd8ed1ab_0    conda-forge
 spdlog                    1.11.0               h9b3ece8_1    conda-forge
 sphinx                    5.3.0              pyhd8ed1ab_0    conda-forge
 sphinx-autobuild          2021.3.14          pyhd8ed1ab_0    conda-forge
 sphinx-copybutton         0.5.2              pyhd8ed1ab_0    conda-forge
 sphinx-markdown-tables    0.0.17             pyh6c4a22f_0    conda-forge
 sphinxcontrib-applehelp   1.0.4              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-devhelp     1.0.2                      py_0    conda-forge
 sphinxcontrib-htmlhelp    2.0.1              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-jsmath      1.0.1                      py_0    conda-forge
 sphinxcontrib-qthelp      1.0.3                      py_0    conda-forge
 sphinxcontrib-serializinghtml 1.1.5              pyhd8ed1ab_2    conda-forge
 sphinxcontrib-websupport  1.2.4              pyhd8ed1ab_1    conda-forge
 sqlalchemy                2.0.15          py310h2372a71_0    conda-forge
 sshpubkeys                3.3.1              pyhd8ed1ab_0    conda-forge
 stack_data                0.6.2              pyhd8ed1ab_0    conda-forge
 streamz                   0.6.4              pyh6c4a22f_0    conda-forge
 sysroot_linux-64          2.17                h4a8ded7_13    conda-forge
 tabulate                  0.9.0              pyhd8ed1ab_1    conda-forge
 tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
 terminado                 0.17.1             pyh41d4057_0    conda-forge
 tinycss2                  1.2.1              pyhd8ed1ab_0    conda-forge
 tk                        8.6.12               h27826a3_0    conda-forge
 tokenizers                0.13.1          py310h633acb5_2    conda-forge
 toml                      0.10.2             pyhd8ed1ab_0    conda-forge
 tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
 toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
 tornado                   6.3.2           py310h2372a71_0    conda-forge
 tqdm                      4.65.0             pyhd8ed1ab_1    conda-forge
 traitlets                 5.9.0              pyhd8ed1ab_0    conda-forge
 transformers              4.24.0             pyhd8ed1ab_0    conda-forge
 typing-extensions         4.5.0                hd8ed1ab_0    conda-forge
 typing_extensions         4.5.0              pyha770c72_0    conda-forge
 tzdata                    2023.3                   pypi_0    pypi
 ucx                       1.14.1               h8c404fb_0    conda-forge
 ukkonen                   1.0.1           py310hbf28c38_3    conda-forge
 urllib3                   1.26.15            pyhd8ed1ab_0    conda-forge
 virtualenv                20.23.0            pyhd8ed1ab_0    conda-forge
 wcwidth                   0.2.6              pyhd8ed1ab_0    conda-forge
 webencodings              0.5.1                      py_1    conda-forge
 websocket-client          1.5.2              pyhd8ed1ab_0    conda-forge
 werkzeug                  2.3.4              pyhd8ed1ab_0    conda-forge
 wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
 wrapt                     1.15.0          py310h1fa729e_0    conda-forge
 xmltodict                 0.13.0             pyhd8ed1ab_0    conda-forge
 xorg-libxau               1.0.11               hd590300_0    conda-forge
 xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
 xxhash                    0.8.1                h0b41bf4_0    conda-forge
 xz                        5.2.6                h166bdaf_0    conda-forge
 yaml                      0.2.5                h7f98852_2    conda-forge
 yarl                      1.9.1           py310h2372a71_0    conda-forge
 zeromq                    4.3.4                h9c3ff4c_1    conda-forge
 zict                      3.0.0              pyhd8ed1ab_0    conda-forge
 zipp                      3.15.0             pyhd8ed1ab_0    conda-forge
 zlib                      1.2.13               h166bdaf_4    conda-forge
 zstd                      1.5.2                h3eb15da_6    conda-forge

Additional context
Add any other context about the problem here.

@galipremsagar galipremsagar added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels May 22, 2023
@galipremsagar
Copy link
Contributor Author

galipremsagar commented May 22, 2023

Maybe a related PR previously worked on similar issue: #11854

@galipremsagar
Copy link
Contributor Author

a_cudf.zip
a_pyarrow.zip

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jun 7, 2023

This seems to be a problem where libcudf is not writing datetime64[s] or timedelta64[s] correctly. My testing shows that libcudf is also not roundtripping it faithfully:

import pyarrow as pa
import cudf

for type in [
    'timedelta64[s]',
    'timedelta64[ms]',
    'timedelta64[us]',
    'timedelta64[ns]',
    'datetime64[s]',
    'datetime64[ms]',
    'datetime64[us]',
    'datetime64[ns]',
]:

    df = cudf.DataFrame({"s": cudf.Series([1234, 3456, 32442], dtype=type)})
    df.to_parquet("a")
    df2 = cudf.read_parquet("a")
    df3 = pa.parquet.read_table("a")    

    print(df['s'].dtype, df2['s'].dtype, df3['s'].type)

output
timedelta64[s] timedelta64[ms] time32[ms]
timedelta64[ms] timedelta64[ms] time32[ms]
timedelta64[us] timedelta64[us] time64[us]
timedelta64[ns] timedelta64[ns] time64[ns]
datetime64[s] datetime64[ms] timestamp[ms]
datetime64[ms] datetime64[ms] timestamp[ms]
datetime64[us] datetime64[us] timestamp[us]
datetime64[ns] datetime64[ns] timestamp[ns]

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 7, 2023
@GregoryKimball GregoryKimball added the 0 - Backlog In queue waiting for assignment label Jun 7, 2023
@mhaseeb123
Copy link
Member

mhaseeb123 commented Mar 22, 2024

Investigation Notes:

  1. SECONDS is not as a valid TimeUnit in Parquet and hence converted to milliseconds by both cudf and arrow.
  2. I have been able to locally update this behavior and add SECONDS to our TimeUnit enum class. It round-trips correctly with cudf but produces an error when read with pyarrow's parquet reader (invalid unit)
  3. cudf's timedelta actually corresponds to Arrow's duration type instead of time type as seen with cudf's to_arrow and from_arrow functions. However, it is not yet possible to convert between timedelta64 and duration by only using Parquet spec.
  4. This is because Arrow encodes duration as int64 in parquet instead of TimeType. Arrow does it by also writing serialized arrow schema with parquet files: [C++][Parquet] Support DurationType in writing/reading parquet apache/arrow#23117 and ARROW-6780: [C++][Parquet] Support DurationType in writing/reading parquet (written as int64) apache/arrow#12449.
  5. Arrow types and Parquet types are different sets mapped as needed using arrow schema as a part of parquet file.

@mhaseeb123
Copy link
Member

mhaseeb123 commented Jun 25, 2024

Update:

Support for duration[s]/timedelta64[s] types has been added via arrow:schema support in cuDF PQ reader and writer and roundtrips faithfully.

For datetime64[s]/timestamp[s], both cuDF and Arrow convert [s] units to [ms] when writing Parquet and interop/roundtrip faithfully regardless of the unit. Though Arrow does not use arrow:schema to correct units, we can do so in cuDF if needed.

Question is: Should we do it or leave it be as the notion of unit in timestamp columns seems arbitrary (in both cuDF and Arrow) as the data are treated, displayed and interpreted in terms of absolute values since epoch (e.g. 1970-01-01 00:00:01.234) regardless of the unit. Example:

def datetime_interop():
    for type in [
        "timestamp[s]",
        "timestamp[ms]",
        "timestamp[us]",
    ]:
        times = pa.array(
            [1234, 3456, 32442], type=type
        )
        names = ["d"]
        pa_table = pa.Table.from_arrays([times], names=names)
        buf = BytesIO()

        pq.write_table(pa_table, buf)
        df2 = cudf.read_parquet(buf)
        df3 = pq.read_table(buf)

        # prints the same values (ignore units)
        print("Original table (pa)\n", pa_table)
        print("cudf read parquet\n", df2)
        print("pyarrow read parquet\n", df3)

        # convert all to pd.Timestamp without caring about column units
        value1 = pd.Timestamp(pa_table["d"][0].as_py())
        value2 = pd.Timestamp(df2["d"][0])
        value3 = pd.Timestamp(df3["d"][0].as_py())

        # check equality
        assert value1 == value2
        assert value1 == value3
        # redundant but anyway
        assert value2 == value3

@mhaseeb123
Copy link
Member

mhaseeb123 commented Jun 27, 2024

Closing this issue for now as units are meaningless for timestamp types as they are treated and displayed in absolute values. Please see the last comment with updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

3 participants