Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU Parquet output for TIMESTAMP_MICROS is misinteterpreted by fastparquet as nanos #8778

Closed
gerashegalov opened this issue Jul 22, 2023 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@gerashegalov
Copy link
Collaborator

Describe the bug
GPU Parquet timestamp is misinterpreted as nanos by fastparquet. Output produced by Identical code on the CPU is interpreted by fastparquet correctly.

Steps/Code to reproduce bug

df = spark.createDataFrame([(datetime.datetime(3023, 7, 14, 7, 38, 45, 418688),)], 'ts timestamp')
cpu_path = tempfile.mkdtemp("cpu_ts")
gpu_path = tempfile.mkdtemp("gpu_ts")
spark.conf.set('spark.sql.parquet.outputTimestampType', 'TIMESTAMP_MICROS')

On CPU, Spark and fastparquet are consistent

spark.conf.set('spark.rapids.sql.enabled', False)
df.write.mode('overwrite').parquet(cpu_path)
cpu_file, = glob.glob(f"{cpu_path}/*.parquet")
spark.read.parquet(cpu_path).show(truncate = False)
fastparquet.ParquetFile(cpu_file).head(1)

Out

+--------------------------+
|ts                        |
+--------------------------+
|3023-07-14 07:38:45.418688|
+--------------------------+

                          ts
0 3023-07-14 07:38:45.418688

GPU's output appears corrupt when read by fastparquet

spark.conf.set('spark.rapids.sql.enabled', True)
df.write.mode('overwrite').parquet(gpu_path)
gpu_file, = glob.glob(f"{gpu_path}/*.parquet")
spark.read.parquet(gpu_path).show(truncate = False)
fastparquet.ParquetFile(gpu_file).head(1)

Out

+--------------------------+
|ts                        |
+--------------------------+
|3023-07-14 07:38:45.418688|
+--------------------------+

                             ts
0 1854-06-04 08:29:37.999584768

The issue appears to be in the GPU case, fast parquet assumes logical time unit nanos

OrderedDict([('max_ts', dtype('<M8[ns]')), ('max_big_ts', dtype('<M8[ns]'))])

because unlike in the CPU case GPU output does not have the logicalType metadata

fastparquet.ParquetFile(cpu_file).fmd
...
 logicalType:
    BSON: null
    DATE: null
    DECIMAL: null
    ENUM: null
    INTEGER: null
    JSON: null
    LIST: null
    MAP: null
    STRING: null
    TIME: null
    TIMESTAMP:
      isAdjustedToUTC: true
      unit:
        MICROS: {}
        MILLIS: null
        NANOS: null
    UNKNOWN: null
    UUID: null

Expected behavior
spark-rapids should be interoperable with non-Spark parquet readers, at least with the ones that work with upstream Spark

Environment details (please complete the following information)

  • Environment location: any
  • Spark configuration settings related to the issue: spark.sql.parquet.outputTimestampType='TIMESTAMP_MICROS'

Additional context
encountered working on #8625

@gerashegalov gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 22, 2023
@revans2
Copy link
Collaborator

revans2 commented Jul 24, 2023

The LogicalType was added more recently and the CUDF parquet writer does not support it. We should be tagging the column with TIMESTAMP_MILLIS or TIMESTAMP_MICROS if it is not in nanoseconds, which is the default.

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype

https://github.com/apache/parquet-format/blob/1603152f8991809e8ad29659dffa224b4284f31b/src/main/thrift/parquet.thrift#L106-L120

@gerashegalov
Copy link
Collaborator Author

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 25, 2023
@revans2
Copy link
Collaborator

revans2 commented Jul 25, 2023

@gerashegalov what version of fastparquet, numpy and pandas are you using?

When I try to read the CPU file with fast parquet I get an error

>>> fp_file_cpu.head(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "python3.7/site-packages/fastparquet/api.py", line 298, in head
    return self[:i+1].to_pandas(**kwargs).head(nrows)
  File "python3.7/site-packages/fastparquet/api.py", line 753, in to_pandas
    row_filter=sel, infile=infile)
  File "python3.7/site-packages/fastparquet/api.py", line 365, in read_row_group_file
    row_filter=row_filter
  File "python3.7/site-packages/fastparquet/core.py", line 609, in read_row_group
    cats, selfmade, assign=assign, row_filter=row_filter)
  File "python3.7/site-packages/fastparquet/core.py", line 583, in read_row_group_arrays
    row_filter=row_filter)
  File "python3.7/site-packages/fastparquet/core.py", line 551, in read_col
    piece[:] = convert(val, se)
  File "python3.7/site-packages/pandas/core/arrays/datetimelike.py", line 373, in __setitem__
    super().__setitem__(key, value)
  File "python3.7/site-packages/pandas/core/arrays/_mixins.py", line 182, in __setitem__
    value = self._validate_setitem_value(value)
  File "python3.7/site-packages/pandas/core/arrays/datetimelike.py", line 745, in _validate_setitem_value
    return self._unbox(value, setitem=True)
  File "python3.7/site-packages/pandas/core/arrays/datetimelike.py", line 757, in _unbox
    self._check_compatible_with(other, setitem=setitem)
  File "python3.7/site-packages/pandas/core/arrays/datetimes.py", line 505, in _check_compatible_with
    if not timezones.tz_compare(self.tz, other.tz):
AttributeError: 'numpy.ndarray' object has no attribute 'tz'

I am running with

fastparquet 0.8.1
numpy 1.21.6
pandas 1.3.5

Also I get different results for the GPU file too. fastparquet gives me an overflow error

>>> fp_file.head(1)
OverflowError: value too large
Exception ignored in: 'fastparquet.cencoding.time_shift'
OverflowError: value too large
                             ts
0 1971-01-20 19:04:07.125418688

Not sure what is happening here. I could see that the footers are tagged equivalently, but it is clear that fastparquet is taking a different path to parse the GPU file vs the CPU file because the GPU one does not get the error that the CPU does. When I try to read them using pandas I get a very similar error about overflow, but it looks like the CPU version has set the isAdjustedToUTC to be true, and that might be the difference between them.

GPU Error:

pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 33246247125418688

CPU Error:

pyarrow.lib.ArrowInvalid: Casting from timestamp[us, tz=UTC] to timestamp[ns] would result in out of bounds timestamp: 33246247125418688

@gerashegalov
Copy link
Collaborator Author

gerashegalov commented Jul 25, 2023

@revans2 good point, should have listed versions in my venv

for p in [fastparquet, numpy, pandas]:
    print(f"name={p.__name__} version={p.__version__}\n")
name=fastparquet version=2023.7.0

name=numpy version=1.25.1

name=pandas version=2.0.3

added pip list to https://github.com/gerashegalov/rapids-shell/blob/557a96c450a307a206330410b335d346d3cc4170/src/jupyter/timestamp_micros.ipynb

@revans2
Copy link
Collaborator

revans2 commented Jul 26, 2023

I have reproduced the issue and gone through it several times. It appears to be a bug in fastparquet and how they compute a large timestamp from a V1 file. CUDF is still spitting out parquet files with a V1 footer. When I use Spark 3.1.1 to write the file (which also writes them out with a V1 footer) then I get the exact same result. fastparquet thinks it is from 1854.

>>> import fastparquet
>>> cpu_file = fastparquet.ParquetFile("TMP_CPU_311/part-00000-9fcbe985-36aa-4765-86a0-47a2c6cc4926-c000.snappy.parquet")
>>> cpu_file.head(1)
                             ts
0 1854-06-04 13:29:37.999584768
>>> gpu_file = fastparquet.ParquetFile("TMP_GPU/part-00000-98db0b25-66e1-48c2-91bd-ac78f2ac30ee-c000.snappy.parquet")
>>> gpu_file.head(1)
                             ts
0 1854-06-04 13:29:37.999584768
>>> newer_cpu_file = fastparquet.ParquetFile("TMP_CPU_330/part-00000-7aaa467a-aa1b-43db-8102-c604b9c04862-c000.snappy.parquet")
>>> newer_cpu_file.head(1)
                          ts
0 3023-07-14 12:38:45.418688

@gerashegalov do you want me to file an issue against fastparquet?

Just FYI CUDF is in the process of going to V2 for writes, eventually. rapidsai/cudf#13501

@gerashegalov
Copy link
Collaborator Author

do you want me to file an issue against fastparquet?

yes feel free to file a fastparquet issue @revans2

@revans2
Copy link
Collaborator

revans2 commented Jul 26, 2023

Done. dask/fastparquet#872

@revans2
Copy link
Collaborator

revans2 commented Jul 26, 2023

@sameerz should we document this, or do we just close this issue because it is a bug in fastparquet.

@sameerz
Copy link
Collaborator

sameerz commented Jul 28, 2023

I am inclined to close this as it is a bug in fastparquet.

@gerashegalov
Copy link
Collaborator Author

Superseded by the issue dask/fastparquet#872. Thanks @revans2 for investigating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants