-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] "Schemas are inconsistent" error for parquet files which have the same dtype, but are different in not null setting #429
Comments
+1 I've seen this issue many times, but only with loading NVTabular's parquet output. |
Let's get these best practices into the doc related to data prep that's being prepared in: |
Thanks for the nice reproducer @gabrielspmoreira ! I cannot reproduce the I think Dask, Dask-CuDF, and NVTabular could all use better documentation on the subject of parquet metadata handling and preparation. I am in the process of working out a simple set of utilities to (1) generate a global Note that if each of your files can fit comfortably in GPU memory (one at a time), you can always generate a clean dataset by round-tripping the data with dask_cudf (without parsing metadata on the read side): import dask_cudf
PATH = '/.../gfn_problematic_columns'
PATH_FIX = '/.../gfn_problematic_columns_fixed'
ddf = dask_cudf.read_parquet(PATH, gather_statistics=False)
ddf.to_parquet(PATH_FIX) |
Thanks for the investigation @rjzamora . It would be very helpful if we could at least provide an error message for the user that highlights either the full schemas (so that the user can check the internal metadata differences by himself, like the example below from your checker script) or describe in the message which columns are not matching among the schemas.
|
@rjzamora I like your suggestion of using |
That makes sense. In
Good question - It is supported upstream, but I haven't tried yet. I'll update this comment after I test :) EDIT: It does seem that NVTabular will handle hive-partitioned datasets just fine. For example, I am able to do |
@rjzamora There might be scenarios where I have my original parquet partitioned by a columns and I'd like to keep the same column partition in the output parquet dataset preprocessed by NVTabular. |
NVTabular doesn't really support this right now, but it seems very doable. The easiest case to support is unshuffled output, where we can just use |
Depends on [dask#6851](dask/dask#6851) [dask#6851](dask/dask#6851) introduces a new `create_metadata_file` utility which can generate a global `_metadata` file from a list of parquet files. Since Dask's `read_parquet` code is optimized to leverage this shared metadata source, it makes a lot of sense to make this file easy to generate. **Why have this utility in dask_cudf?** Although I originally planned to keep this entirely upstream, it eventually became clear that cudf's **schema-agnostic** mechanism for aggregating metadata is adventageous when the dataset in question comprises files with inconsistent schema information. For example, a pyarrow-written dataset may have an inconsistent schema if only a few partitions contain null elements. In this case, the upstream version of `create_metadata_file` will fail with an "inconsistent schema" error, while the `dask_cudf` version will not. This means the user can use the dask_cudf version in lieu of rewritting the entire dataset, because once the `_metadata` file is created, the schema's will no longer be validated at read time. **Use Example** ```python import glob import dask_cudf # Specify the list of parquet files to collect metadata from paths = glob.glob("/my/dataset/*.parquet") # Use `create_metadata_file` to generate `_metadata` dask_cudf.io.parquet.create_metadata_file(paths) ... ``` Addresses [nvtabular#429](NVIDIA-Merlin/NVTabular#429)
@gabrielspmoreira @rjzamora - closing this , re-open if there is still an issue |
Describe the bug
This bug did not ocurr with NVT 0.2, but now occurs with the main branch (future NVT 0.3).
It us raised the error "Schemas are inconsistent" when the parquet files in the same folder share the same columns and dtypes, but there are null values for some column in one of the parquet files, and not in the corresponding column of another parquet file.
But it is raised an error when an NVT dataset is instantiated and we try to head() its first elements, like
The error raised then is:
By using this script from @rjzamora , it was possible to check that the metadata for the parquet files differs because columns are not null for one file and nullable for the other, that contains nulls.
BTW, the two parquet files can be loaded individually using dask_cudf. But when they are loaded together (e.g. pointing to a directory with the two files)
the following error is raised by dask_cudf
Steps/Code to reproduce bug
Here is folder with a minimalist notebook and two small parquet files to reproduce the issue (internal access only for the NVT team)
Expected behavior
NVT should be able to load a dataset whose parquet files share the same dtypes, even if the columns are not null for some files and nullable for the others
Environment details (please complete the following information):
nvtabular in the main branch (future 0.3)
cudf==0.16
dask_cudf==0.16
pyarrow==1.0.1
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: