Migrate JSON reader to pylibcudf #15966

lithomas1 · 2024-06-10T23:49:36Z

Description

Switches the JSON reader to use pylibcudf.
xref #15162

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…f-io-json

commit 60287e1 Author: Thomas Li <[email protected]> Date: Mon Jul 1 17:56:34 2024 +0000 address more comments commit 25c25d4 Merge: 7806ce4 51fb873 Author: Thomas Li <[email protected]> Date: Mon Jul 1 17:31:44 2024 +0000 Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcudf-io-writers commit 51fb873 Merge: 599ce95 e932fbd Author: gpuCI <[email protected]> Date: Mon Jul 1 12:17:38 2024 -0400 Merge pull request rapidsai#16145 from rapidsai/branch-24.06 Forward-merge branch-24.06 into branch-24.08 commit e932fbd Author: Vyas Ramasubramani <[email protected]> Date: Mon Jul 1 09:17:32 2024 -0700 Add patch for incorrect cuco noexcept clauses (rapidsai#16077) [cuco previously marked a number of methods as noexcept that can in fact throw exceptions](NVIDIA/cuCollections#510). This causes problems for cudf functions that call these methods. The issue [was fixed in cuco upstream](NVIDIA/cuCollections#511), but we cannot easily update to the latest commit of cuco, especially in a patch fix for 24.06. This PR instead adds a rapids-cmake patch for the cuco clone to address this issue. The patch may be removed once we update to a commit of cuco that contains the necessary fix. Resolves rapidsai#16059 commit 599ce95 Author: Lawrence Mitchell <[email protected]> Date: Mon Jul 1 09:35:35 2024 +0100 Implement handlers for series literal in cudf-polars (rapidsai#16113) A query plan can contain a "literal" polars Series. Often, for example, when calling a contains-like function. To translate these, introduce a new `LiteralColumn` node to capture the concept and add an evaluation rule (converting from arrow). Since list-dtype Series need the same casting treatment as in dataframe scan case, factor the casting out into a utility, and take the opportunity to handled casting of nested lists correctly. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Thomas Li (https://github.com/lithomas1) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#16113 commit 7806ce4 Author: Thomas Li <[email protected]> Date: Sat Jun 29 00:47:53 2024 +0000 simplify again commit e57a677 Merge: e940e30 3c3edfe Author: Thomas Li <[email protected]> Date: Sat Jun 29 00:26:03 2024 +0000 Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcudf-io-writers commit 3c3edfe Author: Yunsong Wang <[email protected]> Date: Fri Jun 28 13:58:22 2024 -0700 Update implementations to build with the latest cuco (rapidsai#15938) This PR updates existing libcudf to accommodate a cuco breaking change introduced in NVIDIA/cuCollections#479. It helps avoid breaking cudf when bumping the cuco version in `rapids-cmake`. Redundant equal/hash overloads will be removed once the version bump is done on the `rapids-cmake` end. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#15938 commit df88cf5 Author: Bradley Dice <[email protected]> Date: Fri Jun 28 15:40:52 2024 -0500 Use size_t to allow large conditional joins (rapidsai#16127) The conditional join kernels were using `cudf::size_type` where `std::size_t` was needed. This PR fixes that bug, which caused `cudaErrorIllegalAddress` as shown in rapidsai#16115. This closes rapidsai#16115. I did not add tests because we typically do not test very large workloads. However, I committed the test and reverted it in this PR, so there is a record of my validation code. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - https://github.com/nvdbaranec - Yunsong Wang (https://github.com/PointKernel) URL: rapidsai#16127 commit fb12d98 Author: Robert Maynard <[email protected]> Date: Fri Jun 28 12:14:58 2024 -0400 Installed cudf header use cudf::allocate_like (rapidsai#16087) Remove usage of non public cudf::allocate_like from implementations in headers we install Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#16087 commit 78f4a8a Author: Robert Maynard <[email protected]> Date: Fri Jun 28 11:26:27 2024 -0400 Move common string utilities to public api (rapidsai#16070) As part of rapidsai#15982 a subset of the strings utility functions have been identified as being worth expsosing as part of the cudf public API. The `create_string_vector_from_column`, `get_offset64_threshold`, and `is_large_strings_enabled` are now made part of the public `cudf::strings` api. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - MithunR (https://github.com/mythrocks) - David Wendt (https://github.com/davidwendt) - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) - Lawrence Mitchell (https://github.com/wence-) URL: rapidsai#16070 commit a4b951a Author: nvdbaranec <[email protected]> Date: Fri Jun 28 10:20:42 2024 -0500 Templatization of fixed-width parquet decoding kernels. (rapidsai#15911) This PR merges all of the fixed-width parquet decoding kernels into a single templatized kernel that can be selectively instantiated with desired features (dictionary/no-dictionary, nested/non-nested, etc). It also adds support for (non-list) nested columns in this path. So structs do not have to use the much slower general decode kernel any more. A new benchmark was added specific to structs containing only fixed width columns. I added this because the performance improvement is fairly high (+20%) but we don't see it in the normal struct benchmarks because they include (and are dominated by) string decode times. The new benchmark shows: Before this PR: ``` | data_type | io_type | cardinality | run_length | bytes_per_second | peak_memory_usage | encoded_file_size | |-----------|---------------|-------------|------------|------------------|-------------------|-------------------| | STRUCT | DEVICE_BUFFER | 0 | 1 | 21071216823 | 1.047 GiB | 511.675 MiB | | STRUCT | DEVICE_BUFFER | 1000 | 1 | 18974392387 | 821.312 MiB | 128.884 MiB | | STRUCT | DEVICE_BUFFER | 0 | 32 | 20429356824 | 621.787 MiB | 28.141 MiB | | STRUCT | DEVICE_BUFFER | 1000 | 32 | 20572327813 | 598.421 MiB | 16.475 MiB | ``` After this PR: ``` | data_type | io_type | cardinality | run_length | bytes_per_second | peak_memory_usage | encoded_file_size | |-----------|---------------|-------------|------------|------------------|-------------------|-------------------| | STRUCT | DEVICE_BUFFER | 0 | 1 | 25805996399 | 1.047 GiB | 511.675 MiB | | STRUCT | DEVICE_BUFFER | 1000 | 1 | 22422306660 | 821.312 MiB | 128.884 MiB | | STRUCT | DEVICE_BUFFER | 0 | 32 | 24460694014 | 621.787 MiB | 28.141 MiB | | STRUCT | DEVICE_BUFFER | 1000 | 32 | 24674861214 | 598.421 MiB | 16.475 MiB | ``` Split-page decoding for fixed-width types + structs are also going through this new path. New test added. This brings us closer to eliminating the "general" kernel. The only things left that run through it are lists and booleans. This is PR 1 of 2, with the followup moving a lot of code around. At this point, I think it makes sense to start consolidating our files a bit. I also left some breadcrumbs (a few small commented out code blocks) in the core kernel `gpuDecodePageDataGeneric` for the next step of adding list support. They can be removed if people don't like them. Authors: - https://github.com/nvdbaranec Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Vukasin Milovanovic (https://github.com/vuule) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#15911 commit e434fdb Author: David Wendt <[email protected]> Date: Fri Jun 28 10:57:01 2024 -0400 Update libcudf compiler requirements in contributing doc (rapidsai#16103) Updates the compiler requirements in the contributing document. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: rapidsai#16103 commit 565c0d1 Author: Matthew Murray <[email protected]> Date: Fri Jun 28 10:16:55 2024 -0400 Migrate lists/contains to pylibcudf (rapidsai#15981) Part of rapidsai#15162. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#15981 commit c40e0cc Author: Matthew Murray <[email protected]> Date: Fri Jun 28 10:10:31 2024 -0400 Add support for proxy `np.flatiter` objects (rapidsai#16107) Closes rapidsai#15388 Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#16107 commit 673d766 Author: Paul Mattione <[email protected]> Date: Fri Jun 28 09:38:57 2024 -0400 Make binary operators work between fixed-point and floating args (rapidsai#16116) Some of the binary operators in cuDF don't work between fixed_point and floating-point numbers after [this earlier PR](rapidsai#15438) removed the ability to construct and implicitly cast fixed_point numbers from floating point numbers. This PR restores that functionality by detecting and performing the necessary explicit casts, and adds tests for the supported operators. Note that the `binary_op_has_common_type` code is modeled after `has_common_type` found in traits.hpp. This closes [issue 16090](rapidsai#16090) Authors: - Paul Mattione (https://github.com/pmattione-nvidia) Approvers: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) - Karthikeyan (https://github.com/karthikeyann) URL: rapidsai#16116 commit 224ac5b Author: David Wendt <[email protected]> Date: Fri Jun 28 09:26:37 2024 -0400 Add libcudf public/detail API pattern to developer guide (rapidsai#16086) Adds specific description for the public API to detail API function pattern to the libcudf developer guide. Also fixes some formatting issues and broken link. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Shruti Shivakumar (https://github.com/shrshi) - Karthikeyan (https://github.com/karthikeyann) URL: rapidsai#16086 commit 2b547dc Author: Matthew Roeschke <[email protected]> Date: Fri Jun 28 03:11:01 2024 -1000 Add ensure_index to not unnecessarily shallow copy cudf.Index (rapidsai#16117) The `cudf.Index` constructor will shallow copy a `cudf.Index` input. Sometimes, we just need to make sure an input is a `cudf.Index`, so created `ensure_index` (pandas has something similar) so we don't shallow copy these inputs unnecessarily Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#16117 commit 57862a3 Author: Robert Maynard <[email protected]> Date: Fri Jun 28 08:43:12 2024 -0400 stable_distinct public api now has a stream parameter (rapidsai#16068) As part of rapidsai#15982 we determined that the cudf `stable_distinct` public API needs to be updated so that a user provided stream can be provided. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Nghia Truong (https://github.com/ttnghia) - Srinivas Yadav (https://github.com/srinivasyadav18) - Bradley Dice (https://github.com/bdice) URL: rapidsai#16068 commit 6b04fd3 Author: Mads R. B. Kristensen <[email protected]> Date: Fri Jun 28 12:31:18 2024 +0200 Memory Profiling (rapidsai#15866) Use [RMM's new memory profiler](rapidsai/rmm#1563) to profile all functions already decorated with `_cudf_nvtx_annotate`. Example ```python import cudf from cudf.utils.performance_tracking import print_memory_report cudf.set_option("memory_profiling", True) df1 = cudf.DataFrame({"a": [1, 2, 3]}) df2 = cudf.DataFrame({"a": [2, 2, 3]}) df3 = df1.merge(df2) print_memory_report() ``` Output: ``` Memory Profiling ================ Ordered by: memory_peak ncalls memory_peak memory_total filename:lineno(function) 1 272 688 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/dataframe.py:4072(DataFrame.merge) 2 32 64 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/dataframe.py:1043(DataFrame._init_from_dict_like) 2 32 64 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/dataframe.py:690(DataFrame.__init__) 2 0 0 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/dataframe.py:1131(DataFrame._align_input_series_indices) 7 0 0 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/index.py:214(RangeIndex.__init__) 6 0 0 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/index.py:424(RangeIndex.__len__) 4 0 0 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/frame.py:271(Frame.__len__) 2 0 0 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/dataframe.py:3195(DataFrame._insert) 2 0 0 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/index.py:270(RangeIndex.name) 2 0 0 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/index.py:369(RangeIndex.copy) 5 0 0 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/frame.py:134(Frame._from_data) 2 0 0 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/frame.py:1039(Frame._copy_type_metadata) 2 0 0 /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/indexed_frame.py:315(IndexedFrame._from_columns_like_self) ``` Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Mark Harris (https://github.com/harrism) - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#15866 commit e35da6b Author: Lawrence Mitchell <[email protected]> Date: Fri Jun 28 09:54:03 2024 +0100 Implement Ternary copy_if_else (rapidsai#16114) A straightforward evaluation using `copy_if_else`. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - https://github.com/brandon-b-miller URL: rapidsai#16114 commit e940e30 Author: Thomas Li <[email protected]> Date: Thu Jun 27 21:44:41 2024 +0000 Address code review Co-authored-by: Vyas Ramasubramani <[email protected]> commit c847b98 Author: Lawrence Mitchell <[email protected]> Date: Thu Jun 27 21:33:29 2024 +0100 Finish implementation of cudf-polars boolean function handlers (rapidsai#16098) The missing nodes were `is_in`, `not` (both easy), `is_finite` and `is_infinite` (obtained by translating to `contains` calls). While here, remove the implementation of `IsBetween` and just translate to an expression with binary operations. This removes the need for special-casing scalar arguments to `IsBetween` and reproducing the code for binop evaluation. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#16098 commit 2ed69c9 Author: Matthew Roeschke <[email protected]> Date: Thu Jun 27 10:11:09 2024 -1000 Ensure MultiIndex.to_frame deep copies columns (rapidsai#16110) Additionally, this allows simplification in `MultiIndex.__repr__` which avoids a shallow copy and also caught a bug where `NaT` was not supposed to be quoted Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#16110 commit a71c249 Author: GALI PREM SAGAR <[email protected]> Date: Thu Jun 27 14:29:31 2024 -0500 Fix dtype errors in `StringArrays` (rapidsai#16111) This PR adds proxy classes for `ArrowStringArray` and `ArrowStringArrayNumpySemantics` that will increase the pandas test pass rate by 1%. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#16111 commit 8fc139f Merge: 79c1dfd f7cd9e6 Author: Thomas Li <[email protected]> Date: Thu Jun 27 18:33:52 2024 +0000 Merge branch 'pylibcudf-io-writers' of github.com:lithomas1/cudf into pylibcudf-io-writers commit 79c1dfd Author: Thomas Li <[email protected]> Date: Thu Jun 27 18:33:40 2024 +0000 clean source_or_sink commit c5a3fbe Merge: aff6178 5d49fe6 Author: Thomas Li <[email protected]> Date: Thu Jun 27 18:25:42 2024 +0000 Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcudf-io-writers commit f7cd9e6 Author: Thomas Li <[email protected]> Date: Wed Jun 26 09:15:50 2024 -0700 cleanup utils commit aff6178 Author: Thomas Li <[email protected]> Date: Tue Jun 25 20:45:47 2024 +0000 small test fixes commit 0ed9af6 Author: Thomas Li <[email protected]> Date: Tue Jun 25 19:27:14 2024 +0000 Fix error in testing utils Co-authored-by: Lawrence Mitchell <[email protected]> commit 9a6a896 Merge: 186a2fb cdfb550 Author: Thomas Li <[email protected]> Date: Tue Jun 25 19:12:37 2024 +0000 Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcudf-io-writers commit 186a2fb Merge: 53b821c 0c6b828 Author: Thomas Li <[email protected]> Date: Mon Jun 24 17:19:39 2024 +0000 Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcudf-io-writers commit 53b821c Merge: 624d444 604c16d Author: Thomas Li <[email protected]> Date: Mon Jun 24 17:19:12 2024 +0000 Merge branch 'pylibcudf-io-writers' of github.com:lithomas1/cudf into pylibcudf-io-writers commit 624d444 Author: Thomas Li <[email protected]> Date: Mon Jun 24 17:17:27 2024 +0000 fix all nested struct cases commit e6c3ec7 Author: Thomas Li <[email protected]> Date: Mon Jun 24 16:57:29 2024 +0000 address more comments commit 604c16d Author: Thomas Li <[email protected]> Date: Mon Jun 24 16:57:29 2024 +0000 address more comments commit d22953f Merge: e0901dd dcc153b Author: Thomas Li <[email protected]> Date: Tue Jun 18 10:19:24 2024 -0700 Merge branch 'branch-24.08' into pylibcudf-io-writers commit e0901dd Author: Thomas Li <[email protected]> Date: Mon Jun 17 09:45:19 2024 -0700 fix bad merge commit 564358f Merge: e242182 87f6a7e Author: Thomas Li <[email protected]> Date: Mon Jun 17 09:44:11 2024 -0700 Merge branch 'branch-24.08' into pylibcudf-io-writers commit e242182 Author: Thomas Li <[email protected]> Date: Thu Jun 13 20:52:23 2024 +0000 address more comments commit 699efd3 Author: Thomas Li <[email protected]> Date: Thu Jun 13 20:09:43 2024 +0000 cleanup tests commit 1228569 Author: Thomas Li <[email protected]> Date: Thu Jun 13 18:20:03 2024 +0000 update following feedback commit b1951d0 Author: Thomas Li <[email protected]> Date: Thu Jun 13 03:01:19 2024 +0000 try fix commit 9150a6c Author: Thomas Li <[email protected]> Date: Wed Jun 12 23:48:18 2024 +0000 try something else commit 63358e9 Merge: 8c4c4e4 b35991c Author: Thomas Li <[email protected]> Date: Wed Jun 12 23:30:56 2024 +0000 Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcudf-io-writers commit 8c4c4e4 Author: Thomas Li <[email protected]> Date: Wed Jun 12 18:31:54 2024 +0000 address comments commit dc93356 Merge: c54316e 0891c5d Author: Thomas Li <[email protected]> Date: Wed Jun 12 17:49:26 2024 +0000 Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcudf-io-writers commit c54316e Author: Thomas Li <[email protected]> Date: Tue Jun 11 20:41:18 2024 +0000 update commit cd6df5e Merge: 2b3853f 8efa64e Author: Thomas Li <[email protected]> Date: Tue Jun 11 17:00:05 2024 +0000 Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcudf-io-writers commit 2b3853f Author: Thomas Li <[email protected]> Date: Tue Jun 11 16:49:14 2024 +0000 add some tests commit 8c88c7c Merge: c24664c 719a8a6 Author: Thomas Li <[email protected]> Date: Tue Jun 11 00:19:28 2024 +0000 Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcudf-io-writers commit c24664c Author: Thomas Li <[email protected]> Date: Fri Jun 7 18:25:06 2024 +0000 update and start writing tests commit 72204f1 Merge: 15daaaa 9bd16bb Author: Thomas Li <[email protected]> Date: Fri Jun 7 16:02:25 2024 +0000 Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcudf-io-writers commit 15daaaa Author: Thomas Li <[email protected]> Date: Fri Jun 7 16:02:10 2024 +0000 update docs commit 591cdd2 Author: Thomas Li <[email protected]> Date: Thu Jun 6 23:54:58 2024 +0000 Start migrating I/O writers to pylibcudf (starting with JSON)

…f-io-json

lithomas1 · 2024-07-02T21:24:59Z

python/cudf/cudf/pylibcudf_tests/io/test_json.py

+# TODO: Add tests for these!
+# Tests were not added in the initial PR porting the JSON reader to pylibcudf
+# to save time (and since there are no existing tests for these in Python cuDF)
+# mixed_types_as_string = mixed_types_as_string,


In the interest of time, I'm not testing these (there's no tests for these anyways in cudf python).
I can't get rid of these since they are exposed in Python to read_json, but to me these feel more like options added for Spark than for Python.
The prune_columns API also feels a bit unnatural/not consistent to me (I would prefer a usecols style API like libcudf read_csv does).

Can I defer testing on these options for now (maybe marking them as experimental) while we figure out what to do with them?

Yes, I think that's OK. In general I'm fine deprioritizing libcudf features that are not used by either our polars or pandas front ends yet. Those are the long tail that we can add later.

lithomas1 · 2024-07-02T21:25:40Z

python/cudf/cudf/pylibcudf_tests/io/test_json.py

+    exp = pd.read_json(source, orient="records", lines=True)
+
+    # TODO: can do this operation using pylibcudf
+    tbls = []


I did this operation with arrow (so we don't end up testing pylibcudf concat too), but let me know if you want to avoid the GPU-CPU transfers and rely on pylibcudf concat.

lithomas1 · 2024-07-02T21:26:45Z

python/cudf/cudf/pylibcudf_tests/io/test_json.py

+                and compression_type
+                not in {CompressionType.NONE, CompressionType.AUTO}
+            ),
+            # note: wasn't able to narrow down the specific types that were failing


I'll open up a followup issue later after this gets merged.

lithomas1 · 2024-07-02T21:28:12Z

python/cudf/cudf/pylibcudf_tests/common/utils.py

@@ -65,6 +82,33 @@ def assert_column_eq(
    if isinstance(rhs, pa.ChunkedArray):
        rhs = rhs.combine_chunks()

+    def _make_fields_nullable(typ):


This was kinda anoying to deal with.
(basically one field ends up as and the other is just a regular )

I could not test this - but this could be interesting for stuff like parquet (where we should be able to preserve this info)

python/cudf/cudf/pylibcudf_tests/io/test_json.py

vyasr

Looks good, I'm happy with this largely as-is.

vyasr · 2024-07-05T16:37:59Z

python/cudf/cudf/_lib/pylibcudf/libcudf/io/types.pyx

Were we just not using any of the enums before? I don't see any changes to types.pxd that would suddenly merit this file, but I agree that we definitely need it.

Yeah, not sure how it worked before, but I definitely got some sort of compile-time error/runtime error without this.
(can't remember the specifics, though, sorry)

vyasr · 2024-07-05T18:50:41Z

python/cudf/cudf/_lib/pylibcudf/io/json.pyx

+        s_elem.child_types = _generate_schema_map(child_dtypes)
+
+        schema_map[c_name] = s_elem
+    return schema_map


We don't need to change anything, but just so you're aware since we discussed this a bit in your last PR: this is a case where if we were writing C++ you definitely wouldn't need a move, but in Cython you might. In C++, if you have a temporary object within a function (in this case schema_map), then C++ is usually smart enough to avoid making a copy and will automatically just return the same object allocated within the function. However, Cython can break this because of the way that it generates temporary variables, which can confuse the underlying C++ compiler optimizer. In this particular case I think it'll be fine, and regardless it's small enough that the perf change will be negligible, so I'm just telling you this for future reference.

vyasr · 2024-07-05T19:05:28Z

python/cudf/cudf/pylibcudf_tests/io/test_json.py

+# TODO: Add tests for these!
+# Tests were not added in the initial PR porting the JSON reader to pylibcudf
+# to save time (and since there are no existing tests for these in Python cuDF)
+# mixed_types_as_string = mixed_types_as_string,


Yes, I think that's OK. In general I'm fine deprioritizing libcudf features that are not used by either our polars or pandas front ends yet. Those are the long tail that we can add later.

…f-io-json

lithomas1 · 2024-07-08T17:06:10Z

/merge

lithomas1 · 2024-07-08T17:06:52Z

Self-merging so we can get json reader support in to cudf-polars.

(but happy to address any comments in a followup).

Migrate JSON reader to pylibcudf

627a426

lithomas1 added feature request New feature or request non-breaking Non-breaking change labels Jun 10, 2024

github-actions bot added Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jun 10, 2024

lithomas1 added 2 commits June 11, 2024 16:57

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

e44aa81

…f-io-json

stub out tests

cdd9ff3

lithomas1 mentioned this pull request Jun 11, 2024

[FEA] Implement all libcudf modules required by cuDF Python in pylibcudf #15162

Closed

lithomas1 added 7 commits June 25, 2024 17:41

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

9afc860

…f-io-json

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

5e2e6f0

…f-io-json

flesh out more tests

e00fc72

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

2e93a95

…f-io-json

round out testing

5f09344

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

307e243

…f-io-json

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars labels Jul 2, 2024

lithomas1 closed this Jul 2, 2024

lithomas1 reopened this Jul 2, 2024

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

1751238

…f-io-json

lithomas1 force-pushed the pylibcudf-io-json branch from 314a6f9 to 1751238 Compare July 2, 2024 16:29

github-actions bot removed libcudf Affects libcudf (C++/CUDA) code. cudf.pandas Issues specific to cudf.pandas labels Jul 2, 2024

revert accidental cudf_polars changes

33adf67

github-actions bot removed the cudf.polars Issues specific to cudf.polars label Jul 2, 2024

simplify more

bddbb81

lithomas1 marked this pull request as ready for review July 2, 2024 21:19

lithomas1 requested a review from a team as a code owner July 2, 2024 21:19

lithomas1 requested review from vyasr and mroeschke July 2, 2024 21:19

lithomas1 commented Jul 2, 2024

View reviewed changes

python/cudf/cudf/pylibcudf_tests/io/test_json.py Show resolved Hide resolved

vyasr approved these changes Jul 5, 2024

View reviewed changes

lithomas1 added 2 commits July 5, 2024 20:28

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

929f59f

…f-io-json

fix failure

b1ffc38

rapids-bot bot merged commit 036e0ef into rapidsai:branch-24.08 Jul 8, 2024
86 checks passed

lithomas1 deleted the pylibcudf-io-json branch July 8, 2024 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate JSON reader to pylibcudf #15966

Migrate JSON reader to pylibcudf #15966

lithomas1 commented Jun 10, 2024 •

edited

Loading

lithomas1 Jul 2, 2024

vyasr Jul 5, 2024

lithomas1 Jul 2, 2024

lithomas1 Jul 2, 2024

lithomas1 Jul 2, 2024

vyasr left a comment

vyasr Jul 5, 2024

lithomas1 Jul 5, 2024

vyasr Jul 5, 2024

vyasr Jul 5, 2024

lithomas1 commented Jul 8, 2024

lithomas1 commented Jul 8, 2024

Migrate JSON reader to pylibcudf #15966

Migrate JSON reader to pylibcudf #15966

Conversation

lithomas1 commented Jun 10, 2024 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented Jul 8, 2024

lithomas1 commented Jul 8, 2024

lithomas1 commented Jun 10, 2024 •

edited

Loading