Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dd.from_map with cudf read_json not working when pool is enabled #13019

Closed
ayushdg opened this issue Mar 27, 2023 · 4 comments · Fixed by rapidsai/kvikio#189
Closed

[BUG] dd.from_map with cudf read_json not working when pool is enabled #13019

ayushdg opened this issue Mar 27, 2023 · 4 comments · Fixed by rapidsai/kvikio#189
Labels
bug Something isn't working

Comments

@ayushdg
Copy link
Member

ayushdg commented Mar 27, 2023

Describe the bug
When attempting to use dd.from_map with cudf.read_json with rmm pool enabled, I'm running into an error
Exception: "RuntimeError('The current CUDA context must own the given device memory')".

Steps/Code to reproduce bug

import os
os.environ["RAPIDS_NO_INITIALIZE"] = "True"
from dask_cuda import LocalCUDACluster
from distributed import Client
from dask import dataframe as dd
import cudf

if __name__ == "__main__":
    cluster = LocalCUDACluster(rmm_pool_size="5GiB")
    client = Client(cluster)

    files = ["./file_1.jsonl"]
    df = dd.from_map(cudf.read_json, files, lines=True)
    print(len(df))

Expected behavior
Reads in the data as expected. Was working in the 23.02 stable release but doesn't in the 23.04 nightly.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of cuDF install: conda nightly

Environment details
N/A
Additional context
File to reproduce jsonfile.tar.gz

@ayushdg ayushdg added bug Something isn't working Needs Triage Need team to review and classify labels Mar 27, 2023
@wence-
Copy link
Contributor

wence- commented Mar 28, 2023

compute-sanitizer says:

========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuPointerGetAttribute.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x2c6396]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:kvikio::get_current_context(void const*) [0x16c443a]
=========                in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
=========     Host Frame:cudf::io::(anonymous namespace)::file_source::device_read(unsigned long, unsigned long, unsigned char*, rmm::cuda_stream_view) [0x16d6230]
=========                in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
=========     Host Frame:/home/wence/Documents/src/rapids/cudf/cpp/src/io/json/experimental/read_json.cpp:61:cudf::io::detail::json::experimental::ingest_raw_input(cudf::host_span<std::unique_ptr<cudf::io::datasource, std::default_delete<cudf::io::datasource> >, 18446744073709551615ul> const&, cudf::io::compression_type, unsigned long, unsigned long, rmm::cuda_stream_view) [0x15ed91d]
=========                in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
=========     Host Frame:/home/wence/Documents/src/rapids/cudf/cpp/src/io/json/experimental/read_json.cpp:129:cudf::io::detail::json::experimental::get_record_range_raw_input(cudf::host_span<std::unique_ptr<cudf::io::datasource, std::default_delete<cudf::io::datasource> >, 18446744073709551615ul>, cudf::io::json_reader_options const&, rmm::cuda_stream_view) [0x15ede56]
=========                in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
=========     Host Frame:/home/wence/Documents/src/rapids/cudf/cpp/src/io/json/experimental/read_json.cpp:184:cudf::io::detail::json::experimental::read_json(cudf::host_span<std::unique_ptr<cudf::io::datasource, std::default_delete<cudf::io::datasource> >, 18446744073709551615ul>, cudf::io::json_reader_options const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x15ee3e4]
=========                in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
=========     Host Frame:/home/wence/Documents/src/rapids/cudf/cpp/src/io/json/reader_impl.cu:599:cudf::io::json::detail::read_json(std::vector<std::unique_ptr<cudf::io::datasource, std::default_delete<cudf::io::datasource> >, std::allocator<std::unique_ptr<cudf::io::datasource, std::default_delete<cudf::io::datasource> > > >&, cudf::io::json_reader_options const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x15cf826]
=========                in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
=========     Host Frame:cudf::io::read_json(cudf::io::json_reader_options, rmm::mr::device_memory_resource*) [0x14fac1f]
=========                in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
=========     Host Frame:__pyx_f_4cudf_4_lib_4json_read_json(_object*, _object*, bool, _object*, _object*, bool, bool, int) [clone .constprop.0] [0x51278]

Which makes me suspicious of rapidsai/kvikio#181, @madsbk does that look likely? Certainly that error message is coming from kvikio.

@ayushdg
Copy link
Member Author

ayushdg commented Mar 28, 2023

Can confirm that setting LIBCUDF_CUFILE_POLICY to something other that KvikIO helps work around the issue for now.

@wence-
Copy link
Contributor

wence- commented Mar 29, 2023

A bit of debugging...

If I check the state of the context before calling into libcudf from read_json:

diff --git a/python/cudf/cudf/io/json.py b/python/cudf/cudf/io/json.py
index 4de9a92a06..38bd127647 100644
--- a/python/cudf/cudf/io/json.py
+++ b/python/cudf/cudf/io/json.py
@@ -108,6 +108,9 @@ def read_json(
             else:
                 filepaths_or_buffers.append(tmp_source)
 
+        from cuda import cuda
+
+        print(cuda.cuCtxGetCurrent())
         df = libjson.read_json(
             filepaths_or_buffers,
             dtype,

I see something bad:

In [1]: len(df)
(<CUresult.CUDA_SUCCESS: 0>, <CUcontext 0x0>)

So there's no active context when dropping into libcudf. Later when kvikio is doing its checking it does:

      if (check_owning_devPtr != nullptr) {
        CUdeviceptr current_ctx_devPtr{};
        CUdeviceptr dev_ptr = convert_void2deviceptr(check_owning_devPtr);
        
        CUresult const err = cudaAPI::instance().PointerGetAttribute(
                                                                     &current_ctx_devPtr, CU_POINTER_ATTRIBUTE_DEVICE_POINTER, dev_ptr);
        if (err != CUDA_SUCCESS || current_ctx_devPtr != dev_ptr) {
          throw CUfileException("The current CUDA context must own the given device memory");
        }
      }

That attribute is documented as returning:

Returns in *data the device pointer value through which ptr may be accessed by kernels running in the current CUcontext. The type of data must be CUdeviceptr *.

If there exists no device pointer value through which kernels running in the current CUcontext may access ptr then CUDA_ERROR_INVALID_VALUE is returned.

If there is no current CUcontext then CUDA_ERROR_INVALID_CONTEXT is returned.

And indeed, this call returns CUDA_ERROR_INVALID_CONTEXT in this setting.

In contrast, asking for the old CU_POINTER_ATTRIBUTE_CONTEXT attribute returns a plausible looking context:

(gdb) p cudaAPI::instance().PointerGetAttribute(&ctx, CU_POINTER_ATTRIBUTE_CONTEXT, dev_ptr)
$2 = CUDA_SUCCESS
(gdb) p ctx
$3 = (CUcontext) 0x55cc5bf33c60

Something weird is going on though, since if I just do client.run to determine what the worker things the current context is (new process, addresses are different):

    def foo():
        from cuda import cuda
        return repr(cuda.cuCtxGetCurrent())
    print(client.run(foo))
    df = dd.from_map(cudf.read_json, files, lines=True, engine="cudf", meta=cudf.DataFrame({"text": ["foo"]}))
    print(len(df))

Then I see

{'tcp://127.0.0.1:36443': '(<CUresult.CUDA_SUCCESS: 0>, <CUcontext 0x55cf11dc3840>)'}

But the print from within read_json returns (<CUresult.CUDA_SUCCESS: 0>, <CUcontext 0x0>)

So something is on a different process/thread?

@madsbk
Copy link
Member

madsbk commented Mar 29, 2023

Yes, the old behavior that use CU_POINTER_ATTRIBUTE_CONTEXT will work.
I am working on a more robust approach in KvikIO.

rapids-bot bot pushed a commit to rapidsai/kvikio that referenced this issue Apr 3, 2023
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants