Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at most INT_MAX bytes #17058

Open
shrshi opened this issue Oct 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@shrshi
Copy link
Contributor

shrshi commented Oct 11, 2024

Describe the bug
With the implementation of the reallocate-and-retry logic when the initial buffer size estimate fails for byte range reading (PR #16687), the total buffer size read per batch can exceed 2GiB.

Steps/Code to reproduce bug

TEST_F(JsonLargeReaderTest, MultiBatchFailing)
{
  std::string json_string = R"(
    { "a": { "y" : 6}, "b" : [1, 2, 3], "c": 11 }
    { "a": { "y" : 6}, "b" : [4, 5   ], "c": 12 }
    { "a": { "y" : 6}, "b" : [6      ], "c": 13 }
    { "a": { "y" : 6}, "b" : [7      ], "c": 14 }
    )";

  std::size_t const batch_size_upper_bound = std::numeric_limits<int32_t>::max() / 16;
  // set smaller batch_size to reduce file size and execution time
  setenv("LIBCUDF_JSON_BATCH_SIZE", std::to_string(batch_size_upper_bound).c_str(), 1);

  constexpr std::size_t expected_file_size = 1.0 * static_cast<double>(batch_size_upper_bound);
  std::size_t log_repetitions =
    static_cast<std::size_t>(std::floor(std::log2(expected_file_size / json_string.size())));

  json_string.reserve(json_string.size() * (1UL << log_repetitions));
  for (std::size_t i = 0; i < log_repetitions; i++) {
    json_string += json_string;
  }

  std::string really_long_string = R"(haha)";
  log_repetitions += 4;
  really_long_string.reserve(really_long_string.size() * (1UL << log_repetitions));
  for (std::size_t i = 0; i < log_repetitions; i++) {
    really_long_string += really_long_string;
  }
  std::string last_line = "{ \"lol\" : \"" + really_long_string + "\" }";
  json_string += last_line;

  constexpr int num_sources = 1;
  std::vector<cudf::host_span<std::byte>> hostbufs(
    num_sources,
    cudf::host_span<std::byte>(reinterpret_cast<std::byte*>(json_string.data()),
                               json_string.size()));

  // Initialize parsing options (reading json lines)
  cudf::io::json_reader_options json_lines_options =
    cudf::io::json_reader_options::builder(
      cudf::io::source_info{
        cudf::host_span<cudf::host_span<std::byte>>(hostbufs.data(), hostbufs.size())})
      .lines(true)
      .compression(cudf::io::compression_type::NONE)
      .recovery_mode(cudf::io::json_recovery_mode_t::FAIL);

  // Read full test data via existing, nested JSON lines reader
  cudf::io::table_with_metadata current_reader_table = cudf::io::read_json(json_lines_options);

  std::vector<std::unique_ptr<cudf::io::datasource>> datasources;
  for (auto& hb : hostbufs) {
    datasources.emplace_back(cudf::io::datasource::create(hb));
  }
  // Test for different chunk sizes
  std::vector<std::size_t> chunk_sizes{batch_size_upper_bound / 4,
                                       batch_size_upper_bound / 2,
                                       batch_size_upper_bound,
                                       static_cast<std::size_t>(batch_size_upper_bound * 2)};

  for (auto chunk_size : chunk_sizes) {
    auto const tables =
      split_byte_range_reading<std::int64_t>(datasources,
                                             json_lines_options,
                                             chunk_size,
                                             cudf::get_default_stream(),
                                             cudf::get_current_device_resource_ref());

    auto table_views = std::vector<cudf::table_view>(tables.size());
    std::transform(tables.begin(), tables.end(), table_views.begin(), [](auto& table) {
      return table.tbl->view();
    });
    auto result = cudf::concatenate(table_views);

    // Verify that the data read via chunked reader matches the data read via nested JSON reader
    // cannot use EQUAL due to concatenate removing null mask
    CUDF_TEST_EXPECT_TABLES_EQUIVALENT(current_reader_table.tbl->view(), result->view());
  }

  // go back to normal batch_size
  unsetenv("LIBCUDF_JSON_BATCH_SIZE");
}

diff file:

diff --git a/cpp/src/io/json/read_json.cu b/cpp/src/io/json/read_json.cu
index 50f0e5cf33..9aeeb2a0f1 100644
--- a/cpp/src/io/json/read_json.cu
+++ b/cpp/src/io/json/read_json.cu
@@ -204,6 +204,7 @@ datasource::owning_buffer<rmm::device_buffer> get_record_range_raw_input(
       }
     }
 
+    CUDF_EXPECTS(static_cast<size_t>(next_delim_pos - first_delim_pos - shift_for_nonzero_offset) < get_batch_size_upper_bound(), "oh no!");
     return datasource::owning_buffer<rmm::device_buffer>(
       std::move(buffer),
       reinterpret_cast<uint8_t*>(buffer.data()) + first_delim_pos + shift_for_nonzero_offset,

Expected behavior
The size of the buffer returned by get_record_range_raw_input should not exceed 2GiB.
Proposed solution: stop at the line end before the 2GB limit and adjust the batch offsets for the remaining batches.

@shrshi shrshi added the bug Something isn't working label Oct 11, 2024
@karthikeyann karthikeyann added this to the Nested JSON reader milestone Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants