[BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at most `INT_MAX` bytes #17058

shrshi · 2024-10-11T00:40:15Z

Describe the bug
With the implementation of the reallocate-and-retry logic when the initial buffer size estimate fails for byte range reading (PR #16687), the total buffer size read per batch can exceed 2GiB.

Steps/Code to reproduce bug

TEST_F(JsonLargeReaderTest, MultiBatchFailing)
{
  std::string json_string = R"(
    { "a": { "y" : 6}, "b" : [1, 2, 3], "c": 11 }
    { "a": { "y" : 6}, "b" : [4, 5   ], "c": 12 }
    { "a": { "y" : 6}, "b" : [6      ], "c": 13 }
    { "a": { "y" : 6}, "b" : [7      ], "c": 14 }
    )";

  std::size_t const batch_size_upper_bound = std::numeric_limits<int32_t>::max() / 16;
  // set smaller batch_size to reduce file size and execution time
  setenv("LIBCUDF_JSON_BATCH_SIZE", std::to_string(batch_size_upper_bound).c_str(), 1);

  constexpr std::size_t expected_file_size = 1.0 * static_cast<double>(batch_size_upper_bound);
  std::size_t log_repetitions =
    static_cast<std::size_t>(std::floor(std::log2(expected_file_size / json_string.size())));

  json_string.reserve(json_string.size() * (1UL << log_repetitions));
  for (std::size_t i = 0; i < log_repetitions; i++) {
    json_string += json_string;
  }

  std::string really_long_string = R"(haha)";
  log_repetitions += 4;
  really_long_string.reserve(really_long_string.size() * (1UL << log_repetitions));
  for (std::size_t i = 0; i < log_repetitions; i++) {
    really_long_string += really_long_string;
  }
  std::string last_line = "{ \"lol\" : \"" + really_long_string + "\" }";
  json_string += last_line;

  constexpr int num_sources = 1;
  std::vector<cudf::host_span<std::byte>> hostbufs(
    num_sources,
    cudf::host_span<std::byte>(reinterpret_cast<std::byte*>(json_string.data()),
                               json_string.size()));

  // Initialize parsing options (reading json lines)
  cudf::io::json_reader_options json_lines_options =
    cudf::io::json_reader_options::builder(
      cudf::io::source_info{
        cudf::host_span<cudf::host_span<std::byte>>(hostbufs.data(), hostbufs.size())})
      .lines(true)
      .compression(cudf::io::compression_type::NONE)
      .recovery_mode(cudf::io::json_recovery_mode_t::FAIL);

  // Read full test data via existing, nested JSON lines reader
  cudf::io::table_with_metadata current_reader_table = cudf::io::read_json(json_lines_options);

  std::vector<std::unique_ptr<cudf::io::datasource>> datasources;
  for (auto& hb : hostbufs) {
    datasources.emplace_back(cudf::io::datasource::create(hb));
  }
  // Test for different chunk sizes
  std::vector<std::size_t> chunk_sizes{batch_size_upper_bound / 4,
                                       batch_size_upper_bound / 2,
                                       batch_size_upper_bound,
                                       static_cast<std::size_t>(batch_size_upper_bound * 2)};

  for (auto chunk_size : chunk_sizes) {
    auto const tables =
      split_byte_range_reading<std::int64_t>(datasources,
                                             json_lines_options,
                                             chunk_size,
                                             cudf::get_default_stream(),
                                             cudf::get_current_device_resource_ref());

    auto table_views = std::vector<cudf::table_view>(tables.size());
    std::transform(tables.begin(), tables.end(), table_views.begin(), [](auto& table) {
      return table.tbl->view();
    });
    auto result = cudf::concatenate(table_views);

    // Verify that the data read via chunked reader matches the data read via nested JSON reader
    // cannot use EQUAL due to concatenate removing null mask
    CUDF_TEST_EXPECT_TABLES_EQUIVALENT(current_reader_table.tbl->view(), result->view());
  }

  // go back to normal batch_size
  unsetenv("LIBCUDF_JSON_BATCH_SIZE");
}

diff file:

diff --git a/cpp/src/io/json/read_json.cu b/cpp/src/io/json/read_json.cu
index 50f0e5cf33..9aeeb2a0f1 100644
--- a/cpp/src/io/json/read_json.cu
+++ b/cpp/src/io/json/read_json.cu
@@ -204,6 +204,7 @@ datasource::owning_buffer<rmm::device_buffer> get_record_range_raw_input(
       }
     }
 
+    CUDF_EXPECTS(static_cast<size_t>(next_delim_pos - first_delim_pos - shift_for_nonzero_offset) < get_batch_size_upper_bound(), "oh no!");
     return datasource::owning_buffer<rmm::device_buffer>(
       std::move(buffer),
       reinterpret_cast<uint8_t*>(buffer.data()) + first_delim_pos + shift_for_nonzero_offset,

Expected behavior
The size of the buffer returned by get_record_range_raw_input should not exceed 2GiB.
Proposed solution: stop at the line end before the 2GB limit and adjust the batch offsets for the remaining batches.

The text was updated successfully, but these errors were encountered:

shrshi added the bug Something isn't working label Oct 11, 2024

karthikeyann added this to the Nested JSON reader milestone Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at most `INT_MAX` bytes #17058

[BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at most `INT_MAX` bytes #17058

shrshi commented Oct 11, 2024

[BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at most INT_MAX bytes #17058

[BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at most INT_MAX bytes #17058

Comments

shrshi commented Oct 11, 2024

[BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at most `INT_MAX` bytes #17058

[BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at most `INT_MAX` bytes #17058