Reduce memory usage in nested JSON parser - tree generation #11864

karthikeyann · 2022-10-05T04:31:56Z

Description

Reduces Memory usage by 53% in nested JSON parser tree generation algorithm.
1GB JSON takes 8.469 GiB instead of 16.957 GiB. All values below are for 1 GB JSON text input.

This PR employs following optimisations to reduce memory usage

Modified to generate parent node ids from nodes instead of tokens. (16.957 GB -> 10.957 GiB)
Reordered node_range, node_categories generation to the end. (10.957 GiB -> 9.774 GiB)
Scope limited token_levels (9.774 GiB -> 9.403 GiB)
Used CUB sort instead of thrust::stable_sort_by_key (9.403 GiB -> 8.487 GiB)
Used cub::DoubleBuffer which eliminates copy of order. (8.487 GiB -> 7.97 GiB)

The peak memory is reduced by 53%, parsing bandwidth still remains same. (1.6 GB/s in GV100 for 1GB JSON).

Since get_stack_context in JSON parser takes highest memory usage (8.469 GB), peak memory is not influenced by JSON tree generation step anymore. Peak memory is now 50% of that of earlier code.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

reduces memory usage by 35% (1GB json takes 10.951GB instead of 16.957GB)

reduce peak memory usage (not total memory used) reorder node_range, node_cat, scope limit token_levels 10.957 GiB -> 9.91 GiB -> 9.774 GiB -> 9.403 GiB

9.403 GiB to 8.487 GiB (for 1GB json input)

… error

…-reduce-memory1

karthikeyann · 2022-10-07T12:15:37Z

These NVTX RANGES macros might be useful. Commenting here for wider audience & as searchable reference.

This CUDF_PUSH_RANGE("tag") may have slightly more runtime impact than CUDF_FUNC_RANGE() due to the reason that registered_message and event_attributes are not static. Use it only for debug purposes, tracing and performance optimizations. Don't use these in production code.

These variables are not static because it can be used inside loop with loop index as label too.
Eg. CUDF_PUSH_RANGE(std::to_string(i)), CUDF_POP_RANGE().

#define _CONCAT_(x, y) x##y
#define CONCAT(x, y)   _CONCAT_(x, y)

#define NVTX3_PUSH_RANGE_IN(D, tag)                                                            \
  ::nvtx3::registered_message<D> const CONCAT(nvtx3_range_name__, __LINE__){std::string(tag)}; \
  ::nvtx3::event_attributes const CONCAT(nvtx3_range_attr__,                                   \
                                         __LINE__){CONCAT(nvtx3_range_name__, __LINE__)};      \
  nvtxDomainRangePushEx(::nvtx3::domain::get<D>(), CONCAT(nvtx3_range_attr__, __LINE__).get());

#define NVTX3_POP_RANGE(D) nvtxDomainRangePop(::nvtx3::domain::get<D>());

#define CUDF_PUSH_RANGE(tag) NVTX3_PUSH_RANGE_IN(cudf::libcudf_domain, tag)
#define CUDF_POP_RANGE()     NVTX3_POP_RANGE(cudf::libcudf_domain)

CUDF_SCOPED_RANGE("tag") is useful for scope limited ranges.

#define NVTX3_SCOPED_RANGE_IN(D, tag)                                                        \
  ::nvtx3::registered_message<D> const CONCAT(nvtx3_scope_name__,                            \
                                              __LINE__){std::string(__func__) + "::" + tag}; \
  ::nvtx3::event_attributes const CONCAT(nvtx3_scope_attr__,                                 \
                                         __LINE__){CONCAT(nvtx3_scope_name__, __LINE__)};    \
  ::nvtx3::domain_thread_range<D> const CONCAT(nvtx3_range__,                                \
                                               __LINE__){CONCAT(nvtx3_scope_attr__, __LINE__)};

#define CUDF_SCOPED_RANGE(tag) NVTX3_SCOPED_RANGE_IN(cudf::libcudf_domain, tag)

Update:
After #6476 is addressed,

Use nvtx3::thread_range instead of CUDF_SCOPED_RANGE.
Use nvtx3::message instead of non-static registered_message. Check out the nvtx C++ docs: https://nvidia.github.io/NVTX/doxygen-cpp/index.html
Thank you @jrhemstad

This reverts commit 5eefd64.

codecov · 2022-10-07T13:57:30Z

Codecov Report

Base: 87.40% // Head: 88.11% // Increases project coverage by +0.70% 🎉

Coverage data is based on head (8e0c85f) compared to base (f72c4ce).
Patch coverage: 85.47% of modified lines in pull request are covered.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #11864      +/-   ##
================================================
+ Coverage         87.40%   88.11%   +0.70%     
================================================
  Files               133      133              
  Lines             21833    21881      +48     
================================================
+ Hits              19084    19281     +197     
+ Misses             2749     2600     -149

Impacted Files	Coverage Δ
python/cudf/cudf/core/udf/__init__.py	`97.05% <ø> (+47.05%)`	⬆️
python/cudf/cudf/io/orc.py	`92.94% <ø> (-0.09%)`	⬇️
...thon/dask_cudf/dask_cudf/tests/test_distributed.py	`18.86% <ø> (+4.94%)`	⬆️
python/cudf/cudf/core/_base_index.py	`82.20% <43.75%> (-3.35%)`	⬇️
python/cudf/cudf/io/text.py	`91.66% <66.66%> (-8.34%)`	⬇️
python/strings_udf/strings_udf/__init__.py	`86.27% <76.00%> (-10.61%)`	⬇️
python/cudf/cudf/core/index.py	`92.91% <98.24%> (+0.28%)`	⬆️
python/cudf/cudf/__init__.py	`90.69% <100.00%> (ø)`
python/cudf/cudf/core/column/categorical.py	`89.34% <100.00%> (ø)`
python/cudf/cudf/core/scalar.py	`90.52% <100.00%> (+1.25%)`	⬆️
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

karthikeyann · 2022-10-07T19:43:22Z

rerun tests

…-reduce-memory1

upsj

First pass

cpp/src/io/json/json_tree.cu

upsj · 2022-10-10T09:25:42Z

cpp/src/io/json/json_tree.cu

+        auto pid = first_childs_parent_token_id(tid);
+        return pid < 0
+                 ? parent_node_sentinel
+                 : thrust::lower_bound(thrust::seq, node_ids_gpu, node_ids_gpu + num_nodes, pid) -


This will probably be a good point to do some precomputations (hashmap or sparse bitmap) if this kernel becomes a performance issue.

This one takes ~20% of the parent_node_ids computation time. it take significant time, but not the critical one.
hash_map takes extra memory. Tried it now. memory increases from 7.97 GiB to 9.271 GiB. It's much slower than lower_bound; 12 ms lower_bound, vs 133 ms hash_map.

I am interested to learn about the "sparse bitmap" approach .

It's an approach I often use to speed up binary search-like lookups where the indices are unique. The fundamental idea is a bitvector with rank support:
You store a bitvector containing a 1 if the corresponding token is a node and 0 in groups of 32 bit words and compute an exclusive prefix sum over the popcount of the words. Computing the lower bound is then prefix_sum[i / 32] + popcnt(prefix_mask(i % 32) & bitvector[i / 32]) where prefix_mask(i) = (1u<< i) - 1 has all bits smaller than its parameter set. Overall, this uses 2 bits of storage per token. If the number of tokens is much larger than the number of nodes, you can make the data structure even sparser if you only store 32 bit words that are not all 0 (basically a reduce_by_key over the tokens) and use a normal bitvector to store a bit for each 32 bit word denoting whether it was non-zero. Then you have a two-level lookup (though the second level could also be another lookup data structure like a hashmap). The data structure has pretty good caching properties, since locality in indices translates to locality in memory, which hashmaps purposefully don't have.
But with the number of nodes being not smaller than the number of tokens by a huge factor, I don't think this would be worth the effort.

cpp/src/io/json/json_tree.cu

…-reduce-memory1

upsj

LGTM! Nice work!

bdice

Partial review -- initial comments. I'll submit additional comments shortly.

cpp/src/io/json/json_tree.cu

bdice

I finished my first pass of review (this is the second half). Comments attached.

cpp/src/io/json/json_tree.cu

Co-authored-by: Bradley Dice <[email protected]>

cpp/src/io/json/json_tree.cu

karthikeyann · 2022-10-14T08:52:38Z

@gpucibot merge

karthikeyann added 2 commits September 30, 2022 13:08

fix the right condition for parent_node propagation initial condition

a75b0a5

parent_node_id generation using only nodes instead of tokens

4abfb51

reduces memory usage by 35% (1GB json takes 10.951GB instead of 16.957GB)

karthikeyann added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 5, 2022

karthikeyann added this to the Nested JSON reader milestone Oct 5, 2022

karthikeyann added 11 commits October 6, 2022 17:25

reduce node_ids memory (not impacting peak memory)

efb6621

reorder node_range, node_cat, scope limit token_levels

5f250cb

reduce peak memory usage (not total memory used) reorder node_range, node_cat, scope limit token_levels 10.957 GiB -> 9.91 GiB -> 9.774 GiB -> 9.403 GiB

use cub SortPairs to reduce memory

49cb0d7

9.403 GiB to 8.487 GiB (for 1GB json input)

reduce memory by cub::DoubleBuffer, scope limit token_id_for_nodes

02a7b5b

cleanup

9243d89

reorganize parent_node_ids algorithm (generic logical stack)

7efc890

include CUDF_PUSH_RANGE, CUDF_POP_RANGE nvtx macros

6d3a166

replace TreeDepthT with size_type due to cuda Invalid Device function…

bbcbffa

… error

Merge branch 'branch-22.12' of github.com:rapidsai/cudf into enh-json…

483abf1

…-reduce-memory1

update docs

f9f0926

remove nvtx range macros and debug prints

f851232

karthikeyann added 3 commits October 7, 2022 17:46

remove nvtx macros

55369c9

NVTX RANGES macros commit

5eefd64

Revert "NVTX RANGES macros commit"

3bb54f4

This reverts commit 5eefd64.

karthikeyann added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond 4 - Needs cuIO Reviewer and removed 2 - In Progress Currently a work in progress labels Oct 7, 2022

karthikeyann marked this pull request as ready for review October 7, 2022 12:30

karthikeyann requested a review from a team as a code owner October 7, 2022 12:30

Merge branch 'branch-22.12' of github.com:rapidsai/cudf into enh-json…

b70669d

…-reduce-memory1

upsj self-requested a review October 10, 2022 07:56

upsj reviewed Oct 10, 2022

View reviewed changes

address review comments (upsj)

5a0a9a7

karthikeyann requested a review from upsj October 11, 2022 08:39

Merge branch 'branch-22.12' of github.com:rapidsai/cudf into enh-json…

8578a22

…-reduce-memory1

upsj approved these changes Oct 11, 2022

View reviewed changes

karthikeyann requested a review from a team October 12, 2022 04:18

bdice reviewed Oct 13, 2022

View reviewed changes

cpp/src/io/json/json_tree.cu Show resolved Hide resolved

cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved

cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved

cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved

karthikeyann and others added 2 commits October 14, 2022 08:16

Apply suggestions from code review

a356ea0

Co-authored-by: Bradley Dice <[email protected]>

address review comments

7116570

karthikeyann requested a review from bdice October 14, 2022 02:53

bdice reviewed Oct 14, 2022

View reviewed changes

cpp/src/io/json/json_tree.cu Show resolved Hide resolved

bdice approved these changes Oct 14, 2022

View reviewed changes

add copy, memory savings comments for stable_sorted_key_order

8e0c85f

karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond 4 - Needs cuIO Reviewer labels Oct 14, 2022

rapids-bot bot merged commit e91d7d9 into rapidsai:branch-22.12 Oct 14, 2022

SurajAralihalli mentioned this pull request Jan 16, 2024

Resolve degenerate performance in create_structs_data #14761

Merged

3 tasks

karthikeyann mentioned this pull request Feb 19, 2024

Clean up nvtx macros #15038

Merged

3 tasks

PointKernel mentioned this pull request Oct 22, 2024

Implement additional nvtx range helpers #16916

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage in nested JSON parser - tree generation #11864

Reduce memory usage in nested JSON parser - tree generation #11864

karthikeyann commented Oct 5, 2022 •

edited

Loading

karthikeyann commented Oct 7, 2022 •

edited

Loading

codecov bot commented Oct 7, 2022 •

edited

Loading

karthikeyann commented Oct 7, 2022

upsj left a comment

upsj Oct 10, 2022

karthikeyann Oct 11, 2022

upsj Oct 11, 2022

upsj left a comment

bdice left a comment

bdice left a comment

karthikeyann commented Oct 14, 2022

Reduce memory usage in nested JSON parser - tree generation #11864

Reduce memory usage in nested JSON parser - tree generation #11864

Conversation

karthikeyann commented Oct 5, 2022 • edited Loading

Description

Checklist

karthikeyann commented Oct 7, 2022 • edited Loading

codecov bot commented Oct 7, 2022 • edited Loading

Codecov Report

karthikeyann commented Oct 7, 2022

upsj left a comment

Choose a reason for hiding this comment

upsj Oct 10, 2022

Choose a reason for hiding this comment

karthikeyann Oct 11, 2022

Choose a reason for hiding this comment

upsj Oct 11, 2022

Choose a reason for hiding this comment

upsj left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

karthikeyann commented Oct 14, 2022

karthikeyann commented Oct 5, 2022 •

edited

Loading

karthikeyann commented Oct 7, 2022 •

edited

Loading

codecov bot commented Oct 7, 2022 •

edited

Loading