Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage in nested JSON parser - tree generation #11864

Merged
merged 22 commits into from
Oct 14, 2022

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Oct 5, 2022

Description

Reduces Memory usage by 53% in nested JSON parser tree generation algorithm.
1GB JSON takes 8.469 GiB instead of 16.957 GiB. All values below are for 1 GB JSON text input.

This PR employs following optimisations to reduce memory usage

  • Modified to generate parent node ids from nodes instead of tokens. (16.957 GB -> 10.957 GiB)
  • Reordered node_range, node_categories generation to the end. (10.957 GiB -> 9.774 GiB)
  • Scope limited token_levels (9.774 GiB -> 9.403 GiB)
  • Used CUB sort instead of thrust::stable_sort_by_key (9.403 GiB -> 8.487 GiB)
  • Used cub::DoubleBuffer which eliminates copy of order. (8.487 GiB -> 7.97 GiB)

The peak memory is reduced by 53%, parsing bandwidth still remains same. (1.6 GB/s in GV100 for 1GB JSON).

Since get_stack_context in JSON parser takes highest memory usage (8.469 GB), peak memory is not influenced by JSON tree generation step anymore. Peak memory is now 50% of that of earlier code.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 5, 2022
@karthikeyann karthikeyann added this to the Nested JSON reader milestone Oct 5, 2022
@karthikeyann
Copy link
Contributor Author

karthikeyann commented Oct 7, 2022

These NVTX RANGES macros might be useful. Commenting here for wider audience & as searchable reference.

This CUDF_PUSH_RANGE("tag") may have slightly more runtime impact than CUDF_FUNC_RANGE() due to the reason that registered_message and event_attributes are not static. Use it only for debug purposes, tracing and performance optimizations. Don't use these in production code.

These variables are not static because it can be used inside loop with loop index as label too.
Eg. CUDF_PUSH_RANGE(std::to_string(i)), CUDF_POP_RANGE().

#define _CONCAT_(x, y) x##y
#define CONCAT(x, y)   _CONCAT_(x, y)

#define NVTX3_PUSH_RANGE_IN(D, tag)                                                            \
  ::nvtx3::registered_message<D> const CONCAT(nvtx3_range_name__, __LINE__){std::string(tag)}; \
  ::nvtx3::event_attributes const CONCAT(nvtx3_range_attr__,                                   \
                                         __LINE__){CONCAT(nvtx3_range_name__, __LINE__)};      \
  nvtxDomainRangePushEx(::nvtx3::domain::get<D>(), CONCAT(nvtx3_range_attr__, __LINE__).get());

#define NVTX3_POP_RANGE(D) nvtxDomainRangePop(::nvtx3::domain::get<D>());

#define CUDF_PUSH_RANGE(tag) NVTX3_PUSH_RANGE_IN(cudf::libcudf_domain, tag)
#define CUDF_POP_RANGE()     NVTX3_POP_RANGE(cudf::libcudf_domain)

CUDF_SCOPED_RANGE("tag") is useful for scope limited ranges.

#define NVTX3_SCOPED_RANGE_IN(D, tag)                                                        \
  ::nvtx3::registered_message<D> const CONCAT(nvtx3_scope_name__,                            \
                                              __LINE__){std::string(__func__) + "::" + tag}; \
  ::nvtx3::event_attributes const CONCAT(nvtx3_scope_attr__,                                 \
                                         __LINE__){CONCAT(nvtx3_scope_name__, __LINE__)};    \
  ::nvtx3::domain_thread_range<D> const CONCAT(nvtx3_range__,                                \
                                               __LINE__){CONCAT(nvtx3_scope_attr__, __LINE__)};

#define CUDF_SCOPED_RANGE(tag) NVTX3_SCOPED_RANGE_IN(cudf::libcudf_domain, tag)

Update:
After #6476 is addressed,

@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond 4 - Needs cuIO Reviewer and removed 2 - In Progress Currently a work in progress labels Oct 7, 2022
@karthikeyann karthikeyann marked this pull request as ready for review October 7, 2022 12:30
@karthikeyann karthikeyann requested a review from a team as a code owner October 7, 2022 12:30
@codecov
Copy link

codecov bot commented Oct 7, 2022

Codecov Report

Base: 87.40% // Head: 88.11% // Increases project coverage by +0.70% 🎉

Coverage data is based on head (8e0c85f) compared to base (f72c4ce).
Patch coverage: 85.47% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11864      +/-   ##
================================================
+ Coverage         87.40%   88.11%   +0.70%     
================================================
  Files               133      133              
  Lines             21833    21881      +48     
================================================
+ Hits              19084    19281     +197     
+ Misses             2749     2600     -149     
Impacted Files Coverage Δ
python/cudf/cudf/core/udf/__init__.py 97.05% <ø> (+47.05%) ⬆️
python/cudf/cudf/io/orc.py 92.94% <ø> (-0.09%) ⬇️
...thon/dask_cudf/dask_cudf/tests/test_distributed.py 18.86% <ø> (+4.94%) ⬆️
python/cudf/cudf/core/_base_index.py 82.20% <43.75%> (-3.35%) ⬇️
python/cudf/cudf/io/text.py 91.66% <66.66%> (-8.34%) ⬇️
python/strings_udf/strings_udf/__init__.py 86.27% <76.00%> (-10.61%) ⬇️
python/cudf/cudf/core/index.py 92.91% <98.24%> (+0.28%) ⬆️
python/cudf/cudf/__init__.py 90.69% <100.00%> (ø)
python/cudf/cudf/core/column/categorical.py 89.34% <100.00%> (ø)
python/cudf/cudf/core/scalar.py 90.52% <100.00%> (+1.25%) ⬆️
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@karthikeyann
Copy link
Contributor Author

rerun tests

@upsj upsj self-requested a review October 10, 2022 07:56
Copy link
Contributor

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
auto pid = first_childs_parent_token_id(tid);
return pid < 0
? parent_node_sentinel
: thrust::lower_bound(thrust::seq, node_ids_gpu, node_ids_gpu + num_nodes, pid) -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably be a good point to do some precomputations (hashmap or sparse bitmap) if this kernel becomes a performance issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one takes ~20% of the parent_node_ids computation time. it take significant time, but not the critical one.
hash_map takes extra memory. Tried it now. memory increases from 7.97 GiB to 9.271 GiB. It's much slower than lower_bound; 12 ms lower_bound, vs 133 ms hash_map.

I am interested to learn about the "sparse bitmap" approach .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an approach I often use to speed up binary search-like lookups where the indices are unique. The fundamental idea is a bitvector with rank support:
You store a bitvector containing a 1 if the corresponding token is a node and 0 in groups of 32 bit words and compute an exclusive prefix sum over the popcount of the words. Computing the lower bound is then prefix_sum[i / 32] + popcnt(prefix_mask(i % 32) & bitvector[i / 32]) where prefix_mask(i) = (1u<< i) - 1 has all bits smaller than its parameter set. Overall, this uses 2 bits of storage per token. If the number of tokens is much larger than the number of nodes, you can make the data structure even sparser if you only store 32 bit words that are not all 0 (basically a reduce_by_key over the tokens) and use a normal bitvector to store a bit for each 32 bit word denoting whether it was non-zero. Then you have a two-level lookup (though the second level could also be another lookup data structure like a hashmap). The data structure has pretty good caching properties, since locality in indices translates to locality in memory, which hashmaps purposefully don't have.
But with the number of nodes being not smaller than the number of tokens by a huge factor, I don't think this would be worth the effort.

cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
@karthikeyann karthikeyann requested a review from upsj October 11, 2022 08:39
Copy link
Contributor

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work!

@karthikeyann karthikeyann requested a review from a team October 12, 2022 04:18
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review -- initial comments. I'll submit additional comments shortly.

cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished my first pass of review (this is the second half). Comments attached.

cpp/src/io/json/json_tree.cu Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
@karthikeyann karthikeyann requested a review from bdice October 14, 2022 02:53
@karthikeyann karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond 4 - Needs cuIO Reviewer labels Oct 14, 2022
@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e91d7d9 into rapidsai:branch-22.12 Oct 14, 2022
@karthikeyann karthikeyann mentioned this pull request Feb 19, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants