Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature-wip] (memory tracker) (step2) Hook TCMalloc new/delete automatically counts to MemTracker #8476

Merged
merged 5 commits into from
Mar 20, 2022

Conversation

xinyiZzz
Copy link
Contributor

@xinyiZzz xinyiZzz commented Mar 14, 2022

Proposed changes

Issue Number: close #7196 (step 2/3)

Problem Summary:

Early Design Documentation: https://shimo.im/docs/DT6JXDRkdTvdyV3G

Implement a new way of memory statistics based on TCMalloc New/Delete Hook, MemTracker and TLS, and it is expected that all memory new/delete/malloc/free of the BE process can be counted.

Checklist(Required)

  1. Does it affect the original behavior: (Yes)
  2. Has unit tests been added: (Yes)
  3. Has document been added or modified: (Yes)
  4. Does it need to update dependencies: (No)
  5. Are there any changes that cannot be rolled back: (No)

Further comments

specific purpose:

1. Accurate query memory limit.

  • Before, exec_mem_limit is actually the memory limit at the Fragment Instance level, not the query. After, all Fragment Instance mem trackers of a query share a common ancestor query mem tracker, which will be a real memory limit for a query.
  • After attaching to the query through tls, save the queryID, instanceID, and query mem tracker to tls. If the limit is exceeded when consume query mem tracker in the new/delete hook, the query will be canceled.
  • All threads involved in a query runtime should attach the query through tls at startup, and detach the query when the thread exits, by using preset macros.
  • Note that this may cause some previously successful queries to fail oom, and the query needs to be rewritten by hint set exec_mem_limit.

2. Clearer tree structure of mem tracker.

  • The mem tracker is clearly divided into process - query(task) pool - query(task) - fragment instance - exec node - exprs/hash table and others from top to bottom.
  • When creating a tracker, the current tracker in tls is used as the parent by default, so the hierarchical relationship of mem tracker is equivalent to the hierarchical relationship of code, which avoids complicated mem tracker parameter passing.
  • Because before, if you want to record the memory consumption of a location in the specified mem tracker, you need to pass the mem tracker layer by layer to the function of this location, which looks messy.
  • After that, you only need to attach query or switch mem tracker externally, and you can get this mem tracker from tls at any internal location.
  • This is especially useful for RowBatch and MemPool, the memory in RowBlock2 and RowBatch will be recorded directly in the tls tracker, avoiding a large number of mem trackers being passed as parameters, and creating a new temporary tracker in each Block/Batch, a huge number and useless.

3. Complete and detailed BE memory statistics.

  • Automatic memory statistics based on hook tcmalloc new/delete replaces the previous manual call consume/release mem tracker, which theoretically will not be missed.
  • The logic of manually calling consume/release of specific memory is still reserved, but these mem trackers are created as virtual trackers, and cosume/release will not sync to parent, which is only used to improve the observability of running details.
    Note, It is independent of the recording of tcmalloc hook in the thread local tracker, so the same block of memory is recorded independently in these two trackers.
    Note, almost all the previous manual statistics positions are currently reserved, and some places may not be necessary, and will be revised in the future (TODO)
  • if an independent memory allocator is used in a third-party library, special treatment is required. (TODO)
  • How the cache is handled. In the LruCache insert phase, all rights of memory are transferred from the tracker held to the LruCache tracker. In the LruCache find phase, the transfer is reversed, and other caches are processed in the same way. We need a way to find where all the caches are. Implementing a memory-colored detection mode with reference to ASAN may be required (TODO)

The difference between virutal tracker and non-virutal tracker:

  • non-virutal tracker
    In order to ensure the absolute accuracy of non-virutal mem tracker tree statistics, there are only two ways to count: one is to modify the tls mem tracker through attach or switch, and count in the tcmalloc new/delete hook; the other is to transfer memory ownership between non-virutal trackers.

  • virutal tracker
    Manual consume/release as before, the reasons for designing the virutal tracker: First, to transfer memory ownership between two trackers, it will release first and then consume, which is slower than calling consume/release directly on the virutal tracker; second, through parameters After blocking the virutal tracker, it will prevent the mem tracker tree from becoming more messy, and it is safer to add or delete the virutal tracker.

  • The non-virutal tracker is similar to the INFO log level, and the virutal tracker is similar to the DEBUG log level.

Existing performance problems and solutions:

  • The mem tracker shared between threads has low performance when it is frequently consumed/released in the new/delete hook;
    The memory consumption of the current thread is cached in tls, and after the cumulative consumption reaches 2M, the consume / release mem tracker will be called once to avoid frequent calls.

  • At this stage, std::shared_ptr is used to save the mem tracker in tls. When a thread frequently switches the mem tracker, the use count of std::shared_ptr is frequently changed and the performance is low;
    During an attach query, tls caches all mem trackers that have been switched and uncommitted memory consumption, and does not need to reset ptr when switching to the same mem tracker next time. In the future, the mem tracker in tls should be changed to raw pointers to solve this problem from the source (TODO)

In the next pr

more detailed memory statistics such as exec node, exprs, hash table, etc. will be realized through the mem tracker switch during the thread attach query.

Accuracy verification

  1. Accurate process statistics
    If the following three values are the same, the overall statistics are accurate:
  • be_ip:webserver_port/mem_tracker - Process tracker - Current Consumption
  • be_ip:webserver_port/memz - Mem Consumption
  • top -p be_process_id
  1. Query memory can be successfully limited
    Modify session variable set exec_mem_limit = 2147, submit a query, it will return full OOM details: Memory exceed limit. fragment=xxx details=xxx., on backend=xxx. Memory left in process limit=xxx. current tracker <label =xxx, used=xxx, limit=xxx, failed alloc size=xxx>.

  2. View more detailed memory statistics
    Modify session variable set mem_tracker_level = 2, you can see INSTANCE level statistics, ref mem_tracker_level.

Pformance Testing

The performance is reduced by about 1%~2% after opening the Hook TCMalloc new/delete.
Test Set: ssb LINEORDER 600w
Result:

  1. jmeter thread=1
  • master, Avg: 559 (ms)
  • track_new_delete=false, Avg: 558 (ms)
  • track_new_delete=true, untracked_mem_limit_mbytes=2M, Avg: 566 (ms)
  • track_new_delete=true, untracked_mem_limit_mbytes=512k, Avg: 565 (ms)
  1. jmeter thread=10
  • master, Avg: 3853 (ms)
  • track_new_delete=false, Avg: 3834 (ms)
  • track_new_delete=true, untracked_mem_limit_mbytes=2M, Avg: 3929 (ms)
  • track_new_delete=true, untracked_mem_limit_mbytes=512k, Avg: 3912 (ms)

Test SQL:

select k1, count(1) from  ( select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY ) a group by k1 limit 10;

@xinyiZzz xinyiZzz force-pushed the hook_tcmalloc_count_mem_tracker branch 2 times, most recently from 91e918c to d56bac7 Compare March 15, 2022 15:27
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch is very uncomfortable for reviewer due to many files it changed. I looked at about 100 files and added a few comment.

You'd better split changes into small parts. e.g. 1. implement new memory tracker by hooking new, delete but not enable; 2. enable new memory tracker by removing old usage; 3. handle null pointer for old fasion allocate 4. add memory tracker for untracked memory usage;

This way, we can review the 2nd patch quickly and review others carefully. The majority can be put into to the 2nd part, right?

return "COMPACTION";
default:
return "UNKNOWN";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use array instead of switch case.

Copy link
Contributor Author

@xinyiZzz xinyiZzz Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found switch case to be faster than using map. Would it be more readable to use map? or other benefits.
Reference: https://stackoverflow.com/questions/6860525/c-what-is-faster-lookup-in-hashmap-or-switch-statement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I means array not map. e.g. const char * xx_desc[TYPE_NUM]. But maybe compiler translates such switch case to code like array?

Copy link
Contributor Author

@xinyiZzz xinyiZzz Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, it seems that compilers does compile switch constructs into array of code pointers, as said in the link above.

I changed to array as you said, which seems more concise. thx~ @dataroaring

Also, I did a simple test in the benchmark and it seems that switch constructs are faster, I didn't analyze the assembly code carefully, performance is not the bottleneck in this case.

image
image

@xinyiZzz xinyiZzz force-pushed the hook_tcmalloc_count_mem_tracker branch from 8252b7a to 1a387d5 Compare March 16, 2022 14:46
@morningman
Copy link
Contributor

You'd better split changes into small parts. e.g. 1. implement new memory tracker by hooking new, delete but not enable; 2. enable new memory tracker by removing old usage; 3. handle null pointer for old fasion allocate 4. add memory tracker for untracked memory usage;

This is indeed a worse way to submit a PR, but for this current PR, I would suggest leaving it as is for now to avoid some more uncontrollable factors when splitting the PR.
We will gradually ask to split it as much as possible for subsequent PRs.
But DSIP is needed.

@xinyiZzz
Copy link
Contributor Author

xinyiZzz commented Mar 17, 2022

The patch is very uncomfortable for reviewer due to many files it changed. I looked at about 100 files and added a few comment.

You'd better split changes into small parts. e.g. 1. implement new memory tracker by hooking new, delete but not enable; 2. enable new memory tracker by removing old usage; 3. handle null pointer for old fasion allocate 4. add memory tracker for untracked memory usage;

This way, we can review the 2nd patch quickly and review others carefully. The majority can be put into to the 2nd part, right?

Thx for your suggestion, I will pay attention to control the size of pr later,
I consider moving some minor changes to the pr behind.
In fact, I have split a big feature into three, this is the second one...

@@ -416,7 +416,7 @@ VOlapTablePartitionParam::VOlapTablePartitionParam(std::shared_ptr<OlapTableSche
: _schema(schema),
_t_param(t_param),
_slots(_schema->tuple_desc()->slots()),
_mem_tracker(MemTracker::create_tracker(-1, "OlapTablePartitionParam")) {
_mem_tracker(MemTracker::create_virtual_tracker(-1, "OlapTablePartitionParam")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When should I create a virtual memtracker?

Copy link
Contributor Author

@xinyiZzz xinyiZzz Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this tracker needs to manually consume/release
The difference between virutal tracker and non-virutal tracker:

  • non-virutal tracker
    In order to ensure the absolute accuracy of non-virutal mem tracker tree statistics, there are only two ways to count: one is to modify the tls mem tracker through attach or switch, and count in the tcmalloc new/delete hook; the other is to transfer memory ownership between non-virutal trackers.

  • virutal tracker
    Manual consume/release as before, the reasons for designing the virutal tracker: First, to transfer memory ownership between two trackers, it will release first and then consume, which is slower than calling consume/release directly on the virutal tracker; second, through parameters After blocking the virutal tracker, it will prevent the mem tracker tree from becoming more messy, and it is safer to add or delete the virutal tracker.

The non-virutal tracker is similar to the INFO log level, and the virutal tracker is similar to the DEBUG log level.

@@ -48,6 +49,7 @@ NodeChannel::NodeChannel(OlapTableSink* parent, IndexChannel* index_channel, int
if (_parent->_transfer_data_by_brpc_attachment) {
_tuple_data_buffer_ptr = &_tuple_data_buffer;
}
_node_channel_tracker = MemTracker::create_tracker(-1, "NodeChannel");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add BE id to the tracker's name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BE id of a BE mem tracker is the same = _ =, so I added the thread id.

@@ -154,7 +154,7 @@ Status AggFnEvaluator::prepare(RuntimeState* state, const RowDescriptor& desc, M
_intermediate_slot_desc = intermediate_slot_desc;

_string_buffer_len = 0;
_mem_tracker = mem_tracker;
_mem_tracker = MemTracker::create_virtual_tracker(-1, "AggFnEvaluator", mem_tracker);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memtracker param of AggFnEvaluator::prepare can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memtracker param is used in Expr::prepare on the next line, which I modified.
In addition, it is also the parent of the virtual tracker _mem_tracker.

@@ -99,6 +102,9 @@ class SnapshotManager {
// snapshot
Mutex _snapshot_mutex;
uint64_t _snapshot_base_id;

// TODO(zxy) used after
std::shared_ptr<MemTracker> _mem_tracker = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Contributor Author

@xinyiZzz xinyiZzz Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next pr, this tracker will be switched to the tls mem tracker in the public func of SnapshotManager.

In this pr, I created all the trackers that will be used in the future, and built a complete mem tracker tree. (Perhaps it would be better to do this with a pr alone... Do you think it needs to be deleted first? = =||| )

RowBatch::RowBatch(const RowDescriptor& row_desc, int capacity, MemTracker* mem_tracker)
: _mem_tracker(mem_tracker),
RowBatch::RowBatch(const RowDescriptor& row_desc, int capacity)
: _mem_tracker(thread_local_ctx.get()->_thread_mem_tracker_mgr->mem_tracker()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that this rowbatch is transferred from one thread to another.
If so, the _mem_tracker also need to be changed?

Copy link
Contributor Author

@xinyiZzz xinyiZzz Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that modifying the mem tracker of rowbatch is not directly related to the thread used.

My understanding of this question is: whether there is a rowbatch, consume tls A mem tracker when new, release tls B mem tracker when delete, and both trackers are not equal to 0 when they are finally destructed.

The actual code now avoids the above problem. Through RowBatch::acquire_state and RowBatch::transfer_resource_ownership, complete the mem_tracker update and memory ownership transfer of buffers in two row_batch, avoiding the new and delete of buffers in a rowbatch on different trackers.

For example: In OlapScanNode::get_next, the rowbatch created by Scanner will be transferred to the external parameter row_batch through RowBatch::acquire_state, and the mem tracker of the buffer will be modified. Ownership is transferred in two row_batches via update_mem_tracker.

But I'm not sure if the buffer mem_tracker is updated in all similar places, which requires manual maintenance to ensure that the new and delete of a buffer are in the same tracker.

  • I will test in further.

Similarly, mem pool also has a similar situation, mem pool also provides MemPool::acquire_data and MemPool::exchange_data to complete the transfer of chunks. However, I used to add the tls mem tracker when allocate in each chunk, and found that the chunk tracker is different from the tls mem tracker when the mem pool is destructed.

@morningman morningman added the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label Mar 17, 2022
@xinyiZzz xinyiZzz force-pushed the hook_tcmalloc_count_mem_tracker branch from 1a387d5 to d10fc82 Compare March 17, 2022 13:06
morningman
morningman previously approved these changes Mar 18, 2022
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 18, 2022
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 18, 2022
@xinyiZzz xinyiZzz force-pushed the hook_tcmalloc_count_mem_tracker branch from 151a29a to 21c861e Compare March 18, 2022 08:47
@xinyiZzz xinyiZzz force-pushed the hook_tcmalloc_count_mem_tracker branch from 21c861e to 90d03c6 Compare March 18, 2022 11:54
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 19, 2022
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@morningman morningman merged commit eeae516 into apache:master Mar 20, 2022
@BiteTheDDDDt
Copy link
Contributor

This pr seems will make build ASAN fail.

@morningman
Copy link
Contributor

This pr seems will make build ASAN fail.

@yangzhg PTAL

@xinyiZzz
Copy link
Contributor Author

xinyiZzz commented Mar 21, 2022

This pr seems will make build ASAN fail.

I reproduced the problem in ASAN and it has been fixed in the following pr
#8569

@morningman morningman added dev/backlog waiting to be merged in future dev branch and removed dev/1.0.1-deprecated should be merged into dev-1.0.1 branch labels Mar 23, 2022
morningman pushed a commit that referenced this pull request Mar 24, 2022
…emory usage (#8605)

In pr #8476, all memory usage of a process is recorded in the process mem tracker,
and all memory usage of a query is recorded in the query mem tracker,
and it is still necessary to manually call `transfer to` to track the cached memory size.

We hope to separate out more detailed memory usage based on Hook TCMalloc new/delete + TLS mem tracker.

In this pr, the more detailed mem tracker is switched to TLS, which automatically and accurately
counts more detailed memory usage than before.
@xinyiZzz xinyiZzz changed the title [Feature] (Memory) Hook TCMalloc new/delete automatically counts to MemTracker [feature-wip] (memory) Hook TCMalloc new/delete automatically counts to MemTracker Apr 1, 2022
@xinyiZzz xinyiZzz changed the title [feature-wip] (memory) Hook TCMalloc new/delete automatically counts to MemTracker [feature-wip] (memory tracker) (step2) Hook TCMalloc new/delete automatically counts to MemTracker Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. area/memory-consumption dev/backlog waiting to be merged in future dev branch kind/improvement reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Refactored memory statistics framework MemTracker
4 participants