[feature-wip] (memory tracker) (step2) Hook TCMalloc new/delete automatically counts to MemTracker #8476

xinyiZzz · 2022-03-14T18:14:06Z

Proposed changes

Issue Number: close #7196 (step 2/3)

Problem Summary:

Early Design Documentation: https://shimo.im/docs/DT6JXDRkdTvdyV3G

Implement a new way of memory statistics based on TCMalloc New/Delete Hook, MemTracker and TLS, and it is expected that all memory new/delete/malloc/free of the BE process can be counted.

Checklist(Required)

Does it affect the original behavior: (Yes)
Has unit tests been added: (Yes)
Has document been added or modified: (Yes)
Does it need to update dependencies: (No)
Are there any changes that cannot be rolled back: (No)

Further comments

specific purpose:

1. Accurate query memory limit.

Before, exec_mem_limit is actually the memory limit at the Fragment Instance level, not the query. After, all Fragment Instance mem trackers of a query share a common ancestor query mem tracker, which will be a real memory limit for a query.
After attaching to the query through tls, save the queryID, instanceID, and query mem tracker to tls. If the limit is exceeded when consume query mem tracker in the new/delete hook, the query will be canceled.
All threads involved in a query runtime should attach the query through tls at startup, and detach the query when the thread exits, by using preset macros.
Note that this may cause some previously successful queries to fail oom, and the query needs to be rewritten by hint set exec_mem_limit.

2. Clearer tree structure of mem tracker.

The mem tracker is clearly divided into process - query(task) pool - query(task) - fragment instance - exec node - exprs/hash table and others from top to bottom.
When creating a tracker, the current tracker in tls is used as the parent by default, so the hierarchical relationship of mem tracker is equivalent to the hierarchical relationship of code, which avoids complicated mem tracker parameter passing.
Because before, if you want to record the memory consumption of a location in the specified mem tracker, you need to pass the mem tracker layer by layer to the function of this location, which looks messy.
After that, you only need to attach query or switch mem tracker externally, and you can get this mem tracker from tls at any internal location.
This is especially useful for RowBatch and MemPool, the memory in RowBlock2 and RowBatch will be recorded directly in the tls tracker, avoiding a large number of mem trackers being passed as parameters, and creating a new temporary tracker in each Block/Batch, a huge number and useless.

3. Complete and detailed BE memory statistics.

Automatic memory statistics based on hook tcmalloc new/delete replaces the previous manual call consume/release mem tracker, which theoretically will not be missed.
The logic of manually calling consume/release of specific memory is still reserved, but these mem trackers are created as virtual trackers, and cosume/release will not sync to parent, which is only used to improve the observability of running details.
Note, It is independent of the recording of tcmalloc hook in the thread local tracker, so the same block of memory is recorded independently in these two trackers.
Note, almost all the previous manual statistics positions are currently reserved, and some places may not be necessary, and will be revised in the future (TODO)
if an independent memory allocator is used in a third-party library, special treatment is required. (TODO)
How the cache is handled. In the LruCache insert phase, all rights of memory are transferred from the tracker held to the LruCache tracker. In the LruCache find phase, the transfer is reversed, and other caches are processed in the same way. We need a way to find where all the caches are. Implementing a memory-colored detection mode with reference to ASAN may be required (TODO)

The difference between virutal tracker and non-virutal tracker:

non-virutal tracker
In order to ensure the absolute accuracy of non-virutal mem tracker tree statistics, there are only two ways to count: one is to modify the tls mem tracker through attach or switch, and count in the tcmalloc new/delete hook; the other is to transfer memory ownership between non-virutal trackers.
virutal tracker
Manual consume/release as before, the reasons for designing the virutal tracker: First, to transfer memory ownership between two trackers, it will release first and then consume, which is slower than calling consume/release directly on the virutal tracker; second, through parameters After blocking the virutal tracker, it will prevent the mem tracker tree from becoming more messy, and it is safer to add or delete the virutal tracker.
The non-virutal tracker is similar to the INFO log level, and the virutal tracker is similar to the DEBUG log level.

Existing performance problems and solutions:

The mem tracker shared between threads has low performance when it is frequently consumed/released in the new/delete hook;
The memory consumption of the current thread is cached in tls, and after the cumulative consumption reaches 2M, the consume / release mem tracker will be called once to avoid frequent calls.
At this stage, std::shared_ptr is used to save the mem tracker in tls. When a thread frequently switches the mem tracker, the use count of std::shared_ptr is frequently changed and the performance is low;
During an attach query, tls caches all mem trackers that have been switched and uncommitted memory consumption, and does not need to reset ptr when switching to the same mem tracker next time. In the future, the mem tracker in tls should be changed to raw pointers to solve this problem from the source (TODO)

In the next pr

more detailed memory statistics such as exec node, exprs, hash table, etc. will be realized through the mem tracker switch during the thread attach query.

Accuracy verification

Accurate process statistics
If the following three values are the same, the overall statistics are accurate:

be_ip:webserver_port/mem_tracker - Process tracker - Current Consumption
be_ip:webserver_port/memz - Mem Consumption
top -p be_process_id

Query memory can be successfully limited
Modify session variable set exec_mem_limit = 2147, submit a query, it will return full OOM details: Memory exceed limit. fragment=xxx details=xxx., on backend=xxx. Memory left in process limit=xxx. current tracker <label =xxx, used=xxx, limit=xxx, failed alloc size=xxx>.
View more detailed memory statistics
Modify session variable set mem_tracker_level = 2, you can see INSTANCE level statistics, ref mem_tracker_level.

Pformance Testing

The performance is reduced by about 1%~2% after opening the Hook TCMalloc new/delete.
Test Set: ssb LINEORDER 600w
Result:

jmeter thread=1

master, Avg: 559 (ms)
track_new_delete=false, Avg: 558 (ms)
track_new_delete=true, untracked_mem_limit_mbytes=2M, Avg: 566 (ms)
track_new_delete=true, untracked_mem_limit_mbytes=512k, Avg: 565 (ms)

jmeter thread=10

master, Avg: 3853 (ms)
track_new_delete=false, Avg: 3834 (ms)
track_new_delete=true, untracked_mem_limit_mbytes=2M, Avg: 3929 (ms)
track_new_delete=true, untracked_mem_limit_mbytes=512k, Avg: 3912 (ms)

Test SQL:

select k1, count(1) from  ( select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_ORDERKEY as k1, count(1), max(LO_CUSTKEY)  from LINEORDER2 group by LO_ORDERKEY union all  select LO_CUSTKEY as k1, count(1), max(LO_ORDERKEY) from LINEORDER2 group by LO_CUSTKEY ) a group by k1 limit 10;

dataroaring

The patch is very uncomfortable for reviewer due to many files it changed. I looked at about 100 files and added a few comment.

You'd better split changes into small parts. e.g. 1. implement new memory tracker by hooking new, delete but not enable; 2. enable new memory tracker by removing old usage; 3. handle null pointer for old fasion allocate 4. add memory tracker for untracked memory usage;

This way, we can review the 2nd patch quickly and review others carefully. The majority can be put into to the 2nd part, right?

dataroaring · 2022-03-16T01:26:59Z

be/src/runtime/thread_context.h

+            return "COMPACTION";
+        default:
+            return "UNKNOWN";
+        }


We can use array instead of switch case.

I found switch case to be faster than using map. Would it be more readable to use map? or other benefits.
Reference: https://stackoverflow.com/questions/6860525/c-what-is-faster-lookup-in-hashmap-or-switch-statement

I means array not map. e.g. const char * xx_desc[TYPE_NUM]. But maybe compiler translates such switch case to code like array?

I understand, it seems that compilers does compile switch constructs into array of code pointers, as said in the link above.

I changed to array as you said, which seems more concise. thx~ @dataroaring

Also, I did a simple test in the benchmark and it seems that switch constructs are faster, I didn't analyze the assembly code carefully, performance is not the bottleneck in this case.

be/src/runtime/thread_context.h

be/src/olap/task/engine_alter_tablet_task.cpp

morningman · 2022-03-17T02:46:26Z

You'd better split changes into small parts. e.g. 1. implement new memory tracker by hooking new, delete but not enable; 2. enable new memory tracker by removing old usage; 3. handle null pointer for old fasion allocate 4. add memory tracker for untracked memory usage;

This is indeed a worse way to submit a PR, but for this current PR, I would suggest leaving it as is for now to avoid some more uncontrollable factors when splitting the PR.
We will gradually ask to split it as much as possible for subsequent PRs.
But DSIP is needed.

xinyiZzz · 2022-03-17T03:35:00Z

The patch is very uncomfortable for reviewer due to many files it changed. I looked at about 100 files and added a few comment.

You'd better split changes into small parts. e.g. 1. implement new memory tracker by hooking new, delete but not enable; 2. enable new memory tracker by removing old usage; 3. handle null pointer for old fasion allocate 4. add memory tracker for untracked memory usage;

This way, we can review the 2nd patch quickly and review others carefully. The majority can be put into to the 2nd part, right?

Thx for your suggestion, I will pay attention to control the size of pr later,
I consider moving some minor changes to the pr behind.
In fact, I have split a big feature into three, this is the second one...

be/src/exec/base_scanner.cpp

be/src/exec/es_http_scanner.cpp

morningman · 2022-03-15T13:43:29Z

be/src/exec/tablet_info.cpp

@@ -416,7 +416,7 @@ VOlapTablePartitionParam::VOlapTablePartitionParam(std::shared_ptr<OlapTableSche
        : _schema(schema),
          _t_param(t_param),
          _slots(_schema->tuple_desc()->slots()),
-          _mem_tracker(MemTracker::create_tracker(-1, "OlapTablePartitionParam")) {
+          _mem_tracker(MemTracker::create_virtual_tracker(-1, "OlapTablePartitionParam")) {


When should I create a virtual memtracker?

Because this tracker needs to manually consume/release
The difference between virutal tracker and non-virutal tracker:

non-virutal tracker
In order to ensure the absolute accuracy of non-virutal mem tracker tree statistics, there are only two ways to count: one is to modify the tls mem tracker through attach or switch, and count in the tcmalloc new/delete hook; the other is to transfer memory ownership between non-virutal trackers.

virutal tracker
Manual consume/release as before, the reasons for designing the virutal tracker: First, to transfer memory ownership between two trackers, it will release first and then consume, which is slower than calling consume/release directly on the virutal tracker; second, through parameters After blocking the virutal tracker, it will prevent the mem tracker tree from becoming more messy, and it is safer to add or delete the virutal tracker.

The non-virutal tracker is similar to the INFO log level, and the virutal tracker is similar to the DEBUG log level.

morningman · 2022-03-15T13:46:32Z

be/src/exec/tablet_sink.cpp

@@ -48,6 +49,7 @@ NodeChannel::NodeChannel(OlapTableSink* parent, IndexChannel* index_channel, int
    if (_parent->_transfer_data_by_brpc_attachment) {
        _tuple_data_buffer_ptr = &_tuple_data_buffer;
    }
+    _node_channel_tracker = MemTracker::create_tracker(-1, "NodeChannel");


Add BE id to the tracker's name

The BE id of a BE mem tracker is the same = _ =, so I added the thread id.

morningman · 2022-03-15T13:48:32Z

be/src/exprs/agg_fn_evaluator.cpp

@@ -154,7 +154,7 @@ Status AggFnEvaluator::prepare(RuntimeState* state, const RowDescriptor& desc, M
    _intermediate_slot_desc = intermediate_slot_desc;

    _string_buffer_len = 0;
-    _mem_tracker = mem_tracker;
+    _mem_tracker = MemTracker::create_virtual_tracker(-1, "AggFnEvaluator", mem_tracker);


The memtracker param of AggFnEvaluator::prepare can be removed.

The memtracker param is used in Expr::prepare on the next line, which I modified.
In addition, it is also the parent of the virtual tracker _mem_tracker.

morningman · 2022-03-15T14:45:12Z

be/src/olap/snapshot_manager.h

@@ -99,6 +102,9 @@ class SnapshotManager {
    // snapshot
    Mutex _snapshot_mutex;
    uint64_t _snapshot_base_id;
+
+    // TODO(zxy) used after
+    std::shared_ptr<MemTracker> _mem_tracker = nullptr;


What is this for?

In the next pr, this tracker will be switched to the tls mem tracker in the public func of SnapshotManager.

In this pr, I created all the trackers that will be used in the future, and built a complete mem tracker tree. (Perhaps it would be better to do this with a pr alone... Do you think it needs to be deleted first? = =||| )

be/src/olap/tablet_manager.cpp

be/src/olap/task/engine_alter_tablet_task.cpp

be/src/runtime/dpp_sink.cpp

morningman · 2022-03-17T03:13:29Z

be/src/runtime/row_batch.cpp

-RowBatch::RowBatch(const RowDescriptor& row_desc, int capacity, MemTracker* mem_tracker)
-        : _mem_tracker(mem_tracker),
+RowBatch::RowBatch(const RowDescriptor& row_desc, int capacity)
+        : _mem_tracker(thread_local_ctx.get()->_thread_mem_tracker_mgr->mem_tracker()),


Is it possible that this rowbatch is transferred from one thread to another.
If so, the _mem_tracker also need to be changed?

I understand that modifying the mem tracker of rowbatch is not directly related to the thread used.

My understanding of this question is: whether there is a rowbatch, consume tls A mem tracker when new, release tls B mem tracker when delete, and both trackers are not equal to 0 when they are finally destructed.

The actual code now avoids the above problem. Through RowBatch::acquire_state and RowBatch::transfer_resource_ownership, complete the mem_tracker update and memory ownership transfer of buffers in two row_batch, avoiding the new and delete of buffers in a rowbatch on different trackers.

For example: In OlapScanNode::get_next, the rowbatch created by Scanner will be transferred to the external parameter row_batch through RowBatch::acquire_state, and the mem tracker of the buffer will be modified. Ownership is transferred in two row_batches via update_mem_tracker.

But I'm not sure if the buffer mem_tracker is updated in all similar places, which requires manual maintenance to ensure that the new and delete of a buffer are in the same tracker.

I will test in further.

Similarly, mem pool also has a similar situation, mem pool also provides MemPool::acquire_data and MemPool::exchange_data to complete the transfer of chunks. However, I used to add the tls mem tracker when allocate in each chunk, and found that the chunk tracker is different from the tls mem tracker when the mem pool is destructed.

morningman

LGTM

github-actions · 2022-03-18T02:20:18Z

PR approved by at least one committer and no changes requested.

github-actions · 2022-03-18T02:20:20Z

PR approved by anyone and no changes requested.

morningman

LGTM

github-actions · 2022-03-19T09:11:22Z

PR approved by at least one committer and no changes requested.

BiteTheDDDDt · 2022-03-21T07:13:59Z

This pr seems will make build ASAN fail.

morningman · 2022-03-21T07:18:59Z

This pr seems will make build ASAN fail.

@yangzhg PTAL

xinyiZzz · 2022-03-21T14:35:19Z

This pr seems will make build ASAN fail.

I reproduced the problem in ASAN and it has been fixed in the following pr
#8569

…emory usage (#8605) In pr #8476, all memory usage of a process is recorded in the process mem tracker, and all memory usage of a query is recorded in the query mem tracker, and it is still necessary to manually call `transfer to` to track the cached memory size. We hope to separate out more detailed memory usage based on Hook TCMalloc new/delete + TLS mem tracker. In this pr, the more detailed mem tracker is switched to TLS, which automatically and accurately counts more detailed memory usage than before.

morningman added kind/improvement area/memory-consumption labels Mar 15, 2022

xinyiZzz force-pushed the hook_tcmalloc_count_mem_tracker branch 2 times, most recently from 91e918c to d56bac7 Compare March 15, 2022 15:27

dataroaring reviewed Mar 16, 2022

View reviewed changes

xinyiZzz force-pushed the hook_tcmalloc_count_mem_tracker branch from 8252b7a to 1a387d5 Compare March 16, 2022 14:46

morningman reviewed Mar 17, 2022

View reviewed changes

morningman added the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label Mar 17, 2022

xinyiZzz added 2 commits March 17, 2022 20:13

hook_tcmalloc_count_mem_tracker

9f5eebd

fix cache tracker

e9d20cc

xinyiZzz force-pushed the hook_tcmalloc_count_mem_tracker branch from 1a387d5 to d10fc82 Compare March 17, 2022 13:06

fix comment

7ae4be8

morningman previously approved these changes Mar 18, 2022

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 18, 2022

github-actions bot added the reviewed label Mar 18, 2022

xinyiZzz dismissed morningman’s stale review via 151a29a March 18, 2022 05:16

github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 18, 2022

xinyiZzz force-pushed the hook_tcmalloc_count_mem_tracker branch from 151a29a to 21c861e Compare March 18, 2022 08:47

last review

90d03c6

xinyiZzz force-pushed the hook_tcmalloc_count_mem_tracker branch from 21c861e to 90d03c6 Compare March 18, 2022 11:54

fix ut debug

a53631d

morningman approved these changes Mar 19, 2022

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 19, 2022

morningman merged commit eeae516 into apache:master Mar 20, 2022

adonis0147 mentioned this pull request Mar 21, 2022

[feature-wip][array-type] Fix compilation error. #8556

Merged

adonis0147 mentioned this pull request Mar 22, 2022

[feature-wip](array-type) Fix compilation error. #8591

Merged

xinyiZzz mentioned this pull request Mar 22, 2022

[feature-wip] (memory tracker) (step3) Switch TLS mem tracker to separate more detailed memory usage #8605

Merged

morningman added dev/backlog waiting to be merged in future dev branch and removed dev/1.0.1-deprecated should be merged into dev-1.0.1 branch labels Mar 23, 2022

xinyiZzz mentioned this pull request Mar 25, 2022

[feature-wip] (memory tracker) (step4) Switch TLS mem tracker to separate more detailed memory usage #8669

Merged

xinyiZzz changed the title ~~[Feature] (Memory) Hook TCMalloc new/delete automatically counts to MemTracker~~ [feature-wip] (memory) Hook TCMalloc new/delete automatically counts to MemTracker Apr 1, 2022

xinyiZzz changed the title ~~[feature-wip] (memory) Hook TCMalloc new/delete automatically counts to MemTracker~~ [feature-wip] (memory tracker) (step2) Hook TCMalloc new/delete automatically counts to MemTracker Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature-wip] (memory tracker) (step2) Hook TCMalloc new/delete automatically counts to MemTracker #8476

[feature-wip] (memory tracker) (step2) Hook TCMalloc new/delete automatically counts to MemTracker #8476

xinyiZzz commented Mar 14, 2022 •

edited

Loading

dataroaring left a comment •

edited

Loading

dataroaring Mar 16, 2022

xinyiZzz Mar 17, 2022 •

edited

Loading

dataroaring Mar 17, 2022

xinyiZzz Mar 18, 2022 •

edited

Loading

morningman commented Mar 17, 2022

xinyiZzz commented Mar 17, 2022 •

edited

Loading

morningman Mar 15, 2022

xinyiZzz Mar 17, 2022 •

edited

Loading

morningman Mar 15, 2022

xinyiZzz Mar 17, 2022

morningman Mar 15, 2022

xinyiZzz Mar 17, 2022

morningman Mar 15, 2022

xinyiZzz Mar 17, 2022 •

edited

Loading

morningman Mar 17, 2022

xinyiZzz Mar 17, 2022 •

edited

Loading

morningman left a comment

github-actions bot commented Mar 18, 2022

github-actions bot commented Mar 18, 2022

morningman left a comment

github-actions bot commented Mar 19, 2022

BiteTheDDDDt commented Mar 21, 2022

morningman commented Mar 21, 2022

xinyiZzz commented Mar 21, 2022 •

edited

Loading

[feature-wip] (memory tracker) (step2) Hook TCMalloc new/delete automatically counts to MemTracker #8476

[feature-wip] (memory tracker) (step2) Hook TCMalloc new/delete automatically counts to MemTracker #8476

Conversation

xinyiZzz commented Mar 14, 2022 • edited Loading

Proposed changes

Problem Summary:

Checklist(Required)

Further comments

specific purpose:

1. Accurate query memory limit.

2. Clearer tree structure of mem tracker.

3. Complete and detailed BE memory statistics.

The difference between virutal tracker and non-virutal tracker:

Existing performance problems and solutions:

In the next pr

Accuracy verification

Pformance Testing

dataroaring left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinyiZzz Mar 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinyiZzz Mar 18, 2022 • edited Loading

Choose a reason for hiding this comment

morningman commented Mar 17, 2022

xinyiZzz commented Mar 17, 2022 • edited Loading

Choose a reason for hiding this comment

xinyiZzz Mar 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinyiZzz Mar 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinyiZzz Mar 17, 2022 • edited Loading

Choose a reason for hiding this comment

morningman left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 18, 2022

github-actions bot commented Mar 18, 2022

morningman left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 19, 2022

BiteTheDDDDt commented Mar 21, 2022

morningman commented Mar 21, 2022

xinyiZzz commented Mar 21, 2022 • edited Loading

xinyiZzz commented Mar 14, 2022 •

edited

Loading

dataroaring left a comment •

edited

Loading

xinyiZzz Mar 17, 2022 •

edited

Loading

xinyiZzz Mar 18, 2022 •

edited

Loading

xinyiZzz commented Mar 17, 2022 •

edited

Loading

xinyiZzz Mar 17, 2022 •

edited

Loading

xinyiZzz Mar 17, 2022 •

edited

Loading

xinyiZzz Mar 17, 2022 •

edited

Loading

xinyiZzz commented Mar 21, 2022 •

edited

Loading