Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GC heap hard limit for 32 bit #101024

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gbalykov
Copy link
Member

This change enables heap hard limit on 32 bit.

Approach is similar to DOTNET_GCSegmentSize (GCConfig::GetSegmentSize), which allows to set size of segment for SOH/LOH/POH, and guarantees that there's no oveflow during computations (for example, during size_t initial_heap_size = soh_segment_size + loh_segment_size + poh_segment_size;). When DOTNET_GCSegmentSize is set on 32bit, it's rounded down to power of 2, so largest possible value of provided segment (SOH) is 2 Gb (4Mb<=soh_segment_size<=2Gb). For LOH/POH same value is divided by 2 and then also rounded down to power of 2, so largest possible value of LOH/POH segment is 1 Gb (4Mb<=loh_segment_size<=1Gb, 0<=poh_segment_size<=1Gb). So, segment size for SOH/LOH/POH never overflows, as well as initial_heap_size (except for oveflow of initial_heap_size to 0, which will lead to failed allocation later anyway).

Similar thing happens when DOTNET_GCHeapHardLimit or DOTNET_GCHeapHardLimitSOH/DOTNET_GCHeapHardLimitLOH/DOTNET_GCHeapHardLimitPOH are set on 32bit. There're limits on these values:

  1. for heap-specific limits:
    0 <= (heap_hard_limit = heap_hard_limit_oh[soh] + heap_hard_limit_oh[loh] + heap_hard_limit_oh[poh]) < 4Gb
    a) 0 <= heap_hard_limit_oh[soh] < 2Gb, 0 <= heap_hard_limit_oh[loh] <= 1Gb, 0 <= heap_hard_limit_oh[poh] <= 1Gb
    b) 0 <= heap_hard_limit_oh[soh] <= 1Gb, 0 <= heap_hard_limit_oh[loh] < 2Gb, 0 <= heap_hard_limit_oh[poh] <= 1Gb
    c) 0 <= heap_hard_limit_oh[soh] <= 1Gb, 0 <= heap_hard_limit_oh[loh] <= 1Gb, 0 <= heap_hard_limit_oh[poh] < 2Gb
  2. for same limit for all heaps:
    0 <= heap_hard_limit <= 1Gb

These ranges guarantee that calculation of soh_segment_size, loh_segment_size and poh_segment_size with alignment and round up won't overflow, as well as calculation of sum of them for allocation (overflow to 0 is allowed, same as for DOTNET_GCSegmentSize). When values specified by user with env variables don't meet the requirements above, runtime exits with CLR_E_GC_BAD_HARD_LIMIT. When allocation (with mmap on Linux) fails, runtime exits with same error as for large segment size specified with DOTNET_GCSegmentSize.

This patch doesn't enable heap hard limit on 32bit in containers (is_restricted_physical_mem), because current heap hard limit approach is to reserve one large GC heap segment with size equal to specified heap hard limit, and no new segments are reserved in future. On 64 bit in containers by default heap hard limit is set to 75% of total physical memory available, and this is fine, because virtual address space on 64 bit is much larger than actual physical size on devices. In contrast, on 32 bit virtual address space size might be the same as available physical size on device (e.g. 4 Gb for both). This means that reserving 75% of total physical mem will reserve 75% of whole virtual address space, which might be both undesirable (e.g. process later expects more available memory) and mmap on Linux will probably fail anyway.

I've ran release CLR tests on armel Linux with this patch on top of f84d33 with different limits, there seems to be no issues with 4Mb/8Mb/16Mb and 512Mb heap hard limits (some tests fail with "Out of memory"/"OOM" and "System.OutOfMemoryException", as expected, or "System.InvalidOperationException: NoGCRegion mode must be set" when NoGC region is larger than limit). Same for GCStresss 0xc and 0x3 with 4 Mb heap hard limit, and same for debug CLR tests on armel Linux with 4Mb heap hard limit.

@Maoni0 @cshung please share what you think. Thank you.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 14, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@gbalykov gbalykov marked this pull request as ready for review April 14, 2024 14:32
@cshung
Copy link
Member

cshung commented Apr 14, 2024

@gbalykov, I am wondering why you would want a heap_hard_limit for 32 bit platforms?

I am working on #100380 (forgive the misleading title) that enabled the computation of committed_by* value at all times independent of heap_hard_limit.

To make heap_hard_limit work on 32 bits, we need to have an accurate way of maintaining how many bytes are currently committed, I don't think I get the calculation right there yet, and that is perhaps why the CI is failing on 32 bit platforms for now.

The failure is an assertion - so you would need to run it under CHECK or DEBUG to reproduce it. The assertion to prevent an underflow in subtraction. In release, it will run fine (and fail later because of wrong values).

I was thinking about just limiting my work to just 64 bits, but if there is a reason to, we can figure how to get the calculation right on 32 bit platforms as well.

@gbalykov
Copy link
Member Author

I am wondering why you would want a heap_hard_limit for 32 bit platforms?

It's the same as for 64bit: to be able to limit memory consumption of process, and, specifically, GC heap size. Setting the limit specifically on managed heap size allows for much better granularity than using limits for whole process with cgroups, etc. Even though that this change doesn't set default value for containerized environments (is_restricted_physical_mem), it allows to set heap limit manually, which is also very useful.

Also here's related discussion about heap limits for 32 bit: #78128.

To make heap_hard_limit work on 32 bits, we need to have an accurate way of maintaining how many bytes are currently committed

Can you share more details about this? During my anylysis of code (f84d33c) I didn't find any other related places that depend on bitness, and it seems that with this change behavior on 64 and 32 bit is pretty much the same. For example, committed_by_oh_per_heap are used only under _DEBUG && MULTIPLE_HEAPS on 64 bit even without this change, and committed_by_oh is updated the same way both on 64 and 32 bit with this change. Please correct me if I'm wrong here.

I was also able to test this change with more complex apps, and physical size of heap (Private_Dirty of GC vmas) always remained in allowed limit as expected (of course, assuming that heap is able to fit in that limit with more frequent GC, for example).

@cshung
Copy link
Member

cshung commented Apr 15, 2024

It does make sense to use heap_hard_limit to control memory usage under 32 bit platforms, and I can work with you to get this done.

To start with, the key difference between 32 bits and 64 bits is USE_REGIONS.

Here is where USE_REGIONS is defined in gcpriv.h

#if defined (HOST_64BIT) && !defined (BUILD_AS_STANDALONE) && !defined(__APPLE__)
#define USE_REGIONS
#endif //HOST_64BIT && BUILD_AS_STANDALONE

Forgive the inaccurate comments please, sorry.

USE_REGIONS is not enabled in 32 bits, and therefore the behavior on x86 or arm32 is significantly different from the 64 bit versions.

Historical Context (Good for knowledge, but feel free to skip on first read)

With respect to commit accounting, let's go back in time and see why we are doing it.

By the time we introduced heap_hard_limit, we needed to know how much memory was committed, and therefore we had a simple counter current_total_committed that is incremented during commit and decremented during virtual_decommit. That was pretty simple.

Then I introduced per object heap hard limit (#36731) by checking against committed_by_oh, it was meant to solve the problem related to initial commit for large pages support. In that change, I proposed a testing scheme by validating the number against the actual heap. The idea is that, we can count the committed bytes during commit/decommit operations, but we can also count the committed bytes by walking the heap data structures, and they should be identical. With that, we can both validate the numbers are correct and that we didn't mess up with the data structures.

By that time, the verification logic was unnecessarily complicated, and therefore we didn't merge those in.

That was before USE_REGIONS. At some point, we enabled USE_REGIONS, the verification work can be done make simpler under USE_REGIONS, so we did it. As a side-effect of that change, I introduced committed_by_oh_per_heap, which make per heap validation possible without taking locks.

Much of the verification logic was written with only USE_REGIONS in mind. I wouldn't be surprised that it does not work with segments (i.e. the code without USE_REGIONS, 32 bits in particular).

Keeping track of these number is non-trivial, at the very least, we need to take the check_commit_cs lock. Therefore, to make sure we don't regress for performance, we chose to keep those values only if heap_hard_limit is enabled.

Then it comes to RefreshMemoryLimit, a feature to allow user to specify or ask the system to redetect the system memory limits. This allow user updating container limits and also let the GC knows.

A technical challenge at that point of time is that if heap_hard_limit was not specified before the RefreshMemoryLimit call, and we establish a new one, we will need to have the various commit counter values. In USE_REGIONS, we know how to do it, but in the non-USE_REGIONS case, we don't have a good way to do it, so we just don't support it at the moment.

One simple way to get around it is to always keep the counters, which is what I am trying to do right now, we have to be careful with respect to performance (which I haven't checked yet). I am hoping it is okay.

The known problem - overlapped commit

While it appears that it is easy to just add when we commit and subtract when we decommit, it is not that simple. The problem is overlapped commits.

Under USE_REGIONS, we never call virtual_commit with a memory range that contains committed memory. Therefore the add and subtract always works.

Thanks to the forgiving OS memory management, we can call virtual_commit with a memory range that contains committed memory, and the OS will just ignore the committed part.

Under non-USE_REGIONS, however, it is known that we sometimes call virtual_commit with a memory range that contains committed memory. The simple add will double count.

The double count on it own is wrong, but is not fatal. With that happened repeatedly, however, we will leak those bytes (while memory is fine), and so we ran out of bytes when you try to commit (but in fact you do have memory).

One of the known case of overlapping commit is for mark arrays. We don't keep track of the exact bound where the mark array was committed, so any time we need to ensure mark array is available and it isn't fully committed, we commit the whole mark array range every time we need it.

And there could be other unknown cases that we just don't know in the moment.

The unknown problem - overflow?

Beyond just overlapped commit, I am seeing other problems as well.

In case where a heap hard limit is not provided, I suspect the counter overflowed. When I decommit, I am hitting an assert that we are decommitted more memory than we have, and the counter is suscipiously low at that point.

Of course, that's just a guess. There could be other things going on as well. The only thing that I am sure is the number must be wrong and we can't decommit memory that wasn't committed.

Plan of attack

Fundamentally, it is a calculation error. Unfortunately, we don't have any symptoms when the wrong happens, and failing at the end is too late. To make this easier, I think we can

1.) Track mark array usage, and make sure we don't do overlapped commit.
2.) Check the counter against the heap data structures, and make sure they are consistent, and
3.) Use logging to figure out what went wrong.

Notice that dprintf(3, ("commit-accounting ... lines. They are designed so that we know exactly what range of memory is committed/decommitted/transferred between buckets. We can use that to figure out what went wrong.

Final words

As you can see, it is not a trivial problem to solve. With efforts, I think we can nail it. It will worth the effort to make 32 bits app work better.

@gbalykov
Copy link
Member Author

Thanks for such detailed description!

Under non-USE_REGIONS, however, it is known that we sometimes call virtual_commit with a memory range that contains committed memory. The simple add will double count.

The double count on it own is wrong, but is not fatal. With that happened repeatedly, however, we will leak those bytes (while memory is fine), and so we ran out of bytes when you try to commit (but in fact you do have memory).

Does this mean that heap hard limit doesn't work correctly with segments even on 64bit on current main? Original PR to add heap hard limit support (dotnet/coreclr#22180) was merged back in 2019, and it seems that GC regions support (#59283) was enabled only in 2022, so during my work on this patch I thought that segments on 64 bit fully support heap hard limit.

If heap hard limit with segment is not fully correct on 64 bit on main now, is it safe to assume that it was correct before GC regions support was enabled by default (e.g. in .net 6)? Or is this possible leak in bytes accounting (e.g. for mark array) a known bug of heap hard limit with segments that was present since 2019?

The idea is that, we can count the committed bytes during commit/decommit operations, but we can also count the committed bytes by walking the heap data structures, and they should be identical.

Much of the verification logic was written with only USE_REGIONS in mind. I wouldn't be surprised that it does not work with segments (i.e. the code without USE_REGIONS, 32 bits in particular).

If I understand correctly, this logic for counting committed bytes by walking the heap data structures is needed for two things:

  • verification of GC counters and data structures (which is checked under DEBUG && (COMMITTED_BYTES_SHADOW || heap_hard_limit), supported only for 64bit with GC regions)
  • RefreshMemoryLimit (because there might be no limit before it's called, supported only for 64bit with GC regions)

So, if both are disabled for 64 bit, it seems it should be the same as it was before GC regions were enabled by default. Is it correct?
If yes, then it seems that limited 32bit support (same level as 64bit support before GC regions were enabled by default, even though there's a possible leak in accounting) can be enabled with this PR without verification logic and RefreshMemoryLimit, and then we can further work on full support of accounting with segments.

One simple way to get around it is to always keep the counters, which is what I am trying to do right now

Can you share why is it needed to always keep counters on 64bit with regions if there's already a way to update them by walking the heap data structures? Is it needed for better validation of counters and heap structures?

If there's no way to update GC counters by walking the heap data structures with segments, then always keeping the counters seems the only way to fully support limits on 32 bit (including verification of counters and RefreshMemoryLimit, which I'm also interested in). I'm very interested in your patch from #100380, let me try it with this PR to check the behavior. Can you share the C# program that you use for testing?

Thanks again for your interest in this!

@cshung
Copy link
Member

cshung commented Apr 22, 2024

Sorry for the late reply, I took the last few days trying to figure out these.

> Does this mean that heap hard limit doesn't work correctly with segments even on 64bit on current main?

I confirm this is the case, to prove that, I have found a couple of issues, I plan to get these fixed.

Bug 1

When we get a new segment by committing memory for large object heap or pinned object heap, it is incorrectly accounted for as Gen 2, this will impact the committed_by_oh values, but should be fine with the overall current_total_committed value.

gc_heap::get_segment (size_t size, gc_oh_num oh)
{
...
result = make_heap_segment ((uint8_t*)mem, size, __this, (uoh_p ? max_generation : 0));
...
}

Bug 2

When we release a segment, we used virtual_free, note that it does not decrease the current_total_committed vaiue.

void gc_heap::virtual_free (void* add, size_t allocated_size, heap_segment* sg)
{
    bool release_succeeded_p = GCToOSInterface::VirtualRelease (add, allocated_size);
    if (release_succeeded_p)
    {
        reserved_memory -= allocated_size;
        dprintf (2, ("Virtual Free size %zd: [%zx, %zx[",
                    allocated_size, (size_t)add, (size_t)((uint8_t*)add + allocated_size)));
    }
}

But when we call it in release_segment, the segment contained committed memory. It is okay from a memory perspective because virtual_free will decommit, but the bytes are wrong.

This will impact both committed_by_oh and current_total_committed, sometimes the error can be quite large there.

> If heap hard limit with segment is not fully correct on 64 bit on main now, is it safe to assume that it was correct before GC regions support was enabled by default (e.g. in .net 6)?

With the bugs I just shown above, It is pretty clear the answer is no. The bugs I have found above has nothing to do with USE_REGIONS.

> Can you share why is it needed to always keep counters on 64bit with regions if there's already a way to update them by walking the heap data structures? Is it needed for better validation of counters and heap structures?

When I built it, it was meant to be a stepping stone. As you probably know, the logic is quite complicated and easy to make mistakes. Building this logic helped me a lot in validating that I am doing the right thing.

With the PR, my goal was to get refresh_memory_limit to work on segments, always keeping the counter is really meant to be a shortcut so that I don't have to make sure the current_total_committed is good. With the same reasoning as yours, I assumed it was good enough (i.e. I knew there were bugs, but if it was enabled back in 2019, the bug can't be too bad). And so if I keep it, it should be as good as it was.

Except my investigations above proved otherwise, the bug is actually quite bad, the current_total_committed value can be off by a few MB per GC, it is not difficult to see the counter will overflow way faster than memory does.

> Can you share the C# program that you use for testing?

It is this failing test case in #100380.

With that test case, I can:

  1. Reproduce the assertion pretty consistently on x86, check, workstation, on top of 100380 bits, on a DevBox.
  2. Added a validation to check the committed_by_oh values for only the heap memory, and
  3. With a tentative fix for the two issues found above, I am able to get the test to pass on x86, check, workstation, but and server.
  4. The test case is still failing on server GC to validate the committed_by_oh values , something else is still wrong, even without the mark array issue.
  5. The CI is still reporting wks failure on osx and 32 bits, workstation. That means something else is still wrong with the numbers in segments. (we use segments for OSX as well)
  6. The CI is passing with the committed counter validated for the heap segments!

Unfortunately, getting these committed counters correct for segments isn't a priority right now. It would be great if you could help with the debugging. Essentially, you want to find where in the code that is probably MULTIPLE_HEAPS specific that does not consistently update the counter and the heap. You will probably want to keep a history of what happened using logging or otherwise, discard the history before the last successful verify_committed_bytes_per_heap check because you knew everything before it was correct, and track the changes to see which one is wrong.

Now you probably know why building up these verification routines are very useful stepping stone to find bugs.

I know, it is easier said than done, it took me quite a few days to find the couple of issues above. Let me know if there are things I could help with. I am more than happy to help you with this.

@cshung
Copy link
Member

cshung commented Apr 28, 2024

@gbalykov

With another week of debugging, I figured the bug related to server GC. It isn't related to threading, it is just that I missed one call site to virtual_free that I need to change to make sure the committed byte counters are decremented. With that fix, I am able to get the CI passing (See #100380) with the heap_segment related counters verified for every blocking GC.

With that, we can focus on the recorded_committed_bookkeeping_bucket next. The recorded_committed_free_bucket should always be empty in segments, so if we complete the bookkeepping bucket, we are done with the numbers.

@gbalykov
Copy link
Member Author

gbalykov commented May 2, 2024

@cshung Great work! I've been on vacation and busy with other task right now, but plan to switch back to this starting from the next week.

@cshung
Copy link
Member

cshung commented May 9, 2024

@gbalykov, welcome back from your vacation, hope you had a good time.

Another update, I have got the bookkeeping numbers as well (sadly I gave up the mark array bytes for simplicity). With that, I have gone through a round of review with @Maoni0 with my change. I think the change is more or less good to go and I will merge it whenever I can. It would be great if you can rebase this change on top of that, test it on some 32 bit platforms and see if that still works?

To stress test the calculation, you can turn on the COMMITTED_BYTES_SHADOW, that will stress the size computation logic a lot more.

@gbalykov
Copy link
Member Author

@cshung great! I'll test with your change on 32bit this week

@gbalykov
Copy link
Member Author

gbalykov commented May 26, 2024

Sorry for late response, took some time for additional testing. I've rebased your changes (#100380) before they were merged on top of 5474ab5 and then applied my patch.

What I've tested on 32 bit:

  1. debug coreclr tests; debug runtime built with COMMITTED_BYTES_SHADOW
  2. debug coreclr tests with DOTNET_GCHeapHardLimit=400000; debug runtime built with COMMITTED_BYTES_SHADOW, Commit Accounting related changes #100380 and this PR
  3. debug coreclr tests with DOTNET_GCHeapHardLimit=400000; debug runtime built with Commit Accounting related changes #100380 and this PR
  4. debug coreclr tests; debug runtime built with COMMITTED_BYTES_SHADOW, Commit Accounting related changes #100380 and this PR
  5. debug coreclr tests with DOTNET_GCHeapHardLimit=20000000; debug runtime built with COMMITTED_BYTES_SHADOW, Commit Accounting related changes #100380 and this PR
  6. debug coreclr tests with DOTNET_gcServer=1 and DOTNET_GCHeapCount=2; debug runtime built with COMMITTED_BYTES_SHADOW
  7. debug coreclr tests with DOTNET_GCHeapHardLimit=800000, DOTNET_gcServer=1 and DOTNET_GCHeapCount=2; debug runtime built with COMMITTED_BYTES_SHADOW, Commit Accounting related changes #100380 and this PR
  8. debug coreclr tests with DOTNET_gcServer=1 and DOTNET_GCHeapCount=2; debug runtime built with COMMITTED_BYTES_SHADOW, Commit Accounting related changes #100380 and this PR

In all cases there're no regressions and no asserts related to heap limit fire (with enabled heap hard limit some tests fail with OOM as expected, some tests work too long in debug and fail with 10min timeout). All of the above include HeapExpansion/plug test that you've mentioned previously (https://github.com/dotnet/runtime/blob/main/src/tests/GC/Features/HeapExpansion/plug.cs), it also passes without issues.

I've not tested with server gc (with multiple heaps) the same way yet, and also didn't run any perf tests with/without COMMITTED_BYTES_SHADOW enabled, and didn't run any release clr tests.

Thanks again for your fixes for segments. I've rebased this PR on latest main, please take a look.

@gbalykov
Copy link
Member Author

gbalykov commented Jun 4, 2024

@cshung can you please review this? It seems that mark arrays are the only known bug in accounting with segments, so what do you think about reviewing and merging this first, and then focusing on mark arrays?

@cshung
Copy link
Member

cshung commented Jun 5, 2024

My apologies with not responding promptly.

For mark array, I think we are planning to just give it up like what we have right now, the worst impact of that is just inaccurate numbers. By making sure these numbers doesn't go into the accounting, it won't lead to accounted byte leaks.

The accounting is still not perfect just yet, after that change is merged, we are starting to get random assertions (e.g. #102706 (comment)). I will be taking a look at that next, right now I am concentrated on #97316.

For your change, I did skim through the code, the idea looks good to me.

There are places where you need to handle potential overflow, these are good checks for 64 bits as well (64 bit number also overflows, it is just less likely). I wonder if we can just use the same logic for both 32 and 64 bits.

Eventually, this needs to be signed off by @Maoni0.

Having supporting data is likely to get this done more smoothly. Do you have a particularly motivating scenario, maybe we can run it with and without the hard limit set, and show us the behavior difference?

@gbalykov
Copy link
Member Author

Thanks for your feedback! Sorry for delay in response, I was busy with other tasks.

There are places where you need to handle potential overflow, these are good checks for 64 bits as well (64 bit number also overflows, it is just less likely). I wonder if we can just use the same logic for both 32 and 64 bits.

I wanted to minimize impact on 64bit to make this change simplier and kept checks for 32bit only since they were essential there. These checks might be useful on 64bit, though, not sure whether someone will try to set such high limits on 64bit.

There's also additional assumption on 64bit, that physical size is smaller than virtual size in containerized environments, where by default heap hard limit is set to 75% of total physical memory available. This is fine, because virtual address space on 64 bit is much larger than actual physical size on devices. So when there comes a time when overflow checks are needed on 64bit, then this 75% setup logic will probably need to be updated too. Note that for 32bit I didn't enable this default value, only manual heap hard limit is added.

Having supporting data is likely to get this done more smoothly. Do you have a particularly motivating scenario, maybe we can run it with and without the hard limit set, and show us the behavior difference?

Motivating scenario are Tizen applications where app can limit it's own memory consumption, track it and refresh it (in future I want to add refresh of memory limit for 32bit too). In tests that I've done, on some apps I've seen significant reduction in used GC heap size (Private_Dirty) around 33%. This, of course, implies that actually used live-objects fit in specified limits (otherwise, there'll be OOM).

@Maoni0 can you please take a look?

@mangod9
Copy link
Member

mangod9 commented Jul 29, 2024

@gbalykov, just checking whether this change is ready? Given that we are now past preview7 is this required for 9?

@gbalykov
Copy link
Member Author

just checking whether this change is ready?

@mangod9 yes, waiting for review from GC maintainers

Given that we are now past preview7 is this required for 9?

It would be good to get this to .net 9

@Maoni0
Copy link
Member

Maoni0 commented Jul 29, 2024

I didn't know this needed my attention till now. so I'll take a look today. sorry for the delay!

@Maoni0
Copy link
Member

Maoni0 commented Jul 29, 2024

I took a look at the changes and also chatted with @cshung about them. we are a little surprised by some of the changes, for example why the per heap limit is divided up this way (that it has to be a combination of 2, 1 and 1), also the change in init_static_data is on unconditionally even for 64-bit. @cshung will continue to work with you to evaluate whether we can get this into .net 9 or not.

@gbalykov
Copy link
Member Author

@Maoni0 Thanks for taking a look!

why the per heap limit is divided up this way (that it has to be a combination of 2, 1 and 1)

This is needed to guarantee that there's no oveflow during computations on 32bit, for example, during
size_t initial_heap_size = soh_segment_size + loh_segment_size + poh_segment_size;.

Besides, this mimics what DOTNET_GCSegmentSize does: for SOH on 32bit it's rounded down to power of 2, so largest possible value of provided segment (SOH) is 2 Gb (4Mb<=soh_segment_size<=2Gb). For LOH/POH same value is divided by 2 and then also rounded down to power of 2, so largest possible value of LOH/POH segment is 1 Gb (4Mb<=loh_segment_size<=1Gb, 0<=poh_segment_size<=1Gb). So, segment size for SOH/LOH/POH never overflows, as well as initial_heap_size (except for oveflow of initial_heap_size to 0, which will lead to failed allocation later anyway). This change does the same for hard limits.

the change in init_static_data is on unconditionally even for 64-bit

This change is needed to add upper boundary on gen1 max size with enabled heap hard limit, which can be smaller than what was set for gen1 max size before that (for 32bit soh_segment_size can be 4 Mb). For 64bit this change should not make any difference, because soh_segment_size is always >= 16 Mb and gen1 max size can be 6Mb or soh_segment_size/2, so new limit min (gen1_max_size, soh_segment_size / 2); from this change won't affect it.

@Maoni0
Copy link
Member

Maoni0 commented Aug 1, 2024

this mimics what DOTNET_GCSegmentSize does: for SOH on 32bit it's rounded down to power of 2, so largest possible value of provided segment (SOH) is 2 Gb (4Mb<=soh_segment_size<=2Gb). For LOH/POH same value is divided by 2 and then also rounded down to power of 2,

are you referring to this code in get_valid_segment_size?

static size_t get_valid_segment_size (BOOL large_seg=FALSE)
{
    size_t seg_size, initial_seg_size;

    if (!large_seg)
    {
        initial_seg_size = INITIAL_ALLOC;
        seg_size = static_cast<size_t>(GCConfig::GetSegmentSize());
    }
    else
    {
        initial_seg_size = LHEAP_ALLOC;
        seg_size = static_cast<size_t>(GCConfig::GetSegmentSize()) / 2;  <--- we are taking SOH seg size and divide it by 2 if large_seg is TRUE
    }

but this is only used when hardlimit isn't specified. when hardlimit is specified we don't actually call get_valid_segment_size, in GCHeap::Initialize -

    if (gc_heap::heap_hard_limit)
    {
        if (gc_heap::heap_hard_limit_oh[soh])
        {
            large_seg_size = max (gc_heap::adjust_segment_size_hard_limit (gc_heap::heap_hard_limit_oh[loh], nhp), seg_size_from_config);
            pin_seg_size = max (gc_heap::adjust_segment_size_hard_limit (gc_heap::heap_hard_limit_oh[poh], nhp), seg_size_from_config);
        }
        else
        {
            large_seg_size = gc_heap::use_large_pages_p ? gc_heap::soh_segment_size : gc_heap::soh_segment_size * 2;
            pin_seg_size = large_seg_size;
        }
        if (gc_heap::use_large_pages_p)
            gc_heap::min_segment_size = min_segment_size_hard_limit;
    }
    else
    {
        large_seg_size = get_valid_segment_size (TRUE);
        pin_seg_size = large_seg_size;
    }

this is saying if you only specify the total hardlimit, we will divide this limit based on the number of heaps (and not let it fall below the min seg size). and for LOH/POH we'll double it unless we are using large pages. if you did specify the per-object-heap limits they will just be taken as it with the adjustment for number of heaps. it's very common that folks would specify a small value for POH.

the change in init_static_data is on unconditionally even for 64-bit

This change is needed to add upper boundary on gen1 max size with enabled heap hard limit, which can be smaller than what was set for gen1 max size before that (for 32bit soh_segment_size can be 4 Mb). For 64bit this change should not make any difference, because soh_segment_size is always >= 16 Mb and gen1 max size can be 6Mb or soh_segment_size/2, so new limit min (gen1_max_size, soh_segment_size / 2); from this change won't affect it.

we shouldn't have code enabled just because they don't produce a noticeable effect because later changes may cause a code path to have an effect inadvertently. so I would suggest to only have this code for 32-bit. this is not needed for functional because this size specifies the max size for gen1 budget. even if it was bigger than half a seg size we wouldn't be able to physically fit that much onto a segment anyway. I do think it's a good perf change for 32-bit to limit this.

@gbalykov
Copy link
Member Author

gbalykov commented Aug 1, 2024

are you referring to this code in get_valid_segment_size?

but this is only used when hardlimit isn't specified. when hardlimit is specified we don't actually call get_valid_segment_size, in GCHeap::Initialize

Yes. The idea is that we need to somehow limit values of segment size from above to not overflow with enabled heap hard limit on 32bit, and GetSegmentSize does similar thing on 32bit fitting SOH/LOH/POH segments without overflow in 4Gb (except overflow to 0).

  • In case of total limit this PR limits each segment size for SOH/LOH/POH with 1Gb, i.e. 4Gb/3 rounded down to power of 2, and 3 Gb for SOH+LOH+POH. I decided to not double LOH/POH, because, otherwise, limit for SOH segment size will have to be 512 Mb (4Gb/5 rounded down to power of 2), and 2.5 Gb for SOH+LOH+POH. This is not ideal, because later mmap will most probably fail trying to reserve 2.5 Gb, and actual working max size for SOH will probably be just 256 Mb (1.25 Gb for SOH+LOH+POH). However, in current approach with 512 Mb limit for SOH/LOH/POH it works on tests that I've tried, so larger limit can be set this way.
  • In case of heap-specific limits this change tries to fit as much as possible in 4 Gb, doing the same thing as GetSegmentSize, except that it checks three possible ways to limit SOH/LOH/POH from above with max powers of 2 fitting in 4 Gb. This allows any user-specified limit, that doesn't lead to overflow, for example small limit for POH as you mentioned. Similar to case with total limit, 2Gb/1Gb/1Gb upper limits in different combinations for SOH/LOH/POH are enough. For example, this considers case if someone wants 1.5 Gb for SOH and very small limits for LOH/POH.

we shouldn't have code enabled just because they don't produce a noticeable effect because later changes may cause a code path to have an effect inadvertently. so I would suggest to only have this code for 32-bit.

Ok, I agree that it's better to keep this under ifdef if you have concerns for future. I'll update this.

this is not needed for functional because this size specifies the max size for gen1 budget. even if it was bigger than half a seg size we wouldn't be able to physically fit that much onto a segment anyway. I do think it's a good perf change for 32-bit to limit this.

During development I wasn't sure whether it's correct if we set gen1 max size larger than actual segment size, so that's why this change appeared. Thanks for clarification, it's good that this can also have positive effect on performance.

@Maoni0
Copy link
Member

Maoni0 commented Aug 1, 2024

ahh you are right about the combination. 2/1/1 is the only combination to fit into 4gb. I was too used to thinking in 64-bit address space.

there are some code styling issues that make the changes not consistent with the GC code. I'll make some suggestions. I'll do another pass on the changes later today.

@Maoni0
Copy link
Member

Maoni0 commented Aug 1, 2024

actually.... I don't think these sizes have to be power of 2 (I was just thinking about this a bit more since it seemed unfortunate; and it had been quite some time since I thought about seg size stuff). we just need these sizes to be multiples of a power of 2 number. I need to go to sleep but will give more detail later. this is how use_large_pages_p handles seg sizes.

@Maoni0
Copy link
Member

Maoni0 commented Aug 8, 2024

sorry I had a lot of distractions. not sure if you had a chance to look at how the segment size is set when use_large_pages_p is set but seg sizes are computed by adjust_segment_size_hard_limit which calls this -

size_t gc_heap::adjust_segment_size_hard_limit_va (size_t seg_size)
{
    return (use_large_pages_p ?
            align_on_segment_hard_limit (seg_size) :
            round_up_power2 (seg_size));
}

when use_large_pages_p is true it just needs to align on min_segment_size_hard_limit and after seg sizes for soh/loh/poh are computed we do this -

        if (gc_heap::use_large_pages_p)
            gc_heap::min_segment_size = min_segment_size_hard_limit;

you could do the same for 32-bit.

@gbalykov
Copy link
Member Author

gbalykov commented Aug 9, 2024

@Maoni0 thanks for your feedback

I don't think these sizes have to be power of 2 (I was just thinking about this a bit more since it seemed unfortunate

Can you please describe why powers of 2 are not ok in this case? You mean that they might differ significantly from what users specify? (i.e. user sets 780Mb for heap limit, but with this patch it'll round up to 1 Gb on both 32 and 64 bit)

In terms of 2/1/1, mmap will probably fail anyway with smth around a bit higher than 1.5 Gb for SOH+LOH+POH, so higher border for SOH/LOH/POH heap-specific limits is just to prevent overflow of SOH+LOH+POH. So, 2/1/1 in different combinations should cover this case (e.g. 1.5 Gb for SOH and smth small for LOH/POH) and noone will actually be able to use 2/1/1 anyway. And same for same limit for all heaps, 1Gb*3 will lead to failing mmap anyway.

seg sizes are computed by adjust_segment_size_hard_limit which calls this

It seems that this can benefit both 32bit and 64bit, so maybe it's better to do it separately for both after finishing this one. What do you think about this? Besides same can be done in get_valid_segment_size and other places that deal with GetSegmentSize, so it seems better to do it separately.

@gbalykov
Copy link
Member Author

@Maoni0 @mangod9 can this still get to .NET 9?

@Maoni0
Copy link
Member

Maoni0 commented Aug 29, 2024

sorry again to reply late - I've just been swamped with urgent items and also was OOF for a while.

Can you please describe why powers of 2 are not ok in this case?

because it allows too little flexibility. same as large pages since for large pages we don't allow nearly as much virtual memory space, so we do this align on min seg size thing instead of requiring it to be a power of 2.

It seems that this can benefit both 32bit and 64bit,

for 64-bit we allow 5x the hard limit if it's not using large pages. also for regions (since that's what 64-bit uses) it's not an issue since we don't do the reserve all segments on initialization for regions (that's only for segments).

can this still get to .NET 9?

I'll defer to @mangod9, in case someone else can work on this but from my side I've just been swamped with really urgent items and I imagine that will continue for the next few months too.

@gbalykov
Copy link
Member Author

@Maoni0 thanks for your feedback

because it allows too little flexibility

I agree that there's still a lot of room for improvements in future, including removal of dependency on powers of 2 both for regions/segments, refresh for segments, etc. But this PR is an initial step that brings hard limit on 32bit with segments to the same level as on 64bit with segments. Segments, of course, are disabled by default on 64bit, but runtime can be built with them enabled. So, in this sense this PR is a finished change that focuses on specific problem to keep patch smaller, and further work will improve segments both on 32 and 64 bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants