[WIP] String dedup PoC #15135

Rattenkrieg · 2017-11-21T02:20:32Z

PoC for #14208

Currently able to deduplicate strings in non-concurrent workstation mode.

Algorithm in brief:

GC cycle N: Mark phase (only STW at the moment):
- During mark phase of rootset and heap check whether we are interested in current object pointer:
  pointer is string && is evacuating from gen1 && at least has sizeof ptr characters;
- Enqueue interesting pointer in StringDedupQueue;
- If stats suggest to compact we need to adjust freshly enqueued pointers to their relocation addresses - look at is_string_and_about_to_be_promoted_to_gen2 call sites. That is double work, but I see no way to know in advance whether we are going to compact (with exception of rare cases when user ask for compaction by GCCollectionMode.Forced).
After GC: In parallel with mutators StringDedupThread populating StringDedupTable with strings from StringDedupQueue:
- Currently StringDedupThread is running synchronously after STW gc to ease my debugging.
GC cycle N + x, where x is number of GC cycles before next gen2 compaction: Relocate phase (only STW at the moment):
- adjust_string_dups_reloc_pointers is called right before relocating pointers - for each group of duplicates from StringDedupTable we seek for first live string in the heap and destructively adjusting other strings by marking their object headers with 0xFF... and writing pointer to original into its first sizeof ptr characters;
- During reloc phase of rootset and heap call to try_relocate_duplicate_string is made to relocate pointers to duplicates to their original string.
- Everything is opaque for compact phase - it just memmoving plugs, so duplicates become floating garbage until next gen2 cycle sweep or compact their memory.

Things to do:

Fix code according to style, fix types like WCHAR vs wchar_t, BOOL vs bool etc - I need help here;
Insert contracts into methods - I know about the document, but I have issues with managing headers and stuff;
Make checked runs passing on non-concurrent workstation;
Run StringDedupThread concurrently, handle related routines like cooperative join;
Hack concurrent and server gc.

Rattenkrieg · 2017-11-21T02:23:31Z

src/gc/gc.h

@@ -105,6 +106,7 @@ extern "C" uint8_t* g_gc_highest_address;
 extern "C" GCHeapType g_gc_heap_type;
 extern "C" uint32_t g_max_generation;
 extern "C" MethodTable* g_gc_pFreeObjectMethodTable;
+extern "C" MethodTable** g_gc_pStringClass;


Thats lame to have nested pointer, but gc is initialized before EE sets pointers to method tables for such types.

perhaps we should set it lazily, then?

swgillespie · 2017-11-21T03:06:12Z

src/gc/gc.cpp

+    {
+        return false;
+    }
+    // it's sad we can't leverage mark bit to indicate that string is duplicate


quick thought: the lowest two bits of the method table pointer are usable and we only use one of them. We could in theory use the other bit for this.

swgillespie · 2017-11-21T03:10:03Z

src/gc/stringdedup.cpp

+    size_t number_of_heaps
+)
+{
+    table = new StringDedupTable;


Beware of allocation failures here. All of these will throw on failure and the GC is not allowed to throw. You should use the nothrow operator new and return a bool or something to indicate whether or not the initialization was successful.

It is a huge pain that we can't use std::unique_ptr here, and I'm sorry about that. I think that's something we definitely should address in the GC codebase as a whole going forward.

swgillespie · 2017-11-21T03:16:45Z

src/gc/stringdeduptable.cpp

+    }
+    if (*data)
+    {
+        return (*data)->Write(item);


what happens if this string is interned? does this still work?

swgillespie · 2017-11-21T03:18:16Z

src/gc/stringobjhashtable.cpp

+
+uint32_t GCHashTableBase::GetHash(GCStringData* key)
+{
+    uint32_t hash = 5381;


I'd imagine we'd want to use a cryptographic hash if we want to avoid potential denial-of-service problems here.

String itnern pool routines uses this algorithm.
https://github.com/dotnet/coreclr/blob/master/src/vm/eehash.cpp#L289
https://github.com/dotnet/coreclr/blob/master/src/inc/utilcode.h#L3020
And I just simply copied it.

My intuition feels like this hash function is OK for string interning. Strings that get interned are only attacker-controlled when passed verbatim to string.Intern (assuming our threat model doesn't consider string literals in metadata to be attacker controlled). In this case, though, any string on the heap (and therefore passing through this code path) is potentially attacker controlled and large (<85kb since LOH strings aren't eligible for this). I'd be worried of attackers repeatedly colliding strings and forcing the slow string comparison path on potentially large numbers of large strings.

@swgillespie could you please explain this:

<85kb since LOH strings aren't eligible for this

I'm agree with your concerns, but keep in mind that this code meant to happen concurrent with mutators, i.e. not affecting STW pause. Also stronger hash function may affect performance more than collision-induced long chains of entries. And finally such malicious strings have to survive probably several gen0 and following gen1 collections.
I'm not against your suggestion, just want all trade-offs to be considered.
And last, we can try different structures like trie-based ones and trade hashing issues for cache thrashing issues 🤣 . Of course this too exotic, just for the record.

collision-induced long chains of entries

You can give up adding strings to the hashtable when you get collision-induced long chain.

@Rattenkrieg Correct me if I'm wrong, but my understanding of your approach is that strings only get deduped when they are promoted into gen 2, correct? In that case, LOH strings won't ever get promoted into gen 2 (since they are allocated in Gen2/LOH) and won't trigger this logic. Is that correct?

@swgillespie that's right! I've completely missed the way how large strings instantly get promoted. Though I don't think there are going to be duplicates among large strings, towards the finish of this PR we may consider knobs to probe large heap too.

Creating collisions with this algorithm and a hash table with power-of-two buckets is trivial -- you just need to have the last few characters of the string be the same. We should assume an attacker does have the ability to create a bunch of strings and have them last for a little while. That's not particularly far-fetched when they can submit things like a json document with lots of strings in it to a server.

We've taken two approaches to this that would be acceptable:

Always use a randomly seeded strong hash. In the VM, we use https://github.com/dotnet/coreclr/blob/32f0f9721afb584b4a14d69135bea7ddc129f755/src/vm/marvin32.cpp . Marvin is about half the speed of algorithms like HashBytes (though it may make up some of that with a better distribution of results, especially in the power-of-two hash table). I'd recommend testing with this to see if it's good enough to always use (it's how we implement String.GetHashCode).

Detect attacks and change strategies. We do that in Dictionary by counting collisions and switching to Marvin if we hit a threshold. You might also be able to detect attacks and just skip putting the colliding strings in the table, but that would have a couple of issues:
a. Strings would still have to be hashed/checked to see that they are more collisions. That might still add some load, especially if they're long.
b. If a service depends on the improved performance of deduplication, an attack that disables it might still act as a denial-of-service.

swgillespie · 2017-11-21T03:21:06Z

src/gc/stringobjhashtable.cpp

+    if (entryKey->GetCharCount() != key->GetCharCount())
+        return false;
+
+    return !memcmp(entryKey->GetStringBuffer(), key->GetStringBuffer(), entryKey->GetCharCount() * sizeof(wchar_t));


sizeof(wchar_t) is platform dependent, so this looks scary to me.

swgillespie · 2017-11-21T03:25:52Z

src/gc/env/gcenv.object.h

+{
+private:
+    uint32_t m_StringLength;
+    wchar_t m_Characters[1];


wchar_t is 32-bits on Linux, which wouldn't be correct for this definition.

swgillespie · 2017-11-21T03:32:04Z

Thanks for the PR, this is exciting! 😄 There's a lot to unpack in this one so I'll make another few review passes tomorrow.

cc @jkotas @sergiy-k @Maoni0

Rattenkrieg · 2017-11-26T20:05:48Z

On interference with intern pool:

Right now the algorithm doesn't give special treatment to interned strings, i.e it can relocate pointer to interned string to regular string or vice-versa. That may broke reference counting invariants. These are the places where I've found dependencies on correct RC:
StringLiteralMap destructor, which I guess called only on appdomain unload;
Contended path during insertion of new literal into pool - happens conditionally and instantly after entry creation.
So while it's seductive to ignore invariants and deduplicate strings chaotically wrt intern pool let's explore path to harmony with intern pool:
Allocations of entries holding reference counter and pointer to string happens on the large heap, hence we can apply dedicated policy to string pointers came from large heap during tracing - somewhere inside gc_heap::mark_through_cards_for_large_objects. However it's undesirable to unadvisedly ignore such pointers since for example we could miss large string array that way.

On interference with strings under lock:

At first I completely ignored such issue since locking on marshal-by-bleed instances is conventially forbidden. But then I've found this: "FxCop: System.Uri locks on a string." dotnet/corefx#13107 So since even BCL contains evil code like that I came with conclusion to examine object header and potentially sync-block entries and reschedule i.e StringDedup::EnqPromoted locked strings to be deduped next cycle.

Rattenkrieg · 2017-11-27T22:51:28Z

src/gc/gc.cpp

@@ -15357,6 +15377,11 @@ void gc_heap::gc1()
    assert (g_gc_card_bundle_table == card_bundle_table);
 #endif    

+    if (!g_gc_pStringClass)
+    {
+        g_gc_pStringClass = GCToEEInterface::GetStringMethodTable();


This is the logically closest place before actual usage of g_gc_pStringClass.
But we may put it higher on the callstack ie closer to collection begining.

Rattenkrieg · 2017-11-27T23:04:49Z

I have tried @swgillespie 's idea

quick thought: the lowest two bits of the method table pointer are usable and we only use one of them. We could in theory use the other bit for this.

... and it worked! つ ◕_◕ ༽つ
つ ◕_◕ ༽つ That feel when you managed to take the last unused bit つ ◕_◕ ༽つ

Fun fact: there are some amount of duplicated strings in CLR like "0", "1", "2", "%" etc

Rattenkrieg · 2017-11-28T22:57:14Z

src/gc/stringobjhashtable.h

+#else
+
+#ifndef _INC_WINDOWS
+typedef wchar_t TCHAR;


That copied from gcenv.structs.h, which is included in gcenv.h and it's ok to use gcenv.h in .cpp's but not in headers

typedef char TCHAR for PLATFORM_UNIX is super confusing. Isn't sizeof(char) == 1 everywhere according to standard?
Personally I'd like to use char16_t, but it raising type errors when interoping with StringObject methods which are WCHAR typed.

TCHAR is Windowsism: https://msdn.microsoft.com/en-us/library/office/cc842072.aspx

Rattenkrieg · 2017-11-29T21:10:51Z

src/gc/env/gcenv.structs.h

 #else

 #ifndef _INC_WINDOWS
-typedef wchar_t TCHAR;


These changes were introduced in 0462be1
None of usages of TCHAR have survived to this day. So I'm renaming TCHAR to WCHAR and replacing char with char16_t for PLATFORM_UNIX.

Maoni0 · 2017-11-30T07:01:40Z

sorry to jump in this late - I was mostly out for the past few weeks.

this really shouldn't be done as a GC feature - there's no reason to - you can do this outside GC so you don't affect GC STW pauses and don't affect critical code paths in GC penalizing everything else.

the algorithm is this -

go through the heap for old generations (gen2/3) linearly as we only care about those objects; objects in gen0/1 might die quickly and even if they point to strings in the old generations we don't care about deduping those.

if we see an object contains a reference to a string object -

if the string obj is in old gen, check to see if its hash already exists in the table
- if not, we insert the hash of the string and its address into the table;
- otherwise we replace this ref with the address of the string already inserted (obviously check for collision). of course another thread might be changing this ref too so you need to do an InterlockedExchange (you could do a loop with InterlockedExchange but no loop is fine too...since this feature is just opportunistic).
otherwise skip (we don't dedup to young strings)

if an object doesn't contain any references to string objects, simply skip (this includes skipping string objects themselves).

note that going through the gen3 heap (LOH) requires cooperation with the LOH allocator. this is the same way BGC does it (with bgc_mark_set/bgc_mark_done). you can just leave this part out for now as you mentioned - we know that objects on LOH usually aren't nearly as reference rich as the ones on SOH.

most of this could actually be done in managed code. we do need to expose some FCalls to facilitate this - we need to expose the relevant refs to managed code. the benefit of this is you don't need to write the kind of "manually managed code" that you see in the runtime (eg, you don't need to manually relocate the addresses in a gen2 compacting GC if you just store a ref to the string object; you don't need to probe for GC to allow it to happen).

I am fine if you want to implement this in native code first and convert most of it to managed code later; or if you want to try the managed implementation first. again, sorry to comment so late...

jkotas · 2017-11-30T16:47:46Z

of course another thread might be changing this ref too so you need to do an InterlockedExchange

This would work only if all updates of the ref in the app were done using interlocked operations as well. It is hard to do, and even if it was doable - the performance impact would be prohibitive. Updating of the string references to deduped copy has to be done during STW. I do not see any other way.

Rattenkrieg · 2017-12-01T21:38:40Z

@Maoni0

sorry to jump in this late

I'm fine, if even this will be rejected completely, digging through gc internals was fun for me and I learned couple of things from sources.

you can do this outside GC so you don't affect GC STW pauses and don't affect critical code paths in GC penalizing everything else.

From the very begining I thought this planned to be opt-in via configuration knobs. Though it's obvious that additional work during gen1 and gen2 would impact on highest percentiles, there are several types of throughput oriented applications, (compilers, test runners, everything CI related etc) which should benefit from such feature. Also amount of collections may be reduced as well due to smaller number of live objects.
So I'd really like to test this with, for example Roslyn compiling itself or running test suite for it.

As variation of your algorithm we can try to perform references relocation during STW while keep scanning phase running concurrently with mutators.

Maoni0 · 2017-12-04T08:41:25Z

of course another thread might be changing this ref too so you need to do an InterlockedExchange

This would work only if all updates of the ref in the app were done using interlocked operations as well.

why do you think that? on Intel "lock cmpxchg" will prevent other processors from accessing the memory location that's being locked; and loading/storing a ptr sized word is an atomic operation itself; this means if there's an update from another processor it will either be completely observed by lock cmpxchg before or after. on arm ldrex/strex or ldxr/stxr also provide exclusive access which is reset by normal stores so it achieves the same purpose.

Rattenkrieg · 2017-12-04T09:18:02Z

@Maoni0 given that mutators can pin or lock on string being processed by dedup thread, does that imply that we need atomic conditional update of two adjacent pointer sized words?

jkotas · 2017-12-04T21:48:05Z

@Maoni0 You are right that doing all updates using interlocked operations won't make difference. We would actually need read barriers to address the concern I was worried about.

If string reference updates were done outside STW, the programs and libraries out there would need to be prepared to handle the possibility of one string object becoming two string objects. It is even more breaking in very subtle ways than simple dedupping where two string objects become one string object. It is an interesting option to experiment with, but I would think the less breaking option is the more useful one.

In other words, is the following test guaranteed to pass with string deduping turned on?

class Test
{
    string _field;

    Test()
    {
        _field = new string('a', 1);

        string a1 = _field;
        Thread.Sleep(1000);
        string a2 = _field;

        Debug.Assert(Object.ReferenceEquals(a1,a2));
    }
}

jkotas · 2017-12-04T21:56:11Z

Maybe we can structure this such that the string deduping logic is outside the GC, and the GC just provides methods to fold one object into another one atomically during Gen2 GC? Such callback can be then also used e.g. to fold all empty arrays together, etc. There seems to be a lot out there written about deduplicating garbage collection: https://www.bing.com/search?q=deduplicating+GC.

Rattenkrieg · 2017-12-04T22:28:03Z

@jkotas

Maybe we can structure this such that the string deduping logic is outside the GC, and the GC just provides methods to fold one object into another one atomically during Gen2 GC?

I like that idea, actually I offered similar technique in response to first @Maoni0 comment.

karelz · 2019-05-04T03:16:49Z

What's the plan here? 1+ year no update ... should we close the PR?

jkotas · 2019-05-04T03:46:43Z

I agree that the PR should be closed until somebody is interested in picking this up again. @Rattenkrieg Thank you anyway!

string dedup PoC

3736b9d

Rattenkrieg commented Nov 21, 2017

View reviewed changes

swgillespie requested review from swgillespie and Maoni0 November 21, 2017 03:03

swgillespie reviewed Nov 21, 2017

View reviewed changes

src/gc/stringdeduptable.cpp

}

if (*data)

{

return (*data)->Write(item);

Copy link

swgillespie Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if this string is interned? does this still work?

swgillespie reviewed Nov 21, 2017

View reviewed changes

lazy g_gc_pStringClass init; duplicate mark logic rework

5c6c22d

Rattenkrieg commented Nov 27, 2017

View reviewed changes

stephentoub mentioned this pull request Nov 28, 2017

Revert "Minor change to avoid an allocation in Uri" dotnet/corefx#25560

Merged

nothrow allocation; wchar_t -> TCHAR

ca644a3

Rattenkrieg commented Nov 28, 2017

View reviewed changes

TCHAR -> WCHAR

23de669

Rattenkrieg commented Nov 29, 2017

View reviewed changes

well, marvin32, simple copypaste is not enough

8764ab8

jkotas added the area-GC label Dec 28, 2017

jkotas closed this May 4, 2019

Rattenkrieg mentioned this pull request Jan 31, 2020

Rationale behind EEHashTableBase implementation dotnet/runtime#9329

Closed

jkotas mentioned this pull request Feb 8, 2020

String Deduplication Design Doc dotnet/runtime#31971

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] String dedup PoC #15135

[WIP] String dedup PoC #15135

Rattenkrieg commented Nov 21, 2017 •

edited

Loading

Rattenkrieg Nov 21, 2017 •

edited

Loading

swgillespie Nov 21, 2017

swgillespie Nov 21, 2017

swgillespie Nov 21, 2017

swgillespie Nov 21, 2017

swgillespie Nov 21, 2017

Rattenkrieg Nov 26, 2017

danmoseley Nov 26, 2017

swgillespie Nov 26, 2017 •

edited

Loading

Rattenkrieg Nov 26, 2017

jkotas Nov 26, 2017

swgillespie Nov 27, 2017

Rattenkrieg Nov 27, 2017

morganbr Nov 28, 2017

swgillespie Nov 21, 2017

swgillespie Nov 21, 2017

swgillespie commented Nov 21, 2017

Rattenkrieg commented Nov 26, 2017

Rattenkrieg Nov 27, 2017

Rattenkrieg commented Nov 27, 2017

Rattenkrieg Nov 28, 2017

jkotas Nov 29, 2017

Rattenkrieg Nov 29, 2017 •

edited

Loading

Maoni0 commented Nov 30, 2017

jkotas commented Nov 30, 2017

Rattenkrieg commented Dec 1, 2017 •

edited

Loading

Maoni0 commented Dec 4, 2017

Rattenkrieg commented Dec 4, 2017

jkotas commented Dec 4, 2017 •

edited

Loading

jkotas commented Dec 4, 2017 •

edited

Loading

Rattenkrieg commented Dec 4, 2017

karelz commented May 4, 2019

jkotas commented May 4, 2019

[WIP] String dedup PoC #15135

[WIP] String dedup PoC #15135

Conversation

Rattenkrieg commented Nov 21, 2017 • edited Loading

Rattenkrieg Nov 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swgillespie Nov 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swgillespie commented Nov 21, 2017

Rattenkrieg commented Nov 26, 2017

On interference with intern pool:

On interference with strings under lock:

Choose a reason for hiding this comment

Rattenkrieg commented Nov 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rattenkrieg Nov 29, 2017 • edited Loading

Choose a reason for hiding this comment

Maoni0 commented Nov 30, 2017

jkotas commented Nov 30, 2017

Rattenkrieg commented Dec 1, 2017 • edited Loading

Maoni0 commented Dec 4, 2017

Rattenkrieg commented Dec 4, 2017

jkotas commented Dec 4, 2017 • edited Loading

jkotas commented Dec 4, 2017 • edited Loading

Rattenkrieg commented Dec 4, 2017

karelz commented May 4, 2019

jkotas commented May 4, 2019

Rattenkrieg commented Nov 21, 2017 •

edited

Loading

Rattenkrieg Nov 21, 2017 •

edited

Loading

swgillespie Nov 26, 2017 •

edited

Loading

Rattenkrieg Nov 29, 2017 •

edited

Loading

Rattenkrieg commented Dec 1, 2017 •

edited

Loading

jkotas commented Dec 4, 2017 •

edited

Loading

jkotas commented Dec 4, 2017 •

edited

Loading