Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] String dedup PoC #15135

Closed
wants to merge 5 commits into from
Closed

Conversation

Rattenkrieg
Copy link

@Rattenkrieg Rattenkrieg commented Nov 21, 2017

PoC for #14208

Currently able to deduplicate strings in non-concurrent workstation mode.

Algorithm in brief:

  • GC cycle N: Mark phase (only STW at the moment):
    • During mark phase of rootset and heap check whether we are interested in current object pointer:
      pointer is string && is evacuating from gen1 && at least has sizeof ptr characters;
    • Enqueue interesting pointer in StringDedupQueue;
    • If stats suggest to compact we need to adjust freshly enqueued pointers to their relocation addresses - look at is_string_and_about_to_be_promoted_to_gen2 call sites. That is double work, but I see no way to know in advance whether we are going to compact (with exception of rare cases when user ask for compaction by GCCollectionMode.Forced).
  • After GC: In parallel with mutators StringDedupThread populating StringDedupTable with strings from StringDedupQueue:
    • Currently StringDedupThread is running synchronously after STW gc to ease my debugging.
  • GC cycle N + x, where x is number of GC cycles before next gen2 compaction: Relocate phase (only STW at the moment):
    • adjust_string_dups_reloc_pointers is called right before relocating pointers - for each group of duplicates from StringDedupTable we seek for first live string in the heap and destructively adjusting other strings by marking their object headers with 0xFF... and writing pointer to original into its first sizeof ptr characters;
    • During reloc phase of rootset and heap call to try_relocate_duplicate_string is made to relocate pointers to duplicates to their original string.
    • Everything is opaque for compact phase - it just memmoving plugs, so duplicates become floating garbage until next gen2 cycle sweep or compact their memory.

Things to do:

  • Fix code according to style, fix types like WCHAR vs wchar_t, BOOL vs bool etc - I need help here;
  • Insert contracts into methods - I know about the document, but I have issues with managing headers and stuff;
  • Make checked runs passing on non-concurrent workstation;
  • Run StringDedupThread concurrently, handle related routines like cooperative join;
  • Hack concurrent and server gc.

src/gc/gc.h Outdated
@@ -105,6 +106,7 @@ extern "C" uint8_t* g_gc_highest_address;
extern "C" GCHeapType g_gc_heap_type;
extern "C" uint32_t g_max_generation;
extern "C" MethodTable* g_gc_pFreeObjectMethodTable;
extern "C" MethodTable** g_gc_pStringClass;
Copy link
Author

@Rattenkrieg Rattenkrieg Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats lame to have nested pointer, but gc is initialized before EE sets pointers to method tables for such types.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should set it lazily, then?

src/gc/gc.cpp Outdated
{
return false;
}
// it's sad we can't leverage mark bit to indicate that string is duplicate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick thought: the lowest two bits of the method table pointer are usable and we only use one of them. We could in theory use the other bit for this.

size_t number_of_heaps
)
{
table = new StringDedupTable;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware of allocation failures here. All of these will throw on failure and the GC is not allowed to throw. You should use the nothrow operator new and return a bool or something to indicate whether or not the initialization was successful.

It is a huge pain that we can't use std::unique_ptr here, and I'm sorry about that. I think that's something we definitely should address in the GC codebase as a whole going forward.

}
if (*data)
{
return (*data)->Write(item);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if this string is interned? does this still work?


uint32_t GCHashTableBase::GetHash(GCStringData* key)
{
uint32_t hash = 5381;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine we'd want to use a cryptographic hash if we want to avoid potential denial-of-service problems here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@swgillespie swgillespie Nov 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition feels like this hash function is OK for string interning. Strings that get interned are only attacker-controlled when passed verbatim to string.Intern (assuming our threat model doesn't consider string literals in metadata to be attacker controlled). In this case, though, any string on the heap (and therefore passing through this code path) is potentially attacker controlled and large (<85kb since LOH strings aren't eligible for this). I'd be worried of attackers repeatedly colliding strings and forcing the slow string comparison path on potentially large numbers of large strings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swgillespie could you please explain this:

<85kb since LOH strings aren't eligible for this

I'm agree with your concerns, but keep in mind that this code meant to happen concurrent with mutators, i.e. not affecting STW pause. Also stronger hash function may affect performance more than collision-induced long chains of entries. And finally such malicious strings have to survive probably several gen0 and following gen1 collections.
I'm not against your suggestion, just want all trade-offs to be considered.
And last, we can try different structures like trie-based ones and trade hashing issues for cache thrashing issues 🤣 . Of course this too exotic, just for the record.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collision-induced long chains of entries

You can give up adding strings to the hashtable when you get collision-induced long chain.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rattenkrieg Correct me if I'm wrong, but my understanding of your approach is that strings only get deduped when they are promoted into gen 2, correct? In that case, LOH strings won't ever get promoted into gen 2 (since they are allocated in Gen2/LOH) and won't trigger this logic. Is that correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swgillespie that's right! I've completely missed the way how large strings instantly get promoted. Though I don't think there are going to be duplicates among large strings, towards the finish of this PR we may consider knobs to probe large heap too.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating collisions with this algorithm and a hash table with power-of-two buckets is trivial -- you just need to have the last few characters of the string be the same. We should assume an attacker does have the ability to create a bunch of strings and have them last for a little while. That's not particularly far-fetched when they can submit things like a json document with lots of strings in it to a server.

We've taken two approaches to this that would be acceptable:

  1. Always use a randomly seeded strong hash. In the VM, we use https://github.com/dotnet/coreclr/blob/32f0f9721afb584b4a14d69135bea7ddc129f755/src/vm/marvin32.cpp . Marvin is about half the speed of algorithms like HashBytes (though it may make up some of that with a better distribution of results, especially in the power-of-two hash table). I'd recommend testing with this to see if it's good enough to always use (it's how we implement String.GetHashCode).
  2. Detect attacks and change strategies. We do that in Dictionary by counting collisions and switching to Marvin if we hit a threshold. You might also be able to detect attacks and just skip putting the colliding strings in the table, but that would have a couple of issues:
    a. Strings would still have to be hashed/checked to see that they are more collisions. That might still add some load, especially if they're long.
    b. If a service depends on the improved performance of deduplication, an attack that disables it might still act as a denial-of-service.

if (entryKey->GetCharCount() != key->GetCharCount())
return false;

return !memcmp(entryKey->GetStringBuffer(), key->GetStringBuffer(), entryKey->GetCharCount() * sizeof(wchar_t));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof(wchar_t) is platform dependent, so this looks scary to me.

{
private:
uint32_t m_StringLength;
wchar_t m_Characters[1];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wchar_t is 32-bits on Linux, which wouldn't be correct for this definition.

@swgillespie
Copy link

Thanks for the PR, this is exciting! 😄 There's a lot to unpack in this one so I'll make another few review passes tomorrow.

cc @jkotas @sergiy-k @Maoni0

@Rattenkrieg
Copy link
Author

On interference with intern pool:

Right now the algorithm doesn't give special treatment to interned strings, i.e it can relocate pointer to interned string to regular string or vice-versa. That may broke reference counting invariants. These are the places where I've found dependencies on correct RC:
StringLiteralMap destructor, which I guess called only on appdomain unload;
Contended path during insertion of new literal into pool - happens conditionally and instantly after entry creation.
So while it's seductive to ignore invariants and deduplicate strings chaotically wrt intern pool let's explore path to harmony with intern pool:
Allocations of entries holding reference counter and pointer to string happens on the large heap, hence we can apply dedicated policy to string pointers came from large heap during tracing - somewhere inside gc_heap::mark_through_cards_for_large_objects. However it's undesirable to unadvisedly ignore such pointers since for example we could miss large string array that way.

On interference with strings under lock:

At first I completely ignored such issue since locking on marshal-by-bleed instances is conventially forbidden. But then I've found this: "FxCop: System.Uri locks on a string." dotnet/corefx#13107 So since even BCL contains evil code like that I came with conclusion to examine object header and potentially sync-block entries and reschedule i.e StringDedup::EnqPromoted locked strings to be deduped next cycle.

@@ -15357,6 +15377,11 @@ void gc_heap::gc1()
assert (g_gc_card_bundle_table == card_bundle_table);
#endif

if (!g_gc_pStringClass)
{
g_gc_pStringClass = GCToEEInterface::GetStringMethodTable();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the logically closest place before actual usage of g_gc_pStringClass.
But we may put it higher on the callstack ie closer to collection begining.

@Rattenkrieg
Copy link
Author

I have tried @swgillespie 's idea

quick thought: the lowest two bits of the method table pointer are usable and we only use one of them. We could in theory use the other bit for this.

... and it worked! つ ◕_◕ ༽つ
つ ◕_◕ ༽つ That feel when you managed to take the last unused bit つ ◕_◕ ༽つ

Fun fact: there are some amount of duplicated strings in CLR like "0", "1", "2", "%" etc

#else

#ifndef _INC_WINDOWS
typedef wchar_t TCHAR;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That copied from gcenv.structs.h, which is included in gcenv.h and it's ok to use gcenv.h in .cpp's but not in headers :rage1:

typedef char TCHAR for PLATFORM_UNIX is super confusing. Isn't sizeof(char) == 1 everywhere according to standard?
Personally I'd like to use char16_t, but it raising type errors when interoping with StringObject methods which are WCHAR typed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#else

#ifndef _INC_WINDOWS
typedef wchar_t TCHAR;
Copy link
Author

@Rattenkrieg Rattenkrieg Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes were introduced in 0462be1
None of usages of TCHAR have survived to this day. So I'm renaming TCHAR to WCHAR and replacing char with char16_t for PLATFORM_UNIX.

@Maoni0
Copy link
Member

Maoni0 commented Nov 30, 2017

sorry to jump in this late - I was mostly out for the past few weeks.

this really shouldn't be done as a GC feature - there's no reason to - you can do this outside GC so you don't affect GC STW pauses and don't affect critical code paths in GC penalizing everything else.

the algorithm is this -

go through the heap for old generations (gen2/3) linearly as we only care about those objects; objects in gen0/1 might die quickly and even if they point to strings in the old generations we don't care about deduping those.

if we see an object contains a reference to a string object -

  • if the string obj is in old gen, check to see if its hash already exists in the table

    • if not, we insert the hash of the string and its address into the table;
    • otherwise we replace this ref with the address of the string already inserted (obviously check for collision). of course another thread might be changing this ref too so you need to do an InterlockedExchange (you could do a loop with InterlockedExchange but no loop is fine too...since this feature is just opportunistic).
  • otherwise skip (we don't dedup to young strings)

if an object doesn't contain any references to string objects, simply skip (this includes skipping string objects themselves).

note that going through the gen3 heap (LOH) requires cooperation with the LOH allocator. this is the same way BGC does it (with bgc_mark_set/bgc_mark_done). you can just leave this part out for now as you mentioned - we know that objects on LOH usually aren't nearly as reference rich as the ones on SOH.

most of this could actually be done in managed code. we do need to expose some FCalls to facilitate this - we need to expose the relevant refs to managed code. the benefit of this is you don't need to write the kind of "manually managed code" that you see in the runtime (eg, you don't need to manually relocate the addresses in a gen2 compacting GC if you just store a ref to the string object; you don't need to probe for GC to allow it to happen).

I am fine if you want to implement this in native code first and convert most of it to managed code later; or if you want to try the managed implementation first. again, sorry to comment so late...

@jkotas
Copy link
Member

jkotas commented Nov 30, 2017

of course another thread might be changing this ref too so you need to do an InterlockedExchange

This would work only if all updates of the ref in the app were done using interlocked operations as well. It is hard to do, and even if it was doable - the performance impact would be prohibitive. Updating of the string references to deduped copy has to be done during STW. I do not see any other way.

@Rattenkrieg
Copy link
Author

Rattenkrieg commented Dec 1, 2017

@Maoni0

sorry to jump in this late

I'm fine, if even this will be rejected completely, digging through gc internals was fun for me and I learned couple of things from sources.

you can do this outside GC so you don't affect GC STW pauses and don't affect critical code paths in GC penalizing everything else.

From the very begining I thought this planned to be opt-in via configuration knobs. Though it's obvious that additional work during gen1 and gen2 would impact on highest percentiles, there are several types of throughput oriented applications, (compilers, test runners, everything CI related etc) which should benefit from such feature. Also amount of collections may be reduced as well due to smaller number of live objects.
So I'd really like to test this with, for example Roslyn compiling itself or running test suite for it.

As variation of your algorithm we can try to perform references relocation during STW while keep scanning phase running concurrently with mutators.

@Maoni0
Copy link
Member

Maoni0 commented Dec 4, 2017

of course another thread might be changing this ref too so you need to do an InterlockedExchange

This would work only if all updates of the ref in the app were done using interlocked operations as well.

why do you think that? on Intel "lock cmpxchg" will prevent other processors from accessing the memory location that's being locked; and loading/storing a ptr sized word is an atomic operation itself; this means if there's an update from another processor it will either be completely observed by lock cmpxchg before or after. on arm ldrex/strex or ldxr/stxr also provide exclusive access which is reset by normal stores so it achieves the same purpose.

@Rattenkrieg
Copy link
Author

@Maoni0 given that mutators can pin or lock on string being processed by dedup thread, does that imply that we need atomic conditional update of two adjacent pointer sized words?

@jkotas
Copy link
Member

jkotas commented Dec 4, 2017

@Maoni0 You are right that doing all updates using interlocked operations won't make difference. We would actually need read barriers to address the concern I was worried about.

If string reference updates were done outside STW, the programs and libraries out there would need to be prepared to handle the possibility of one string object becoming two string objects. It is even more breaking in very subtle ways than simple dedupping where two string objects become one string object. It is an interesting option to experiment with, but I would think the less breaking option is the more useful one.

In other words, is the following test guaranteed to pass with string deduping turned on?

class Test
{
    string _field;

    Test()
    {
        _field = new string('a', 1);

        string a1 = _field;
        Thread.Sleep(1000);
        string a2 = _field;

        Debug.Assert(Object.ReferenceEquals(a1,a2));
    }
}

@jkotas
Copy link
Member

jkotas commented Dec 4, 2017

Maybe we can structure this such that the string deduping logic is outside the GC, and the GC just provides methods to fold one object into another one atomically during Gen2 GC? Such callback can be then also used e.g. to fold all empty arrays together, etc. There seems to be a lot out there written about deduplicating garbage collection: https://www.bing.com/search?q=deduplicating+GC.

@Rattenkrieg
Copy link
Author

@jkotas

Maybe we can structure this such that the string deduping logic is outside the GC, and the GC just provides methods to fold one object into another one atomically during Gen2 GC?

I like that idea, actually I offered similar technique in response to first @Maoni0 comment.

@jkotas jkotas added the area-GC label Dec 28, 2017
@karelz
Copy link
Member

karelz commented May 4, 2019

What's the plan here? 1+ year no update ... should we close the PR?

@jkotas
Copy link
Member

jkotas commented May 4, 2019

I agree that the PR should be closed until somebody is interested in picking this up again. @Rattenkrieg Thank you anyway!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants