Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collection.clear() fails to decref objects used as keys, or delete the reverse-mapping vatstore entries #8756

Closed
warner opened this issue Jan 16, 2024 · 5 comments
Assignees
Labels
bug Something isn't working liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Jan 16, 2024

While working on rate-limited GC, I discovered that calling .clear() method on a ScalarBigMapStore or ScalarBigSetStore would fail to delete any vref-bearing objects used as keys of the collection.

Background: How Virtual Collections work

Our .clear() method is fancy: it accepts a "key pattern" and "value pattern" to limit the scope of what gets cleared. When these are used, clear() must iterate through each entry, compare the key/value against the pattern, and only delete the ones that match.

When we omit those patterns, or when we're deleting the collection entirely (because nobody is referencing it anymore), we use a specialized function named clearInternalFull(), which avoids deserializing the keys/values first. This is faster: we need to know what the keys/values point to (so we can decrement their refcounts), but we don't need to construct any new Representatives.

Collections are encoded as one vatstore record per collection entry. The vatstore key is an encoded form of the collection key, and the vatstore value is the JSON-serialized marshalling of the collection value. We use the encodePassable from @endo/marshal to encode the collection key into a string, which can then be further encoded into a vatstore key.

The collection key might be an object, with a pass-by-reference identity. These objects all have vrefs: values like o-12 for the Presence you get when importing an object from the kernel, o+13 for the ephemeral Remotable you create with Far() and might export into the kernel, or o+d14/2 for a Representative of one instance of a virtual/durable Kind that might also be exported into the kernel.

We need virtual collections to enumerate in a useful and stable order, and we use the vatstore ordering directly as the collection's ordering. We rely upon encodePassable to provide encoded keys that maintain this order. For example, all BigInts are encoded into a form that has a length prefix on the MSB side, so that 19 sorts later than 2, and it uses a n or p prefix for negative and positive numbers to ensure 15 sorts later than -15.

When encodePassable is given a pass-by-reference object (with a vref), we cannot use the vref as the sort order. Vrefs are local to the vat, so different vats would observe different orderings. In addition, we keep vrefs private to prevent an information leak (you could learn how many objects have been allocated, which would enable both eavesdropping and a covert communication channel).

Instead, we assign each object-used-as-key an "ordinal" integer. This gives us insertion order, just like JavaScript's native Map and Set collections. For as long as the object is a key of that collection, it will retain the same ordinal. We use a second vatstore record to map from the vref to its ordinal (the key is prefixed with |, which is not used by encodePassable). And the main entry's key is a combination (safe concatenation) of the ordinal plus the vref, so we don't need a third entry to remember the ordinal->vref mapping.

Suppose collection is collection number 5 (whose data all uses a vatstore prefix of vc.5.), and userspace defines an add method like:

Far('iface', {
  add: presence => collection.init(presence, 'data'),
})

If add(p) is invoked with a Presence whose vref is o-12:

  • we'll assign ordinal 1 to o-12 (if it is the first one added)
  • encodePassable tells us to use an encodedKey of r0000000001:o+v10/1
    • that combines the r prefix (marking this key type as an object, rather than a string or Number or BigInt, etc), the fixed-length ordinal (providing insertion order), and the original vref (for recoverability during .keys() or .entries())
  • we build a dbKey of vc.5.r0000000001:o+v10/1
  • we marshal 'data' into something like { body: '"data"', slots: [] }
  • we do vatstoreSet('vc.5.r0000000001:o+v10/1', data)
  • we also do vatstoreSet('vc.5.|o+v10/1', '1'), for the reverse record

How clearInternalFull Works

Our clearInternalFull uses a loop like this:

      for (const dbKey of enumerateKeysStartEnd(syscall, start, end)) {
        const value = JSON.parse(syscall.vatstoreGet(dbKey));
        doMoreGC =
          value.slots.map(vrm.removeReachableVref).some(b => b) || doMoreGC;
        syscall.vatstoreDelete(dbKey);
        if (isEncodedRemotable(dbKey)) {   ///// <--- BUG HERE
          const keyVref = vrefFromDBKey(dbKey); // <- 2nd bug
          if (hasWeakKeys) {
            vrm.removeRecognizableVref(keyVref, `${collectionID}`, true);
          } else {
            doMoreGC = vrm.removeReachableVref(keyVref) || doMoreGC;
          }
          syscall.vatstoreDelete(prefix(`|${keyVref}`));
        }
      }

We iterate through the DB with enumerateKeysStartEnd to find all the forward-mapping keys (like vc.5.r0000000001:o+v10/1). For each one, we fetch its value, and run all the value's slots through vrm.removeReachableVref(), to decrement refcounts on anything pointed to by the value. Then we delete the forward mapping with vatstoreDelete.

Then we use isEncodedRemotable(dbKey) to decide whether the key was a pass-by-reference value. If so, we use vrefFromDBKey to convert vc.5.r0000000001:o+v10/1 into o+v10/1, and decrement either its reference count or its recognizability count (depending upon whether the collection was strong or weak). Finally, we delete the reverse mapping with another vatstoreDelete.

The Bug

The problem is that isEncodedRemotable(dbKey) is being passed vc.5.r0000000001:o+v10/1, but is expecting r0000000001. This function is coming from the encode-passble library of @endo/marshal, and doesn't know anything about our database keys. It will always answer false, because liveslots is always using dbKeys that start with the virtual-collection prefix (vc. something), and the only encode-passable prefix that is used for "encoded remotables" is r.

As a result, clearInternalFull() will never perform the key-deletion code for passables used as keys, with two consequences:

  • the key object will not have its refcount reduced, so it will stay around forever
  • the reverse-mapping vatstore entry will not be removed, leaving DB space used forever

The vatstore entry will be removed if/when the collection is deleted (not merely cleared), because there is extra code there to delete all the vc.5.|-prefixed metadata keys, and that happens to include the reverse-mapping keys. But the only other way to delete that DB space (or to drop the refcounts of the key objects) is to delete the entire vat.

The if clause being skipped had a second bug: it calls vrefFromDBKey(dbKey) to get the vref to use, but that function is:

    const vrefFromDBKey = dbKey => dbKey.substring(BIGINT_TAG_LEN + 2);

The dbKey actually starts with the collection prefix, like vc.5., but this function is only given the job of stripping off the r0000000001: encode-passable tag. So it will return corrupted data.

The only other caller of vrefFromDBKey is in the very next function:

    const decodeRemotable = encodedKey =>
      convertSlotToVal(vrefFromDBKey(encodedKey));

where it does get a prefix-stripped value. The core problem is naming: we need separate functions, with correct names and argument types:

    const encodedKeyFromDBKey = dbKey => dbKey.substring(dbKeyPrefix.length);
    const vrefFromEncodedKey = encodedKey => encodedKey.substring(BIGINT_TAG_LEN + 2);
    const vrefFromDBKey = dbKey => vrefFromEncodedKey(encodedKeyFromDBKey(dbKey));

Incidentally, we already have an encodedKeyFromDBKey, it is named dbKeyToKey:

    function dbKeyToKey(dbKey) {
      const dbEntryKey = dbKey.substring(dbKeyPrefix.length);
      return decodeKey(dbEntryKey);
    }

The Fix

  • add vrefFromEncodedKey and fix vrefFromDBKey
  • change isEncodedRemotable(dbKey) to be isEncodedRemotable(dbKeyToKey(dbKey))

To deploy the fix into any given vat, we must perform a liveslots-bumping vat upgrade.

Remediation

Fortunately, our production code does not delete or clear entire collections very frequently.

Unfortunately, it is not easy to search the code for places that might delete a collection. We can grep for clear(), or calls to clear(keyPatt, valuePatt) (and filter on cases where both patterns are empty). But that won't reveal places where a collection is dropped and then gets deleted when liveslots notices all its pillars have fallen.

We can scan the slogfiles for places where two encoded-passable keys are deleted in quick succession: eyeballing the vatstore calls should make it pretty obvious if we're looking at a clearInternalFull().

I think it's entirely possible that we haven't run into this problem in production yet, which would be lovely.

But if we do have lost keys, we'll need to come up with a remediation plan. Maybe we should change deleteCollection to look for the reverse-mapping keys as it does it's final metadata sweep, and instead of merely deleting the key, it should also decrement it's reachability/recognizability count. If the entry was deleted properly (by the non-full form of clearInternal), the refcount was changed, and the reverse-mapping key was deleted. So the only case where the key is still present should also mean the refcount was not changed.

Then the remediation process is to somehow get userspace to delete the affected collections.

We might also change clear() to scan for leftover reverse-mapping keys, after it's processed and deleted everything else, and convince userspace to merely .clear() the affected collections.

These would only work if the collection were still reachable by userspace. A coding pattern that re-uses the same collection but periodically clears it would probably fix itself the next time it gets cleared. But if the collection is deleted by GC, then we've lost the data that shows where refcounts came from, and remediation will have to wait for liveslots to acquire a full mark-and-sweep garbage collector (which would discover the discrepancy between the refcount and the set of references, and could correctly mark the object for deletion).

And of course, we must upgrade the vats in question to a new version of liveslots, before any of these remediations could be applied, just to get the new code into place.

Deeper Fixes

Code-coverage would probably have spotted this: I don't think there's any way we could reach that if clause, and we certainly have tests that use objects as keys. So improving our coverage infrastructure would be a big help.

I'm writing more tests that examine the syscalls triggered by deletion, which is how I discovered this one. Maybe more tests of that flavor.

Stronger types might have also helped: the core confusions are that isEncodedRemotable takes an encodedKey but is given a dbKey, and that vrefFromDBKey is written to expect an encodedKey but its name and signature claim that it takes a dbKey. However, since these are all strings, I don't think the "structural types" that TypeScript offers would be sufficient to catch the bug; it would need "nominal types" to distinguish between the dbKey kind of string, the encodedKey kind of string, and the vref kind of string.

@warner warner added bug Something isn't working SwingSet package: SwingSet liveslots requires vat-upgrade to deploy changes labels Jan 16, 2024
@warner warner self-assigned this Jan 16, 2024
@warner
Copy link
Member Author

warner commented Jan 16, 2024

We could check the need for remediation by building an external refcount checker: a tool that walks the DB and looks at each vat's vatstore, finds references from virtual data, adds them up, and compares against the rc count in the DB. We can do the same for ir recognizer records and the keys of weak virtual collections. This bug would manifest as a difference between the two numbers.

@mhofman
Copy link
Member

mhofman commented Jan 17, 2024

This is faster: we need to know what the keys/values point to (so we can decrement their refcounts), but we don't need to construct any new Representatives.

Why does the version accepting a pattern need to construct representatives? Is it because the pattern may require of a specific instance instead of a generic "remotable"? I'm wondering if we could optimize the clear with pattern to avoid creating representative for each keys as that would cause GC pressure (representatives are only collected at the next BOYD)

  • the fixed-length ordinal (providing insertion order)

This effectively puts a cap on the number of object references in a collection ever (not at a given point in time, but over time), right? The value might be large enough, but there are ways to encode that keep sorting, and increase the size as needed. cc @gibson042 who's a specialist in these encoding tricks.

If add(p) is invoked with a Presence whose vref is o-12:

This seems to lack the reference counting syscalls made when inserting the object in the collection?

Deeper Fixes

One thing that concerns me with this whole scheme is that it leaves no space for composite keys. This will effectively allow the same vref in multiple collection entries. Either we re-use the remotable reference create for it (unlikely) and need to count how many times that remotable reference is used, or we need to accept that a single vref may have multiple object references in the collection. I suppose the latter could be done by transforming the reverse mapping into a list of ordinals.

warner added a commit that referenced this issue Aug 26, 2024
Fix collection deletion. Previously, objects (passables/vrefs) used as
collection keys were not de-referenced when the collection was
dropped/deleted or unconditionally cleared (`collection.clear()`
without keyPatt or valuePatt). Strong collections would leak a
reachable-refcount to the key object, and weak collections would leak
a recognizable-refcount, leading to the object being kept alive (or
kept unretired) forever. In addition, each entry would leak a vatstore
key until the collection was finally deleted.

The code in `clearInternalFull()` confused dbKeys, encodedKeys, and
vrefs, and gave the wrong kind of key to `isEncodedRemotable()`, so
the answer was always "no".

fixes #8756
warner added a commit that referenced this issue Aug 26, 2024
Fix collection deletion. Previously, objects (passables/vrefs) used as
collection keys were not de-referenced when the collection was
dropped/deleted or unconditionally cleared (`collection.clear()`
without keyPatt or valuePatt). Strong collections would leak a
reachable-refcount to the key object, and weak collections would leak
a recognizable-refcount, leading to the object being kept alive (or
kept unretired) forever. In addition, each entry would leak a vatstore
key until the collection was finally deleted.

The code in `clearInternalFull()` confused dbKeys, encodedKeys, and
vrefs, and gave the wrong kind of key to `isEncodedRemotable()`, so
the answer was always "no".

fixes #8756
@warner
Copy link
Member Author

warner commented Aug 26, 2024

New notes from my extensive-but-redundant writeup in #9959 :

The deeper improvements are:

  • define the types ("passable key" vs "vatstore key") at the top of the file, with explanations
  • include the name of the type in the variables we used to hold those values
  • improve the unit tests to exercise .clear() and look critically at the vatstore to ensure the data is removed as we expect, plus the refcount changes. I suspect we have tests that call .clear() and assert that the normal collection APIs return sensible values (.length = 0 and .keys() returning nothing), but nothing to look at the backing store or the refcount of vref-based keys.

Measurement

The walkthrough above shows that isEncodedRemotable failures mean we won't decrement the key's vref (or remove it as a recognizer), nor will we delete the matching ordinal assignment. This means we can look for r0000000001:${vref} keys without a matching |${vref} key to count how many of these have been mishandled.

We could/should also build a tool to scan the whole vatstore and extract the durable reference graph, then compare it against the refcounts to see if they match. Entries which have been mishandled this way will have had their r0000000001:${vref} key removed, but their refcount will not have been decremented. This would appear as a surplus of refcounts (e.g. refCount=1 but no actual reference edges found in the graph). (#8759)

Remediation

Messy but it could be worse. We could take advantage of the leftover ordinal records to get a list of the problematic vrefs, decref them, and delete the ordinal record. We'd do this with a vatstore version bump, so we know if/when it has been done, to avoid doing the remediation more than once.

warner added a commit that referenced this issue Aug 27, 2024
Fix collection deletion. Previously, objects (passables/vrefs) used as
collection keys were not de-referenced when the collection was
dropped/deleted or unconditionally cleared (`collection.clear()`
without keyPatt or valuePatt). Strong collections would leak a
reachable-refcount to the key object, and weak collections would leak
a recognizable-refcount, leading to the object being kept alive (or
kept unretired) forever. In addition, each entry would leak a vatstore
key until the collection was finally deleted.

The code in `clearInternalFull()` confused dbKeys, encodedKeys, and
vrefs, and gave the wrong kind of key to `isEncodedRemotable()`, so
the answer was always "no".

fixes #8756

move test, new naming convention

more test, needs refactor

refactor test

finish weak-collection test

fix: retire objects from weakstore recognizers

add test of voAwareWeakSet deletion and retirement

simplify add/removeRecognizableVref

These functions have two purposes:

* track "recognition records" for keys of weak collections
* tell BOYD about Presence vrefs (addToPossiblyRetiredSet) when a
  record is removed

Virtual/durable weak stores keep their recognition records in vatstore
`vom.ir.${vref}|${collectionID}` keys. voAwareWeakMap/Set keeps them
in RAM, in a Map named vrefRecognizers, where the key is the vref
being recognized, and the value is a Set of virtualObjectMaps (one per
recognizing WeakMap/Set).

Adding a record means adding a `vom.ir.` key, or adding to the
Set (which might require adding a Map entry). Removing a record means
removing a `vom.ir.` key, or removing from the Set (and maybe the Map
entry).

We need to add the vref to possiblyRetiredSet when both 1: it is a
Presence-style vref (o-NN), and 2: it might be the last
recognizer. For `vom.ir.` keys, we do this for every Presence-style
vref, whether or not we're the last one, because searching for other
records would be expensive. For voAwareWeakMap/Set, we have to delete
the Map entry if the Set became empty anyways, so we only
addToPossiblyRetiredSet if it was the last one.

The previous version only updated records for Presence-style and
Representative-style (virtual/durable object) vrefs, and avoided
updating records for Remotable-style vrefs. This changes the logic to
update records for all vrefs, so that retiring a key can remove the
weakstore entry (and its associated data, for WeakMapStores).

This causes two behavioral changes:
* add vom.ir. keys for Remotable vrefs
* only addToPossiblyRetiredSet for presence-style vrefs in weakstores

make virtualReferences right

more

more

more
warner added a commit that referenced this issue Aug 28, 2024
warner added a commit that referenced this issue Aug 28, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Remotable-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Remotable/Representative/Presence-style keys, consuming c-list
  entries longer than necessary.

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has() to incorrectly return true, and
  .init() to throw a "already registered" error, as well as consuming
  DB space longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
warner added a commit that referenced this issue Aug 28, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
@warner
Copy link
Member Author

warner commented Aug 28, 2024

PR #9961 will fix this. Calls to collection.clear() will no longer leave references to keys, and will clean up their ordinal-assignment records properly, so subsequent calls to collection.has(oldkey) will return the right value, and .init(oldkey, newvalue) will not fail.

As a change to liveslots, this requires two steps to fix any given (deployed) vat. First, the kernel must be restarted with a version of @agoric/swingset-vat that references the new version of @agoric/swingset-liveslots. Second, any existing vats must be upgraded (e.g. with E(vatAdminNode).upgrade(), or controller.upgradeStaticVat()). These upgrades can be "null-upgrades", in which the same vat bundle (and vatParameters) are used as before, because even a null-upgrade will cause the new incarnation to use the current version of liveslots.

Note that any collections which were .clear()ed before the liveslots upgrade will still be corrupted (they will contain leftover ordinal-assignment records), so .has(oldkey) will incorrectly return true. The PR makes no attempt to remediate that problem (sorry, I feel bad about that, but 1: it's a non-trivial amount of work, and 2: I don't believe there are any vats in production which contain this corruption; I've checked the mainnet DB and each ordinal-assignment record matches a forward record).

The #8759 tool will check for this kind of corruption, so if you've got a DB and are worried about it, run that tool against your DB (once it exists). You could manually remediate this by deleting the ordinal-assignment records that don't match, which would fix the correctness-breaking corruption (even though there might still be refcount corruption, which causes objects to be retained incorrectly, but that's just a space leak).

I'll close this once the PR lands.

warner added a commit that referenced this issue Aug 29, 2024
This was broken because of #8756 . The modified tests are marked as
failing, until the fix is landed.
warner added a commit that referenced this issue Aug 29, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 30, 2024
This was broken because of #8756 . The modified tests are marked as
failing, until the fix is landed.
warner added a commit that referenced this issue Aug 30, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 30, 2024
This was broken because of #8756 . The modified tests are marked as
failing, until the fix is landed.
warner added a commit that referenced this issue Aug 30, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
mergify bot pushed a commit that referenced this issue Aug 30, 2024
These tests will fail without the fixes in the next commit

One new test is disabled because of additional pending bugs that
interfere with the test setup (a combo of #9956 and #9959), which will
be re-enabled in PR #9961 (to address #8756 and others)
warner added a commit that referenced this issue Aug 31, 2024
This was broken because of #8756 . The modified tests are marked as
failing, until the fix is landed.
warner added a commit that referenced this issue Aug 31, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @mhofman and @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 31, 2024
This was broken because of #8756 . The modified tests are marked as
failing, until the fix is landed.
@mergify mergify bot closed this as completed in 97e81f1 Aug 31, 2024
mergify bot added a commit that referenced this issue Aug 31, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
@warner
Copy link
Member Author

warner commented Aug 31, 2024

This is faster: we need to know what the keys/values point to (so we can decrement their refcounts), but we don't need to construct any new Representatives.

Why does the version accepting a pattern need to construct representatives? Is it because the pattern may require of a specific instance instead of a generic "remotable"?

Yep, exactly.

I'm wondering if we could optimize the clear with pattern to avoid creating representative for each keys as that would cause GC pressure (representatives are only collected at the next BOYD)

I guess, if we made some sort of "virtual pattern" or "virtual matcher" that could remember its vref without actually holding a Representative.

  • the fixed-length ordinal (providing insertion order)

This effectively puts a cap on the number of object references in a collection ever (not at a given point in time, but over time), right? The value might be large enough, but there are ways to encode that keep sorting, and increase the size as needed. cc @gibson042 who's a specialist in these encoding tricks.

Yeah, this week we talked about his "unary length encoding" scheme that would remove that arbitrary limit.

If add(p) is invoked with a Presence whose vref is o-12:

This seems to lack the reference counting syscalls made when inserting the object in the collection?

I omitted that from the writeup. The actual doInit() function increments the key's refcount:

      if (passStyleOf(key) === 'remotable') {
        ...
        if (hasWeakKeys) {
          vrm.addRecognizableValue(key, `${collectionID}`, true);
        } else {
          vrm.addReachableVref(vref);
        }

One thing that concerns me with this whole scheme is that it leaves no space for composite keys. This will effectively allow the same vref in multiple collection entries. Either we re-use the remotable reference create for it (unlikely) and need to count how many times that remotable reference is used, or we need to accept that a single vref may have multiple object references in the collection. I suppose the latter could be done by transforming the reverse mapping into a list of ordinals.

Yeah, I don't know how composite keys are going to work. Maybe we assign an ordinal to each composite key, have a secondary table that tracks the relationship between ordinals and the vrefs that go into them (and vice versa, probably many-to-many).

It will probably become easier if/when we can take better advantage of SQLite, and use real column indexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

2 participants