PoRep security policy (FIP-0047) #415

anorth · 2022-07-22T00:05:38Z

anorth
Jul 22, 2022
Maintainer

This is a proposal for the Filecoin network to ratify a policy to be adopted in case an insecurity is discovered in the theory or implementation of proof-of-replication. The discussion is prompted by #386 which, by proposing to change the maximum sector commitment duration, changes the implicit policy that exists today. But this is a discussion that we should have in any case.

Background

Proof-of-replication and network security

The Filecoin network uses cryptographic techniques to provide assurance of the physical uniqueness of sectors of data (proof of replication, or PoRep) and their ongoing availability (proof of space-time, PoSt). These mechanisms provide the proof of work underlying the blockchain’s security (in addition to the security offered by pledge collateral stake).

Some of the cryptography involved has been developed relatively recently. It is possible that there are errors in either the theory or implementation, or that errors may be introduced one day, that undermine the desired assurances. The result of such an error would most likely be that storage providers could “cheat” the network to claim they were maintaining more committed storage than they in fact possessed. This would reduce network security as consensus power could be gained without the expected physical infrastructure commitment. It is also unfair to non-cheating providers, assuming knowledge of the flaw was limited. In the case of an error in PoRep, it is likely that there would be no possible protocol change that could detect the cheating sectors after commitment.

This situation has already arisen once in the life of the Filecoin network. The v1.1 PoRep algorithm patched a bug in the v1 PoRep implementation that weakened the security assurances of sectors. The bug was responsibly reported to the Filecoin team and it is unknown if it was ever exploited by a provider.

We are not aware of any similar bugs at this time. Filecoin storage is secure as far as we can ascertain.

Current policy

The network today has an implicit policy on what to do if another such bug is detected:

Implement a new, fixed PoRep algorithm
In a network upgrade, the old algorithm is disallowed and the new one mandated for new sectors
Old sectors are prohibited extension beyond the 1.5-year maximum commitment
The power attributed to old sectors decreases as they expire. In the worst case of many insecure sectors being committed (or extended) immediately prior to the fix being deployed, the insecure power is half diminished after 9 months, and extinct after 18 months. As the network grows, the insecure proportion diminishes even faster.

The policy might thus be summarised as: put up with the potentially-insecure power for a limited period of time, but retain existing commitments of providers to the network and vice-versa. The 1.5-year window was selected as a compromise: a longer maximum commitment would be beneficial for storage stability, but a shorter bound improves the response time in case of a bug.

Pre-genesis network designers may have considered this policy a placeholder, to be replaced on the fly in the event of a bug. But without alternative ideas expressed clearly in FIPs or code, participants might reasonably assume that the current code is a stable policy.

Changing the maximum sector commitment duration

The sector commitment duration (currently 1.5 years) is the period availability a provider commits to when first proving a sector. When a sector is close to expiration, a provider can extend the sector for up to the same duration again. This can be repeated until the sector maximum lifetime (currently 5 years), after which further extensions are prohibited.

The sector duration multiplier proposal proposes increasing the maximum sector commitment duration to match the maximum lifetime of 5 years, in order to most simply provide increased rewards for longer commitments.

This demands a new analysis and policy for how the network should respond to an error in current or future PoRep implementations. We should express such a policy in any case, but the proposed extension of commitment duration makes it more urgent.

Proposal

This proposal aims to establish a concrete and widely-accepted policy for how the network should respond to possible future bugs that compromise the security of storage-based consensus.

Goals

Document a network policy for how to respond to errors that affect security of PoRep as the basis for consensus power.
Ratify the policy in a FIP, binding participants to align with it should such a situation arise.
Give storage providers some confidence of network policy for use in their own risk and business modelling.
Establish a foundation for a smooth network upgrade implementing a pre-agreed policy in the event of discovery of an insecure PoRep.

It may not be necessary to implement code embodying the agreed policy until necessitated by discovery of an insecurity. It’s quite likely that the details of the flaw would inform the implementation. A FIP-level ratification of a concrete policy, but without requiring code, provides a good balance between community consensus and future adaptability.

Ideas

We identify two basic classes of policy, some of which have parameters and/or multiple mechanisms that could realise them.

Note that these are policies for disaster response. We should not expect the outcomes to be desirable, as compared with PoRep continuing to be secure. But it’s important to identify what trade-offs network participants would prefer in the event of such a discovery.

Option A: status quo

One possible policy is to continue today’s implicit policy, but allowing insecure power to persist for >3x the current policy, up to 5 years.

Storage providers in this scenario could be prohibited from extending the life of any sectors initially committed for less than 5 years, but given the very large proposed power multipliers offered for long commitments, it’s likely that 5-year-committed sectors dominate the network power table at most times. Such sectors might be part-way through their life at the time a flaw is detected.

This is the easiest to implement (no change from today) but may be unpalatable from a network security point of view. While new sectors would gradually reduce the power attributable to possibly-cheating ones, significant insecure power would persist for years.

Option B: re-sealing

Another possible policy is to allow/require storage providers to re-seal their committed sectors with a new PoRep algorithm within some fixed time window in order to maintain power. The time window for re-sealing is a policy parameter. We might assume that 1.5 years is an appropriate value for minor weaknesses in security, but insufficient to address a major flaw that retro-actively weakens all power. In such cases, the re-sealing deadline might have to be much shorter.

Another parameter of mandated re-sealing is any penalty to be extracted for sectors which are not re-sealed. Failure to re-seal before the deadline could be considered an early termination of the sector, where the provider would pay the termination penalty (at present, approximately 90 days worth of projected reward at the epoch of commitment, but may differ in the future). Alternatively, failure to reseal before the deadline (which was unknowable when the sector was committed) could be considered a “normal” expiration, effectively forgiving the commitments in case of a PoRep flaw. Or something in between: a penalty that differs from the usual early-termination fee.

We might not be certain about the appropriate re-sealing window in advance of identifying a concrete flaw, but we should strive to align on a default for minor flaws with similar impact to the V1 PoRep. Similarly, it may not be possible to lock in an appropriate failure-to-reseal penalty, but we should strive for alignment on a value given a moderate re-sealing window. It might be easier to set a tight re-sealing deadline if failure to meet it is not penalised too hard.

Discussion

Informing storage provider business operations

Storage providers need some level of certainty over network behaviour in order to appropriately fund and structure the risk they take. For example, if Option B were ratified, a provider could chose to commit only short-duration sectors and effectively avoid most unexpected obligations to re-seal, or, seek the higher returns from longer commitments but factor in or insure against the potential costs of re-sealing.

Without clarity on the network’s response to such a possibility, providers might assume the status quo and be ill-prepared for re-sealing were it subsequently mandated. Destabilising storage provider business and operations by invalidating assumptions would likely make a bad situation even worse.

Analysis

The options above, and any new ones proposed in this discussion, require some analysis to understand their impacts. We need to answer questions like:

What is the expected sealing capacity of the network?
How would we expect network power and growth rates to change, over what timeframes?
How would token supply flows change as pledges are released and any new sectors sealed?
What is the distribution of provider profit margins and how would the increased cost of re-sealing and/or a penalty affect their viability? How does this in turn affect the questions above?

We should understand what we are trading-off between various policy options and parameters, in response to a severe network distress.

Implementation feasibility

The options around re-sealing require significant implementation work to realise on-chain. At the scale of the Filecoin network even today, traversing all sectors to compute new deadlines or terminate faulty sectors is a large amount of on-chain processing that would have to be spread over many weeks or months. An algorithm to prove that a new PoRep commits to the same data as the original proof must also be described.

Before adopting a policy, we should gain a reasonable idea of its implementation feasibility.

Limits

There are limits to the scope of events we can plan for. An insecure proof-of-replication is one network risk which has been considered since Filecoin’s early development, and the current 1.5-year maximum commitment provides a mitigation to some class of flaws.

However, there is always the possibility of some class of flaw that researchers and implementers have not predicted. The policy we arrive at here may not help in all possible situations, and we should avoid applying it if it doesn’t fit whatever flaw we might discover. Nevertheless, for this class of well-understood possible flaw, a clear policy agreed up-front should greatly aid the network in navigating that situation should it arise.

anorth · 2022-07-28T23:49:26Z

anorth
Jul 28, 2022
Maintainer Author

I think there are (at least) two different parts to the policy to be decided here.

The action to be taken on SPs existing (now-suspect) sectors, i.e. the storage itself
The economic consequences of [in-]action, penalties etc

For part 1, on the surface it would seem desirable to have SPs re-seal the "same" sectors they have already committed. That is, perform a new proof of replication for the same underlying data (unsealed sector CID, aka CommD). If they did this, then any parties relying on the data sealed into a sector could continue to do so. Storage markets and the FIL+ program are two obvious such parties, but as we enable more programmability there could be many more on- and off-chain entities that rely on the assurances offered by PoRep.

I point this out because, as far as I know today, we don't know how to do this. We don't have a specific technique for proving that a re-seal of a sector commits to the same data as the original proof-of-replication. The unsealed sector CID is not stored in chain state. We don't yet know that it's impossible, but developing a mechanism (or proving we can't) could represent some weeks of work. We'd be far better off to work this out in advance of discovering a flaw.

The unsealed sector CIDs do exist in the blockchain message history, but that is not immediately accessible to actors. A possible technique would be to build an off-chain database of them, and SPs would prove inclusion of the appropriate (Sector ID, CommD) when re-sealing.

If we can't find a practical re-sealing mechanism, we should also prepare for the case that we can't require SPs to re-commit to the same data. This means that providers would recover power by committing new sectors. The default behaviour of the built-in storage market and FIL+ would be to consider the old sectors terminated when the proof validity deadline expires. This is an unattractive position for clients, but it turns out we are already working towards the mitigation we'd need. Proposals like #313 work towards the ability to transfer FIL+ verified pieces of data from an expiring sector into a new one, and #298 lays the foundations for the same capability in any storage market. These mechanism could allow a provider to transfer client data into new sectors, maintain deal-related commitments, and let their old sectors terminate "empty" (with any penalties correspondingly reduced to CC sectors). These mechanisms' utility in responding to a PoRep flaw might motivate implementation of those specific capabilities sooner rather than later (FYI @ZenGround0).

For part 2, I think this should primarily be driven by analysis of effects on network power, collateral, token supply, SP profitability etc to find which policy would be least disruptive.

0 replies

Kubuxu · 2022-08-22T14:02:53Z

Kubuxu
Aug 22, 2022
Collaborator

I would like to propose a modification to changes proposed in #366, dealing with increased MaxSectorDuration.

The current SDM proposal increases the maximum sector commitment length from 1.5 to 5 years. It does it by increasing the MaxSectorDuration parameter from 1.5 to 5 years. This parameter tweak seems very simple, but it has an unaccounted externality of causing mitigation of possible PoRep] bug to be much more difficult.

Proposal

The core of the proposal is to separate the period of validity of a proof from the period of commitment for a sector. This leaves the 1.5-year sector extension process in place to maintain proof validity, but allows SPs to commit to longer periods.

The existing sector Expiration property would be renamed to CommitmentExpiration with a duration range from MinSectorLifetime to MaxSectorLifetime.

A new sector property is introduced called ProofExpiration . The initial value of ProofExpiration always falls into the range of (CurrentEpoch + MaxProofDuration - ProofRefereshWindow, CurrentEpoch + MaximumProofDuration] or CommitmentExpiration whichever is smaller.

MaxProofDuration: maximum period from last commitment or refresh for which a PoRep is valid, set to 1.5y to match current behaviour
ProofRefreshWindow : a window of time before ProofExpiration when the proof can be extended, set to 0.5y

The proof refresh window creates a trade-off: the larger the window, the more refreshes can happen in a singular batch, but more frequently each proof must be refreshed.
For example with MaxProofDuration of 1.5y and ProofRefreshWindow of 0.5y, each extension will be at least 1y and at most 1.5y. SPs will have to refresh sectors every MaxProofDuration - ProofRefereshWindow epochs.

The proof expiration is not freely chosen by the SP, but takes a value that is derived and quantised from the sector’s activation epoch.

Storage Provider can at any point call RefreshProofEpiration(SectorSelector) requesting a refreshed proof expiration. This call only results in an actual refresh of the ProofExpiration if called within ProofRefereshWindowof the ProofExpiration.

In case of a PoRep bug, the RefreshProofExpiration method would be disallowed for sectors created with insecure PoRep.
Other policies are possible too, such as requiring re-sealing or some other computation. The mechanism can support other policies developed in response to a concrete security flaw.

It also has the following benefits:

The bug response upgrade does not need to iterate over all sectors within one epoch. The mechanism to iterate over all sectors exists in form of an expiration schedule.
Sectors’ ProofExpirations are spread out uniformly (according to onboarding) and are not possible to manipulate on a short timescale, which could be an issue otherwise leading to non-even distribution and terminations of sectors.
Storage Providers must take action to avoid early sector termination when a Bug Response Policy is implemented, instead of the network having to create a new mechanism to enforce the Bug Response Policy.

Limitations of this mechanism include:

The response period must be specified now but can be modified at cost of some additional complexity in future.
Relative to the naive approach of just increasing the maximum sector duration, this mechanism requires more SP operational costs and on-chain activity.
But only a little less than the amount required today to extend sectors. (Less because we won’t recompute pledge, etc).
Relative to the naive approach, this mechanism introduces operational risk to SPs, who will pay a termination fee for failing to refresh expiring proofs.
Since the sector duration multiplier proposal greatly increases rewards for long commitments, this operational cost might be considered just part of the risk taken in pursuit of those rewards.
On the other hand, SPs are given a generous window of time to refresh their sectors.

For more technical details please see this document.

0 replies

anorth · 2022-08-31T04:34:26Z

anorth
Aug 31, 2022
Maintainer Author

I've posted a draft FIP at #446. The proposal introduces a proof expiration mechanism which supports the orderly processing of all sectors in the network. This mechanism would need to be implemented before committing any sectors with longer durations – it can't be deferred until we discover a PoRep bug. However, the policy about what to do in case of a bug does not need to be implemented yet (or, hopefully, ever).

1 reply

anorth Aug 31, 2022
Maintainer Author

This proposal is valuable independent of FIP-0036, though (in my view) also required for it.

With this proposal, we could immediately increase the maximum sector extension period from 1.5y to some larger value like 3 or 5y. Without FIP-0036 there would be no immediate reward incentive for a provider to do so, but it would allow an SP to take deals (including FIL+ deals) with terms longer than 1.5 years. Probably, most sectors we would see with long commitments would be those hosting deals. Coupled with FIP-0045, this would let clients express long deal terms, and deal-taking providers lock in a full 10x reward for up to 3 or 5 years. This has a very nice effect of making it easier for deal-oriented providers to earn higher returns, despite being a neutral change.

FYI @jennijuju @dkkapur

fjoianginaiwsnggw · 2022-10-08T03:30:38Z

fjoianginaiwsnggw
Oct 8, 2022

@anorth Hi would like to know if this proposal is introduced for Committed Capacity (CC) sectors or sectors containing storage deals. Thanks~

2 replies

jennijuju Oct 8, 2022
Maintainer

Shall apply to all sectors, at least new ones.

fjoianginaiwsnggw Oct 8, 2022

Hi Is it possible that the day before the sector expires the storage provider extends the lifetime for 1.5 year again, the power will now be 5x or even less in terms of sectors containing storage deals?

fjoianginaiwsnggw · 2022-10-11T00:59:11Z

fjoianginaiwsnggw
Oct 11, 2022

Hi @anorth @jennijuju Is it possible that the day before the sector expires the storage provider extends the lifetime for 1.5 year again, the power will now be 5x or even less in terms of sectors containing storage deals?

2 replies

jennijuju Oct 11, 2022
Maintainer

please checkout FIP-0045 which will be finalized in the upcoming upgrade

anorth Oct 17, 2022
Maintainer Author

It is possible for a sector to be extended right before it expires. The new power depends on whether the sector is SimpleQAPower as specified in FIP-0045.

fjoianginaiwsnggw · 2022-10-11T03:24:47Z

fjoianginaiwsnggw
Oct 11, 2022

@jennijuju Thanks a lot! May I know when this proposal will come into effect please?

0 replies

tmellan · 2022-11-16T17:52:46Z

tmellan
Nov 16, 2022

The policy for this FIP stated here is

The RefreshProofExpiration method is disallowed for old sectors (ie, created with the vulnerable PoRep algorithm), in particular, all old sectors will terminate at ProofExpiration epoch;
RefreshProofExpiration method is disallowed for old sectors
A new method ReplaceSector(OldSector, NewSector) can be called by a provider to replace an old sector with a newly-sealed one.
This requires a Storage Provider to run the sealing procedure with the new PoRep algorithm on the new sector (ie, successful call to PreCommitSector and ProveCommitSector)
The sealing procedure and replacement must be completed before the ProofExpiration epoch of the old sector
If the replacement sealing procedure is successful, the old sector is removed with no termination fee
If the replacement sealing is not performed for an old sector before its proof expires, that sector is terminated at the ProofExpiration.

Is point 5 correct? If yes that seems like it places an immediate and substantial burden on SPs in the case of a PoRep bug. Unlucky sectors will have arbitrarily small durations to reseal (such as 1 day).

More precisely, simulating the stated policy, assuming a max 5 year sector commitment, gives a distribution of ‘time-to-reseal’ in the event of bug discovery that’s shown in the image attached. You can see a substantial proportion have short reseal times (e.g < 2 weeks). While this is a disaster response policy, such a short time for any proportion seems unnecessarily disruptive to SPs.

If this interpretation of the policy is correct it raises some questions on the balance of tradeoffs and potentially the policy should be improved:

What additional structure in FIP-0047 can avoid any sectors getting very short reseal times?
Why is the FIP-0047 policy better than a simpler one like ‘a bug is found, everyone has X days to reseal’? The main argument seems to be the naive policy has potential for congestion. FIP-0047 however appears to have the certainty of some unlucky sectors getting very short reseal times. And is congestion even likely to be a problem given batching? Another argument against the naive is simultaneous expiration, but this seems like it could be avoided with a simpler staggering.

But first, I’d like to find out if the above policy interpretation/statement is what’s planned to be implemented. 🙏

11 replies

SHSRPL Nov 22, 2022

Thanks all for the fruitful discussion surrounding the policy.

I feel that it is worth mentioning one specific edge case where SPs would in practice have very little time to reseal their sectors - one where the bug is discovered right before the commitment expiry of the sector. In such a case, giving a grace period would not help as the PoRep proof cannot be extended beyond the commitment expiry epoch.

While in practice this case would be a very rare encounter it is still worth thinking about the fairness of imposing a termination penalty for such sectors which have very short reseal times.

Would it be feasible to think about a policy where the termination penalty is waived for sectors with very short reseal times?
Would it be feasible to otherwise think about a policy for the termination fee that makes its value a function of the time the sector has to reseal?

geoff-vball Nov 22, 2022

PoRep proof cannot be extended beyond the commitment expiry epoch

@SHSRPL I don't think this is part of the proposal. ProofExpiry does not have to be bound by the commitment expiry epoch. If the proof expires shortly before the sector commitment, we should extend the proof for the full amount. This way if a miner extends that sector's commitment, they don't also have to extend the proof at the same time if it should already be valid.

SHSRPL Nov 23, 2022

Thank you @geoff-vball for your reply! My question now is this: In the scenario where the commitment expiration of a sector comes before the proof expiration, when the network upgrade is rolled out and the RefreshProofExpiration method is disallowed, wouldn't the time to reseal be computed as the min(CommitmentExpiration, ProofExpiration) - CurrentEpoch? Or can the SP do replacement sealing for their sectors even after it has reached its commitment expiration epoch?

geoff-vball Nov 23, 2022

Miners do not have to reseal sectors where CommitmentExpiration < ProofExpiration, they can just let them expire. The miner could choose to extend such a sector to make CommitmentExpiration > ProofExpiration, but I think in this scenario it would almost always be better to seal a new sector instead of resealing the original one.

anorth Nov 23, 2022
Maintainer Author

We would disallow extension of old-proof sectors in this case – it must expire. The FIP does say this is part of the existing policy, and I think it should say it's part of the new policy too. @Kubuxu

SHSRPL · 2022-12-11T05:43:31Z

SHSRPL
Dec 11, 2022

Hi @geoff-vball, with regards to the most recent commit to the FIP locking down the exact parameters for the policy, we at the CryptoEcon Lab have some analysis that we would like to share on specifically this. We are currently working on consolidating all our research and intend to share it publicly by Monday/Tuesday latest.

1 reply

geoff-vball Dec 11, 2022

Sure thing. Currently we've just specified what is in the current implementation. I'm looking forward to hearing your analysis this week.

SHSRPL · 2023-01-09T09:27:06Z

SHSRPL
Jan 9, 2023

We recently conducted an analysis of the incentives in FIP-047 and created a report detailing some of our findings. We welcome all suggestions and feedback and wish to open a discussion on locking down the exect parameters for the policy, specifically the ProofRefreshWindow and the termination fee ones, and how they affect the economic viability of conducting replacement sealing everytime a bug is discovered in the PoRep algorithm.

Our key findings can be summarized as follows:

A ProofRefreshWindow of approximately 30-60 days is optimal for making resealing more economically viable than not resealing, considering both typical scenarios and across a wide range of possible scenarios.
A 50-100 day block reward per sector termination fee is sufficient to drive storage providers towards the intended behaviour of resealing their sector in the case of a PoRep bug. We suggest no change from the current termination fee right now. But note that in the near the future termination fee will likely need to be revised upwards: the current termination fees were set in the context of 6 month sectors, and revision soon after sector duration multiplier implementation (if accepted) is high priority.

2 replies

anorth Jan 11, 2023
Maintainer Author

Thanks for this. I have read the report.

The range 30-60 days is rather wide (a 2x). It's not clear if there is any significant difference between the values of 30 and 60 days for the proof refresh window. If there is little difference, we will select 60 days, since a larger value will be more operationally convenient for storage providers during normal operations. Indeed - this convenience is what drove the current value to be about twice that. cc @Kubuxu

SHSRPL Jan 15, 2023

Thanks a lot @anorth. As you mentioned the effect of having a 60 vs 30 day PRF window is marginal. The reason we gave such a high range is so that the implementers can choose a value within this range keeping in mind the operational constraints that SPs face while refreshing their proofs and conducting replacement sealing if the need arises.

anorth · 2023-02-22T22:45:04Z

anorth
Feb 22, 2023
Maintainer Author

Background

FIP-0047 aims to provide a security mechanism for the network in case a flaw is discovered in the PoRep algorithm used to commit sectors. It establishes a rolling 1.5 year schedule for expiring the validity of the PoRep proof for every sector, requiring each to be replaced or terminated. This schedule must be implemented prior to any sector being committed for >1.5 years and any such flaw being discovered, as (a) doing so afterwards would involve a large and complicated state migration in the middle of a presumably fragile network and intense time pressure to fix the actual flaw, and (b) implementing the schedule now clearly communicates to SPs how such a flaw would be handled, so they can mitigate risk as they see fit.

The mechanism specified in FIP-0047 requires a storage provider to send a message to refresh the proof for each sector in a 60-day window before each 1.5-year validity period. If the message is not sent, the sector is immediately terminated. This ensures that every sector’s proof has been checked within the past 1.5 years, and presents a simple mechanims for the network to deny subsequent refreshes if a flaw is discovered.

Motivation for change

Despite this mechanism being approved and implemented (but not yet active on the network), there are two reasons motivating a change:

Product impact: terminating a sector is a very harsh penalty for a storage provider who, through neglect or operational error, fails to send the message. Not only is power lost and the termination penalty applied, but any data stored is removed from the network, so client’s deals will fail too. And this is all to an SP which hasn’t really done anything wrong – they’re still proving the sector and nothing is known to be wrong with their PoRep. Good automation and tooling should make this error rare, but it will happen.
Complexity: the implementation of FIP-0047 in terms of the miner actor’s existing expiration queue mechanism was much more complex than expected. This code was already among the most complex areas of the built-in actors. Any change here brings risk. It is difficult to be highly confident of correctness in all cases, even after testing (of which we have done only a moderate amount).

Proposal

We (@Kubuxu, @ZenGround0 and I) have the concept for an alternative implementation of FIP-0047 which will solve both problems: a much simpler mechanism for a schedule of proof expirations with no need for a message from storage providers to update proof validity.

In brief:

No change to the miner actor expiration queue, so it processes only commitment expirations and faults as today.
Add a fixed-size array (AMT) of 540 buckets of sector numbers to each miner actor, each representing sectors assigned to one day in the 1.5-year schedule. Assign each sector to bucket (ActivationEpoch / 2880) % 540 when proven.
When a sector is terminated, remove it from the bucket (which can be calculated from the ActivationEpoch loaded from SectorOnChainInfo during termination processing).
In the event of a PoRep flaw, in a network upgrade:
1. a mechanism for replacement sealing would be added. One of the possible proposals for replacement sealing adds a new sector and performs a non-fault termination of the old sector.
2. after some grace period (probably the 60-day proof refresh window of the original proposal), the miner actor’s deadline maintenance will start processing the entries from this AMT, terminating the sectors in bucket (CurrentEpoch / 2880) % 540. This AMT becomes a 540-day queue of sectors to terminate.

Compared with the original implementation, this proposal:

Requires no action from storage providers at all, unless a PoRep flaw is actually found
Involves no potential additional penalty for SPs or data loss for clients
Is much simpler overall, a smaller net code change
Is an easier migration, as it does not write SectorOnChainInfos
Makes no change to the already-complex expiration queues
Is isolated from existing mechanisms, so that any implementation error is likely to have zero impact on normal operations
Increases the implementation work to be done if a PoRep flaw is actually found, in order to trigger the terminations (unless we choose to do that work sooner)

The most difficult part of this proposal is likely to be a data structure for an arbitrary-cardinality collection of sector numbers. A single bitfield has bounded capacity, so we need a collection of them with appropriate indexing and splitting logic.

Migration

One-time migration to populate the schedule with existing sectors. Requires reading sector infos, but not writing them.

Constraints

FIP-0047 is already approved, implemented, and scheduled for activation nv19. It’s scheduled there in order to unblock extension to maximum sector commitments, such as FIP-0052 (approved) or FIP-0056. We don’t intend to impact the timeline of those changes being activated. Thus any alteration to FIP-0047 needs to be implemented and tested in a very short amount of time.

We have reason to believe that this proposal is simple enough to be implemented on such a tight timeline, and still result in a mechanism that is better for SPs and simpler and safer than the existing mechanism.

In the worst case, we can just ship the existing FIP-0047 implementation if necessary.

2 replies

arajasek Feb 23, 2023
Maintainer

Thanks for the write-up! I like this proposal, and it's SO much simpler (and so much nicer for miner UX). I feel comfortable saying this is better than (the current) FIP-0047, and should be preferred. Some thoughts / questions:

Assign each sector to bucket (ActivationEpoch / 2880) % 540 when proven.

What do we achieve by dividing by 2880 here? Is it any different to just use ActivationEpoch % 540?

In the event of a PoRep flaw, in a network upgrade:

a mechanism for replacement sealing would be added. One of the possible proposals for replacement sealing adds a new sector and performs a non-fault termination of the old sector.

after some grace period (probably the 60-day proof refresh window of the original proposal), the miner actor’s deadline maintenance will start processing the entries from this AMT, terminating the sectors in bucket (CurrentEpoch / 2880) % 540. This AMT becomes a 540-day queue of sectors to terminate.

I think this makes sense, and I'm okay delaying the details (both spec and impl) of the replacement mechanism. I would like the proposal to be specific about point ii, though (eg. firmly landing on the 60-day window). I'd like to see that logic implemented too as part of this, but am okay deferring that in the interest of time, so long as we have an agreed-to process in the spec.
There's an interesting question about how to go about this from a governance perspective. I would personally treat this as a brand-new FIP that supersedes FIP-0047, even though that causes an Accepted FIP to go into Deferred. I think we should be comfortable making that change if a "better" idea comes along.

Kubuxu Feb 23, 2023
Collaborator

What do we achieve by dividing by 2880 here? Is it any different to just use ActivationEpoch % 540?

It preserves the design of FIP-0047 which buckets terminations in daily buckets based on the activation period. Dividing by 2880 guarantees that the terminations follow the same pattern as sector onboarding modulo 1.5 years, it leads to replication of the onboarding pattern which is globally quite smooth and predictable.
Taking just the modulo 540 would would be equivalent to taking onboarding modulo 4.5h which IMO (I don't have solid data on it yet) is much less predictable. I'm playing with ideas for alternative cheap mapping functions from the sector activation to the termination bucket.
The most solid approach would be a hash function but, in my view, the cost-benefit tradeoff is too high.

anorth · 2023-03-02T01:34:19Z

anorth
Mar 2, 2023
Maintainer Author

@Kubuxu and I have now realised an even simpler solution: we can do nothing after all. We have convinced ourselves that in fact we can make no code or state changes now, and still establish an orderly off-boarding of sectors over a defined period of time in the event that becomes necessary. The new mechanism for doing so is even more friendly to SPs, giving them more control over which sectors to terminate or replace and in what sequence.

Background

FIP-0047 aims to provide a security mechanism for the network in case a flaw is discovered in the PoRep algorithm used to commit sectors. It establishes a rolling 1.5 year schedule for expiring the validity of the PoRep proof for every sector, requiring each to be replaced or terminated. We thought that this schedule must be implemented prior to any sector being committed for >1.5 years and any such flaw being discovered, as (a) doing so afterwards would involve a large and complicated state migration in the middle of a presumably fragile network and intense time pressure to fix the actual flaw, and (b) implementing the schedule now clearly communicates to SPs how such a flaw would be handled, so they can mitigate risk as they see fit.

However, we have since realised that an actual schedule (i.e. sequence of sectors and epochs) is not required. We need a scheme for orderly replacement or termination the sectors with old proofs, but we do not need the specific schedule to be established ahead of time. Instead, we can simply require that all SPs terminate or replace some fraction of their old sectors per proving period, such that all of them are terminated after 540 days. The network doesn't care which ones happen when, and the SP can chose those sectors that are most valuable (e.g. those with deals first). We just need to count them

Motivation for change

Product impact: All schemes which establish a schedule for termination remove freedom for the SP in arranging their re-sealing operations. While "in order of ActivationEpoch % 1.5y" is a natural sequence, it might not match the SPs utility function at all, and is far from evenly distributed network-wide. An even distribution can be obtained with "SectorNumber % 540", but this essentially randomises the sequence for the SP, also unaligned with utility. There's no clear reason for the network to prefer one sequence to another (both the proposals were arbitrary), except to prefer whatever the SP prefers to keep their operation as intact as possible in the circumstances.

Complexity: Doing nothing is much simpler! Also the code to enforce this scheme will be much simpler than either prior proposal. This does postpone all of the work to enforce the off-boarding until such time as we might discover and respond to a PoRep flaw. We can choose to do some of the implementation work ahead of time and leave it dormant.

Proposal

In the event of a PoRep flaw, a network upgrade will introduce a new proof type and associated code, and specify a start epoch ResealStartEpoch and period ResealPeriod during which to require mandatory off-boarding/replacing of existing sectors. In a migration associated with that upgrade:

Add a BadProofSectorCount field to the miner info, initialised to the number of activated sectors (calculated from deadline infos).
Add a TerminatedBadProofSectorCount field to the miner info, set to 0.

Then the following changes are made to miner actor operational code:

Whenever a sector expires or is terminated (natural expiration, unrecovered fault, manual termination), if it has an old seal proof then increment TerminatedBadProofSectorCount.
- [Probably] Give a gas rebate for manual termination of such sectors
During deadline cron, if TerminatedBadProofSectorCount/BadProofSectorCount < (CurrentEpoch - ResealStartEpoch) / ResealPeriod then mark all sectors in the deadline as faulty.

This scheme lets the SP choose to terminate/replace sectors in whatever sequence they wish, so long as they at least keep up with a rate of uniform progress that would replace all their sectors within the specified period. If the SP lacks sealing capacity they will need to manually terminate, rather than replace, some sectors, but they can choose which ones. The inducement to an SP to keep up is that their deadlines will be faulted if they don't. This puts those sectors on a fixed timeline to forced termination anyway, but costs more in fees than just terminating them up front.

One notable difference of this scheme from the previous ones is that the forced termination is an explicit SP action (unless they let deadlines fault), which means they pay gas. Prior schemes did the forced termination of unreplaced sectors in cron. This may be a case where we consider a gas rebate for part of the gas costs of manual termination to be appropriate. Such a rebate is a network subsidy, but in this case where the network is requiring the SPs to re-seal, in order to recover from a network flaw, extracting additional payment seems undesirable (the gas would represent a transfer from SPs to token holders). We must charge at least some gas, though, to keep from overloading block validation too much.

Next steps

While FIP-0047 is already approved and scheduled for activation, it should be either re-written or replaced. The motivation of clearly communicating to SPs how such a flaw is expected to be handled remains important. FIP-0047 could be reduced to an informational one describing this policy.

8 replies

arajasek Mar 2, 2023
Maintainer

I think this makes sense, and meets its goals. I think there's 2 questions to be asked about whether it is in fact to "do nothing" now:

Will it be safe to migrate? I think the answer's yes. Calculating BadProofSectorCount for all miners isn't trivial, but is easier than a lot of what we've done in the past.
Will it take us long to implement this? Probably not, but we might want to consider prototyping it ahead of time anyway.

anorth Mar 2, 2023
Maintainer Author

@arajasek I think we probably should implement and ship this code (but with the parameters set appropriately to no-op), probably as part of a new FIP.

jennijuju Mar 2, 2023
Maintainer

@anorth two qq

whats the resealing period?
any mitigation in the case where a good portion of SPs refuse to upgrade(retire the bad sectors)? Especially we are seeing fips propose extending sector life time to 3.5, 5, 10 years

arajasek Mar 2, 2023
Maintainer

Hmm, that's interesting. I think that would cause us to have 2 miner-wide migrations (one setting every BadSectorCount to 0, and a second at the actual "event" upgrade). But I think that's fine.

anorth Mar 3, 2023
Maintainer Author

whats the resealing period

We can't decide this until we know the severity of any flaw that's being mitigated

any mitigation in the case where a good portion of SPs refuse to upgrade(retire the bad sectors)

If you're talking about SPs refusing to perform the network upgrade that activates this policy, I'm afraid that's a question that's beyond scope here, and not new. If you're talking about SPs that refuse to manually retire sectors, they will all be faulted and then forcibly terminated.

PoRep security policy (FIP-0047) #415

anorth Jul 22, 2022 Maintainer

Background

Proof-of-replication and network security

Current policy

Changing the maximum sector commitment duration

Proposal

Goals

Ideas

Option A: status quo

Option B: re-sealing

Discussion

Informing storage provider business operations

Analysis

Implementation feasibility

Limits

Replies: 11 comments · 29 replies

anorth Jul 28, 2022 Maintainer Author

Kubuxu Aug 22, 2022 Collaborator

Proposal

anorth Aug 31, 2022 Maintainer Author

anorth Aug 31, 2022 Maintainer Author

jennijuju Oct 8, 2022 Maintainer

jennijuju Oct 11, 2022 Maintainer

anorth Oct 17, 2022 Maintainer Author

anorth Nov 23, 2022 Maintainer Author

anorth Jan 11, 2023 Maintainer Author

anorth Feb 22, 2023 Maintainer Author

Background

Motivation for change

Proposal

Migration

Constraints

arajasek Feb 23, 2023 Maintainer

Kubuxu Feb 23, 2023 Collaborator

anorth Mar 2, 2023 Maintainer Author

Background

Motivation for change

Proposal

Next steps

arajasek Mar 2, 2023 Maintainer

anorth Mar 2, 2023 Maintainer Author

jennijuju Mar 2, 2023 Maintainer

arajasek Mar 2, 2023 Maintainer

anorth Mar 3, 2023 Maintainer Author

anorth
Jul 22, 2022
Maintainer

Replies: 11 comments 29 replies

anorth
Jul 28, 2022
Maintainer Author

Kubuxu
Aug 22, 2022
Collaborator

anorth
Aug 31, 2022
Maintainer Author

anorth Aug 31, 2022
Maintainer Author

jennijuju Oct 8, 2022
Maintainer

jennijuju Oct 11, 2022
Maintainer

anorth Oct 17, 2022
Maintainer Author

anorth Nov 23, 2022
Maintainer Author

anorth Jan 11, 2023
Maintainer Author

anorth
Feb 22, 2023
Maintainer Author

arajasek Feb 23, 2023
Maintainer

Kubuxu Feb 23, 2023
Collaborator

anorth
Mar 2, 2023
Maintainer Author

arajasek Mar 2, 2023
Maintainer

anorth Mar 2, 2023
Maintainer Author

jennijuju Mar 2, 2023
Maintainer

arajasek Mar 2, 2023
Maintainer

anorth Mar 3, 2023
Maintainer Author