Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: remove the UCAN stream #39

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

rfc: remove the UCAN stream #39

wants to merge 5 commits into from

Conversation

alanshaw
Copy link
Member

@alanshaw alanshaw commented Jan 23, 2025

Copy link
Member

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 in general. I am all for simplification.

I do want to suggest some alternate ways to think about this before we simply blame the queue, cause we could find ourselves not resolving problems.

I think it's worth thinking about when synchronous and asynchronous operations in general make sense.

Generally I would say it's always best to default to synchronous operations for simplicity.

However, synchronous operations do have a limit, and they do create coupling. In worst cases they lead to really complex big handlers that are hard to decipher for newbies cause you have to onboard all the logic to determine what's happening. They also imply transactionality -- the whole operation succeed or the whole operations fails -- which as stated in the RFC is important in many cases. But it can again add to complexity if a single handler is required to be able to roll back anywhere in a long series of steps. And as mentioned there are speed issues. For everything I'm aware we're doing, I don't think speed is an issue yet. But there are contexts where you are more constrained -- such as if we ever put this on a blockchain.

Secondly, there are in between points between pure synchronous operations and pure pure pub sub message queues. I tend to dislike pure pubsub message queues cause they obscure coupling by treating every message as an open ended broadcast as noted in the RFC. However, there are job queues and/or asynchronous commands -- this keeps the coupling explicit -- you design messages to say exactly what you expect to happen but it can still happen asynchronously. The Filecoin pipeline is a good example of this, and uses simpler services like SQS and SNS. They difference here is that messages are commands or information sent to specific asynchronous actors, as opposed to open broadcasts that can be picked up by anyone. (when in doubt, the actor pattern never fails)

For the future, one thing I think would help us make smarter decisions is to consider all of the issues we've had with the queue whenever we decide to we have go asynchronous. These issues aren't so much impossible or inherent as they require more consideration. There are lots of ways to propogate errors asychronously -- pushing errors via websockets for example, or polling. You just have to run through a tight checklist of "how am I going to address each concern of an asynchronous system?"

Ultimately, taken in sum, on a base level I agree: to much premature complexity here, let's lose it. But let's be intentional about our designs -- understanding the tradeoffs of synchrnous and asynchrnous designs as they each have pitfalls.

Also, one thing I think it is super important not to have go away is keeping a record of all of our receipts in S3, and eventually in user space. This would be my one blocking change.

* Async post-processing tasks i.e. tasks that don't have a direct effect on the outcome of the original invocation, can be easily lost since there is no record of them being scheduled, succeeding or failing.
* Async tasks that depend on the receipt of a different invocation are brittle to changes in the receipt. The tasks are often "far away" from changes to the receipt they depend on and changes to the receipt may not be known to break the task until much later (or worst case at runtime in production).
* Unless the async tasks account strictly for receipts they have seen, they are brittle to duplication.
* Tasks that depend on the receipt of a different invocation can exert influence over it's content. For example, in billing we discovered that storing the same blob multiple times was causing double counts. This is undesirable - a user should not be charged multiple times for the same item of data. At the time the _easiest_ solution was to alter the receipt the billing service relied upon to include the number of allocated bytes. Essentially if the blob already exists in the space then the allocated bytes are 0, but if the blob does not then the allocated bytes are equal to the size of the blob. It sounds sensible, but the change to the receipt is not needed by the `upload-api` sub-system at all. It is only relevant to the `billing` sub-system and worse, there is no way to know this is the case. It is a loose coupling between `billing` and `upload-api` sub-systems which is brittle for reasons stated above. A better solution might be to define specific usage invocations for billing and to invoke them at the appropriate point when an additon is known to affect space size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to point out: is this truly a coupling? we're talking about a receipt for what the upload-api did. In fact, it did something different in the two different cases, and I think it makes sense that to the extent it was trying to communicate to any arbitrary party about what it did, it should be transparent about whether it allocated new data or just found existing data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree, I can see how that info is useful/transparent, but we don't surface it anywhere on the client currently. It was originally added for billing purposes though - that much is true.

We've observed a few issues with this:

* Stream processing errors lead to incorrect counts that are never rectified.
* In the specific case of billing, we've discovered that some receipts (that are required for billing) are added onto the stream multiple times, due to the fact that they are embedded in multiple responses. Specifically the `web3.storage/blob/allocate` receipt is added onto the stream when a blob is allocated, and then again when it is included in the response for `space/blob/add`. Since [this change](https://github.com/storacha/w3infra/pull/380), we have been double counting space usage additions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this isn't a queue problem so much as a bad message/event design problem.

Generally if I'm publishing events, a consumer should have a pretty good idea that either:

  1. events won't get published twice -OR-
  2. as a consumer I can will have a mechanism for detecting duplicates or otherwise processing them idempotently

Copy link
Member Author

@alanshaw alanshaw Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the problem in this case was that events were never published twice before because we didn't include receipts for other completed async tasks in responses. Duplication was simply not an issue (your number 1), but then unfortunately became an issue and we didn't consider the consequences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above the linked PR actually introduced the change where ALL invocations and receipts in the CAR were added to the stream, not just the onces linked from the AgentMessage block. Two reviewers (myself included) missed this subtle change in the PR and the consequences.


* Stream processing errors lead to incorrect counts that are never rectified.
* In the specific case of billing, we've discovered that some receipts (that are required for billing) are added onto the stream multiple times, due to the fact that they are embedded in multiple responses. Specifically the `web3.storage/blob/allocate` receipt is added onto the stream when a blob is allocated, and then again when it is included in the response for `space/blob/add`. Since [this change](https://github.com/storacha/w3infra/pull/380), we have been double counting space usage additions.
* Async post-processing tasks i.e. tasks that don't have a direct effect on the outcome of the original invocation, can be easily lost since there is no record of them being scheduled, succeeding or failing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an inherent limitation of queues, but definitely not something we are handling or surfacing properly.

Copy link
Member Author

@alanshaw alanshaw Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think your suggestion is explicitly placing a task on a queue and processing it. There is visibility that this has been done, and you can ensure its completion. This is different to placing arbitrary data onto a queue and attaching multiple tasks to process it. i.e. generic queue, specific task(s) vs specific queue, specific task...

* Stream processing errors lead to incorrect counts that are never rectified.
* In the specific case of billing, we've discovered that some receipts (that are required for billing) are added onto the stream multiple times, due to the fact that they are embedded in multiple responses. Specifically the `web3.storage/blob/allocate` receipt is added onto the stream when a blob is allocated, and then again when it is included in the response for `space/blob/add`. Since [this change](https://github.com/storacha/w3infra/pull/380), we have been double counting space usage additions.
* Async post-processing tasks i.e. tasks that don't have a direct effect on the outcome of the original invocation, can be easily lost since there is no record of them being scheduled, succeeding or failing.
* Async tasks that depend on the receipt of a different invocation are brittle to changes in the receipt. The tasks are often "far away" from changes to the receipt they depend on and changes to the receipt may not be known to break the task until much later (or worst case at runtime in production).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a couple things here:

  • code proximity is real -- on a team this size with people floating everywhere I agree it makes sense for everyone to see everything, and asynchronous events make it natural to not be aware of logic
  • "receipt not known to break a task till later" -- there are ways to mitigate perhaps via types/testing at least with certian types of queues (perhaps not Kenisis) -- but yes, it's an inherent problem.

@alanshaw
Copy link
Member Author

However, synchronous operations do have a limit, and they do create coupling. In worst cases they lead to really complex big handlers that are hard to decipher for newbies cause you have to onboard all the logic to determine what's happening.

👍 yes, but I think in this case it's not going to blow out the complexity level.

They also imply transactionality -- the whole operation succeed or the whole operations fails -- which as stated in the RFC is important in many cases. But it can again add to complexity if a single handler is required to be able to roll back anywhere in a long series of steps. And as mentioned there are speed issues. For everything I'm aware we're doing, I don't think speed is an issue yet. But there are contexts where you are more constrained -- such as if we ever put this on a blockchain.

Secondly, there are in between points between pure synchronous operations and pure pure pub sub message queues. I tend to dislike pure pubsub message queues cause they obscure coupling by treating every message as an open ended broadcast as noted in the RFC. However, there are job queues and/or asynchronous commands -- this keeps the coupling explicit -- you design messages to say exactly what you expect to happen but it can still happen asynchronously. The Filecoin pipeline is a good example of this, and uses simpler services like SQS and SNS. They difference here is that messages are commands or information sent to specific asynchronous actors, as opposed to open broadcasts that can be picked up by anyone. (when in doubt, the actor pattern never fails)

Yeah, I think we're on the same page. The tasks we're talking about here simply insert/update a record in the DB - there's little time saving to be had by putting this on a queue and then processing. Some milliseconds perhaps. Either way, explicitly queuing or inserting is more desirable than those tasks being triggered from a generic receipt queue, especially when the tasks must be performed. Granted for metrics it's not the end of the world if that doesn't happen, but I think a whole kinesis stream just for 2 metrics is overkill.

For the future, one thing I think would help us make smarter decisions is to consider all of the issues we've had with the queue whenever we decide to we have go asynchronous. These issues aren't so much impossible or inherent as they require more consideration. There are lots of ways to propogate errors asychronously -- pushing errors via websockets for example, or polling. You just have to run through a tight checklist of "how am I going to address each concern of an asynchronous system?"

Ultimately, taken in sum, on a base level I agree: to much premature complexity here, let's lose it. But let's be intentional about our designs -- understanding the tradeoffs of synchrnous and asynchrnous designs as they each have pitfalls.

Yes, in general I try to do that. I'm not trying to make mistakes but they do happen and will happen again. I think there was good intentions with the UCAN stream (which I tried to enumerate in the intro) and it maps pretty well to blockchain events, however in hindsight I think we used it wrong (for billing), and we actually should have been explicitly registering space size changes at the point they happen.

Again, billing was put together pretty hastily and inspecting the receipts felt like the right thing to do at the time.

Also, one thing I think it is super important not to have go away is keeping a record of all of our receipts in S3, and eventually in user space. This would be my one blocking change.

👍 yeah I'll add a note to clarify but the proposal is to remove the stream, not the agent message store entirely (i.e. we keep the bucket stores).


### Mid-term

1. Insert space usage deltas directly to DynamoDB table from the `BlobRegistry` at the point where items are added or removed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, the BlobRegistry will be available after storacha/w3infra#453 is merged, right? Also, are the register and deregister methods only used in blob/accept and blob/remove, respectively?
About the error handler, what’s currently being done with failed requests? Are they stored somewhere so we can reprocess them later? Are we satisfied with how things are, or should it also be discussed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, the BlobRegistry will be available after storacha/w3infra#453 is merged, right?

Correct

Also, are the register and deregister methods only used in blob/accept and blob/remove, respectively?

Correct.

About the error handler, what’s currently being done with failed requests? Are they stored somewhere so we can reprocess them later? Are we satisfied with how things are, or should it also be discussed?

What requests are you referring to?

Failed invocations are surfaced by the client, so it is up to the caller to retry. We use p-retry internally when it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants