Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: egress tracking #37

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 200 additions & 0 deletions rfc/egress-traffic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# RFC: Egress Traffic Tracking

## Authors

- Felipe Forbeck

## Status

- Proposed

## Context

Storacha Network aims to implement a scalable and automated mechanism to track egress traffic for customer billing purposes. The objective is to update Stripe’s API with relevant egress data for each customer, ensuring accurate billing. This RFC outlines two potential approaches to achieve this goal, each with its own set of trade-offs.

## Problem

Currently, there is no established automated process to track and associate egress traffic per customer and report this usage to Stripe for billing. A solution is required to automate this process for seamless, scalable, and accurate billing based on egress traffic.

## Proposal

We propose two alternatives for tracking and updating egress billing traffic:
fforbeck marked this conversation as resolved.
Show resolved Hide resolved

### Alternative 1: SQS, Lambda, DynamoDB, and Stripe Integration

1. Introduce an SQS queue to receive egress event data from the Cloudflare Freeway Worker.
2. Use a Lambda function to process these events and store the egress data in a DynamoDB table, mapping the data to the respective customer.
3. The Lambda function will then update Stripe’s API with the corresponding egress data size for the customer.

```mermaid
graph TD
CF[Cloudflare Freeway Worker] --> SQS[SQS Queue]
SQS --> Lambda[Lambda Function]
Lambda --> DynamoDB[DynamoDB - Egress Data Storage]
Lambda --> Stripe[Stripe API]
```

**Advantages:**

- Decoupled, event-driven architecture using SQS and Lambda, providing clear separation of responsibilities.
- Fine-grained control over event handling and processing.

**Disadvantages:**

- Requires managing additional infrastructure (SQS queue, Lambda function, DynamoDB table) solely for this purpose.
- Cloudflare Freeway Worker must be authorized to publish events to the SQS queue, adding complexity to authentication and authorization mechanisms.
- Increased operational overhead due to the need to monitor and maintain new infrastructure components.
fforbeck marked this conversation as resolved.
Show resolved Hide resolved

### Alternative 2: Extend `w3up/upload-api` with `usage/record` Capability

1. Add a new capability, `usage/record`, to the `w3up/upload-api`.
2. Implement this capability in the `w3infra` repository to directly update Stripe’s API with the egress data size for each customer.
3. The Cloudflare Freeway Worker would invoke the `w3up/upload-api` every time a resource is served, passing the egress data size and customer details to trigger the Stripe update.
fforbeck marked this conversation as resolved.
Show resolved Hide resolved
4. Ensure the Cloudflare Freeway Worker is authorized to invoke this API via delegation.
5. Each invocation will be logged, providing signed receipts of egress traffic.
fforbeck marked this conversation as resolved.
Show resolved Hide resolved

```mermaid
graph TD
CF[Cloudflare Freeway Worker] --> w3up[w3up/upload API]
w3up --> Stripe[Stripe API]
w3up --> Logs[Logs - Egress Data Receipt]
```

**Advantages:**

- No need to maintain additional infrastructure components such as SQS queues, Lambdas, or DynamoDB tables, as existing infrastructure already handles similar UCAN invocations.
- Simplified architecture by relying on current logs and receipts to record traffic data.
- Reduces operational complexity with fewer components to manage.
fforbeck marked this conversation as resolved.
Show resolved Hide resolved

**Disadvantages:**

- Cloudflare Freeway Worker will need to be updated to interact with the `w3up/upload-api` for each served resource.
- Delegation mechanisms need to be set up and secured to ensure the worker is authorized to invoke the new API.

## Stripe Integration

To implement usage-based billing for Storacha Network, we have two options for integrating with Stripe: **Usage Records (Legacy)** and **Billing Meters (Recommended)**.

### 1. Usage Records (Legacy)

**Usage Records** allow manual tracking of customer usage and submitting it to Stripe for billing. However, Stripe is deprecating this feature, meaning it will soon be unsupported. Additionally, **it requires customers to have an active subscription** to start tracking their usage, adding complexity when handling pre-subscription events.

### 2. Billing Meters (Recommended)

**Billing Meters** are Stripe's modern solution for usage-based billing, designed to handle high-throughput event reporting and provide more flexibility. Key advantages include:

- **No Subscription Requirement:** Billing Meters allow usage tracking for customers without needing an active subscription, simplifying pre-subscription billing.
- **Real-Time Reporting:** Automatically tracks and aggregates usage events, ensuring accurate and timely reporting.
- **Scalability:** Optimized for high-volume usage reporting, making it ideal for Storacha Network’s needs.
- **No Need for Immediate Migration:** We do not have to migrate existing customers to a new price. We can begin tracking egress traffic with the Billing Meter and create new usage-based prices later.
- **Adding New Prices for Each Product:** At a future point, we will create new prices for each of our three products (**Starter, Lite, and Business**), based on tiers of egress data usage. Once these new prices are associated with the Billing Meter, Stripe will automatically track usage for billing.

### Example Pricing Tiers:

1. **Egress Starter:**

- **0 to 5 GB:** Free
- **5 GB to 20 GB:** $0.10 per GB
- **20 GB+:** $0.07 per GB

2. **Egress Lite:**

- **0 to 10 GB:** Free
- **10 GB to 50 GB:** $0.08 per GB
- **50 GB+:** $0.06 per GB

3. **Egress Business:**
- **0 to 20 GB:** Free
- **20 GB to 100 GB:** $0.05 per GB
- **100 GB+:** $0.03 per GB

### Possible Solution

Since **Usage Records** are being deprecated by Stripe and **Billing Meters** offer greater flexibility and scalability, adopting **Billing Meters** for Storacha Network’s usage-based billing is recommended.

### Benefits of Billing Meters:

- **No Customer Disruption:** Existing billing remains intact while usage is tracked in the background through the Billing Meter.
- **Flexible Tiers:** New prices can be created for each product tier and associated with the Billing Meter for automated usage tracking.
- **Scalability:** This approach ensures that billing for egress traffic is accurate and scalable with minimal operational overhead.

## Questions & Answers

### 1. **Operational Complexity:**

- **Question:** Which approach requires less effort to maintain in the long run? Can the additional infrastructure of Alternative 1 be justified given its decoupled architecture?
- **Answer:** Alternative 2 requires less effort to maintain in the long run since it leverages existing infrastructure (`w3up/upload-api`), avoiding the need to set up new infrastructure components (SQS, Lambda, DynamoDB).

### 2. **Security:**

- **Question:** How should delegation and authentication mechanisms be handled, particularly in Alternative 2? What security implications arise from allowing the Cloudflare Freeway Worker to directly interact with the `w3up/upload-api`?
- **Answer:** Delegation can be handled by granting the Cloudflare Freeway Worker permission to invoke the required capabilities. The Gateway will need to become a `w3up` client to claim the delegation and execute the invocation securely.
- **Question:** Who will create the delegation to the Cloudflare Freeway Worker?
- **Answer:** ???
fforbeck marked this conversation as resolved.
Show resolved Hide resolved

### 3. **Performance Impact:**

- **Question:** Will the real-time nature of API updates in Alternative 2 introduce latency concerns for the Cloudflare Freeway Worker? How does this compare to the event-driven model in Alternative 1?
- **Answer:** By using Cloudflare’s "fire and forget" context calls, the Freeway Worker can avoid blocking while serving the resource. This minimizes latency, allowing the request to complete without waiting for the billing events, making the performance impact negligible.

### 4. **Logging and Monitoring:**

- **Question:** How can logs and monitoring be effectively implemented in both alternatives? Does Alternative 2’s reliance on signed receipts offer sufficient logging capabilities? Do we really need these signed receipts?
- **Answer:** ???
Copy link
Member

@travis travis Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it's at least moderately important to have those signed receipts - I suspect customers are going to want as much information as possible about what they're being charged for. We might be able to just aggregate metrics around them and store in Glacier for the sake of minimizing costs, but having signed receipts for each request feels like something that will eventually be fairly useful?

@alanshaw is right that there will be a LOT of them and they'll probably have different retention policies than most other UCAN invocation receipts, though for the sake of getting this shipped we might want to use the upload API service for now and add a new service later?

It definitely might be worth adding a new service that uses a similar invocation processing pipeline and different retention policies - ideally we'd move egress tracking receipts from S3 hot storage to Glacier pretty quickly for these, faster than for most other receipts we store.


### 5. **Stripe Integration:**

- **Question:** How does the Stripe integration impact our current billing if we migrate prices?
- **Answer:** It won't disrupt current customer billing since we are adding a new price flow for egress traffic, separate from existing services. The existing customers will continue using their current plans without changes.
- **Question:** What are the downsides of integrating with the Billing Meters API?
- **Answer:** Billing Meters don’t support certain aggregations (`max`, `last_ever`, and `last_during_period`) and lack reporting capabilities directly in the Stripe Dashboard. However, these limitations can be managed through external monitoring tools.
- **Question:** How we can see the customer egress usage in Stripe if there is no reporting/dashboard for that customer?
- **Answer:** We will need to query the API and filter by Meter Id, Customer and Date Range, e.g:

```js
const meterEventSummaries = await stripe.billing.meters.listEventSummaries(
'{{METER_ID}}',
{
customer: '{{CUSTOMER_ID}}',
start_time: 1729083600,
end_time: 1729080000,
value_group_window='{{day|hour}}'
})
```

Sample response

```json
{
"object": "list",
"data": [
{
"id": "...",
"object": "billing.meter_event_summary",
// The sum of egress data in bytes for the given group_window
"aggregated_value": 5014,
"end_time": 1729083600,
"livemode": false,
"meter": "{{METER_ID}}",
"start_time": 1729080000
}
],
"has_more": false,
"url": "/v1/billing/meters/:id/event_summaries"
}
```

We can also retrieve the upcoming invoice to find the total egress usage for the current subscription period - as long as the customer is subscribed to the new usage-based price.
Stripe reflects the usage as the quantity of the invoice item for a given subscription item.

```js
const upcomingInvoice = await stripe.invoices.retrieveUpcoming({
customer: "{{CUSTOMER_ID}}",
});
```

Stripe processes meter events asynchronously, so upcoming invoices might not immediately reflect recently received meter events.

## Recommendation

**Alternative 2** offers a more streamlined architecture by leveraging the existing `w3up/upload-api` and avoids introducing additional infrastructure components like SQS, Lambda, and DynamoDB. While it requires updates to the Freeway Worker and additional delegation controls, the reduced operational overhead makes this approach more sustainable.