-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: object storage building block #18
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,275 @@ | ||
# Object storage building block | ||
|
||
* Author(s): @ItalyPaleAle | ||
* State: Draft | ||
* Updated: 2022-12-14 | ||
* Original proposals / discussions: dapr/dapr#4808, dapr/dapr#4934 | ||
|
||
## Overview | ||
|
||
This is a design proposal for a new "object storage" building block which allows Dapr users to store and retrieve unstructured data of arbitrary size. | ||
|
||
We define **objects** as sequences of unstructured data, that should be assumed to be possibly large (many MBs or even GBs in size) and binary. Examples include images and videos, Office documents, etc. | ||
|
||
## Background | ||
|
||
Dapr currently offers the state store building block which allows storing (mostly) unstructured data, and is backed by services that include object storage services (e.g. AWS S3, Azure Blob Storage), in addition to databases of various kinds. | ||
|
||
However, as its name implies, the Dapr state store building block is optimized for storing state, such as KV pairs and small payloads. Due to design and implementation decisions made over the years, including the need to support a variety of backends, Dapr state stores are not suitable for working with large blobs of (opaque) data, as they buffer the entire data in memory multiple times and, depending on the component, can perform various kinds of transformations on the data. Using the current Dapr state stores, users trying to store "large" blobs (many MBs to GBs) have a very poor experience at the moment, which ranges from bad performance all the way to exhausting the memory of the host system running Dapr. | ||
|
||
This building block aims at allowing users to store data of arbitrary size, treated in a completely opaque way by Dapr, in a way that is performant and scalable. This will be guaranteed by certain design decisions such as: | ||
|
||
- Supporting only backends that are optimized for storing objects, such as Azure Blob Storage, AWS S3, S3-compatible endpoints, and perhaps the local filesystem. | ||
- All APIs are streaming-first, optimized to work with data as a stream, so there's no need for Dapr to buffer the entire payload in memory at any time. | ||
|
||
## Implementation Details | ||
|
||
### Building block interface | ||
|
||
The proposal involves creating a new `objectstore` building block with the following interface: | ||
|
||
```go | ||
type ObjectStore interface { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about listing objects? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am unsure about listing objects at this point. The problem with this is that it's a very hard API to build in a scalable, performant, and consistent way. There could be hundreds of thousands or millions of file in a storage account / bucket, and returning them all is not practical, both for Dapr and for the service. Users will try to list all objects in a storage account or bucket (it's inevitable!) and that will cause problems. The "obvious" solution here is to add pagination, which means we limit the number of objects we return in each list response, but that's actually pretty complicated in practice because some backends have no native support for pagination (local filesystems), and those who do, are inconsistent in the way it's implemented. It's also very hard to do pagination in a consistent way (in the sense of data consistency), because if files are modified between pagination requests, the result is undefined. My preference for now would be to move forward without listing objects, so this (already very important) proposal isn't delayed due to complications in design and implementation for supporting listing objects. It is something we could add later on, if we can figure out the correct way to do that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Very important based on what? that's a very subjective statement, I can't relate to that as a reviewer (and in this case, a voter) and any sort of urgency expressed here is redundant to the proposal. Two things to consider here:
I don't feel too strongly about listing objects and merely wanted to know if we view listing objects as an integral part of the overall experience. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alright, it wasn't clear to me that proposals did not have to be accepted and implemented in toto. I am still unsure how to correctly implement, and even design, an API for listing; I will need to let this sit with me for a bit and perhaps even make some POCs. But I'm happy to accept suggestions here and I can update this proposal if someone has good ideas and/or experience with designing such an API. Also, no particular urgency (in fact, the initial proposal was made in June). Just hoping the entire proposal doesn't get stalled because of lack of clarity on the listing API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That should not happen, at least from my perspective. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding the list operation, I agree that this particular operation for most any object store is very costly and inefficient (and therefore largely unused except by individuals / tools doing administrative operations such as cleaning up stale data) -- a common solution is a keeping secondary index stored elsewhere in something that is queryable so I think it's okay to not add to the building block. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Listing is super important, but perhaps consider it an "optional" feature of the underlying storage? For example, local storage on disk, can list the files very quickly and cheaply, while maybe it is a burden for S3/Azure.
To avoid paging, why not make the listing a stream that you can consume at your convenience...? |
||
// GetObject retrieves an object with name key and writes it to the out stream. | ||
// If the object doesn't exist, err will be ErrNotFound. | ||
// Return value tagOut contains an etag or a similar concurrency-control tag. | ||
GetObject(ctx context.Context, key string, out io.Writer) (md ObjectMetadata, tagOut any, err error) | ||
// SetObject stores an object with name key, reading data from the in stream. | ||
// Parameter tag is optional and can be used to pass an etag or any similar concurrency-control tag. | ||
// In case of conflict (e.g. etag mismatch), the method returns ErrConflict. | ||
SetObject(ctx context.Context, key string, in io.Reader, tag any, md ObjectMetadata) (tagOut any, err error) | ||
// DeleteObject deletes an object with name key. | ||
// Parameter tag is optional and can be used to pass an etag or any similar concurrency-control tag. | ||
// In case of conflict (e.g. etag mismatch), the method returns ErrConflict. | ||
DeleteObject(ctx context.Context, key string, tag any) (err error) | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about getting a public url or reference to the object? For example, S3/Azure/(probably GCP) allows for getting a pre-signed URL that can be accessed for some time. That's incredibly useful when you want to "share" objects with other systems/clients in a secure way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that could be a good suggestion but I would consider it an optional method, as not sure if all blob storage services have a concept of pre-signed URLs. |
||
|
||
// Metadata associated with an object. | ||
type ObjectMetadata map[string]string | ||
|
||
// Error constants: | ||
var ErrNotFound = errors.New("...") | ||
var ErrConflict = errors.New("...") | ||
``` | ||
|
||
The `key` parameter is the name of the object. Unlike with state stores, there are no key prefixes when operating with objects: applications have full access to all data in the storage service. | ||
|
||
Each object can have metadata associated to it, which is normally passed to clients as headers (more on that below). In particular, `Content-Type` is yet another metadata key, treated no differently from anything else. | ||
|
||
> Option: we could support auto-detection of the content type if users pass `"auto"` as value, using a library such as [ItalyPaleAle/file-type-stream-go](https://github.com/ItalyPaleAle/file-type-stream-go) (which I'd be very happy to transfer to the Dapr org). | ||
|
||
The only exception is the `tag`, which is passed separately from the metadata object. The reason is that in the `SetObject` method, the tag is both an (optional) parameter and a return value. Note however that support for tags / ETags, is not a requirement if the underlying storage service doesn't support it. | ||
ItalyPaleAle marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When invoking `GetObject` or `SetObject`, the output and input streams (respectively) are created outside of the state store and are owned by the caller. The `GetObject` and `SetObject` methods synchronously return only after all data has been written or read (respectively), and after that the caller can call `Close()` on the streams (if necessary/appropriate). | ||
|
||
Notes: | ||
|
||
- This building block will be stream-first. It will be working with input and output data as binary streams, and will not perform any transformation or encoding on the data. Thanks to using streams, there are no limits on the size of the input/output data, bypassing `MaxBodySize`. | ||
- Dapr only supports the "last-write-wins" concurrency pattern for object storage (unlike with state stores). This makes sense considered the intended usage of object storage, which is storing documents that usually do not change. When documents do change, using last-write-wins is consistent with how regular filesystems work, for example. | ||
ItalyPaleAle marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Dapr will not calculate and store the checksum of objects. This is because checksums must be transmitted as header, but Dapr stores data in the state store in a streaming way, so it can't compute the full checksum until the end of the stream. Apps that want to store checksums (such as the `Content-MD5` header) should compute it beforehand and submit it as metadata header. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can metadata also be submitted via a trailer? Imagine that I have a massive file that I want to compute the checksum for, if I can only send it as a header, I must traverse the file twice, once to compute the header, then again to submit to the object store. I think it would make sense to allow trailing metadata and/or being able to submit metadata separately. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about adding a method to replace metadata on a blob? Using tailers is a less-ideal option because it means you can't use streaming and need to buffer the entire blob in memory, because SDKs need to have the metadata when the call is made |
||
|
||
### HTTP APIs | ||
|
||
Applications that interact with object storage using HTTP can leverage these APIs, that follow a more REST-like pattern than the typical state store APIs. | ||
|
||
> These APIs have been modeled after the APIs supported by AWS S3 and Azure Blob Storage. | ||
|
||
#### Retrieving an object | ||
|
||
To retrieve an object, clients make a GET request: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would also recommend a HEAD request being allowed if I just want to get the metadata. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that's a good suggestion |
||
|
||
```text | ||
GET http://localhost:<port>/v1.0-alpha1/objects/<component>/<key> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, we would call it the "objects" API and not "blob"? I think "blob" is clear about the unstructured nature of it while "object" reminds me of JSON. It might just be me but food for thought. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had the same doubt. I landed with "object" because the industry-standard term seems to be "object storage", but I agree that "blob" does sound better. I'm fine either way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be great for people to vote on the naming for this here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that the points you 2 just mentioned here should be part of the overview: about both blob and objects being commonly used to refer to the concept at hand and that the industry usually uses "object". I will say that blob makes it non-ambiguous to me to what you are targeting with this building block. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Arguably "Object storage" contains more than just a binary blob, there's often metadata attached to it -- AWS and Google whereas blob storage usually refers to an opaque large binary object so I think naming matters; are we purely offering the opaque data or something that requires metadata be associated with it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Metadata can be added (as described in the doc) but isn't required. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd go for object storage here since metadata while optional can play a major part in how the API behaves, and this supports the same object storage notion for AWS and GCP which has metadata as an optional construct. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then I would agree that it feels more like an object store than a blob store, which is fine because objects can be blobs (i.e no metadata associated), but blobs can't really be objects by their accepted definitions. |
||
``` | ||
|
||
Where `<component>` is the name of the object store component, and `<key>` is the key (name) of the object. | ||
|
||
A successful response will be: | ||
|
||
- Status: `200 OK` | ||
- Headers: each metadata key is a header; see below for a comment on headers. Additionally, the `ETag` is passed as response header too. | ||
- Body: the body of the response is the raw bytes of the object, which are sent to the client in a streamed way, as soon as they are retrieved from the state store (that is: Dapr does not buffer the data in-memory before sending it to the client). | ||
|
||
An error response will have: | ||
|
||
- Status: | ||
- `404 Not Found` for objects not found | ||
- `400 Bad Request` if the state store doesn't exist or the key parameter is missing | ||
- `500 Internal Server Error` for all other errors | ||
- Headers: response will have `Content-Type: application/json` | ||
- Body: a JSON-encoded payload including more details on the error | ||
|
||
#### Storing an object | ||
|
||
To store an object, clients make a PUT request: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a user has a state store component that is used both for KV and Blob, will a key from one API override the other? If so, this might create inconsistencies that are very hard to reason about, especially during runtime. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should recommend users to not use the same storage accounts / buckets with multiple building blocks. I don't see this much different than someone using for example Postgres as a Dapr state store and also interacting with the database directly using the Postgres APIs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see a great difference, because in a Dapr state Postgres + Direct Postgres scenario saving the same key would actually work as expected by default whereas with Dapr state S3 and Dapr object S3 you get values being overridden by default. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not following, sorry. What do you mean with:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you're calling a Dapr S3 state store for saving state with key There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, that's the same as using Dapr Postgres state store to save state with key The documentation should not encourage users to use multiple building blocks to access the same resources. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Save state with Dapr for key So if a user saves two different values, with and without Dapr, and performs a GET call with and without Dapr, the values would be taken from two different keys, as expected. That's not the case with Dapr state and object storage, where both operations target There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Best to incorporate into the proposal as a note then. |
||
|
||
```text | ||
PUT http://localhost:<port>/v1.0-alpha1/objects/<component>/<key> | ||
``` | ||
|
||
Where `<component>` is the name of the object store component, and `<key>` is the key (name) of the object. | ||
|
||
The request body contains the raw bytes of the object, which are read by Dapr in a streamed way and are sent to the object store as soon as they are available (once again, Dapr does not buffer the data in-memory before storing it). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It may be a good idea to include the object size in this request. I am not certain about other platforms but S3, for example, requires you use the multipart upload mechanism if the object to be stored exceeds the 5GB put operation maximum, which would require knowledge of the expected size so that the component can use the right API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm afraid that's not possible because knowing the object size ahead of time would require the client to have all the data already assembled in memory. This won't work for data that is generated on-the-fly and streamed to the object storage service. When implementing this, we can err on the side of caution and always leverage multipart upload if that's the most universal solution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is possible to stream files to S3 without knowing the size of the file -- https://stackoverflow.com/questions/8653146/can-i-stream-a-file-upload-to-s3-without-a-content-length-header?rq=1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup, I have done it before (with the SDK), confirming it's possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it's possible in S3 using multipart but that requires separate permissions and implementation than a simple put operation -- if the requirement is that size is explicitly not known ahead of time then that should be clear to component developers so that, for example, the S3 implementation can default to using multipart (or at least make it obvious to the developer what limitations it will have). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Since metadata can only be submitted as a header, if you want to compute a checksum, you'll need to traverse the file once before you send it anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could make object size an optional parameter. Caller can pass it using a |
||
|
||
Additionally, metadata keys are passed as header values; see below for a comment on headers. An `ETag` can be passed as request header too for concurrency control, and it's ignored if the object doesn't exist already. If an ETag is not specified, the write will always succeed (unless other errors occur). | ||
|
||
ItalyPaleAle marked this conversation as resolved.
Show resolved
Hide resolved
|
||
A response will be: | ||
|
||
- Status: | ||
- `200 OK` for successful operations | ||
- `400 Bad Request` if the state store doesn't exist or the key parameter is missing | ||
- `409 Conflict` if the object already exists and there's an ETag mismatch | ||
- `500 Internal Server Error` for all other errors | ||
- Headers: response will have `Content-Type: application/json` | ||
- Body: | ||
- For successful responses: a JSON-encoded payload containing the ETag and the number of bytes stored. | ||
- In case of errors: a JSON-encoded payload including more details on the error | ||
|
||
#### Deleting an object | ||
|
||
To delete an object, clients make a DELETE request: | ||
|
||
```text | ||
DELETE http://localhost:<port>/v1.0-alpha1/objects/<component>/<key> | ||
``` | ||
|
||
Where `<component>` is the name of the object store component, and `<key>` is the key (name) of the object. | ||
|
||
An `ETag` can be passed as request header too for concurrency control. If an ETag is is specified, the operation will fail if the ETag doesn't match; the ETag is optional and when it's missing there's no concurrency control. | ||
|
||
A response will be: | ||
|
||
- Status: | ||
- `204 No Content` for successful operations | ||
- `400 Bad Request` if the state store doesn't exist or the key parameter is missing | ||
- `409 Conflict` if the object already exists and there's an ETag mismatch | ||
- `500 Internal Server Error` for all other errors | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about write-only object storage? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example, an HSM might be backing this that allows storing new key material, and reading public keys, but not deletions. Maybe the object is backed by a file system but the file is marked read-only. I'd suggest a 403: forbidden, to handle the case where deletion isn't allowed. I'd also recommend a 410 if the content is already deleted and no longer exists. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Support for HSM is out of scope of this building block. It would more likely be aligned with the crypto and/or secret stores building blocks. But in general, i support adding more status codes to indicate various scenarios like the ones you describe. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I was illustrating the error, but yeah, I agree. |
||
- Headers: error responses will have `Content-Type: application/json` | ||
- Body: | ||
- For successful responses, the response body is empty | ||
- In case of errors: a JSON-encoded payload including more details on the error | ||
|
||
### gRPC APIs | ||
|
||
For applications using gRPC to interact with Dapr, the `Dapr` service is expanded to include: | ||
|
||
```proto3 | ||
// The "Dapr" service already exists | ||
service Dapr { | ||
// GetObject retrieves an object. | ||
rpc GetObject(GetObjectRequest) returns (stream GetObjectResponse) {} | ||
|
||
// SetObject stores an object with name key. | ||
rpc SetObject(stream SetObjectRequest) returns (SetObjectResponse) {} | ||
|
||
// DeleteObject an object with name key. | ||
rpc DeleteObject(DeleteObjectRequest) returns (DeleteObjectResponse) {} | ||
} | ||
|
||
// GetObjectRequest is the message to get an object from specific object store. | ||
message GetObjectRequest { | ||
// The name of object store. | ||
string store_name = 1; | ||
|
||
// The key of the desired object. | ||
string key = 2; | ||
} | ||
|
||
// GetObjectResponse contains the retrieved object (this is used in a streamed response). | ||
message GetObjectResponse { | ||
// The tag for concurrency control. | ||
string tag = 1; | ||
|
||
// The metadata stored with the object. | ||
map<string, string> metadata = 2; | ||
|
||
// Chunk of data. | ||
StreamPayload payload = 10; | ||
} | ||
|
||
// SetObjectRequest is the message to store a object in a specific object store (this is used in a streamed request). | ||
message SetObjectRequest { | ||
// The name of object store. | ||
string store_name = 1; | ||
|
||
// The key of the desired object. | ||
string key = 2; | ||
|
||
// The tag for concurrency control. | ||
string tag = 3; | ||
|
||
// The metadata which will be stored with the object. | ||
map<string, string> metadata = 4; | ||
|
||
// Chunk of data. | ||
StreamPayload payload = 10; | ||
} | ||
|
||
// SetObjectResponse contains the result of storing the object. | ||
message SetObjectResponse { | ||
// The updated tag for concurrency control. | ||
string tag = 1; | ||
|
||
// The number of bytes written. | ||
int bytes = 2; | ||
} | ||
|
||
// DeleteObjectRequest is the message to delete an object in specific object store. | ||
message DeleteObjectRequest { | ||
// The name of object store. | ||
string store_name = 1; | ||
|
||
// The key of the desired object. | ||
string key = 2; | ||
} | ||
|
||
// DeleteObjectResponse contains the result of deleting the object. | ||
message DeleteObjectResponse { | ||
// Currently empty but allowing for future expansion. | ||
} | ||
``` | ||
|
||
#### Handling streams | ||
|
||
`StreamPayload` is first introduced with dapr/dapr#4903 and corresponds to: | ||
|
||
```proto3 | ||
// Chunk of data sent in a streaming request or response. | ||
message StreamPayload { | ||
// Data sent in the chunk. | ||
google.protobuf.Any data = 1; | ||
|
||
// Set to true if this is the last chunk. | ||
bool complete = 2; | ||
} | ||
``` | ||
|
||
Data is sent from the application to Dapr (in `SetObject` RPCs) or from Dapr to the application (in `GetObject` RPCs) in a stream. Each message in the stream contains a `GetObjectResponse` or `SetObjectRequest` where: | ||
|
||
- The first message in the stream MUST contain all other required keys. | ||
- The first message in the stream MAY contain a `payload`, but that is not required. | ||
- Subsequent messages (any message except the first in the stream) MUST contain a `payload` and MUST NOT contain any other property. | ||
- The last message in the stream MUST contain a `payload` with `complete` set to `true`. That message is assumed to be the last one from the sender and no more messages are to be sent in the stream after that. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above, but I'd also recommend being able to send any other metadata at the end. For example, content-length, content-hash, or whatever else is important to the application that can only be done once the entire file is consumed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will let the server populate content-hash and/or content-length. For checksumming during upload, if the service and SDK support that, it can be done transparently by the component. Because components will most likely need to performed chunked upload, the checksum should be included in each chunk anyways. |
||
|
||
The amount of data contained in `payload.data` is variable and it's up to the discretion of the sender. In service invocation calls, as implemented by dapr/dapr#4903, chunks are at most 4KB in size, although senders may send smaller chunks if they wish. Receivers must not assume that messages will contain any specific number of bytes in the payload. | ||
|
||
Note that it's possible for senders to send a single message in the stream. If the data is small and could fit in a single chunk, senders MAY choose to include a `payload` with `complete=true` in the first message. Receivers should assume that single message to be the entire communication from the sender. | ||
|
||
### Metadata and headers | ||
|
||
Metadata properties are passed between the client and server as headers. | ||
|
||
There can be two types of headers: | ||
|
||
- Custom ones have the `x-dapr-` prefix in requests sent between Dapr and the app. When they are stored, they generally do not have the `x-dapr-` prefix, which is removed (although this could be dependent on the implementation of the state store). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO, I don't have an answer here, just something to consider. |
||
- Certain common headers are exchanged between apps and Dapr in their canonical form. The way these are stored in the state store depends on the service and implementation: | ||
- `Last-Modified` | ||
- `Content-Length` | ||
- `Content-Type` | ||
- `Content-MD5` | ||
- `Content-Encoding` | ||
- `Content-Language` | ||
- `Cache-Control` | ||
- `Origin` | ||
- `Range` | ||
- Additionally, the `ETag` header is exchanged in the request/response headers, although it's not included in the metadata object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is really missing from this proposal is the why (why users would care about this), the actual problems it helps solve and a mention of the alternatives. All these sections are indicated in the proposed template and need to be added here for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, can you mention the alternatives? I have additional comments that would best go under that section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides Dapr state stores (which I sort-of mentioned), what other alternatives do you think are worth including? Are non-Dapr solutions (like libraries that abstract various object storage services) one thing I should include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, what would a user do without Dapr (for example, going with a native SDK) is important - it was included in other proposals and it serves as a really great way to know if the value add to the user with Dapr is worth both the maintenance effort for the project and also for the users who need to run it. (Dapr has overhead and we need to make sure APIs/features we add have a considerable value add).