You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 28, 2020. It is now read-only.
Currently we demand that a Fetch request match all the qualifiers for a previously Pushed blob to be found, however this doesn't line up with how BuildStream plans on using the Remote Asset API. In https://gitlab.com/BuildStream/buildstream/-/issues/1274 it's suggested that BuildStream will use a different set of Qualifiers for Push and Fetch requests - fetch qualifiers being a subset of those pushed. As a result this will mean that Push'd sources will never be Fetch'd by BuildStream.
Our implementation currently uses a list of all qualifiers as part of the inputs to a hash, which is in turn used as the key for a BlobAccess. As a result, matching on just a subset of qualifiers is a non-trivial change. I'll outline a few possibilities to square this circle that have come to mind:
1. Allow a fetcher to set which qualifiers it deems important enough to hash.
As BuildStream presumably (from the discussion linked) plan to have the qualifiers which it Fetches with well-defined on a per-source type basis, we may be able to add some API to specify which qualifiers a given fetcher finds important. Given we'll likely need a whole bunch of different fetchers adapted to specific fetching cases, this isn't so bad, although it does potentially focus on this use case to the detriment of generality.
We could expose this as configuration for server operators to maintain, at the risk of ballooning configuration.
2. Hash only the URI, and mitigate against the collisions
This way we retrieve assets based only on the URI, and add additional logic in order to handle matching the qualifiers. This will keep things general, and mean that we match qualifiers consistently in all cases. However, it will clearly cause a performance decrease.
To mitigate collisions, I see two possibilities: use a form of cuckoo hashing or modify the AssetStore to store a list of Assets corresponding to the URI, along with their qualifiers. Cuckoo hashing has the benefit of allowing us to expire references automatically as part of the Put, but is more complex and means more I/O, as we have to read from the underlying blob store more. Making the AssetStore take a list of Assets will allow us to load only once, but may cause the entries to increase a lot in size, and will also mean that we have to modify the content stored in the blobstore under a single digest.
I think of these possibilities I'm leaning towards extending the AssetStore to store a list of Assets.
The text was updated successfully, but these errors were encountered:
Currently we demand that a
Fetch
request match all the qualifiers for a previouslyPush
ed blob to be found, however this doesn't line up with how BuildStream plans on using the Remote Asset API. In https://gitlab.com/BuildStream/buildstream/-/issues/1274 it's suggested that BuildStream will use a different set of Qualifiers for Push and Fetch requests - fetch qualifiers being a subset of those pushed. As a result this will mean that Push'd sources will never be Fetch'd by BuildStream.Our implementation currently uses a list of all qualifiers as part of the inputs to a hash, which is in turn used as the key for a BlobAccess. As a result, matching on just a subset of qualifiers is a non-trivial change. I'll outline a few possibilities to square this circle that have come to mind:
1. Allow a fetcher to set which qualifiers it deems important enough to hash.
As BuildStream presumably (from the discussion linked) plan to have the qualifiers which it Fetches with well-defined on a per-source type basis, we may be able to add some API to specify which qualifiers a given fetcher finds important. Given we'll likely need a whole bunch of different fetchers adapted to specific fetching cases, this isn't so bad, although it does potentially focus on this use case to the detriment of generality.
We could expose this as configuration for server operators to maintain, at the risk of ballooning configuration.
2. Hash only the URI, and mitigate against the collisions
This way we retrieve assets based only on the URI, and add additional logic in order to handle matching the qualifiers. This will keep things general, and mean that we match qualifiers consistently in all cases. However, it will clearly cause a performance decrease.
To mitigate collisions, I see two possibilities: use a form of cuckoo hashing or modify the
AssetStore
to store a list ofAsset
s corresponding to the URI, along with their qualifiers. Cuckoo hashing has the benefit of allowing us to expire references automatically as part of thePut
, but is more complex and means more I/O, as we have to read from the underlying blob store more. Making theAssetStore
take a list ofAsset
s will allow us to load only once, but may cause the entries to increase a lot in size, and will also mean that we have to modify the content stored in the blobstore under a single digest.I think of these possibilities I'm leaning towards extending the AssetStore to store a list of Assets.
The text was updated successfully, but these errors were encountered: