-
Notifications
You must be signed in to change notification settings - Fork 650
CartoDB Surrogate Keys
The surrogate key concept comes form the databases world. From the Wikipedia: A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database.
Having in mind that concept we can tag and control resources with a surrogate key, so any request that has some associated resources will be tag with their surrogate keys, those surrogate keys will allow us to invalidate the cached requests in our cache layers.
We manually invalidate from Windshaft and SQL API using a regular expression for support on the built-in Varnish cache layer.
But Surrogate-Keys are thought in mind for other platforms which support individual hash-keys such as Fastly or Varnish with the libvmod-xkey plugin.
Currently we are using another header for cache invalidation, the X-Cache-Channel
header.
It has the format: ${DB_NAME}:${TABLE_NAME}[,${TABLE_NAME}...]
. That enables to tag and invalidate resources based on a user database name and a list of user tables associated to the resources. However not all resources are associated to those entities (database names + database tables) so we are restricted to cache and invalidate resources that we can associate with them.
We could integrate the surrogate key concept inside that header, but for the sake of simplicity (and being compatible with other solutions rather than Varnish) we've decided to migrate all caching to a Surrogate-Key
header which follows the individual key format.
Fastly's Surrogate-Key header has a fixed length limitation of 16,384 bytes, so we are going to set our limit to the same value for compatibility.
In order to not waste space, surrogate keys should be as short as possible. Having really short keys means collisions can happen, so keys have to guarantee uniqueness but be short enough to avoid collisions as much as possible. Check git's abbreviated hash changes.
We use namespaces to differentiate between different Surrogate-Key types to mitigate collisions and to add visibility (so we can determine which kind of object is behind a Surrogate-Key).
Surrogate Keys have the format:
N:KEY
where N is one or two letters specifying the namespace the object belongs to, and KEY is obtained by hashing the input of the cache pointer as:
substring(base64(sha256(input)), 0, 6)
For instance for a named map resource we will use OWNER_NAME:TABLE_NAME
as an input.
So for a named map with owner=foo
and name=bar
the surrogate key will be n:p2Wovq
. With the n
namespace is easier to know that it has a named map resource associated and collisions will be reduced to other named maps and not to other resource types.
For an implementation example, check Windshaft-cartodb's cache/model/named_maps_entry.js.
It's very important to take into account that results we desire to be cached can be uncached in any moment. This possibility already existed when just the LRU cache was used, and is slightly exacerbated by Surrogate-Keys as there can be collisions.
Due to this, you should generally not rely on requests being always cached and should avoid one-off or variant responses that cannot be regenerated.
For instance in the named maps example, although far from ideal, if we invalidate a layergroup instance from user foo
because a named map from user bar
results in the same surrogate key for the layergroup request, we can live with it because the tiles can be generated again.
In order to transition from the current X-Cache-Channel
header we would have to start tagging with several surrogate keys for each table associated to the request. Choosing a key with low collisions is very important here because table names will be very similar in a lot of users. So probably it will require to use more than one namespace.
The current usages of cache keys are:
-
n
namespace: named map, having as inputUSERNAME:NAMED_MAP_ID
(see cache/model/named_maps_entry.js.) -
t
namespace: table, obtained asDATABASE_NAME:SCHEMA_NAME.TABLE_NAME
(see database_tables.js) - Note that SCHEMA_NAME and TABLE_NAME are to be escaped as PostgreSQL idenitifiers: this means being wrapped into quotes if a special symbol is inside (
-
) or if the object has a special name.
We use the following specific namespaces for objects:
-
rv
: namespace for referring to a visualization. It has to be amended the hash of the visualization ID.
Generic namespaces:
-
rj
: namespace forviz.json
pages. -
rp
: namespace por cacheable public pages (embeddable maps/public map pages)
So, for example, a viz.json for a visualization with ID "foo" will have the Surrogate-Key: rj rv:LCa0a2
This way we could:
- invalidate everything related to a visualization by knowing its ID.
- or invalidate all public pages or all
viz.json
outputs.