Prevent a possible race condition in blocking requests #3656
Labels
area/api
Issue or PR related to the Infra API.
kind/improvement
A report of a quality problem, or a change that addresses a quality problem.
status/never-stale
Indicates to actions/stale that the issue or PR should never be marked stale.
The new blocking requests (aka requests used for long polling) have (I believe) a possible race condition between retrieving the next ID from the
seq_update_index
sequence, and committing the transaction that uses that ID. Note I haven't actually reproduced this race, so it's possible that it's unlikely in practice. We should try to reproduce before attempting to fix the problem.Problem
There is no synchronization between
nextval('seq_update_index')
and committing of the transaction that uses the value. This means that if twoCreateGrant
orDeleteGrant
requests are made at the same time, it's possible for them to be ordered like this:nextval('seq_update_index')
to get value 1000nextval('seq_update_index')
to get value 1001notify
, and the blocking query reads value 1001notify
, and the blocking query reads value 1001 (which means it has missed the update with id 1000)This is only a problem if both requests (CreateGrant1 and CreateGrant2) are for the same organization and same destination (or in the future same org same user may also be a problem). If either the org or destination are different, then the race doesn't break anything.
Given that the query has to receive the notification and then repeat a query, it seems like this race is very unlikely, because even if it happens, as long as the second commit completes before the blocking request does its query, the problem will be hidden. None the less, I think it's important we have some way to fix it.
Proposed Solution
One way I've found to fix this is by using locks. https://www.postgresql.org/docs/current/explicit-locking.html says this:
Which sounds like exactly the scenario we are in.
I'm not sure yet exactly which locking mode we need, but I think what we want is a lock that synchronizes the
nextval('seq_update_index')
with the commit of the transaction. Since locks are automatically released on commit, all we need is to acquire the right lock before doing aCreateGrant
orDeleteGrant
.The lock does not need to be global, it should be per-organization, so one org will never block other orgs. We could get even more fine grained locks by destination and user, but I think it's unlikely we need that complexity.
I don't think we want to lock the
grants
tables, and we can't lock individual rows in thegrants
table, because inserts will not be on an existing row.One idea is to either use the
organizations
table (since the lock is scoped to an org), or to create a new table used exclusively for this purpose. The table would have at least one row per org (but if we end up with differentupdateIndex
sequences maybe we add more).The solution would be for every
CreateGrant
andDeleteGrant
to do this:I believe that will prevent any races, and also ensure that one org can never block another.
If an org starts running into problems, we can direct them to the new bulk grants edit endpoint. That endpoint allows multiple CREATE/DELETE in the same transaction, which allows an org to get a lot more throughput without lock contention.
The text was updated successfully, but these errors were encountered: