-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: large latency spikes when creating a large storing index while running tpcc #45888
Comments
Investigation and discussion with @ajwerner seems to have led to some understanding of this. The sql latency spikes observed can be directly correlated with spikes transaction restarts. Intuition for this is the following: when we create a second primary key, we introduce a large amount of new artificial contention, as all txns that update rows in the primary key also have to go and update the row in the index being built. Due to this, it seems that running tpcc with 2k warehouses and a large storing index will overload a 3 node cluster. I see fewer latency spikes at 1k warehouses, and only 1 spike with 500 warehouses. However, it is unclear what users should do if they have a workload running and need to change their primary key. |
Interestingly this relates to some other discussions about the inner workings of schema changes which perform backfills to new indexes. Here's some words that relate to several recent conversations. The thrust of this is (a) adding a new Today we have two forms of index backfill, the "index backfill" and the "column backfill". When we perform that column backfill we do that work in a normal transaction because we're writing to a live index. For "index backfill" we're writing to an index that hasn't been written to (sort of). Rather, we put the new index backfill in Unfortunately there's no access to this fanciness for "column backfill". The upshot of column backfill is that it doesn't create a second index which foreground traffic needs to write to. This means that for transactions which used to hit the 1PC optimization, they still do. Furthermore, for transactions which are contended, the contention footprint is not doubled. As we can see in this specific issue here, that contention increase can be a big deal. For upcoming work on #9851 (changing data types), we're got what seems like two bad options.
The concrete proposal here is to buy into #36850 and give ourselves a third option. |
...
I'm having trouble squaring these in my understanding: if it were just the fact that we have a secondary index, then wouldn't the latency linger post backfill? My understanding from #36925 was that it was only _during the backfill that we see issues? |
I mentioned this in person last week but I think this investigation would be well served by a trace of a query which sees the latency spike, run before, during and after the backfill that shows exactly where we are spending the extra time. |
That's reasonable. You're right that if adding the index is going to cause problems, keeping the backfill out of the way is not going to solve anything. When I entered the conversation my mindset had been on the primary key changes where we're going to remove the old index after the backfill. |
I was having trouble collecting traces with jaeger due to memory issues, but traces that I collected from the At least for primary key changes the old primary index is dropped so that after the backfill everything calms down. I'll try again with keeping the storing index around. |
Whats our thinking now that we've completed additional investigations here? |
We have marked this issue as stale because it has been inactive for |
Investigation into #44501 and #44504 have uncovered that the core of the problem seems to be creating the index that stores all the columns.
To reproduce:
After the ramp period, run in another shell
After some time, large p99 latency spikes can be witnessed, sometimes going up to multiple seconds.
Epic CRDB-8816
Jira issue: CRDB-5120
The text was updated successfully, but these errors were encountered: