Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using index for NULL values is slower than full table scan (not using index) #62963

Closed
kjlubick opened this issue Apr 1, 2021 · 11 comments
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@kjlubick
Copy link

kjlubick commented Apr 1, 2021

Describe the problem

I have a table (TiledTraceDigests) with ~20 million rows. I realized the table needed another column (grouping_id) and an index on the new column, so I added it with alter table and then create index.

I then needed to fill in this column with data by looking it up in another table (Traces), so I ran several updates like:

UPDATE TiledTraceDigests SET (grouping_id) = (
  SELECT grouping_id FROM Traces WHERE Traces.trace_id = TiledTraceDigests.trace_id)
WHERE grouping_id is NULL limit 1000000;

The Limit of 1 million was to prevent the updates from taking too long or having to be retried if new data came in.
This was going fine for the first 16 updates or so, taking 60s or so per update. Suddenly, an update stalled out, taking over 15 minutes before I killed it from the UI.

I tried running the same update command with limit 5 and instead of happening in tens or hundreds of milliseconds, it took over 10 seconds (to update 5 rows).

I used EXPLAIN ANALYZE to see where the time was taking. It was blocked on getting rows from TiledTraceDigests.

Here's some interesting queries:

  • select * From TiledTraceDigests@grouping_digest_idx where grouping_id is null limit 10; took 6.7 seconds
  • select * From TiledTraceDigests@primary where grouping_id is null limit 10; took 5.3 seconds
  • select * From TiledTraceDigests@grouping_digest_idx where grouping_id = x'a181394e13962c65455837cbdd3a8da8' limit 10; took 4 milliseconds (as I would expect).

It appears that querying the null portion of this index is very very slow.

I first noticed this on v20.2.3, but the problem appears to persist after updating to v20.2.7.

Expected behavior
I expect querying a few rows from an index to be fast (milliseconds), not slower than avoiding use of the index.

Additional data / screenshots
SQL Schemas:

CREATE TABLE IF NOT EXISTS TiledTraceDigests (
  trace_id BYTES,
  tile_id INT4,
  digest BYTES NOT NULL,
  -- The following row was added with an alter table
  -- grouping_id BYTES NOT NULL,
  PRIMARY KEY (trace_id, tile_id, digest)
  -- The following index was added after the alter table
  -- INDEX grouping_digest_idx (grouping_id, digest)
);
CREATE TABLE IF NOT EXISTS Traces (
  trace_id BYTES PRIMARY KEY,
  corpus STRING AS (keys->>'source_type') STORED NOT NULL,
  grouping_id BYTES NOT NULL,
  keys JSONB NOT NULL,
  matches_any_ignore_rule BOOL,
  INDEX grouping_ignored_idx (grouping_id, matches_any_ignore_rule),
  INDEX ignored_grouping_idx (matches_any_ignore_rule, grouping_id),
  INVERTED INDEX keys_idx (keys)
);

I've attached the zip file taken from Statement Diagnostics in the UI.
stmt-bundle-646243008118882309.zip

Environment:

  • CockroachDB version 20.2.3 (where I first noticed this. After updating to 20.2.7, the problem persists)
  • Server OS: Debian/Kubernetes
  • Client app cockroach sql

Additional context
What was the impact?

My new column is only partially filled out, and I'm not sure how long it will take me to finish.

@kjlubick kjlubick added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 1, 2021
@blathers-crl

This comment has been minimized.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Apr 1, 2021
@kjlubick
Copy link
Author

kjlubick commented Apr 1, 2021

FWIW: Deleting and recreating the index appears to have made the "WHERE grouping_id IS NULL" queries fast again. Not a great solution, but a solution.

@mgartner mgartner self-assigned this Apr 6, 2021
@RaduBerinde
Copy link
Member

select * From TiledTraceDigests@grouping_digest_idx where grouping_id is null limit 10; took 6.7 seconds
select * From TiledTraceDigests@primary where grouping_id is null limit 10; took 5.3 seconds

These queries should result in a contradiction when we generate index constraints (since the column is defined as NOT NULL). We may have a bug where the contradiction becomes a full table scan because there are no spans.

@kjlubick
Copy link
Author

kjlubick commented Apr 6, 2021

Oh, sorry I better clarify something. The grouping_id BYTES column was added, but w/o the NOT NULL constraint. I planned to add that constraint after I had filled out all the data.

@mgartner
Copy link
Collaborator

mgartner commented Apr 6, 2021

@kjlubick Are you running multiple UPDATEs concurrently? When you timed those three "interesting" queries, were there any ongoing UPDATEs?

@kjlubick
Copy link
Author

kjlubick commented Apr 7, 2021

Yes, there would have been multiple INSERTs to that table in parallel. I noticed if an INSERT happened during my query it would take approximately 2 or 3 times as long (retries, I presume). The data I provided was when there were not INSERTs being executed.

@kjlubick
Copy link
Author

kjlubick commented Apr 7, 2021

FWIW, those INSERTs were something like
INSERT INTO TiledTraceDigests (trace_id, tile_id, digest) VALUES ($1, $2, $3), ($4, $5, $6)... ON CONFLICT DO NOTHING using crdbpgx.ExecuteTx to retry retryable errors. Row batch size was up to 200 rows per insert.

@mgartner
Copy link
Collaborator

mgartner commented Apr 7, 2021

Contention may be the culprit - scanning for rows where grouping_id is null would contend with these inserts you describe. However, I don't see how dropping and recreating the index would have improve performance in this case. Has performance degraded at all since recreating the index?

@kjlubick
Copy link
Author

kjlubick commented Apr 7, 2021

Performance appears normal after recreating the index:

> select * From TiledTraceDigests@primary where grouping_id is null limit 10;
  trace_id | tile_id | digest | grouping_id
-----------+---------+--------+--------------
(0 rows)

Time: 21.636s total (execution 21.535s / network 0.101s)

> select * From TiledTraceDigests@grouping_digest_idx where grouping_id is null limit 10;
  trace_id | tile_id | digest | grouping_id
-----------+---------+--------+--------------
(0 rows)

Time: 103ms total (execution 102ms / network 1ms)

There are currently about 22 million rows in TiledTraceDigests as I write this.

@mgartner
Copy link
Collaborator

mgartner commented Apr 7, 2021

Another possibility is that you ran into #54029. If so, that first UDPATE of 1 million rows could have created general slowness with the table.

Even if it's unrelated to #54029, I'd suggest reducing the number of rows updated in each batch to around 10k. You should get more consistent performance by doing so.

@mgartner
Copy link
Collaborator

mgartner commented Apr 9, 2021

@kjlubick I'm going to close this issue for now because it's unlikely we'll get to the bottom of the this unless you encounter it again. Our best guess that it's related to #54029, and reducing the batch size of the updates to ~10k should mitigate that. Please leave a comment if you see it again, and we can try to diagnose it.

@mgartner mgartner closed this as completed Apr 9, 2021
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
Archived in project
Development

No branches or pull requests

3 participants