The --exact-rowcount option is buggy on large tables #830

zhxnlai · 2020-03-10T20:25:20Z

Context

During a schema change, gho-st copies every row from the original table to a ghost table. This results in a lock on the original table and a long running query similar to the following

mysql> show open tables where in_use>0;
+------------+----------------------+--------+-------------+
| Database   | Table                | In_use | Name_locked |
+------------+----------------------+--------+-------------+
| my_service | my_table             |      1 |           0 |
+------------+----------------------+--------+-------------+
1 row in set (0.00 sec)

mysql> show full processlist;
...
| 1594199 | my_service_ddl | 10.137.166.91:35626  | my_service         | Query       |    0 | Sending data                                                  | select  /* gh-ost `my_service`.`my_table` iteration:108 */
						`id`
					from
						`my_service`.`my_table`
					where ((`id` > _binary'1131891')) and ((`id` < _binary'310834793') or ((`id` = _binary'310834793')))
					order by
						`id` asc
					limit 1
					offset 9999 |

It appears that when --exact-rowcount is enabled, ghost grabs another lock on the original table and starts an addition query to calculate a more accurate progress ETA. By default, row counting happens in parallel to row copying.

mysql> show open tables where in_use>0;
+------------+----------------------+--------+-------------+
| Database   | Table                | In_use | Name_locked |
+------------+----------------------+--------+-------------+
| my_service | my_table             |      2 |           0 |
+------------+----------------------+--------+-------------+
1 row in set (0.00 sec)

mysql> show full processlist;
...
| 1594046 | my_service_ddl | 10.137.166.91:57666  | my_service         | Query       |   71 | Sending data                  | select /* gh-ost */ count(*) as rows from `my_service`.`my_table`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
...

Problem

After all the rows are copied to the ghost table, gh-ost attempts to cut-over the original table.

Usually, the row counting query finishes before all the rows are copied to the ghost table. At the time of cut-over, because both queries have finished, there is no lock on the origin table. Hence the schema change is successful.

However, we repeatedly found that on a large table (> 400GB), row counting query takes longer than row copying. At the time of cut-over, the row counting query is still running and holding a lock on the original table.

This causes the schema change to fail with this error: "Error 1205: Lock wait timeout exceeded; try restarting transaction", which was described in #773

Proposals

If you run into this issue, try disabling --exact-rowcount.

Improvements that we could make

Document this bug; add a warning.
Teach gh-ost to terminate the row counting query before cutting over the original table.

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2020-03-11T08:20:55Z

the row counting query takes longer than row copying. At the time of cut-over, the row counting query is still running and holding a lock on the original table.

Very interesting -- thank you for this analysis! I've never seen this scenario happen, and once you point it out, it makes sense.

I'm not the owner of this repository at this time, but would you consider a pull request to update the docs? If not, someone else would hopefully do it, and hopefully also fix the code.

Closes github#830. Switches from using `QueryRow` to `QueryRowContext`, and stores a context.CancelFunc in the migration context, which is called to halt any running row count query before beginning the cut over.

timvaillancourt · 2021-06-17T22:19:13Z

@zhxnlai / @shlomi-noach: I'm curious having support for cancelling the row-count query would solve this issue?

Some ideas that come to mind:

Passing a cancellable context.Context to the row-count query
- Cancel the row-count context before cut-over
Running a KILL <query id> on the row-count query (feels more hacky)

For now I think documenting this limitation sounds good, this issue doesn't come up often

shlomi-noach · 2021-06-22T12:06:09Z

Cancel the row-count context before cut-over

This will not kill the query. Only way as far as I can tell is to KILL <query-id> which is indisputably hacky

timvaillancourt · 2021-06-22T14:30:28Z

Cancel the row-count context before cut-over

This will not kill the query. Only way as far as I can tell is to KILL <query-id> which is indisputably hacky

Ahh yes, the query won't be killed on the server 🤔👍. Sounds like documentation is the best approach

* Cancel any row count queries before attempting to cut over Closes #830. Switches from using `QueryRow` to `QueryRowContext`, and stores a context.CancelFunc in the migration context, which is called to halt any running row count query before beginning the cut over. * Make it threadsafe * Kill the count query on the database side as well * Explicitly grab a connection to run the count, store its connection id * When the query context is canceled, run a `KILL QUERY ?` on that connection id * Rewrite these to use the threadsafe functions, stop exporting the cancel func * Update logger * Update logger Co-authored-by: Tim Vaillancourt <[email protected]> Co-authored-by: Tim Vaillancourt <[email protected]> Co-authored-by: dm-2 <[email protected]>

* Cancel any row count queries before attempting to cut over Closes github#830. Switches from using `QueryRow` to `QueryRowContext`, and stores a context.CancelFunc in the migration context, which is called to halt any running row count query before beginning the cut over. * Make it threadsafe * Kill the count query on the database side as well * Explicitly grab a connection to run the count, store its connection id * When the query context is canceled, run a `KILL QUERY ?` on that connection id * Rewrite these to use the threadsafe functions, stop exporting the cancel func * Update logger * Update logger Co-authored-by: Tim Vaillancourt <[email protected]> Co-authored-by: Tim Vaillancourt <[email protected]> Co-authored-by: dm-2 <[email protected]>

* Merge pull request github#1143 from github/codeql-run-on-prs Run CodeQL analysis on PRs * Merge pull request github#1141 from shaohk/fix-two-phase-commit-lost-data_v2 fix(lost data in mysql two-phase commit): lost data in mysql two-phas… * compound pk tests (github#387) * compound pk tests * more details in failure diff * more elaborate test; the pk-ts one consistently fails * Fix merge conflict Co-authored-by: Shlomi Noach <[email protected]> Co-authored-by: Tim Vaillancourt <[email protected]> Co-authored-by: Tim Vaillancourt <[email protected]> * Use `switch` statements for readability, simplify `.NewGoMySQLReader()` (github#1135) * Use `switch` statements for readability * Simplify initBinlogReader() * Add context/timeout to HTTP throttle check (github#1131) * Add context/timeout to HTTP throttle check * Dont run `.GetThrottleHTTPInterval()` on every loop * Update help message * Var rename * 2022 * Add timeout flag * Add unix/tcp server commands, use ParseInt() for string->int64 * Var rename * Re-check http timeout on every loop iteration * Remove stale comment * Make throttle interval idempotent * var rename * Usage grammar * Make http timeout idempotent too * Parse time.Duration once * Move timeout to NewThrottler * Help update * Set User-Agent header * Re-add newline Co-authored-by: dm-2 <[email protected]> * Default to `go1.17.11` (github#1136) * Default to go1.17.11 * `go mod vendor` Co-authored-by: dm-2 <[email protected]> * Cancel any row count queries before attempting to cut over (github#846) * Cancel any row count queries before attempting to cut over Closes github#830. Switches from using `QueryRow` to `QueryRowContext`, and stores a context.CancelFunc in the migration context, which is called to halt any running row count query before beginning the cut over. * Make it threadsafe * Kill the count query on the database side as well * Explicitly grab a connection to run the count, store its connection id * When the query context is canceled, run a `KILL QUERY ?` on that connection id * Rewrite these to use the threadsafe functions, stop exporting the cancel func * Update logger * Update logger Co-authored-by: Tim Vaillancourt <[email protected]> Co-authored-by: Tim Vaillancourt <[email protected]> Co-authored-by: dm-2 <[email protected]> Co-authored-by: dm-2 <[email protected]> Co-authored-by: Shlomi Noach <[email protected]> Co-authored-by: Shlomi Noach <[email protected]> Co-authored-by: Tim Vaillancourt <[email protected]> Co-authored-by: Tim Vaillancourt <[email protected]> Co-authored-by: Andrew Mason <[email protected]>

ajm188 mentioned this issue May 18, 2020

Cancel any row count queries before attempting to cut over #846

Merged

2 tasks

wileyveteran mentioned this issue Oct 22, 2020

ERROR Timeout while waiting for events up to lock #799

Open

timvaillancourt added bug documentation labels Jun 17, 2021

timvaillancourt closed this as completed in #846 Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The --exact-rowcount option is buggy on large tables #830

The --exact-rowcount option is buggy on large tables #830

zhxnlai commented Mar 10, 2020 •

edited

Loading

shlomi-noach commented Mar 11, 2020 •

edited

Loading

timvaillancourt commented Jun 17, 2021 •

edited

Loading

shlomi-noach commented Jun 22, 2021

timvaillancourt commented Jun 22, 2021

The --exact-rowcount option is buggy on large tables #830

The --exact-rowcount option is buggy on large tables #830

Comments

zhxnlai commented Mar 10, 2020 • edited Loading

Context

Problem

Proposals

shlomi-noach commented Mar 11, 2020 • edited Loading

timvaillancourt commented Jun 17, 2021 • edited Loading

shlomi-noach commented Jun 22, 2021

timvaillancourt commented Jun 22, 2021

zhxnlai commented Mar 10, 2020 •

edited

Loading

shlomi-noach commented Mar 11, 2020 •

edited

Loading

timvaillancourt commented Jun 17, 2021 •

edited

Loading