Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The --exact-rowcount option is buggy on large tables #830

Closed
zhxnlai opened this issue Mar 10, 2020 · 4 comments · Fixed by #846
Closed

The --exact-rowcount option is buggy on large tables #830

zhxnlai opened this issue Mar 10, 2020 · 4 comments · Fixed by #846

Comments

@zhxnlai
Copy link

zhxnlai commented Mar 10, 2020

Context

During a schema change, gho-st copies every row from the original table to a ghost table. This results in a lock on the original table and a long running query similar to the following

mysql> show open tables where in_use>0;
+------------+----------------------+--------+-------------+
| Database   | Table                | In_use | Name_locked |
+------------+----------------------+--------+-------------+
| my_service | my_table             |      1 |           0 |
+------------+----------------------+--------+-------------+
1 row in set (0.00 sec)

mysql> show full processlist;
...
| 1594199 | my_service_ddl | 10.137.166.91:35626  | my_service         | Query       |    0 | Sending data                                                  | select  /* gh-ost `my_service`.`my_table` iteration:108 */
						`id`
					from
						`my_service`.`my_table`
					where ((`id` > _binary'1131891')) and ((`id` < _binary'310834793') or ((`id` = _binary'310834793')))
					order by
						`id` asc
					limit 1
					offset 9999 |

It appears that when --exact-rowcount is enabled, ghost grabs another lock on the original table and starts an addition query to calculate a more accurate progress ETA. By default, row counting happens in parallel to row copying.

mysql> show open tables where in_use>0;
+------------+----------------------+--------+-------------+
| Database   | Table                | In_use | Name_locked |
+------------+----------------------+--------+-------------+
| my_service | my_table             |      2 |           0 |
+------------+----------------------+--------+-------------+
1 row in set (0.00 sec)

mysql> show full processlist;
...
| 1594046 | my_service_ddl | 10.137.166.91:57666  | my_service         | Query       |   71 | Sending data                  | select /* gh-ost */ count(*) as rows from `my_service`.`my_table`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
...

Problem

After all the rows are copied to the ghost table, gh-ost attempts to cut-over the original table.

Usually, the row counting query finishes before all the rows are copied to the ghost table. At the time of cut-over, because both queries have finished, there is no lock on the origin table. Hence the schema change is successful.

However, we repeatedly found that on a large table (> 400GB), row counting query takes longer than row copying. At the time of cut-over, the row counting query is still running and holding a lock on the original table.

This causes the schema change to fail with this error: "Error 1205: Lock wait timeout exceeded; try restarting transaction", which was described in #773

Proposals

If you run into this issue, try disabling --exact-rowcount.

Improvements that we could make

  • Document this bug; add a warning.
  • Teach gh-ost to terminate the row counting query before cutting over the original table.
@shlomi-noach
Copy link
Contributor

shlomi-noach commented Mar 11, 2020

the row counting query takes longer than row copying. At the time of cut-over, the row counting query is still running and holding a lock on the original table.

Very interesting -- thank you for this analysis! I've never seen this scenario happen, and once you point it out, it makes sense.

I'm not the owner of this repository at this time, but would you consider a pull request to update the docs? If not, someone else would hopefully do it, and hopefully also fix the code.

ajm188 added a commit to ajm188/gh-ost that referenced this issue May 18, 2020
Closes github#830. Switches from using `QueryRow` to `QueryRowContext`, and
stores a context.CancelFunc in the migration context, which is called to
halt any running row count query before beginning the cut over.
@timvaillancourt
Copy link
Collaborator

timvaillancourt commented Jun 17, 2021

@zhxnlai / @shlomi-noach: I'm curious having support for cancelling the row-count query would solve this issue?

Some ideas that come to mind:

  • Passing a cancellable context.Context to the row-count query
    • Cancel the row-count context before cut-over
  • Running a KILL <query id> on the row-count query (feels more hacky)

For now I think documenting this limitation sounds good, this issue doesn't come up often

@shlomi-noach
Copy link
Contributor

Cancel the row-count context before cut-over

This will not kill the query. Only way as far as I can tell is to KILL <query-id> which is indisputably hacky

@timvaillancourt
Copy link
Collaborator

Cancel the row-count context before cut-over

This will not kill the query. Only way as far as I can tell is to KILL <query-id> which is indisputably hacky

Ahh yes, the query won't be killed on the server 🤔👍. Sounds like documentation is the best approach

timvaillancourt added a commit that referenced this issue Jul 6, 2022
* Cancel any row count queries before attempting to cut over

Closes #830. Switches from using `QueryRow` to `QueryRowContext`, and
stores a context.CancelFunc in the migration context, which is called to
halt any running row count query before beginning the cut over.

* Make it threadsafe

* Kill the count query on the database side as well

* Explicitly grab a connection to run the count, store its connection id
* When the query context is canceled, run a `KILL QUERY ?` on that connection id

* Rewrite these to use the threadsafe functions, stop exporting the cancel func

* Update logger

* Update logger

Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: dm-2 <[email protected]>
dm-2 added a commit that referenced this issue Jul 7, 2022
* Cancel any row count queries before attempting to cut over

Closes #830. Switches from using `QueryRow` to `QueryRowContext`, and
stores a context.CancelFunc in the migration context, which is called to
halt any running row count query before beginning the cut over.

* Make it threadsafe

* Kill the count query on the database side as well

* Explicitly grab a connection to run the count, store its connection id
* When the query context is canceled, run a `KILL QUERY ?` on that connection id

* Rewrite these to use the threadsafe functions, stop exporting the cancel func

* Update logger

* Update logger

Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: dm-2 <[email protected]>
dm-2 added a commit that referenced this issue Jul 7, 2022
* Cancel any row count queries before attempting to cut over

Closes #830. Switches from using `QueryRow` to `QueryRowContext`, and
stores a context.CancelFunc in the migration context, which is called to
halt any running row count query before beginning the cut over.

* Make it threadsafe

* Kill the count query on the database side as well

* Explicitly grab a connection to run the count, store its connection id
* When the query context is canceled, run a `KILL QUERY ?` on that connection id

* Rewrite these to use the threadsafe functions, stop exporting the cancel func

* Update logger

* Update logger

Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: dm-2 <[email protected]>
RainbowDashy pushed a commit to RainbowDashy/gh-ost that referenced this issue Jul 13, 2022
* Cancel any row count queries before attempting to cut over

Closes github#830. Switches from using `QueryRow` to `QueryRowContext`, and
stores a context.CancelFunc in the migration context, which is called to
halt any running row count query before beginning the cut over.

* Make it threadsafe

* Kill the count query on the database side as well

* Explicitly grab a connection to run the count, store its connection id
* When the query context is canceled, run a `KILL QUERY ?` on that connection id

* Rewrite these to use the threadsafe functions, stop exporting the cancel func

* Update logger

* Update logger

Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: dm-2 <[email protected]>
d-bytebase pushed a commit to bytebase/gh-ost that referenced this issue Jul 13, 2022
* Merge pull request github#1143 from github/codeql-run-on-prs

Run CodeQL analysis on PRs

* Merge pull request github#1141 from shaohk/fix-two-phase-commit-lost-data_v2

fix(lost data in mysql two-phase commit): lost data in mysql two-phas…

* compound pk tests (github#387)

* compound pk tests

* more details in failure diff

* more elaborate test; the pk-ts one consistently fails

* Fix merge conflict

Co-authored-by: Shlomi Noach <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>

* Use `switch` statements for readability, simplify `.NewGoMySQLReader()` (github#1135)

* Use `switch` statements for readability

* Simplify initBinlogReader()

* Add context/timeout to HTTP throttle check (github#1131)

* Add context/timeout to HTTP throttle check

* Dont run `.GetThrottleHTTPInterval()` on every loop

* Update help message

* Var rename

* 2022

* Add timeout flag

* Add unix/tcp server commands, use ParseInt() for string->int64

* Var rename

* Re-check http timeout on every loop iteration

* Remove stale comment

* Make throttle interval idempotent

* var rename

* Usage grammar

* Make http timeout idempotent too

* Parse time.Duration once

* Move timeout to NewThrottler

* Help update

* Set User-Agent header

* Re-add newline

Co-authored-by: dm-2 <[email protected]>

* Default to `go1.17.11` (github#1136)

* Default to go1.17.11

* `go mod vendor`

Co-authored-by: dm-2 <[email protected]>

* Cancel any row count queries before attempting to cut over (github#846)

* Cancel any row count queries before attempting to cut over

Closes github#830. Switches from using `QueryRow` to `QueryRowContext`, and
stores a context.CancelFunc in the migration context, which is called to
halt any running row count query before beginning the cut over.

* Make it threadsafe

* Kill the count query on the database side as well

* Explicitly grab a connection to run the count, store its connection id
* When the query context is canceled, run a `KILL QUERY ?` on that connection id

* Rewrite these to use the threadsafe functions, stop exporting the cancel func

* Update logger

* Update logger

Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: dm-2 <[email protected]>

Co-authored-by: dm-2 <[email protected]>
Co-authored-by: Shlomi Noach <[email protected]>
Co-authored-by: Shlomi Noach <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>
Co-authored-by: Andrew Mason <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants