Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky onlineddl tests: reduce -online_ddl_check_interval #7847

Merged
merged 31 commits into from
Apr 20, 2021

Conversation

shlomi-noach
Copy link
Contributor

Description

Solves #7834

Not sure how this happened, will later look in git blame and history, but in the tests suite, vtctld should run with a very short -online_ddl_check_interval. For some reason the flag didn't appear in any of the online ddl tests. It must have changes somehow over time, looking into. anyway, this should resolve the current CI failures seen mostly in onlineddl_ghost but also in other onlineddl_* tests.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

@shlomi-noach
Copy link
Contributor Author

Hmmm actually the problem is more specific to onlineddl_ghost and the fact vttablet (not vtctld) was missing some important flags:

https://github.com/vitessio/vitess/pull/7847/files#diff-2bb5e78bd4cd6cdb8f66836e671f20da41eb0fd6664f550e395cc71657e86b76R155-R159

Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach shlomi-noach requested a review from a team April 13, 2021 09:02
@shlomi-noach
Copy link
Contributor Author

Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@shlomi-noach
Copy link
Contributor Author

ugh, now failing: https://github.com/vitessio/vitess/pull/7847/checks?check_run_id=2331837611 . I'll keep looking into.

@shlomi-noach
Copy link
Contributor Author

The nature of this error is unclear: Error querying information_schema.tables for schema vt_ks: dial tcp 10.1.0.5:16009: i/o timeout, and I feel like it's similar to other flaky errors we're seeing with network issues.

@shlomi-noach shlomi-noach marked this pull request as draft April 13, 2021 11:16
@shlomi-noach
Copy link
Contributor Author

onlineddl_ghost continues to frequently fail (seldomly passes). I can't figure it out.

Now, it fails on a nonsensical issue: the very first gh-ost test fails not because gh-ost has errors. It fails because the test, 20sec after submittion, is still in ready state (you'd expect it to be complete, or failed or even running by that time!).

This does not reproduce locally, ever. All onlineddl_* tests consistently pass locally.

Also, once in a while, another onlineddl_* test joins the party and fails, but with a different type of error:

Error querying information_schema.tables for schema vt_ks: dial tcp 10.1.0.5:16009: i/o timeout

This feels like resource exhaustion. I have no other explanation to this behavior.

@rohit-nayak-ps rohit-nayak-ps force-pushed the flaky-tests-online-ddl branch from bc1ffd6 to d7f0e8d Compare April 14, 2021 23:09
@rohit-nayak-ps
Copy link
Contributor

rebased and removed the os tune up commit since we have a different fix for CI now

@rohit-nayak-ps
Copy link
Contributor

@shlomi-noach: tests are passing. I see you moved it to Draft: is it ready to be merged?

@shlomi-noach
Copy link
Contributor Author

I moved it to Draft because tests went crazy the other day and I wasn't sure anymore whether this PR helps at all. I'm gonna merge master and if tests pass again, I'll merge the PR.

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach
Copy link
Contributor Author

Some update on where onlineddl_ghost test fails: it gets as far as running gh-ost in dry-run mode and then goes blank.

In gh-ost migration there's three distinct gh-ost executions:

  • gh-ost --version, to vlaidate the binary itself is good. This passes.
  • gh-ost --alter=... in dry-run mode. This never completes.
  • gh-ost --alter=... --execute, This is never reached because the previous run fails.

I invested time to verify/deby that the problem is with gh-ost hooks. Those call tablet server via HTTP API. So I thought this correlates to problems we saw with local_example. It isn't the issue. I completely disabled hooks and the behavior is unchanged.
So we know gh-ost begins to run, but never completes. Why? What makes it block? No information yet, but that's where I'll look into next.

Reiterating that all these tests pass just fine on multiple environemtns, just no in GitHub CI. BTW, the test passes in #7850. So I'm still looking for a platform problem. Sockets/files/other resources.

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach
Copy link
Contributor Author

catting the gh-ost logs show that the problem is, after all, the hooks. Specifically, gh-ost runs hooks that report back to vttablet via curl, ie accessing vttablet's HTTP API:

2021-04-19 06:21:05 INFO starting gh-ost 33516f4955c3df5c06e975854a0c3e00f6b52112
2021-04-19 06:21:05 INFO Migrating `vt_ks`.`vt_onlineddl_test_03`
2021-04-19 06:21:05 INFO executing gh-ost-on-startup hook: /tmp/online-ddl-69529787_a0d7_11eb_b97f_000d3a02b579-192278704/gh-ost-on-startup

The gh-ost-on-startup hook runs this (template):

#!/bin/bash

curl --max-time 3 -s 'http://localhost:%d/schema-migration/report-status?uuid=%s&status=%s&dryrun='"$GH_OST_DRY_RUN"'&progress='"$GH_OST_PROGRESS"'&eta='"$GH_OST_ETA_SECONDS"

Running even more experiments to double validate this.

Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach
Copy link
Contributor Author

If I remove hooks, then the next thing gh-ost gets blocked on is opening a connection to the underlying database.

It's all about ports.

@shlomi-noach shlomi-noach marked this pull request as ready for review April 19, 2021 15:11
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Was it intentional to checkin a new gh-ost binary?
  • When we merge this, we should squash-merge since so many of the commits were experimental.

@@ -12,6 +12,16 @@ jobs:
with:
go-version: 1.15

- name: Tune the OS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the port range fix? Or is the /etc/hosts fix sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had immediate positive effect last week on local_example tests. I'll experiment without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, and tests are green!

@shlomi-noach
Copy link
Contributor Author

Was it intentional to checkin a new gh-ost binary?

No! It was experimental. Reverting.

When we merge this, we should squash-merge since so many of the commits were experimental.

Cool

@shlomi-noach
Copy link
Contributor Author

Removing the port range patch, 8f4444c, seems to look good. Tests are still passing without the patch.

@shlomi-noach shlomi-noach merged commit 2179d0c into vitessio:master Apr 20, 2021
@shlomi-noach shlomi-noach deleted the flaky-tests-online-ddl branch April 20, 2021 06:09
@shlomi-noach
Copy link
Contributor Author

Suqash commit: 2179d0c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants